Series Overview
This series is a 4-chapter educational content designed to teach you how to apply Materials Informatics (MI) methods to drug discovery and pharmaceutical development. You will understand the challenges facing traditional drug discovery processes and acquire practical skills in efficient drug design using AI and machine learning.
Features:- ✅ Drug Discovery Specialized: Comprehensive coverage of essential technologies including molecular representation, QSAR, and ADMET prediction
- ✅ Practice-Oriented: 30 executable code examples leveraging RDKit/ChEMBL
- ✅ Latest Trends: Case studies from AI drug discovery companies like Exscientia and Insilico Medicine
- ✅ Industrial Applications: Implementation patterns usable in real drug discovery projects
- Completion of Materials Informatics Introduction Series recommended
- Python basics, fundamental machine learning concepts
- Basic chemistry knowledge (introductory organic chemistry, biochemistry)
How to Study
Recommended Learning Path
flowchart TD
A["Chapter 1: Role of MI
in Drug Discovery"] --> B["Chapter 2: Drug Discovery
Specialized MI Methods"] B --> C["Chapter 3: Python Implementation
RDKit & ChEMBL"] C --> D["Chapter 4: Latest Case Studies
in AI Drug Discovery"] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e8f5e9
For Drug Discovery Beginners (learning drug discovery processes for the first time):
in Drug Discovery"] --> B["Chapter 2: Drug Discovery
Specialized MI Methods"] B --> C["Chapter 3: Python Implementation
RDKit & ChEMBL"] C --> D["Chapter 4: Latest Case Studies
in AI Drug Discovery"] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#e8f5e9
- Chapter 1 → Chapter 2 → Chapter 3 (basic code only) → Chapter 4
- Duration: 80-100 minutes
- Chapter 2 → Chapter 3 → Chapter 4
- Duration: 70-90 minutes
- Chapter 3 (full code implementation) → Chapter 4
- Duration: 60-75 minutes
Chapter Details
Chapter 1: The Role of Materials Informatics in Drug Discovery
Difficulty: Beginner Reading time: 20-25 minutesLearning Content
- Current Status and Challenges of Drug Discovery Processes
- Traditional drug discovery: 10-15 years, $2.6B/drug, 0.01% success rate
- Drug discovery stages: Discovery → Preclinical → Clinical → Approval
- Bottleneck analysis: candidate molecule search, toxicity prediction, optimization
- Three Challenges Solved by MI
- Challenge 1: Efficient search through vast chemical space (10^60 molecules)
- Challenge 2: Early prediction of ADMET properties (Absorption, Distribution, Metabolism, Excretion, Toxicity)
- Challenge 3: Optimization of formulation design (solubility, stability, controlled release)
- Industrial Impact of AI Drug Discovery
- Market size: $2T (2024) → $3T (2030 forecast)
- Development time reduction: 10-15 years → 3-5 years
- Cost reduction: $2.6B → $500M-$1B
- AI drug discovery startups: Exscientia, Insilico Medicine, Atomwise
- History of MI in Drug Discovery
- 1960s: Birth of QSAR (Quantitative Structure-Activity Relationship)
- 1990s: High-Throughput Screening (HTS)
- 2010s: Deep Learning for Drug Discovery
- 2020s: Foundation Models (MatBERT, MolGPT)
Learning Objectives
- ✅ Explain the 5 stages of the drug discovery process
- ✅ Identify 3 limitations of traditional drug discovery with specific examples
- ✅ Demonstrate the industrial impact of AI drug discovery with numbers
- ✅ Understand the background for applying MI to drug discovery
Chapter 2: Drug Discovery Specialized MI Methods
Difficulty: Intermediate Reading time: 25-30 minutesLearning Content
- Molecular Representation and Descriptors
- SMILES Representation:
CC(=O)OC1=CC=CC=C1C(=O)O(aspirin) - Molecular Fingerprints: ECFP (Extended Connectivity Fingerprints), MACCS keys
- 3D Descriptors: pharmacophore coordinates, charge distribution, surface area
- Graph Representation: atoms as nodes, bonds as edges in graph structure
- QSAR (Quantitative Structure-Activity Relationship)
- Principle: molecular structure → descriptors → activity prediction
- Methods: Random Forest, SVM, Neural Networks
- Applications: IC50 prediction, binding affinity prediction
- Limitations and cautions: Applicability Domain, dangers of extrapolation
- ADMET Prediction
- Absorption: Caco-2 permeability, oral bioavailability
- Distribution: plasma protein binding rate, blood-brain barrier permeability
- Metabolism: CYP450 inhibition/induction
- Excretion: renal clearance, half-life
- Toxicity: hERG inhibition, hepatotoxicity, mutagenicity
- Molecular Generative Models
- VAE (Variational Autoencoder): molecular optimization in latent space
- GAN (Generative Adversarial Network): generation of novel molecules
- Transformer: SMILES-based generation (GPT-like models)
- Graph Neural Networks: direct generation of molecular graphs
- Major Databases and Tools
- ChEMBL: 2 million compounds, bioactivity data
- PubChem: 100 million compounds, structure and property information
- DrugBank: database of approved drugs and clinical trial drugs
- BindingDB: protein-ligand interactions
- RDKit: open-source cheminformatics library
- Drug Discovery MI Workflow
flowchart LR
A[Target Identification] --> B["Compound Library
Construction"] B --> C["In Silico
Screening"] C --> D[ADMET Prediction] D --> E["Lead Compound
Optimization"] E --> F[Experimental Validation] F --> G{Activity OK?} G -->|Yes| H[Preclinical Trial] G -->|No| E
Construction"] B --> C["In Silico
Screening"] C --> D[ADMET Prediction] D --> E["Lead Compound
Optimization"] E --> F[Experimental Validation] F --> G{Activity OK?} G -->|Yes| H[Preclinical Trial] G -->|No| E
Learning Objectives
- ✅ Explain and differentiate 4 types of molecular representation methods
- ✅ Understand the principles and application examples of QSAR
- ✅ Explain the 5 ADMET items specifically
- ✅ Compare 4 molecular generative model methods
- ✅ Grasp characteristics and use cases of major databases
- ✅ Draw the overall picture of drug discovery MI workflow
Chapter 3: Implementing Drug Discovery MI with Python - RDKit & ChEMBL Practice
Difficulty: Intermediate Reading time: 35-45 minutes Code examples: 30 (all executable)Learning Content
- Environment Setup
- RDKit installation: `conda install -c conda-forge rdkit`
- ChEMBL Web Resource Client: `pip install chembl_webresource_client`
- Dependencies: pandas, scikit-learn, matplotlib
- RDKit Basics (10 code examples)
- Example 1: Create molecule object from SMILES string
- Example 2: 2D molecular drawing
- Example 3: Calculate molecular weight and LogP
- Example 4: Lipinski's Rule of Five check
- Example 5: Generate molecular fingerprints (ECFP)
- Example 6: Calculate Tanimoto similarity
- Example 7: Substructure search (SMARTS)
- Example 8: 3D structure generation and optimization
- Example 9: Batch calculation of molecular descriptors
- Example 10: Read and write SDF/MOL files
- ChEMBL Data Acquisition (5 code examples)
- Example 11: Target protein search
- Example 12: Retrieve compound bioactivity data
- Example 13: Filter IC50 data
- Example 14: Build structure-activity dataset
- Example 15: Data preprocessing and cleaning
- QSAR Model Building (8 code examples)
- Example 16: Dataset splitting (train/test)
- Example 17: Random Forest classifier (active/inactive)
- Example 18: Random Forest regression (IC50 prediction)
- Example 19: SVM classifier
- Example 20: Neural Network (Keras/TensorFlow)
- Example 21: Feature importance analysis
- Example 22: Cross-validation and hyperparameter tuning
- Example 23: Model performance comparison (ROC-AUC, R^2)
- ADMET Prediction (4 code examples)
- Example 24: Solubility prediction
- Example 25: LogP (lipophilicity) prediction
- Example 26: Caco-2 permeability prediction
- Example 27: hERG inhibition prediction (cardiotoxicity)
- Graph Neural Network (3 code examples)
- Example 28: Molecular graph representation (PyTorch Geometric)
- Example 29: GCN (Graph Convolutional Network) implementation
- Example 30: GNN vs traditional ML performance comparison
- Project Challenge
- Goal: Predict COVID-19 protease inhibitors with ChEMBL data (ROC-AUC > 0.80)
- 6-Step Guide:
- Retrieve target (SARS-CoV-2 Mpro) data
- Collect 1,000 active compound samples
- Generate ECFP fingerprints
- Train Random Forest model
- Performance evaluation (ROC-AUC, Confusion Matrix)
- Screen novel candidate molecules
Learning Objectives
- ✅ Load molecules, draw, and calculate descriptors using RDKit
- ✅ Retrieve bioactivity data using ChEMBL API
- ✅ Implement QSAR models (RF, SVM, NN) and compare performance
- ✅ Build models to predict ADMET properties
- ✅ Understand Graph Neural Network basics and implement them
- ✅ Execute actual drug discovery projects end-to-end
Chapter 4: Latest Case Studies and Industrial Applications in AI Drug Discovery
Difficulty: Intermediate to Advanced Reading time: 20-25 minutesLearning Content
- 5 Detailed Case Studies
- Disease: Obsessive-Compulsive Disorder (OCD)
- Technology: Active Learning, Multi-objective Optimization
- Results: Candidate compound discovery in 12 months (conventional 4.5 years)
- Status: Phase II clinical trial (started 2023)
- Impact: Demonstrated feasibility of AI drug discovery
- Technology: Generative Chemistry (GAN), Reinforcement Learning
- Results: Phase I reached in 18 months (conventional 3-5 years)
- Cost: $2.6M (conventional $100M+)
- Target: TNIK kinase inhibitor
- Publication: Zhavoronkov et al. (2019), *Nature Biotechnology*
- Technology: AtomNet (Deep Convolutional Neural Network)
- Screening: Evaluated 7 million compounds in 1 day
- Results: 2 candidate compounds (in vitro validated)
- Conventional method: Several months for equivalent scale screening
- Applications: Expanded to COVID-19, malaria
- Approach: Finding new indications for approved drugs (Drug Repurposing)
- Technology: Knowledge Graph, Natural Language Processing
- Discovery: Baricitinib (rheumatoid arthritis drug) for ALS indication
- Status: Clinical trial preparation
- Advantage: Leveraging existing safety data, shortened development period
- Technology: Transformer, Attention Mechanism
- Achievement: Protein structure prediction accuracy 90%+ (conventional 40-60%)
- Impact: Accelerated structure-based drug design
- Database: Released 200 million protein structure predictions
- Publication: Jumper et al. (2021), *Nature*
- AI Drug Discovery Strategies of Major Companies
- Pfizer: Building AI drug discovery platform, partnership with IBM
- Roche: Established Genentech AI Lab, $3B investment
- GSK: Created AI Hub, partnership with DeepMind
- Novartis: Leveraging Microsoft Azure, $1B investment
- Exscientia: Raised $525M, market cap $2.4B (IPO 2021)
- Insilico Medicine: Raised $400M, 30+ pipeline
- Recursion Pharmaceuticals: Raised $500M, robotic laboratory
- Schrodinger: Raised $532M, computational chemistry platform
- Best Practices for AI Drug Discovery
- ✅ Securing high-quality data (ChEMBL, in-house data)
- ✅ Integration with domain knowledge (chemist + data scientist)
- ✅ Iteration with experimental validation (wet lab feedback loop)
- ✅ Emphasis on interpretability (avoiding black box)
- ❌ Neglecting data quality (GIGO: Garbage In, Garbage Out)
- ❌ Overfitting (excessive adaptation to training data)
- ❌ Ignoring Applicability Domain (prediction reliability)
- ❌ Delayed experimental validation (in silico bias)
- Regulation and Ethics
- FDA/PMDA: Developing review guidelines for AI-designed drugs
- Data Privacy: Handling patient data (GDPR, HIPAA)
- Explainability: Accountability to regulatory authorities
- Bias: Training data bias, ensuring fairness
- Career Paths in AI Drug Discovery
- Positions: Postdoctoral researcher, assistant professor, associate professor
- Salary: ¥5-12M/year (Japan), $60-120K (US)
- Institutions: University of Tokyo, Kyoto University, MIT, Stanford
- Positions: Computational Chemist, AI Scientist, Drug Designer
- Salary: ¥8-20M/year (Japan), $80-250K (US)
- Companies: Pfizer, Roche, Exscientia, Insilico Medicine
- Risk/Return: High risk, high impact
- Salary: ¥6-15M/year + stock options
- Required skills: Technical + business + pitching
- Learning Resources
- Coursera: "Drug Discovery" (UC San Diego)
- edX: "Medicinal Chemistry" (Davidson College)
- Udacity: "AI for Healthcare"
- "Deep Learning for the Life Sciences" (O'Reilly)
- "Artificial Intelligence in Drug Discovery" (Royal Society of Chemistry)
- RDKit Users Group
- AI in Drug Discovery Conference
- ChEMBL Community
Learning Objectives
- ✅ Explain 5 AI drug discovery success cases with technical details
- ✅ Compare and evaluate AI strategies of major companies
- ✅ Understand best practices and pitfalls of AI drug discovery
- ✅ Recognize regulatory and ethical challenges and consider responses
- ✅ Plan career paths in AI drug discovery field
- ✅ Select resources for continuous learning
Overall Learning Outcomes
Upon completing this series, you will acquire the following skills and knowledge:Knowledge Level (Understanding)
- ✅ Explain drug discovery processes and limitations of traditional methods
- ✅ Understand concepts of molecular representation, QSAR, and ADMET
- ✅ Grasp AI drug discovery industry trends and major players
- ✅ Detail 5 or more latest AI drug discovery case studies
Practical Skills (Doing)
- ✅ Load molecules, draw, and calculate descriptors using RDKit
- ✅ Retrieve bioactivity data using ChEMBL API
- ✅ Implement QSAR models (RF, SVM, NN, GNN)
- ✅ Build ADMET prediction models
- ✅ Execute actual drug discovery projects end-to-end
Application Ability (Applying)
- ✅ Design new drug discovery projects
- ✅ Evaluate industrial implementation cases and apply to your own research
- ✅ Plan AI drug discovery career path specifically
- ✅ Follow latest technology trends and continue learning
Recommended Learning Patterns
Pattern 1: Complete Mastery (For Drug Discovery Beginners)
Target: Those learning drug discovery for the first time, wanting systematic understanding Duration: 2-3 weeks Approach: Week 1:- Day 1-2: Chapter 1 (Drug discovery process and background)
- Day 3-4: Chapter 2 (MI methods)
- Day 5-7: Chapter 2 exercises, terminology review
- Day 1-2: Chapter 3 (RDKit basics, Examples 1-10)
- Day 3-4: Chapter 3 (ChEMBL & QSAR, Examples 11-23)
- Day 5-7: Chapter 3 (ADMET & GNN, Examples 24-30)
- Day 1-3: Chapter 3 (Project Challenge)
- Day 4-5: Chapter 4 (Case Studies)
- Day 6-7: Chapter 4 (Career plan creation)
Deliverables:
- COVID-19 protease inhibitor prediction project (ROC-AUC > 0.80)
- Personal career roadmap (3 months/1 year/3 years)
Pattern 2: Quick Learning (With Chemistry/Pharmacy Background)
Target: Those with chemistry/pharmacy basics wanting to acquire AI techniques
Duration: 1-2 weeks
Approach:
Day 1-2: Chapter 2 (MI methods, focusing on drug discovery specialization)
Day 3-5: Chapter 3 (Full code implementation)
Day 6: Chapter 3 (Project Challenge)
Day 7-8: Chapter 4 (Case Studies and Career)
Deliverables:
- QSAR model performance comparison report
- Project portfolio (GitHub publication recommended)
Pattern 3: Implementation Skills Enhancement (For ML Experienced)
Target: Those with machine learning experience wanting to learn drug discovery domain application
Duration: 3-5 days
Approach:
Day 1: Chapter 2 (Molecular representation and databases)
Day 2-3: Chapter 3 (Full code implementation)
Day 4: Chapter 3 (Project Challenge)
Day 5: Chapter 4 (Industrial application cases)
Deliverables:
- Drug discovery MI code library (reusable)
- ADMET prediction web app (Streamlit/Flask)
FAQ (Frequently Asked Questions)
Q1: Can I understand without chemistry knowledge?
A: Chapters 1 and 2 are easier to understand with basic chemistry knowledge (introductory organic chemistry, biochemistry), but it's not essential. Important chemical concepts are explained as needed. For Chapter 3 code implementation, programming skills are sufficient since the RDKit library handles chemical calculations. If concerned, we recommend reviewing high school chemistry level beforehand.
Q2: RDKit installation is difficult.
A: We recommend installing RDKit via conda:bash
conda create -n rdkit_env python=3.9
conda activate rdkit_env
conda install -c conda-forge rdkit
```
If you still have issues, use Google Colab (free, browser-only). You can install on Colab with `!pip install rdkit`.
Q3: Can ChEMBL data be used commercially?
A: ChEMBL is non-profit/academic use only under CC BY-SA 3.0 license. Commercial use requires separate permission. For details, check ChEMBL License. If considering corporate use, we recommend consulting your legal department.Q4: What's needed for an AI drug discovery job?
A: The following skill set is required:- Essential: Python, machine learning (scikit-learn, TensorFlow/PyTorch), RDKit
- Recommended: Chemistry/biology knowledge, QSAR experience, domain literature understanding
- Advantageous: GNN implementation experience, large-scale data processing, paper writing
- Build foundation with this series (2-4 weeks)
- Publish original projects on GitHub (3-6 months)
- Internship or collaborative research (6-12 months)
- Join industry (pharmaceutical companies, AI drug discovery startups) or academia
Q5: Are Graph Neural Networks essential?
A: Currently not essential but strongly recommended. Traditional QSAR (Random Forest, SVM) can achieve sufficient performance, but GNNs have these advantages:- Directly learn molecular 3D structures
- No feature engineering required
- SOTA (State-of-the-Art) performance
Q6: Can I become an AI drug discovery expert with this series alone?
A: This series targets "beginner to intermediate" levels. To reach expert level:- Build foundation with this series (2-4 weeks)
- Read papers intensively (*Journal of Medicinal Chemistry*, *Nature Biotechnology*) (3-6 months)
- Execute original projects (Kaggle drug discovery competitions, etc.) (6-12 months)
- Conference presentations or paper writing (1-2 years)
Next Steps
Recommended Actions After Series Completion
Immediate (Within 1-2 weeks):- ✅ Create GitHub portfolio
- ✅ Publish Project Challenge results with README
- ✅ Add "AI Drug Discovery" skill to LinkedIn profile
- ✅ Participate in Kaggle drug discovery competitions (e.g., "Predicting Molecular Properties")
- ✅ Select one learning resource from Chapter 4 for deep dive
- ✅ Join RDKit Users Group, ask questions, discuss
- ✅ Execute own small-scale project (e.g., candidate molecule search for specific disease)
- ✅ Read 10 papers intensively (*Journal of Medicinal Chemistry*, *J. Chem. Inf. Model.*)
- ✅ Contribute to open-source projects (RDKit, DeepChem, etc.)
- ✅ Present at domestic conferences (Pharmaceutical Society of Japan, Medicinal Chemistry Society)
- ✅ Participate in internship or collaborative research
- ✅ Present at international conferences (ACS, EFMC)
- ✅ Submit peer-reviewed papers
- ✅ Work in AI drug discovery field (pharmaceutical companies or startups)
- ✅ Nurture next generation AI drug discovery researchers/engineers
Feedback and Support
About This Series
This series was created under Dr. Yusuke Hashimoto, Tohoku University, as part of the MI Knowledge Hub project. Created: October 19, 2025 Version: 1.0We Welcome Your Feedback
To improve this series, we await your feedback:- Typos, errors, technical mistakes: Report via GitHub repository Issues
- Improvement suggestions: New topics, additional code examples desired, etc.
- Questions: Difficult parts to understand, sections needing additional explanation
- Success stories: Projects using what you learned from this series
License and Terms of Use
This series is published under CC BY 4.0 (Creative Commons Attribution 4.0 International) license. What's allowed:- ✅ Free viewing and downloading
- ✅ Educational use (classes, study groups, etc.)
- ✅ Modification and derivative works (translation, summarization, etc.)
- 📌 Author credit required
- 📌 Modifications must be indicated
- 📌 Contact beforehand for commercial use
Let's Begin!
Ready? Start with Chapter 1 and begin your journey into the world of AI drug discovery! Chapter 1: The Role of Materials Informatics in Drug Discovery →Update History
- 2025-10-19: v1.0 Initial release
The journey to transform healthcare's future with AI drug discovery starts here!