Chapter 1: Protein Structure and Biomaterials

This chapter covers Protein Structure and Biomaterials. You will learn definition of bioinformatics, Read structure files, and Evaluate AlphaFold2 prediction accuracy.

Understanding Biomolecular Shapes with PDB and AlphaFold

Learning Objectives

By reading this chapter, you will be able to:

✅ Understand the definition of bioinformatics and its applications to materials science
✅ Explain the hierarchical structure of proteins (primary to quaternary)
✅ Retrieve and analyze protein structures from the PDB database
✅ Read structure files and extract information using Biopython
✅ Evaluate AlphaFold2 prediction accuracy and assess practical utility
✅ Explain collagen structure and biomaterial applications

Reading Time: 25-30 minutes Code Examples: 8 Exercises: 3

1.1 What is Bioinformatics?

Definition and Interdisciplinarity

Bioinformatics is an interdisciplinary field that analyzes biological data using computational and information sciences.

flowchart LR A[Biology] --> D[Bioinformatics] B[Computer Science] --> D C[Materials Science] --> D D --> E[Biomaterial Design] D --> F[Drug Delivery] D --> G[Biosensor Development] style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#4CAF50,color:#fff style E fill:#ffebee style F fill:#f3e5f5 style G fill:#fff9c4

Application Domains

1. Biomaterial Design - Collagen artificial skin (tissue engineering) - Silk fiber structure optimization (Spiber Inc.) - Peptide hydrogels (drug delivery)

2. Drug Delivery Systems (DDS) - Antibody therapeutics (targeted therapy) - Nanoparticle carriers (cancer treatment) - Liposomal formulations (vaccines)

3. Biosensor Development - Glucose sensors (diabetes management) - Antibody biosensors (COVID-19 diagnostics) - DNA chips (genetic diagnostics)

Protein Structural Hierarchy

Protein structure is described at four levels:

Primary Structure - Amino acid sequence - Connected by peptide bonds - Example: Gly-Pro-Ala-Ser-...

Secondary Structure - Alpha helix (α-helix, helical structure) - Beta sheet (β-sheet, planar structure) - Loops

Tertiary Structure - Three-dimensional folding of polypeptide chain - Hydrophobic interactions, hydrogen bonds, disulfide bonds

Quaternary Structure - Assembly of multiple subunits - Example: Hemoglobin (4 subunits)

1.2 Utilizing PDB (Protein Data Bank)

What is PDB?

PDB (Protein Data Bank) is the world's largest database for publicly sharing three-dimensional structures of proteins and nucleic acids.

Statistics (as of 2025): - Registered structures: Over 200,000 - X-ray crystallography: 90% - NMR: 8% - Electron microscopy: 2%

Access: https://www.rcsb.org/

PDB File Structure

PDB files are text-based and contain atomic coordinates with associated information.

Example 1: Basic PDB File Structure

# Main sections of a PDB file
"""
HEADER    Structure classification and date
TITLE     Brief description of structure
COMPND    Compound name
SOURCE    Biological species
ATOM      Atomic coordinates
HELIX     Alpha helix regions
SHEET     Beta sheet regions
CONECT    Bond information
"""

# Example of ATOM record
"""
ATOM      1  N   MET A   1      20.154  29.699   5.276  1.00 49.05           N
ATOM      2  CA  MET A   1      21.289  28.803   5.063  1.00 49.05           C
ATOM      3  C   MET A   1      21.628  28.123   6.377  1.00 49.05           C
"""

# Column meanings
"""
ATOM: Record type
1: Atom number
N: Atom name
MET: Amino acid (Methionine)
A: Chain ID
1: Residue number
20.154, 29.699, 5.276: x, y, z coordinates (Å)
1.00: Occupancy
49.05: B-factor (temperature factor)
N: Element symbol
"""

Reading PDB Files with Biopython

Example 2: Reading PDB Structure and Extracting Basic Information

from Bio.PDB import PDBParser
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

# Initialize PDB parser
parser = PDBParser(QUIET=True)

# Read PDB file (example: 1UBQ - Ubiquitin)
structure = parser.get_structure('ubiquitin', '1ubq.pdb')

# Display basic information
print("=== Basic Structure Information ===")
print(f"Structure name: {structure.id}")
print(f"Number of models: {len(structure)}")

# Get first model
model = structure[0]
print(f"Number of chains: {len(model)}")

# Information for each chain
for chain in model:
    residue_count = len(chain)
    print(f"Chain {chain.id}: {residue_count} residues")

    # Count atoms
    atom_count = sum(1 for residue in chain for atom in residue)
    print(f"  Number of atoms: {atom_count}")

Output Example:

=== Basic Structure Information ===
Structure name: ubiquitin
Number of models: 1
Number of chains: 1
Chain A: 76 residues
  Number of atoms: 602

Atomic Coordinates and Distance Calculation

Example 3: Extracting Atomic Coordinates and Calculating Distances

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Example 3: Extracting Atomic Coordinates and Calculating Dis

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser(QUIET=True)
structure = parser.get_structure('ubiquitin', '1ubq.pdb')
model = structure[0]
chain = model['A']

# Get specific residue (10th isoleucine)
residue = chain[10]

print(f"=== Residue {residue.get_resname()} {residue.id[1]} ===")

# Get atomic coordinates
for atom in residue:
    coord = atom.get_coord()
    print(f"Atom {atom.name}: "
          f"({coord[0]:.2f}, {coord[1]:.2f}, {coord[2]:.2f})")

# Calculate distance between two atoms
atom1 = residue['CA']  # Alpha carbon
atom2 = residue['CB']  # Beta carbon

distance = atom1 - atom2  # Bio.PDB automatically calculates distance
print(f"\nCA-CB distance: {distance:.2f} Å")

# Manual distance calculation
coord1 = atom1.get_coord()
coord2 = atom2.get_coord()
manual_distance = np.linalg.norm(coord1 - coord2)
print(f"Manual calculation: {manual_distance:.2f} Å")

Output Example:

=== Residue ILE 10 ===
Atom N: (15.23, 18.45, 12.67)
Atom CA: (16.12, 17.89, 11.65)
Atom C: (17.28, 18.85, 11.32)
Atom O: (17.15, 20.07, 11.45)
Atom CB: (15.34, 17.53, 10.38)

CA-CB distance: 1.53 Å
Manual calculation: 1.53 Å

Retrieving Secondary Structure

Example 4: DSSP (Secondary Structure Analysis)

from Bio.PDB import PDBParser, DSSP
import warnings
warnings.filterwarnings('ignore')

# Read structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure('protein', '1ubq.pdb')
model = structure[0]

# Run DSSP (Define Secondary Structure of Proteins)
# Note: Requires dssp executable (conda install -c salilab dssp)
try:
    dssp = DSSP(model, '1ubq.pdb', dssp='mkdssp')

    print("=== Secondary Structure Analysis ===")
    print("Residue | AA | Secondary Structure | Solvent Accessible Area")
    print("-" * 45)

    # Information for each residue
    for key in list(dssp.keys())[:20]:  # First 20 residues
        residue_info = dssp[key]
        chain_id = key[0]
        res_id = key[1][1]
        aa = residue_info[1]
        ss = residue_info[2]  # Secondary structure
        asa = residue_info[3]  # Solvent accessible area

        # Secondary structure symbols
        # H: Alpha helix, B: Beta bridge,
        # E: Beta strand, G: 3-10 helix
        # I: Pi helix, T: Turn, S: Bend
        # -: Coil

        print(f"{res_id:4d} | {aa:3s} | {ss:^12s} | "
              f"{asa:6.1f} Ų")

except FileNotFoundError:
    print("DSSP not installed")
    print("Install: conda install -c salilab dssp")

1.3 Utilizing AlphaFold

The AlphaFold2 Revolution

AlphaFold2 (DeepMind, 2020) is a deep learning model that predicts protein three-dimensional structures from amino acid sequences with high accuracy.

Achievements: - Overwhelming victory in CASP14 (structure prediction competition) - Prediction accuracy: RMSD < 1.5 Å with experimental structures (atomic-level precision) - Prediction time: Minutes to hours (compared to months to years for traditional experiments)

AlphaFold Database

AlphaFold Protein Structure Database publicly shares over 200 million protein structures.

Access: https://alphafold.ebi.ac.uk/

Example 5: Downloading AlphaFold Predicted Structures

# Requirements:
# - Python 3.9+
# - requests>=2.31.0

import requests
import gzip
import shutil

def download_alphafold_structure(uniprot_id, output_file):
    """
    Download PDB structure from AlphaFold Database

    Parameters:
    -----------
    uniprot_id : str
        UniProt ID (e.g., P69905)
    output_file : str
        Output filename
    """
    # AlphaFold DB URL
    base_url = "https://alphafold.ebi.ac.uk/files/"
    pdb_url = f"{base_url}AF-{uniprot_id}-F1-model_v4.pdb"

    print(f"Downloading: {pdb_url}")

    try:
        response = requests.get(pdb_url)
        response.raise_for_status()

        with open(output_file, 'w') as f:
            f.write(response.text)

        print(f"Saved: {output_file}")
        return True
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
        return False

# Example: Download hemoglobin beta chain (P69905)
download_alphafold_structure('P69905', 'hemoglobin_beta.pdb')

Evaluating Prediction Confidence (pLDDT)

pLDDT (predicted Local Distance Difference Test) is AlphaFold's prediction confidence score (0-100).

Interpretation: - 90-100: Very high confidence (equivalent to experimental structures) - 70-90: High confidence (generally accurate) - 50-70: Low confidence (loop regions, etc.) - < 50: Very low confidence (disordered regions)

Example 6: Extracting and Visualizing pLDDT

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Example 6: Extracting and Visualizing pLDDT

Purpose: Demonstrate data visualization techniques
Target: Beginner to Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

from Bio.PDB import PDBParser
import matplotlib.pyplot as plt
import numpy as np

# Read AlphaFold structure
parser = PDBParser(QUIET=True)
structure = parser.get_structure('alphafold', 'hemoglobin_beta.pdb')

# pLDDT is stored in the B-factor column
model = structure[0]
chain = model['A']

residue_numbers = []
plddt_scores = []

for residue in chain:
    if residue.id[0] == ' ':  # Only standard residues
        residue_numbers.append(residue.id[1])
        # Get B-factor of CA atom (pLDDT)
        ca_atom = residue['CA']
        plddt_scores.append(ca_atom.bfactor)

# Visualization
plt.figure(figsize=(12, 5))
plt.plot(residue_numbers, plddt_scores, linewidth=2)
plt.axhline(y=90, color='green', linestyle='--',
            label='Very High Confidence (>90)')
plt.axhline(y=70, color='orange', linestyle='--',
            label='High Confidence (>70)')
plt.axhline(y=50, color='red', linestyle='--',
            label='Low Confidence (>50)')

plt.xlabel('Residue Number', fontsize=12)
plt.ylabel('pLDDT Score', fontsize=12)
plt.title('AlphaFold Prediction Confidence (Hemoglobin Beta Chain)',
          fontsize=14)
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('plddt_plot.png', dpi=300)
plt.show()

# Statistical information
print(f"=== pLDDT Statistics ===")
print(f"Mean: {np.mean(plddt_scores):.1f}")
print(f"Median: {np.median(plddt_scores):.1f}")
print(f"Minimum: {np.min(plddt_scores):.1f}")
print(f"Maximum: {np.max(plddt_scores):.1f}")
print(f"High confidence region (>70): "
      f"{100 * np.sum(np.array(plddt_scores) > 70) / len(plddt_scores):.1f}%")

1.4 Case Study: Collagen Structure Analysis

What is Collagen?

Collagen is the most abundant protein in mammals and is the main component of skin, bone, and cartilage.

Structural Features: - Triple helix structure (3 polypeptide chains) - Repeating sequence: (Gly-X-Y)n (X, Y are proline, hydroxyproline) - Full length: Approximately 300 nm

Biomaterial Applications: - Artificial skin (burn treatment) - Bone regeneration materials - Cosmetic medicine (collagen injections)

Retrieving Collagen Structure from PDB

Example 7: Collagen Structure Analysis

# Requirements:
# - Python 3.9+
# - matplotlib>=3.7.0
# - numpy>=1.24.0, <2.0.0

"""
Example: Example 7: Collagen Structure Analysis

Purpose: Demonstrate data visualization techniques
Target: Intermediate
Execution time: 2-5 seconds
Dependencies: None
"""

from Bio.PDB import PDBParser
import matplotlib.pyplot as plt
import numpy as np

# Collagen-like peptide structure (1K6F)
parser = PDBParser(QUIET=True)
structure = parser.get_structure('collagen', '1k6f.pdb')

model = structure[0]

print("=== Collagen Structure Analysis ===")
print(f"Number of chains: {len(model)}")

# Length of each chain
for chain in model:
    residue_count = len([res for res in chain if res.id[0] == ' '])
    print(f"Chain {chain.id}: {residue_count} residues")

# Confirming triple helix structure: inter-chain distances
chainA = model['A']
chainB = model['B']
chainC = model['C']

# Get CA atom of central residue in each chain
def get_central_ca(chain):
    residues = [res for res in chain if res.id[0] == ' ']
    central_idx = len(residues) // 2
    return residues[central_idx]['CA']

ca_A = get_central_ca(chainA)
ca_B = get_central_ca(chainB)
ca_C = get_central_ca(chainC)

# Inter-chain distances
dist_AB = ca_A - ca_B
dist_BC = ca_B - ca_C
dist_CA = ca_C - ca_A

print(f"\n=== Triple Helix Inter-Chain Distances ===")
print(f"A-B: {dist_AB:.2f} Å")
print(f"B-C: {dist_BC:.2f} Å")
print(f"C-A: {dist_CA:.2f} Å")
print(f"Average: {np.mean([dist_AB, dist_BC, dist_CA]):.2f} Å")

# Calculate helix pitch (rise per residue)
first_ca = chainA[7]['CA']
last_ca = chainA[20]['CA']
z_rise = last_ca.get_coord()[2] - first_ca.get_coord()[2]
num_residues = 20 - 7

rise_per_residue = z_rise / num_residues
print(f"\nHelix rise (per residue): {rise_per_residue:.2f} Å")

Output Example:

=== Collagen Structure Analysis ===
Number of chains: 3
Chain A: 27 residues
Chain B: 27 residues
Chain C: 27 residues

=== Triple Helix Inter-Chain Distances ===
A-B: 10.2 Å
B-C: 10.1 Å
C-A: 10.3 Å
Average: 10.2 Å

Helix rise (per residue): 2.9 Å

Collagen Sequence Pattern Analysis

Example 8: Validating Gly-X-Y Repeats

from Bio.PDB import PDBParser
from collections import Counter

parser = PDBParser(QUIET=True)
structure = parser.get_structure('collagen', '1k6f.pdb')
chain = structure[0]['A']

# Get amino acid sequence
sequence = []
for residue in chain:
    if residue.id[0] == ' ':  # Standard residues
        resname = residue.get_resname()
        # Convert 3-letter code to 1-letter code (simplified version)
        aa_dict = {
            'GLY': 'G', 'PRO': 'P', 'ALA': 'A',
            'VAL': 'V', 'LEU': 'L', 'ILE': 'I',
            'MET': 'M', 'PHE': 'F', 'TYR': 'Y',
            'TRP': 'W', 'SER': 'S', 'THR': 'T',
            'CYS': 'C', 'ASN': 'N', 'GLN': 'Q',
            'ASP': 'D', 'GLU': 'E', 'LYS': 'K',
            'ARG': 'R', 'HIS': 'H'
        }
        if resname in aa_dict:
            sequence.append(aa_dict[resname])
        else:
            sequence.append('X')  # Unknown amino acid

seq_str = ''.join(sequence)
print(f"Sequence: {seq_str}")

# Validate Gly-X-Y pattern
gly_positions = [i for i, aa in enumerate(sequence) if aa == 'G']
print(f"\nGly positions: {gly_positions}")

# Check if Gly appears every 3 residues
expected_positions = list(range(0, len(sequence), 3))
match_count = sum(1 for pos in gly_positions
                  if pos in expected_positions)
print(f"Gly-X-Y pattern match rate: "
      f"{100 * match_count / len(gly_positions):.1f}%")

# Amino acid composition
aa_composition = Counter(sequence)
print(f"\nAmino acid composition:")
for aa, count in aa_composition.most_common():
    percentage = 100 * count / len(sequence)
    print(f"  {aa}: {count} ({percentage:.1f}%)")

1.5 Chapter Summary

What We Learned

Definition of Bioinformatics - Biology × Computer Science × Materials Science - Application domains: Biomaterials, DDS, biosensors
Protein Structural Hierarchy - Primary structure (sequence) → Secondary structure (α-helix, β-sheet) - Tertiary structure (folding) → Quaternary structure (complex)
PDB Database - Over 200,000 structural data entries - Analysis methods with Biopython
AlphaFold2 - High-accuracy structure prediction from sequence - Reliability assessment using pLDDT
Collagen Structure - Triple helix structure - Gly-X-Y repeating pattern - Biomaterial applications

Key Points

✅ Protein structure determines function
✅ PDB is a treasure trove of experimental structures
✅ AlphaFold has revolutionized prediction
✅ Biopython makes structure analysis easy
✅ Collagen is an important biomaterial

To the Next Chapter

In Chapter 2, we will learn about sequence analysis and machine learning: - Homology sequence search with BLAST - Feature extraction from sequences - Functional prediction using machine learning - Case study: Enzyme activity prediction

Chapter 2: Sequence Analysis and Machine Learning →

Exercises

Exercise 1 (Difficulty: Easy)

Extract specific information from a PDB file.

Task: 1. Download PDB file (1UBQ) 2. Calculate total number of atoms 3. Aggregate counts for each element (C, N, O, S)

Hint

- Use `atom.element` to get element symbol - Use Counter for aggregation

Solution Example

from Bio.PDB import PDBParser
from collections import Counter

parser = PDBParser(QUIET=True)
structure = parser.get_structure('ubiquitin', '1ubq.pdb')

elements = []
for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                elements.append(atom.element)

print(f"Total atoms: {len(elements)}")
print("\nElement composition:")
for elem, count in Counter(elements).most_common():
    print(f"  {elem}: {count}")

**Output Example**:

Total atoms: 602
Element composition:
  C: 312
  O: 100
  N: 89
  S: 1

Exercise 2 (Difficulty: Medium)

Analyze the pLDDT scores of an AlphaFold predicted structure and identify regions of high and low confidence.

Solution Example

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Analyze the pLDDT scores of an AlphaFold predicted structure

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: ~5 seconds
Dependencies: None
"""

from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser(QUIET=True)
structure = parser.get_structure('alphafold', 'alphafold_structure.pdb')

chain = structure[0]['A']
high_confidence = []
low_confidence = []

for residue in chain:
    if residue.id[0] == ' ':
        res_id = residue.id[1]
        plddt = residue['CA'].bfactor

        if plddt >= 90:
            high_confidence.append(res_id)
        elif plddt < 70:
            low_confidence.append(res_id)

print(f"High confidence region (pLDDT >= 90): {len(high_confidence)} residues")
print(f"Range: {min(high_confidence)}-{max(high_confidence)}")

print(f"\nLow confidence region (pLDDT < 70): {len(low_confidence)} residues")
if low_confidence:
    print(f"Range: {min(low_confidence)}-{max(low_confidence)}")

Exercise 3 (Difficulty: Hard)

Analyze the triple helix structure of collagen in detail.

Task: 1. Extract CA atom coordinates from 3 chains 2. Calculate helix radius 3. Estimate helix pitch (rise per turn)

Hint

- Find helix central axis (center of mass of 3 chains) - Distance from each CA atom to central axis = radius - Pitch = (z-direction rise) / (number of turns)

Solution Example

# Requirements:
# - Python 3.9+
# - numpy>=1.24.0, <2.0.0

"""
Example: Task:
1. Extract CA atom coordinates from 3 chains
2. Calcul

Purpose: Demonstrate core concepts and implementation patterns
Target: Beginner to Intermediate
Execution time: 10-30 seconds
Dependencies: None
"""

from Bio.PDB import PDBParser
import numpy as np

parser = PDBParser(QUIET=True)
structure = parser.get_structure('collagen', '1k6f.pdb')
model = structure[0]

# Get CA coordinates from 3 chains
chains_coords = {}
for chain_id in ['A', 'B', 'C']:
    chain = model[chain_id]
    coords = []
    for residue in chain:
        if residue.id[0] == ' ':
            coords.append(residue['CA'].get_coord())
    chains_coords[chain_id] = np.array(coords)

# Central axis (center of mass of all CA atoms)
all_coords = np.vstack(list(chains_coords.values()))
center = np.mean(all_coords, axis=0)

# Helix radius (distance from central axis)
radii = []
for coords in chains_coords.values():
    for coord in coords:
        # Distance in xy plane
        dist = np.linalg.norm(coord[:2] - center[:2])
        radii.append(dist)

print(f"Helix radius: {np.mean(radii):.2f} ± "
      f"{np.std(radii):.2f} Å")

# Pitch (z-direction rise)
z_coords = chains_coords['A'][:, 2]
z_rise = z_coords[-1] - z_coords[0]
num_residues = len(chains_coords['A'])

# Collagen helix makes approximately 1 turn per 3 residues
turns = num_residues / 3
pitch = z_rise / turns

print(f"Helix pitch: {pitch:.2f} Å/turn")

Data Licenses and Citations

Structure Databases

Biological databases used in code examples in this chapter:

1. PDB (Protein Data Bank)

License: CC0 1.0 Public Domain
Citation: Berman, H. M. et al. (2000). "The Protein Data Bank." Nucleic Acids Research, 28(1), 235-242.
Access: https://www.rcsb.org/
Purpose: Experimental structure data for proteins and nucleic acids
Data Examples: 1UBQ (Ubiquitin), 1K6F (Collagen-like peptide)

2. AlphaFold Protein Structure Database

License: CC BY 4.0
Citation: Jumper, J. et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature, 596(7873), 583-589.
Access: https://alphafold.ebi.ac.uk/
Purpose: AlphaFold2 structure prediction data (over 200 million proteins)
API: https://alphafold.ebi.ac.uk/api/

3. UniProt

License: CC BY 4.0
Citation: The UniProt Consortium. (2023). "UniProt: the Universal Protein Knowledgebase in 2023." Nucleic Acids Research, 51(D1), D523-D531.
Access: https://www.uniprot.org/
Purpose: Protein sequences and annotation information

Library Licenses

Library	Version	License	Purpose
Biopython	1.81+	BSD-3-Clause	Structure analysis/sequence manipulation
NumPy	1.24+	BSD-3-Clause	Numerical computation
Matplotlib	3.7+	PSF (BSD-like)	Visualization
requests	2.31+	Apache 2.0	Data download

License Compliance: - All allow commercial use - Proper citations required in academic papers - Always cite sources when using PDB/AlphaFold structures

Code Reproducibility

Environment Setup

Installing Required Packages

# requirements.txt
biopython==1.81
numpy==1.24.3
matplotlib==3.7.1
requests==2.31.0
scipy==1.11.1

# Install
pip install -r requirements.txt

Optional Tools

# DSSP (secondary structure analysis)
conda install -c salilab dssp

# PyMOL (structure visualization)
conda install -c conda-forge pymol-open-source

Obtaining PDB Files

# Reproducible PDB download
from Bio.PDB import PDBList

def download_pdb_reproducible(pdb_id, output_dir='./pdb_files'):
    """
    Download PDB structure reproducibly

    Parameters:
    -----------
    pdb_id : str
        PDB ID (e.g., '1ubq')
    output_dir : str
        Save directory
    """
    pdbl = PDBList()
    filename = pdbl.retrieve_pdb_file(
        pdb_id,
        pdir=output_dir,
        file_format='pdb'
    )

    print(f"Downloaded: {pdb_id} → {filename}")
    return filename

# Usage example
download_pdb_reproducible('1ubq')
download_pdb_reproducible('1k6f')  # Collagen

Obtaining Fixed Version AlphaFold Structures

def download_alphafold_with_version(uniprot_id, version='v4'):
    """
    Download AlphaFold structure with specific version

    Parameters:
    -----------
    uniprot_id : str
        UniProt ID
    version : str
        AlphaFold version (v1-v4)
    """
    base_url = "https://alphafold.ebi.ac.uk/files/"
    pdb_url = f"{base_url}AF-{uniprot_id}-F1-model_{version}.pdb"

    response = requests.get(pdb_url)
    response.raise_for_status()

    filename = f"AF_{uniprot_id}_{version}.pdb"
    with open(filename, 'w') as f:
        f.write(response.text)

    print(f"Downloaded AlphaFold {version}: {filename}")
    return filename

Common Pitfalls and Solutions

1. PDB File Parsing Errors

Problem: Errors due to format differences or missing atoms in PDB files

Symptom:

# NG example
structure = parser.get_structure('protein', 'broken.pdb')
# KeyError: 'CA'

Solution:

# OK example: With error handling
from Bio.PDB import PDBParser
import warnings

parser = PDBParser(QUIET=True)
structure = parser.get_structure('protein', 'protein.pdb')

for residue in chain:
    try:
        ca_atom = residue['CA']
        coord = ca_atom.get_coord()
    except KeyError:
        # Skip residues without CA atom
        print(f"Warning: No CA atom in residue {residue.id}")
        continue

2. Misinterpreting AlphaFold pLDDT

Problem: Misunderstanding pLDDT score as absolute accuracy

NG:

# Assuming all pLDDT > 70 is reliable
if plddt > 70:
    use_structure()  # Dangerous!

OK:

# Setting thresholds based on application
if application == 'drug_design':
    threshold = 90  # High precision required
elif application == 'homology_modeling':
    threshold = 70  # Medium acceptable

if plddt > threshold:
    use_structure()

Recommended Standards: - Docking studies: pLDDT > 90 - Homology modeling: pLDDT > 70 - Structure prediction validation: pLDDT > 50

3. Confusion of Coordinate System Units

Problem: Confusion between Å (Angstrom) and nm

NG:

distance = atom1 - atom2  # Å units
if distance < 5:  # nm? Å? Unclear
    print("Close contact")

OK:

distance_angstrom = atom1 - atom2  # Explicitly Å units
CONTACT_THRESHOLD_A = 5.0  # Å

if distance_angstrom < CONTACT_THRESHOLD_A:
    print(f"Close contact: {distance_angstrom:.2f} Å")

# Convert to nm if needed
distance_nm = distance_angstrom / 10.0

4. DSSP Execution Errors

Problem: DSSP not installed or path not set

Symptom:

dssp = DSSP(model, 'protein.pdb')
# FileNotFoundError: dssp executable not found

Solution:

import shutil

# Check for DSSP executable
dssp_path = shutil.which('mkdssp') or shutil.which('dssp')

if dssp_path:
    dssp = DSSP(model, 'protein.pdb', dssp=dssp_path)
else:
    print("DSSP not installed. Please run:")
    print("  conda install -c salilab dssp")
    # Switch to alternative method
    use_alternative_ss_prediction()

5. Memory Issues (Large Structures)

Problem: Memory errors with huge complexes (viruses, ribosomes, etc.)

Solution:

# Process by chain
for chain in structure[0]:
    # Analyze per chain
    analyze_chain(chain)
    # Free memory
    del chain

# Or load only specific chains
chain_selection = ['A', 'B']  # Only needed chains
for chain_id in chain_selection:
    chain = structure[0][chain_id]
    analyze_chain(chain)

6. Network Errors (Download Failures)

Problem: Downloads from AlphaFold/PDB fail mid-way

Solution:

import time
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def download_with_retry(url, max_retries=3):
    """
    Download with retry
    """
    session = requests.Session()
    retry = Retry(
        total=max_retries,
        backoff_factor=1,
        status_forcelist=[500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    for attempt in range(max_retries):
        try:
            response = session.get(url, timeout=30)
            response.raise_for_status()
            return response.text
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt+1}/{max_retries} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

Quality Checklist

Data Acquisition Phase

[ ] Confirm PDB ID is correct (search on PDB site)
[ ] Confirm UniProt ID is correct
[ ] Verify downloaded file MD5 checksum
[ ] File size is reasonable (not empty file)
[ ] Visual inspection in text editor (ATOM records exist)

Structure Analysis Phase

[ ] Confirm CA atom exists in all residues
[ ] Check for missing residues (REMARK 465)
[ ] Check for abnormal coordinate values (typically -100 to 100 Å range)
[ ] B-factor values are reasonable (typically 0-100)
[ ] Check pLDDT score distribution (for AlphaFold structures)

AlphaFold-Specific Checks

[ ] pLDDT score average > 70 (reliability)
[ ] Confirm positions of low confidence regions (pLDDT < 50)
[ ] Distinguish loop regions vs structured regions
[ ] Check PAE map for relative placement reliability (if available)

Code Quality

[ ] Type hints added (def func(x: Structure) -> float:)
[ ] Error handling implemented
[ ] Logging implemented (progress checkable)
[ ] Units explicitly stated (Å, nm, degrees, etc.)

Materials Science-Specific Checks

[ ] Understand protein biological function
[ ] Confirm feasibility of biomaterial application
[ ] Consider stability (thermal stability, pH stability)
[ ] Synthesis/manufacturing feasibility

References

Berman, H. M. et al. (2000). "The Protein Data Bank." Nucleic Acids Research, 28, 235-242. DOI: 10.1093/nar/28.1.235
Jumper, J. et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature, 596, 583-589. DOI: 10.1038/s41586-021-03819-2
Shoulders, M. D. & Raines, R. T. (2009). "Collagen Structure and Stability." Annual Review of Biochemistry, 78, 929-958. DOI: 10.1146/annurev.biochem.77.032207.120833

Navigation

Next Chapter

Chapter 2: Sequence Analysis and Machine Learning →

Series Table of Contents

← Return to Series Table of Contents

Author Information

Created by: AI Terakoya Content Team Created: 2025-10-17 Version: 1.0

License: Creative Commons BY 4.0

Let's learn sequence analysis and machine learning in detail in Chapter 2!

Chapter 1: Protein Structure and Biomaterials

Learning Objectives

1.1 What is Bioinformatics?

Definition and Interdisciplinarity

Application Domains

Protein Structural Hierarchy

1.2 Utilizing PDB (Protein Data Bank)

What is PDB?

PDB File Structure

Reading PDB Files with Biopython

Atomic Coordinates and Distance Calculation

Retrieving Secondary Structure

1.3 Utilizing AlphaFold

The AlphaFold2 Revolution

AlphaFold Database

Evaluating Prediction Confidence (pLDDT)

1.4 Case Study: Collagen Structure Analysis

What is Collagen?

Retrieving Collagen Structure from PDB

Collagen Sequence Pattern Analysis

1.5 Chapter Summary

What We Learned

Key Points

To the Next Chapter

Exercises

Exercise 1 (Difficulty: Easy)

Exercise 2 (Difficulty: Medium)

Exercise 3 (Difficulty: Hard)

Data Licenses and Citations

Structure Databases

1. PDB (Protein Data Bank)

2. AlphaFold Protein Structure Database

3. UniProt

Library Licenses

Code Reproducibility

Environment Setup

Installing Required Packages

Optional Tools

Obtaining PDB Files

Obtaining Fixed Version AlphaFold Structures

Common Pitfalls and Solutions

1. PDB File Parsing Errors

2. Misinterpreting AlphaFold pLDDT

3. Confusion of Coordinate System Units

4. DSSP Execution Errors

5. Memory Issues (Large Structures)

6. Network Errors (Download Failures)

Quality Checklist

Data Acquisition Phase

Structure Analysis Phase

AlphaFold-Specific Checks

Code Quality

Materials Science-Specific Checks

References

Navigation

Next Chapter

Series Table of Contents

Author Information

Disclaimer