2  Python Fundamentals

Let’s dive into the basic building blocks of Python. Think of these as the fundamental vocabulary and grammar you’ll use to tell the computer what to do. Mastering these basics will empower you to process and understand biological data.

2.1 Variables and Data Types

In programming, a variable is like a named container or a labeled box in your computer’s memory where you can store data. When you want to use that data later, you just refer to the variable’s name.

Python is a dynamically typed language. This means you don’t need to explicitly tell Python what type of data a variable will hold (e.g., integer, text, decimal number) when you create it. Python automatically infers the type based on the value you assign. This makes Python very flexible and quicker to write for beginners, but it also means you need to be mindful of the data types you’re working with during operations.

Variable Naming Rules (PEP 8 Style): To make your code readable and consistent (especially important when working with others or revisiting your own code later), follow these common conventions:

  • Lowercase with underscores (snake_case): This is the standard for variable and function names. Examples: pdb_id, atom_count, active_site_residues.

  • Start with a letter or underscore: Variable names cannot begin with a number.

    • Valid: _temp_value, residue_10, chain_A.
    • Invalid: 10_residues, 2nd_atom
  • Case-sensitive: Python distinguishes between uppercase and lowercase letters. atom_name is a completely different variable from Atom_Name.

  • Avoid Python keywords: Do not use words that Python reserves for its own syntax (e.g., if, for, list, def, return, True, False, None). Using them will cause a SyntaxError.

  • Be descriptive: Choose names that clearly indicate the purpose or content of the variable. This makes your code much easier to understand.Good: active_site_residues, rmsd_value_angstroms, protein_sequenceBad: asr, x, p_seq

Common Data Types in Python for Structural Biology

Python comes with several built-in data types that are fundamental for storing and manipulating information. Here are the most common ones you’ll encounter, with a focus on their applications in structural bioinformatics:

  • Integers (int): What they are: Whole numbers (positive, negative, or zero) without any fractional part.

  • Use Cases in Structural Bioinformatics:

    • Counting discrete entities: number_of_atoms = 4582, chain_count = 2.
    • Identifiers: residue_number = 101, atom_serial_number = 1560.
    • Simulation steps: simulation_step = 50000.
number_of_amino_acids = 287
atom_serial_number = 1560

print(f"Number of amino acids: {number_of_amino_acids}")
print(f"Atom serial number: {atom_serial_number}")

# The type() function tells you what data type a variable holds
print(f"Type of atom_serial_number: {type(atom_serial_number)}") # Output: <class 'int'>
Number of amino acids: 287
Atom serial number: 1560
Type of atom_serial_number: <class 'int'>

Exercise: Define a variable total_atoms_in_myoglobin and assign it an approximate number of atoms (e.g., 1260). Print the variable and its type.

  • Floating-Point Numbers (float):

What they are: Numbers that have a decimal point or are expressed in scientific notation. They are used to represent real numbers, often for measurements that are not whole numbers.

  • Use Cases in Structural Bioinformatics:

  • Atomic coordinates: x_coord = 12.345, y_coord = -4.567, z_coord = 18.901.

  • Measurements: RMSD values (rmsd_value = 1.5), B-factors (b_factor = 45.67), bond lengths (bond_length_angstroms = 1.54), solvent accessible surface area (SASA), resolution of crystal structures.

# Atomic coordinates for a C-alpha atom
ca_x_coord = 10.253
ca_y_coord = -5.781
ca_z_coord = 22.019
print(f"C-alpha coordinates: X={ca_x_coord}, Y={ca_y_coord}, Z={ca_z_coord}")
C-alpha coordinates: X=10.253, Y=-5.781, Z=22.019
rmsd_to_native = 0.85 # Angstroms (Å is the symbol for Angstrom)
print(f"RMSD to native structure: {rmsd_to_native} Å")
print(f"Type of rmsd_to_native: {type(rmsd_to_native)}") # Output: <class 'float'>
RMSD to native structure: 0.85 Å
Type of rmsd_to_native: <class 'float'>

Exercise: Define three float variables: coord_x = 10.123, coord_y = -4.567, coord_z = 18.901. Print these coordinates in a formatted string, perhaps like “Coordinates: (X: 10.123, Y: -4.567, Z: 18.901)”.

  • Strings (str):

What they are: Sequences of characters (text). Strings are immutable, meaning once created, their individual characters cannot be changed. Any operation that seems to modify a string actually creates a new string.

How to define strings:

  • Single quotes (‘…’) or Double quotes (“…”): Use these for single-line strings. It’s largely a matter of personal preference or team convention, but choose one and stick to it for consistency.

    • If your string contains an apostrophe, use double quotes to avoid issues (e.g., “It’s a protein”).

    • If your string contains double quotes, use single quotes (e.g., ‘The “alpha” carbon’).

  • Triple quotes (“““…”“” or ’’‘…’’’): Use triple single or double quotes for multi-line strings. This is extremely useful for storing blocks of text, such as descriptions, help messages, or raw data snippets that span multiple lines, like a portion of a PDB file header or a long protein sequence.

Use Cases in Structural Bioinformatics:

  • PDB IDs: pdb_id = '1AKE', protein_data_bank_id = "2HHB".

  • Atom names: atom_name = 'CA', element_symbol = "C".

  • Residue names: residue_name = 'ALA', three_letter_code = "GLY".

  • Chain identifiers: chain_id = 'A'.

  • Biological sequences: protein_sequence = "MVLSEGEWQLVLHVWAKVEAD", dna_sequence = "ATGCGTAGCATG".

  • File paths: pdb_file_path = "data/1crn.pdb".

  • Lines from data files: atom_record_line_pdb = "ATOM 1 N MET A 1 26.654 24.145 24.409 1.00 10.00 N".

protein_data_bank_id = "2HHB" # Hemoglobin
amino_acid_sequence_snippet = "VLSPADKTNVKAAWGKVGAHAGEYGAEALE"

# A multi-line string for a simplified PDB ATOM record example
atom_record_line_pdb = """
ATOM      1  N   MET A   1      26.654  24.145  24.409  1.00 10.00           N
ATOM      2  CA  MET A   1      25.500  23.800  23.100  1.00 12.00           C
"""
print(f"Analyzing PDB ID: {protein_data_bank_id}")
print(f"Amino acid sequence snippet: {amino_acid_sequence_snippet}")
print(f"Example PDB ATOM line (multi-line): {atom_record_line_pdb}")
print(f"Type of protein_data_bank_id: {type(protein_data_bank_id)}") # Output: <class 'str'>
Analyzing PDB ID: 2HHB
Amino acid sequence snippet: VLSPADKTNVKAAWGKVGAHAGEYGAEALE
Example PDB ATOM line (multi-line): 
ATOM      1  N   MET A   1      26.654  24.145  24.409  1.00 10.00           N
ATOM      2  CA  MET A   1      25.500  23.800  23.100  1.00 12.00           C

Type of protein_data_bank_id: <class 'str'>

Exercise: Create a string variable target_pdb_code and assign it "1CBS". Create another string variable ligand_name and assign it "STI" (a substrate analog for 1CBS). Print both variables in a sentence, e.g., “The target PDB is [code], and its ligand is [name].”

  • Booleans (bool):

What they are: Truth values. A boolean variable can only hold one of two possible values: True or False. These are special keywords in Python (notice the capital first letter). Booleans are the foundation of decision-making in your code.

Use Cases in Structural Bioinformatics:

  • Flags for presence/absence: is_ligand_present = True, has_hydrogen_atoms_in_pdb = False.

  • Results of comparisons: resolution_ok = (pdb_resolution < 2.0), is_distance_significant = (2.8 < 3.5).

  • Categorization: is_alpha_helix = True, is_clash_detected = False.

is_structure_solved_by_xray = True
has_hydrogen_atoms_in_pdb = False # Often true for NMR, sometimes for high-res X-ray, but usually False for older X-ray PDBs
is_distance_significant = (2.8 < 3.5) # This is a comparison, which directly evaluates to True or False
print(f"Structure solved by X-ray? {is_structure_solved_by_xray}")
print(f"Hydrogen atoms present? {has_hydrogen_atoms_in_pdb}")
print(f"Is the distance significant for an H-bond (2.8 < 3.5)? {is_distance_significant}")
print(f"Type of is_distance_significant: {type(is_distance_significant)}") # Output: <class 'bool'>
Structure solved by X-ray? True
Hydrogen atoms present? False
Is the distance significant for an H-bond (2.8 < 3.5)? True
Type of is_distance_significant: <class 'bool'>

Exercise: Imagine you’ve calculated the B-factor of an atom as b_factor_value = 55.2. Create a boolean variable is_atom_flexible that is True if b_factor_value is greater than 50.0, and False otherwise. Print the value of is_atom_flexible.

  • Type Conversion (Casting):

Sometimes, you need to convert data from one type to another. This process is called “type conversion” or “casting.” Python provides built-in functions for this: int(), float(), str(), bool().

Context in Structural Bioinformatics: A common scenario is when you read data from a file (like a PDB file). Although numbers (coordinates, B-factors, occupancy) are stored as text (strings) in the file, you’ll need to convert them to float or int to perform numerical calculations.

coordinate_str = "12.345" # This value is read as a string from a PDB file line
coordinate_float = float(coordinate_str) # Convert the string to a floating-point number
print(f"Coordinate string '{coordinate_str}' as float: {coordinate_float}")
print(f"Type of coordinate_float: {type(coordinate_float)}") # Output: <class 'float'>
Coordinate string '12.345' as float: 12.345
Type of coordinate_float: <class 'float'>
atom_count_str = "5000"
atom_count_int = int(atom_count_str) # Convert the string to an integer
print(f"Atom count '{atom_count_str}' as int: {atom_count_int}")
print(atom_count_int * 2) # Now arithmetic operations are possible: 10000
Atom count '5000' as int: 5000
10000
# Careful with float to int conversion: it truncates (chops off the decimal part), it does NOT round.
occupancy_str = "0.75"
occupancy_float = float(occupancy_str)
occupancy_int_truncated = int(occupancy_float) # Becomes 0 (truncates 0.75 to 0)
print(f"Occupancy '{occupancy_str}' as float: {occupancy_float}, as int (truncated): {occupancy_int_truncated}")
Occupancy '0.75' as float: 0.75, as int (truncated): 0
# Converting to boolean:
# Most non-empty strings, non-zero numbers, non-empty lists/tuples/dictionaries evaluate to True
# Empty strings, 0, None, empty lists/tuples/dictionaries evaluate to False
is_present_str = "True"
is_present_bool = bool(is_present_str) # This will be True because the string "True" is not empty
print(f"'{is_present_str}' as boolean: {is_present_bool}")

empty_sequence_str = ""
is_empty_bool = bool(empty_sequence_str) # This will be False
print(f"'{empty_sequence_str}' as boolean: {is_empty_bool}")
'True' as boolean: True
'' as boolean: False

2.2 Operators: The Tools for Data Manipulation

Operators are special symbols or keywords that perform operations on variables and values (called operands). Understanding operators is crucial for performing calculations, making comparisons, and combining conditions in your code.

Arithmetic Operators:

Used to perform mathematical calculations.

Operators: * + (Addition) * - (Subtraction) * * (Multiplication) * / (True Division): Performs standard division, always returning a float (even if the result is a whole number).

  • // (Floor Division): Performs division and returns the integer part of the quotient (rounds down to the nearest whole number).

  • % (Modulo): Returns the remainder of a division.

  • ** (Exponentiation): Raises the first operand to the power of the second.

Use Cases in Structural Bioinformatics:

  • Calculating distances between atoms (e.g., Euclidean distance:
  • Averaging B-factors or other numerical properties.
  • Scaling coordinates.
  • Calculating molecular weights (summing up atomic weights).
# Simplified distance calculation in 1D for now
atom1_x = 10.0
atom2_x = 15.5
delta_x = atom2_x - atom1_x # Difference in X-coordinate: 5.5
distance_sq_x = delta_x ** 2 # Square of the difference: 5.5 * 5.5 = 30.25
print(f"Difference in X: {delta_x}, Squared difference: {distance_sq_x}")

# Average B-factor of two atoms
b_factor1 = 20.5
b_factor2 = 30.1
average_b_factor = (b_factor1 + b_factor2) / 2 # Division results in float: 25.3
print(f"Average B-factor: {average_b_factor}")

sequence_length = 101 # Example: number of residues
residues_per_turn_helix = 3.6 # Approximate number of residues per turn in an alpha-helix
approx_turns = sequence_length / residues_per_turn_helix # True division: 101 / 3.6 = 28.055...
full_turns = sequence_length // residues_per_turn_helix # Floor division: 101 // 3.6 = 28.0 (type is float but value is integer part)
# Note: If both operands are integers, // returns an integer. If either is float, it returns a float.
print(f"Approximate turns in helix: {approx_turns:.2f}, Full turns: {full_turns}")

remainder_check = 10 % 3 # Modulo: 10 divided by 3 is 3 with a remainder of 1
print(f"Remainder of 10 / 3: {remainder_check}") # Output: 1
Difference in X: 5.5, Squared difference: 30.25
Average B-factor: 25.3
Approximate turns in helix: 28.06, Full turns: 28.0
Remainder of 10 / 3: 1

Exercise: Given x1 = 5.0, y1 = 10.0 and x2 = 8.0, y2 = 14.0. Calculate the squared distance in 2D: \[(x2−x1)^2+(y2−y1)^2\] Store the result in a variable sq_dist_2d and print it.

Comparison Operators:

What they are: Used to compare two values. They always return a Boolean value (True or False).

Operators:

  • == (Equal to): Returns True if two operands have the same value.
  • != (Not equal to): Returns True if two operands have different values.
  • > (Greater than)
  • < (Less than)
  • >= (Greater than or equal to)
  • <= (Less than or equal to)

Use Cases in Structural Bioinformatics:

  • Comparing resolution values: resolution1 < resolution2.
  • Checking if a B-factor exceeds a threshold: b_factor > 70.0.
  • Comparing atom names: atom_name == 'CA'.
  • Checking if a residue number is within a specific range: residue_number >= 10 and residue_number <= 20.
my_protein_resolution = 1.8 # Angstroms
acceptable_resolution_max = 2.5
is_good_resolution = my_protein_resolution <= acceptable_resolution_max # 1.8 is less than or equal to 2.5, so True
print(f"Is resolution {my_protein_resolution}Å acceptable (<= {acceptable_resolution_max}Å)? {is_good_resolution}")

atom_type = "C"
is_carbon_atom = atom_type == "C" # Is "C" equal to "C"? True
print(f"Is atom_type '{atom_type}' a Carbon? {is_carbon_atom}")

# Check if a chain ID is not 'X' (often used for unknown chains)
chain_id = "A"
is_not_unknown_chain = chain_id != "X" # "A" is not equal to "X", so True
print(f"Is chain '{chain_id}' not an unknown chain? {is_not_unknown_chain}")
Is resolution 1.8Å acceptable (<= 2.5Å)? True
Is atom_type 'C' a Carbon? True
Is chain 'A' not an unknown chain? True

Exercise: A typical hydrogen bond length is between 2.7 and 3.3 Angstroms. Given measured_hbond_length = 2.9, write a comparison to check if it’s less than or equal to 3.3. Print the result.

Logical Operators What they are: Used to combine multiple Boolean expressions. They also return a Boolean value (True or False).

Operators:

  • and: Returns True if both operands are True.
  • or: Returns True if at least one of the operands is True.
  • not: Reverses the logical state of its operand (if True becomes False, if False becomes True).

Use Cases in Structural Bioinformatics:

  • Checking if a residue is part of an alpha-helix AND is solvent exposed: is_helix and is_exposed.
  • Checking if an atom is a backbone atom (C-alpha OR N OR C OR O): (atom_name == 'CA') or (atom_name == 'N') or (atom_name == 'C') or (atom_name == 'O').
  • Checking if a structure is not an NMR structure: not (experimental_method == 'NMR').
is_hydrophobic_residue = True
is_surface_exposed = False
# Is it a surface-exposed hydrophobic residue? Both conditions must be True.
is_hydrophobic_and_surface = is_hydrophobic_residue and is_surface_exposed # True and False -> False
print(f"Is the residue hydrophobic AND surface exposed? {is_hydrophobic_and_surface}")

atom_name_check = "CB" # C-beta
# Is it a main chain atom? (N, CA, C, or O)
is_main_chain_atom = (atom_name_check == "N" or atom_name_check == "CA" or atom_name_check == "C" or atom_name_check == "O")
# "CB" == "N" (False) OR "CB" == "CA" (False) OR "CB" == "C" (False) OR "CB" == "O" (False)
# False or False or False or False -> False
print(f"Is atom '{atom_name_check}' a main chain atom? {is_main_chain_atom}")
print(f"Is atom '{atom_name_check}' NOT a main chain atom? {not is_main_chain_atom}") # not False -> True
Is the residue hydrophobic AND surface exposed? False
Is atom 'CB' a main chain atom? False
Is atom 'CB' NOT a main chain atom? True

Assignment Operators What they are: Used to assign values to variables.

Operators:

  • = (Assignment): The basic assignment operator.
  • += (Add and assign): x += yis equivalent to x = x + y.
  • -= (Subtract and assign): x -= y is equivalent to x = x - y.
  • *= (Multiply and assign): x *= y is equivalent to x = x * y.
  • /= (Divide and assign): x /= yis equivalent to x = x / y.
  • And others for floor division (//=), modulo (%=), exponentiation (**=).

Use Cases:

  • Incrementing counters: atom_count += 1.
  • Accumulating sums: total_energy += current_interaction_energy.
  • Updating properties based on calculations.
num_hydrogen_bonds = 0
print(f"Initial hydrogen bonds: {num_hydrogen_bonds}")
# ... some code finds a hydrogen bond ...
num_hydrogen_bonds += 1 # Increment by 1. Equivalent to: num_hydrogen_bonds = num_hydrogen_bonds + 1
print(f"Number of hydrogen bonds found: {num_hydrogen_bonds}")

protein_weight = 1000.0 # Initial hypothetical weight
# During refinement, suppose a modification adds 50.0 Daltons
protein_weight += 50.0
print(f"Updated protein weight: {protein_weight} Da")
Initial hydrogen bonds: 0
Number of hydrogen bonds found: 1
Updated protein weight: 1050.0 Da

Membership Operators

What they are: Used to test if a sequence (like a string, list, or tuple) contains a specific value.

Operators: * in: Returns True if the value is found in the sequence. * not in: Returns True if the value is not found in the sequence.

Use Cases in Structural Bioinformatics: * Checking if a specific amino acid is in a protein sequence: 'LYS' in protein_sequence. * Checking if a ligand ID (e.g., 'HOH' for water) is in a list of ligands found in a PDB file. * Checking if an atom name is in a list of known backbone atoms: atom_name in ['N', 'CA', 'C', 'O'].

protein_sequence_example = "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
is_tryptophan_present = "W" in protein_sequence_example # W is Tryptophan: True
print(f"Is Tryptophan ('W') in the sequence? {is_tryptophan_present}")

active_site_residue_numbers = [45, 48, 97, 101, 152] # Example list of residue numbers
current_residue_num = 101
is_in_active_site = current_residue_num in active_site_residue_numbers # 101 is in the list: True
print(f"Is residue {current_residue_num} in the active site list? {is_in_active_site}")

# Check for a non-existent residue
non_existent_residue_num = 200
is_not_in_active_site = non_existent_residue_num not in active_site_residue_numbers # 200 is not in the list: True
print(f"Is residue {non_existent_residue_num} NOT in the active site list? {is_not_in_active_site}")
Is Tryptophan ('W') in the sequence? False
Is residue 101 in the active site list? True
Is residue 200 NOT in the active site list? True

2.3 Essential Built-in Functions

Python has many built-in functions that are always available for use without needing to import them from a module. These functions perform common tasks, making your coding more efficient. print():

What it does: Displays output (text, variable values) to the console. This is your primary tool for seeing what your code is doing and for debugging.

How to use: You can pass multiple arguments separated by commas, and print() will separate them with a space by default.

  • f-strings (formatted string literals): The most modern and recommended way to embed variable values directly into strings for printing. They start with an f before the opening quote, and variables are enclosed in curly braces {} within the string. You can also apply formatting (e.g., :.2f for two decimal places for a float).

  • Use Cases:

  • Printing PDB ID, number of atoms, specific coordinates, results of calculations (e.g., print(f'Distance between atom {atom_serial_1} and atom {atom_serial_2}: {dist_value:.2f} Angstroms')). Providing progress updates during long analyses.

pdb_code_to_print = "1TIM"
chain_id_to_print = "A"
num_residues_to_print = 248

# Basic printing with commas
print("Analyzing structure", pdb_code_to_print, "chain", chain_id_to_print)

# Using an f-string for clear, embedded values and formatting
print(f"Analyzing structure {pdb_code_to_print}, chain {chain_id_to_print}, which has {num_residues_to_print} residues.")

# Formatting a float in an f-string
calculated_rmsd = 1.23456
print(f"The RMSD is {calculated_rmsd:.3f} Å.") # Prints 1.235 (rounded to 3 decimal places)
Analyzing structure 1TIM chain A
Analyzing structure 1TIM, chain A, which has 248 residues.
The RMSD is 1.235 Å.

Exercise: Ask the user to input a residue number. Convert the input to an integer and print “You selected residue number: [number]”. Add a try-except block to handle a potential ValueError if the user doesn’t enter a valid number (e.g., they type “twenty” instead of “20”).

2.4 Control Flow: Directing Your Code’s Path

Control flow statements determine the order in which the lines of your code are executed. Without control flow, your program would simply execute line by line from top to bottom. Control flow allows your program to make decisions, repeat actions, and respond dynamically to data, which is essential for any non-trivial task in structural bioinformatics.

2.4.1 Conditional Statements: if, elif, else

Conditional statements allow your program to make decisions and execute different blocks of code based on whether certain conditions are True or False. This enables your code to behave dynamically depending on the data it is processing or the state of your program.

  • if statement: Purpose: Executes a block of code only if a specified condition is True.

Syntax: if condition: # Code to execute if condition is True statement1 statement2

  • The condition is any expression that evaluates to True or False (e.g., x > 10, atom_name == "CA", is_ligand_present). The indented block of code (often called the “suite”) is executed only if the condition is True.

  • elif (else if) statement: Purpose: Used after an if (or another elif) to check an additional condition only if all preceding if or elif conditions were False. You can have any number of elif blocks.

Syntax: if condition1: # Code if condition1 is True elif condition2: # Only checked if condition1 was False # Code if condition2 is True

  • else statement: Purpose: An optional final block that executes if none of the preceding if or elif conditions were True. It acts as a “catch-all” or default action.

Syntax: if condition1: # Code if condition1 is True elif condition2: # Code if condition2 is True else: # Only executed if condition1 and condition2 were both False # Default code

Important Indentation: Python uses indentation (typically 4 spaces) to define code blocks. This is critical for if, elif, else, loops, and function definitions. Incorrect indentation will lead to IndentationError or logical bugs.

Use Cases in Structural Bioinformatics:

Data Quality Assessment:

experimental_method = "X-RAY DIFFRACTION" # From a PDB header
resolution_angstroms = 1.9

if "X-RAY" in experimental_method.upper(): # Check if "X-RAY" is part of the method string (case-insensitive search)
    print(f"Method is X-ray. Resolution: {resolution_angstroms} Å.")
    # Nested if statement: this inner decision only happens if the outer condition is true
    if resolution_angstroms < 1.5:
        print("  This is a very high-resolution X-ray structure (excellent for details).")
    elif resolution_angstroms < 2.5: # Only checked if resolution_angstroms was not < 1.5
        print("  This is a good-resolution X-ray structure (suitable for most analyses).")
    else: # Only executed if resolution_angstroms was not < 1.5 AND not < 2.5
        print("  This X-ray structure has moderate to lower resolution (interpret with caution).")
elif "NMR" in experimental_method.upper(): # This block runs if the first 'if' (X-RAY) was False
    print("Method is NMR. Analysis might involve an ensemble of models and different validation criteria.")
elif "ELECTRON MICROSCOPY" in experimental_method.upper() or "EM" in experimental_method.upper(): # Can check multiple terms
    print("Method is Electron Microscopy. Resolution ranges widely, often higher for overall shape.")
else: # This block runs if none of the above conditions were True
    print(f"Method is {experimental_method}. Specific analysis protocol may be needed for this experimental technique.")
Method is X-ray. Resolution: 1.9 Å.
  This is a good-resolution X-ray structure (suitable for most analyses).

Atom/Residue Classification:

atom_name_to_classify = "CB" # C-beta atom, a sidechain atom
atom_category = ""

# Check if the atom name is one of the standard backbone atoms
if atom_name_to_classify in ["N", "CA", "C", "O"]:
    atom_category = "Backbone"
# If not a standard backbone atom, check if it starts with 'H' (for hydrogen atoms)
elif atom_name_to_classify.startswith("H"): # Example: HA, HB1, HG2 (all start with 'H')
    atom_category = "Hydrogen"
# If neither of the above, it's likely a sidechain heavy atom (e.g., CB, CG, SD, NE2 etc.)
else:
    atom_category = "Sidechain Heavy Atom"
print(f"Atom '{atom_name_to_classify}' is categorized as: {atom_category}")

# Example for a water molecule vs. protein residue
residue_type = "HOH"
if residue_type == "HOH":
    print(f"{residue_type} is a water molecule.")
elif residue_type in ["ALA", "GLY", "LEU", "LYS"]: # Check against a list of common amino acids
    print(f"{residue_type} is a standard amino acid.")
else:
    print(f"{residue_type} is a ligand or unknown residue.")
Atom 'CB' is categorized as: Sidechain Heavy Atom
HOH is a water molecule.

Exercise: Given b_factor_value = 65.0. Write an if-elif-else statement that prints: * “Very stable” if b_factor_value is less than 20.0. * “Moderately stable” if b_factor_value is between 20.0 (inclusive) and 50.0 (exclusive). * “Flexible” if b_factor_value is 50.0or greater.

2.4.2 Loops: Repeating Actions

Loops are fundamental control flow statements that allow you to execute a block of code multiple times. This is incredibly powerful for automating repetitive tasks, processing large datasets, or performing iterative calculations that would be tedious or impossible to do manually.

  • for Loops: Purpose: A for loop is used to iterate over a sequence (like a string, list, or tuple) or any other iterable object. In each iteration, a temporary variable (often called the “loop variable”) takes on the value of the next item in the sequence, and the indented code block is executed.

Syntax:

for item_variable in sequence_or_iterable: # Code to execute for each item statement1 statement2

  • The range() function: Often used with for loops to generate a sequence of numbers.
  • range(stop): Generates numbers from 0 up to (but not including) stop.
  • range(start, stop): Generates numbers from start up to (but not including) stop.
  • range(start, stop, step): Generates numbers from start up to (but not including) stop, incrementing by step each time.

Use Cases in Structural Bioinformatics: * Iterating through atoms/residues: Processing each atom or residue parsed from a PDB file to perform calculations (e.g., distances, B-factor analysis) or extract specific information. * Sequence analysis: Processing each amino acid or nucleotide in a protein/DNA sequence to identify specific residues, calculate composition, or find motifs. * Batch processing: Looping through a list of PDB IDs to download, analyze, or generate reports for each one automatically. * Molecular Dynamics (MD) trajectory analysis: Iterating through frames of an MD trajectory to calculate properties over time.

protein_sequence_to_scan = "ACEGIKMNPQSTVWYFL" # All single letter codes except a few hydrophobic
hydrophobic_amino_acids = "AILMFWVP" # Alanine, Isoleucine, Leucine, Methionine, Phenylalanine, Tryptophan, Valine, Proline
hydrophobic_count = 0

print(f"Scanning sequence: {protein_sequence_to_scan} for hydrophobic residues...")
# 'residue_code' will take on each character from the string 'protein_sequence_to_scan' one by one
for residue_code in protein_sequence_to_scan:
    if residue_code in hydrophobic_amino_acids: # Check if the current residue code is in our set of hydrophobic amino acids
        print(f"  Found hydrophobic residue: {residue_code}")
        hydrophobic_count += 1 # Increment the counter
print(f"Total hydrophobic residues found: {hydrophobic_count}")

# Using range() to simulate processing numbered models in an NMR structure
num_nmr_models = 10
print("\nSimulating analysis of NMR models:")
# range(1, num_nmr_models + 1) generates numbers from 1 to 10 (inclusive)
for model_number in range(1, num_nmr_models + 1):
    print(f"  Processing NMR model {model_number}...")
    # (Imagine complex analysis code here for each model, e.g., calculating RMSD for each model)
Scanning sequence: ACEGIKMNPQSTVWYFL for hydrophobic residues...
  Found hydrophobic residue: A
  Found hydrophobic residue: I
  Found hydrophobic residue: M
  Found hydrophobic residue: P
  Found hydrophobic residue: V
  Found hydrophobic residue: W
  Found hydrophobic residue: F
  Found hydrophobic residue: L
Total hydrophobic residues found: 8

Simulating analysis of NMR models:
  Processing NMR model 1...
  Processing NMR model 2...
  Processing NMR model 3...
  Processing NMR model 4...
  Processing NMR model 5...
  Processing NMR model 6...
  Processing NMR model 7...
  Processing NMR model 8...
  Processing NMR model 9...
  Processing NMR model 10...

Exercise: Given the peptide sequence peptide = "GATPLES". Iterate through each character (residue) in the sequence using a for loop. If a residue is 'P' (Proline), print “Proline found!”. Also, count how many 'S' (Serine) residues are present in the sequence and print the final count at the end of the loop.

  • while Loops: Purpose: A while loop repeatedly executes a code block as long as a given condition remains True. The condition is checked before each iteration.

Syntax: while condition: # Code to execute as long as condition is True statement1 statement2

Use Cases in Structural Bioinformatics: * Iterative refinement: Iteratively refining a protein structure model until convergence criteria (e.g., RMSD change below a threshold) are met. * Simulation steps: Simulating steps in a molecular dynamics trajectory until a certain simulation time is reached. * File reading: Reading lines from a large PDB file until a specific section (e.g., ‘HETATM’ records) is found or the end of the file is reached. * User input validation: Repeatedly asking the user for input until valid input is provided.

# Simulate iterative structure refinement until RMSD target is met or max iterations
current_rmsd_value = 5.0 # Initial RMSD in Angstroms
target_rmsd_value = 0.5  # Desired RMSD value for convergence
iteration_count = 0
max_iterations = 100    # Safety limit to prevent infinite loops

print("\nStarting structure refinement simulation:")
# The loop continues as long as RMSD is above target AND max iterations not reached
while current_rmsd_value > target_rmsd_value and iteration_count < max_iterations:
    print(f"  Iteration {iteration_count + 1}: Current RMSD = {current_rmsd_value:.3f} Å")
    # In a real scenario, complex refinement calculations would occur here
    current_rmsd_value -= 0.25 # Simulate RMSD improvement in each iteration
    iteration_count += 1       # Increment the iteration counter

# After the loop finishes, check why it stopped
if current_rmsd_value <= target_rmsd_value:
    print(f"Refinement converged after {iteration_count} iterations. Final RMSD: {current_rmsd_value:.3f} Å.")
else:
    print(f"Max iterations ({max_iterations}) reached. Refinement stopped. Final RMSD: {current_rmsd_value:.3f} Å.")

Starting structure refinement simulation:
  Iteration 1: Current RMSD = 5.000 Å
  Iteration 2: Current RMSD = 4.750 Å
  Iteration 3: Current RMSD = 4.500 Å
  Iteration 4: Current RMSD = 4.250 Å
  Iteration 5: Current RMSD = 4.000 Å
  Iteration 6: Current RMSD = 3.750 Å
  Iteration 7: Current RMSD = 3.500 Å
  Iteration 8: Current RMSD = 3.250 Å
  Iteration 9: Current RMSD = 3.000 Å
  Iteration 10: Current RMSD = 2.750 Å
  Iteration 11: Current RMSD = 2.500 Å
  Iteration 12: Current RMSD = 2.250 Å
  Iteration 13: Current RMSD = 2.000 Å
  Iteration 14: Current RMSD = 1.750 Å
  Iteration 15: Current RMSD = 1.500 Å
  Iteration 16: Current RMSD = 1.250 Å
  Iteration 17: Current RMSD = 1.000 Å
  Iteration 18: Current RMSD = 0.750 Å
Refinement converged after 18 iterations. Final RMSD: 0.500 Å.

Caution: When using while loops, it is absolutely critical to ensure that the condition will eventually become False. If the condition never evaluates to False, your program will enter an infinite loop, running forever and potentially freezing your environment. Always include a mechanism (like an incrementing counter with a max_iterations limit, or a condition that will definitely change state) to guarantee that the loop will terminate.

  • break and continue Statements (within loops): These statements provide more fine-grained control over loop execution, allowing you to alter the normal flow within a loop.
    • break:

    • Purpose: Immediately terminates the innermost loop it is in. Execution then continues with the first statement after the loop (if any).

    • Use Cases: Stop processing atoms in a PDB file if a critical formatting error is found. If you are searching for the first structure meeting certain criteria in a list of PDB IDs, break after finding it to save computation time and move on to the next part of your script.

    • continue:

    • Purpose: Skips the rest of the current iteration of the loop and immediately proceeds to the next iteration (checks the loop condition again). The loop itself continues to run, but the current item is effectively ignored for the remaining part of the current pass.

    • Use Cases: When iterating through atoms, use continue to skip hydrogen atoms if they are not needed for the current analysis (e.g., only interested in heavy atoms). If a residue in a PDB file has incomplete coordinate data, continue to the next residue to avoid errors, rather than stopping the whole process.

pdb_ids_to_process = ["1AKE", "2HHB", "MISSING_FILE", "1CRN", "6M0J"]
print("\nProcessing PDB IDs, stopping if a 'MISSING_FILE' is encountered:")
for pdb_id in pdb_ids_to_process:
    if pdb_id == "MISSING_FILE":
        print(f"  Error: PDB ID '{pdb_id}' indicates a problem. Halting process.")
        break # Exit the 'for' loop entirely
    print(f"  Successfully processed (simulated) PDB ID: {pdb_id}")
print("Loop finished (due to break or exhaustion of list).")


print("\nCalculating average B-factor for non-hydrogen atoms (simplified):")
# Assume atom_data is a list of tuples: (atom_name, b_factor)
atom_data_from_pdb_residue = [("N", 22.5), ("CA", 20.1), ("C", 21.0), ("O", 23.5),
                              ("CB", 25.0), ("H", 18.0), ("HA", 19.5), ("HB1", 20.0)] # H, HA, HB1 are hydrogens
total_b_factor_heavy_atoms = 0
count_heavy_atoms = 0

for atom_name, b_factor_val in atom_data_from_pdb_residue:
    if atom_name.startswith("H"): # If it's a hydrogen atom (names usually start with H)
        print(f"  Skipping hydrogen atom: {atom_name}")
        continue # Skip the rest of this iteration and go directly to the next atom in the list

    print(f"  Including heavy atom: {atom_name}, B-factor: {b_factor_val}")
    total_b_factor_heavy_atoms += b_factor_val
    count_heavy_atoms += 1

if count_heavy_atoms > 0:
    average_b_heavy = total_b_factor_heavy_atoms / count_heavy_atoms
    print(f"Average B-factor for heavy atoms: {average_b_heavy:.2f}")
else:
    print("No heavy atoms found to calculate average B-factor.")

Processing PDB IDs, stopping if a 'MISSING_FILE' is encountered:
  Successfully processed (simulated) PDB ID: 1AKE
  Successfully processed (simulated) PDB ID: 2HHB
  Error: PDB ID 'MISSING_FILE' indicates a problem. Halting process.
Loop finished (due to break or exhaustion of list).

Calculating average B-factor for non-hydrogen atoms (simplified):
  Including heavy atom: N, B-factor: 22.5
  Including heavy atom: CA, B-factor: 20.1
  Including heavy atom: C, B-factor: 21.0
  Including heavy atom: O, B-factor: 23.5
  Including heavy atom: CB, B-factor: 25.0
  Skipping hydrogen atom: H
  Skipping hydrogen atom: HA
  Skipping hydrogen atom: HB1
Average B-factor for heavy atoms: 22.42

Exercise Idea: You have a list of residue names: residues = [“ALA”, “HIS”, “UNK”, “LEU”, “GLY”]. Loop through them. If a residue is “UNK” (unknown), print “Unknown residue found, skipping!” and use continue. Otherwise, print “Processing residue: [name]”.