Arthur Leung

Parallelizing "embarrassingly parallel" bioinformatics jobs, e.g., FoldX

In parallel computing, an embarrassingly parallel workload or problem (also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel) is one where little or no effort is needed to split the problem into a number of parallel tasks. This is due to minimal or no dependency upon communication between the parallel tasks, or for results between them.

“Embarrassingly parallel” on Wikipedia

One of FoldX’s functions is predicting the effect of a mutation on the stability of a protein crystal structure, as indexed by the change in free energy of folding (∆∆Gfold).

In BuildModel, you give it a individual_list file in which each line represents one or mutations to apply to the crystal structure, in the format of residue-chain-position-mutated residue, e.g., MA309I; mutates a methionine at site 309 on chain A to an isoleucine. You can multiple mutations, e.g., MA309I,MB309I,MC309I,MD309I; mutates chains A to D.

FoldX then iterates over the individual_list and spits out files with the predicted ∆∆Gfold and its various subcomponents. Each line is independent from other lines in the individual_list file, so its jobs can be considered “embarrassingly parallel”.

I needed to run a file with 174 different sets of mutations, which was going to take a loooong time on my computer. So, thanks to a suggestion by my colleague Matt, I ended up using GNU Parallel to split it up into 8 jobs to run across 8 cores.

I split up my individual_list_mutations.txt into approximately 8 equal chunks into their files into individual_list_mutations_{}.txt in Python:


# Split into 8 chunks
input_file = "individual_list_mutations.txt"
with open(input_file, "r") as file:
    lines = file.readlines()

total_lines = len(lines) # Calculate the number of lines per chunk and the remainder (if any)
print("Read", total_lines, "lines")

num_chunks = 8
chunk_size = total_lines // num_chunks
remainder = total_lines % num_chunks

chunks = [] # Split lines into chunks
start_index = 0
for i in range(num_chunks):
    # Distribute the remainder by adding 1 line to the first 'remainder' chunks
    end_index = start_index + chunk_size + (1 if i < remainder else 0)
    chunks.append(lines[start_index:end_index])
    print("Got chunk", str(i + 1), "with lines", start_index, "to", str(end_index - 1), ", inclusive")
    start_index = end_index

# Write each chunk to a new file
for i, chunk in enumerate(chunks, 1):
    output_filename = "mutations_split/" + f"{input_file.rstrip('.txt')}_{i}.txt"
    with open(output_filename, "w") as output_file:
        output_file.writelines(chunk)
    print(f"Created: {output_filename}")

Using the individual_list_mutations_{}.txt files, I then used GNU Parallel in bash (on MSYS2, Windows) to run 8 instances of FoldX. One neat thing I did was have it send me a message on Pushover when a job was done (which is free up to 10000 messages a month, or you can upgrade for a one-time fee of USD$5).

Importantly, by default FoldX ignores water molecules (--water=-IGNORE), ascribes strong van der Waals forces, which requires a very precise crystal structure (--vdwDesign=2), and assumes an ion strength of 0.05 M(--ionStrength=0.05). So I made some tweaks to make it more applicable for my enzyme of interest.

Let me know what you think!


#!/bin/bash

# ---- Setup ----

cd "$(dirname "$0")"

njobs_parallel=8
TOKEN="12345"
USER="67890"
TITLE="FoldX analysis progress"

mkdir -p logs runs

# ---- Function to run each FoldX job ----
run_foldx_job() {
  i="$1"
  ./foldx.exe \
    --command=BuildModel \
    --pdb=8ruc-assembly-clean_Repair.pdb \
    --mutant-file=individual_list_mutations_$1.txt \
    --fix-residues-file=individual_list_constraints.txt \
    --out-pdb=false \
    --output-dir=runs \
    --output-file=run_"$i" \
    --water=-CRYSTAL \
    --vdwDesign=0 \
    --ionStrength=0.15 \
    --numberOfRuns=3 \
    > "logs/foldx_run_${i}.log" 2>&1 && \
  curl -s -F "token=$TOKEN" -F "user=$USER" -F "title=$TITLE" -F "message=Run $i is done." https://api.pushover.net/1/messages.json
}
export -f run_foldx_job

# ---- Run FoldX jobs in parallel ----
parallel --progress run_foldx_job ::: $(seq 1 $njobs_parallel)

# ---- Clean up ----
echo -e "\n✅ All FoldX jobs are done!"
read -p "Press enter to exit."

#Code #Science