Structure prediction with Boltz

Boltz, is an open-source and fully commercially available model from the MIT Jameel Clinic, designed to accurately model complex biomolecular interactions. It can be used to predict structure from a single protein sequence, or more complex problems involving multimer proteins and ligands.

Useful links

Usage with Ginkgo's AI Python client

First, initialize the Ginkgo AI python client. You will need to first get a GINKGOAI_API_KEY. To get your API key, go to models.ginkgobioworks.ai and create a free account. Copy your API key, which should be visible once you have logged in.

First, install the Ginkgo ai python client ginkgo-ai-client:

pip install ginkgo-ai-client
from ginkgo_ai_client import GinkgoAIClient, PromoterActivityQuery

# If the api_key field is omitted, it will be read from the env GINKGOAI_API_KEY
client = GinkgoAIClient(api_key="xxxxx-xxxx-xx-xxxxx-xxx")

Structure prediction with a single protein sequence

Let's ask Ginkgo's Boltz server for the structure of the GFP protein!

sequence = (
    "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL"
    "VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV"
    "NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD"
    "HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
)

query = BoltzStructurePredictionQuery.from_protein_sequence(sequence)
response = client.send_request(query)

The response contains a link to a structure file which can be downloaded either as CIF or PDB:

response.download_structure(path="GFP.cif")
response.download_structure(path="GFP.pdb") # conversion on-the-fly

The response also contains confidence data. Here a confidence score of 0.95, close to the maximum of 1, indicates a high confidence in the result.

>>> response.confidence_data
{'ptm': 0.9286148548126221,
 'iptm': 0,
 'chains_ptm': {'0': 0.9286148548126221},
 'complex_pde': 0.3618420660495758,
 'ligand_iptm': 0,
 'complex_ipde': 0,
 'protein_iptm': 0,
 'complex_plddt': 0.9573345184326172,
 'complex_iplddt': 0,
 'confidence_score': 0.9515905380249023,
 'pair_chains_iptm': {'0': {'0': 0.9286148548126221}}}

Predictions with multimers and ligands

For more complex problems with multiple protein chains and ligands, the best is to start from a YAML Boltz input file (see the Boltz instructions, and examples for more details).

In this example (in full here), we start from a typical Boltz YAML input file defining a dimer (a protein with two chains A and B sharing an identical sequence), and ligands.

sequences:
  - protein:
      id: [A, B]
      sequence: MVTPEGNVSLVDESLLVGVTDEDRAVRSAHQFYERLIGLWAPAVME...
  - ligand:
      id: [C, D]
      ccd: SAH
  - ligand:
      id: [E, F]
      smiles: N[C@@H](Cc1ccc(O)cc1)C(=O)O

We load the file with from_yaml_file and submit it to our Boltz server:

query = BoltzStructurePredictionQuery.from_yaml_file("with_ligand.yaml")
response = client.send_request(query, timeout=1000)
response.download_structure("with_ligand.pdb")

And voilà:

Handling larger batches

For large sets of queries, it is better to use an iterator, managed via send_requests_by_batches .

Processing a FASTA file with many protein sequences

If you have a FASTA file with many protein sequences, you might define an iterator that builds a query for each protein sequence in the file

from Bio import SeqIO
    
queries_iterator = (
    BoltzStructurePredictionQuery.from_protein_sequence(
        sequence=str(record.seq)
        query_name=record.id
    )
    for record in SeqIO.parse("my_proteins.fa", format="fasta")
)

Then the queries are generated and sent in batches via send_requests_by_batches and the results are processed as they become available:

for batch_result in client.send_requests_by_batches(queries_iterator, batch_size=10):
    for result in batch_result:
        result.download_structure(f"{result.query_name}.pdb")

Processing a folder with multiple files

queries_iterator = (
    BoltzStructurePredictionQuery.from_yaml_file(f)
    for f in Path(folder).glob("*.yaml")
)

for batch_result in client.send_requests_by_batches(queries_iterator, batch_size=10):
    for result in batch_result:
        # the query name is the name of the original yaml file (without extension)
        result.download_structure(f"{result.query_name}.pdb")

Throughput and Pricing

The Boltz model can complete predictions on our servers in as little as 20s for short proteins and as much as ~4 minutes for more complex problems with multiple chains and ligands.

We only currently accept protein sequences under 1000 amino-acids.

The current pricing is a combination of:

  • A fixed cost of $0.01 per request

  • Plus an additional $0.00025 per amino-acid in the sequences (so $0.1 for a 400-AA sequence)

  • Plus an additional $0.0025 per ligand.

Last updated