Boltz, is an open-source and fully commercially available model from the MIT Jameel Clinic, designed to accurately model complex biomolecular interactions. It can be used to predict structure from a single protein sequence, or more complex problems involving multimer proteins and ligands.
First, initialize the Ginkgo AI python client. You will need to first get a GINKGOAI_API_KEY. To get your API key, go to models.ginkgobioworks.ai and create a free account. Copy your API key, which should be visible once you have logged in.
from ginkgo_ai_client import GinkgoAIClient, PromoterActivityQuery# If the api_key field is omitted, it will be read from the env GINKGOAI_API_KEYclient =GinkgoAIClient(api_key="xxxxx-xxxx-xx-xxxxx-xxx")
Structure prediction with a single protein sequence
Let's ask Ginkgo's Boltz server for the structure of the GFP protein!
The response contains a link to a structure file which can be downloaded either as CIF or PDB:
The response also contains confidence data. Here a confidence score of 0.95, close to the maximum of 1, indicates a high confidence in the result.
Predictions with multimers and ligands
For more complex problems with multiple protein chains and ligands, the best is to start from a YAML Boltz input file (see the Boltz instructions, and examples for more details).
In this example (in full here), we start from a typical Boltz YAML input file defining a dimer (a protein with two chains A and B sharing an identical sequence), and ligands.
We load the file with from_yaml_file and submit it to our Boltz server:
And voilà:
A dimer with two identical chains (left and right) and identical ligands docked in each chain
Handling larger batches
For large sets of queries, it is better to use an iterator, managed via send_requests_by_batches .
Processing a FASTA file with many protein sequences
If you have a FASTA file with many protein sequences, you might define an iterator that builds a query for each protein sequence in the file
Then the queries are generated and sent in batches via send_requests_by_batches and the results are processed as they become available:
Processing a folder with multiple files
If your input is more complex, with multiple chains and ligands, and you have a folder with different YAML Boltz input files, here is how you would use send_requests_by_batches👍
Throughput and Pricing
The Boltz model can complete predictions on our servers in as little as 20s for short proteins and as much as ~4 minutes for more complex problems with multiple chains and ligands.
We only currently accept protein sequences under 1000 amino-acids.
The current pricing is a combination of:
A fixed cost of $0.01 per request
Plus an additional $0.00025 per amino-acid in the sequences (so $0.1 for a 400-AA sequence)
from Bio import SeqIO
queries_iterator = (
BoltzStructurePredictionQuery.from_protein_sequence(
sequence=str(record.seq)
query_name=record.id
)
for record in SeqIO.parse("my_proteins.fa", format="fasta")
)
for batch_result in client.send_requests_by_batches(queries_iterator, batch_size=10):
for result in batch_result:
result.download_structure(f"{result.query_name}.pdb")
queries_iterator = (
BoltzStructurePredictionQuery.from_yaml_file(f)
for f in Path(folder).glob("*.yaml")
)
for batch_result in client.send_requests_by_batches(queries_iterator, batch_size=10):
for result in batch_result:
# the query name is the name of the original yaml file (without extension)
result.download_structure(f"{result.query_name}.pdb")