Direct Prediction of Gene Expression with Promoter-0
Introduction
Promoter-0 enables direct prediction of promoter activity for synthetic gene expression cassettes. Our approach builds on the Borzoi model. This page demonstrates how to use our ginkgo-ai-client to run inference with Promoter-0 to generate predictions for synthetic expression cassettes.
To learn more, see:
Using promoter-0 to predict promoter activity, a Colab notebook providing detailed usage examples
Our blog post for technical details and benchmark results
Documentation for getting up and running with the ginkgo-ai-client
Usage
First, initialize the Ginkgo AI python client. You will need to first get a GINKGOAI_API_KEY. To get your API key, go to models.ginkgobioworks.ai and create a free account. Copy your API key, which should be visible once you have logged in.
First, install ginkgo-ai-client:
pip install ginkgo-ai-client
from ginkgo_ai_client import GinkgoAIClient, PromoterActivityQuery
# If the api_key field is omitted, it will be read from the env GINKGOAI_API_KEY
client = GinkgoAIClient(api_key = "xxxxx-xxxx-xx-xxxxx-xxx")
Then, build a query, which typically has the following inputs:
promoter_sequence
: specifying the DNA sequence of the promoter region being evaluated.orf_sequence
: including the payload sequence with the 5' and 3' UTRs. We include the complete expression cassette because the Borzoi model predicts the gene regulation at the genomic level; the promoter's activity depends on whether the promoter sequence is connected to other DNA sequence elements for transcription (i.e., UTRs and coding sequence).source
: Specifies how to calculate activity based on track type:Use
binding
for DNase-seq or ATAC-seq tracks, where activity was calculated from promoter sequence regions. This method is based on information from DNA accessibility.Use
expression
for RNA-seq tracks, where activity was calculated from transcribed regions. This method is based on information from transcriptional activity.
tissue_of_interest
: specifying which tissues (muscle, brain, etc.) we are interested in for the prediction (more precisely, it is a dictionary mapping tissue names to their associated genomic tracks in Borzoi). See section Selecting Tissue Labels on this page for more information.
query = PromoterActivityQuery(
orf_sequence = "tgccagccatctgttgtttgcccctcccccgtgccttcctt",
promoter_sequence = "GTCCCACTGATGAACTGTGCTGCCACAGTAAATGTAGCCA",
source= "binding",
tissue_of_interest = {'liver': ['ENCFF068ZBX', 'ENCFF462ZLK', 'ENCFF775FBE']}
)
Finally, we send the request to the server. The output is a dictionary where each key is a tissue name, and the value is a predicted activity level for that tissue.{tissue_name: score}
.
response = client.send_request(query)
response.activity_by_tissue
>>> {"liver": 5.078294628815041}
Evaluating large sets of promoters
If you have a large number of promoters to evaluate — say, a thousand — you can iterate through their sequence stored in batches using client.send_requests_by_batches
.
In the particular scenario where all promoters sequences are in a fasta file, in the form of
>promoter_1
ATCGCATCGCACG...
>promoter_2
GCTACACACCAGT...
All evaluations will use the same ORF sequence and tracks, one can generate the queries as follows:
queries = PromoterActivityQuery.iter_with_promoter_from_fasta(
fasta_path="promoter_sequences.fasta",
orf_sequence=orf_sequence,
source="expression",
tissue_of_interest={
"heart": ["CNhs10608+", "CNhs10612+"],
"liver": ["CNhs10608+", "CNhs10612+"],
},
)
Then, we can send these queries by batches — say, 50 sequences at once — and process the results as they become available. In the background, several batches are processed at the same time, making the most of the parallelization between our servers and your script:
for batch_result in client.send_requests_by_batches(queries, batch_size=10):
for query_result in batch_result:
# Add entries to the output file
query_result.write_to_jsonl("promoter_activity.jsonl")
Selecting tissue track names
There are over 9000 tissue tracks to choose from (ENCFF068ZBX, ENCFF008DNO, etc.) and it is important to carefully choose the tracks that will be used for promoter evaluation, so they are relevant for the conditions you are interested in.
To help with track selection, we provide.get_tracks_dataframe()
which returns a dataframe describing relevant tracks
heart_tracks_df = PromoterActivityQuery.get_tissue_track_dataframe(
tissue="heart", # optionally provide tissue of interest
assay="DNASE" # optionally provide assay of interest
)
heart_tracks_df[["name", "description"]]
ENCFF008DNO
DNASE:heart female embryo (117 days)
ENCFF509TTX
DNASE:heart female embryo (110 days)
ENCFF064YHT
DNASE:heart male child (3 years)
You can then construct the tissue_of_interest parameter:
tissue_of_interest = {"heart": heart_tracks_df["name"].to_list()}
Pricing
Each query to the server costs a fixed price of $0.005 per query. Each query takes roughly 1~2 seconds on a given server, and queries are run parallel and limited by available servers.
Last updated