Direct Prediction of Gene Expression with Promoter-0
Last updated
Last updated
Promoter-0 enables direct prediction of promoter activity for synthetic gene expression cassettes. Our approach builds on the Borzoi model. This page demonstrates how to use our ginkgo-ai-client to run inference with Promoter-0 to generate predictions for synthetic expression cassettes.
To learn more, see:
Using promoter-0 to predict promoter activity, a Colab notebook providing detailed usage examples
Our blog post for technical details and benchmark results
Documentation for getting up and running with the ginkgo-ai-client
First, initialize the Ginkgo AI python client. You will need to first get a GINKGOAI_API_KEY. To get your API key, go to models.ginkgobioworks.ai and create a free account. Copy your API key, which should be visible once you have logged in.
First, install ginkgo-ai-client:
Then, build a query, which typically has the following inputs:
promoter_sequence
: specifying the DNA sequence of the promoter region being evaluated.
orf_sequence
: including the payload sequence with the 5' and 3' UTRs. We include the complete expression cassette because the Borzoi model predicts the gene regulation at the genomic level; the promoter's activity depends on whether the promoter sequence is connected to other DNA sequence elements for transcription (i.e., UTRs and coding sequence).
source
: Specifies how to calculate activity based on track type:
Use binding
for DNase-seq or ATAC-seq tracks, where activity was calculated from promoter sequence regions. This method is based on information from DNA accessibility.
Use expression
for RNA-seq tracks, where activity was calculated from transcribed regions. This method is based on information from transcriptional activity.
tissue_of_interest
: specifying which tissues (muscle, brain, etc.) we are interested in for the prediction (more precisely, it is a dictionary mapping tissue names to their associated genomic tracks in Borzoi). See section Selecting Tissue Labels on this page for more information.
Finally, we send the request to the server. The output is a dictionary where each key is a tissue name, and the value is a predicted activity level for that tissue.{tissue_name: score}
.
If you have a large number of promoters to evaluate — say, a thousand — you can iterate through their sequence stored in batches using client.send_requests_by_batches
.
In the particular scenario where all promoters sequences are in a fasta file, in the form of
All evaluations will use the same ORF sequence and tracks, one can generate the queries as follows:
Then, we can send these queries by batches — say, 50 sequences at once — and process the results as they become available. In the background, several batches are processed at the same time, making the most of the parallelization between our servers and your script:
There are over 9000 tissue tracks to choose from (ENCFF068ZBX, ENCFF008DNO, etc.) and it is important to carefully choose the tracks that will be used for promoter evaluation, so they are relevant for the conditions you are interested in.
To help with track selection, we provide.get_tracks_dataframe()
which returns a dataframe describing relevant tracks
ENCFF008DNO
DNASE:heart female embryo (117 days)
ENCFF509TTX
DNASE:heart female embryo (110 days)
ENCFF064YHT
DNASE:heart male child (3 years)
You can then construct the tissue_of_interest parameter:
Each query to the server costs a fixed price of $0.005 per query. Each query takes roughly 1~2 seconds on a given server, and queries are run parallel and limited by available servers.