Direct Prediction of Gene Expression with Promoter-0

Introduction

Promoter-0 enables direct prediction of promoter activity for synthetic gene expression cassettes. Our approach builds on the Borzoi model. This page demonstrates how to use our ginkgo-ai-client to run inference with Promoter-0 to generate predictions for synthetic expression cassettes.

To learn more, see:

Using promoter-0 to predict promoter activity, a Colab notebook providing detailed usage examples
Our blog post for technical details and benchmark results
Documentation for getting up and running with the ginkgo-ai-client

Usage

First, initialize the Ginkgo AI python client. You will need to first get a GINKGOAI_API_KEY. To get your API key, go to models.ginkgobioworks.ai and create a free account. Copy your API key, which should be visible once you have logged in.

First, install ginkgo-ai-client:

pip install ginkgo-ai-client

from ginkgo_ai_client import GinkgoAIClient, PromoterActivityQuery

# If the api_key field is omitted, it will be read from the env GINKGOAI_API_KEY
client = GinkgoAIClient(api_key = "xxxxx-xxxx-xx-xxxxx-xxx")

Then, build a query, which typically has the following inputs:

promoter_sequence: specifying the DNA sequence of the promoter region being evaluated.
orf_sequence: including the payload sequence with the 5' and 3' UTRs. We include the complete expression cassette because the Borzoi model predicts the gene regulation at the genomic level; the promoter's activity depends on whether the promoter sequence is connected to other DNA sequence elements for transcription (i.e., UTRs and coding sequence).
source: Specifies how to calculate activity based on track type:
- Use binding for DNase-seq or ATAC-seq tracks, where activity was calculated from promoter sequence regions. This method is based on information from DNA accessibility.
- Use expression for RNA-seq tracks, where activity was calculated from transcribed regions. This method is based on information from transcriptional activity.
tissue_of_interest : specifying which tissues (muscle, brain, etc.) we are interested in for the prediction (more precisely, it is a dictionary mapping tissue names to their associated genomic tracks in Borzoi). See section Selecting Tissue Labels on this page for more information.

query = PromoterActivityQuery(
    orf_sequence = "tgccagccatctgttgtttgcccctcccccgtgccttcctt",
    promoter_sequence = "GTCCCACTGATGAACTGTGCTGCCACAGTAAATGTAGCCA",
    source= "binding",
    tissue_of_interest = {'liver': ['ENCFF068ZBX', 'ENCFF462ZLK', 'ENCFF775FBE']}
)

Finally, we send the request to the server. The output is a dictionary where each key is a tissue name, and the value is a predicted activity level for that tissue.{tissue_name: score} .

response = client.send_request(query)
response.activity_by_tissue
>>> {"liver": 5.078294628815041}

Evaluating large sets of promoters

If you have a large number of promoters to evaluate — say, a thousand — you can iterate through their sequence stored in batches using client.send_requests_by_batches.

In the particular scenario where all promoters sequences are in a fasta file, in the form of

>promoter_1
ATCGCATCGCACG...

>promoter_2
GCTACACACCAGT...

All evaluations will use the same ORF sequence and tracks, one can generate the queries as follows:

queries = PromoterActivityQuery.iter_with_promoter_from_fasta(
    fasta_path="promoter_sequences.fasta",
    orf_sequence=orf_sequence,
    source="expression",
    tissue_of_interest={
        "heart": ["CNhs10608+", "CNhs10612+"],
        "liver": ["CNhs10608+", "CNhs10612+"],
    },
)

Then, we can send these queries by batches — say, 50 sequences at once — and process the results as they become available. In the background, several batches are processed at the same time, making the most of the parallelization between our servers and your script:

for batch_result in client.send_requests_by_batches(queries, batch_size=10):
    for query_result in batch_result:
        # Add entries to the output file
        query_result.write_to_jsonl("promoter_activity.jsonl")

Selecting tissue track names

There are over 9000 tissue tracks to choose from (ENCFF068ZBX, ENCFF008DNO, etc.) and it is important to carefully choose the tracks that will be used for promoter evaluation, so they are relevant for the conditions you are interested in.

To help with track selection, we provide.get_tracks_dataframe() which returns a dataframe describing relevant tracks

heart_tracks_df = PromoterActivityQuery.get_tissue_track_dataframe(
    tissue="heart", # optionally provide tissue of interest
    assay="DNASE" # optionally provide assay of interest
)
heart_tracks_df[["name", "description"]]

name

description

ENCFF008DNO

DNASE:heart female embryo (117 days)

ENCFF509TTX

DNASE:heart female embryo (110 days)

ENCFF064YHT

DNASE:heart male child (3 years)

You can then construct the tissue_of_interest parameter:

tissue_of_interest = {"heart": heart_tracks_df["name"].to_list()}

Pricing

Each query to the server costs a fixed price of $0.005 per query. Each query takes roughly 1~2 seconds on a given server, and queries are run parallel and limited by available servers.

PreviousmRNA discrete diffusion (mDD)

Last updated 6 months ago