Ginkgo Docs
  • MODELS
  • abdiffusion
  • Ginkgo-AA-0-650M
  • lcdna
  • ginkgo-maskedlm-3utr
  • ESM2-650M
  • mRNA discrete diffusion (mDD)
  • Applications
    • Direct Prediction of Gene Expression with Promoter-0
Powered by GitBook
On this page
  • Introduction
  • Usage
  • Evaluating large sets of promoters
  • Selecting tissue track names
  • Pricing
  1. Applications

Direct Prediction of Gene Expression with Promoter-0

PreviousmRNA discrete diffusion (mDD)

Last updated 5 months ago

Introduction

Promoter-0 enables direct prediction of promoter activity for synthetic gene expression cassettes. Our approach builds on the. This page demonstrates how to use our to run inference with Promoter-0 to generate predictions for synthetic expression cassettes.

Figure 1. Predictions of 19 commonly used promoters' activities in 4 cell lines using Promoter-0.

To learn more, see:

Usage

pip install ginkgo-ai-client
from ginkgo_ai_client import GinkgoAIClient, PromoterActivityQuery

# If the api_key field is omitted, it will be read from the env GINKGOAI_API_KEY
client = GinkgoAIClient(api_key = "xxxxx-xxxx-xx-xxxxx-xxx")

Then, build a query, which typically has the following inputs:

  • promoter_sequence: specifying the DNA sequence of the promoter region being evaluated.

  • orf_sequence: including the payload sequence with the 5' and 3' UTRs. We include the complete expression cassette because the Borzoi model predicts the gene regulation at the genomic level; the promoter's activity depends on whether the promoter sequence is connected to other DNA sequence elements for transcription (i.e., UTRs and coding sequence).

  • source: Specifies how to calculate activity based on track type:

    • Use binding for DNase-seq or ATAC-seq tracks, where activity was calculated from promoter sequence regions. This method is based on information from DNA accessibility.

    • Use expression for RNA-seq tracks, where activity was calculated from transcribed regions. This method is based on information from transcriptional activity.

query = PromoterActivityQuery(
    orf_sequence = "tgccagccatctgttgtttgcccctcccccgtgccttcctt",
    promoter_sequence = "GTCCCACTGATGAACTGTGCTGCCACAGTAAATGTAGCCA",
    source= "binding",
    tissue_of_interest = {'liver': ['ENCFF068ZBX', 'ENCFF462ZLK', 'ENCFF775FBE']}
)

Finally, we send the request to the server. The output is a dictionary where each key is a tissue name, and the value is a predicted activity level for that tissue.{tissue_name: score} .

response = client.send_request(query)
response.activity_by_tissue
>>> {"liver": 5.078294628815041}

Evaluating large sets of promoters

In the particular scenario where all promoters sequences are in a fasta file, in the form of

>promoter_1
ATCGCATCGCACG...

>promoter_2
GCTACACACCAGT...

All evaluations will use the same ORF sequence and tracks, one can generate the queries as follows:

queries = PromoterActivityQuery.iter_with_promoter_from_fasta(
    fasta_path="promoter_sequences.fasta",
    orf_sequence=orf_sequence,
    source="expression",
    tissue_of_interest={
        "heart": ["CNhs10608+", "CNhs10612+"],
        "liver": ["CNhs10608+", "CNhs10612+"],
    },
)

Then, we can send these queries by batches — say, 50 sequences at once — and process the results as they become available. In the background, several batches are processed at the same time, making the most of the parallelization between our servers and your script:

for batch_result in client.send_requests_by_batches(queries, batch_size=10):
    for query_result in batch_result:
        # Add entries to the output file
        query_result.write_to_jsonl("promoter_activity.jsonl")

Selecting tissue track names

There are over 9000 tissue tracks to choose from (ENCFF068ZBX, ENCFF008DNO, etc.) and it is important to carefully choose the tracks that will be used for promoter evaluation, so they are relevant for the conditions you are interested in.

To help with track selection, we provide.get_tracks_dataframe() which returns a dataframe describing relevant tracks

heart_tracks_df = PromoterActivityQuery.get_tissue_track_dataframe(
    tissue="heart", # optionally provide tissue of interest
    assay="DNASE" # optionally provide assay of interest
)
heart_tracks_df[["name", "description"]]
name
description

ENCFF008DNO

DNASE:heart female embryo (117 days)

ENCFF509TTX

DNASE:heart female embryo (110 days)

ENCFF064YHT

DNASE:heart male child (3 years)

You can then construct the tissue_of_interest parameter:

tissue_of_interest = {"heart": heart_tracks_df["name"].to_list()}

Pricing

Each query to the server costs a fixed price of $0.005 per query. Each query takes roughly 1~2 seconds on a given server, and queries are run parallel and limited by available servers.

, a Colab notebook providing detailed usage examples

for technical details and benchmark results

for getting up and running with the ginkgo-ai-client

First, initialize the Ginkgo AI python client. You will need to first get a GINKGOAI_API_KEY. To get your API key, go to and create a free account. Copy your API key, which should be visible once you have logged in.

First, install

tissue_of_interest : specifying which tissues (muscle, brain, etc.) we are interested in for the prediction (more precisely, it is a dictionary mapping tissue names to their associated genomic tracks in Borzoi). See section on this page for more information.

If you have a large number of promoters to evaluate — say, a thousand — you can iterate through their sequence stored in batches using .

Using promoter-0 to predict promoter activity
Our blog post
Documentation
models.ginkgobioworks.ai
ginkgo-ai-client:
client.send_requests_by_batches
Selecting Tissue Labels
Borzoi model
ginkgo-ai-client