Direct Prediction of Gene Expression with Promoter-0

Introduction

Promoter-0 enables direct prediction of promoter activity for synthetic gene expression cassettes. Our approach builds on the Borzoi model. This page demonstrates how to use our ginkgo-ai-client to run inference with Promoter-0 to generate predictions for synthetic expression cassettes.

Figure 1. Predictions of 19 commonly used promoters' activities in 4 cell lines using Promoter-0.

To learn more, see:

Usage

First, initialize the Ginkgo AI python client. You will need to first get a GINKGOAI_API_KEY. To get your API key, go to models.ginkgobioworks.ai and create a free account. Copy your API key, which should be visible once you have logged in.

First, install ginkgo-ai-client:

Then, build a query, which typically has the following inputs:

  • promoter_sequence: specifying the DNA sequence of the promoter region being evaluated.

  • orf_sequence: including the payload sequence with the 5' and 3' UTRs. We include the complete expression cassette because the Borzoi model predicts the gene regulation at the genomic level; the promoter's activity depends on whether the promoter sequence is connected to other DNA sequence elements for transcription (i.e., UTRs and coding sequence).

  • source: Specifies how to calculate activity based on track type:

    • Use binding for DNase-seq or ATAC-seq tracks, where activity was calculated from promoter sequence regions. This method is based on information from DNA accessibility.

    • Use expression for RNA-seq tracks, where activity was calculated from transcribed regions. This method is based on information from transcriptional activity.

  • tissue_of_interest : specifying which tissues (muscle, brain, etc.) we are interested in for the prediction (more precisely, it is a dictionary mapping tissue names to their associated genomic tracks in Borzoi). See section Selecting Tissue Labels on this page for more information.

Finally, we send the request to the server. The output is a dictionary where each key is a tissue name, and the value is a predicted activity level for that tissue.{tissue_name: score} .

Evaluating large sets of promoters

If you have a large number of promoters to evaluate — say, a thousand — you can iterate through their sequence stored in batches using client.send_requests_by_batches.

In the particular scenario where all promoters sequences are in a fasta file, in the form of

All evaluations will use the same ORF sequence and tracks, one can generate the queries as follows:

Then, we can send these queries by batches — say, 50 sequences at once — and process the results as they become available. In the background, several batches are processed at the same time, making the most of the parallelization between our servers and your script:

Selecting tissue track names

There are over 9000 tissue tracks to choose from (ENCFF068ZBX, ENCFF008DNO, etc.) and it is important to carefully choose the tracks that will be used for promoter evaluation, so they are relevant for the conditions you are interested in.

To help with track selection, we provide.get_tracks_dataframe() which returns a dataframe describing relevant tracks

name
description

ENCFF008DNO

DNASE:heart female embryo (117 days)

ENCFF509TTX

DNASE:heart female embryo (110 days)

ENCFF064YHT

DNASE:heart male child (3 years)

You can then construct the tissue_of_interest parameter:

Pricing

Each query to the server costs a fixed price of $0.005 per query. Each query takes roughly 1~2 seconds on a given server, and queries are run parallel and limited by available servers.

Last updated