mRNA discrete diffusion (mDD)

mrna-foundation: Ginkgo's discrete diffusion model for mRNA design

Model Overview

This mRNA Discrete Diffusion model (mrna-foundation) allows for sequence-based generation or mask-filling of 3' UTRs and 5' UTRs using discrete diffusion. Additionally, our mRNA model translates input protein sequences to codon sequences, when conditioned on a species of interest. Together, this model allows for unconditional generation of the entire mRNA: UTR design and codon optimization for a species of interest.

Overview: Designing 3' and 5' UTRs

This model iteratively generates 3' and 5' UTRs from scratch using just the length of sequence to generate or from a user-defined template and fills in the blank positions.

The main input for generating UTRs is a template containing any number of mask tokens (<mask>) .

Additionally, there are parameters for users to modify the generation of UTRs with different sampling strategies, number of generated samples, and number to unmask each step. The number to unmask each step can potentially speed up sampling by reducing the number of passes through the model when generating. The value can be one of [1, 2, 4]. Any value above 4 tends to reduce the quality of sample generation. Decoding strategy is one of the following:

Decoding strategies:
    * `random`: Positions to unmask are ordered randomly
    * `max_prob`: Positions to unmask are ordered by the highest per-position probability.
    * `entropy`: Positions to unmask are ordered by the lowest per-position entropy.

Example Generation

Prerequisites

To use Ginkgo models, first install the Ginkgo's AI model API client.

pip install ginkgo-ai-client

Register at https://models.ginkgobioworks.ai/ to get credits and an API KEY (of the form xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx). Store the API KEY in the GINKGOAI_API_KEY environment variable.

Example

The code below demonstrates how to partially and fully mask UTRs for unconditional generation of mRNA sequences.

Note below that to indicate the end of a protein sequence, "-" character is added to the end of the sequence.


from ginkgo_ai_client.queries import RNADiffusionMaskedQuery
from ginkgo_ai_client import (
    GinkgoAIClient,
)

client = GinkgoAIClient()
three_utr="<mask>" * 20
five_utr="AAA<mask>TTTGGGCC<mask><mask>"
protein_sequence="MAKS-" # '-' denotes end of protein sequence
species="HOMO_SAPIENS"
query = RNADiffusionMaskedQuery(
    three_utr=three_utr,
    five_utr=five_utr,
    protein_sequence=protein_sequence,
    species=species,
    model="mrna-foundation",
    temperature=1.0,
    decoding_order_strategy="entropy",
    unmaskings_per_step=10,
    num_samples=1
)

response = client.send_request(query)
samples = response.samples

Output

The output is a list of dictionaries containing generated codon sequences, UTRs, and input parameters.

[{    
    'species': 'HOMO_SAPIENS', 
    'five_utr': 'AAATTTTGGGCCCT', 
    'three_utr': 'AAACTTTGGGCCGA', 
    'num_samples': 1,
     'temperature': 1, 
     'num_to_decode_per_step': 4, 
     'decoding_order_strategy': 'max_prob', 
    'codon_sequence': 'ATGGCAAAGAGCTGA',
     'protein_sequence': 'MAKS-'
}]

Viewing available species

mRNA-foundation trains on hundreds of species. To list available species that can be conditioned on during generation:

species = RNADiffusionMaskedQuery.get_species_dataframe()
print(species)

Last updated