mRNA discrete diffusion (mDD)
mrna-foundation: Ginkgo's discrete diffusion model for mRNA design
Model Overview
This mRNA Discrete Diffusion model (mrna-foundation
) allows for sequence-based generation or mask-filling of 3' UTRs and 5' UTRs using discrete diffusion. Additionally, our mRNA model translates input protein sequences to codon sequences, when conditioned on a species of interest. Together, this model allows for unconditional generation of the entire mRNA: UTR design and codon optimization for a species of interest.
Overview: Designing 3' and 5' UTRs
This model iteratively generates 3' and 5' UTRs from scratch using just the length of sequence to generate or from a user-defined template and fills in the blank positions.
The main input for generating UTRs is a template containing any number of mask tokens (<mask>
) .
Additionally, there are parameters for users to modify the generation of UTRs with different sampling strategies, number of generated samples, and number to unmask each step. The number to unmask each step can potentially speed up sampling by reducing the number of passes through the model when generating. The value can be one of [1, 2, 4]
. Any value above 4 tends to reduce the quality of sample generation. Decoding strategy is one of the following:
Example Generation
Prerequisites
To use Ginkgo models, first install the Ginkgo's AI model API client.
Register at https://models.ginkgobioworks.ai/ to get credits and an API KEY (of the form xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx
). Store the API KEY in the GINKGOAI_API_KEY
environment variable.
Example
The code below demonstrates how to partially and fully mask UTRs for unconditional generation of mRNA sequences.
Note below that to indicate the end of a protein sequence, "-" character is added to the end of the sequence.
Output
The output is a list of dictionaries containing generated codon sequences, UTRs, and input parameters.
Viewing available species
mRNA-foundation trains on hundreds of species. To list available species that can be conditioned on during generation:
Last updated