Ginkgo Docs
  • MODELS
  • abdiffusion
  • Ginkgo-AA-0-650M
  • lcdna
  • ginkgo-maskedlm-3utr
  • ESM2-650M
  • mRNA discrete diffusion (mDD)
  • Applications
    • Direct Prediction of Gene Expression with Promoter-0
Powered by GitBook
On this page
  • Model Overview
  • Input
  • Output
  • Colab Examples

ginkgo-maskedlm-3utr

3' UTR masked language model

PreviouslcdnaNextESM2-650M

Last updated 4 months ago

Model Overview

ginkgo-maskedlm-3utr is a 44M parameter large language model (LLM) trained to perform masking on genomic 3’ UTRs. This model was trained on a curated dataset consisting of 3’ UTRs from 125 mammalian species. When used in conjuction with a supervised model, we showed that our 3' UTR LLM can be used to select mutations in the 3' UTR to increase mRNA stability over random selection of mutations. See our publication for more details on this work.

Input

This token accepts uppercase nucleotide characters (A,T,G,C) as well as the special tokens <unk>, <pad>, <cls>, <mask>, and <eos>. All sequences are limited to 1000 amino acids.

Output

For the embedding task, we return a mean-pooled embedding representation of the nucleotide sequence.

For the masked modeling task, we return the completed sequence with mask tokens replaced with predictions.

Colab Examples

  • Using our 3' UTR LLM to select mutations in the 3' UTR

ML-driven design of 3’ UTRs for mRNA Stability
Masking example: