MODELS

This page contains a description of all of our models, with links to quick start Google Colab notebooks that allow you to run our models without any setup.

Ginkgo-AA-0-650M

Model Overview

This is a 650M parameter protein language model, trained to do both masked language modeling and embedding calculation. This model was trained on a proprietary dataset consisting of a large internal Ginkgo sequence database, along with publicly available sequence from UniRef. You can see more information about this model in our technical post on the Ginkgo blog.

Input

The input uses the ESMTokenizer. This token accepts uppercase characters representing amino acids as well as the special tokens <unk>, <pad>, <cls>, <mask>, and <eos>. Please refer to the ESMTokenizer docs for more information on the inputs. Additionally, we allow newlines in the input in the playground, which allows you to copy and paste a protein from a FASTA file. We strip newlines, so the input will be treated as one extended sequence. All sequences are limited to 1000 amino acids.

Output

For the embedding task, we return a mean-pooled embedding representation of the amino acid sequence.

For the masked modeling task, we return the completed sequence with mask tokens replaced with predictions.

Colab Examples

ESM2-650M

Model Overview

This is a 650M parameter protein language model, trained to do both masked language modeling and embedding calculation. This uses the publicly available weights trained by the ESM2 team at Meta.

Input

The input uses the ESMTokenizer. This token accepts uppercase characters representing amino acids as well as the special tokens <unk>, <pad>, <cls>, <mask>, and <eos>. Please refer to the ESMTokenizer docs for more information on the inputs. Additionally, we allow newlines in the input in the playground, which allows you to copy and paste a protein from a FASTA file. We strip newlines, so the input will be treated as one extended sequence. All sequences are limited to 1000 amino acids.

Output

For the embedding task, we return a mean-pooled embedding representation of the amino acid sequence.

For the masked modeling task, we return the completed sequence with mask tokens replaced with predictions.

Colab Examples

Promoter activity inference using Promoter-0 with Borzoi

Model Overview

This model predicts activity in specific tissues based on provided genetic sequences and tissue-specific tracks. For detailed usage of the model, please refer to the Promoter-0 documentation.

3' UTR masked language model

Model Overview

ginkgo-maskedlm-3utr is a 44M parameter large language model (LLM) trained to perform masking on genomic 3’ UTRs. This model was trained on a curated dataset consisting of 3’ UTRs from 125 mammalian species. When used in conjuction with a supervised model, we showed that our 3' UTR LLM can be used to select mutations in the 3' UTR to increase mRNA stability over random selection of mutations. See our publication ML-driven design of 3’ UTRs for mRNA Stability for more details on this work.

Input

This token accepts uppercase nucleotide characters (A,T,G,C) as well as the special tokens <unk>, <pad>, <cls>, <mask>, and <eos>. All sequences are limited to 1000 amino acids.

Output

For the embedding task, we return a mean-pooled embedding representation of the nucleotide sequence.

For the masked modeling task, we return the completed sequence with mask tokens replaced with predictions.

Colab Examples

Microbial DNA Design with Diffusion

Model Overview

Our first long-context DNA diffusion model focuses on microbial DNA design. The autoregressive diffusion model operates by iteratively refining DNA sequences based on the model's understanding DNA regulatory logic in microbial genomes. Starting from a user-provided sequence template, it gradually transforms the unspecified nucleotide characters into a coherent and functional DNA sequence. At each step, the model uses the surrounding DNA context (up to 30kb) to predict the best nucleotide choices for the masked nucleotide characters specified by the user.

Input

The model requires a sequence template consisting of IUPAC nucleotides of a length of up to 30,000 characters. Degenerate nucleotides (e.g. N indicating any of A, T, G, or C , K indicating either G or T). Optionally, users may customize generation by specifying a sampling temperature between 0.0 and 1.0, and a decoding strategy. Decoding strategy options are one of the following:

Decoding strategies:
    * `random`: Positions to unmask are ordered randomly
    * `max_prob`: Positions to unmask are ordered by the highest per-position probability.
    * `entropy`: Positions to unmask are ordered by the lowest per-position entropy.

Example

{
  "sequence": "AAAATGKRYGCTAAATAGTTRNNNNNNNNN",
  "temperature": 0.5,
  "decoding_order_strategy": "entropy",
  "unmaskings_per_step": 1
}

Output

The output is a dictionary with the key "sequence", linked to the diffusion sampling generated nucleotide sequence.

Example

{
    "sequence": ["aaaatgggtgctaaatagttaatttttatg"]
}

Antibody Discrete Diffusion

Model Overview

The Antibody Discrete Diffusion model allows for sequence-based generation or mask-filling of sequences using discrete diffusion trained from an 8M parameter ESM model. This model iteratively generates variable heavy or light chain antibodies from scratch using just the length of sequence to generate or from a user-defined template and fills in the blank positions.

Input

The main input is a sequence with any number of mask tokens. (<mask>)

Additionally, there are parameters for users to modify the generation with different strategies, temperatures, and number to unmask each step. The number to unmask each step can potentially speed up sampling by reducing the number of passes through the model when generating. The value can be one of [1, 2, 4]. Any value above 4 tends to reduce the quality of sample generation. Temperature values range from 0.0 to 1.0 and decoding strategy is one of the following:

Decoding strategies:
    * `random`: Positions to unmask are ordered randomly
    * `max_prob`: Positions to unmask are ordered by the highest per-position probability.
    * `entropy`: Positions to unmask are ordered by the lowest per-position entropy.

Example

Full Generation

{
  "sequence": "<mask>" * 128,
  "temperature": 0.75,
  "decoding_order_strategy": "entropy",
  "unmaskings_per_step": 4
}

Sequence

{
  "sequence": "EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGS<mask><mask><mask><mask><mask><mask>SVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKDRDYYGSGSYYLPFDYWGQGTLVTVSS",
  "temperature": 0.75,
  "decoding_order_strategy": "entropy",
  "unmaskings_per_step": 1
}

Output

The output is a single amino-acid sequence

Example

{
    "sequence": ["EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCAKDRDYYGSGSYYLPFDYWGQGTLVTVSS"]
}

Colab Examples

Last updated