ESM2-650M

Model Overview

This is a 650M parameter protein language model, trained to do both masked language modeling and embedding calculation. This uses the publicly available weights trained by the ESM2 team at Meta.

Input

The input uses the ESMTokenizer. This token accepts uppercase characters representing amino acids as well as the special tokens <unk>, <pad>, <cls>, <mask>, and <eos>. Please refer to the ESMTokenizer docs for more information on the inputs. Additionally, we allow newlines in the input in the playground, which allows you to copy and paste a protein from a FASTA file. We strip newlines, so the input will be treated as one extended sequence. All sequences are limited to 1000 amino acids.

Output

For the embedding task, we return a mean-pooled embedding representation of the amino acid sequence.

For the masked modeling task, we return the completed sequence with mask tokens replaced with predictions.

Colab Examples

The PETase Mask Fill Challenge: A simple protein engineering task to demonstrate the use of Ginkgo's protein sequence model API

Previousginkgo-maskedlm-3utr NextmRNA discrete diffusion (mDD)

Last updated 6 months ago