ESM2-650M
ESM2-650M
Model Overview
This is a 650M parameter protein language model, trained to do both masked language modeling and embedding calculation. This uses the publicly available weights trained by the ESM2 team at Meta.
Input
The input uses the ESMTokenizer. This token accepts uppercase characters representing amino acids as well as the special tokens <unk>
, <pad>
, <cls>
, <mask>
, and <eos>
. Please refer to the ESMTokenizer docs for more information on the inputs. Additionally, we allow newlines in the input in the playground, which allows you to copy and paste a protein from a FASTA file. We strip newlines, so the input will be treated as one extended sequence. All sequences are limited to 1000 amino acids.
Output
For the embedding task, we return a mean-pooled embedding representation of the amino acid sequence.
For the masked modeling task, we return the completed sequence with mask tokens replaced with predictions.
Colab Examples
The PETase Mask Fill Challenge: A simple protein engineering task to demonstrate the use of Ginkgo's protein sequence model API
Last updated