Ginkgo-AA-0-650M

Ginkgo's proprietary protein language model (Ginkgo-AA-0-650M)

Ginkgo-AA-0-650M

Model Overview

This is a 650M parameter protein language model, trained to do both masked language modeling and embedding calculation. This model was trained on a proprietary dataset consisting of a large internal Ginkgo sequence database, along with publicly available sequence from UniRef. You can see more information about this model in our technical post on the Ginkgo blog.

Input

The input uses the ESMTokenizer. This token accepts uppercase characters representing amino acids as well as the special tokens <unk>, <pad>, <cls>, <mask>, and <eos>. Please refer to the ESMTokenizer docs for more information on the inputs. Additionally, we allow newlines in the input in the playground, which allows you to copy and paste a protein from a FASTA file. We strip newlines, so the input will be treated as one extended sequence. All sequences are limited to 1000 amino acids.

Output

For the embedding task, we return a mean-pooled embedding representation of the amino acid sequence.

For the masked modeling task, we return the completed sequence with mask tokens replaced with predictions.

Colab Examples

The PETase Mask Fill Challenge: A simple protein engineering task to demonstrate the use of Ginkgo's protein sequence model API

Previousabdiffusion Nextlcdna

Last updated 9 months ago