lcdna

Microbial DNA Design with Diffusion

Model Overview

Our first long-context DNA diffusion model focuses on microbial DNA design. The autoregressive diffusion model operates by iteratively refining DNA sequences based on the model's understanding DNA regulatory logic in microbial genomes. Starting from a user-provided sequence template, it gradually transforms the unspecified nucleotide characters into a coherent and functional DNA sequence. At each step, the model uses the surrounding DNA context (up to 30kb) to predict the best nucleotide choices for the masked nucleotide characters specified by the user.

Input

The model requires a sequence template consisting of IUPAC nucleotides of a length of up to 30,000 characters. Degenerate nucleotides (e.g. N indicating any of A, T, G, or C , K indicating either G or T). Optionally, users may customize generation by specifying a sampling temperature between 0.0 and 1.0, and a decoding strategy. Decoding strategy options are one of the following:

Decoding strategies:
    * `random`: Positions to unmask are ordered randomly
    * `max_prob`: Positions to unmask are ordered by the highest per-position probability.
    * `entropy`: Positions to unmask are ordered by the lowest per-position entropy.

Example

{
  "sequence": "AAAATGKRYGCTAAATAGTTRNNNNNNNNN",
  "temperature": 0.5,
  "decoding_order_strategy": "entropy",
  "unmaskings_per_step": 1
}

Output

The output is a dictionary with the key "sequence", linked to the diffusion sampling generated nucleotide sequence.

Example

{
    "sequence": ["aaaatgggtgctaaatagttaatttttatg"]
}

Last updated