GENA-LM is a family of Open-Source Foundational Models for Long DNA Sequences.
GENA-LM models are transformer masked language models trained on human DNA sequence.
Key features of our GENA-LM models:
- BPE tokenization instead of k-mers (DNABERT, Nucleotide Transformer)
- max input sequence size ranges from 4.5k to 36k bp, compared to 512bp in DNABERT and 1000bp in Nucleotide Transformer
- pre-training on the latest T2T human genome assembly vs GRCh38/hg38
Model | Architecture | Max SeqLen, tokens (bp) | Params | Tokenizer data | Training data |
---|---|---|---|---|---|
bert-base | BERT-12L | 512(4500) | 110M | T2T split v1 | T2T split v1 |
bert-base-t2t | BERT-12L | 512(4500) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
bert-base-lastln-t2t | BERT-12L | 512(4500) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
bert-base-t2t-multi | BERT-12L | 512(4500) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs+Multispecies |
bert-large-t2t | BERT-24L | 512(4500) | 336M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
bigbird-base-sparse | BERT-12L, DeepSpeed Sparse Ops, RoPE | 4096(36000) | 110M | T2T split v1 | T2T split v1 |
bigbird-base-sparse-t2t | BERT-12L, DeepSpeed Sparse Ops, RoPE | 4096(36000) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
bigbird-base-t2t | BERT-12L, HF BigBird | 4096(36000) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
T2T split v1 refers to preliminary models with a non-augmented T2T human genome assembly split. BERT-based models employ Pre-Layer Normalization and lastln explicitly denotes that layer normalization is also applied to the final layer. RoPE indicates the use of rotary position embeddings in place of BERT-like absolute positional embeddings.
For our first models (gena-lm-bert-base
and gena-lm-bigbird-base-sparse
) we hold out human chromosomes 22 and Y (CP068256.2 and CP086569.2) as the test dataset for the masked language modeling task. For all other models, we hold out human chromosomes 7 and 10 (CP068271.2 and CP068268.2); these models have the suffix "t2t" in their names. Other data was used for training. Human-only models were trained on pre-processed Human T2T v2 genome assembly and its 1000-genome SNP augmentations making in a total of ≈ 480 x 10^9 base pairs. Multispecies models were trained on human-only and multispecies data making in a total of ≈ 1072 x 10^9 base pairs.
Model | Task | Task seq len | Metric | HF branch name |
---|---|---|---|---|
gena-lm-bert-base-t2t | promoters | 300bp | 74.56+-0.36 F1 | promoters_300_run_1 |
gena-lm-bert-large-t2t | promoters | 300bp | 76.44+-0.16 F1 | promoters_300_run_1 |
gena-lm-bert-large-t2t | promoters | 2000bp | 93.70+-0.44 F1 | promoters_2000_run_1 |
gena-lm-bert-base-t2t | splice site | 15000bp | 92.63+-0.09 PR AUC | spliceai_run_1 |
gena-lm-bert-large-t2t | splice site | 15000bp | 93.59+-0.11 PR AUC | spliceai_run_1 |
To get a pre-trained model on a downstream task, replace model_name
and branch_name
with values from the table. The metrics in the table are averaged over multiple runs. Therefore, the values for each checkpoint may differ from those reported here.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(f'AIRI-Institute/{model_name}')
model = AutoModel.from_pretrained(f'AIRI-Institute/{model_name}', revision=branch_name, trust_remote_code=True)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)
Get model class from GENA-LM repository:
git clone https://github.com/AIRI-Institute/GENA_LM.git
from GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
or you can just download modeling_bert.py and put it close to your code.
OR you can get model class from HuggingFace AutoModel:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', num_labels=2)
GENA-LM bigbird-base-t2t
model uses the HuggingFace BigBird implementation. Therefore, default classes from the Transformers library could be used:
from transformers import AutoTokenizer, BigBirdForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
-
Sequence classification with GENA-LM and Huggingface Transformers
-
Explore GENA-LM model fine-tuned on Enformer dataset for gene expression
@article {GENA_LM,
author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
elocation-id = {2023.06.12.544594},
year = {2023},
doi = {10.1101/2023.06.12.544594},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
journal = {bioRxiv}
}
Downstream tasks for model evaluation encompass the prediction of promoter and enhancer activity, splicing sites, chromatin profiles, and polyadenylation site strength.
Check downstream_tasks
folder for code and data preprocessing scripts we used:
- Promoters prediction
- Splice site prediction (SpliceAI)
- Drosophila enhancers prediction (DeepSTARR)
- Chromatin profiling (DeepSea)
- Polyadenylation sites prediction (APARENT)
In order to download human genome please run the following script:
./download_data.sh human
For preprocessing, execute the following script:
python src/gena_lm/genome_tools/create_corpus.py --input_file data/ncbi_dataset/data/GCA_009914755.4/GCA_009914755.4_T2T-CHM13v2.0_genomic.fna --output_dir data/processed/human/
For models with sparse attention (gena-lm-bigbird-base-sparse
, gena-lm-bigbird-base-sparse-t2t
) FP16 support and DeepSpeed is needed.
Install APEX https://github.com/NVIDIA/apex#quick-start
git clone https://github.com/NVIDIA/apex
cd apex
# most recent commits may fail to build
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100), CUDA 10.1, 10.2, 11.0, or 11.1 and runs only in FP16 mode (as of DeepSpeed 0.6.0).
PyTorch>=1.7.1,<=1.10.1 wheels with CUDA 10.2/11.0/11.1 from pytorch.org can be used. However, using Sparse Ops with CUDA 11.1 PyTorch wheels would require CUDA 11.3/11.4 to be installed on the system. Sparse Ops could also be used with PyTorch==1.12.1 CUDA 11.3 wheels, but running DeepSpeed Sparse Ops tests would require modifying them as they check for Torch CUDA version <=11.1. DeepSpeed fork for Triton 1.1.1 already has updated tests.
Triton 1.0.0 and 1.1.1 requires python<=3.9.
pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option="build_ext" --global-option="-j8" --no-cache
and check installation with
ds_report
Triton 1.1.1 brings x2 speed-up to sparse operations on A100, but DeepSpeed (0.6.5) currently supports only triton 1.0.0. DeepSpeed fork with triton 1.1.1 support could be used in the cases where such speed-up is needed:
pip install triton==1.1.1
git clone https://github.com/yurakuratov/DeepSpeed.git
cd DeepSpeed
DS_BUILD_SPARSE_ATTN=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache
and run sparse ops tests with
cd tests/unit
pytest -v test_sparse_attention.py
We use Trainer and multi-gpu training from lm-experiments-tools repository as the basis for our finetuning scripts. However, you can use HF Transformers Trainer, PyTorch Lightning, or Accelerate and PyTorch with custom training loops instead.
Install lm-experiments-tools according to https://github.com/yurakuratov/t5-experiments#install-only-lm_experiments_tools:
git clone https://github.com/yurakuratov/t5-experiments
cd t5-experiments
pip install -e .