Dnabert-2: Efficient foundation model and benchmark for multi-species genom

Zhou, Zhihan, et al. "Dnabert-2: Efficient foundation model and benchmark for multi-species genome." arXiv preprint arXiv:2306.15006 (2023).

Background

토크나이제이션(tokenization)은 언어 모델링에서 가장 초기 단계이자 핵심 단계로, 모델의 효율성과 성능에 큰 영향을 미친다.
DNA 서열은 A, T, C, G 네 가지 염기로 구성된다.
기존 대부분의 게놈 언어 모델(Ji et al., 2021; Dalla-Torre et al., 2023)은 k-mer 토크나이제이션을 사용함.
- 본래 단일 문자(voca 4개 = a,c,g,t) 만으로 긴 dependency 모티프를 찾는것이 어려움 / 구조적 패턴을 잘 못 배움
- 연속된 길이 k의 게놈 조각을 하나의 토큰으로 간주하는 방식임.
- Overlapped k-mer
  - 윈도우 크기 k, stride t 슬라이딩 윈도우: k-t 만큼의 염기를 공유하게됨.
  - masked LM에서 정부 누출 + 계산 비효율
- Non-Overlapped k-mer
  - k = stride : 공유하는 염기서열 없도록함
  - shift sensitivity: 1bp를 shift 해도 토큰 전체가 바뀜 -> 같은 서열에 대해 다른 문장으로 토크나이즈가 될 수 있음.

Byte-Pair Encoding(BPE) 제안

merge frequent pairs of nucleotides and genome segments
forming a vocabulary of variable-length tokens
2^12 toekns
정보 누출 방지 및 길이 약 5배 정도 길이 단축 가능
가변 길이의 MLM
- 마치 a T5-style "replace spans of text“ 과 유사
- 몇 개의 염기로 이루어졌는지 + 어떤 염기인지-> 두 가지를 모두 추론해야 되기 때문

초기 단계(iteration 0): 코퍼스 내의 모든 고유 문자(A,T,C,G)를 vocabulary로 초기화
각 iteration에서, 코퍼스 내에서 가장 자주 등장하는 문자/문자열 쌍(예: iteration 1에서 'TA')을 새로운 “단어”로 간주하여 vocabulary에 추가
이후 해당 문자열을 모두 새로운 토큰으로 치환하여 corpus를 업데이트
목표한 vocabulary 크기에 도달할 때까지 이 과정을 반복

multi-species 을 커버하는 범용 DNA tokenizer !

SentencePiece BPE

간단리뷰 Day17. Few shot learning for phenotype-driven diagnosis of patients with rare genetic diseases (0)	2025.12.11
간단리뷰 Day16. Engineering E. coli strains using antibiotic-resistance-gene-free plasmids (0)	2025.11.23
간단리뷰 Day15. Pharmacogenomics polygenic risk score for drug response prediction using PRS-PGx methods (evaluation 좀더 볼 것 ) (0)	2025.11.11
간단리뷰 Day14. Revisiting genome-wide association studies from statistical modelling to machine learning (0)	2025.11.11
간단리뷰 Day12. shaPRS: Leveraging shared genetic effects across traits or ancestries improves accuracy of polygenic scores (방법중심으로 다시 보기) (0)	2025.11.05

Bioinfomatics