ECOLE: Learning to call copy number variants on whole exome sequencing data

내가 사용하는 tool merge 알고리즘이 좋다는 것을 어떤식으로 연구하면 좋을지 참고하기 위함

정확한 WEG 데이터에서 CNV 지역을 찾는 것은 오랜 관심사였고 알고리즘이 계속 개발되고 향상되어왔으나
전문가가 큐레이션한 골드 스탠다드 데이터에서 낮은 정밀도(precision)와 재현율(recall)을 보이는 문제가 있다.

Transformer architecture,
1. WGS 샘플에서 생성된 높은 신뢰도의 호출 데이터 사용
2. 소규모의 전문가 호출 데이터 세트 활용 via transfer learning
3. 전문가가 레이블링한 데이터에서 성능 평가
fine-tuning을 통해 control sample 없이도 종양 샘플에 적용가능

ECOLE 성능을 평가하기위해 1000 genome 데이터를 사용했다.
이때 SOTA germline CNV caller(FREEC, CNVkit, XHMM, CONIFER, CODEX2, GATK)와 DEL/DUP/no-call 그룹별로 비교했다. 이때, semi-ground truth CNV calls 으로 CNVnator 결과를 사용했다(Fig 2).

추가로 CNLearn 결과도 함께 비교
- CNLearn which is a random forest-based method that creates an ensemble of four WES-based callers (See Section 4.2 for details)
- CNLearn v1, which learns to aggregate the calls of other WES-based germline CNV callers (CANOES, XHMM, CONIFER and CLAMMS).
ECOLE를 학습시킬때, illumina hiseq 2000& illumina Genome Analyzer 2를 사용을해서, 다른 플렛폼 데이터에 대해서도 성능을 평가했다. NA12828 sample에 대하여 BGISEQ 500, HISeq 4000, NovaSeq 6000, MGISEQ 2000
1KG 9 sample을 포함하는 human expert-curated 데이터, Chaisson et al. 를 이용해서 성능 테스트(Fig 3)한다.
- This is a human expert-curated, consensus call set that relies on the results of 15 WGS-based CNV callers compared against structural variations generated using PacBio with single base pair breakpoint resolution.

-> 성능 평가로써 사용되는 1KG 데이터 셋

-> groun truth 로 CNVnator를 사용..(읭?)

-> type을 구분하고 fregment 구분하지 않고 평가하고 있음, RO값은 없는것 같음

-> 우리도 Chaisson et al를 이용해보자, 그리고 이 사람들이 15개에 대해 어떻게 큐레이트했는지

-> 그리고 다들 성능이 좋지 않음

EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data (1)	2025.01.28
Genomic Language Models: Opportunities and Challenges (0)	2025.01.19
(Archive)Trasnformer 이해하기 (0)	2024.12.09
1000 Genome Project 샘플을 이용한 지리적 계층화가 나타난 논문정리 (0)	2024.10.13
AI-based language models powering drug discovery and development(정리 중) (0)	2024.09.07

Bioinfomatics