간단리뷰 Day10. Boosting GPT models for genomics analysis: generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning

Boosting GPT models for genomics analysis: generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning

https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf019/8002096

2025, Bioinformatics Advances, 7 citation

RAG + GPT 이용해 변이 주석 데이터 개선/생성의 새로운 방향성을 제안하는 논문.

RAG를 어떻게 활용할 수 있는지 확인할 수 있었음.

1. Problem

유전체학에서 LLM 모델의 성능은 아직이다.
질병 관련 변이를 해석하고 우선순위를 정하는데 변이 주석 데이터는 필수적이다.
그래서 이 주석 데이터를 효과적으로 학습시켜보겠다.
- RAG를 접목함으로써 데이터의 양, 정확도, 비용 효율석 면에서 fine tuning 보다 우수하다는 것이 novelty.

2. Related Work

fine tuning: 사전 학습 모델에 작은 규모의 특정 도메인 데이터 세트를 학습시켜, 그 도메인에서 잘 동작하도록 가중치를 조정하는 학습 프로세스.
RAG: 미세 조정하지 않음. 관련 외부 정보를 끌어와 답변의 정확도와 관련성을 향상시키는 방법
- 작동 원리: 입력이 주어지면, 검색기가 관련 문서를 검색하고, 이 문서(와 입력 모두)들을 생성기에 입력으로 제공합니다. 생성기는 이 문서들을 참고하여 보다 정확하고 풍부한 응답을 생성합니다.(관련 논문 및 설명: https://dwin.tistory.com/172)
- 출처: https://aws.amazon.com/ko/what-is/retrieval-augmented-generation/
- RAG는 기존의 LLM의 문제를 해결하기위한 접근 방식으로써, LLM을 리디렉션하여 신뢰할 수 있는 사전 결정된 지식 출처에서 관련 정보를 검색. 제어된 텍스트를 출력하도록 유도.
  - LLM의 문제: 허위 정보 제공, 오래되거나 일반적인 정보 제공, 신뢰할 수 없는 출처로부터 응답 생성, 용어 혼동

3. Idea

- retrieval-augmented generation (RAG)을 사용해
  유전자 변이 주석 데이터를 LLM에 주입 -> 유전체학에서의 성능을 개선시킴

4. Materials & Methods

curated DB: ClinVar, gnomAD, GWAS catalog, PharmGKB / predicted DB: SnpEff, VEP

5. Evaluation & Findings

염색체 위치 및 SNP ID가 주어졌을때 유전자 및 질환 예측에서, 기본 GPT모델은 제한적이다.

* Azure AI Search : 사용자 정의한 DB에 쉽게 RAG 할 수있도록 돕는 플랫폼. mcirosoft azure

사용자 쿼리 -> Azure 시스템은

- 입력 토큰화
- 미리 작성된 검색 인덱스 내에서 관련 정보를 검색
- 검색 결가와 원래 사용자 쿼리와 함께 프롬프트에 통합
- GPT에 전송

https://learn.microsoft.com/ko-kr/azure/databricks/generative-ai/retrieval-augmented-generation

-평가: 이름이 정확도는 자카드 유사도로(0.8 ) / clinvar 에서 잘 연구된 상위 10개의 유전자를 test로 선별

그 결과 입력(염색체 위치)에 대한 모든 유전자 주석 정확도 100% 달성

including gene, condition, IDs, allele frequencies, molecular consequences, etc

( ...? 당연함. 그냥 저장된 데이터 잘 가져온다. 쿼리 - 저장정보 이런 매핑과정에서 문제가 없었다. 이정도만 해석하고 넘어가면될듯?)( GPT가 잘 이해한다. 라고 말하긴 어려움.)

이런 데이터를 주입을 하니 높은 정확도(fig 2d)를 보였다.

(당연함, 이미 gpt는 summary 분야에서 성능이 검증되었음. 이걸 가져와서 요약하는 것 뿐인데?)

(흥미로운 것은 fine tuning 보다 더 좋다는 것... 그러면 단어 임베딩 상으로는 차이가 없는 거 아닌지?!)

활용 방법: chatbot으로 활용한다는 것 같음, 그냥 검색 엔진 강화같은 느낌임

환자의 증상과 유전자 검사를 통해 보고된 변이 목록을 바탕으로 질병을 추론 및 원인 변이 식별이 가능하다.

-> 라고 주장하기에는 실험 근거가 부족했음.

6. Take away

figure1을 참고했을때, GPT 모델이 well-known gene과 아닌 유전자간의 정확도가 큰 것이 신기했음.
그러나 task 정의가 너무 심플했음!
MS Azure AI을 사용했고,저자가 MS 관계자이니까...플랫폼 논문, gene 데이터 진입을 위한 파일럿 논문으로 생각됨.

저작자표시 (새창열림)

'Paper' 카테고리의 다른 글

간단리뷰 Day11. Prefix-Tuning: Optimizing Continuous Prompts for Generation (0)	2025.11.05
간단논문 Day9. GenePT: A Simple But Effective Foundation Modelfor Genes and Cells Built From ChatGPT (0)	2025.11.04
읽어볼 논문 (0)	2025.10.27
간단리뷰 Day 8. A Novel Balanced-Lethal Host-Vector System Based on glmS (0)	2025.10.22
간단리뷰 Day4. Principles and methods for transferring polygenic risk scores across global populations (0)	2025.10.12

Bioinfomatics

간단리뷰 Day10. Boosting GPT models for genomics analysis: generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning

Boosting GPT models for genomics analysis: generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning

1. Problem

2. Related Work

3. Idea

4. Materials & Methods

5. Evaluation & Findings

6. Take away

'Paper' 카테고리의 다른 글

티스토리툴바

간단리뷰 Day10. Boosting GPT models for genomics analysis: generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning

Boosting GPT models for genomics analysis: generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning

1. Problem

2. Related Work

3. Idea

4. Materials & Methods

5. Evaluation & Findings

6. Take away

'Paper' 카테고리의 다른 글

'Paper' Related Articles

티스토리툴바