PGS 카탈로그 소개

PGS(polygenic score) 란 일반적으로 수백만 개의 유전적 변이(대개 SNP)로 구성되며, 관련 게놈 전체 연관 연구( GWAS )에서 추정된 해당 효과 크기를 곱한 대립유전자 용량의 가중 합을 사용하여 결합.
질병과 같은 특정 표현형을 예측하는 경우 genetic scores, genomic risk scores (GRS), polygenic risk scores (PRS) 라고도 함.

Downloads & Access

해당 카탈로그는 발표된 PGS DB. 각 PGS는 메타 데이터로 일관되게 주석처리되어있음
점수 파일(변이, 효과 대립유전자/가중치), PGS 개발 및 적용 방법에 대한 설명, 예측 성능 평가 등이 포함됨

<< 실제 데이터 소개

PGS Catalog에 게제된 파일을 이용해 PGS를 쉽게 계산할 수 있도록 함
https://pgsc-calc.readthedocs.io/en/latest/

pgsc_calc: a reproducible workflow to calculate polygenic scores

PGS Catalog 스코어가 만들어지는 과정

1. 기반: GWAS summary statistics

GWAS에서 각 SNP의 effect size(β 또는 OR) 추정:

표현형 ~ SNP_dosage + 공변량

→ 수백만 개 SNP × β 값 테이블 = summary statistics

2. GWAS β를 그대로 쓰면 안 되는 이유

LD(연관 불평형) 문제: 실제 causal variant 1개가 주변 수십~수백 개 SNP과 상관되어 있어서, 다 더하면 같은 신호를 중복 계산.

causal SNP ── r²=0.9 ── tag SNP A
           ── r²=0.8 ── tag SNP B
           ── r²=0.7 ── tag SNP C

β_A, β_B, β_C 다 더하면 같은 효과를 3번 합산.

3. PGS 가중치 계산 방법들

방법원리해당 PGS

P+T (Pruning+Thresholding)	LD pruning 후 p-value 기준 SNP 선택, GWAS β 그대로 사용	PGS002280 (83 SNPs)
LDpred2	Bayesian shrinkage, LD 구조 모델링해서 β 재추정	PGS004034
PRS-CS	Continuous shrinkage prior, β를 0 방향으로 수축	PGS002753
dbSLMM	Mixed model 기반 β 재추정	PGS003992

4. PGS 계산 공식

PRS_i = Σ (w_j × dosage_ij)

w_j  = PGS catalog의 가중치 (방법마다 다름)
dosage_ij = 샘플 i의 SNP j 복사 수 (0, 1, 2)

P+T: w_j ≈ GWAS β_j (선택된 SNP만)
LDpred2/PRS-CS: w_j = shrunk β_j (LD 보정된 값, GWAS β보다 작음)

5. PGS002280 (83 SNPs) 구체적으로

Bellenguez 2022 GWAS (111,326명) → genome-wide significant SNPs → LD clumping → 83개 독립 loci 선택 → 각 locus의 GWAS β를 가중치로 사용.

즉 P+T 방식 = GWAS β와 가장 유사하지만 LD independent한 SNP만 선택한 것.

Workflow summary

1. 설치하기

java 설치, 현재 버전 11 이므로 23.10.0 nextflow 버전 설치(https://dokk.org/documentation/nextflow/v23.10.0/getstarted 참고)

java 설치 및 conda를 이용한 nextflow 설치 방법 http://docs.seqera.io/nextflow/install 참고

wget https://github.com/nextflow-io/nextflow/releases/download/v23.10.0/nextflow
chmod +x nextflow
./nextflow -version

#N E X T F L O W  ~  version 23.10.0 출력되면 성공

# 실행 가능한 경로로 이동
mkdir -p $HOME/.local/bin/
mv nextflow $HOME/.local/bin/

#올바르게 설치되었는 확인
nextflow info

2. nextflow로 pgsc_calc 실행하기

nextflow run pgscatalog/pgsc_calc -profile test,docker

#실행 결과가 다음과 같이(successfully~) 나오면 정상
-[pgscatalog/pgsc_calc] Pipeline completed successfully-
Completed at: 20-Apr-2026 11:49:16
Duration    : 10m 55s
CPU hours   : (a few seconds)
Succeeded   : 8

테스트 데이터(의미없는 데이터) 로 실행했을 때, 에러가 안나는지 검토해야함.

docker 대신 해당 위치에 singularity 또는 conda를 넣어서 실행할 수 있음.

docker의 경우 실행 권한이 없어서 conda로 진행함.

vi environment.yml

name: pgscatalog-utils
channels:
  - conda-forge
  - bioconda
  - nodefaults
dependencies:
  - pgscatalog.core=1.0.0
  - pgscatalog.match=0.4.0
  - pgscatalog.calc=0.3.1=pyhdfd78af_1

conda search -c conda-forge -c bioconda pgscatalog.calc=0.3.1 --info

의존성 목록에서 pgscatalog.core 제약 조건 확인-> calc=0.3.1= pyhdfd78af_1 이 pgscatalog.core=1.0.0 와 호환되는 것을 확인.
yaml 수정후 rm -rf ~/work/conda/pgscatalog-utils-* 한 다음 다시 실행(nextflow run pgscatalog/pgsc_calc -profile test,conda)

3. sampleheet 작성

sampleset
path_prefix
chrom
format

예시:

sampleset,path_prefix,chrom,format
biobank,./260326_biobank_all/biobank_data/c1_merged,1,vcf
biobank,./260326_biobank_all/biobank_data/c2_merged,2,vcf

4. 실행

nextflow run pgscatalog/pgsc_calc \
    -profile conda \
    --input samplesheet.csv --target_build GRCh37 \
    --pgs_id PGS002280 \
    --run_ancestry pgsc_HGDP+1kGP_v1.tar.zst

# pgs_id [PGS id]

해당 카탈로그 내 다유전자 점수를 사용함.

--pgs_id PGS001229 # one score
--pgs_id PGS001229,PGS001405 # many scores separated by , (no spaces)

AD 연관된 pgs를 보려고하니, MONDO_0004975를 기준으로 53개 프로젝트가 나옴.

각 프로젝트 별로 적게는 6개 많게는 1,136,212 변이가 조사됨.

GWAS 수행한 인종도 같이 표현되어있음.

우선 다음과 같은 기준으로 하나의 PGS만 선택해봄

PGS002280 (Bellenguez 2022, Nature Genetics)

현재까지 최대 규모 AD GWAS (111,326명)
유럽인 기반
83 variants → 계산 빠름, UK Biobank에 거의 다 있음
가장 많이 인용되는 AD PRS 기준선

변이가 많다고 좋은 것은 아님, 노이즈가 그만큼 껴 있고 계산이 어려움.

# --run_ancestry [DB파일명]

유전적 조상 유사성 계산 및 PGS 정규화를 활성화하는 방법

wget https://ftp.ebi.ac.uk/pub/databases/spot/pgs/resources/pgsc_HGDP+1kGP_v1.tar.zst

이 데이터베이스는 1000 Genomes와 Human Genome Diversity Project 참조 패널을 통합한 것으로, 권장되는 기본 패널입니다.

1000 Genomes만 사용하시는 것을 선호하실 수도 있습니다.

$ wget https://ftp.ebi.ac.uk/pub/databases/spot/pgs/resources/pgsc_1000G_v1.tar.zst

왜 ancestry 보정이 필요한가

https://pgsc-calc.readthedocs.io/en/latest/explanation/geneticancestry.html

PRS(PGS) 분포의 평균과 분산이 조상(ancestry) 집단마다 다름. 유럽인 GWAS로 만든 PRS를 그대로 쓰면 유럽인은 점수가 높게, 동아시아인은 낮게 나오는 체계적 편향이 생김. 환자/대조군 비교에서 이 편향이 confounding으로 작용.

--run_ancestry 원리 (단계별)

1단계: Reference PCA 구성

1000 Genomes(기본값) 레퍼런스 패널로 PCA를 만들고, target 샘플을 그 유전적 조상 공간에 projection.

단순 projection은 shrinkage bias 문제가 있어서, FRAPOSA의 OADP(Online SVD + Shrinkage Adjustment) 방법으로 bias 없는 PC projection을 수행.

2단계: 집단 유사도 분류

RandomForest classifier를 레퍼런스 패널의 PCA 로딩(기본 10 PC)으로 학습해서, target 샘플 각각이 어느 레퍼런스 집단(EUR/EAS/AFR 등)과 가장 유사한지 분류. 또는 Mahalanobis distance로 계산하는 방법도 사용 가능.

3단계: PGS 보정 (두 가지 방법)

경험적 방법(empirical): 유사 집단 내 PGS 분포의 평균/표준편차로 Z-score 계산. 연속형 PCA 기반 방법(continuous): PC 값을 공변량으로 사용해 PGS를 회귀 보정.

저작자표시 (새창열림)

'Paper' 카테고리의 다른 글

Multimodal AI predicts clinical outcomes of drug combinations from preclinical data (0)	2026.05.20
Large-scale chemical language representations capture molecular structure and properties (0)	2026.05.16
DCGAT-DTI: dynamic cross-graph attention network for drug–target interaction prediction (0)	2026.03.27
MINERVA—microbiome network research and visualization atlas: a scalable knowledge graph for mapping microbiome-disease associations (0)	2026.03.26
PrePR-CT,Predicting and interpreting cell-type-specific drug responses in the small-data regime using inductive priors (0)	2026.03.23

Bioinfomatics

PGS 카탈로그 소개

PGS Catalog 스코어가 만들어지는 과정

1. 기반: GWAS summary statistics

2. GWAS β를 그대로 쓰면 안 되는 이유

3. PGS 가중치 계산 방법들

4. PGS 계산 공식

5. PGS002280 (83 SNPs) 구체적으로

Workflow summary

왜 ancestry 보정이 필요한가

--run_ancestry 원리 (단계별)

'Paper' 카테고리의 다른 글

티스토리툴바

PGS 카탈로그 소개

PGS Catalog 스코어가 만들어지는 과정

1. 기반: GWAS summary statistics

2. GWAS β를 그대로 쓰면 안 되는 이유

3. PGS 가중치 계산 방법들

4. PGS 계산 공식

5. PGS002280 (83 SNPs) 구체적으로

Workflow summary

왜 ancestry 보정이 필요한가

--run_ancestry 원리 (단계별)

'Paper' 카테고리의 다른 글

'Paper' Related Articles

티스토리툴바