2020.09.06
Population genetics: detecting selection sites
群体遗传中检测选择信号/位点

本文主要介绍hapFLK(FLK统计量)、LEA软件包(LFMM模型)和PCADAPT软件如何检测选择信号

总体的思想就是通过这几款软件、不同的算法计算出每一个位点的z-score，然后对z-score进行整合、筛选位点

FST的计算方法

##对每一个SNP变异位点进行计算
vcftools--vcf test.vcf--weir-fst-pop1_population.txt--weir-fst-pop2_population.txt--outp_1_2—single
##按照区域来计算
vcftools--vcf test.vcf--weir-fst-pop1_population.txt--weir-fst-pop2_population.txt--outp_1_2_bin--fst-window-size500000--fst-window-step50000
# test.vcf是SNP calling 过滤后生成的vcf 文件；
# p_1_2_3 生成结果的prefix
# 1_population.txt是一个文件包含同一个群体中所有个体，一般每行一个个体。个体名字要和vcf的名字对应。
# 2_population.txt 包含了群体二中所有个体。
#计算的窗口是500kb，而步长是50kb （根据你的需其可以作出调整）。我们也可以只计算每个点的Fst，去掉参数（--fst-window-size 500000 --fst-window-step 50000）即可。

vcftools --vcf final_Taxodium_noMxg308.recode.vcf \
--weir-fst-pop Cs.txt \
--weir-fst-pop Dfs.txt \
--weir-fst-pop Lys.txt \
--weir-fst-pop Mxg.txt \
--weir-fst-pop Zss.txt

hapFLK

FLK和Fst同类，都是评估群体分化过程中的选择效应的。 FLK的分析的介绍和方法： http://membres-timc.imag.fr/Michael.Blum/summer_school_2013/Main_talks/Servin_SSMPG2013.pdf

Fst的公式为: Fst=(Ht-Hs)/Ht 其中，Ht为种群预期杂合度（假设这个种群中不存在亚群，达到哈温平衡的杂合度），Hs为亚群预期杂合度（假设这个种群中每一个亚群内部独立自由交配，相互之间无交配，达到哈温平衡后，几个亚群杂合度的加权平均值）

因此Fst是从这两种假设之间的“相似度”出发的统计量，而hapFLK则是从“差异”出发的统计量。开发者认为FLK比Fst更能够表现出群体之间的差异性。

hapFLK安装

## install
sudo pip install hapflk
## upgrade
sudo pip install hapflk --upgrade

或直接下载

下载链接：https://files.pythonhosted.org/packages/51/c4/ed7b823aa136185d750d60f53268cfb6804dcc87183dbe216135e63aa583/hapflk-1.4.tar.gz

网址：

https://pypi.org/project/hapflk/#files

https://forge-dga.jouy.inra.fr/projects/hapflk

hapFLK运行

hapFLK的输入文件是PLINK文件，包括ped/map，或者bed/bim/fam。文件的第一行必须是种群的名字（FID）。

hapflk --bfile NorthernSheep --outgroup Soay 
# 基本使用方法

具体使用方法 hapflk -h

usage: hapflk  [-h] [--ped FILE] [--map FILE] [--file PREFIX] [--bfile PREFIX]
               [--miss_geno C] [--miss_pheno C] [--miss_parent C]
               [--miss_sex C] [--chr C] [--from x] [--to x] [--other_map]
               [-p PREFIX] [--ncpu N] [--eigen] [--kinship FILE]
               [--reynolds-snps L] [--outgroup POP] [--keep-outgroup] [-K K]
               [--nfit NFIT] [--phased]

optional arguments:
  -h, --help            show this help message and exit
  -p PREFIX, --prefix PREFIX
                        prefix for output files (default: hapflk)
  --ncpu N              Use N processors when possible (default: 1)
  --eigen               Perform eigen decomposition of tests (default: False)
  --reynolds            Force writing down Reynolds distances (default: False)
  --annot               Shortcut for --eigen --reynolds --kfrq (default:

Input Files:
  Available formats

  --ped FILE            PED file (default: None)
  --map FILE            MAP file (default: None)
  --file PREFIX         PLINK file prefix (ped,map) (default: None)
  --bfile PREFIX        PLINK bfile prefix (bim,fam,bed) (default: None)

Data format options:

  --miss_geno C         Missing Genotype Code (default: 0)
  --miss_pheno C        Missing Phenotype Code (default: -9)
  --miss_parent C       Missing Parent Code (default: 0)
  --miss_sex C          Missing Sex Code (default: 0)

SNP selection:
  Filter SNP by genome position

  --chr C               Select chromosome C (default: None)
  --from x              Select SNPs with position > x (in bp/cM) Warning :
                        does not work with BED files (default: 0)
  --to x                Select SNPs with position < x (in bp/cM) Warning :
                        does not work with BED files (default: inf)
  --other_map           Use alternative map (genetic/RH) (default: False)

Population kinship :
  Set parameters for getting the population kinship matrix

  --kinship FILE        Read population kinship from file (if None, kinship is
                        estimated) (default: None)
  --reynolds-snps L     Number of SNPs to use to estimate Reynolds distances
                        (default: 10000)
  --outgroup POP        Use population POP as outgroup for tree rooting (if
                        None, use midpoint rooting) (default: None)
  --keep-outgroup       Keep outgroup in population set (default: False)

hapFLK and LD model:
  Switch on hapFLK calculations and set parameters of the LD model

  -K K                  Set the number of clusters to K. hapFLK calculations
                        switched off if K<0 (default: -1)
  --nfit NFIT           Set the number of model fit to use (default: 20)
  --phased, --inbred    Haplotype data provided (default: False)
  --kfrq                Write Cluster frequencies (Big files) (default: False)

LFMMs模型(latent factor mixed models)/LEA package

LEA软件官网：http://membres-timc.imag.fr/Olivier.Francois/LEA/tutorial.htm

tutorial：http://membres-timc.imag.fr/Olivier.Francois/LEA/files/LEA_1.html

library(LEA)

LEA包主要有两个重要的function：snmf() 和 lfmm() 需要两种格式的输入文件第一是基因型文件，如官方提供的genotype.lfmm文件

0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 2 0 0 0 0 0 0 0 0 1 1 2 0 2 0 0 0 0 0 2 2 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 2 0
1 0 1 0 0 2 0 0 0 0 0 0 0 0 0 2 1 0 1 2 0 2 0 0 0 1 0 0 1 0 0 0 1 1 2 0 2 0 0 0 0 0 2 2 0 1 0 0 2 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 2 0
1 0 1 0 0 2 0 0 0 0 0 0 0 0 0 2 1 0 1 2 0 2 0 0 0 1 0 0 1 0 0 0 1 1 2 0 2 0 0 0 0 0 2 2 0 1 0 0 2 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 1 2 0
...

每一行代表一个个体，012代表基因型(SNPs)

第二是环境因子向量文件，如 gradient.env，有多少个体就有多少行

-7.86345637580724
-0.77898414155103
-1.63128052555653
2.06376301313332
0.80270905131277
1.84265215979298
10.249483491269
9.69112362600386
11.8551811248261

没有环境变量的，可以设置为0和1。如1代表高原，0代表平原。

snmf()

snmf()类似于STRUCTURE，都是群体结构分析，但snmf()运行速度更快

genotype = lfmm2geno("genotype.lfmm")
# 将lfmm格式转换为geno格式
obj.snmf = snmf(genotype, K = 1:12, entropy = T, ploidy = 2, project="new")
# 运行程序，K设置1到12，每个跑一边
plot(obj.snmf)
#绘制熵图，熵最低的K是最合理、有序度最高的
barplot(t(Q(obj.snmf, K = 6)), col = 1:6)
# 绘制structure barplot（根据上一步已知k=6是最合适的）

lfmm()

lfmm模型用于探究潜在的环境因子对基因型的影响

obj.lfmm = lfmm("genotype.lfmm", "gradient.env", K = 6, rep = 5, project="new")

K值根据snmf结果来设置，在本例中是6。rep是运算重复次数，越多越好，可以设为10。

#Record z-scores from the 5 runs in the zs matrix
#整合多次运行的结果，以zscore的形式，取中位数
zs = z.scores(obj.lfmm, K = 6)

#Combine z-scores using the median
zs.median = apply(zs, MARGIN = 1, median)

#Compute the GIF
lambda = median(zs.median^2)/qchisq(0.5, df = 1)
lambda

# compute adjusted p-values from the combined z-scores
adj.p.values = pchisq(zs.median^2/lambda, df = 1, lower = FALSE)

#histogram of p-values
hist(adj.p.values, col = "red")

绘制出的柱状图要尽量平整，而在接近0点位置尽量陡峭而高，如下图所示。否则需要手动调整lambda值，例如调到0.55。lambda是基因组膨胀系数，小于1是最好的，大于1说明存在低质量位点。

接下来需要对loci进行筛选，找到真正差异的基因位点，有两种方法：Benjamini-Hochberg算法和直接计算Storey's q-values 。

1、用Benjamini-Hochberg算法来调整、筛选多样本的candidate loci

## FDR control: Benjamini-Hochberg at level q
## L = number of loci
L = 500
#fdr level q
q = 0.1
w = which(sort(adj.p.values) < q * (1:L)/L)
candidates.bh = order(adj.p.values)[w]

如此，candidate loci被储存在candidates.bh文件中实际科研中，FDR level可以设置为q=0.01

2、直接用qvalue包计算q-value

## FDR control: Storey's q-values 
library(qvalue)
plot(qvalue(adj.p.values))
candidates.qv = which(qvalue(adj.p.values, fdr = .1)$signif)

q-value即通过FDR矫正后的p-value，可参见https://www.jianshu.com/p/9e97e9b351fd 最后，将candidates.bh 和candidates.qv 中的loci合并，就是筛选得到的差异位点。

PCADAPT软件(Principal Component Analysis for outlier detection)

此软件基于贝叶斯因子模型

官网：https://cran.r-project.org/web/packages/pcadapt/index.html 输入文件为bed或者matrix，都可以

pcadapt(
      input,
      K = 2,
      method = c("mahalanobis", "componentwise"),
      min.maf = 0.05,
      ploidy = 2,
      LD.clumping = NULL, pca.only = FALSE,
      tol = 1e-04 )

实际研究中，我设置为k=1，min.maf=0.05

mahalanobis (default): the robust Mahalanobis distance is computed for each genetic marker using a robust estimate of both mean and covariance matrix between the K vectors of z-scores.

componentwise: returns a matrix of z-scores.

如果要把bed转化为matrix：bed2matrix，但没必要

bed2matrix(bedfile, n = NULL, p = NULL)

#bedfile:  Path to a bed file.
#n: Number of samples. Default reads it from corresponding fam file. 
#p: Number of SNPs. Default reads it from corresponding bim file.

最后一步

整合三个软件跑出来的z-score，计算calibrated P value，然后用Benjamini–Hochberg FDR矫正，生存loci列表。

2020.09.06 Population genetics: detecting selection sites 群体遗传中检测选择信号/位点