Input Data¶
Here, we briefly discuss different input files required by PRSice:
Base Dataset¶
Base (i.e. GWAS) data must be provided as a whitespace delimited file containing association analysis results for SNPs on the base phenotype. PRSice has no problem reading in a gzipped base file.
PLINK output is used, then please make sure there is a column for the effective allele (A1) and specify it with A1
option.
On the other hand, if the base data follows other formats, then the column headers can be provided using the chr
, A1
, A2
, stat
, snp
, bp
, se
, pvalue
options
Important
PRSice requires the base file to contain information of the effective allele (A1), effect size estimates (stat), pvalue for association (pvalue), and the SNP ID (snp).
If the input file does not contain a column header, the column can be specified using their index (start counting from 0) with the index
flag.
For example, with the following input format:
SNP  CHR  BP  A1  A2  OR  SE  P 

rs3094315  1  752566  A  G  0.9912  0.0229  0.7009 
rs3131972  1  752721  A  G  1.007  0.0228  0.769 
rs3131971  1  752894  T  C  1.003  0.0232  0.8962 
the parameters can either be
snp SNP chr CHR bp BP A1 A1 A2 A2 stat OR se SE pvalue P
or
snp 0 chr 1 bp 2 A1 3 A2 4 stat 5 se 6 pvalue 7 index
Strand flips are automatically detected and accounted for.
If an imputation info score or the minor allele frequencies (MAF) are also included in the file,
infobase <Info Name>,<Info Threshold>
and mafbase <MAF Name>,<MAF Threshold>
can be used to filter
SNPs based on their INFO score and MAF respectively.
For binary trait base file, SNPs can be filtered according to the MAF in case and control separately using
mafbase <Name for Control>,<Threshold for Control>:<Name for Case>,<Threshold for Case>
By default, PRSice will look for the following column names automatically from the base file header if index
was not provided:
CHR, BP, A1, A2, SNP, P, INFO, SE (case sensitive) and OR / BETA (case insensitive)
nodefault
can be used to disable all the defaults of PRSice.
Note
PRSice will ignore any columns that were not found in the base file (e.g. IfA2 B
is specified but none of the column header is B, then PRSice will treat it as if no A2 information is presented)
Target Dataset¶
Currently two different target file format is supported by PRSice:
PLINK Binary¶
A target dataset in PLINK binary format must consist of three files: .bed, .bim, and a .fam file  where bed contains the compressed genotype data, bim contains the SNP information and fam contains the family information. Currently only SNP major PLINK format are supported (default output of the latest PLINK program).
The .bed and .bim file must have the same prefix.
If the .fam file follow a different prefix from the .bed and bim file, it can be specified using
target <bim bed prefix>,<fam file>
Warning
The fam file MUST contains the correct number of samples or PRSice will crash
Note
9 will NOT be considered as missing for quantitative traits
If the binary file is separated into individual chromosomes, then an # can be used to specify the location of the chromosome number in the file name. PRSice will automatically substitute # with 122
i.e. If the files are chr1.
target chr#
Note
Chromosome number substitution will not be performed on the external fam file as the fam file should be the same for all chromosomes.
Alternatively, if your PLINK files do not have a unified prefix, you can use targetlist
to provide a file containing all
prefix to PRSice.
BGEN¶
PRSice currently support BGEN v1.1 and v1.2. To specify a BGEN file, simply add the type bgen
or ldtype bgen
to the PRSice command
As BGEN does not store the phenotype information and sometime not even the sample ID, you must provide
a phenotype file (phenofile
) for PRSice to run. Alternatively, the sample file can be provided using
Note
The sample file is required even if noregress
is set as the sample ID is required for output. This requirement might be losen in future versions
With BGEN input, a number of PRSice options become effective:

hard
: Normally, with BGEN format, PRS is calculated using the dosage information. But hardthresholding can be performed by using thehard
option. SNPs will then coded as the genotype (0,1 or 2) with probability larger than threshold specified byhardthres
. If no such genotype is presented, the SNP will be coded as missing 
hardthres
: The genotype probability threshold. SNPs with no genotype having a probability larger than this threshold will be treated as missing
To perform clumping on BGEN file, we need to repeatly decompress the genotype dosage and convert them into PLINK binary format.
Therefore, to speed up the clumping process, you can allow PRSice to generate a large intermediate file, containing the hard
coded genotypes in PLINK binary format by using the allowinter
option.
Phenotype files¶
An external phenotype file can be provided to PRSice using the phenofile
parameter.
This must be a tab / space delimited file and missing data must be represented by either NA
or 9
(only for binary traits).
The first two column of the phenotype file should be the FID and the IID, or when
ignorefid
is set, the first column should be the IID.
The rest of the columns can be the phenotype(s).
To specify a trait within the phenotype file, the column name for the trait can be specified using
phenocol
, providing that the phenotype file contains a header.
Multiple column name can be provided via a comma separated
list: e.g. phenocol A,B,C,D
. Trait(s) not found within the phenotype file will
be automatically skipped.
Important
The column name(s) should not contain space nor comma
Note
When more than one traits are provided, the column name will be appended to the output prefix.
LD reference¶
When the target sample is small (e.g. < 500 samples), an external reference panel can be used to improve the LD estimation for clumping.
The LD reference follows the same notion as the target dataset. Simply use
ld <LD refernce>
ldtype
to specify the format
When a LD reference file is not provided and noclump
is not specified, the target file
will be used as the LD reference panel
Important
Any parameters with the ld
prefix will only work on the file specified by the ld
parameter.
That is, if a LD reference file is not provided, none of the ld*
options will be used.
If a different set of filtering is to be perforemd on the target file when performing LD calculation, it must
be provided separately to the ld
parameter
e.g. target <target> ld <target> keep <samples for PRS> ldkeep <samples for LD>
Note
BGEN file will always be hard coded when used to estimate the LD
Clumping¶
By default, PRSice will perform Clumping to remove SNPs that are in LD with each other.
Similar to PLINK, the r^{2} values computed by PRSice are based on maximum likelihood haplotype frequency estimates.
The pearson correlation r^{2} calculation is also supported and can be specified by using pearson
.
Both cases and controls are included in the LD calculation.
Alternatively, a combination of ld
and ldkeep
/ldremove
can be used to restrict LD calculation in control samples.
Clumping parameters can be changed by using the clumpkb
, clumpr2
and clumpp
option.
Clumping can be disabled using noclump
PRS calculation¶
PRSice allow different genetic models to be specified (e.g. add, dom, het, rec), and the polygenic score of each of those are calculated differently
Assuming \(S\) is the summary statistic for the effective allele and \(G\) is the number of the effective allele observed, then the main difference between the models is how the genotypes are coded:
For additive model (add)
For dominant model (with respect to the effective allele of the base file)
For recessive model (with respect to the effective allele of the base file)
For heterozygous model
Then depending on the score
option, the PRS is calculated as (assuming \(M_j\) is the number of Alleles included in the PRS of the \(j^{th}\) individual)
score avg
(default):
$$
PRS_j = \sum_i{\frac{S_i\times G_{ij}}{M_j}}
$$
score sum
:
$$
PRS_j = \sum_i{S_i\times G_{ij}}
$$
score std
:
$$
PRS_j = \frac{\sum_i({S_i\times G_{ij}})  \text{Mean}(PRS)}{\text{SD}(PRS)}
$$
Sometimes, sample can have missing genotype.
The missing
option is used to determine how
PRSice handle the missingness.
When not specified, the Minor Allele Frequency (MAF) in the target sample will be used as the genotype as the sample with missing genotype.
If missing NO_MEAN_IMPUTATION
is set, the SNP for the missing
samples will be excluded.
Alternatively, if missing CENTER
is set, all PRS calculated will be minused by
the MAF of the SNP (therefore, missing samples will have PRS of 0).
Empirical Pvalue calculation¶
All approaches to PRS calculation involve parameter optimisation in generating the final prediction model, and are thus vulnerable to overfitting. There are a few methods to account for the overfitting:
 Evaluate performance in an independent validation sample
 Cross validation
 Calculate an empirical Pvalue
In, PRSice2, we have implemented a permutation analysis to calculate an empirical Pvalue.
Permutation Procedure¶
To calculate the empirical Pvalue, PRSice2 perform the following
 Perform standard PRSice analysis
 Obtain the pvalue of association of the best pvalue threshold (\(P_o\))
 Randomly shuffle the phenotype and repeat the PRSice analysis
 Obtain the pvalue of association of the best pvalue threshold under the null (\(P_{null}\))
 Repeat step2 \(N=10,000\) times (for
perm 10000
)  The empirical pvalue can then be calculated as
where \(I(.)\) is the indicator function.
Warning
While the empirical pvalue for association will be controlled for Type 1 error, the observed phenotypic variance explained, R^{2}, remains unadjusted and is affected by overfitting. Therefore, it is imperative to perform outofsamp,le prediction, or crossvalidation to evaluate the predictive accuracy of PRS.
Computation Algorithm¶
In reality, due to the time consuming nature of permutation analysis, PRSice2 exploit certain property of random number generation to speed up the permutation analysis.
To generate random numbers, a random seed is required. When the same seed is provided, the same sequence of random number will always be generated. PRSice2 exploit this property, such that the permutation analysis is performed as follow
 Generate the random seed or use the user provided random seed to \(S\)
 For each pvalue threshold
 Calculate the observed pvalue
 Seed the random number generator with \(S\)
 For Quantitative trait, (and binary trait, unless
logitperm
is set), decompose the matrix of the independent variables (\(Intercept+PRS+Covariates\))  Generate N copies of random phenotypes via random shuffling.
 Calculate the pvalue association for each null phenotype
 For each permutation, check if the current null pvalue is the most significant. Replace the previous "best" pvalue if the current null pvalue is more significant
 Calculate the empirical pvalue once all pvalue thresholds have been processed
As we reseed the random number generator for each pvalue threshold, we ensure the random phenotypes generated in each pvalue thresholds to be identical, therefore allowing us to reuse the calculated PRS, and also reuse the decomosed matrix, which leads to significant speed up of the permutation process.
Output of Results¶
Bar Plot¶
Note
Hereon, [Name] is assumed to be the output prefix specified using out
and [date] is the date when the analysis was performed.
PRSice will always generate a bar plot displaying the model fit of the PRS at Pvalue threshold as indicated by barlevels
The plot will be named as [Name]_BARPLOT_[date].png . An example bar plot:
High Resolution Plot¶
If fastscore
is not specified, a highresolution plot named [Name]_HIGHRES_PLOT_[date].png will be generated.
This plot present the model fit of PRS calculated at all Pvalue thresholds.
Important
The model fit is defined as the \(R^2\) of the Full model  the \(R^2\) of the Null model
For example, if Sex is a covariate in the PRSice calculation, then model fit = \(R^2\) of \(Pheno\sim PRS+Sex\)  \(R^2\) of \(Pheno\sim Sex\)
A green line connects points showing the model fit at the broad Pvalue thresholds used in the corresponding bar plot are also added. An example highresolution plot:
Quantile Plots¶
If quantile [number of quantile]
is specified, a quantile plot named [Name]_QUANTILE_PLOT_[date].png will be generated.
The quantile plot provide an illustration of the effect of increasing PRS on predicted risk of phenotype.
An example quantile plot:
Specifically, the quantile plot is generated by the following steps
 Regress out the covariate from the phenotype (if any) and use the standardized residual to represent the phenotype
 Distribute samples into user specified number of quantiles
 Treat the quantiles as a factor, where the
quantref
is the base factor  Perform regression with \(Pheno \sim Quantile\)
 Set the reference quantile to have coefficient of 1 (if binary) or 0 (otherwise)
The point of each quantile is their OR (if binary) or coefficient (otherwise) from the regression analysis
A text file [Name]_QUANTILE\_[date].txt is also produced, which provides all the data used for the plotting.
Moreover, uneven distribution of quantiles can be specified using the quantbreak
function, which will generate
the strata plot.
For example, to replicate the quantile break from Natarajan et al (2015):
Percentile of PRS, %  All studies in iCOGS excluding pKARMA OR (95% CI) 
pKARMA only OR (95% CI) 

<1  0.29 (0.23 to 0.37)  0.48 (0.28 to 0.83) 
>1–5  0.42 (0.37 to 0.47)  0.48 (0.36 to 0.63) 
5–10  0.55 (0.50 to 0.61)  0.58 (0.45 to 0.74) 
10–20  0.65 (0.60 to 0.70)  0.68 (0.57 to 0.81) 
20–40  0.80 (0.76 to 0.85)  0.81 (0.71 to 0.94) 
40–60  1 (referent)  1 (referent) 
60–80  1.18 (1.12 to 1.24)  1.35 (1.19 to 1.54) 
80–90  1.48 (1.39 to 1.57)  1.56 (1.34 to 1.82) 
90–95  1.69 (1.56 to 1.82)  2.05 (1.70 to 2.47) 
95–99  2.20 (2.03 to 2.38)  2.12 (1.73 to 2.59) 
>99  2.81 (2.43 to 3.24)  3.06 (2.16 to 4.34) 
The following command can be added to PRSice command:
quantile 100 \ quantbreak 1,5,10,20,40,60,80,90,95,99,100 \ quantref 60
Specifically, quantbreak
indicates the upper bound of each group and quantref
specify the upper bound of the reference quantiles
Note
The quantile boundaries are nonoverlapping, with the inclusive upper bound and exclusive lower bound
PRS modelfit¶
A file containing the PRS model fit across thresholds is named [Name].prsice; this is stored as
Set, Threshold, \(R^2\), Pvalue, Coefficient, Standard Deviation and Number of SNPs at this threshold
Important
\(R^2\) reported in the prsice file is the \(R^2\) of the Full model  the \(R^2\) of the Null model
Scores for each individual¶
A file containing PRS for each individual at the bestfit PRS named
[Name].best is provide. This file has the format of:
FID,IID,In_Regression, PRS at best threshold of first set, PRS at best threshold of second set, ...
Where the In_Regression column indicate whether the sample is included in the regression model performed by PRSice.
If allscore
option is used, a file named [Name].all.score is also generated
This file has the format of
FID, IID, PRS for first set at first threshold, PRS for first set at second threshold, ...
If allscore
is used, the PRS for each individual at all threshold and all sets will be given.
In the event where the target sample size is large and a lot of threshold are tested, this file can be large.
Summary Information¶
Information of the best model fit of each phenotype and gene set is stored in [Name].summary. The summary file contain the following fields:
 Phenotype  Name of Phenotype
 Set  Name of Gene Set
 Threshold  Best Pvalue Threshold
 PRS.R2  Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment
 Full.R2  Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment
 Null.R2  Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment
 Prevalence  Population prevalence as indicated by the user. "" if not provided.
 Coefficient  Regression coefficient of the model. Can provide insight of the direction of effect.
 P  P value of the model fit
 Num_SNP  Number of SNPs included in the model
 EmpiricalP  Only provided if permutation is performed. This is the empirical pvalue and should account for multiple testing and overfitting
Only one summary file will be generated for each PRSice run (disregarding the number of target phenotype used)
Log File¶
To allow for easy replication, a log file named [Name].log is generated for each PRSice run, which contain the all the commands used for the analysis and information regarding filtering, field selected etc.
This also allow easy identification of problems and should always be included in the bug report.