QC_GWAS {QCGWAS} | R Documentation |
Automated Quality Control of GWAS Results files
Description
QC_GWAS
runs a full quality control (QC) over a single
GWAS results file. It removes missing and invalid data, checks
the alleles and allele frequency with a reference, tests the
reported p-value against both calculated and expected values,
creates QQ and Manhattan plots and reports the distribution of
the quality-parameters within the dataset, as well as various
QC statistics.
QC_series
does the same thing for multiple GWAS results
files. It's mainly a wrapper that passes individual files to
QC_GWAS
, but it has a few extra features, such as
making a checklist of important QC stats
and creating several graphs to compare the QC'ed files.
Although the number of arguments in QC_GWAS
may seem
overwhelming, only three of them are required to run a basic
QC. The name of the file to be QC'ed should be passed to the
filename
argument; the directory of said file to the
dir_data
argument; and a header-translation table to
the header_translations
argument.
For a quick introduction to QCGWAS, read the quick
reference guide that can be found in "R\library\QCGWAS\doc".
Usage
QC_GWAS(filename,
filename_output = paste0("QC_", filename),
dir_data = getwd(),
dir_output = paste(dir_data, "QCGWASed", sep = "/"),
dir_references = dir_data,
header_translations,
column_separators = c("\t", " ", "", ",", ";"),
nrows = -1, nrows_test = 1000,
header = TRUE, comment.char = "",
na.strings = c("NA", "nan", "NaN", "."),
imputed_T = c("1", "TRUE", "T"),
imputed_F = c("0", "FALSE", "F"),
imputed_NA = c(NA, "-"),
save_final_dataset = TRUE,
gzip_final_dataset = TRUE, order_columns = FALSE,
spreadsheet_friendly_log = FALSE,
out_header = "standard",
out_quote = FALSE, out_sep = "\t", out_eol = "\n",
out_na = "NA", out_dec = ".", out_qmethod = "escape",
out_rownames = FALSE, out_colnames = TRUE,
return_HQ_effectsizes = FALSE,
remove_X = FALSE, remove_Y = FALSE,
remove_XY = remove_Y, remove_M = FALSE,
calculate_missing_p = FALSE,
make_plots = TRUE, only_plot_if_threshold = TRUE,
threshold_allele_freq_correlation = 0.95,
threshold_p_correlation = 0.99,
plot_intensity = FALSE,
plot_histograms = make_plots, plot_QQ = make_plots,
plot_QQ_bands = TRUE, plot_Manhattan = make_plots,
plot_cutoff_p = 0.05,
allele_ref_std, allele_name_std,
allele_ref_alt, allele_name_alt,
update_alt = FALSE, update_savename,
update_as_rdata = FALSE, backup_alt = FALSE,
remove_mismatches = TRUE,
remove_mismatches_std = remove_mismatches,
remove_mismatches_alt = remove_mismatches,
threshold_diffEAF = 0.15, remove_diffEAF = FALSE,
remove_diffEAF_std = remove_diffEAF,
remove_diffEAF_alt = remove_diffEAF,
check_ambiguous_alleles = FALSE,
use_threshold = 0.1,
useFRQ_threshold = use_threshold,
useHWE_threshold = use_threshold,
useCal_threshold = use_threshold,
useImp_threshold = use_threshold,
useMan_threshold = use_threshold,
HQfilter_FRQ = 0.01, HQfilter_HWE = 10^-6,
HQfilter_cal = 0.95, HQfilter_imp = 0.3,
QQfilter_FRQ = c(NA, 0.01, 0.05),
QQfilter_HWE = c(NA, 10^-6, 10^-4),
QQfilter_cal = c(NA, 0.95, 0.99),
QQfilter_imp = c(NA, 0.3, 0.5, 0.8),
NAfilter = TRUE,
NAfilter_FRQ = NAfilter, NAfilter_HWE = NAfilter,
NAfilter_cal = NAfilter, NAfilter_imp = NAfilter,
ignore_impstatus = FALSE,
minimal_impQ_value = -0.5, maximal_impQ_value = 1.5,
logI = 1L, logN = 1L, ...)
QC_series(data_files, datafile_tag, output_filenames,
dir_data = getwd(),
dir_output = paste(dir_data, "QCGWASed", sep = "/"),
dir_references = dir_data,
header_translations, out_header = "standard",
allele_ref_std, allele_name_std,
allele_ref_alt, allele_name_alt,
update_alt = FALSE, update_savename,
update_as_rdata = FALSE, backup_alt = FALSE,
plot_effectsizes = TRUE, lim_effectsizes = NULL,
plot_SE = TRUE, label_SE = TRUE,
plot_SK = TRUE, label_SK = "outliers",
save_filtersettings = FALSE, ...)
Arguments
filename , data_files , datafile_tag |
|
filename_output , output_filenames |
respectively the filename or names of the output of the QC.
This should not include an extension, since the QC will
automatically add one. The default is to use the input
filename with |
dir_data , dir_output , dir_references |
character strings
specifying the directory dress of the folders for, respectively,
the input file(s), the output file(s) and the auxillary files
(header-translation tables and allele references). Note that
R uses forward slash (/) where Windows uses backslash (\).
If |
header_translations |
Translation table for the column
names of the input file. Alternatively, the name of a file
in |
column_separators |
character string or vector; specifies
the values used as column delimitator in the GWAS file. The
argument is passed to |
nrows_test |
integer; the number of rows used for
"trial-loading". Before loading the entire dataset, the
function |
nrows , header , comment.char |
arguments passed to
|
na.strings |
character vector describing the values that
represent missing data in the dataset. Passed to
|
imputed_T , imputed_F , imputed_NA |
character vectors;
passed to |
save_final_dataset |
logical; should the post-QC dataset be saved? |
gzip_final_dataset |
logical; should the post-QC dataset be compressed? |
order_columns |
logical; should the post-QC dataset use the default column order? |
spreadsheet_friendly_log |
logical; if |
out_header |
Translation table for the column names of
the output file. This argument is the opposite of
|
out_quote , out_sep , out_eol , out_na , out_dec , out_qmethod , out_rownames , out_colnames |
arguments passed to
|
return_HQ_effectsizes |
logical; return a vector of (max.
1000) high-quality effect-sizes? (In |
remove_X , remove_Y , remove_XY , remove_M |
logical; respectively whether X-chromosome, Y-chromosome, pseudo-autosomal and mitochondrial SNPs are removed from the dataset. |
calculate_missing_p |
logical; should the QC calculate missing/invalid p-values in the dataset? |
make_plots |
logical; should the QC generate and save
QQ plots, a Manhattan plot, histograms of data distribution
and scatter plots of correlation in |
only_plot_if_threshold |
logical; should the scatterplots only be made if the correlation is below a threshold value? |
threshold_allele_freq_correlation , threshold_p_correlation |
numeric; thresholds for reporting and plotting the correlation between respectively the allele frequency of the dataset and the reference, and the calculated and reported p-values. |
plot_intensity |
logical; if |
plot_histograms |
logical; should histograms of the effect sizes, standard errors, allele frequencies, HWE p-values, callrates and imputation quality be made? |
plot_QQ , plot_Manhattan |
logical; should QQ and Manhattan plots be made? |
plot_QQ_bands |
logical; include probability bands in the QQ plot? |
plot_cutoff_p |
numeric; significance threshold for
inclusion in the QQ and Manhattan plots. The default value
( |
allele_ref_std , allele_ref_alt |
the standard and alternative
allele-reference tables. Alternatively, the name of a file
in |
allele_name_std , allele_name_alt |
character strings; these name the standard and alternative allele reference in the output. If no values are given, the function will use the reference's filename (if specified) or a default name. |
update_alt |
logical; if the function encounters SNPs not included in either the standard or alternative reference, should these be added to the alternative reference file? If no alternative reference was specified, this creates one. |
update_savename |
character string; the filename for
saving the updated alternative reference, without
extension. If |
update_as_rdata |
logical; should the updated alternative
allele reference be saved as |
backup_alt |
logical; if the alternative allele reference is updated, should a back-up be made of the original reference file? |
remove_mismatches , remove_mismatches_std , remove_mismatches_alt |
logical; should SNPs with mismatching alleles be removed
from the dataset? |
threshold_diffEAF |
Numeric; the threshold for the difference between reported and reference allele-frequency. SNPs for which the difference exceeds the threshold are counted and (optionally) removed. |
remove_diffEAF , remove_diffEAF_std , remove_diffEAF_alt |
Logical; should SNPs that exceed the |
check_ambiguous_alleles |
logical; check for SNPs with strand-independent allele-configurations (i.e. A/T and C/G SNPs)? |
use_threshold |
numeric; threshold value. The relative or absolute number of
usable values required for a variable to be used in the QC.
These arguments prevent the QC from applying filters to variables
with no data. If a variable has less non-missing, non-invalid
values than specified in the threshold, it will be ignored;
i.e. no filter will be applied to it and no plots will be made.
Values |
useFRQ_threshold , useHWE_threshold , useCal_threshold , useImp_threshold , useMan_threshold |
numeric; variable-specific thresholds for allele frequency, HWE p-value, callrate, imputation quality and Manhattan plot (i.e. chromosome & position values) respectively. |
HQfilter_FRQ , HQfilter_HWE , HQfilter_cal , HQfilter_imp |
numeric; threshold values for the high-quality SNP filter.
SNPs that do not meet or exceed all four thresholds will be
excluded from several QC tests. The filters are for allele
frequency, HWE p-value, callrate & imputation quality,
respectively, and are processed by |
QQfilter_FRQ , QQfilter_HWE , QQfilter_cal , QQfilter_imp |
numeric vector; threshold values for the QQ plot filters.
SNPs that do not meet or exceed the value will be excluded
from the QQ plot. Up to five values can be specified per
filter. The filters are for allele-frequency, HWE p-value,
callrate & imputation quality respectively, and are
processed by |
NAfilter , NAfilter_FRQ , NAfilter_HWE , NAfilter_cal , NAfilter_imp |
logical; should the high-quality and QQ filters exclude
( |
ignore_impstatus |
logical; if |
minimal_impQ_value , maximal_impQ_value |
numeric; the minimal and maximal possible (i.e. non-invalid) imputation quality values. |
logI , logN |
progress indicators used by |
plot_effectsizes , plot_SE , plot_SK |
logical; additional
plot options for |
lim_effectsizes |
specifies the y-axis range of the effect-size distribution plot. |
label_SE |
logical; should the datapoints in the precision plot be labeled? |
label_SK |
character string; determines whether the
datapoints in the skewness vs. kurtosis plot are labeled.
Options are |
save_filtersettings |
logical; saves the filtersettings
used by the high-quality filter to a file
'Check_filtersettings.txt' in the output directory. If
a file of that name already exists, the settings are added
to the end (i.e. it updates rather than overwrites existing
files). The file can be used as ini file by
|
... |
in |
Details
The full quality-control carried out by QC_GWAS
consists
of 5 phases. The function takes a single dataset (or, rather,
the location and filename of a single dataset) and runs it
through the following phases:
1: Importing the dataset
2: Checking data integrity
3: Checking alleles
4: Generating QC statistics and graphs
5: Saving the output
Phase 1: importing the dataset
GWAS results files come in a variety of formats, so
QC_GWAS
is flexible about loading data. It
uses an autoloader function (load_GWAS
) to
unpack .zip
or .gz
files and determine the
column-separator used in the file. See the section
'Requirement for the input dataset' for more information.
Next, the function attempts to translate the dataset's column
names (the header) to standard names, so that it knows what
type of data a column contains. This is done by comparing
the column names to a translation table (specified in the
header_translations
argument). See
translate_header
for more information.
Note that only the SNP ID, alleles, effect-size and standard
error columns are required. The absence of other standard
columns (chromosome, position, strand, allele frequency, HWE
p-value, callrate, imputation quality, imputation status and
used for imputation) will not cause the QC to abort.
Instead, a warning is printed on screen and in the log file,
and a dummy column filled with NA
values is added to
the dataset.
It is therefor important to check the log file: if a standard
column is present but not identified (because it is missing or
misspelled in the translation table) the QC will continue,
but is unable to check/use the data inside. The unidentified
column will be reported in the columns_unidentified
value of the QC_GWAS
return or in the
"QC_checklist.txt" file generated by QC_series
.
Phase 2: checking data integrity
The purpose of phase 2 is to ensure that the dataset can be QC'ed: that that all SNPs have the required data and that all columns contain only valid (or missing) values.
The first step is to remove SNPs that won't be used: monomorphic
SNPs and (if specified by the arguments) allosomal,
pseudo-autosomal and mitochondrial SNPs. The function considers
SNPs monomorphic if they have a missing or invalid other
(non-effect) allele, identical alleles or an allele-frequency
of 1
or 0
.
The second step is to check the imputation status column with
the function convert_impstatus
. See the section
'Requirement for the input dataset' for more information.
Imputation status is one of the most important variables in
the dataset: if unknown, the HWE p-value, callrate and imputation
quality won't be used (unless ignore_impstatus
is
TRUE
), as the function cannot determine which
SNPs are imputed and which are not. For this reason, if
convert_impstatus
is unable to translate any character
string in the column, the QC will abort.
The third step carries out three tests for all other standard variables:
Does the column contain the correct type of data (integer, numeric or character)?
How many values are missing (
NA
)?How many values are invalid (= impossible)?
The exact nature of the three tests differs per variable: see the documentation file in "R\library\QCGWAS\doc" for more detail.
The presence of the wrong data-type will cause the QC to abort. Wrong data-type indicates either a problem in the file itself, or with the way it was imported (in which case it is most likely due to a mistranslated header).
The final step is the removal of the invalid values and of
unusable SNPs. The variables MARKER, EFFECT_ALLELE, OTHER_ALLELE,
EFFECT and STDERR are considered crucial. SNPs with missing or
invalid values in any of these variables are removed the dataset.
Missing values in the other variables are ignored, while
invalid values are set to NA
.
Phase 3: checking alleles
This phase has three functions:
To check if the correct alleles are reported for each SNP
To check if the allele-frequency is reported for the correct (effect) allele
To ensure that SNPs are aligned to the positive strand and use the same effect-allele in all post-QC datasets
This is achieved by comparing the data to a reference, using
the function match_alleles
. First, all SNPs are
switched to the positive strand (the alleles are converted to
their opposing base and the strand-value is set to "+"
).
If there are SNPs whose allele pair doesn't match the
reference, match_alleles
assumes the information in
the strand column is absent or incorrect, and will also
switch those SNPs to the other (presumably positive) strand.
This step is referred to as strand-switching in QC output, and
is independent from the negative-strand SNP conversion. It is
therefor possible that a SNP is switched twice: once because
the strand-column indicates it is on the negative strand, and
twice because of a mismatch. This is referred to as double
strand-switch in the output, and indicates either the wrong
value in the strand column, or a mismatch with the reference.
In the latter case, it will most likely be picked up in the
next step.
If the strand-switch does not fix the mismatch, the SNPs
are counted in the QC output as mismatches. Depending on the
remove_mismatches
arguments, the SNPs will either be
removed from the dataset, or left in but excluded from the
further tests of the allele-matching.
Next, match_alleles
"flips" SNPs so that their effect
allele matches the reference minor allele. This ensures
that a SNP will have the same effect allele in all post-QC
datasets.
match_alleles
also counts the number of SNPs with a
strand-independent allele configuration (A/T or C/G; these are
designated as "ambiguous SNPs"), and the subset of those with
an allele-frequency that is substantially different from the
reference ("suspect SNPs"). If a substantial proportion of
ambiguous SNPs is suspect, it indicates that the strand
information is incorrect. In our experience, a regular, 2.5M
SNP dataset usually consists of 15% ambiguous SNPs, of which
a few dozen will be suspect.
The function also counts the number of SNPs whose allele
frequency differs from the reference by more than a set amount
(threshold_difEAF
). If the relevant remove_diffEAF
argument is TRUE
, these SNP will be excluded from the
dataset after the allele-matching is finished.
The final step is to correlate the reported allele-frequency
against the reference. If allele-frequency is reported for
the correct (effect) allele, the correlation should be close
to 1
. If the outcome is close to -1
, the reported
frequency is that of the other allele. Depending on the plot settings,
a scatter plot of reported vs. reference frequency is made
for all SNPs, and for the subsets of ambiguous and non-ambiguous
SNPs.
The standard and alternative allele reference
The above steps describe what happens when the dataset is compared to a single reference. However, we found that many GWAS datasets contain SNPs not present in our standard HapMap reference, so we added a second, flexible reference that can be updated with any unknown SNPs the QC encounters.
SNPs that are not found in either reference are converted
to the positive strand, and "flipped" if their allele frequency
is > 0.50. If update_alt
is TRUE
, these SNPs are
then added to the alternative
reference and saved under the name update_savename
.
There are a few caveats to this system: see the section
'Updating the alternative reference' for details.
Phase 4: generating QC statistics and graphs
At this stage, no further changes will be made to the dataset (except, optionally, to recalculate missing p-values). The function will now start to calculate the QC statistics and generate the important graphs. These are:
Create histograms of variable distribution (optional)
Check p-values by correlating them to a p calculated from the effect-size and standard-error (via the
check_P
function).Recalculate missing/invalid p-values (optional)
Calculate QC statistics
Create QQ and Manhattan plots (optional, see
QC_plots
function for more information).
Phase 5: saving the output
A series of tables is added to the bottom of
the log file, reporting the QC statistics and the data
distribution. If save_final_dataset
is TRUE
, the
post-QC data will be exported as a .txt file. The column names
and format of that file can be specified by the out arguments.
Value
The most important output of the QC is the log file. See the section 'QC output files' for more details. This section only describes the function return within R.
QC_series
returns a single, invisible, logical value,
indicating whether the alternative allele-reference has
been updated.
QC_GWAS
returns an object of class 'list'. If the QC
was not successful, this list contains only five of the following
components (QC_successful
, filename_input
,
filename_output
, all_ref_changed
, effectsize_return
).
If it was, it will contain all of them:
QC_successful |
logical; indicates whether the QC was
completed. If |
filename_input , filename_output |
the filenames of the dataset pre- and post-QC respectively. |
sample_size , sample_size_HQ |
the highest reported sample size for all SNPs and high-quality SNPs only, respectively. |
lambda , lambda_geno , lambda_imp |
the lambda values of all, genotyped and imputed SNPs, respectively. |
SNP_N_input |
the number of SNPs in the original dataset. |
SNP_N_input_monomorphic |
the number of SNPs removed because they are monomorphic. |
SNP_N_input_monomorphic_identic_alleles |
the subset of
above that had identical alleles, but allele-frequencies
that were not |
SNP_N_input_chr |
the number of SNPs removed because they
were X-chromosomal, Y-chromosomal, pseudo-autosomal or
mitochondrial (depends on the remove-arguments). If all
remove arguments were set to |
SNP_N_preQC |
the number of SNPs that entered phase 2b (i.e. after removal of the monomorphic and excluded-chromosome SNPs). |
SNP_N_preQC_unusable |
the number of SNPs removed in phase 2d, due to missing or invalid crucial variables. |
SNP_N_preQC_invalid |
the number of SNPs with invalid, non-crucial values in phase 2d. |
SNP_N_preQC_min |
the number of negative-strand SNPs in phase 2d. |
SNP_N_midQC |
the number of SNPs in the dataset during allele-matching (phase 3). |
SNP_N_midQC_min |
the number of negative strand SNPs in phase 3. |
SNP_N_midQC_min_std , SNP_N_midQC_min_alt , SNP_N_midQC_min_new |
the number of negative strand SNPs matched against, respectively, the standard allele reference, the alternative allele reference or neither. |
SNP_N_midQC_strandswitch_std , SNP_N_midQC_strandswitch_alt |
the number of SNPs that were strand-switched because of a mismatch with the standard and alternative allele reference, respectively. |
SNP_N_midQC_strandswitch_std_min , SNP_N_midQC_strandswitch_alt_min |
the subset of previous that were negative-strand SNPs. NOTE: at this point in the QC, negative-strand SNPs have already been converted to the positive strand, i.e. they should not appear in this category. If they do, there is a problem with the reported strand, or with the reference table. |
SNP_N_midQC_mismatch |
the number of SNPs that were still mismatching after the strand-switch. |
SNP_N_midQC_mismatch_std , SNP_N_midQC_mismatch_alt |
subset of previous that were matched with the standard and alternative allele reference, respectively. |
SNP_N_midQC_mismatch_std_min , SNP_N_midQC_mismatch_alt_min |
subset of previous that are negative-strand SNPs. |
SNP_N_midQC_flip_std , SNP_N_midQC_flip_alt , SNP_N_midQC_flip_new |
Number of SNPs that were flipped (had their alleles reversed) when matched against, respectively, the standard allele reference, the alternative allele reference or neither. |
SNP_N_midQC_ambiguous |
the number of ambiguous SNPs |
SNP_N_midQC_ambiguous_std , SNP_N_midQC_ambiguous_alt , SNP_N_midQC_ambiguous_new |
subset of ambiguous SNPs matched against, respectively, the standard allele reference, the alternative allele reference or neither. |
SNP_N_midQC_suspect |
the subset of ambiguous SNPs whose allele frequencies differ strongly from those in the reference. |
SNP_N_midQC_suspect_std , SNP_N_midQC_suspect_alt |
the subsets of previous matched against the standard and alternative allele reference, respectively. |
SNP_N_midQC_diffEAF |
the number of SNPs whose allele frequency differs strongly from the reference. |
SNP_N_midQC_diffEAF_std , SNP_N_midQC_diffEAF_alt |
subset of previous that were matched with the standard and alternative allele reference, respectively. |
SNP_N_postQC |
the number of SNPs in the final dataset. |
SNP_N_postQC_geno , SNP_N_postQC_imp |
the number of genotyped and imputed SNPs in the final dataset. |
SNP_N_postQC_invalid |
the number of SNPs with invalid
values remaining in the final dataset. Note: any invalid
values have already been changed to |
SNP_N_postQC_min |
the number of negative-strand SNPs in the final dataset. Note: all SNPs have been switched to the positive strand at this point. This merely counts how many of those SNPs are still in the dataset. |
SNP_N_postQC_HQ |
the number of high-quality SNPs in the final dataset. |
fixed_HWE , fixed_callrate , fixed_sampleN , fixed_impQ |
logical or character string; are HWE p-values, callrates,
sample-size and imputation quality values identical for all
relevant SNPs? If |
effect_25 , effect_median , effect_75 |
the quartile values of the effect-size distribution. |
effect_mean |
the mean of the effect-size distribution. |
SE_median , SE_median_HQ |
the median standard error of all SNPs and high-quality SNPs only, respectively. |
skewness , kurtosis |
the skewness and kurtosis value of the dataset. |
skewness_HQ , kurtosis_HQ |
the skewness and kurtosis value for high-quality SNPs only. |
all_ref_std_name , all_ref_alt_name |
the names used for the standard and alternative allele-references in the output. |
all_MAF_std_r , all_MAF_alt_r |
allele-frequency correlation with the standard and alternative allele references. |
all_ambiguous_MAF_std_r , all_ambiguous_MAF_alt_r |
allele-frequency correlations for the subset of ambiguous SNPs in the standard and alternative allele references, respectively. |
all_non_ambig_MAF_std_r , all_non_ambig_MAF_alt_r |
allele-frequency correlations for the subset of non-ambiguous SNPs in the standard and alternative allele references, respectively. |
all_ref_changed |
logical; has an updated alternative allele reference been saved? |
effectsize_return |
logical; is a vector of high-quality
effect-sizes returned in |
effectsizes_HQ |
if |
pvalue_r |
the correlation between reported and calculated p-values. |
visschers_stat , visschers_stat_HQ |
the Visscher's statistic for all SNPs and high-quality SNPs only, respectively. |
columns_std_missing |
the names of any missing, standard
columns: if no columns are missing, this returns |
columns_std_empty |
the names of any empty, standard
columns: if no columns are empty, this returns |
columns_unidentified |
the names of any unidentified
columns in the input dataset. If none, this returns |
outcome_useFRQ , outcome_useHWE , outcome_useCal , outcome_useImp , outcome_useMan |
logical; indicates whether the variable passed the threshold test. |
... |
the remaining 'setting' components return the the actual filter settings used in the QC (i.e. taking into account whether the variables passed the threshold test). |
QC output files
The results of the QC are reported in a variety of .txt and
.png files saved in dir_output
. The files use the same
output name as the dataset, with an extension to indicate
their contents (i.e. '_log.txt', '_graph_QQ.png'). The .txt
files are tab-delimited and are best viewed in a spreadsheet
program. The most important one is the log file. This file
summarizes the results of the QC and the data inside the file.
The log file
The top of the file is table of log entries, reporting any (potential) problems encountered during the QC. Some of these are just routine updates; the removal of SNPs with missing data, for example. However, do check the other entries. These report important but non-fatal problems, relating to crucial missing data or invalid data. In such a case, and provided the QC did not abort, the affected SNPs will have been exported to a .txt file before being excluded, so the user can inspect them without having to reload the entire dataset. The .txt's are:
[filename_output]_SNPs_invalid_allele2.txt
[filename_output]_SNPs_duplicates.txt
[filename_output]_SNPs_removed.txt
[filename_output]_SNPs_improbable_values.txt
(Note: the names of the files are slightly confusing: the "SNPs_removed" file contains all SNPs removed in phase 2d. This does not include monomorphic SNPs, or SNPs from excluded chromosomes, as these are removed in phase 2a. Also, the "SNPs_improbable_values" file does not include SNPs with invalid values for crucial variables, as these are already in the "SNPs_removed" file.)
Another important but non-fatal problem is missing columns.
QC_GWAS
uses a translation table to determine the
contents of a column. If the translation table is incomplete,
or contains a typo, the function will be unable to translate
(and therefor use) a column. If this involves, say, callrate,
it merely means the function cannot apply the callrate filters,
but the absence of p-values or imputation status will disable
many features of the QC. If you know that a data column is
present, yet the log reports it missing, check the translation
table. The QC_series
checklist output and the
columns_unidentified
value of the QC_GWAS
return
report the names of any unidentified columns in the dataset.
If the QC aborts, the log file should give some indication why this occurred. However, if it was successful, there will be several other tables in the log file.
The second table reports the number of SNPs in the dataset at various stages of the QC; as well as how many (and for what reason) SNPs were removed.
The third table gives an overview of the data itself. It reports how many values were missing and invalid per variable in both the pre- and post-QC datasets, as well as the quartile and mean values of the post-QC data. A few notes on the nomenclature: invalid values will have been removed (for crucial variables) or set to missing (for non-crucial variables) in stage 2d. The post-QC 'invalid' column merely records how many of these SNPs remain in the dataset. 'Unusable' values are the missing and invalid values combined (shown as percentage of the total data). Finally, pre-QC refers to the dataset during stage 2b-c, but after monomorphic SNPs and SNPs from excluded chromosomes (stage 2a) have been removed.
The fourth table reports on the allele-matching in phase 3.
The concepts are explained in 'details' and the
match_alleles
function; here we just mention
what the user needs to pay attention to. Strand-switching
counts SNPs whose strand was switched because it mismatched
with the reference. As many cohorts do not add strand data
(or set every SNP to "+"
), the presence of such SNPs is
not a red flag by itself. However, if there are mismatching
SNPs (the subset of strand-switched SNPs that could not be
fixed), there is a problem with the allele data (or, possibly,
a triallelic SNP). Check the
[filename_output]_SNPs_mismatches-[ref-name].txt
file
to see the affected SNPs.
Another red flag is if there are strand-switched SNPs in a
dataset that also contains negative-strand SNPs (i.e.
the cohort included real strand data, rather just setting it
to "+"
for all SNPs). Negative-strand SNPs are converted
to the positive strand beforehand, so they should not appear in
this step (if they do, they are counted in the "double
strand-switch" entry, but that is of minor importance). The real
problem is that if a cohort includes negative-strand SNPs (i.e.
real strand data), and there are still strand-switches, the strand
data must be incorrect. Whether the strand-switches and the
negative-strand SNPs overlap is unimportant.
The third possible problem is when a large proportion of ambiguous SNPs is suspect: it indicates that they are on the wrong strand.
Finally, a large number of SNPs with a deviating allele frequency indicates either that the frequency is reported for the wrong allele (see below) or that the dataset population does not match that of the reference.
The fifth table reports the QC outcome statistics. The p-value
and allele-frequency correlations should be close to 1
.
An allele-frequency correlation of -1
means that the
frequency was reported for the wrong (non-effect) allele. As
for the p-value correlaction: in a
typical GWAS dataset, the expected and observed p-values
should correlate perfectly. If this isn't the case, it means
either that a column was misidentified when loading the dataset
or that the wrong values were used when generating the data.
The sixth table reports how many SNPs were removed by the various QQ plot filters.
The seventh table reports the chromosomes and alleles present in the final dataset.
The eighth table counts invalid values in the pre-
and post-QC files for several variables. 'Extreme p' is a
value that is only used when calculate_missing_p
is
TRUE
. Any newly-calculated p-values that are
< 1E-300
will be set to 1E-300
, in order to
exclude any values of 0
(1E-300
is close to the
smallest numeric value that R can handle safely).
The final four tables contain the settings of the QC.
The QC_series output
QC_series
saves a checklist, showing the most
important QC stats of the various files side by side, and
(depending on the plot-arguments) several graphs comparing
effect-size distribution, precision and skewness vs. kurtosis
of the QC'ed files.
Output header
The standard-format values used by out_header
are:
-
"standard"
retains the default column names used byQC_GWAS
. -
"original"
restores the column names used in the input file. -
"old"
uses the default column names of pre-v1.0b versions ofQCGWAS
. -
"GWAMA"
,"PLINK"
,"META"
and"GenABEL"
set the column names to those use by the respective programs. Note that META's alleleB corresponds to EFFECT_ALL inQC_GWAS
.
Requirement for the input dataset
QC_GWAS
will automatically
unpack .gz
and .zip
files, provided the filename
includes the extension of the packed file. For example, if the
data file is named "data1.csv"
, the zip file should be
"data1.csv.zip"
. If it is named "data1.zip"
,
QC_GWAS
won't be able to "call" the file inside.
QC_GWAS
is flexible when it comes to file-format. By
default, it can open datasets with a variety of column separators
and NA values (the user can specify these via the
column_separators
and na.strings
arguments).
Read the documentation of the auto-loader function
load_GWAS
for more information.
Chromosome values can be coded as numeric or character:
values of "X"
, "Y"
, "XY"
and
"M"
will automatically be converted to 23
,
24
, 25
and 26
, respectively.
By default, imputation status is coded as integers 0
(genotyped) and 1
(imputed). As of version 1.0-4,
imputation status will always be translated using the
imputed_T
, imputed_F
and imputed_NA
arguments. This means that the user must specify values for
these arguments, even when the dataset already uses the
standard format.
Because of the importance of imputation status, if the function
is unable to translate character values, the QC will abort.
The minimal and maximal valid imputation quality values are
determined by the minimal_impQ_value
and
maximal_impQ_value
arguments.
Standard errors and p-values of 0
are considered invalid and
removed in phase 2d, while values of -1
will be set to
NA
. Effect sizes of -1
are accepted, unless the
standard error and/or the p value are also -1
, in which
case it is also set to NA
.
Filter arguments
QC_GWAS
has three sets arguments relating to filters:
the arguments for the HQ (high-quality), QQ (plot) and NA
(missing values) filters. The HQ and QQ arguments work
mostly in the same way, but there are a few key differences.
The high-quality arguments accept single, numeric values that determine the minimal values of allele-frequency (FRQ), HWE p-value (HWE), callrate (cal) and imputation quality (imp) for a SNP to be of high-quality. The high-quality filter is used for the effect-size boxplot and the Manhattan plot, as well as several QC stats.
The QQ arguments accept a vector of max. 5 numeric values that are applied sequentially as filters in the QQ plots.
Both of these use the NA filter argument(s) to determine whether to exclude or ignore missing values.
Neither filter is used to remove SNPs; they merely
exclude them from several QC tests. Both HQ and QQ filter
criteria are only applied if the variable passed the threshold
test, i.e. if there are sufficient non-missing, non-invalid
values for the filter to be applied (see the use_threshold
argument for details). It's pointless to filter an empty column.
If ignore_impstatus
is FALSE
(default), the
imputation-quality criterion is only applied to
imputed SNPs, and the HWE p-value and callrate criteria only
to genotyped SNPs. If TRUE
, the filters are applied to
all SNPs, regardless of the imputation status.
The allele-frequency filters are two-sided: when set to value x, SNPs with frequency < x or > 1 - x are excluded.
To filter missing values only, set the filter to NA
and
the corresponding NA filter argument to TRUE
. To disable
entirely, set to NULL
(this means the NA-filter setting is
ignored as well).
The differences between the HQ and QQ filters are:
The HQ filter arguments accept a single value, the QQ filters can accept up to 5.
The HQ filter is a single filter: a SNP needs to meet all relevant criteria to be considered high-quality. The QQ filter values are applied separately.
The QQ filter has an additional feature: if passed a value
x > 1
, it will calculate a filter value of
x / sample size
. This is to allow size-based filtering
of allele frequencies. Note that this filter uses the sample
size listed for that specific SNP. If the sample size is
missing, the relevant NA-filter setting is used to determine
whether it should be excluded.
Updating the alternative reference
There are two drawbacks to the way the function updates the alternative reference file. One is a technical issue, but the other can affect the QC of subsequent files.
Firstly, the argument update_alt
has a slightly misleading
name: the alternative-reference file is updated, but
the reference inside the R memory is not. If the user wants to
do further QCs with the updated reference, (s)he will have to
manually reload the updated file into R.
This is caused by the way R handles data-alterations that occur
inside a function. Any changes made to data last only for the
duration of the function. Once the function terminates, the memory
reverts to its original state. In other words: the allele reference
is updated inside QC_GWAS
, but goes back to the pre-QC
state once the QC is finished. QC_series
deals with this by
automatically reloading the reference file whenever it is
updated, but, again, once the function terminates it will
revert to its original state.
The second drawback is that the content of the alternative reference is arbitrary, depending on which file an unknown SNP is encountered in first. For example, suppose that SNP rs31 has alleles A and G, an allele frequency that varies around 0.5, and does not appear in the standard reference. When it is added to the alternative reference, the allele listed as the minor one depends entirely on the allele frequency in the first file it is encountered in.
More seriously, if the information in the first file is
incorrect, the SNP may be strand-switched or excluded
in subsequent files because it does not match the (incorrect)
reference. This is another reason why it is important to check
the log files: if there is a problem with a datafile's
strand, alleles or allele-frequency, and the alternative
reference was updated, the incorrect data may have been added
to the reference. If so, one should go back to a previous
reference file. The argument backup_alt
is useful for
this, though note that QC_series
only does this the
first time the reference is updated.
Also, if one wants to QC a large number of files for a meta-GWAS,
one should use the same alternative allele reference file (and
let QC_GWAS
update it) for every file, otherwise it is
possible that rs13 may have a different effect alleles in some
post-QC files.
See Also
For the plots created by QC_series
:
plot_distribution
,
plot_precision
and plot_skewness
.
For loading and preparing a GWAS dataset:
load_GWAS
, translate_header
,
convert_impstatus
.
For carrying out separate steps of QC_GWAS
:
match_alleles
, check_P
,
QC_plots
.
Examples
## For instructions on how to run QC_GWAS and QC_series
## check the quick start guide in /R/library/QCGWAS/doc