Marvin N. Wright*, Damian Gola*, Andreas Ziegler

Institut für Medizinische Biometrie und Statistik, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Universität zu Lübeck, Lübeck, Germany

* Equal contribution

The final publication is available at link.springer.com via https://doi.org/10.1007/978-1-4939-7274-6_30.

Abstract

The advancement of high throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates and contaminations, alignment to a reference genome and the post-processing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.

Keywords: Whole genome sequencing, Sequencing, High throughput sequencing, Illumina, HiSeq X, HTS, NGS, Quality control, Preprocessing, Alignment, Mapping

1 Introduction

High throughput sequencing (HTS) has expanded rapidly during the last decade, and great progress has been made in terms of sequencing speed, read length and reduction in per-base cost. The main technology competitors in this field are Illumina (1), Life Technologies (2) and Roche (3). Detailed descriptions of their HTS platforms are given by Metzker (4) and Liu et al. (5); the strengths and weaknesses of the platforms are reviewed by Van Dijk et al. (6).

The main principle of HTS technology is sequencing by synthesis, which is the same for most platforms. The first step is to prepare samples for sequencing by creating libraries. This is done by fragmenting the cDNA into small single strand fragments, called ssDNA templates, and ligating adapters to the 5’ and 3’ ends. Second, libraries are arranged into clusters on a solid surface. Third, complementary bases are built up. The complementary bases sequentially emit fluorescence signals, and the process of this synthesis step is captured in a series of images. The actual nucleotide information is inferred from the fluorescence intensity data for each cluster by using base-calling algorithms. This yields so-called reads, which can be processed by downstream algorithms. Finally, a measure of uncertainty or quality is assigned to each called base.

Further refinements of the basic HTS technology, such as paired end sequencing and multiplexing, can increase precision and throughput. Paired end sequencing enables sequencing from both ends of the ssDNA templates, resulting in paired end reads (7). By this means, twice the number of reads are produced, enabling more precise alignment to the reference genome and the ability to detect insertions and deletions (8). Consequently, paired end sequencing is used commonly. Multiplexing enables sequencing libraries from multiple samples simultaneously by adding an additional barcode sequence to the adapters (7, 9). The reads thus obtained can be sorted by these barcodes to get read data for each sample individually.

The major drawback of current HTS technology is that any information about the region or position on the genome of the generated reads is lost. Therefore, for each read, the most likely position on a reference genome must be found. This process is called alignment, which is the central element of HTS data preprocessing.

Illumina was the first company to break the $1000 barrier for human whole genome sequencing with the HiSeq X platform in 2014, thus achieving the goal given out by the National Human Genome Research Institute. This was realized by raising the throughput using patterned flow cells as solid surface, providing an extremely high cluster density, high occupancy and monoclonality within each well, a fast camera system, and new polymerase enzymes which result in faster reactions and thus shorter runtimes (10).

This chapter describes the entire pipeline from raw sequence data to data ready for variant calling for Illumina’s HiSeq X platform. At the core, the raw reads produced by this sequencing platform need to be aligned to the human reference genome. In addition to the alignment, several quality control and processing steps utilizing different tools are necessary. The workflow presented in this chapter is illustrated in Figure 1.1. Generally, it follows the workflow presented by DePristo et al. (11) and Van der Auwera et al. (12). In the remainder of the introduction, the steps highlighted in Figure 1.1 and the underlying rationales are outlined. In Section 2, the tools utilized are described in detail, and code snippets are given to illustrate their application.

Figure 1.1: Schematic workflow of preprocessing steps for whole genome sequences from the Illumina HiSeq X platform.

1.1 Remove PhiX contamination and adapters

We recommend performing a quality check of the raw sequencing data prior to any further analyses (see Subheading 2.2.1). At this stage, the aim is to reveal and resolve DNA contamination. Common sources of contamination with foreign DNA are biomaterial from other samples or external biomaterial, e.g. from laboratory staff, the adapters used for sequencing the short reads, or libraries from genomes of other species added intentionally for quality control. The handling of the latter two sources is described below, whereas the first two are covered in Subheading 2.2.

Illumina offers PhiX libraries from the PhiX bacteriophage genome to provide a quality control for cluster generation and sequencing as well as a calibration control during sequencing runs. This technique is recommended for samples with low diversity. However, since most mammals have high diversity, i.e., 40%–60% GC content for whole-genome sequencing, a PhiX control is not required. If PhiX libraries are used, they are either spiked in the same lane of samples in low concentration (~1%), or they occupy a separate lane in high concentration (up to 40% for low diversity samples) (13). These libraries should be removed to avoid integration into the target genome. However, Mukherjee et al. (14) noted that many published genomes contain PhiX contamination, which might be an indication that some researchers are not aware of these PhiX libraries.

As described above, adapters are a core element of Illumina’s HTS technologies. These oligos are ligated to the 5’ and 3’ end of each DNA fragment during library preparation and are the necessary elements for immobilization on a solid surface and sequencing (7). These oligo sequences are not part of the target genome and should not be considered for alignment. As the sequences of the adapters are known, they can easily be marked or removed from the raw reads.

1.2 Align to reference genome

The aim in aligning a read to the reference genome is to find the chromosomal position the read is most likely coming from. If only perfect matching positions are sought for a reference sequence, no variation can be found. The alignment thus need to allow for non-perfect matches, which increases algorithmic and computational complexity. Furthermore, the human genome is highly repetitive, making unambiguous alignment impossible.

In the workflow presented here, alignment is done with the BWA-MEM (15) algorithm. BWA-MEM is part of the Burrows-Wheeler aligner (BWA) package, which includes three alignment algorithms based on the Burrows-Wheeler transformation (BWT) (16). The general idea of all Burrows-Wheeler aligners is to collapse repetitions of sequences in the reference genome to find matching substrings in linear time.

After the alignment of reads to the reference genome, duplicate reads may be present, the alignment should be further refined and base call quality scores supplied by the sequencer should be recalibrated.

1.3 Remove duplicates

Generally, duplicates are several reads of the same original DNA fragment. There are two sources for duplication, PCR duplicates and optical duplicates. PCR duplicates cannot be avoided if PCR amplification is used for library preparation. Optical duplicates arise if sequences from a single amplification cluster are incorrectly detected to be from multiple adjacent clusters by the optical sensor of the sequencer.

Duplicate reads can propagate amplification errors and might violate independence assumptions in subsequent variant calling or statistical testing. Therefore, only the read with the highest sequencing quality of all duplicate reads should be kept. A duplicate sequence is simple to detect because it aligns to the same starting position of the reference. However, due to the repetitive nature of the human genome, this could also be the case for non-duplicate reads, and false positives might arise. If PCR-free library preparation is used, the removal of duplicates might not be necessary. Given the computational effort of duplicate removal and its unclear benefit, there is thus an ongoing discussion whether duplicates should be removed at all (17, 18). The current standard approach is to remove duplicates, at least to remove optical duplicates.

1.4 Realign around insertions or deletions

Insertions or deletions (indels) can lead to incorrect alignments, in particular if indels are close to the ends of the reads. This can impact the recalibration of base quality scores and can lead to incorrectly called SNPs in the variant calling. To improve the alignment around indels, a local realignment can be done, which jointly considers all reads covering the region and includes known indels from external information, such as the 1000 Genomes (19).

There is current discussion whether realignment is still necessary if a variant caller is used that completely reassembles regions showing variation (20). In the GATK workflow, such a variant caller is used, and consequently the realignment step is removed from that workflow (21). However, realignment is still required if another variant caller is used, or for low quality data to improve the base quality score recalibration.

1.5 Recalibrate base quality scores

Variant calling algorithms rely heavily on accurate per-base estimates of errors emitted by sequencing machines. For the Illumina HiSeq X platform, insertion and deletion errors are rare. The overall base miscall error rate is around 1% (22). The main complication resulting in base miscall errors on the Illumina platform is the asynchronous synthesis process between different ssDNA templates in the same cluster. This results in less accurate base-calling in later cycles as asynchrony increases. Usually, base call uncertainty or quality is measured as a score on the Phred-scale (23): $Q_{\text{Phred}}=-10 \log_{10}P\left(\text{error}\right)$. The Phred score is used during variant calling to weigh up the evidence for or against a variant allele existing at a particular site. Raw quality scores for each base call are provided by Illumina directly in the raw data, but these are based on noise estimates from image analysis only. They disregard any other covariates, such as the cycle number or the dinucleotide context. The raw scores should thus be recalibrated prior to variant calling by algorithms which take into account these covariates. This usually improves the base miscall error accuracy by 5%–30% (22).

2 Methods

2.1 Preprocessing

In the workflow described in this chapter, we make use of four different command-line tools. Reads are aligned with the bwa software (15, 24). The Genome Analysis Toolkit (GATK) (25, 26) is employed for advanced post-align processing. For file manipulation samtools (27, 28) and Picard (29) are applied. The versions of the tools used in the workflow are given in Note 1. Detailed overviews of different software tools are given elsewhere (30–34). Since this is an area of active research, the tools utilized are still in further development, new methods are proposed frequently, and no common standard has been stablished so far.

A Java environment is required for GATK and Picard. Generally, all tools in the workflow are capable of multithreading (see Note 2 for details). However, their availability and effectiveness depends on the underlying algorithms. In most tools memory usage increases with multithreading. The memory for the Java virtual machine and the number of threads are set in Listing 2.1. These will be used in subsequent processing steps in the presented workflow.

Listing 2.1 Define computational resources.

threads=4
java_mem=12g

2.1.1 Prepare resources

This workflow relies on different resources which are available online. Reference genomes for Homo Sapiens and PhiX are provided by Illumina (35). The GATK Resource Bundle (36) also contains a human reference genome as well as all other resources needed here. In Listing 2.2, the paths to these resources are defined for the further workflow.

Listing 2.2 Initialization of file paths.

reference_dir="<Your path to reference genomes>"
reference_file_human=$reference_dir"/human_g1k_v37.fasta"
reference_file_phix=$reference_dir"/phix.fa"

gatk_resources_dir="<Your path to GATK resources>"
known_sites_file_gold=$gatk_resources_dir"/Mills_and_1000G_gold_standard.indels.b37.vcf"
known_sites_file_1000g=$gatk_resources_dir"/1000G_phase1.indels.b37.vcf"
known_sites_file_dbsnp=$gatk_resources_dir"/dbsnp_138.b37.vcf"
known_sites_file_hapmap=$gatk_resources_dir"/hapmap_3.3.b37.vcf"
known_sites_file_omni=$gatk_resources_dir"/1000G_omni2.5.b37.vcf"

Usually, the reference genomes come along with index and dictionary files. If these are not provided, they should be created as shown in Listing 2.3.

Listing 2.3 Create index files for the reference genomes.

bwa index -a bwtsw $reference_file_human
bwa index -a bwtsw $reference_file_phix

samtools faidx $reference_file_human

java -jar picard.jar CreateSequenceDictionary \
  REFERENCE=$reference_file_human

The standard formats for representing biological sequence data are fastq and fasta. Raw read data from the Illumina HiSeq X platform normally will be shipped as files in the fastq format. Example data is provided by Illumina (see Note 3). In this workflow, the reads are assumed to be provided in the files <SID>_<FID>_<Lx>_R1.fastq.gz, <SID>_<FID>_<Lx>_R2.fastq.gz per sample <SID> and lane <Lx> on flowcell with barcode <FID>. This is important to correctly group the reads for quality control purposes. If only one file per sample is provided, it has to be split by flowcell ID and lane number (see Listing 3.1 in Note 4).

2.1.2 Define read groups

A read group contains all reads created by a single run of the sequencer. Usually, a read group is defined by the sample, library, flowcell and lane. In a bam file, this information is saved in one tabular delimited line containing several fields. A minimum set of fields for processing with the GATK toolkit are:

Read group identifier (ID): Composed of flowcell barcode and lane number. Used to detect possible batch effects in later steps.
Platform Unit (PU): Composed of flowcell barcode, lane number and sample ID. Takes precedence over ID in GATK.
Sample (SM): Sample ID.
Platform (PL): Platform producing the read. Used in base recalibration.
Library identifier (LB): DNA preparation library. Used to identify PCR amplification duplicates.

The read group fields are defined in Listing 2.4. If library information is not available and only one library is used per sample, the library identifier can be chosen arbitrarily per sample. The flowcell barcode can be extracted from the file name or the read headers in a fastq file.

Listing 2.4 Define read group for library <LIB> of sample <SID> on lane <Lx> of flowcell <FID>.

rg_id=<FID>.<Lx>
rg_sm=<SID>
rg_lb=<LIB>
rg_pu=<FID>.<Lx>.<SID>
rg_pl="ILLUMINA"
read_group="@RG\tID:$rg_id\tSM:$rg_sm\tLB:$rg_lb\tPU:$rg_pu\tPL:$rg_pl"

The read groups have to be set during first creation of a bam file. If the step to remove PhiX contamination is included in the pipeline, the read groups should be set with bwa mem (shown in Listing 2.5). If not, they are set during conversion from fastq to bam with Picard (shown in Listing 2.8).

2.1.3 Remove PhiX contamination

Three steps are necessary to remove PhiX contamination from raw data:

Align reads to PhiX genome (Listing 2.5)
Extract unaligned reads (Listing 2.6)
Revert to unaligned bam (ubam) file (Listing 2.7))

The first step utilises bwa mem to align reads to the PhiX genome. This leaves reads which cannot be aligned to the PhiX genome in an unaligned state, indicated by a flag (Listing 2.5). Reads with this flag set can be extracted using samtools (Listing 2.6). Nevertheless, alignment information is added to these reads by bwa mem. To remove this alignment information and revert to an unaligned bam file, the Picard program RevertSam is used as shown in Listing 2.7.

Listing 2.5 Align raw reads to PhiX genome. Run for each lane <Lx> on flowcell <FID> of each sample <SID> separately to correctly set the read group information. Set -M for Picard compatibility.

bwa mem -t $threads -M -R $read_group $reference_file_phix \
  <SID>_<FID>_<Lx>_R1.fastq.gz <SID>_<FID>_<Lx>_R2.fastq.gz > <SID>_<FID>_<Lx>.phix_aligned.sam

Listing 2.6 Extract reads not aligned to the PhiX genome. Run for each lane <Lx> on flowcell <FID> of each sample <SID>. Set -b for bam output, -f 4 to extract unaligned reads.

samtools view -b -f 4 <SID>_<FID>_<Lx>.phix_aligned.sam > \
  <SID>_<FID>_<Lx>.phix_removed.bam

Listing 2.7 Revert the bam file with reads not aligned to the PhiX genome to an unaligned ubam file. Run for each lane <Lx> on flowcell <FID> of each sample <SID>. Set SANITIZE=true to remove paired reads with missing mates. Clear all alignment tags.

java -Xmx$java_mem -jar picard.jar RevertSam \
  INPUT=<SID>_<FID>_<Lx>.phix_removed.bam \
  OUTPUT=<SID>_<FID>_<Lx>.ubam \
  SANITIZE=true \
  ATTRIBUTE_TO_CLEAR=NM,UQ,PG,MD,MQ,SA,MC,AS,XS,XT

2.1.4 Mark Illumina adapters

Illumina adapter sequences are marked by setting the adapter-trimming tag (XT) in a sam/bam file. There are no such tags in a fastq file, thus the raw reads must be converted first (Listing 2.8). If PhiX sequences have been removed in the previous step, the reads are already converted to the bam format and this conversion is not necessary.

The Picard program MarkIlluminaAdapters is used to mark the Illumina adapter sequences (Listing 2.9). For alignment, described in the next section, the reads must be converted back into the fastq format using the Picard tool SamToFastq (Listing 2.10). In this step either the base qualities of the adapter sequences are modified, the base calls are set to missing, or the base calls are trimmed. It is common to set the base quality to a low value to keep the base call information for all reads.

Listing 2.8 Convert the raw reads from fastq format to unaligned ubam format

java -Xmx$java_mem -jar picard.jar FastqToSam \
  FASTQ=<SID>_<FID>_<Lx>_R1.fastq.gz \
  FASTQ2=<SID>_<FID>_<Lx>_R2.fastq.gz \
  OUTPUT=<SID>_<FID>_<Lx>.ubam \
  READ_GROUP_NAME=$rg_id \
  SAMPLE_NAME=$rg_sm \
  LIBRARY_NAME=$rg_lb \
  PLATFORM_UNIT=$rg_pu \
  PLATFORM=$rg_pl

Listing 2.9 Mark Illumina adapters in an unaligned ubam file.

java -Xmx$java_mem -jar picard.jar MarkIlluminaAdapters \
  INPUT=<SID>_<FID>_<Lx>.ubam \
  OUTPUT=<SID>_<FID>_<Lx>.adaptersmarked.ubam \
  METRICS=ID.adaptersmarked.txt

Listing 2.10 Convert back to a fastq file. Set the base quality of base calls from marked Illumina adapters to 2. Set INTERLEAVE=true for paired end reads.

java -Xmx$java_mem -jar picard.jar SamToFastq \
  INPUT=<SID>_<FID>_<Lx>.adaptersmarked.ubam \
  FASTQ=<SID>_<FID>_<Lx>.adaptersmarked.fastq \
  CLIPPING_ATTRIBUTE=XT \
  CLIPPING_ACTION=2 \
  INTERLEAVE=true

2.1.5 Align to reference genome

The tool bwa mem is utilized to align the reads to the reference genome (Listing 2.11). During conversion to the fastq format, original read information, meta data and the read groups are lost. To preserve this information, the sam file created by the alignment is merged with the unaligned ubam file created previously using the Picard program MergeBamAlignment (Listing 2.12).

Listing 2.11 Align to reference genome. Set -p for paired end reads in one input file, -M for Picard compatibility.

bwa mem -t $threads -p -M $reference_file_human \
  <SID>_<FID>_<Lx>.adaptersmarked.fastq > <SID>_<FID>_<Lx>.adaptersmarked.aligned.sam

Listing 2.12 Merge aligned sam file with unaligned ubam file. Set MAX_INSERTIONS_OR_DELETIONS=-1 to not restrict the number of indels. Set PRIMARY_ALIGNMENT_STRATEGY=MostDistant to select the alignment with the largest insert size, if multiple possible alignments are found. Set ATTRIBUTES_TO_RETAIN=XS to retain alignment scores.

java -Xmx$java_mem -jar picard.jar MergeBamAlignment \
  ALIGNED_BAM=<SID>_<FID>_<Lx>.adaptersmarked.aligned.sam \
  UNMAPPED_BAM=<SID>_<FID>_<Lx>.ubam \
  OUTPUT=<SID>_<FID>_<Lx>.adaptersmarked.aligned.merged.bam \
  REFERENCE_SEQUENCE=$reference_file_human \
  MAX_INSERTIONS_OR_DELETIONS=-1 \
  PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
  ATTRIBUTES_TO_RETAIN=XS

2.1.6 Mark duplicates

Duplicates are marked using the Picard program MarkDuplicates, as shown in Listing 2.13). This step includes merging the reads from one sample, which are run on different lanes and possibly on different flowcells. By default, duplicates are marked by setting the respective flag in the bam file. By setting the option REMOVE_DUPLICATES to true, the duplicates can be entirely removed instead.

Listing 2.13 Mark duplicates for sample <SID>. Use files of all flowcells and lanes of the sample as input. Set CREATE_INDEX=true to create an index for the resulting bam file.

java -Xmx$java_mem -jar picard.jar MarkDuplicates \
  INPUT=<SID>_<FID>_<L1>.adaptersmarked.aligned.merged.bam \
  INPUT=<SID>_<FID>_<L2>.adaptersmarked.aligned.merged.bam \
  ...
  INPUT=<SID>_<FID>_<Ln>.adaptersmarked.aligned.merged.bam \
  OUTPUT=<SID>.adaptersmarked.aligned.merged.deduped.bam \
  METRICS_FILE=<SID>.metrics.txt \
  CREATE_INDEX=true

2.1.7 Realign around insertions or deletions

Realignment is divided into two steps. First, the regions requiring realignment are found with the GATK program RealignerTargetCreator (Listing 2.14). Second, the realignment is performed using the GATK program IndelRealigner (Listing 2.15). In both steps, known indel locations from the GATK resource bundle (36) are used.

Listing 2.14 Find regions to realign for sample <SID>.

java -Xmx$java_mem -jar gatk.jar -T RealignerTargetCreator \
  -nt $threads \
  -R $reference_file_human \
  -I <SID>.adaptersmarked.aligned.merged.deduped.bam \
  -known $known_sites_file_gold \
  -known $known_sites_file_1000g \
  -o <SID>.realignment_targets.list

Listing 2.15 Perform realignment for sample <SID>.

java -Xmx$java_mem -jar gatk.jar -T IndelRealigner \
  -R $reference_file_human \
  -I <SID>.adaptersmarked.aligned.merged.deduped.bam \
  -targetIntervals <SID>.realignment_targets.list \ 
  -known $known_sites_file_gold \
  -known $known_sites_file_1000g \
  -o <SID>.adaptersmarked.aligned.merged.deduped.realigned.bam

2.1.8 Recalibrate base quality scores

The two steps of base quality score recalibration with GATK are a) to build a model of covariation based on the data and a set of known non-polymorphic sites (Listing 2.16) and b) to adjust the base quality scores in the data based on the model (Listing 2.17). The GATK resource bundle (36) is used to account for known variations.

Listing 2.16 Build model of covariation for sample <SID>.

java -Xmx$java_mem -jar gatk.jar -T BaseRecalibrator \
  -nct $threads \
  -R $reference_file_human \
  -I <SID>.adaptersmarked.aligned.merged.deduped.realigned.bam \
  -knownSites $known_sites_file_dbsnp \
  -knownSites $known_sites_file_gold \
  -knownSites $known_sites_file_1000g \
  -o <SID>.recalibration.table

Listing 2.17 Apply the recalibration model to the alignment of sample <SID>.

java -Xmx$java_mem -jar gatk.jar -T PrintReads \
  -nct $threads \
  -R $reference_file_human \
  -I <SID>.adaptersmarked.aligned.merged.deduped.realigned.bam \
  -BQSR <SID>.recalibration.table \
  -o <SID>.adaptersmarked.aligned.merged.deduped.realigned.recalibrated.bam

2.2 Quality control

Several steps are performed to increase data quality. For example, adapter sequences, contaminations and duplicates are marked or removed as outlined in Figure 1.1. To review the success of these steps, several quality controls should be performed. FastQC (37) is a simple and reliable tool for this purpose. It generates a quality control report for each single file. Subheading 2.2.1 further describes how these reports are generated and interpreted. Multiple reports can be aggregated with MultiQC (38).

Generally, FastQC can be applied to any sam/bam or fastq file occurring in this workflow. However, at three milestones in this workflow we advise checking the quality of the data. The first check should be run on the raw reads before any other step to inspect the overall data quality and to decide which following steps are necessary. After alignment and de-duplication, the second check should be performed to review whether adapter sequences, possible contaminations and duplicates have been marked or removed successfully and a reasonable number of reads has been aligned. Finally, the success of the complete preprocessing procedure should be evaluated with a third FastQC run on the final bam file.

A high depth of coverage is required to detect variants reliably and is a major cost factor in sequencing studies. Generally, the average coverage per genome is contracted in advance with the sequencing provider. The actual depth of coverage is thus not only a standard measure for quality control, but it should also be compared with the targeted sequence depth. A naïve estimate is to simply divide the number of sequenced bases by the genome size. This estimate does not, however, account for low quality, duplicates or unaligned reads. More accurate and detailed coverage statistics are generated by the GATK program DepthOfCoverage, as described in Subheading 2.2.2. For accurate measurement, coverage depth should be computed after realignment and recalibration on the final bam file.

The two most challenging quality issues are contaminations with human DNA and sample swaps which are almost impossible to detect reliably without any additional genotype information. Both issues can be addressed with the tool verifyBamID (39), as described in Subection 2.2.3. The tool requires a de-duplicated and recalibrated bam file and therefore should be run at the end of the workflow.

2.2.1 FastQC

One of the most frequently used software tools for quality control of HTS data is FastQC (37). It computes multiple metrics for datasets in fastq and sam/bam formats, among others. The results are presented as html documents, organized in several modules. A green, yellow or red icon indicates no, minor or severe problems in the respective module.

In the Basic Statistics module, the file encoding, number of reads, read length and other basic properties are reported. For the Illumina HiSeq X platform an encoding of “Sanger / Illumina 1.9” and a read length of 150bp is expected.

The simplest and yet most important metric is about base quality scores. This provides a good impression of the overall sequencing quality. As such, the distribution of the quality of each read position, i.e. base, over all reads in the file is computed. The results are presented in the module Per Base Sequence Quality as a sequence of box plots, one for each position in the read. As the quality of a read often degrades towards its end, the box plots will show more variation and the median quality will drop with higher read positions. In addition to the box plots, the mean base quality at each read position is indicated by a blue line. By default, FastQC will issue a minor problem if the lower quartile at any base position is below 10 or if the median for any base position is below 25. A severe problem is issued if the lower quartile at any read position is below 5 or if the median for any read position is below 20 (37).

The second metric dealing with base quality scores is presented in the module Per Sequence Quality Scores. Here, the average quality of each read is displayed as a cumulative plot. By inspecting this metric, it is possible to identify a subset of reads with universally poor quality. This can emerge from poor image quality at the edge of the field of view. However, these should represent only a small percentage of the total sequences, thus FastQC will indicate a minor problem if the most frequently observed mean quality is below 27, which corresponds to a 0.2% error rate. A severe problem is issued if the observed mean quality is below 20, which corresponds to a 1% error proportion.

Another important metric is the distribution of base content at each read position. Under the assumption that the data comprise a random sample from the sequence space, the contribution of each base at each read position should be identical, resulting in straight lines in the module Per Base Sequence Content. In real data, the first few bases often show some deviation from this assumption. This could be due to the adapters ligated to the ssDNA templates during library preparation. A minor problem is indicated by FastQC if the difference between A and T or G and C is greater than 10% at any position. A severe problem is present if this difference is above 20%.

Similarly, there is a metric that measures the GC content across the whole length of each read. The average GC content and the standard deviation of GC content are used to estimate a normal distribution. The actual distribution of GC content should match this reference distribution as close as possible. Deviations from the theoretical distribution, in particular peaks, indicate possibly contaminated libraries that contain sequences with GC contents that differ from the actual sample, and this may include adapters or sequences from other species.

Deviations from these theoretical assumptions can result in overrepresented sequences, indicating that the library is contaminated, for example, with sequences from the PhiX genome. If FastQC detects any such sequences, i.e., sequences that make up more than 0.1% of the total number of sequences, they are listed in the module Overrepresented Sequences together with possible matches in a database of common contaminants, e.g., Illumina adapters. In the case of no match, it is possible to search for the sequences found using BLAST (40) to get an idea of their origin. Note that commonly used adapters for library generation are also shown in the module Adapter Content. In the case of long reads with poor base quality, the overrepresented sequences might not match exactly. Here, it is more robust to look for overrepresented 7-mers, which is done by the metric Overrepresented Kmers.

FastQC calculates further metrics that are described in detail in the manual (37). Readers should note that FastQC only indicates that there might be issues with the data. Similarly, even perfect FastQC results do not guarantee high quality data.

In this workflow, FastQC is run at three milestones as described above. First it is applied to the raw reads (Listing 2.18), second after alignment and de-duplication (Listing 2.19) and third on the final bam file (Listing 2.20).

Listing 2.18 Run FastQC on raw reads provided by the sequencer.

fastqc --threads $threads \
  --outdir <output_dir> \
  --format fastq \
  <SID>_<FID>_<Lx>_R1.fastq.gz <SID>_<FID>_<Lx>_R2.fastq.gz

Listing 2.19 Run FastQC on aligned reads without duplicates.

fastqc --threads $threads \
  --outdir <output_dir> \
  --format bam_mapped \
  <SID>.adaptersmarked.aligned.merged.deduped.bam

Listing 2.20 Run FastQC on final bam file.

fastqc --threads $threads \
  --outdir <output_dir> \
  --format bam_mapped \
  <SID>.adaptersmarked.aligned.merged.deduped.realigned.recalibrated.bam

2.2.2 Analyze coverage depth

Coverage depth can be computed using the GATK program DepthOfCoverage. A first step is to compute the average coverage depth across the whole genome for every sample (Listing 2.21). To generate more detailed coverage statistics, coverage can be computed for each base by removing the flag --omitDepthOutputAtEachBase. Be aware that this is computationally intensive and generates large output files. To compute the coverage depth for a list of genes, the flag --omitIntervalStatistics has to be replaced by --calculateCoverageOverGenes <gene.list>. Here, <gene.list> is a Reference Sequence (RefSeq) (41) file.

Listing 2.21 Compute coverage depth statistics. To generate more detailed coverage statistics, remove the --omitDepthOutputAtEachBase or --omitIntervalStatistics flag.

java -Xmx$java_mem -jar gatk.jar -T DepthOfCoverage \
  -nt $threads \
  -R $reference_file_human \
  -I <SID>.adaptersmarked.aligned.merged.deduped.realigned.recalibrated.bam \
  -o <SID> \
  --omitIntervalStatistics \
  --omitDepthOutputAtEachBase

DepthOfCoverage generates several output files. The average coverage depth across the genome is written to the file <SID>.sample_summary. A coverage histogram aggregated over all bases is written to <SID>.sample_statistics. If intervals or gene statistics are computed, summaries and histograms are additionally generated for these.

In the case of unexpected low coverage depth for a sample, the number of reads remaining after each preprocessing step can be compared. These are included in the reports generated by FastQC. If the number of reads drops considerably in one step, this could indicate a problem with the data or the preprocessing step.

2.2.3 Detect sample swaps and contamination

Detecting sample swaps and human DNA contamination is a critical quality control step in any HTS study. Ideally, additional genotype information from other sources, such as microarrays, is available. In this case, the tool verifyBamID (39) can be used to test whether the sequencing data for a sample matches the additional genotype information (Listing 2.22). If no additional genotype information is available, verifyBamID can still detect contamination due to mixtures of samples.

By default, verifyBamID only compares samples with matching IDs. In this mode, it checks whether the sequencing information matches the genotyping information for a sample. Alternatively, the option --best can be used to find the best-matching genotyping sample for the given sequencing sample. Since the --best option is computationally intensive, an alternative approach for large sample sizes is to run verifyBamID in two rounds. In the first round, it is run without the --best option to identify non-matching samples. The best matching sample from the genotyping data can then be found in a second run on the identified samples only, including the --best option.

Listing 2.22 Detect sample swaps and contamination. By using the option --best, the best matching sample from the additional genotype data is identified. If no additional genotype information is available, remove the --vcf and --best options.

verifyBamID --verbose \
  --vcf additional_genotypes.vcf \
  --bam <SID>.adaptersmarked.aligned.merged.deduped.realigned.recalibrated.bam \
  --out <SID> \
  --best

The most important output files of verifyBamID are <SID>.selfSM and <SID>.bestSM, where the latter is only available if the --best option is set. Additionally, a file containing coverage statistics is included, and all files are also available per read group. In <SID>.selfSM, the column FREEMIX is an estimate of contamination based on the sequencing data only, approximately measuring the proportion of bases differing from the reference. Accordingly, the column CHIPMIX estimates concordance based on sequencing and genotyping data.

The recommendation by the authors of verifyBamID is to examine samples with FREEMIX > 0.02 or CHIPMIX > 0.02 in detail (42). A large FREEMIX value indicates contamination, while a FREEMIX value of approximately 0 and CHIPMIX of 1 in <SID>.selfSM indicates that a sample is possibly swapped, but not contaminated. If the columns SEQ_ID and CHIP_ID do not match in <SID>.bestSM and the CHIPMIX value in this file is approximately 0, this indicates these two samples might be swapped.

3 Notes

Software versions. This workflow is based on
- bwa 0.7.15
- Picard 2.8.1
- GATK 3.7
- samtools 1.3.1
Parallel computing. In this workflow, the preprocessing for single sample files is described. In addition to using the multithreading capabilities of the tools, several samples can be processed in parallel. On a local machine this can be done, for example, by simple shell scripts. For large sample sizes, we advise using a high performance computing (HPC) cluster or a cloud based computing environment. However, the latter could be restricted by data protection requirements or file transfer issues.
Example data. Illumina provides example sequencing data from the HiSeq X platform on BaseSpace (43). A user account is required to access the data. Currently, all available data is based on replicate sequencing runs of the same sample.
Split raw reads by flowcell and lane.

Listing 3.1 Split fastq file of sample <SID> and paired end reads <Rx> by flowcell ID and lane number.

awk -F: '{flowcell=$3; lane=$4 ; print > "<SID>_"flowcell"_L"lane"_<Rx>.fastq" ; \
  for (i=1; i <= 3; i++) {getline ; print > "<SID>_"flowcell"_L"lane"_<Rx>.fastq"}}' \
  < <SID>_<Rx>.fastq

Acknowledgments

The work presented in this chapter was supported by the German Centre for Cardiovascular Research (DZHK; Deutsches Zentrum für Herz-Kreislauf-Forschung) and the DZHK OMICs Resource Project (grant: 81X1700104).

References

1. Bentley DR et al (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53–59. doi: 10.1038/nature07517

2. McKernan KJ et al (2009) Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation. Genome Res 19:1527–1541. doi: 10.1101/gr.091868.109

3. Margulies M et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. doi: 10.1038/nature03959

4. Metzker ML (2010) Sequencing technologies — the next generation. Nat Rev Genet 11:31–46. doi: 10.1038/nrg2626

5. Liu L et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11. doi: 10.1155/2012/251364

6. Van Dijk EL et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:1–9. doi: 10.1016/j.tig.2014.07.001

7. Illumina Inc. (2016) An introduction to next-generation sequencing technology. http://www.illumina.com/technology/next-generation-sequencing.html. Accessed 16 Jan 2017

8. Nakazato T, Ohta T, Bono H (2013) Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive. PLoS One 8:e77910. doi: 10.1371/journal.pone.0077910

9. Illumina Inc. (2016) Indexed sequencing guide. http://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/miseq/indexed-sequencing-overview-guide-15057455-02.pdf. Accessed 16 Jan 2017

10. Illumina Inc. (2015) HiSeq X series of sequencing systems. http://www.illumina.com/documents/products/datasheets/datasheet-hiseq-x-ten.pdf. Accessed 16 Jan 2017

11. DePristo MA et al (2011) A framework for variation discovery and genotyping using next-generation dna sequencing data. Nat Genet 43:491–498. doi: 10.1038/ng.806

12. Van der Auwera GA et al (2013) From FastQ data to high-confidence variant calls: The genome analysis toolkit best practices pipeline. Curr Protoc Bioinforma 11:11.10.1–11.10.33. doi: 10.1002/0471250953.bi1110s43

13. Illumina Inc. (2012) Using a PhiX control for HiSeq sequencing runs. http://support.illumina.com/content/dam/illumina-marketing/documents/products/technotes/hiseq-phix-control-v3-technical-note.pdf. Accessed 16 Jan 2017

14. Mukherjee S et al (2015) Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Stand Genomic Sci 10:18. doi: 10.1186/1944-3277-10-18

15. Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 http://arxiv.org/abs/1303.3997.

16. Burrows M, Wheeler DJ. (1994) A block-sorting lossless data compression algorithm. http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html. Accessed 31 Jan 2017

17. Ebbert MTW et al (2016) Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 17(Suppl. 7):239.

18. Dozmorov MG et al (2015) Detrimental effects of duplicate reads and low complexity regions on rna-and chip-seq data. BMC Bioinformatics 16(Suppl. 13):S10. doi: 10.1186/1471-2105-16-S13-S10

19. The 1000 Genomes Consortium (2015) A global reference for human genetic variation. Nature 526:68–74. doi: 10.1038/nature15393

20. Lee SH. Changing workflows around calling SNPs and indels. http://gatkforums.broadinstitute.org/gatk/discussion/7847. Accessed 11 Jan 2017

21. Van der Auwera G. Version highlights for GATK version 3.6. https://software.broadinstitute.org/gatk/blog?id=7712. Accessed 11 Jan 2017

22. Nielsen R et al (2011) Genotype and snp calling from next-generation sequencing data. Nat Rev Genet 12:443–451. doi: 10.1038/nrg2986

23. Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186–194. doi: 10.1101/gr.8.3.186

24. Li H. Burrow-wheeler aligner for pairwise alignment between DNA sequences. https://github.com/lh3/bwa. Accessed 12 Jan 2017

25. McKenna A et al (2010) The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303. doi: 10.1101/gr.107524.110

26. Broad Institute. Genome analysis toolkit. https://software.broadinstitute.org/gatk/. Accessed 11 Jan 2017

27. Li H et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25:2078–2079. doi: 10.1093/bioinformatics/btp352

28. Andrews S. Tools for manipulating next-generation sequencing data. https://github.com/samtools/samtools. Accessed 12 Jan 2017

29. Broad Institute. Picard. https://broadinstitute.github.io/picard/. Accessed 4 Jan 2017

30. Dalca AV, Brudno M (2010) Genome variation discovery with high-throughput sequencing data. Brief Bioinform 11:3–14. doi: 10.1093/bib/bbp058

31. Magi A et al (2010) Bioinformatics for next generation sequencing data. Genes (Basel) 1:294–307. doi: 10.3390/genes1020294

32. Altmann A et al (2012) A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet 131:1541–1554. doi: 10.1007/s00439-012-1213-z

33. Fonseca NA et al (2012) Tools for mapping high-throughput sequencing data. Bioinformatics 28:3169–3177. doi: 10.1093/bioinformatics/bts605

34. Bao R et al (2014) Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Inform 13(Suppl. 2):67–82. doi: 10.4137/CIN.S13779

35. Illumina Inc. iGenomes. http://support.illumina.com/sequencing/sequencing_software/igenome.html. Accessed 11 Jan 2017

36. Van der Auwera G. GATK Resource Bundle. http://gatkforums.broadinstitute.org/gatk/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it. Accessed 11 Jan 2017

37. Andrews S. FastQC. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 19 Dec 2016

38. Ewels P et al (2016) MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048.

39. Jun G et al (2012) Detecting and estimating contamination of human dna samples in sequencing and array-based genotype data. Am J Hum Genet 91:839–848.

40. Boratyn GM et al (2013) BLAST: A more efficient report with usability improvements. Nucleic Acids Res 41:W29–W33. doi: 10.1093/nar/gkt282

41. Pruitt KD et al (2014) RefSeq: An update on mammalian reference sequences. Nucleic Acids Res 42:D756–D763. doi: 10.1093/nar/gkt1114

42. Kang HM. Genome analysis wiki. http://genome.sph.umich.edu/wiki/VerifyBamID. Accessed 12 Jan 2017

43. Illumina Inc. BaseSpace. https://basespace.illumina.com/. Accessed 5 Jan 2017

Preprocessing and quality control for whole genome sequences from the Illumina HiSeq X platform