Marvin N. Wright*, Damian Gola*, Andreas Ziegler

Institut für Medizinische Biometrie und Statistik, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Universität zu Lübeck, Lübeck, Germany

* Equal contribution

The final publication is available at via


The advancement of high throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates and contaminations, alignment to a reference genome and the post-processing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.

Keywords: Whole genome sequencing, Sequencing, High throughput sequencing, Illumina, HiSeq X, HTS, NGS, Quality control, Preprocessing, Alignment, Mapping

1 Introduction

High throughput sequencing (HTS) has expanded rapidly during the last decade, and great progress has been made in terms of sequencing speed, read length and reduction in per-base cost. The main technology competitors in this field are Illumina (1), Life Technologies (2) and Roche (3). Detailed descriptions of their HTS platforms are given by Metzker (4) and Liu et al. (5); the strengths and weaknesses of the platforms are reviewed by Van Dijk et al. (6).

The main principle of HTS technology is sequencing by synthesis, which is the same for most platforms. The first step is to prepare samples for sequencing by creating libraries. This is done by fragmenting the cDNA into small single strand fragments, called ssDNA templates, and ligating adapters to the 5’ and 3’ ends. Second, libraries are arranged into clusters on a solid surface. Third, complementary bases are built up. The complementary bases sequentially emit fluorescence signals, and the process of this synthesis step is captured in a series of images. The actual nucleotide information is inferred from the fluorescence intensity data for each cluster by using base-calling algorithms. This yields so-called reads, which can be processed by downstream algorithms. Finally, a measure of uncertainty or quality is assigned to each called base.

Further refinements of the basic HTS technology, such as paired end sequencing and multiplexing, can increase precision and throughput. Paired end sequencing enables sequencing from both ends of the ssDNA templates, resulting in paired end reads (7). By this means, twice the number of reads are produced, enabling more precise alignment to the reference genome and the ability to detect insertions and deletions (8). Consequently, paired end sequencing is used commonly. Multiplexing enables sequencing libraries from multiple samples simultaneously by adding an additional barcode sequence to the adapters (7, 9). The reads thus obtained can be sorted by these barcodes to get read data for each sample individually.

The major drawback of current HTS technology is that any information about the region or position on the genome of the generated reads is lost. Therefore, for each read, the most likely position on a reference genome must be found. This process is called alignment, which is the central element of HTS data preprocessing.

Illumina was the first company to break the $1000 barrier for human whole genome sequencing with the HiSeq X platform in 2014, thus achieving the goal given out by the National Human Genome Research Institute. This was realized by raising the throughput using patterned flow cells as solid surface, providing an extremely high cluster density, high occupancy and monoclonality within each well, a fast camera system, and new polymerase enzymes which result in faster reactions and thus shorter runtimes (10).

This chapter describes the entire pipeline from raw sequence data to data ready for variant calling for Illumina’s HiSeq X platform. At the core, the raw reads produced by this sequencing platform need to be aligned to the human reference genome. In addition to the alignment, several quality control and processing steps utilizing different tools are necessary. The workflow presented in this chapter is illustrated in Figure 1.1. Generally, it follows the workflow presented by DePristo et al. (11) and Van der Auwera et al. (12). In the remainder of the introduction, the steps highlighted in Figure 1.1 and the underlying rationales are outlined. In Section 2, the tools utilized are described in detail, and code snippets are given to illustrate their application.