Brief Description and Continuing Discussion:

Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)

Before committing significant time and resources to interpret the data pouring from high throughput sequencers it is important to check the quality and usability of the raw data. Knowing about potential problems in your data can help you to either correct for these before going on to do further analysis, or can take them into account when interpreting results you later derive.

As core facilities we are exposed to a wider variety of data than any individual research group and may have a better feel for how to spot potential problems, hence we are in a favorable position to develop best practices and measures to ensure the analysis and the results are meaningful.

In this session we will aim to look at the way people are currently assessing their data quality. We will look at:

The type of problems which occur during sequencing experiments
The metrics for quality control
The cutoffs to identify or filter poor data
False positives sources
Which software packages are being used

Preliminary Information

To try to get the discussion started the information below sets out some existing tests and software packages which are already in use so we have a basis to start from.

The type of problems which occur during sequencing experiments

Poor and failed sequencing experiment
- Quality which drops off evenly over the course of a run
- Poor quality which affects only a subset of sequences
- Poor quality which affects a run
- Overplated samples
- Underplated samples
Sample contamination
- Primers / Adapters
- Repeats / Low Complexity
- Ribosomal RNA
- Other samples
Base call bias
- Bias affecting the whole run
- Bias affecting certain base positions
Switched samples or wrong labeling

The metrics for quality control

Quality plots
- Per base
- Per sequence
- Intensity / Focus measures
Composition plots
- Per base composition
- GC content
- GC profile
Contaminant identification
- Overrepresented sequences
- Overrepresented k-mers
- Duplicate levels
Mapping quality
- Overall error rate based on Phi-X alignment
- Genomic distribution
- Reference genome selection
  - Version
  - EST/Unigene

False positives sources

Quality plots
- Biased sequence in illumina libraries
- Biased in different sample preparation protocols
Composition plots
- Bisulphite conversion
- Genomes with extreme GC content
Per base composition
- Bar codes
- Restriction sites
- ChIP-Seq
- RNA-Seq primers
Contaminants
- Highly enriched sequences
- Repeats
Genomic distribution
- haplotypes
- Sex chromosomes
- Mitochondria
- Biased ChIP

Which software packages are available

Examples

Look away now if you are of a nervous disposition. Below are a few examples of strange things which have been seen in real runs and a description of the cause, when known. It's worth noting that in nearly all cases usable data was able to be salvaged from these runs, after varying degrees of filtering, so bad QC is not a death sentence, just an invitation to work harder.

Poor Quality

This plot shows a common problem, especially with longer runs, which is that as the run progresses the quality of the calls gradually drops. Newer chemistry has improved this somewhat, but people are now doing even longer reads than ever. We'd normally say that when the median quality is dropping to a phred score of ~20 that we'd consider trimming the sequence at that point, since sequence with lower quality than that is likely to cause more problems than it fixes.

This plot shows a different kind of quality problem where the run suddenly switches from having very good quality to very poor quality. In this case there was a leak in the machine which covered the flowcell in salt about halfway through the run. Trimming the sequence actually meant that we recovered virtually a full run of usable data (since this was a ChIP-Seq run where the sequence was only used for mapping).

Biased Sequence

This plot actually shows two separate sources of bias. The first few bases show an extreme bias caused by the library having a restriction site on the front. The rest of the library shows a lower (but still very high) level of bias which comes from a single sequence which makes up ~20% of the library.

A lot of groups have found that RNA-Seq libraries created with Illumina kits show this odd bias in the first ~10 bases of the run. This seems to be due to the 'random' primers which are used in the library generation, which may not be quite as random as you'd hope. We've not removed this biased sequence and the results seem to be OK.

This shows an unusual experiment where the library was bisulphite converted, so the separation of G from C and A from T is expected. However you can see that as the run progresses there is an overall shift in the sequence composition. This correlated with a loss of sequencing quality so the suspicion is that miscalls get made with a more even sequence bias than bisulphite converted libraries. Trimming the sequences fixed this problem, but if this hadn't been done it would have had a dramatic effect on the methylation calls which were made.

Duplicated Sequence

This plot shows the duplication levels for sequences in a library which had been over amplified. In a normal, diverse library you should see most sequences occurring only once. In this library you can see that the unique sequences make up only a small proportion of the library, with low level duplication accounting for most of the rest. Removing all duplicated reads from this library meant that usable data was able to be salvaged.

Wrong library

The above plot shows a GC profile of all of the sequences in a library. In this case the library was supposed to be human, which would produce a library with a median GC content of ~42%, however you can see that the median GC content for this sample is 44-45%. This may seem like a small shift, but GC profiles are remarkably stable and even a minor deviation indicates that there is a problem in the library. In this case the sequence actually came from a bacterium.

ribosomal RNA (rRNA) contamination

The Problem: Only 10.81% of reads mapped to the transcriptome. This is in contrast to 63.44% of reads from a different tissue sample from the same non-model organism, mapping to the same transcriptome.

Sequencing experiment metrics

The run produced 27,032,861 SE40 sequences. % PF Clusters was 72.37 +/- 1.85. Quality trimming did not indicate any problem: 98.74% of the reads were of good quality (at least 20 bases of Sanger q20), and 25,342,800 reads (93.75%) were full length (40 bases).

Read Mapping: The reads were mapped to a transcriptome set of cDNA sequences (fasta file) provided by the researcher. Mapping was performed using bwa aln and samse (default parameters) and pileup generated using samtools (default parameters).

Steps Taken to Understand/Resolve the Problem:
- The bwa and samtools mapping and counting steps were re-run to ensure that no computational error had taken place. This produced the same results.
- The reads were then mapped to the draft genome of this non-model organism. 52.82% of the reads mapped to the genome. This indicated that a large percentage of the reads were from this organism (rather than contamination from another organism), but that the sequences were not represented in the transcriptome (cDNA fasta file)
- The reads were then assembled. The assembly produced 147 contigs with coverage data. The consensus sequences for the 10 contigs with highest coverage were blasted (blastn) against the NCBI nt database. All hits were to ribosomal RNA.

Experimental troubles before sequencing: This problem involved one lane of Illumina GAII sequence. Per the DNA Sequencing Core (Charlie Nicolet), the original total RNA sample was of very low concentration (very dilute). The first attempt at library construction failed, and the second attempt succeeded, but was difficult to complete.

Illimina adaptor contamination

Sometime the sequences from Illumina contain partial Illumina adapters' sequences. If not trimmed, they may affect significantly the assembly and mapping outcomes .

Adapter Sequences: 5' AATGATACGGCGACCACCGA GATCTACACTCTTTCCCTAC ACGACGCTCTTCCGATCT (N) AGATCGGAAGAGCGGTTCAG CAGGAATGCCGAGACCGATC TCGTATGCCGTCTTCTGCTT G 3' 3' TTACTATGCCGCTGGTGGCT CTAGATGTGAGAAAGGGATG TGCTGCGAGAAGGCTAGA (N) TCTAGCCTTCTCGCCAAGTC GTCCTTACGGCTCTGGCTAG AGCATACGGCAGAAGACGAA C 5'

GGCxG motif

GGCxG or even GGC motif in the 5' to 3' direction of reads. Simply put: at some places, base calling after a GGCxG or GGC motif is particularly error prone, the number of reads without errors declines markedly. Repeated GGC motifs worsen the situation.

Sequencing bias towards lower GC

Transcript of Minutes

coming soon

9th Discussion-28 October 2010

Contents