9th Discussion-28 October 2010
Contents
Brief Description and Continuing Discussion (please edit in line):
Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)
Before going on to analyse data coming from high throughput sequencers it is important to check the quality of the raw data. Knowing about potential problems in your data can help you to either correct for these before going on to do any analysis, or can make life easier when interpreting results you later derive.
Knowing what measures to usefully take from your data and how to interpret these can help enormously. As core facilities we are exposed to a wider variety of data than any individual research group and may have a better feel for how to spot potential problems.
In this session we will aim to look at the way people are currently assessing their data quality. We will look at:
- The type of errors which occur
- The best metrics to calculate
- The cutoffs to identify poor data
- False positives coming from different types of experiment
- Which software packages are being used
Preliminary Information
To try to get the discussion started the information below sets out some existing tests and software packages which are already in use so we have a basis to start from.
What type of problems occur
- Poor sequence quality
- Quality which drops off evenly over the course of a run
- Poor quality which affects only a subset of sequences
- Poor quality which suddenly affects a run
- Contamination
- Primers / Adapters
- Repeats / Low Complexity
- Other samples
- Base call bias
- Bias affecting the whole run
- Bias affecting certain base positions
- Switched samples
What metrics could we calculate
- Quality plots
- Per base
- Per sequence
- Intensity / Focus measures
- Composition plots
- Per base composition
- GC content
- GC profile
- Contaminant identification
- Overrepresented sequences
- Duplicate levels
- Mapping quality
- Genomic distribution
What are sources of false positives
- Quality plots
- Biased sequence in illumina libraries
- Composition plots
- Bisulphite conversion
- Genomes with extreme GC content
- Per base composition
- Bar codes
- Restriction sites
- ChIP-Seq
- RNA-Seq primers
- Contaminants
- Highly enriched sequences
- Repeats
- Genomic distribution
- Sex chromosomes
- Mitochondria
- Biased ChIP
Which software packages are available
Transcript of Minutes
coming soon