Brief Description and Continuing Discussion (please edit in line):

Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)

Before going on to analyse data coming from high throughput sequencers it is important to check the quality of the raw data. Knowing about potential problems in your data can help you to either correct for these before going on to do any analysis, or can make life easier when interpreting results you later derive.

Knowing what measures to usefully take from your data and how to interpret these can help enormously. As core facilities we are exposed to a wider variety of data than any individual research group and may have a better feel for how to spot potential problems.

In this session we will aim to look at the way people are currently assessing their data quality. We will look at:

The type of errors which occur
The best metrics to calculate
The cutoffs to identify poor data
False positives coming from different types of experiment
Which software packages are being used

Preliminary Information

To try to get the discussion started the information below sets out some existing tests and software packages which are already in use so we have a basis to start from.

What type of problems occur

Poor sequence quality
- Quality which drops off evenly over the course of a run
- Poor quality which affects only a subset of sequences
- Poor quality which suddenly affects a run
Contamination
- Primers / Adapters
- Repeats / Low Complexity
- Other samples
Base call bias
- Bias affecting the whole run
- Bias affecting certain base positions
Switched samples

What metrics could we calculate

Quality plots
- Per base
- Per sequence
- Intensity / Focus measures
Composition plots
- Per base composition
- GC content
- GC profile
Contaminant identification
- Overrepresented sequences
- Duplicate levels
Mapping quality
- Genomic distribution

What are sources of false positives

Quality plots
- Biased sequence in illumina libraries
Composition plots
- Bisulphite conversion
- Genomes with extreme GC content
Per base composition
- Bar codes
- Restriction sites
- ChIP-Seq
- RNA-Seq primers
Contaminants
- Highly enriched sequences
- Repeats
Genomic distribution
- Sex chromosomes
- Mitochondria
- Biased ChIP

Which software packages are available

Transcript of Minutes

coming soon

9th Discussion-28 October 2010

Contents

Brief Description and Continuing Discussion (please edit in line):

Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)

Preliminary Information

What type of problems occur

What metrics could we calculate

What are sources of false positives

Which software packages are available

Transcript of Minutes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools