9th Discussion-28 October 2010

From BioWiki
Revision as of 04:18, 13 October 2010 by Simon andrews (talk | contribs) (New page: = Brief Description and Continuing Discussion (please edit in line): = == Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)== Before going on to analyse data coming f...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Brief Description and Continuing Discussion (please edit in line):

Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)

Before going on to analyse data coming from high throughput sequencers it is important to check the quality of the raw data. Knowing about potential problems in your data can help you to either correct for these before going on to do any analysis, or can make life easier when interpreting results you later derive.

Knowing what measures to usefully take from your data and how to interpret these can help enormously. As core facilities we are exposed to a wider variety of data than any individual research group and may have a better feel for how to spot potential problems.

In this session we will aim to look at the way people are currently assessing their data quality. We will look at:

  • The type of errors which occur
  • The best metrics to calculate
  • The cutoffs to identify poor data
  • False positives coming from different types of experiment
  • Which software packages are being used

Preliminary Information

To try to get the discussion started the information below sets out some existing tests and software packages which are already in use so we have a basis to start from.

What type of problems occur

  • Poor sequence quality
    • Quality which drops off evenly over the course of a run
    • Poor quality which affects only a subset of sequences
    • Poor quality which suddenly affects a run
  • Contamination
    • Primers / Adapters
    • Repeats / Low Complexity
    • Other samples
  • Base call bias
    • Bias affecting the whole run
    • Bias affecting certain base positions
  • Switched samples

What metrics could we calculate

  • Quality plots
    • Per base
    • Per sequence
    • Intensity / Focus measures
  • Composition plots
    • Per base composition
    • GC content
    • GC profile
  • Contaminant identification
    • Overrepresented sequences
    • Duplicate levels
  • Mapping quality
    • Genomic distribution

What are sources of false positives

  • Quality plots
    • Biased sequence in illumina libraries
  • Composition plots
    • Bisulphite conversion
    • Genomes with extreme GC content
  • Per base composition
    • Bar codes
    • Restriction sites
    • ChIP-Seq
    • RNA-Seq primers
  • Contaminants
    • Highly enriched sequences
    • Repeats
  • Genomic distribution
    • Sex chromosomes
    • Mitochondria
    • Biased ChIP

Which software packages are available


Transcript of Minutes

coming soon