20 July 2008 BoF on next-gen sequencing
Conference Notes, 16th International Conference “Intelligent Systems for Molecular Biology” (ISMB); 19-23 July 2008, Toronto (Canada)
Conference website with abstracts, links to related materials etc.
Introduction:
The ISMB series of annual conferences, which is organized by the International Society for Computational Biology (ISCB), provides a comprehensive overview of current research in bioinformatics and computational biology. The meeting attracts thousands of scientists from all over the world, from a variety of disciplines (such as Computer Science, Molecular Biology, Biophysics, Theoretical Biology, Medicine). As a platform for interdisciplinary discussions this series of conferences provides unique opportunities to discuss ideas and challenges related to the computational analysis of biological data. This includes classic areas of bioinformatics such as sequence analysis, protein modeling and the analysis of genome-wide datasets, but also many other related topics that may be applicable to biological problems.
Apart from various keynote lectures, there are many parallel sessions, poster sessions, Birds-of-a-feather (BoF) sessions (where people with similar interests get together informally), and networking opportunities.
Selected contributions (talks, posters, etc.):
One of the major themes at the event was next generation sequencing (next gen sequencing). Many institutions have already set up their respective core facilities by now, mostly using the Solexa technology from Illumina. ABI’s SOLiD system was considered by many as being more experimental than Solexa. In addition, Roche’s 454 technology is widely used, and is often considered to be complementary to the Solexa technology. The most mature applications include resequencing and SNP detection, ChIP-seq (chromatin immunoprecipitation followed by sequencing, e.g. for finding transcription factor binding sites and for epigenomics), and the sequencing of new genomes. A variety of new applications are still in a more experimental stage but seem to be progressing fast. Therefore, this area has taken off quite rapidly.
- In the BoF session on next gen sequencing, organized by the Bioinfo-Core community (coordinator: David Sexton, Vanderbilt University), several speakers shared their experience on the factors that were critical in the setup of their next gen sequencing capabilities, with a focus on computer hardware, personnel and computational infrastructure (e.g. the Solexa LIMS developed by Dawei Lin). This helped those who are about to set up similar facilities to plan realistically, in terms of hiring (consensus was on 1-2 FTEs even for small setups, for sysadmin and analysis), budgeting and computational infrastructure (hardware and software). As the data generated can be very large and difficult to handle by biologists, most institutions provide the sequence reads and quality scores as results, but not all the primary imaging data (cost for storage and backup can be higher than doing the experiment again, if the samples used are not exceptionally precious). The NCBI has developed a trace archive that is already quite useful.
- In his keynote talk, David Jaffe from the Broad Institute of MIT and Harvard (one of the leading institutions in this area) gave an overview of current applications of next gen sequencing technologies (see above). He presented results from their recent stem cell differentiation epigenomics paper (Meissner’08 in Nature), advertised their ALLPATHS algorithm for de novo assembly of short reads (Butler’08) and introduced their approach to SNP detection (Brockman’08). His lab was involved in many validation experiments to assess the use of next gen sequencing approaches for particular applications. For the assembly of new genomes, they are currently testing an approach that combines the complementary Solexa and 454 technologies.
- Before the main conference, there was a Special Interest Group pre-meeting focused on next generation sequencing. The abstracts are available online.
- Posters: many approaches to the problem of aligning short sequence reads to genomes efficiently were presented. For many users, the ELAND tool that comes with Solexa machines seems to be sufficient for most uses. By next year we may be in a better position to compare alternative tools that were first presented at the ISMB 2008 (some of them are not released yet).
- SHRIMP: just released version 1.1
- SLIPPER (Bing Ren, Terry Gaasterland): "Combining Illumina’s Genome Analyzer Pipeline with SLIPPER, our iterative mapping pipeline, increases the proportion of mapped sequences significantly over using the Illumina software alone. A proof of principle analysis of eight lanes mapped an additional 5.33% of reads to the human genome."
Exon arrays were more controversial. While some groups suggested that filtering of the least useful probe sets may do the trick (Peter Munson, NIH), others decided to not push this technology in their respective institutions after some initial experiments. Many of the critics found the high validation rates for splicing-related events reported in recent publications (in the range of 40-70%) unrealistic according to their own experience with the technology, and that issues like constraints on designing probes for short exons will be hard to overcome. On the other hand, many agreed that carefully selecting useful probe sets will help to get more reliable gene-level readouts, as the probes are more distributed across different exons of the same gene, especially in comparison with earlier chip designs that focused almost entirely on 3’UTR probes. One group reported that comparisons between protein-level expression readouts (using iTRAQ mass spectrometry) and transcript-level expression readouts improved if the measured peptides were matched with probes covering the respective exons (Crispin Miller et al). This would indicate that some of the inconsistencies that were reported before between transcript- and protein-level expression data may be due to the complexity of transcription, including phenomena such as alternative splicing, alternative promoters and alternative polyA usage. Therefore, matching probes to the right exon or even area in an exon (e.g. in long 3’UTR exons) has to be considered carefully when designing experiments.
The biological interpretation of data coming from genome-wide association studies (GWAS) was another theme. A special session focused on this topic, organized by Thomas Hudson. Current challenges include the validation of loci, the lack of protein-coding genes in the center of the association signals, and the understanding of associations with complex diseases. Also, it was noted that current data explain just only up to 10% of the genetic contributions to disease risk. One of the best studied diseases with this emerging technology is Crohn’s disease, for which a range of loci have been reported in recent years. In this particular case, there was relatively good progress in the interpretation of the GWAS data in terms of pathways, which led to the discovery of autophagy as an important mechanism (Rioux’07). David Balding explained the importance of dog genetics studies for understanding disease, and advocated the use of kinship information in association studies (see also his recent Bioinformatics paper).
Mark Reimers presented a useful and comprehensive overview of various computational approaches related to microarray data analysis. He makes many of his materials, including slides and handout, available (“An opinionated guide to microarray data analysis”).
Other tools and databases:
- visANT from BU (Charles deLisi’s lab): Java tool for pathway/network visualization, integrated interaction data for many species (78000 for human); looks like a promising project
- SCI-PHY takes as input an alignment, calculates a tree and predicts key residues for defining subfamilies (can be important for assigning function in large protein families)
- Cytoscape: many useful new plugins were presented in the special session, including interactions derived from text mining and 3D structure related tools that can help to understand interactions better. Unfortunately, some of them are not available in the plugin manager. This is clearly one of the most promising community efforts.
- dbGAP: one of the largest collection of SNP studies, at NCBI
“Bioinfo-Core” community: initiated by Fran Lewitter from the Whitehead Institute, this network of the heads of bioinformatics core facilities is meeting regularily at ISMB conferences. They share their experience with bioinformatics tools, related infrastructure and strategies for providing access to bioinformatics know-how to biologists and medical researchers. In addition to the BoFs at the ISMB, this group organizes teleconferences throughout the year, and discusses current topics on a mailing list. Recently, this group has become more active, partly because of the challenges related to next gen sequencing and other data-intensive technologies. If you are a head of a bioinformatics facility and interested in exchanging experiences, check out their mailing list. In addition to the BoF on next gen sequencing described above, a BoF on best practices in running a bioinformatics core was held as well at the ISMB 2008.
Keynote talk by Claire Fraser on metagenomics (genomics-enabled study of microbial communities): diseases have been found to be associated with shifts in the microbiome. Therefore, a Human Microbiome Project has been proposed (Turnbaugh’07). Recent studies in the Amish population (relatively homogeneous lifestyle, good family trees) were very revealing.
- need to understand the role of disordered regions in interactions, esp. in signaling (Kim’08); take into account the role of 3D structure in understanding networks, interfaces and hubs (Kim’06), simultaneously possible interactions
- they are connecting interactomes with variation data (SNPs, CNVs), using Ensembl
- proteins at the periphery of the network are under different selection pressures than those at the center (hypothesis: they interact more with the extracellular environment), see Kim’07
- TopNet: comparison of subnetworks (topology) (Yu’04)
Curtis Huttenhower: function prediction for yeast genes based on integrated information from microarrays, genetic interactions, physical interactions (currently in the process of applying the same approach to other organisms, incl. human