ISMB 2010: Workshop on Managing large data sets in core facilities

Organizer: Simon Andrews, Babraham Institute, Cambridge, UK

This in-person session was held during the annual ISMB meeting in Boston MA, USA as a workshop (Introduction to workshop: Media:Richter_intro_ismb10.pdf). Please find presentations below.

1 Description:
2 Presenters:
3 Blogs of the Session
4 Summary

Description:

This topic will discuss issues confronting core facilities related to data collection and handling, and what data should be stored.

Presenters:

1.) Simon Andrews, Introduction
- Media:babraham_andrews_Intro2managingdata.pdf
2.) Hemant Kelkar, U of North Carolina, USA File:Hemant Kelkar ISMB2010.pdf
- Media:unc_kelkar_NGS_biofxchallenges.pdf
3.) Mario Caccamo, The Genome Analysis Centre, Norwich, UK
- Media:ismb10_caccamo_pipeline.pdf

Blogs of the Session

Summary

Hemant gave a comprehensive introduction to the problems likely to be encountered in a core facility. He showed how data volumes have increased enormously, even since the introduction of high throughput sequencing machines and how new systems such as the HiSeq are set to increase storage requirements yet again.

He went on to show that rather than just thinking about the traditional storage problem we should also focus on the problem of moving this volume of data around. Most networks are not designed with this volume of throughput in mind and a high throughput sequencing facility can have a huge impact on both internal and external networks

He also discussed the hardware and infrastructure he thought was necessary to handle this data. This stretched from the need for machines to move to 64bit operating systems to allow them the flexibility to use large amounts of RAM, to the need for new backup regimes where traditional solutions can't cope, and finally he moved on to discuss the need for a LIMS system to keep track of all of the data.

Mario then provided a case study where he went through the process he has undertaken at TGAC in establishing a new genome analysis centre from scratch. He showed the architecture of the LIMS system which will sit at the core of the centre and described how this was designed to deal with multiple sequencing platforms, and to integrate with public repositories for both sequences and annotations. He emphasised the need to integrate closely with public repositories since automating the movement of data from an internal system to a public system will increasingly become a challenge for many centres.

The MISO system system which has been developed at TGAC will be released publicly and was also presented as a poster at the ISMB conference (U46).

Search term in web browser: bioinfo-core Email - [1] Wiki - [2]

ISMB 2010: Workshop on Managing large data sets in core facilities

Contents

Description:

Presenters:

Blogs of the Session

Summary

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools