ISMB 2010: Workshop on Managing large data sets in core facilities

From BioWiki
Revision as of 14:26, 12 October 2011 by Lewitter (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Organizer: Simon Andrews, Babraham Institute, Cambridge, UK

This in-person session was held during the annual ISMB 2010 meeting in Boston MA, USA as a workshop (Media:Richter_intro_ismb10.pdf). See also Workshop on Analysis of Large Datasets.

Please find presentations below.


This topic will discuss issues confronting core facilities related to data collection and handling, and what data should be stored.


Blogs of the Session


Hemant gave a comprehensive introduction to the problems likely to be encountered in a core facility. He showed how data volumes have increased enormously, even since the introduction of high throughput sequencing machines and how new systems such as the HiSeq are set to increase storage requirements yet again.

He went on to show that rather than just thinking about the traditional storage problem we should also focus on the problem of moving this volume of data around. Most networks are not designed with this volume of throughput in mind and a high throughput sequencing facility can have a huge impact on both internal and external networks

He also discussed the hardware and infrastructure he thought was necessary to handle this data. This stretched from the need for machines to move to 64bit operating systems to allow them the flexibility to use large amounts of RAM, to the need for new backup regimes where traditional solutions can't cope, and finally he moved on to discuss the need for a LIMS system to keep track of all of the data.

Mario then provided a case study where he went through the process he has undertaken at TGAC in establishing a new genome analysis centre from scratch. He showed the architecture of the LIMS system which will sit at the core of the centre and described how this was designed to deal with multiple sequencing platforms, and to integrate with public repositories for both sequences and annotations. He emphasised the need to integrate closely with public repositories since automating the movement of data from an internal system to a public system will increasingly become a challenge for many centres.

The MISO system system which has been developed at TGAC will be released publicly and was also presented as a poster at the ISMB conference (U46).

Search term in web browser: bioinfo-core Email - [1] Wiki - [2]