12th Discussion-21 May 2012

From BioWiki
Jump to: navigation, search

Brief Description and Continuing Discussion:

This discussion will have one main topic for discussion, followed by an open floor where people can raise other topics of interest.

Topic 1: Managing Storage in a Core Facility (Led by Simon Andrews)

With ever larger datasets being produced the management of data storage in a core facility is a constant concern for many people. This topic will aim to cover all aspects of storage, from looking at the the physical systems people use to how data is managed, shared with users and charged for.

Preliminary Information

Questions which would be relevant to this topic might include:

  • Are core facilities the right people to be managing storage
  • What systems are people using to store and back up their data
  • What data are people storing?
  • How are you managing your storage? How do you ensure you're only storing what you need and making sure data which is no longer useful is deleted?
  • How do you make stored data available to users? What access controls do you require? How do you allow users to share data both internally and externally?
  • Do you charge for storage, and if so how?
  • How do you integrate local storage with large public datasets?

Topic 2: Open Floor

After the main discussion there will be an opportunity to raise other topics which might be of interest to the community. These can be suggested in advance below, or raised on the call itself.

Suggested Topics

  • Add topics here...

Transcript of Minutes

People Present

  • Deanne Taylor
  • Charlie Whittaker
  • Simon Andrews
  • Brent Richter
  • Steven Turner
  • Thomas Manke
  • David Sexton
  • Fran Lewitter
  • Matthew Trunnel
  • Hemant Kellar
  • Matt Eldridge
  • Madeline Gogol
  • Stuart Levine
  • David Lapointe

Topic 1 Managing Storage in a Core Facility

Charlie Whittaker kindly provided some information in advance of the call about his facility to help answer some of the questions posed.

The call started with a general question about whether people were managing their own storage or defering this to other groups

[FL] has storage which is managed by a separate IT group, and having this taken out of the hands of the bioinformaticians works well.

[DS] A separate IT group manages the storage but the bioinformatics core is heavily involved in the management of the storage

[CW] Storage for informatics at MIT is self-managed.

[TM] Storage is self-managed without the aid of the IT group.

[HK] At UNC they started out by managing their own storage but have moved to doing this in collaboration with an IT group.

[ST] Storage is managed by the university.

[MT] There is split responsibility - the storage infrastructure is managed by a core IT group, but the data itself is managed by the facility.

A general question was asked about how long data was being stored.

[SA] said that all of their primary data had to be stored for 10 years since this was stipulated in the terms of the grants which funded much of their work. Raw data was poorly defined and they used to assume this meant intensity files from sequencers, but were moving to fastq files as the storage they had couldn't cope with the increasing volumes of lower level data.

[DS] stores all data online for 6 months then puts it down to tape for longer term storage. This can get complicated as changes in tape format make this difficult to use for very long term storage.

[FL] has a shared server for scientists where they can access and modify their data. They don't back this up at all and scientists must copy it themselves if they want to back up. At the end of a project all data is wiped after 3 months.

[HK] has a patron model with shared space on a common infrastructure. People have the option of buying storage within the infrastructure. Duplication is a short term problem. Archival storage (tape backup) is done for 3 years by default. In effect the tape backup storage may end up being indefinite.

[BR] Partners has 3 levels of storage. IT have a managed Netapp system but this is very expensive and is too much for most scientists.

For clusters they have storage based on iBrix which allows for different storage tiers and can be accessed by NFS or Samba. Finally use compellant storage for working space. Allows for granular authorisation. Works well.

Partners provide some storage attached to the HPC cluster but this is temporary storage space and is quickly deleted. Users have to pay for longer term storage. Have moved away from tape for archiving as it's too expensive and has too many changes. Now using a home-built system based on commodity hardware. This is a series of locally built storage units linked together using gluster. They used to have Sun Thumpers as the underlying storage but are moving away from these. Transition hasn't been simple as they were using ZFS under gluster which wasn't officially supported. Have 800TB under RAID6 and have found this to be manageable.

[MT] At the Broad they don't believe in long term storage within the facility. The group charges for 3 years worth of storage as part of the sequencing charge. They make every effort to make the data available directly to users rather than having them copy it to other storage systems. This keeps the bulk of the storage on the common central system and the users create smaller derived files such as variant files which are much smaller. After 3 years the raw data is deposited into NCBI and can be removed from the local storage. All storage is billed as a service and measured in terrabyte years.

Access to data is managed by a custom metadata management system which integrates with their primary LIMS. This then serves as the interface to the data which can be accessed through a web interface, web services or a FUSE based filesystem mount. The FUSE mount is present on all compute nodes on their cluster and provides managed access to all data. The service picks up the access control rules from the data management system and provides a custom view of the run data to each user. Scales much better than using traditional file system permissions to do this. Normal permissions aren't flexible enough to handle the complex rules which might be present (eg limit of 16 group memberships per user on NFS). The FUSE overlay does entail a performance hit but is practical. The underlying technology is simply an NFS mount to an isilon cluster. The FUSE system is effectively creating a series of links into the main data store.

Same system is used to share data with collaborators - usually they are given access to the analysis cluster so can access the data the same way local researchers do.

Also looking at integrating IRODS into this system. This is a descendent of the storage resource broker system. It is a metadata management system which is based on a key value store, but also provides a series of tools to move and manage the data. It uses micro-services to do things like delete, copy or change the data and you can use these to construct complex business rules which are based on the metadata stored in the system. It's not likely to be useful as the main store of experimental metadata due to the relatively unstructured nature of the data, but it might make sense to copy metadata from a structured LIMS into IRODS.

Since storage is now becoming the limiting part of many pipelines some vendors are now looking at integrating IRODS into storage controllers so data processing can happen closer to the disk. Would allow, for instance, the conversion of BAM files to FastQ on the fly in the storage controller.

IRODS also potentially allows you to decouple the physical storage from the presentation to the user so you can move things around underneath without the user being aware.

[??] said they were using filetech for the same purpose. Has the benefit that you can use tape as a storage level so you can use this for backup as well.

[DS] Asked if people were separating the long term storage from the area which was written directly by their sequencers.

[SA] Said that they'd had bad experiences using the local storage supplied with their pipeline server so they now had a dedicated Nexsan storage system for the sequencers to write to and for initial data processing. Once this had been completed then data was moved to a separate system where it was accessible by the users.

[DS] Said they were putting together a 200TB BlueArc system for the instruments to write to.

[MT] Said their sequencing instruments were writing to one Isilon cluster, then primary analysis was writing to another cluster, and finally tertiary analysis was writing to yet another Isilon cluster.

[??] Said that they were using the isilon SyncIQ system to replicate data from the initial storage to the final storage system.

[??] Said that their data used to go to Thumper systems but now goes direct to isilon storage and is copied from there to its final destination.

[SA] Asked if anyone had looked at using data compression in their storage solutions? Particularly interested to know if anyone had tried out CRAM?

[DS] Said they'd tried CRAM at Baylor. It worked but the tools provided were not really working well enough yet.

[??] Was were looking at using reduced representation compression for 1000 genomes data. Since this is lossy there were concerns that you need to define in advance what questions you want to ask. Looking at keeping only the neighbourhood around particularly variable regions. If you do this you can get 2 orders of magnitude compression so there's a big payoff if you adopt this.

[??] Said that compellant have inline compression on their storage systems. Dell purchased occarina which also does something like this. There were concerns that this uses a proprietary compression algorithm so you're then completely reliant on this system for access to your data.

[??] asked if anyone was trying to manage storage for ISO9001 high sensitivity clinical data.

[DS] said they stored clinical data but they don't attach any patient information to the data so don't have to deal with the regulatory requirements. De-identified data is much easier to deal with.

[??] said that there was some concern that sequence data may fall under HIPA. If that happens then we're all sunk!

[MT] asked if anyone was looking at encrypting their main storage? If so, how do you encrypt 800TB? Is there even any point on a system where even if you were given all the bits of hardware from a storage system you'd have great difficultly actually reconstructing the data

[BR] said that in their systems they only worry about encrypting mobile data - ie tapes, USB devices or replacement disks. For failed disks they had an agreement with the supplier (HP) that failed disks would either be wiped or destroyed on site.