12th Discussion-21 May 2012 Charlie Whittaker

From BioWiki
Jump to navigationJump to search

The Koch Institute of MIT Bioinformatics and Computing Core has a 500 Tb isilon cluster consisting of seven 36NL nodes and three 108NL nodes. Each lab in the KI has a share on the device and every member of each lab can have a directory in that share. Authentication is done with active directory using MIT credentials. The cluster also serves the KI administration and core facilities. This includes two hiseq2000 and 2 GAIIx sequencers. It is also the primary storage for accounts on our Linux cluster. From a hardware point of view, everything is working well. There have been issues with permissions and authentication but these have been resolved for the most part. Most importantly and up to now, both major and minor hardware problems have been addressed in a rapid and effective manner. From a performance point of view, the 108NL units are better than the 36NLs. Since adding the three 108NL nodes I’ve noticed a major performance boost, presumable due to the increased processing capacity of these larger nodes.

  • Are core facilities the right people to be managing storage
At the KI, the bioinformatics and computing core facility is the only 
real centralize option for managing storage because, other than us and our 
support consultants, we don’t have a formal IT infrastructure.
  • What systems are people using to store and back up their data
MIT TSM, 3 different hosts backup 3 different parts of the file system 
on a daily basis although it does not finish every day. It costs $65 per month, 
per 10 TB. We also enable snapshots for labs that request that function.
  • What data are people storing?
All types of data, there are no restrictions. 
The bulk of the storage is NGS related.
  • How are you managing your storage? How do you ensure you're only storing what you need and making sure data which is no longer useful is deleted?
This is one of the most interesting parts of this topic for me. Our 
storage is unmanaged right now other than we are attempting to make labs 
pay for their own decisions. One place where this breaks down relates to 
the work people in our core do for people. On some level, our core employees 
are responsible for the data they produce and there is a reluctance to turn 
those data, and the decisions relating to what to keep, over to the labs.

• How do you make stored data available to users? What access controls do you require? How do you allow users to share data both internally and externally?

There are various drop boxes for sharing inside labs and within KI. 
For external access, parts of the device are available from the web, either 
with htaccess controls or not.
  • Do you charge for storage, and if so how?
This is a second part of the topic that is very interesting to me. 
Each lab gets 1 Tb free. Then every additional Tb costs $50 per month 
with accounting done at the start of each month. So if a lab has 2.05 Tb 
on the first of the month, they will get a bill for $120. Same price for 2.99 Tb. 
We arrived at this number based on covering administration, maintenance and backup 
costs. I don’t understand the details very well but there are various purchasing 
rules that make it hard to pay for the hardware using a chargeback scheme like this.
  • How do you integrate local storage with large public datasets?
We attempt to maintain a centralized collection of indexed genomes and 
annotations but there are various maintenance and versioning issues associated 
with this that sometimes lead to people creating their own.
  • Other point
Im also interested in annotation and documentation methods. If we will 
store all these data, it would be great to know what everything is. How can 
we promote more effective documentation? Is it possible to retroactively 
document older data?