ISMB 2016: BioinfoCoreWorkshop

From BioWiki
Revision as of 14:06, 12 July 2016 by Bgrichter (talk | contribs)
Jump to: navigation, search

We are holding a Bioinfo-core workshop at the 2016 ISMB meeting in Orlando, Florida. We have been given a half-day workshop track slot in the program on Monday, July 11th from 2:00-4:30 PM.

Workshop Structure

The workshop is split into 4 sessions of ~30 mins each with a required break between the first and second half of the meeting (3-3:30).

  • The first slot will have 2 15 minute talks on the topic of Big Data followed by a 30 minute panel discussion. After the break we will have 2 15 minute talks about Big Compute, followed by a 30 minute panel discussion.

Workshop topics

The workshop will address "The practical experience of big data and big compute". Members of core facilities will share their experience and insights via presentation and panel discussion.

Big data

Speaker: Yury Bukhman, Great Lakes Bioenergy Research Center Time: 2:00 pm – 2:15 pm

Presentation Overview:

The Computational Biology Core of the Great Lakes Bioenergy Research Center supports mostly academic labs at the University of Wisconsin, Michigan State University and other universities. With a variety of experiment types, they are challenged to manage and analyze disparate data and metadata in a diverse academic environment. Details of these data challenges and solutions will be discussed.


Speaker: Alberto Riva, University of Florida Time:2:15 pm – 2:30 pm

Presentation Overview

The Bioinformatics Core of the ICBR provides bioinformatics services to the large and diverse scientific community of the University of Florida. Routine handling of projects covering a vast spectrum of biological and biomedical research requires a flexible and powerful data infrastructure. Implementation details of a software development environment (Actor) for reliable, reusable, reproducible analysis pipelines will be discussed, as well as insights on managing big data projects in a core setting.


Big Data Panel Time: 2:30 pm – 3:00 pm

Moderator: Madelaine Gogol, Stowers Institute for Medical Research

  • Panel Speaker: Yury Bukhman, Great Lakes Bioenergy Research Center
  • Panel Speaker: Alberto Riva, University of Florida
  • Panel Speaker: Hua Li, Stowers Institute for Medical Research
  • Panel Speaker: Jyothi Thimmapuram, Purdue University

The presenters, panelists, and attendees will explore practical experience with “big data” as well as use of public datasets in a panel discussion. Topics may include accuracy of annotation, trust of data, raw versus processed, data validation, and QC.

Big Compute

Speaker: Sergi Sayols Puig, Institute of Molecular Biology Mainz Time: 3:30 pm – 3:45 pm

Presentation Overview With a variety of computing infrastructures available, building robust, transferable pipelines can increase utilization of compute resources. NGS analysis pipelines implemented as docker containers and deployed on a variety of compute platforms – (cluster, supercomputer, or workstation) will be discussed.

Speaker: Jingzhi Zhu, The Koch Institute at MIT Time: 3:45 pm – 4:00 pm

Experiences transitioning a Bioinformatics core from a local to a cloud-based compute solution will be discussed, including the motivation, performance, cost, and issues with deploying bioinformatics pipelines to Amazon EC2 instances.


Big Compute Panel Time: 4:00 pm – 4:30 pm

Moderator: Brent Richter, Partners HealthCare

  • Panel Speaker: Sergi Sayols Puig, Institute of Molecular Biology Mainz
  • Panel Speaker: Jingzhi Zhu, The Koch Institute at MIT
  • Panel Speaker: Sara Grimm, NIEHS

The presenters, panelists, and attendees will discuss how people manage to stay on top of compute requirements for their own sites in a panel discussion. Major hurdles to overcome and the compromises needed for success will be discussed. We may also touch on experiences with containers and portable computing.

We will have a bioinfo-core dinner the night of the workshop, Monday, at 6:30 PM. The dinner will be at Garden Grove, a restaurant in the Swan Hotel.

Discussion with notes

Big Data

Yury Bukhman. The GLBRC consortium consists mainly of people at U of Wisonson and a small group at Michigan state university. The consortium in mainly involved in agriculture and sustainablility. Very practical--loking to develop biofules and biochemicals. All groups in the consortium is mandated to have a data management plan that's reviewed by the bioinformatics core on a yearly basis. This provides a consulting opportunity for bioinformatics planning and research IT: both on prem resources as well as cloud services. The core has developed a metadata database called GLOW. Alberto Riva: The bioinformatics core facility sits within a closely organized group of core facilities that can be used for Life and Health Science. Additionally, they have access to large and shared IT resources with segments setup for their specific use cases (a private area of the large 10,000 core cluster, for example).

What fraction of the University of Florida system uses the core and how is work paid for?

       It's fee for service, but moving to a model where work is charged/allocated by level of effort for a resource with longer term projects through full resource allocation to a grant.

Regarding percentage of university using the core, there is no good measure. Not everyone knows about the core, they focus on some outreach, but overall it's hard to quantify.

Discussion of overall cost that includes data analysis and storage. One view was that storage is getting cheaper, however the data itself is still a problem: the data growing faster than storage is getting cheaper. HMS, for example, has hired a data manager who works solely with people to put their data in the appropriate places--cheap archive storage vs. more expensive on-line high-performance storage.

At Purdue, there is not a single big large set of data, but 1000's of small datasets. Purdue core works with users who have varying levels of analytic and IT knowledge. They find that they have spend time working on datasets in order to adapt/format/clean them for analysis as well as understanding the experimental parameters. Not everyone knows what goes on inside and behind the scenes of the core in performing this work. They expect the work to be quick, but without prior involvement in developing the experiment, it takes days to get the dataset to a state where it can be run through their analysis! Educating the students and educating their users about the data, dataset and the analysis is important.

Collecting metadata of small and large datasets is a big problem, particularly if one wants to combing data across experiments or in the future. It is required to compare different datasets. Additionally, when submitting new data to public datasets, the repositories require long list of metadata. GLBRC maintains a spreadsheet that's required to be filled out that specifically focused on the metadata. This forces investigators to think about the metadata.

Biggest challenges for Riva is in educating users on how to generate the data--you may have all the big data you want, but if the experiment is not designed properly, there's quite a lot of cruft.

The evolving technology in big data, NGS, life science is really an evolution in what "big" means. We've always dealt with challenging datasets but "big data" involves additional or more challenging work on the actual analysis and management processes--elaborate.

The biggest problem is in the complexity of the projects.  But a larger problem is working with faculty who don't have a lot of money
Most cores are willing to devote part of their time, pro bono, to generate results for grant submission.  The investigator will include the data and cost:effort into the grant for the analysis services.

How do you deal with privacy and security of the data? When thinking about a pipeline, do you take into acount what's public vs. private?

 Purdue: download all data into their local environment.
 Florida: they have the largest southern florida health center who works with patient data.  To comply with regulations, the research computing group has created a  secure area for their cluster to work with this data.  It's walled off from external and internal access--i.e controlled access.

Bottom line, last thought: a core and the personnel within it has to be adaptable in order to understand what is brought to them. No 2 experiments are alike and needs continuously change. The trends, technology capability and tools change. Need to remain flexible, adapt pipelines, process and people.