Difference between revisions of "ISMB 2016: BioinfoCoreWorkshop"

From BioWiki
Jump to navigationJump to search
 
(10 intermediate revisions by 3 users not shown)
Line 11: Line 11:
 
The workshop will address "The practical experience of big data and big compute". Members of core facilities will share their experience and insights via presentation and panel discussion.
 
The workshop will address "The practical experience of big data and big compute". Members of core facilities will share their experience and insights via presentation and panel discussion.
  
== Big data ==
+
=== Big data ===
  
 
Speaker: '''Yury Bukhman''', Great Lakes Bioenergy Research Center
 
Speaker: '''Yury Bukhman''', Great Lakes Bioenergy Research Center
Line 29: Line 29:
 
The Bioinformatics Core of the ICBR provides bioinformatics services to the large and diverse scientific community of the University of Florida. Routine handling of projects covering a vast spectrum of biological and biomedical research requires a flexible and powerful data infrastructure. Implementation details of a software development environment (Actor) for reliable, reusable, reproducible analysis pipelines will be discussed, as well as insights on managing big data projects in a core setting.
 
The Bioinformatics Core of the ICBR provides bioinformatics services to the large and diverse scientific community of the University of Florida. Routine handling of projects covering a vast spectrum of biological and biomedical research requires a flexible and powerful data infrastructure. Implementation details of a software development environment (Actor) for reliable, reusable, reproducible analysis pipelines will be discussed, as well as insights on managing big data projects in a core setting.
  
[[File:alberto.pdf]]
+
[[File:Riva-WK03-ISMB16.pdf]]
  
  
Line 43: Line 43:
 
The presenters, panelists, and attendees will explore practical experience with “big data” as well as use of public datasets in a panel discussion. Topics may include accuracy of annotation, trust of data, raw versus processed, data validation, and QC.
 
The presenters, panelists, and attendees will explore practical experience with “big data” as well as use of public datasets in a panel discussion. Topics may include accuracy of annotation, trust of data, raw versus processed, data validation, and QC.
  
== Big Compute ==
+
=== Big Compute ===
  
 
Speaker: '''Sergi Sayols Puig''', Institute of Molecular Biology Mainz
 
Speaker: '''Sergi Sayols Puig''', Institute of Molecular Biology Mainz
Line 51: Line 51:
 
With a variety of computing infrastructures available, building robust, transferable pipelines can increase utilization of compute resources. NGS analysis pipelines implemented as docker containers and deployed on a variety of compute platforms – (cluster, supercomputer, or workstation) will be discussed.
 
With a variety of computing infrastructures available, building robust, transferable pipelines can increase utilization of compute resources. NGS analysis pipelines implemented as docker containers and deployed on a variety of compute platforms – (cluster, supercomputer, or workstation) will be discussed.
  
 +
[[File:Sergi_updated.pdf]]
  
 
Speaker: '''Jingzhi Zhu''', The Koch Institute at MIT  
 
Speaker: '''Jingzhi Zhu''', The Koch Institute at MIT  
Line 71: Line 72:
  
 
We will have a bioinfo-core dinner the night of the workshop, Monday, at 6:30 PM. The dinner will be at [http://www.swandolphinrestaurants.com/gardengrove/index.html Garden Grove], a restaurant in the Swan Hotel.
 
We will have a bioinfo-core dinner the night of the workshop, Monday, at 6:30 PM. The dinner will be at [http://www.swandolphinrestaurants.com/gardengrove/index.html Garden Grove], a restaurant in the Swan Hotel.
 +
 +
==Discussion with notes==
 +
===Big Data===
 +
Yury Bukhman.  The Great Lakes Bioenergy Research Center, GLBRC, is based at U of Wisconsin - Madison and Michigan State University.  It also includes a few groups at other institutions. The informatics core consists of a larger group at UW and a smaller one at MSU. The Center is involved in all aspects of biofuels research, including agriculture and sustainability, biomass processing, and microbial fermentation of biomass-derived hydrolysates to produce fuels and chemicals.  The goal is develop basic science that will enable economical and sustainable production of biofules and industrial biochemicals in the future. 
 +
 +
All groups in the Center are mandated to have a data management plan that's reviewed by the informatics core on a yearly basis.  This provides a consulting opportunity for bioinformatics planning and research IT: both to assess the needs of the Center and to suggest existing services to group leads.
 +
 +
The informatics core has implemented a suite of solutions including shared file servers, a SharePoint site, LIMS systems, an omics metadata database called GLOW, and a genome suite called GxSeq. The major LIMS solution at GLBRC is based on the commercial StarLIMS platform. GLOW records metadata on files produced by sequencing centers and by downstream bioinformatics analyses. It also documents bioinformatics workflows. GxSeq was developed by the MSU Group and includes a genome browser and an expression data viewer.
 +
 +
Alberto Riva: The bioinformatics core facility sits within a closely organized group of core facilities that can be used for Life and Health Science.  Additionally, they have access to large and shared IT resources with segments setup for their specific use cases (a private area of the large 10,000 core cluster, for example).
 +
 +
What fraction of the University of Florida system uses the core and how is work paid for?
 +
The core is currently fee for service, but moving to a model where work is charged/allocated by level of effort for a resource with longer term projects through full resource allocation to a grant.
 +
Regarding percentage of university using the core, there is no good measure.  Not everyone knows about the core, they focus on some outreach, but overall it's hard to quantify.
 +
 +
Discussion of overall cost that includes data analysis and storage. 
 +
One view was that storage is getting cheaper, however the data itself is still a problem: the data growing faster than storage is getting cheaper.  HMS, for example, has hired a data manager who works solely with people to put their data in the appropriate places--cheap archive storage vs. more expensive on-line high-performance storage.
 +
 +
At Purdue, there is not a single big large set of data, but 1000's of small datasets.  Purdue core works with users who have varying levels of analytic and IT knowledge. They find that they have spend time working on datasets in order to adapt/format/clean them for analysis as well as understanding the experimental parameters.  Not everyone knows what goes on inside and behind the scenes of the core in performing this work.  They expect the work to be quick, but without prior involvement in developing the experiment, it takes days to get the dataset to a state where it can be run through their analysis!  Educating the students and educating their users about the data, dataset and the analysis is important.
 +
 +
GLBRC faces similar challenges. Its GLOW metadata database attempts to encourage consistent recording of metadata that may be used in bioinformatics analyses. It also makes it easier for researchers to find relevant datasets later.
 +
 +
Collecting metadata of small and large datasets is a big problem, particularly if one wants to combing data across experiments or in the future. It is required to compare different datasets.  Additionally, when submitting new data to public datasets, the repositories require long list of metadata.  GLBRC maintains LIMS systems and a metadata spreadsheet that's required to be filled out to submit datasets to its GLOW omics meta-database.  The latter is similar to submitting to GEO and other public repositories. It encourages investigators to record their metadata at an early stage.
 +
 +
Biggest challenges for Riva is in educating users on how to generate the data--you may have all the big data you want, but if the experiment is not designed properly, there's quite a lot of cruft.
 +
 +
The evolving technology in big data, NGS, life science is really an evolution in what "big" means.  We've always dealt with challenging datasets but "big data" involves additional or more challenging work on the actual analysis and management processes--elaborate.
 +
The biggest problem is in the complexity of the projects.  But a larger problem is working with faculty who don't have a lot of money
 +
Most cores are willing to devote part of their time, pro bono, to generate results for grant submission.  The investigator will include the data and cost:effort into the grant for the analysis services.
 +
 +
How do you deal with privacy and security of the data?  When thinking about a pipeline, do you take into acount what's public vs. private?
 +
Purdue: download all data into their local environment.
 +
Florida: they have the largest southern florida health center who works with patient data.  To comply with regulations, the research computing group has created a  secure area for their cluster to work with this data.  It's walled off from external and internal access--i.e controlled access.
 +
 +
Bottom line, last thought: a core and the personnel within it has to be adaptable in order to understand what is brought to them.  No 2 experiments are alike and needs continuously change.  The trends, technology capability and tools change.  Need to remain flexible, adapt pipelines, process and people.
 +
 +
=== Big Compute ===
 +
 +
Do you keep the analysis results on the cloud or download to local storage? J: For the pilot they synced data to S3, then downloaded data back to local storage. Sergi: use a private cluster, the data is there.
 +
 +
Do you pass costs on to users? J: That's the eventual plan, to pass costs. They might not LIKE it... Haven't done it yet, as this was just a pilot.
 +
 +
On the cloud, you can just use the storage you need.
 +
 +
Is data transfer feasible without a big AWS pipe like MIT has? J: Works well for MIT...
 +
 +
How do you schedule the data to be processed in AWS? J: Just use SGE - the same as a local cluster. Make your cluster size 1000 (or whatever) and run your jobs simultaneously. After test is finished, move results to S3.
 +
 +
How many users do you support? Answer for all in the range of 50?
 +
 +
Do your users themselves use docker or amazon? Sergi: No. They will sometimes share pipelines with users, but permissions issues crop up... Sara: Users are not sophisticated, they might have trouble using a command line. J: I set up scripts for them sometimes, no AWS yet. Launching AWS in the cloud is kind of difficult.
 +
 +
How do you address training issues with these new tools? Different solutions for different users. Naive users are not going to use these tools yet, more advanced might start. Depends on them. Sergi: we send them to university training for unix.
 +
 +
Did you try the pilot with different kinds of nodes? How was error handling? J: Yes, tried different nodes. Error handing is no different locally versus in the cloud.
 +
 +
In a distributed docker system, how are errors handled? Sergi: Inside or outside a container, errors are really the same. It's an abstraction layer.
 +
 +
How are you becoming more integrated with the technology group or bringing scientists together with the technology? Many don't care about the technology. They want their raw data and their results or final results. Some groups have their own bioinformaticians - with them cores work closer, provide code, etc. They are a link to technology groups. Many cores have courses for users to learn to do things themselves.
 +
 +
When using the cloud, don't you still assume you will have to store input / output data locally? Someone will download the data. End user can download data to analyze. Sergi: We generate tons of data, don't want to move it to Amazon... J: We think there will be a hybrid model. For storage, 1 Terabyte is 90$ per year, but that includes maintenance. Different options - S3, Glacier (70$ per Tb per year). There will be a hybrid solution for cloud and local compute and storage. The cloud can expand the possibilities of what we can do.
 +
 +
=== bioinfo-core Dinner ===
 +
 +
Some members met for dinner at Garden Grove... Including a new honorary member...
 +
 +
[[File:goofy_small.jpg]]

Latest revision as of 10:20, 20 July 2016

We are holding a Bioinfo-core workshop at the 2016 ISMB meeting in Orlando, Florida. We have been given a half-day workshop track slot in the program on Monday, July 11th from 2:00-4:30 PM.

Workshop Structure

The workshop is split into 4 sessions of ~30 mins each with a required break between the first and second half of the meeting (3-3:30).

  • The first slot will have 2 15 minute talks on the topic of Big Data followed by a 30 minute panel discussion. After the break we will have 2 15 minute talks about Big Compute, followed by a 30 minute panel discussion.

Workshop topics

The workshop will address "The practical experience of big data and big compute". Members of core facilities will share their experience and insights via presentation and panel discussion.

Big data

Speaker: Yury Bukhman, Great Lakes Bioenergy Research Center Time: 2:00 pm – 2:15 pm

Presentation Overview:

The Computational Biology Core of the Great Lakes Bioenergy Research Center supports mostly academic labs at the University of Wisconsin, Michigan State University and other universities. With a variety of experiment types, they are challenged to manage and analyze disparate data and metadata in a diverse academic environment. Details of these data challenges and solutions will be discussed.

File:Yury.pdf

Speaker: Alberto Riva, University of Florida Time:2:15 pm – 2:30 pm

Presentation Overview

The Bioinformatics Core of the ICBR provides bioinformatics services to the large and diverse scientific community of the University of Florida. Routine handling of projects covering a vast spectrum of biological and biomedical research requires a flexible and powerful data infrastructure. Implementation details of a software development environment (Actor) for reliable, reusable, reproducible analysis pipelines will be discussed, as well as insights on managing big data projects in a core setting.

File:Riva-WK03-ISMB16.pdf


Big Data Panel Time: 2:30 pm – 3:00 pm

Moderator: Madelaine Gogol, Stowers Institute for Medical Research

  • Panel Speaker: Yury Bukhman, Great Lakes Bioenergy Research Center
  • Panel Speaker: Alberto Riva, University of Florida
  • Panel Speaker: Hua Li, Stowers Institute for Medical Research
  • Panel Speaker: Jyothi Thimmapuram, Purdue University

The presenters, panelists, and attendees will explore practical experience with “big data” as well as use of public datasets in a panel discussion. Topics may include accuracy of annotation, trust of data, raw versus processed, data validation, and QC.

Big Compute

Speaker: Sergi Sayols Puig, Institute of Molecular Biology Mainz Time: 3:30 pm – 3:45 pm

Presentation Overview With a variety of computing infrastructures available, building robust, transferable pipelines can increase utilization of compute resources. NGS analysis pipelines implemented as docker containers and deployed on a variety of compute platforms – (cluster, supercomputer, or workstation) will be discussed.

File:Sergi updated.pdf

Speaker: Jingzhi Zhu, The Koch Institute at MIT Time: 3:45 pm – 4:00 pm

Experiences transitioning a Bioinformatics core from a local to a cloud-based compute solution will be discussed, including the motivation, performance, cost, and issues with deploying bioinformatics pipelines to Amazon EC2 instances.

File:Jingzhi.pdf


Big Compute Panel Time: 4:00 pm – 4:30 pm

Moderator: Brent Richter, Partners HealthCare

  • Panel Speaker: Sergi Sayols Puig, Institute of Molecular Biology Mainz
  • Panel Speaker: Jingzhi Zhu, The Koch Institute at MIT
  • Panel Speaker: Sara Grimm, NIEHS

The presenters, panelists, and attendees will discuss how people manage to stay on top of compute requirements for their own sites in a panel discussion. Major hurdles to overcome and the compromises needed for success will be discussed. We may also touch on experiences with containers and portable computing.

We will have a bioinfo-core dinner the night of the workshop, Monday, at 6:30 PM. The dinner will be at Garden Grove, a restaurant in the Swan Hotel.

Discussion with notes

Big Data

Yury Bukhman. The Great Lakes Bioenergy Research Center, GLBRC, is based at U of Wisconsin - Madison and Michigan State University. It also includes a few groups at other institutions. The informatics core consists of a larger group at UW and a smaller one at MSU. The Center is involved in all aspects of biofuels research, including agriculture and sustainability, biomass processing, and microbial fermentation of biomass-derived hydrolysates to produce fuels and chemicals. The goal is develop basic science that will enable economical and sustainable production of biofules and industrial biochemicals in the future.

All groups in the Center are mandated to have a data management plan that's reviewed by the informatics core on a yearly basis. This provides a consulting opportunity for bioinformatics planning and research IT: both to assess the needs of the Center and to suggest existing services to group leads.

The informatics core has implemented a suite of solutions including shared file servers, a SharePoint site, LIMS systems, an omics metadata database called GLOW, and a genome suite called GxSeq. The major LIMS solution at GLBRC is based on the commercial StarLIMS platform. GLOW records metadata on files produced by sequencing centers and by downstream bioinformatics analyses. It also documents bioinformatics workflows. GxSeq was developed by the MSU Group and includes a genome browser and an expression data viewer.

Alberto Riva: The bioinformatics core facility sits within a closely organized group of core facilities that can be used for Life and Health Science. Additionally, they have access to large and shared IT resources with segments setup for their specific use cases (a private area of the large 10,000 core cluster, for example).

What fraction of the University of Florida system uses the core and how is work paid for?

The core is currently fee for service, but moving to a model where work is charged/allocated by level of effort for a resource with longer term projects through full resource allocation to a grant. Regarding percentage of university using the core, there is no good measure. Not everyone knows about the core, they focus on some outreach, but overall it's hard to quantify.

Discussion of overall cost that includes data analysis and storage.  

One view was that storage is getting cheaper, however the data itself is still a problem: the data growing faster than storage is getting cheaper. HMS, for example, has hired a data manager who works solely with people to put their data in the appropriate places--cheap archive storage vs. more expensive on-line high-performance storage.

At Purdue, there is not a single big large set of data, but 1000's of small datasets. Purdue core works with users who have varying levels of analytic and IT knowledge. They find that they have spend time working on datasets in order to adapt/format/clean them for analysis as well as understanding the experimental parameters. Not everyone knows what goes on inside and behind the scenes of the core in performing this work. They expect the work to be quick, but without prior involvement in developing the experiment, it takes days to get the dataset to a state where it can be run through their analysis! Educating the students and educating their users about the data, dataset and the analysis is important.

GLBRC faces similar challenges. Its GLOW metadata database attempts to encourage consistent recording of metadata that may be used in bioinformatics analyses. It also makes it easier for researchers to find relevant datasets later.

Collecting metadata of small and large datasets is a big problem, particularly if one wants to combing data across experiments or in the future. It is required to compare different datasets. Additionally, when submitting new data to public datasets, the repositories require long list of metadata. GLBRC maintains LIMS systems and a metadata spreadsheet that's required to be filled out to submit datasets to its GLOW omics meta-database. The latter is similar to submitting to GEO and other public repositories. It encourages investigators to record their metadata at an early stage.

Biggest challenges for Riva is in educating users on how to generate the data--you may have all the big data you want, but if the experiment is not designed properly, there's quite a lot of cruft.

The evolving technology in big data, NGS, life science is really an evolution in what "big" means.  We've always dealt with challenging datasets but "big data" involves additional or more challenging work on the actual analysis and management processes--elaborate.
The biggest problem is in the complexity of the projects.  But a larger problem is working with faculty who don't have a lot of money

Most cores are willing to devote part of their time, pro bono, to generate results for grant submission. The investigator will include the data and cost:effort into the grant for the analysis services.

How do you deal with privacy and security of the data?  When thinking about a pipeline, do you take into acount what's public vs. private?

Purdue: download all data into their local environment. Florida: they have the largest southern florida health center who works with patient data. To comply with regulations, the research computing group has created a secure area for their cluster to work with this data. It's walled off from external and internal access--i.e controlled access.

Bottom line, last thought: a core and the personnel within it has to be adaptable in order to understand what is brought to them. No 2 experiments are alike and needs continuously change. The trends, technology capability and tools change. Need to remain flexible, adapt pipelines, process and people.

Big Compute

Do you keep the analysis results on the cloud or download to local storage? J: For the pilot they synced data to S3, then downloaded data back to local storage. Sergi: use a private cluster, the data is there.

Do you pass costs on to users? J: That's the eventual plan, to pass costs. They might not LIKE it... Haven't done it yet, as this was just a pilot.

On the cloud, you can just use the storage you need.

Is data transfer feasible without a big AWS pipe like MIT has? J: Works well for MIT...

How do you schedule the data to be processed in AWS? J: Just use SGE - the same as a local cluster. Make your cluster size 1000 (or whatever) and run your jobs simultaneously. After test is finished, move results to S3.

How many users do you support? Answer for all in the range of 50?

Do your users themselves use docker or amazon? Sergi: No. They will sometimes share pipelines with users, but permissions issues crop up... Sara: Users are not sophisticated, they might have trouble using a command line. J: I set up scripts for them sometimes, no AWS yet. Launching AWS in the cloud is kind of difficult.

How do you address training issues with these new tools? Different solutions for different users. Naive users are not going to use these tools yet, more advanced might start. Depends on them. Sergi: we send them to university training for unix.

Did you try the pilot with different kinds of nodes? How was error handling? J: Yes, tried different nodes. Error handing is no different locally versus in the cloud.

In a distributed docker system, how are errors handled? Sergi: Inside or outside a container, errors are really the same. It's an abstraction layer.

How are you becoming more integrated with the technology group or bringing scientists together with the technology? Many don't care about the technology. They want their raw data and their results or final results. Some groups have their own bioinformaticians - with them cores work closer, provide code, etc. They are a link to technology groups. Many cores have courses for users to learn to do things themselves.

When using the cloud, don't you still assume you will have to store input / output data locally? Someone will download the data. End user can download data to analyze. Sergi: We generate tons of data, don't want to move it to Amazon... J: We think there will be a hybrid model. For storage, 1 Terabyte is 90$ per year, but that includes maintenance. Different options - S3, Glacier (70$ per Tb per year). There will be a hybrid solution for cloud and local compute and storage. The cloud can expand the possibilities of what we can do.

bioinfo-core Dinner

Some members met for dinner at Garden Grove... Including a new honorary member...

Goofy small.jpg