Difference between revisions of "ISMB 2016: BioinfoCoreWorkshop"

From BioWiki
Jump to navigationJump to search
Line 13: Line 13:
 
== Big data ==
 
== Big data ==
  
This could cover a number of topics, but it would be interesting to hear of people's experiences working with some of the large publicly available datasets.  Lots of us must have made extensive re-use of publicly deposited datasets, and it would be interesting to hear how this has worked out practically.  We could look at things such as:
+
Speaker: Yury Bukhman, Great Lakes Bioenergy Research Center
 +
Time: 2:00 pm – 2:15 pm
  
* How good and accurate was the annotation.  How much did you use or trust this
+
Presentation Overview:
  
* What level of data did you take - do you always go back to raw data, or work with processed data?  What are the trade-offs here.
+
The Computational Biology Core of the Great Lakes Bioenergy Research Center supports mostly academic labs at the University of Wisconsin, Michigan State University and other universities. With a variety of experiment types, they are challenged to manage and analyze disparate data and metadata in a diverse academic environment. Details of these data challenges and solutions will be discussed.
  
* What type of validation, sanity checking and QC do you do.  Any horror stories?
+
Speaker: Alberto Riva, University of Florida
 +
Time:2:15 pm – 2:30 pm
  
* Have people compared across different studies / labs / platforms and how did this work out.
+
Presentation Overview
  
It would be particularly interesting to hear from groups who have had to work with access controlled data and the extra level of bureaucracy and technical problems this involves. Some practical advice and experience for others who might just be getting involved in this would, I'm sure, go down well.
+
The Bioinformatics Core of the ICBR provides bioinformatics services to the large and diverse scientific community of the University of Florida. Routine handling of projects covering a vast spectrum of biological and biomedical research requires a flexible and powerful data infrastructure. Implementation details of a software development environment (Actor) for reliable, reusable, reproducible analysis pipelines will be discussed, as well as insights on managing big data projects in a core setting.
  
If there is anyone who could come at this the other way - looking at the generation and management of large datasets and how these ultimately integrate into the big public repositories, this would also be interesting.  I'm sure the process of data curation and submission on a large scale is fraught with all kinds of problems and compromises which would be worth discussing.
+
Big Data Panel
 +
Time: 2:30 pm – 3:00 pm
 +
 
 +
Moderator: Madelaine Gogol, Stowers Institute for Medical Research
 +
Panel Speaker: Yury Bukhman, Great Lakes Bioenergy Research Center
 +
Panel Speaker: Alberto Riva, University of Florida
 +
Panel Speaker: Hua Li, Stowers Institute for Medical Research
 +
Panel Speaker: Jyothi Thimmapuram, Purdue University
 +
 
 +
The presenters, panelists, and attendees will explore practical experience with “big data” as well as use of public datasets in a panel discussion. Topics may include accuracy of annotation, trust of data, raw versus processed, data validation, and QC.
  
 
== Big Compute ==
 
== Big Compute ==

Revision as of 08:44, 12 May 2016

We are holding a Bioinfo-core workshop at the 2016 ISMB meeting in Orlando, Florida. We have been given a half-day workshop track slot in the program.

Workshop Structure

The workshop is split into 4 sessions of ~30 mins each with a required break between the first and second half of the meeting.

  • The first slot will have 2 15 minute talks on the topic of Big Data followed by a 30 minute panel discussion. After the break we will have 2 15 minute talks about Big Compute, followed by a 30 minute panel discussion.

Workshop topics

The workshop will address "The practical experience of big data and big compute". Members of core facilities will share their experience and insights via presentation and panel discussion.

Big data

Speaker: Yury Bukhman, Great Lakes Bioenergy Research Center Time: 2:00 pm – 2:15 pm

Presentation Overview:

The Computational Biology Core of the Great Lakes Bioenergy Research Center supports mostly academic labs at the University of Wisconsin, Michigan State University and other universities. With a variety of experiment types, they are challenged to manage and analyze disparate data and metadata in a diverse academic environment. Details of these data challenges and solutions will be discussed.

Speaker: Alberto Riva, University of Florida Time:2:15 pm – 2:30 pm

Presentation Overview

The Bioinformatics Core of the ICBR provides bioinformatics services to the large and diverse scientific community of the University of Florida. Routine handling of projects covering a vast spectrum of biological and biomedical research requires a flexible and powerful data infrastructure. Implementation details of a software development environment (Actor) for reliable, reusable, reproducible analysis pipelines will be discussed, as well as insights on managing big data projects in a core setting.

Big Data Panel Time: 2:30 pm – 3:00 pm

Moderator: Madelaine Gogol, Stowers Institute for Medical Research Panel Speaker: Yury Bukhman, Great Lakes Bioenergy Research Center Panel Speaker: Alberto Riva, University of Florida Panel Speaker: Hua Li, Stowers Institute for Medical Research Panel Speaker: Jyothi Thimmapuram, Purdue University

The presenters, panelists, and attendees will explore practical experience with “big data” as well as use of public datasets in a panel discussion. Topics may include accuracy of annotation, trust of data, raw versus processed, data validation, and QC.

Big Compute

We have mentioned this type of topic before, but as the computational requirements for many projects increase then hearing of some case studies for how people manage to stay on top of the compute requirements for their own sites would be really interesting to hear. Discussing the major hurdles to overcome and the compromises needed to be made to keep as many people happy for as much of the time as possible would be interesting.

Another aspect of this might be the adaptions needed to have a more portable approach to computing. There have been a lot of suggestions that for future large datasets it might be necessary to bring the compute to the data rather than the other way around. For this to work we would need ways to develop independent pipelines and containerise them in systems such as Docker, Galaxy or VMs. We've probably all heard the sales pitches for these technologies, but there are probably many of us with some practical experience here so having some real-world examples of how well or poorly this actually works out would be great.