ISMB 2016: BioinfoCoreWorkshop
We are again proposing to hold a Bioinfo-core workshop at the 2016 ISMB meeting in Florida. We have been given a half-day workshop track slot in the programme (similar to the last couple of years) and need to organise the content to fill this.
The structure and content suggestions below are NOT set in stone - this is simply a suggestion to profile a framework for subsequent discussion and organisation.
Workshop Structure
The workshop is nominally split into 4 sessions of ~30 mins each in the programme. We can merge or split these as we see fit, but there is a mandatory break between the first and second half of the meeting as this is a break in the overall programme.
In previous years the structure of the workshop has been a set of moderated discussions, preceded by very short introductory talks. These have been successful and we want to maintain the group discussion aspect of the workshop, but also provide for slightly longer form talks to allow for people to give more detailed descriptions of their views and experience.
The suggested format of the workshop is therefore that it be split into two:
- The first 2 slots would be for individual speakers to present slightly longer form talks to share their experience. The number of talks is somewhat flexible, but it would make sense for it to either be 2x30mins or 4x15min.
- After the break the second 2 slots would be a moderated, panel lead, discussion where the topic could be opened out for wider discussion with the group.
Workshop topic
The suggested topic for the workshop is "The practical experience of big data and big compute". This is obviously a very broad topic and gives us substantial scope for flexibility, but is also something which is relevant now, and a topic where members of core facilities undoubtedly have a lot of experience to share.
The big areas which it would be nice to cover would be:
Big data
This could cover a number of topics, but it would be interesting to hear of people's experiences working with some of the large publicly available datasets. Lots of us must have made extensive re-use of publicly deposited datasets, and it would be interesting to hear how this has worked out practically. We could look at things such as:
- How good and accurate was the annotation. How much did you use or trust this
- What level of data did you take - do you always go back to raw data, or work with processed data? What are the trade-offs here.
- What type of validation, sanity checking and QC do you do. Any horror stories?
- Have people compared across different studies / labs / platforms and how did this work out.
It would be particularly interesting to hear from groups who have had to work with access controlled data and the extra level of bureaucracy and technical problems this involves. Some practical advice and experience for others who might just be getting involved in this would, I'm sure, go down well.
If there is anyone who could come at this the other way - looking at the generation and management of large datasets and how these ultimately integrate into the big public repositories, this would also be interesting. I'm sure the process of data curation and submission on a large scale is fraught with all kinds of problems and compromises which would be worth discussing.
Big Compute
Again, there are a number of ways we could come at this. We have mentioned this type of topic before, but as the computational requirements for many projects increase then hearing of some case studies for how people manage to stay on top of the compute requirements for their own sites would be really interesting to hear. Discussing the major hurdles to overcome and the compromises needed to be made to keep as many people happy for as much of the time as possible would be interesting.
Another aspect of this might be the adaptions needed to have a more portable approach to computing. There have been a lot of suggestions that for future large datasets it might be necessary to bring the compute to the data rather than the other way around. For this to work we would need ways to develop independent pipelines and containerise them in systems such as Docker, Galaxy or VMs. We've probably all heard the sales pitches for these technologies, but there are probably many of us with some practical experience here so having some real-world examples of how well or poorly this actually works out would be great.