19th Discussion-17 Oct 2017

From BioWiki
Jump to navigationJump to search

ISMB workshop recap

Brief recap of the workshop in Prague by members who attended.

The first session was on the business side of running a core. Two presenters, Russell Hamilton spoke on the first year of setting up a new core, and Annette McGrath on managing people in a core.

Russell Hamilton covered the issues getting storage and computational resources, selecting pipelining tools, and building the team. Who is the first best hire, and how do you grow the facility? What level of training do you offer and how do you balance time spent training with time spent doing straight up analysis? How does cost recovery work - can your costs be built into grants prior to analysis?

Annette McGrath talked about managing both people within the core and clients, especially expectations. What deliverables does the core provide and making sure they have appropriate training and computational resources. Instilling a sense of pride and ownership in core members by giving them projects that stretch their abilities and allowing people to be experts in a certain area. The manager should be a cheerleader for the people in the facility and the facility within the wider institution. How do you manage PhD students? Are researchers jointly shared between the core and labs, or are there bioinformaticians embedded in labs. Try to build bridges between the core and the wider research institution. How do you advance people and develop a career progression?

A little discussion on training ensued. Brent Richter mentioned they cover the basics - R, unix, jupyter. Allow the users to do as much as they can by themselves, and the core can focus on more complex tasks. Thomas said they train on Galaxy and R 3-4 times a year. Hemant said they have two options - 3 full day session or 5 weeks for 2 hour sessions. Have done unix/NGS, qc/trim/align, or RNA-seq/ChIP-seq/ATAC-seq. 2-3 times a year. Make people apply, make it competitive. Alberto mentioned attending the bioinformatics education workshop - some cores don't do training b/c they have a dedicated program at their institute, others have to. Who supports these workshops? Teaching benefits the community AND the core, because it makes the core get themselves up to date with tools. Online resources for training are available - e.g. GOBLET. Alastair finds public training materials don't always have a narrative to follow. Hemant - Initial investment in course development is the largest time requirement, but redoing it is easier. They record their sessions and make them available after the fact.

How do we make sure we are learning new tools and keeping up with new technologies? Conferences, journal clubs and embedded bioinformaticians were all mentioned, but there's not a lot of time to spend to really benchmark new tools, so sometimes there is a tendency to stick with the status quo. Comparative papers can be useful if there are clear front-runners. Many end-users are sort of less experienced and don't tend to ask questions on "why did you use this tool" and just sort of take a report as-is. Many packages have so many options, that when one has learned all the flags, it's a big investment to switch to a new tool.

The second ISMB session was about reproducibility, Alberto mentioned that there are two aspects - internally being able to reproduce something and providing information to the external community on an analysis. Can we repeat this in 6-12 months and get consistent (if not identical) results? Funding agencies and journals will increasingly require it.

Phil Ewels - MultiQC developer talked about his modular approach. New modules can be developed and MultiQC is easy to use. This is a more generic platform for QC. Leonard Ovitz discussed their internal QC tools, really tailored to their environment.

Technical aspects of reproducibility were discussed - how long do you keep data and who owns it. What priority does reproducibility get? Do users pay for it? It is essential for the reputation of the core, but some users need to be educated on why it is important. Sometimes building a pipeline takes longer than a quick hack. Some use conda, some not allowed to use docker. Matt suggested converting docker to singularity pipelines if docker is not an option. Hemant said google compute does allow docker, Matt said they had done that, but it's annoying as they don't have a cloud budget.

Deliverables and returning results to end users

Alastair kicked off the discussion about deliverables. Sounds like many groups are increasingly using R notebooks, shiny apps, and interactive tools to return results. Rmarkdown is nice for training - press play on a code block. Shiny apps used include sort of substitutes for excel with linked tables and plots, documentation. Ways to tweak a figure to avoid the analyst making 400 versions of the same figure for publication. Matt's group had one that was a supplement to a pipeline with a few options to delve into a gene-centric view, for instance. Apps are very specific to a given data type. They have been asked to write portals to large data resources, but this could spiral out of control if it becomes a service. Not every core may have the resources for that.

Sabbaticals (brief discussion)

Sabbaticals were just touched on at the end. The problem raised is that many cores are stretched as is and may not be willing to give up anyone. We may revisit this topic next time.

Anyone with suggestions for the next call, feel free to post to the mailing list.