ISMB 2018: BioinfoCoreWorkshop

From BioWiki
Jump to: navigation, search


Workshop Overview

The bioinfo-core workshop is scheduled for Saturday, July 7, 2018, from 2:00-4:00 pm in Columbus EF at the Hyatt Regency in Chicago.

The bioinformatics core workshop is a workshop by practitioners and managers of Core Facilities for all members of core facilities, including scientists, engineers, analysts, operations and management staff. In this 15th year of bringing the Core community together at ISMB, we will explore in-depth three topics relevant to bioinformatics core facilities through lightning talks that broadly explore each area followed by small-group break out discussions with insights brought back to the full audience for further discussion and knowledge share.

Organizers:

  • Madelaine Gogol, Stowers Institute, United States
  • Hemant Kelkar, University of North Carolina, United States
  • Alastair Kerr, University of Edinburgh, United Kingdom
  • Brent Richter, Partners HealthCare of Massachusetts General and Brigham and Women’s Hospitals, United States
  • Alberto Riva, University of Florida, United States

Part A: Strategies for Hiring, Recruiting, and Interviewing new bioinformaticians

Methods to find, interview and hire highly successful staff and bioinformaticians for a core facility. Speakers will introduce experience and challenges including finding and hiring people, interview techniques and questions and best practices for recruiting candidates

Part B: Containerization, Clouds, and Workflows

Topics to be covered include cloud infrastructure recommendations and limitations, key datasets of value hosted in the cloud, containerization technology that works and workflow tool development and results.

Part C: When good experiments go bad: Negotiating experiment quality failures

A non-exhaustive survey of methods and successes in detecting failures and exploring guidelines for terminating bad projects.

Part D: Small group discussion

During this longer session, audience members will divide into groups based on their own interests. Groups will come up with their main take away points and bring them back to the main audience for knowledge sharing and for further discussion. Topics may include all previous presentation areas as well as other areas of interest to running or working within a bioinformatics core facility such as single cell analysis or long read analysis.

Time Title Authors
2:00 PM - 2:08 PM Bioinformatics Core Staffing (slides) Sara Grimm, NIEHS, United States
2:08 PM - 2:16 PM Characteristics of a highly successful candidate and how to find them Brent Richter, Partners HealthCare, United States
2:16 PM - 2:24 PM nf-core: community-driven best-practice Nextflow pipelines (slides) Alexander Peltzer, Quantitative Biology Center, Tübingen, Germany
2:24 PM - 2:32 PM Data Science in the 21st Century: Streaming Public Data into Containerized Workflows (slides) Ben Busby, NCBI, United States
2:32 PM - 2:40 PM Shesmu - An analysis orchestration system designed for FAIR standards and the GA4GH cloud ecosystem (slides) Lars Jorgensen, OICR, Canada
2:40 PM - 2:48 PM A (Fire)Cloud-Based DNA Methylation Data Preprocessing and Quality Control Platform (slides) Divy Kangeyan, Harvard University, United States
2:48 PM - 2:56 PM Usability of Marginal Data (slides) Jyothi Thimmapuram, Purdue University, United States
2:56 PM - 3:04 PM Experimental Failures (slides) Krishna Karuturi, The Jackson Laboratory, United States
3:04 PM - 3:20 PM Small Group Discussions
3:20 PM - 4:00 PM Report to all present the insights obtained within the small group discussions


Notes (feel free to contribute or modify)

Sara Grimm: Bioinformatics Core Staffing They support 60 labs. Embedded support model, staff is assigned to a particular lab. Mentioned needing soft skills to communicate and manage expe ctations. Hiring is done by a contracting agency, so they have little control, and there is local competition. They get applicants from life sciences or sometimes IT mid-career. They want someone comfortable at the command line with at least one programming language, conversant in basic biology. They include a scientist on the interview panel, and make sure being in a core is a good fit with their career goals.

Brent Richter: Showed (complex) org chart. Ideally, want someone who will stay for 2 or more years. Recommended keeping the job description pretty fresh and detailed. Google similar positions and see what the descriptions are like. Offer learning opportunities and define responsibilities clearly, don't be overly general. What bigger areas can the position grow into, who will they report to, how will their career goals be supported? This is a good chance to clarify the scope of the position. The first 90 days offer immediate feedback/praise/criticism. Check in 5 min weekly - do they need anything? Yearly review is an opportunity to re-recruit high performers, or provide constructive criticism if someone is struggling.

Alex Peltzer: NF-core Diverse, big, erroneous data. Large scale projects that integrate old with new data. Nextflow allows fast prototyping, task composition, parallelization, containerization. http://nf-co.re to collect pipelines (nextflow, MIT license, docker bundled). Continuous integration testing, stable release tags. Cookie cutter skeleton available for new pipelines, gitter channel.

Ben Busby: WE CAN SAVE COMPBIO by submitting data FOR biologists. Lots of free cloud out there, just call it 'education'. Docker vs Singularity - Singularity offers user same permissions (doesn't req root) may be more comfortable for IT. Antibiotic resistance pipeline, simple enough for juniors in college. Prokaryotic genome pipeline. Nanopore simple enough for high schoolers. ATACflow, Jupyter notebook "press the triangles". Mentioned Google collaboratory.

Lars Jorgensen: Shesmu. They get the samples nobody else wants to sequence. No "standard" pipeline. Niassa (seqware fork). "Deciders", but infrastructure is troublesome. hard to write and debug, large memory reqs. Shesmu - decider server. Olives - determine what actions to do. Stateless, so recovers nicely if server dies. Olives make jira tickets if something is missing. Good: unified interface. Bad: They are now maintaining a compiler. oicr-gsi/shesmu.

Divy Kangayan Firecloud - scalable genomic analysis. Need something scalable, reproducible, access to public data, best practices. Mainly applied to methylation data so far. R and scmeth, WDL glues tools together. Lots of QC, read cvg, cpg cvg, cpg density, m bias plot.

Jyothi Thimmapurum Data that is close to the lower limit of qualification, barely exceeding the minimum requirements. How do you use it? Why did it fail? Experimental design failures - insufficient replicates, wrong type of reads, too few reads. Contamination - sample mix ups, contamination during sample processing or lib prep. Mistakes in protocol, or seq machine failures. Plant DNA can confound studies of bacterial endophytes. WT/mut experiment, but SNPs same in ALL samples, hmmm. Repurpose the data if possible. Still might be able to address some questions or give them some useful info. Txome assembly instead of RNA-seq. You can still learn SOMEthing. Also can be data analysis failures - often fixable - wrong ref genome, wrong analysis methods, how you dealt with missing data. Data interp failures, didn't do multiple hypothesis testing.

Krishna Karuturi Not really a *fun* topic, but an important, sticky topic. They have 100 labs. Prevention is better than cure. Exp failures affect relationships with labs and timing. What we really NEED are superheroes, but... They do a multi-point QC inspection. When an experiment fails, do a design re view and figure out why or where they could have caught it. Need to decide if its a drop/no drop situation. In the case of confounding batch effects, if the biological effect being testing is much larger than the batch effect, perhaps proceed with caution. Limit "free" time for projects, or they will lag and drag on.

Hiring/Interviewing small group: "Lock them in a room" with a competancy test, something they would have to do on the job. Give them 40 minutes for a task that might take 25 minutes. Examples were fixing a broken script (multiple languages available) or writing an email (maybe that's for a different type of job, but something like that). Would you want to go on a camping trip with them? Coffee or lunch with the group can be a good enticement, give them a feel for the group, give you a feel for how they interact. Send ahead of time an RNA-seq analysis and have them run it, present results at the interview. Most people hadn't had any formal recruitment training. Entry level hiring is easier, but they may only stay 2-3 years. Is there a track to PI level, some people have that. Recruits should have a presence in github. Have them provide examples of how they solved a problem and search for evidence of their autonomy and ability to teach themselves new things. Hiring is part vetting and part seduction. How do you make the job attractive, sell it. Work for the common good? Crosstraining available within group. Put a link to your group website in job description, and then ask in the interview if they've been to your website. Bad sign if they haven't.


Workflow small group

Cromwell The Broad’s java based pipeline manager for their WDL (“Widdle”) pipeline language. CWL is available as an option in the languages section of the configuration.

cwl-airflow CWL pipeline manager variant of apache’s airflow originally developed by AirBnB. Python based but seems to require several out of date packages. A VirtualENV is recommended Appears to wrap cwl-runner and cwltool.

Nextflow Groovy based language. There appears to be a prototype CWL to nextflow converter : https://www.nextflow.io/blog/2017/nextflow-and-cwl.html pipelines seems to use a s3 bucket. AWS account and command line tools are therefore required .

nf-core : A community effort to collect curated Nextflow pipelines. As of August 9th, 3 released pipelines, 7 in development


Nextflow supports Docker and Singularity containers technology.

This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration. It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes and Amazon AWS cloud platforms. Nextflow is based on the dataflow programming model which greatly simplifies writing complex distributed pipelines Parallelisation is implicitly defined by the processes input and output declarations. The resulting applications are inherently parallel and can scale-up or scale-out, transparently, without having to adapt to a specific platform architecture. All the intermediate results produced during the pipeline execution are automatically tracked. Stream oriented Nextflow extends the Unix pipes model with a fluent DSL, allowing you to handle complex stream interactions easily.


This group is interested in a communication mechanism, maybe a slack channel or maybe a part of biostars?

Experimental failures small group: Have evidence before pointing fingers... Some groups go-pro film a sample prep, can you imagine. Jointly meet with PIs. Force people to fill out metadata spreadsheet or form before project? Earlier involvement of analysts is better. Some places have a conflict resolution mechanism btw groups.

Ideas for group operations to follow up on:

Cloud/workflow discussion again?

More time? Ask for 3 hours next time. One comment that 8 minutes is too short, but some people liked it.

4-5 small groups might be better.

I'd love to have some topics we decide, but then allow people to submit posters and select talks from submitted posters. Sounds like we can get in on a monetary poster prize via ISCB. We could also seek our own sponsers and use the money for travel fellowships. ISCB will be our bank and hold funds in escrow.

Do we need to alter our communication strategy? A slack channel was requested by a few people. ISCB may give us access to Zoom for conf calls.

We will try to get the details on who checks the bioinfo-core box when registering... Right now (supposedly) ISCB will add those people to our mailing list, but we have not confirmed that this is happening. It would be good to verify.

Regarding workflows (edit Alastair) my issues are :

Most tutorials online are geared around a single user setting up the workflow software in a user account. Not many sys-admin friendly

   • How to configure a pipeline for all users 
   • How/where is  best to store pipelines and make accessible for users 
   • How bad are the docker vulnerabilities 
   • How to configure dockerised pipelines properly
   • How to convert docker pipelines  to singularity (with and without AWS) 
   • Easiest pipelines to create from scratch
   • Resources per pipeline language?  Such as yaml files for each tool? 
   • Running the pipeline manager: Once per user? How to enforce port usage? If once per server, how to only use individual users outputs?