ISMB 2015: BioinfoCoreWorkshop

From BioWiki
Jump to navigationJump to search


Our proposal for this year is that we would like to have a single unifying topic for the entire session, which would be "The evolving relationship between core facilities and researchers". This will be divided into sub-topics but each of these will look at how the role of core facilities is changing in response to the increased prevalence of bioinformatics knowledge within the wider research community, and the introduction of many more dedicated bioinformaticians into research groups. We hope this this discussion will attract people from both the core facility and research group side of the discussion as this changing relationship will affect both parties.

The more detailed structure will be broken down into four somewhat overlapping areas. Each of these will be introduced by a different speaker with a short presentation and will be followed by a moderated group discussion.

Topic 1: The role of core facilities when everyone is a bioinformatician

  • Speaker : Davide Cittaro
  • Moderators : Simon Andrews and Matthew Eldridge

Whilst bioinformatics used to be the preserve of dedicated bioinformaticians a modern research group will now often have a significant amount of bioinformatics expertise within its staff. This will range from wet lab biologists who would like to be responsible for the analysis of their own data to dedicated embedded bioinformaticians who can find enough work from the output of a single group to fill their time. In this environment the role played by core facilities must necessarily change. Their traditional role as the analysis hub for a set of research groups must give way to a broader view of how they can help to support the more diverse range of informatics activities happening within a research institution. This session will look at the ways in which different core facilities have adapted to these changes and try to look at how their role will change further in future.

Topic 2: Bioinformatics core facilities as service providers

  • Speaker : Sven Nahnsen
  • Moderators : Simon Andrews and Matthew Eldridge

One of the growing roles for core facilities is to act as central service providers for routine large scale analyses or data stores. In this session we will look at how much it is possible to automate routine analyses and how much of a standard analysis pipeline can be treated in this way. We will aim to go further though and explore how core facilities can remain relevant and stay on top of the latest developments rather than being constrained as high volume service providers.

Topic 3: Maintaining a publicly used analysis infrastructure

  • Speaker : Madelaine Gogol
  • Moderators : David Sexton and Brent Richter

When a large proportion of the research staff in an institution want to be able undertake bioinformatics analyses on large data sets it makes sense to have a centralised computing resource on which to run this, and the management of these resources is generally falling into the hands of bioinformatics core facilities. We will look here at the ways in which different sites have chosen to make their compute infrastructure more widely available, and how they have tackled the problems which this has thrown up.

Topic 4: The business of core services

  • Speaker : Jim Cavalcoli
  • Moderators : David Sexton and Brent Richter

Many core facilities operate on a cost recovery basis, and the most common method of recharging has been around the number of hours of analyst time spent working on specific projects. In an era where the core facility is less visible as a front line analysis service and spends more of its time maintaining infrastructure and services how do cores continue to recoup their costs. We will look at the different funding models which are being used and will discuss the fairest and least burdensome ways of recharging and how to communicate these costs to end users.

Discussion (unordered notes on session)

Cittaro: Core services: Reward bioinformaticians (Nature 520, 151-152): Core's do real science, but the core rents out the bioinformatician. For Cittaro's core, some of the people are dedicated individuals to a specific PI, these folks are usually split 80/20 or 70/30 between PI work and core work. It benefits to have people members of a core to generate a critical mass of knowledge and diversity, rather than working alone. How does authorship work: Cittaro's core participates in scientific design and even if they charge for the collaboration, they obtain authorship and measure performance using this as one metric. For other metrics, they also perform survey's to get feedback.

In Eldridge's institution they are seeing big changes with bioinformaticians moving into wet labs and wet lab scientists learning more techniques. They are finding that they are doing more training. Andrews: use the core as a meeting place to share knowledge how do you deal with the fact that bioinformatics is so broad: proteomics, genetics, etc. Cittaro's core only deals with NGS given the focus of the institution. Also tries to find individuals who have complementary experience and train them in the specifics of NGS (metabolomics for example).

     some cores try to understand external groups' expertise and direct new users' questions to those groups--biostats questions to the biostats group, for example.

Students: There are different models: PhD students who spend a part of their time in the core to learn, others who have a core and a faculty appointment maintain K award for training, etc. Sometimes this works, but supervisors are mostly outside of the core. One core has hired a dedicated person who supervises/manages PhD students, funded by institute. Interesting model: the core facility is seen as a center of bioinformatics expertise that can train students, professors "attach" students to the core and have argued for the institution to fund a dedicated supervisor. Stowers has an exchange program with University of Oregon Master program that send students for 9 - 12 months. Core staff need to understand that it's a training situation, not that the student drives Core personnel. Internships are problematic: ramp up is needed to make a 3 month intern productive.

there is a range of sizes to core, demand, no matter how large the core, is still an issue.

data integration: internal cluster and virtualization. this field should be differentiated into 2 areas, data processing and data analytics. processing and data management can be standardized, but analytics is custom. Data integration is a critical area moving forward. at the university environment: there are no centralized mechanism for data management, metadata, etc--all investigators have their own. From the industry perspective, they try to enforce standard and supply tools and SOPs to do so. TranSMART initiative is gaining adoption in much of industry and the institutional core level.

Nahnsen: launch a discussion into how much service should be done by a core facility vs. pure service. One has a business model of some kind, market, scale out/up, grown the team. But the reality of a core, which is embedded in academic environment, have to be flexible and do not operate at a mature nor a production scale. Need to keep understanding of the forefront of the needs, provide democratic support: both those with funds and those without. the core can be involved in many aspects of the research: design, processing data management, analysis and interpretation. Are bioinformatics core facilities only service providers or should they be? yes: expert services a product or NO: fundamental research and need to be at the forefront. or BOTH: automate as much as possible and provide a service and also work on "discovery" of new techniques.

  what can be standardized?  it's difficult, Cittaro is ISO certified for standards, but documents changes to the standard processing pipeline.  
  Can someone "sell" scientific contributions and how to deal with authorship.  There are differences: if a standard service, expect an acknowledgment.  If a core provides experimental design, analysis development, etc, then an author.  The argument that they pay for the person/service and therefore they do not need to put a core person on a paper does not hold water because that same PI pays the salary of their grad student and they are on the paper.  
      The important point is that the expectation has to be set up front and can vary over the course of the project.

Is a one person facility sustainable? Yes. It all depends upon scheduling and turning away excess demand.

The more time a group pushes and develops standardized services, the less time you have to keep up with the leading edge techniques and development, which may be pushed out to the individual research groups. How does one balance?

  the majority of the work of Cittaro is not the standardized things, its more of the custom work.
  Where do you put your effort?  sensing the trends and directions is important.  Sink effort into the major trends.
  RNA-seq is a generalized pipeline.  most other analysis are custom.
  how does a core participate in institutional strategy regarding future needs?  Or how to predict where analysis will come from?  Easier to stay connected in smaller institutions.

Notes from second session (topics 3 and 4)

Madelaine Gogol - maintaining a compute infrastructure.

Getting an increasing influx of less experienced naive users who don’t know how to monitor the load they’re putting on a compute infrastructure. Can be a worry for the core.

Most Stowers servers/clusters are open to everyone - some are dedicated to particular groups which have a specific requirement for high throughput and availability. Some have a training requirement. For the rest only informal rules (don’t use more than half of the cores / memory on a server). They can ask for more.

Teach people about htop etc.

Informal policing by email.

Have problems with poorly designed programs which take up as many resources as they can. Need to monitor to see load, but not simple to translate into the amount of pain felt by analysts.

Have mailing list where server / cluster problems can be discussed and use that to get a feel for when more resources are required.

Other things core handles:

  • Software installations (including multiple versions)
  • Public sequencing data caches
  • Galaxy
  • RStudio server

Galaxy and Rstudio avoid direct server usage while still allowing biologists to do some analysis.

Try to make people community minded.

Tracking versions of genomes, software, data etc.


Lots of groups have handed over hardware to another group but some still maintain their own. Cloud is also an option.

Many people face a problem where people want to continue an old analysis and need to find a way to freeze an analysis and pass it over in a structured fashion.

Documentation is the key, for pipelines you need checkpointing and log files to record exactly what you did so you can restart it.

Extension of existing analysis is more difficult. You can put outputs up front, but don’t just allow follow-on questions without limit.

Can use interactive visualisation analysis programs to pass results back to uses. This could be client side programs like seqmonk, or server side like shiny. Even having better documentation with systems like knitr can make it easy to pick up and continue or do minor variants of analysis.

What controls do you have for new users on a cluster?

For new users can do miniature workshops before users are allowed to use the cluster.

Stowers does a training course once a year which isn’t required but is recommended. For queued clusters they do have informal training upon login creation at the admin's desk.

Unix introduction courses are fairly common. Most people on the cluster start out with easy and lightweight jobs. Most early big jobs end up being errors.

Have worked with informatics group to try to get an environment where most (80%) of the packages they need are already present to make things easy to start with. Because they’re tied to a VM they can’t mess anything up.

Storage is also an issue. Scratch is free, but they pay per annum for everything else.

There are often experienced users who don’t need much training. They can’t install software but can use what they need. Workspaces are deleted after 60 days. Also have pipelines set up so users can use these very easily from the web interface.

Can be a problem convincing IT that more resources are required. IT see that clusters are empty at weekends so they assume that there are still OK.

Often competing with other groups with their own

Internal clusters can be useful. It’s painful to use many generic clusters as they won’t have the right packages installed, and will have stringent resource limits so that jobs can fail for obscure reasons. Having a dedicated bioinformatics cluster with all of the right software installed and with permissive policies for resources so that common jobs work cleanly is a big win. Using pipelining systems such as clusterflow which take care of the resource allocation and jobs submission and tracking make things really easy for inexperienced users.

Very few people have taken the plunge to move to cloud services - lots have played, none have stuck.

Jim Cavalcoi - The Business of core facilities

Need to look at what your niche is, are there competitors and do you want to compete with them.

Identify where you’re going to get the support you need to pay for salaries and hardware. Having a supportive facility is useful, and full cost recovery is still not very viable for a lot of academic services.

Need to try to gauge what the level of satisfaction with your service has been.

Can define standard services, aligning reads, calling variants etc and need to think about how much time you put in to make a robust pipeline vs doing it ad hoc. Great if you can put together a per-sample cost.

Scientists don’t like to hear that a project takes X hours, but they’re fine with $X per sample since that’s a metric they’re used to. Bioinformatics isn’t a per-sample service, but if you can package it that way then it can be easier to charge for.

Has spent a lot of time working out how to estimate cost/time for novel projects. High proportion of projects fall into this category and useful to split the costing into the standard / non-standard parts so that it’s clear where the costs are going. Have an agile process where they do a meeting to estimate costs. Has been working really well.

Last years estimates on time were off by 100% People anticipate the best scenario and changes suck up time.

Commercial analysis services are technically competitors but in effect their services are limited and specific so they’re not generally suitable. Researchers with their own bioinformaticians are more realistic ‘competition’.

Most of our costs are salaries. Getting and retaining quality people is difficult. Costs for hardware is minimal by comparison. There’s a lot of cost to develop pipelines for which there’s no immediate project to bill. Need to work out how this fits with the finance models of your institute - amortisation would be ideal but may not be practical.

Bioinformaticians don’t like clock punching. Have forced the issue just to show where the time is going. Difficult balancing act.

Things to deliver:

  • Useful reports which the scientists can interpret themselves.
  • Regular communication
  • Good estimates and reasonable turn around time.
  • Reproducibility within a reasonable time.
  • Good project and time tracking.
  • Validation of methods and tools

How do we measure success?

We should have some metric of success and be tracking it.

  • Revenue is an easy thing to track.
  • How many projects / samples.
  • How many papers.
  • How many grants did you contribute to.

Our business models should resemble consultancy services, not sequencing cores or instrument

Do you try to train people to not need you any more.

Is custom service the long term product rather than the routine analysis.


It’s very unlikely that you’ll lost business by training. Number of people trained should be part of the metrics of success. This should be a vital part of our activity.

How do you make people consult with you? Lots of people come too late with unrealistic data or experiment design. For some cores the sequencing facility won’t take samples unless they have a plan for analysis already - works if you have a sufficiently close relationship with other facilities.

You should have discussions before a project starts, so realistic expectations are set up from the beginning. Having some kind of formalised experimental design process is really useful if you can enforce it.

Scientist’s don’t understand where the effort goes and what it actually costs. Training can really help people - they get a better appreciation for what actually goes on so they see what the core actually do.

Moving to more standardised service risks losing people who aren’t going to stay engaged with the facility if they’re not being challenged with new types of data or analysis.

Good if you can have a mix of staff, but don’t split by standard / novel analysis. Make sure people rotate around and share experience within the group. Find out who is getting bored. Tailor it to the preferences of the staff.

What do you use to track hours. Trying to track exact time on a specific project doesn’t work. Instead use a project system which estimates time with deadlines, and then report when each task is done. Fine tracking causes problems.

Have tried a program called click-time which tracks stuff exactly, not quite start stop but much closer. Can compare to the expected time to see how far off you are. Also use things like base-camp and jira.

Is there something we could do as a community to try to improve estimates or is it too core specific? There is some commonality and could be useful to compare. In many ways it doesn’t matter if there is variation - it will be different for everyone.

Having a dedicated project manager is unusual within cores but can be useful if you have a sufficiently large core (probably 8+ people). Gets quite addictive when you have it.

How do you deal with people asking for free? Could you do the service as investment - take IP in the data and get part of future grants or IP?

Groups do often do work for free. Depends on the relationship you have with the PI. Need to set a limit on what you can afford to lose. Doesn’t get to be a priority ever. Can work with the institute to try to jointly get funding.

Can do work for free as long as it relates to a software project we’ve released. Pays back to take feedback