ISMB 2024: BioinfoCoreWorkshop
The bioinfo-core COSI brought together managers and staff working in bioinformatics core facilities around the world. In our first ever full day session at ISMB 2024 (Montreal), we had a mix of presentations, panel discussions, and breakout groups.
Talks: • Swapnil Sawant from Phoenix Bioinformatics spoke about the comprehensive modernization of TAIR (the arabadopsis information resource, a website and database with 600k users globally). They were starting with legacy technology more than 20 years old, where changes took a long time and maintenance costs were high. By keeping the interface pretty similar and moving to a new technology platform, they were able to improve performance a lot for users and make things easier and cheaper to maintain. • Francesco Lescai gave an excellent introduction and overview about nextflow, the problems it helps solve for cores, gave us a feel for what it looks like, and discussed the nf-core community. • Nikhil Kumar offered another view of nextflow as they use it at Memorial Sloan Kettering Cancer Center, where they have shareable nextflow modules within their center, how they set this up and how they use it to good effect. • Dena Leshkowitz spoke about UTAP2, a popular and user friendly pipeline allowing non-computational users to easily perform transcriptomic and epigenomic data processing and analysis. • Grace Pigeau discussed the massive amount of data being generated at OICR and some of their approaches and strategies to manage and remove or store this data once they have processed and generated the results. • In a rather memorable and unusual rhyming talk (thanks AI), George Bell discussed the development of command line linux-style scripts and tools they write to allow their end users to perform different types of downstream analysis in R, etc. Users who might not want to bother learning a whole language are still willing to run a script on the command line. • Patricia Carvajal Lopez spoke about the Bioinformatics core facility competency framework, an effort to better define the role of a bioinformatics core facility scientist at three different levels plus a managerial level using competencies and knowledge, skills, and attributes. • Michael Laszloffy showed and discussed Dimsum, a dashboard for quality control, project tracking, turnaround time reporting, and more. This allows OICR to better keep track of the status of all their projects and keep on top of their turnaround time and progress, set priorities, and better communicate with end users. • Aliye Hashemi presented about protein classification using delaunay tessellation, a way of representing a protein using points in 3D space. That data is then put into a neural network in order to classify proteins.
Panels: • AI/LLMs in cores: what are we doing now? o Panelists: Dexter Pratt, Nancy Li, Michelle Brazas, Dukka KC o This was a really excellent and wide-ranging discussion on everything from training users on how to use AI and LLMs, using generative AI to help develop training materials, incorporating AI chat agents into web interfaces, how to set up the infrastructure so that users can access and query various models on their own without too much hassle, open source vs. commercial models, etc. We all wrote down many notes to take home and explore further. • New technologies in cores o Panelists: Madelaine Gogol, Lorena Pantano Rubino o How do cores manage to find the time and resources to be able to tackle new technologies? How do they avoid the situation that the first person to do some new type of project has to pay for all the development time? Perhaps more questions were raised than answered. One suggestion was that the first pilot project gets a discount on the sequencing and the project then becomes a collaborative project resulting in a publication. There was some discussion at what point you treat a project as a one-off and at what point it starts becoming part of a pipeline.
Breakout groups:
We broke into three breakout groups based on what the people in the room were interested in discussing. • Pipelines – nextflow was quite popular, but they discussed how transition to using a workflow tool does require some time and effort. You also must compare and evaluate how this new approach compares to what you have done before. • Cost recovery models and management – basically it’s challenging. Some cores use a combination of techniques or approaches. Some charge by the hour, but it’s not really fair to clients who are first to try a new type of project. Different analysts also may take different amounts of time on the same type of project. There is also a fixed cost model for a particular type of project, but there is a risk that you estimate the cost wrong. Some people estimate a minimum number of hours required and if the project goes over, then an additional discussion is needed with the collaborator. • Managing big data – What data actually DOES need to be saved, and what data needs to be made FAIR? What data can you upload to repositories for longer term storage? In some groups, every project actually has a data management plan, but the compliance with that plan is not always there. iRODs was mentioned as a solution, but it wasn’t an option for other reasons so they went with a second choice, which was more DIY. Starfish was mentioned as a commercial solution. Some places have a data steward, which is a role helping guide people through what metadata they need, what they should store where (locally or cloud) and for how long, etc.