http://bioinfo-core.org/api.php?action=feedcontributions&user=Alastair.kerr&feedformat=atomBioWiki - User contributions [en]2024-03-28T20:23:54ZUser contributionsMediaWiki 1.34.0http://bioinfo-core.org/index.php?title=ISMB_2019:_BioinfoCoreWorkshop&diff=10398ISMB 2019: BioinfoCoreWorkshop2019-07-24T15:59:56Z<p>Alastair.kerr: </p>
<hr />
<div>= Dinner information =<br />
<br />
We will be leaving at 7.30 Wednesday from the reception hall (near the ribbon stand) or see you at the restaurant <br />
<br />
Details:<br />
Veranda Pellicanò<br />
Birskopfweglein 7, 4052 Basel, Switzerland<br />
+41 61 311 55 01<br />
<br />
https://maps.app.goo.gl/gwWciAvukf7ARhgo9<br />
<br />
<br />
=Workshop Overview=<br />
<br />
The bioinfo-core workshop is scheduled for Monday, July 22, 2019, from 10:15 to 12:40 pm at the Congress Center in Basel.<br />
<br />
The bioinformatics core workshop is a workshop by practitioners and managers of Core Facilities for all members of core facilities, including scientists, engineers, analysts, operations and management staff. In this 16th year of bringing the Core community together at ISMB, we will explore topics relevant to bioinformatics core facilities through lightning talks and demos followed by small-group break out discussions with insights brought back to the full audience for further discussion and knowledge sharing.<br />
<br />
Organizers:<br />
<br />
* Madelaine Gogol, Stowers Institute, United States<br />
* Hemant Kelkar, University of North Carolina, United States<br />
* Alastair Kerr, CRUK-MI, University of Manchester, United Kingdom<br />
* Brent Richter, Partners HealthCare of Massachusetts General and Brigham and Women’s Hospitals, United States<br />
* Alberto Riva, University of Florida, United States<br />
<br />
Social Events:<br />
<br />
* ISCB Markthalle event, Tuesday, July 23rd, 8pm (look for bioinfo-core signs)<br />
* Wednesday night dinner, Veranda Pelicano, 8pm (meet outside congress center at 7:30pm to walk over), email mcm@stowers.org to RSVP<br />
<br />
Additional related opportunity:<br />
* [http://www.aebc2.eu/ AEBC2 Workshop] - Friday, July 26th.<br />
<br />
==Part A: Technologies and Analytical Methods==<br />
<br />
Machine Learning, AI, single cell RNA-seq analysis, and conda/bioconda.<br />
<br />
==Part B: Communication and Training==<br />
<br />
Communication and project management tools and training offered by cores.<br />
<br />
==Part C: Small group discussion==<br />
<br />
During this hour-long session, audience members will divide into groups based on their own interests. Groups will come up with their main take away points and bring them back to the main audience for knowledge sharing and for further discussion. Topics may include all previous presentation areas as well as other areas of interest to running or working within a bioinformatics core facility.<br />
<br />
==Part D: Pipeline Demo==<br />
<br />
Demo of nextflow<br />
<br />
==Schedule==<br />
<br />
{|class="wikitable"<br />
|-<br />
|Time<br />
|Title<br />
|Authors<br />
|-<br />
|10:20 - 10:30 AM<br />
|Transitioning bioinformatics core to support biomedical AI/ML research - lessons learned<br />
|Yang Fann, NIH, United States<br />
|-<br />
|10:30 - 10:40 AM<br />
|Supporting single cell RNA-seq analysis: A Core's Perspective<br />
|Shannan Ho Sui, Harvard School of Public Health, United States<br />
|-<br />
|10:40 - 10:50 AM<br />
|Conda and Bioconda, the best thing since sliced bread<br />
|Devon Ryan, Max Planck Institute, Germany<br />
|-<br />
|10:50 - 11:00 AM<br />
|Improving project management and tracking with Asana and Toggl<br />
|Sara Brin Rosenthal, UCSD, United States<br />
|-<br />
|11:00 - 11:10 AM<br />
|Bioinformatics training (in the context of a core)<br />
|Radhika Khetani, Harvard School of Public Health, United States<br />
|-<br />
|11:10 - 11:20 AM<br />
|Development of bioinformatics workshop by a core facility<br />
|Alberto Riva, University of Florida, United States<br />
|-<br />
|11:20 - 11:55 AM<br />
|Small Group Discussions<br />
|<br />
|-<br />
|11:55 AM - 12:20 PM<br />
|Small Group Reports<br />
|<br />
|-<br />
|12:20 PM - 12:35 PM<br />
|nf-core - A community effort to collect a curated set of pipelines built using Nextflow (https://nf-co.re/).<br />
|Harshil Patel, The Francis Crick Institute, United Kingdom<br />
|-<br />
|}<br />
<br />
<br />
== Workshop Discussion ==<br />
===175 total people over the 2.5 hours (over capacity within room). 55 people participated for the full 2.5 hours including participation in the breakout sessions and discussions. 75 people for final NextFlow Demo===<br />
<br />
*Transitioning bioinformatics core to support biomedical AI/ML research - lessons learned<br />
**Large, diverse datasets from multiple sources both private and public from around the world.<br />
*Supporting single cell RNA-seq analysis: A Core's Perspective<br />
**Single cell growing in demand over the last 5 years. Data analysis is becoming the bottleneck. Taking a community based approach by collaborating with other HSPS teams and other schools (HMS) to tackle the problem: sequencing core (de-multiplexing), labs (iterative; requires research input--is cell cycling part or mitochondria?), training, etc. <br />
**built out bcbio python toolkit with 62 international contributors.<br />
**settled on serat suite of tools but also uses many others such as multicca<br />
*Conda and Bioconda, the best thing since sliced bread<br />
**Installing Software--get asked for help to install all kinds of software, particularly ones that carry many dependencies.<br />
**with Conda, root access not needed, ever. dependencies are handled for you.<br />
**free and can add your own packages<br />
**module load activates a conda environment behind the scene for them<br />
**bioconductor packages in bioconda. for every package they also compile a singularity and docker container. Biocontainer<br />
**CoreOS--Quay.<br />
**1700 packages upgraded over a week. behind the scenes, bioconductor upgrading<br />
**bioconda has 700+ contributors: release your tools using bioconda.<br />
*Improving project management and tracking with Asana and Toggl<br />
**Fee for Service Center with up to 324 projects over the last 4 years.<br />
**Need to Track projects intra-team: transition them 1 team member to another as the project cycles through the experts<br />
**Analysis can be punctuated with long periods of time while investigator writes papers and grants. Needs to pick up history sometimes a year later.<br />
**Asana: have defined a workflow within Asana that includes intake, waiting periods, in progress systems, close out and bill<br />
**archive data to S3.<br />
**implemented toggl to track time on each project and subtask. Integrates with Asana for project management components.<br />
**allows for obtaining better estimates to people. Have found in general they underestimate work.<br />
*Bioinformatics training (in the context of a core)<br />
**Funders provide FTEs dedicated to training (harvard catalyst, HMS)<br />
**interplay between training and consulting: surge in single cell analysis highlights need for training in this technology<br />
**2/3 time spent on training, the remainder on consulting and understanding best practices<br />
**partner with faculty on teaching for credit--e.g. an R component for their cause <br />
**10:1 student to instructor ratios, 25 per class. Use local resources such as their HPC system. Publish materials on GitHub<br />
*Development of bioinformatics workshop by a core facility<br />
**being asked to provide practical bioinformatics training<br />
**challenges: large and diverse audience which makes it hard to develop a suitable curriculum, limited to 8x1hour courses, need to find source of support<br />
**partnered with the cancer center for admin support, the library for 5-seat lab, faculty for some lectures and research computing for the HiPerGator cluster with a dedicated allocation of cores.<br />
**successful: filled 50 spots in just a few days and over ½ attending all lectures. videorecorded and publicly available.<br />
<br />
==Breakout sessions==<br />
*Training<br />
**chunk out training and repackage and create efficiency<br />
**signups--under subscription vs. over. Charging to put some skin in the game<br />
**Access to compute.<br />
**Google and AWS use, and cost effective. use of jupyter notebooks are particularly cheap<br />
*Single Cell<br />
**help people help themselves.<br />
**shiny apps<br />
**what let's you know it worked properly? primer dimers, cell ranger but '''serat''' R package is the main thing that came out of it.<br />
**need to talk about the standard set of thresholds<br />
*Project Management<br />
**from Excel to google docs<br />
**Asana, trello, Jira<br />
**time tracking with Toggle and Harvast (app on phone, laptop, etc)<br />
**Wants: Confluence to integrate project management together with documentation?<br />
**fees help manage demand and help finance pipeline development<br />
*Conda/bioconda reproducibility<br />
<br />
==Demos==<br />
*Nextflow<br />
**manages reproducibility. integrates with many other schedulers <br />
**uses Conda<br />
**AWS iGenomes<br />
**git repo at nf-core/configs and test datasets at nf-core/test-datasets</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2019:_BioinfoCoreWorkshop&diff=10380ISMB 2019: BioinfoCoreWorkshop2019-05-15T10:42:43Z<p>Alastair.kerr: /* Workshop Overview */</p>
<hr />
<div>=Workshop Overview=<br />
<br />
The bioinfo-core workshop is scheduled for Monday, July 22, 2019, from 10:15 to 12:40 pm at the Congress Center in Basel.<br />
<br />
The bioinformatics core workshop is a workshop by practitioners and managers of Core Facilities for all members of core facilities, including scientists, engineers, analysts, operations and management staff. In this 16th year of bringing the Core community together at ISMB, we will explore topics relevant to bioinformatics core facilities through lightning talks and demos followed by small-group break out discussions with insights brought back to the full audience for further discussion and knowledge sharing.<br />
<br />
Organizers:<br />
<br />
* Madelaine Gogol, Stowers Institute, United States<br />
* Hemant Kelkar, University of North Carolina, United States<br />
* Alastair Kerr, CRUK-MI, University of Manchester, United Kingdom<br />
* Brent Richter, Partners HealthCare of Massachusetts General and Brigham and Women’s Hospitals, United States<br />
* Alberto Riva, University of Florida, United States<br />
<br />
==Part A: Technologies and Analytical Methods==<br />
<br />
Machine Learning and AI in a core and a pipelining tool will be presented.<br />
<br />
==Part B: Communication and Training==<br />
<br />
Communication and project management tools and training offered by cores.<br />
<br />
==Part C: Small group discussion==<br />
<br />
During this hour-long session, audience members will divide into groups based on their own interests. Groups will come up with their main take away points and bring them back to the main audience for knowledge sharing and for further discussion. Topics may include all previous presentation areas as well as other areas of interest to running or working within a bioinformatics core facility.<br />
<br />
==Part D: Pipeline Demo==<br />
<br />
Demo of nextflow<br />
<br />
{|class="wikitable"<br />
|-<br />
|Time<br />
|Title<br />
|Authors<br />
|-<br />
|<br />
|Transitioning bioinformatics core to support biomedical AI/ML research - lessons learned<br />
|Yang Fann, NIH, United States<br />
|-<br />
|<br />
|conda/bioconda<br />
|Devon Ryan, Max Planck Institute, Germany<br />
|-<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|<br />
|<br />
|-<br />
|<br />
|Improving project management and tracking with Asana and Toggl<br />
|Sara Brin Rosenthal, UCSD, United States<br />
|-<br />
|<br />
|Bioinformatics training (in the context of a core)<br />
|Radhika Khetani, Harvard School of Public Health, United States<br />
|-<br />
|<br />
|Development of bioinformatics workshop by a core facility<br />
|Alberto Riva, University of Florida<br />
|-<br />
|11:25 AM - 11:55 AM<br />
|Small Group Discussions<br />
|<br />
|-<br />
|11:55 AM - 12:20 PM<br />
|Small Group Reports<br />
|<br />
|-<br />
|12:20 PM - 12:35 PM<br />
|nf-core - A community effort to collect a curated set of pipelines built using Nextflow (https://nf-co.re/).<br />
|Harshil Patel, The Francis Crick Institute, United Kingdom<br />
|-<br />
|}</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Main_Page&diff=10369Main Page2019-03-07T14:42:49Z<p>Alastair.kerr: </p>
<hr />
<div>'''Welcome to the bioinfo-core's wiki!''' <br />
<br><br />
<br><br />
*'''[http://lists.open-bio.org/mailman/listinfo/bioinfo-core Sign up to the listserv and participate in the discussion!]'''<br />
*'''[[BioWiki:Community_portal | Add your core to the wiki]]'''<br />
<br><br />
<br><br />
We thank [http://www.iscb.org ISCB] for hosting and maintaining this wiki.<br />
<br><br />
<br><br />
===Newest Content===<br />
* [[21th_Discussion-6_Nov_2018 | 21th Discussion - Nextflow and nf-core demo]]<br />
* [[ISMB_2018:_BioinfoCoreWorkshop | ISMB 2018 Workshop]]<br />
* [[20th_Discussion-6_Feb_2018 | 20th Discussion - Training, Nanopore, and New Technology]]<br />
* [[19th_Discussion-17_Oct_2017 | 19th Discussion - Workshop recap, Deliverables, & Sabbaticals]]<br />
* [[ISMB_2017:_BioinfoCoreWorkshop | ISMB 2017 Workshop]]<br />
* [[ISMB_2016:_BioinfoCoreWorkshop | ISMB 2016 Workshop]]<br />
* [[18th_Discussion-16_Oct_2015 | 18th Discussion - ISMB2015 follow up]]<br />
* [[Interesting NGS failures]]<br />
* [[ISMB_2015:_BioinfoCoreWorkshop|ISMB 2015 Workshop - The evolving relationship between core facilities and researchers]]<br />
* [[17th_Discussion-27_Feb_2015 | 17th Discussion - Best practices for bioinformatics training]]<br />
* [[ISMB_2014:_InfrastructureForNewCores|16th Discussion - ISMB2014 follow up: Infrastructure for new Cores]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshopWriteUp|ISMB 2014 Workshop Write Up]]<br />
* [[15th_Discussion-24_Feb_2014 | 15th Discussion - The biologist is the analyst]]<br />
* [[ISCB_COSI_Proposal | Proposal to make bioinfo-core an ISCB community of special interest]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshop|ISMB 2014 Workshop Proposal]]<br />
* [[14th_Discussion-7_November_2013| 14th Discussion - Evaluating software]]<br />
* [[ISMB_2013:_BioinfoCoreWorkshop|ISMB 2013 Workshop]]<br />
* [[13th_Discussion-5_November_2012| 13th Discussion - Embedded bioinformaticians and Integrative analysis]]<br />
* [[ISMB_2012:_Workshop_Proposal|ISMB 2012 Workshop]]<br />
* [[12th_Discussion-21_May_2012|12th Discussion - Managing Storage in a Core Facility]]<br />
* [[11th_Discussion-7_November_2011|11th Discussion - Measuring the output of a Core and Tracking Software Versions]]<br />
* ISMB 2012: Bioinfo Core Workshop - Long Beach CA - July 16, 2012 [http://www.iscb.org/ismb2012-program/ismb2012-workshops#w3|ISMB Workshop]<br />
* [[ISMB 2011: Workshop on Analysis Pipelines for High Throughput Sequencing]]<br />
* [[ISMB 2011: Workshop on Practical Aspects of Running a Core Facility]]<br />
* [[ISMB 2011 Workshop Call]]<br />
* ISMB 2010 Workshop [[Call Minutes]] page<br />
* Include [http://twitter.com/#search?q=%23BioInfoCore #BioInfoCore] in your [http://twitter.com/ tweets] for the Core community. <br />
* Numerous new additions to the community portal<br />
<br />
*[[BioWiki:Community_portal | Community Portal]]<br />
<br />
= Introduction =<br />
Bioinfo-core is a worldwide body of people that manage or staff bioinformatics facilities within organizations of all types including academia, academic medical centers, medical schools, biotechs and pharmas. Through this wiki and our online [http://lists.open-bio.org/mailman/listinfo/bioinfo-core discussion lists] we discuss many topics that are challenging bioinformatics cores world wide: from IT, new instrumentation, staffing and training bioinformaticians, tools, software, to services for biologists and MD's.<br />
<BR><BR><br />
We hold several events throughout the year including quarterly conference calls (with published [[Call Minutes]]) and a yearly set of informal presentations and dinners at the annual meeting, Intelligent Systems in Molecular Biology ([http://www.iscb.org/iscb-conferences ISMB]), the official conference of [http://www.iscb.org/ ISCB]<br />
<br><br><br />
Please browse, add and participate in the wiki and the discussion lists. To edit the wiki, create a New Account and then edit the [[BioWiki:Community_portal | Community Portal]] to add a link for your core facility and its description.<br />
<br />
= Wiki page links =<br />
*[[Call Minutes]]: Annual meetings at ISMB with presenations; Detailed minutes from quarterly conference calls on selected and pertinent topics. <br />
*[[BioWiki:Community_portal | Community Portal]]: list your organization!<br />
*[[Ongoing Discussions]]: discussion forums including lists of software, tools, etc.<br />
*[[Special:Categories]]: find pages using categories such as Tools, Presentations, NextGenSequencing, Meetings etc.<br />
<br />
=Bioinfo-core Member Publications relevant to core facilities=<br />
*[http://collections.plos.org/ploscompbiol/corefacilities.php PLoS Computational Biology Journal--CORE facilities: editorial and perspectives]<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000372 The Need for Centralization of Computational Biology Resources] Lewitter F, Rebhan M, Richter B, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000369 Managing and Analyzing Next-Generation Sequence Data] Richter BG, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000368 Establishing a Successful Bioinformatics Core Facility Team] Lewitter F, Rebhan M</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Main_Page&diff=10368Main Page2019-03-07T14:41:45Z<p>Alastair.kerr: </p>
<hr />
<div>'''Welcome to the bioinfo-core's wiki!''' <br />
<br><br />
<br><br />
*'''[http://lists.open-bio.org/mailman/listinfo/bioinfo-core Sign up to the listserv and participate in the discussion!'''<br />
*'''[[BioWiki:Community_portal | Add your core to the wiki]]'''<br />
<br><br />
<br><br />
We thank [http://www.iscb.org ISCB] for hosting and maintaining this wiki.<br />
<br><br />
<br><br />
===Newest Content===<br />
* [[21th_Discussion-6_Nov_2018 | 21th Discussion - Nextflow and nf-core demo]]<br />
* [[ISMB_2018:_BioinfoCoreWorkshop | ISMB 2018 Workshop]]<br />
* [[20th_Discussion-6_Feb_2018 | 20th Discussion - Training, Nanopore, and New Technology]]<br />
* [[19th_Discussion-17_Oct_2017 | 19th Discussion - Workshop recap, Deliverables, & Sabbaticals]]<br />
* [[ISMB_2017:_BioinfoCoreWorkshop | ISMB 2017 Workshop]]<br />
* [[ISMB_2016:_BioinfoCoreWorkshop | ISMB 2016 Workshop]]<br />
* [[18th_Discussion-16_Oct_2015 | 18th Discussion - ISMB2015 follow up]]<br />
* [[Interesting NGS failures]]<br />
* [[ISMB_2015:_BioinfoCoreWorkshop|ISMB 2015 Workshop - The evolving relationship between core facilities and researchers]]<br />
* [[17th_Discussion-27_Feb_2015 | 17th Discussion - Best practices for bioinformatics training]]<br />
* [[ISMB_2014:_InfrastructureForNewCores|16th Discussion - ISMB2014 follow up: Infrastructure for new Cores]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshopWriteUp|ISMB 2014 Workshop Write Up]]<br />
* [[15th_Discussion-24_Feb_2014 | 15th Discussion - The biologist is the analyst]]<br />
* [[ISCB_COSI_Proposal | Proposal to make bioinfo-core an ISCB community of special interest]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshop|ISMB 2014 Workshop Proposal]]<br />
* [[14th_Discussion-7_November_2013| 14th Discussion - Evaluating software]]<br />
* [[ISMB_2013:_BioinfoCoreWorkshop|ISMB 2013 Workshop]]<br />
* [[13th_Discussion-5_November_2012| 13th Discussion - Embedded bioinformaticians and Integrative analysis]]<br />
* [[ISMB_2012:_Workshop_Proposal|ISMB 2012 Workshop]]<br />
* [[12th_Discussion-21_May_2012|12th Discussion - Managing Storage in a Core Facility]]<br />
* [[11th_Discussion-7_November_2011|11th Discussion - Measuring the output of a Core and Tracking Software Versions]]<br />
* ISMB 2012: Bioinfo Core Workshop - Long Beach CA - July 16, 2012 [http://www.iscb.org/ismb2012-program/ismb2012-workshops#w3|ISMB Workshop]<br />
* [[ISMB 2011: Workshop on Analysis Pipelines for High Throughput Sequencing]]<br />
* [[ISMB 2011: Workshop on Practical Aspects of Running a Core Facility]]<br />
* [[ISMB 2011 Workshop Call]]<br />
* ISMB 2010 Workshop [[Call Minutes]] page<br />
* Include [http://twitter.com/#search?q=%23BioInfoCore #BioInfoCore] in your [http://twitter.com/ tweets] for the Core community. <br />
* Numerous new additions to the community portal<br />
<br />
*[[BioWiki:Community_portal | Community Portal]]<br />
<br />
= Introduction =<br />
Bioinfo-core is a worldwide body of people that manage or staff bioinformatics facilities within organizations of all types including academia, academic medical centers, medical schools, biotechs and pharmas. Through this wiki and our online [http://lists.open-bio.org/mailman/listinfo/bioinfo-core discussion lists] we discuss many topics that are challenging bioinformatics cores world wide: from IT, new instrumentation, staffing and training bioinformaticians, tools, software, to services for biologists and MD's.<br />
<BR><BR><br />
We hold several events throughout the year including quarterly conference calls (with published [[Call Minutes]]) and a yearly set of informal presentations and dinners at the annual meeting, Intelligent Systems in Molecular Biology ([http://www.iscb.org/iscb-conferences ISMB]), the official conference of [http://www.iscb.org/ ISCB]<br />
<br><br><br />
Please browse, add and participate in the wiki and the discussion lists. To edit the wiki, create a New Account and then edit the [[BioWiki:Community_portal | Community Portal]] to add a link for your core facility and its description.<br />
<br />
= Wiki page links =<br />
*[[Call Minutes]]: Annual meetings at ISMB with presenations; Detailed minutes from quarterly conference calls on selected and pertinent topics. <br />
*[[BioWiki:Community_portal | Community Portal]]: list your organization!<br />
*[[Ongoing Discussions]]: discussion forums including lists of software, tools, etc.<br />
*[[Special:Categories]]: find pages using categories such as Tools, Presentations, NextGenSequencing, Meetings etc.<br />
<br />
=Bioinfo-core Member Publications relevant to core facilities=<br />
*[http://collections.plos.org/ploscompbiol/corefacilities.php PLoS Computational Biology Journal--CORE facilities: editorial and perspectives]<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000372 The Need for Centralization of Computational Biology Resources] Lewitter F, Rebhan M, Richter B, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000369 Managing and Analyzing Next-Generation Sequence Data] Richter BG, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000368 Establishing a Successful Bioinformatics Core Facility Team] Lewitter F, Rebhan M</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Main_Page&diff=10361Main Page2019-01-15T14:19:12Z<p>Alastair.kerr: </p>
<hr />
<div>'''Welcome to the bioinfo-core's wiki!''' <br />
<br><br />
<br><br />
*'''[http://lists.open-bio.org/mailman/listinfo/bioinfo-core Sign up to the listserv and participate in the discussion! ...some problems currently, we are aware and working on it]'''<br />
*'''[[BioWiki:Community_portal | Add your core to the wiki]]'''<br />
<br><br />
<br><br />
We thank [http://www.iscb.org ISCB] for hosting and maintaining this wiki.<br />
<br><br />
<br><br />
===Newest Content===<br />
* [[21th_Discussion-6_Nov_2018 | 21th Discussion - Nextflow and nf-core demo]]<br />
* [[ISMB_2018:_BioinfoCoreWorkshop | ISMB 2018 Workshop]]<br />
* [[20th_Discussion-6_Feb_2018 | 20th Discussion - Training, Nanopore, and New Technology]]<br />
* [[19th_Discussion-17_Oct_2017 | 19th Discussion - Workshop recap, Deliverables, & Sabbaticals]]<br />
* [[ISMB_2017:_BioinfoCoreWorkshop | ISMB 2017 Workshop]]<br />
* [[ISMB_2016:_BioinfoCoreWorkshop | ISMB 2016 Workshop]]<br />
* [[18th_Discussion-16_Oct_2015 | 18th Discussion - ISMB2015 follow up]]<br />
* [[Interesting NGS failures]]<br />
* [[ISMB_2015:_BioinfoCoreWorkshop|ISMB 2015 Workshop - The evolving relationship between core facilities and researchers]]<br />
* [[17th_Discussion-27_Feb_2015 | 17th Discussion - Best practices for bioinformatics training]]<br />
* [[ISMB_2014:_InfrastructureForNewCores|16th Discussion - ISMB2014 follow up: Infrastructure for new Cores]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshopWriteUp|ISMB 2014 Workshop Write Up]]<br />
* [[15th_Discussion-24_Feb_2014 | 15th Discussion - The biologist is the analyst]]<br />
* [[ISCB_COSI_Proposal | Proposal to make bioinfo-core an ISCB community of special interest]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshop|ISMB 2014 Workshop Proposal]]<br />
* [[14th_Discussion-7_November_2013| 14th Discussion - Evaluating software]]<br />
* [[ISMB_2013:_BioinfoCoreWorkshop|ISMB 2013 Workshop]]<br />
* [[13th_Discussion-5_November_2012| 13th Discussion - Embedded bioinformaticians and Integrative analysis]]<br />
* [[ISMB_2012:_Workshop_Proposal|ISMB 2012 Workshop]]<br />
* [[12th_Discussion-21_May_2012|12th Discussion - Managing Storage in a Core Facility]]<br />
* [[11th_Discussion-7_November_2011|11th Discussion - Measuring the output of a Core and Tracking Software Versions]]<br />
* ISMB 2012: Bioinfo Core Workshop - Long Beach CA - July 16, 2012 [http://www.iscb.org/ismb2012-program/ismb2012-workshops#w3|ISMB Workshop]<br />
* [[ISMB 2011: Workshop on Analysis Pipelines for High Throughput Sequencing]]<br />
* [[ISMB 2011: Workshop on Practical Aspects of Running a Core Facility]]<br />
* [[ISMB 2011 Workshop Call]]<br />
* ISMB 2010 Workshop [[Call Minutes]] page<br />
* Include [http://twitter.com/#search?q=%23BioInfoCore #BioInfoCore] in your [http://twitter.com/ tweets] for the Core community. <br />
* Numerous new additions to the community portal<br />
<br />
*[[BioWiki:Community_portal | Community Portal]]<br />
<br />
= Introduction =<br />
Bioinfo-core is a worldwide body of people that manage or staff bioinformatics facilities within organizations of all types including academia, academic medical centers, medical schools, biotechs and pharmas. Through this wiki and our online [http://lists.open-bio.org/mailman/listinfo/bioinfo-core discussion lists] we discuss many topics that are challenging bioinformatics cores world wide: from IT, new instrumentation, staffing and training bioinformaticians, tools, software, to services for biologists and MD's.<br />
<BR><BR><br />
We hold several events throughout the year including quarterly conference calls (with published [[Call Minutes]]) and a yearly set of informal presentations and dinners at the annual meeting, Intelligent Systems in Molecular Biology ([http://www.iscb.org/iscb-conferences ISMB]), the official conference of [http://www.iscb.org/ ISCB]<br />
<br><br><br />
Please browse, add and participate in the wiki and the discussion lists. To edit the wiki, create a New Account and then edit the [[BioWiki:Community_portal | Community Portal]] to add a link for your core facility and its description.<br />
<br />
= Wiki page links =<br />
*[[Call Minutes]]: Annual meetings at ISMB with presenations; Detailed minutes from quarterly conference calls on selected and pertinent topics. <br />
*[[BioWiki:Community_portal | Community Portal]]: list your organization!<br />
*[[Ongoing Discussions]]: discussion forums including lists of software, tools, etc.<br />
*[[Special:Categories]]: find pages using categories such as Tools, Presentations, NextGenSequencing, Meetings etc.<br />
<br />
=Bioinfo-core Member Publications relevant to core facilities=<br />
*[http://collections.plos.org/ploscompbiol/corefacilities.php PLoS Computational Biology Journal--CORE facilities: editorial and perspectives]<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000372 The Need for Centralization of Computational Biology Resources] Lewitter F, Rebhan M, Richter B, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000369 Managing and Analyzing Next-Generation Sequence Data] Richter BG, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000368 Establishing a Successful Bioinformatics Core Facility Team] Lewitter F, Rebhan M</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=21th_Discussion-6_Nov_2018&diff=1036021th Discussion-6 Nov 20182018-11-13T11:59:07Z<p>Alastair.kerr: </p>
<hr />
<div>Bioinfo-Core teleconference: Alex Peltzer demo on nexflow and nf-core. <br />
<br />
UPDATE: Here are the [https://t.co/FecFUcOzDk slides from the talk]<br />
and the [http://bifx-core.bio.ed.ac.uk/Bioinf-Core/zoom_1.mp4 full video]<br />
<br />
<br />
<br />
After the positive feedback from the ISMB workshop, Alex Peltzer has agreed to run a demo for the bioinfo-core on nextflow and nf-core. This will be on November 6th, 3 PM GMT. Details on how this will occur are being finalized but I would invite everyone to submit any questions ahead of the demo so we can all get the most out of it. <br />
<br />
Please add your questions below:<br />
<br />
* How to configure a pipeline for all users? <br />
* How/where is the best location to store pipelines and make them accessible for users?<br />
* How bad are the docker vulnerabilities? <br />
* How to configure Dockerized pipelines properly?<br />
* How to convert Docker pipelines to singularity (with and without AWS)? <br />
* Easiest pipelines to create from scratch<br />
** What do you mean with that? How would I start writing a pipeline and give advice to a beginner on how to do that?<br />
*** Are there tools, or existing configuration files, that would make the job easier? <br />
*** Any good GUIs such as can be found in Galaxy?<br />
** Resources per pipeline language? Such as YAML files for each tool?<br />
*** Yes<br />
b Not sure what you mean with this - do you mean how we define dependencies per pipeline? <br />
* Running the pipeline manager: Once per user? How to enforce port usage? If once per server, how to only use individual users outputs?</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=21th_Discussion-1_Nov_2018&diff=1035921th Discussion-1 Nov 20182018-11-07T16:25:30Z<p>Alastair.kerr: </p>
<hr />
<div>Bioinfo-Core teleconference: Alex Peltzer demo on nexflow and nf-core. <br />
<br />
UPDATE: Slide deck [https://t.co/FecFUcOzDk can be found here] <br />
<br />
Link to video coming soon<br />
<br />
<br />
After the positive feedback from the ISMB workshop, Alex Peltzer has agreed to run a demo for the bioinfo-core on nextflow and nf-core. I plan to schedule this on November 1st, late afternoon in the GMT timezone. Details on how this will occur are being finalised but I would invite everyone to submit any questions ahead of the demo so we can all get the most out of it. <br />
<br />
Please add your questions below:<br />
<br />
* How to configure a pipeline for all users <br />
* How/where is best to store pipelines and make accessible for users <br />
* How bad are the docker vulnerabilities <br />
* How to configure dockerised pipelines properly<br />
* How to convert docker pipelines to singularity (with and without AWS) <br />
* Easiest pipelines to create from scratch<br />
* Resources per pipeline language? Such as yaml files for each tool? <br />
* Running the pipeline manager: Once per user? How to enforce port usage? If once per server, how to only use individual users outputs?</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=21th_Discussion-6_Nov_2018&diff=1035821th Discussion-6 Nov 20182018-11-07T10:36:00Z<p>Alastair.kerr: </p>
<hr />
<div>Bioinfo-Core teleconference: Alex Peltzer demo on nexflow and nf-core. <br />
<br />
UPDATE: Here are the [https://t.co/FecFUcOzDk slides from the talk]<br />
Video will be added pending Alex's approval and a suitable server.<br />
<br />
<br />
<br />
After the positive feedback from the ISMB workshop, Alex Peltzer has agreed to run a demo for the bioinfo-core on nextflow and nf-core. This will be on November 6th, 3 PM GMT. Details on how this will occur are being finalized but I would invite everyone to submit any questions ahead of the demo so we can all get the most out of it. <br />
<br />
Please add your questions below:<br />
<br />
* How to configure a pipeline for all users? <br />
* How/where is the best location to store pipelines and make them accessible for users?<br />
* How bad are the docker vulnerabilities? <br />
* How to configure Dockerized pipelines properly?<br />
* How to convert Docker pipelines to singularity (with and without AWS)? <br />
* Easiest pipelines to create from scratch<br />
** What do you mean with that? How would I start writing a pipeline and give advice to a beginner on how to do that?<br />
*** Are there tools, or existing configuration files, that would make the job easier? <br />
*** Any good GUIs such as can be found in Galaxy?<br />
** Resources per pipeline language? Such as YAML files for each tool?<br />
*** Yes<br />
b Not sure what you mean with this - do you mean how we define dependencies per pipeline? <br />
* Running the pipeline manager: Once per user? How to enforce port usage? If once per server, how to only use individual users outputs?</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=21th_Discussion-6_Nov_2018&diff=1035721th Discussion-6 Nov 20182018-11-05T12:53:41Z<p>Alastair.kerr: </p>
<hr />
<div>Bioinfo-Core teleconference: Alex Peltzer demo on nexflow and nf-core. <br />
<br />
After the positive feedback from the ISMB workshop, Alex Peltzer has agreed to run a demo for the bioinfo-core on nextflow and nf-core. This will be on November 6th, 3 PM GMT. Details on how this will occur are being finalized but I would invite everyone to submit any questions ahead of the demo so we can all get the most out of it. <br />
<br />
Please add your questions below:<br />
<br />
* How to configure a pipeline for all users? <br />
* How/where is the best location to store pipelines and make them accessible for users?<br />
* How bad are the docker vulnerabilities? <br />
* How to configure Dockerized pipelines properly?<br />
* How to convert Docker pipelines to singularity (with and without AWS)? <br />
* Easiest pipelines to create from scratch<br />
** What do you mean with that? How would I start writing a pipeline and give advice to a beginner on how to do that?<br />
*** Are there tools, or existing configuration files, that would make the job easier? <br />
*** Any good GUIs such as can be found in Galaxy?<br />
** Resources per pipeline language? Such as YAML files for each tool?<br />
*** Yes<br />
b Not sure what you mean with this - do you mean how we define dependencies per pipeline? <br />
* Running the pipeline manager: Once per user? How to enforce port usage? If once per server, how to only use individual users outputs?</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=21th_Discussion-6_Nov_2018&diff=1035021th Discussion-6 Nov 20182018-10-26T08:52:43Z<p>Alastair.kerr: Created page with "Bioinfo-Core teleconference: Alex Peltzer demo on nexflow and nf-core. After the positive feedback from the ISMB workshop, Alex Peltzer has agreed to run a demo for the bioi..."</p>
<hr />
<div>Bioinfo-Core teleconference: Alex Peltzer demo on nexflow and nf-core. <br />
<br />
After the positive feedback from the ISMB workshop, Alex Peltzer has agreed to run a demo for the bioinfo-core on nextflow and nf-core. I plan to schedule this on November 1st, late afternoon in the GMT timezone. Details on how this will occur are being finalised but I would invite everyone to submit any questions ahead of the demo so we can all get the most out of it. <br />
<br />
Please add your questions below:<br />
<br />
* How to configure a pipeline for all users <br />
* How/where is best to store pipelines and make accessible for users <br />
* How bad are the docker vulnerabilities <br />
* How to configure dockerised pipelines properly<br />
* How to convert docker pipelines to singularity (with and without AWS) <br />
* Easiest pipelines to create from scratch<br />
* Resources per pipeline language? Such as yaml files for each tool? <br />
* Running the pipeline manager: Once per user? How to enforce port usage? If once per server, how to only use individual users outputs?</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Main_Page&diff=10349Main Page2018-10-26T08:51:56Z<p>Alastair.kerr: /* Newest Content */</p>
<hr />
<div>'''Welcome to the bioinfo-core's wiki!''' <br />
<br><br />
<br><br />
*'''[http://lists.open-bio.org/mailman/listinfo/bioinfo-core Sign up to the listserv and participate in the discussion!]'''<br />
*'''[[BioWiki:Community_portal | Add your core to the wiki]]'''<br />
<br><br />
<br><br />
We thank [http://www.iscb.org ISCB] for hosting and maintaining this wiki.<br />
<br><br />
<br><br />
===Newest Content===<br />
* [[21th_Discussion-6_Nov_2018 | 21th Discussion - Nextflow and nf-core demo]]<br />
* [[ISMB_2018:_BioinfoCoreWorkshop | ISMB 2018 Workshop]]<br />
* [[20th_Discussion-6_Feb_2018 | 20th Discussion - Training, Nanopore, and New Technology]]<br />
* [[19th_Discussion-17_Oct_2017 | 19th Discussion - Workshop recap, Deliverables, & Sabbaticals]]<br />
* [[ISMB_2017:_BioinfoCoreWorkshop | ISMB 2017 Workshop]]<br />
* [[ISMB_2016:_BioinfoCoreWorkshop | ISMB 2016 Workshop]]<br />
* [[18th_Discussion-16_Oct_2015 | 18th Discussion - ISMB2015 follow up]]<br />
* [[Interesting NGS failures]]<br />
* [[ISMB_2015:_BioinfoCoreWorkshop|ISMB 2015 Workshop - The evolving relationship between core facilities and researchers]]<br />
* [[17th_Discussion-27_Feb_2015 | 17th Discussion - Best practices for bioinformatics training]]<br />
* [[ISMB_2014:_InfrastructureForNewCores|16th Discussion - ISMB2014 follow up: Infrastructure for new Cores]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshopWriteUp|ISMB 2014 Workshop Write Up]]<br />
* [[15th_Discussion-24_Feb_2014 | 15th Discussion - The biologist is the analyst]]<br />
* [[ISCB_COSI_Proposal | Proposal to make bioinfo-core an ISCB community of special interest]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshop|ISMB 2014 Workshop Proposal]]<br />
* [[14th_Discussion-7_November_2013| 14th Discussion - Evaluating software]]<br />
* [[ISMB_2013:_BioinfoCoreWorkshop|ISMB 2013 Workshop]]<br />
* [[13th_Discussion-5_November_2012| 13th Discussion - Embedded bioinformaticians and Integrative analysis]]<br />
* [[ISMB_2012:_Workshop_Proposal|ISMB 2012 Workshop]]<br />
* [[12th_Discussion-21_May_2012|12th Discussion - Managing Storage in a Core Facility]]<br />
* [[11th_Discussion-7_November_2011|11th Discussion - Measuring the output of a Core and Tracking Software Versions]]<br />
* ISMB 2012: Bioinfo Core Workshop - Long Beach CA - July 16, 2012 [http://www.iscb.org/ismb2012-program/ismb2012-workshops#w3|ISMB Workshop]<br />
* [[ISMB 2011: Workshop on Analysis Pipelines for High Throughput Sequencing]]<br />
* [[ISMB 2011: Workshop on Practical Aspects of Running a Core Facility]]<br />
* [[ISMB 2011 Workshop Call]]<br />
* ISMB 2010 Workshop [[Call Minutes]] page<br />
* Include [http://twitter.com/#search?q=%23BioInfoCore #BioInfoCore] in your [http://twitter.com/ tweets] for the Core community. <br />
* Numerous new additions to the community portal<br />
<br />
*[[BioWiki:Community_portal | Community Portal]]<br />
<br />
= Introduction =<br />
Bioinfo-core is a worldwide body of people that manage or staff bioinformatics facilities within organizations of all types including academia, academic medical centers, medical schools, biotechs and pharmas. Through this wiki and our online [http://lists.open-bio.org/mailman/listinfo/bioinfo-core discussion lists] we discuss many topics that are challenging bioinformatics cores world wide: from IT, new instrumentation, staffing and training bioinformaticians, tools, software, to services for biologists and MD's.<br />
<BR><BR><br />
We hold several events throughout the year including quarterly conference calls (with published [[Call Minutes]]) and a yearly set of informal presentations and dinners at the annual meeting, Intelligent Systems in Molecular Biology ([http://www.iscb.org/iscb-conferences ISMB]), the official conference of [http://www.iscb.org/ ISCB]<br />
<br><br><br />
Please browse, add and participate in the wiki and the discussion lists. To edit the wiki, create a New Account and then edit the [[BioWiki:Community_portal | Community Portal]] to add a link for your core facility and its description.<br />
<br />
= Wiki page links =<br />
*[[Call Minutes]]: Annual meetings at ISMB with presenations; Detailed minutes from quarterly conference calls on selected and pertinent topics. <br />
*[[BioWiki:Community_portal | Community Portal]]: list your organization!<br />
*[[Ongoing Discussions]]: discussion forums including lists of software, tools, etc.<br />
*[[Special:Categories]]: find pages using categories such as Tools, Presentations, NextGenSequencing, Meetings etc.<br />
<br />
=Bioinfo-core Member Publications relevant to core facilities=<br />
*[http://collections.plos.org/ploscompbiol/corefacilities.php PLoS Computational Biology Journal--CORE facilities: editorial and perspectives]<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000372 The Need for Centralization of Computational Biology Resources] Lewitter F, Rebhan M, Richter B, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000369 Managing and Analyzing Next-Generation Sequence Data] Richter BG, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000368 Establishing a Successful Bioinformatics Core Facility Team] Lewitter F, Rebhan M</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=21th_Discussion-1_Nov_2018&diff=1034821th Discussion-1 Nov 20182018-10-23T10:20:09Z<p>Alastair.kerr: Created page with "Bioinfo-Core teleconference: Alex Peltzer demo on nexflow and nf-core. After the positive feedback from the ISMB workshop, Alex Peltzer has agreed to run a demo for the bioi..."</p>
<hr />
<div>Bioinfo-Core teleconference: Alex Peltzer demo on nexflow and nf-core. <br />
<br />
After the positive feedback from the ISMB workshop, Alex Peltzer has agreed to run a demo for the bioinfo-core on nextflow and nf-core. I plan to schedule this on November 1st, late afternoon in the GMT timezone. Details on how this will occur are being finalised but I would invite everyone to submit any questions ahead of the demo so we can all get the most out of it. <br />
<br />
Please add your questions below:<br />
<br />
* How to configure a pipeline for all users <br />
* How/where is best to store pipelines and make accessible for users <br />
* How bad are the docker vulnerabilities <br />
* How to configure dockerised pipelines properly<br />
* How to convert docker pipelines to singularity (with and without AWS) <br />
* Easiest pipelines to create from scratch<br />
* Resources per pipeline language? Such as yaml files for each tool? <br />
* Running the pipeline manager: Once per user? How to enforce port usage? If once per server, how to only use individual users outputs?</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Main_Page&diff=10347Main Page2018-10-23T10:16:43Z<p>Alastair.kerr: /* Newest Content */</p>
<hr />
<div>'''Welcome to the bioinfo-core's wiki!''' <br />
<br><br />
<br><br />
*'''[http://lists.open-bio.org/mailman/listinfo/bioinfo-core Sign up to the listserv and participate in the discussion!]'''<br />
*'''[[BioWiki:Community_portal | Add your core to the wiki]]'''<br />
<br><br />
<br><br />
We thank [http://www.iscb.org ISCB] for hosting and maintaining this wiki.<br />
<br><br />
<br><br />
===Newest Content===<br />
* [[21th_Discussion-1_Nov_2018 | 21th Discussion - Nextflow and nf-core demo]]<br />
* [[ISMB_2018:_BioinfoCoreWorkshop | ISMB 2018 Workshop]]<br />
* [[20th_Discussion-6_Feb_2018 | 20th Discussion - Training, Nanopore, and New Technology]]<br />
* [[19th_Discussion-17_Oct_2017 | 19th Discussion - Workshop recap, Deliverables, & Sabbaticals]]<br />
* [[ISMB_2017:_BioinfoCoreWorkshop | ISMB 2017 Workshop]]<br />
* [[ISMB_2016:_BioinfoCoreWorkshop | ISMB 2016 Workshop]]<br />
* [[18th_Discussion-16_Oct_2015 | 18th Discussion - ISMB2015 follow up]]<br />
* [[Interesting NGS failures]]<br />
* [[ISMB_2015:_BioinfoCoreWorkshop|ISMB 2015 Workshop - The evolving relationship between core facilities and researchers]]<br />
* [[17th_Discussion-27_Feb_2015 | 17th Discussion - Best practices for bioinformatics training]]<br />
* [[ISMB_2014:_InfrastructureForNewCores|16th Discussion - ISMB2014 follow up: Infrastructure for new Cores]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshopWriteUp|ISMB 2014 Workshop Write Up]]<br />
* [[15th_Discussion-24_Feb_2014 | 15th Discussion - The biologist is the analyst]]<br />
* [[ISCB_COSI_Proposal | Proposal to make bioinfo-core an ISCB community of special interest]]<br />
* [[ISMB_2014:_BioinfoCoreWorkshop|ISMB 2014 Workshop Proposal]]<br />
* [[14th_Discussion-7_November_2013| 14th Discussion - Evaluating software]]<br />
* [[ISMB_2013:_BioinfoCoreWorkshop|ISMB 2013 Workshop]]<br />
* [[13th_Discussion-5_November_2012| 13th Discussion - Embedded bioinformaticians and Integrative analysis]]<br />
* [[ISMB_2012:_Workshop_Proposal|ISMB 2012 Workshop]]<br />
* [[12th_Discussion-21_May_2012|12th Discussion - Managing Storage in a Core Facility]]<br />
* [[11th_Discussion-7_November_2011|11th Discussion - Measuring the output of a Core and Tracking Software Versions]]<br />
* ISMB 2012: Bioinfo Core Workshop - Long Beach CA - July 16, 2012 [http://www.iscb.org/ismb2012-program/ismb2012-workshops#w3|ISMB Workshop]<br />
* [[ISMB 2011: Workshop on Analysis Pipelines for High Throughput Sequencing]]<br />
* [[ISMB 2011: Workshop on Practical Aspects of Running a Core Facility]]<br />
* [[ISMB 2011 Workshop Call]]<br />
* ISMB 2010 Workshop [[Call Minutes]] page<br />
* Include [http://twitter.com/#search?q=%23BioInfoCore #BioInfoCore] in your [http://twitter.com/ tweets] for the Core community. <br />
* Numerous new additions to the community portal<br />
<br />
*[[BioWiki:Community_portal | Community Portal]]<br />
<br />
= Introduction =<br />
Bioinfo-core is a worldwide body of people that manage or staff bioinformatics facilities within organizations of all types including academia, academic medical centers, medical schools, biotechs and pharmas. Through this wiki and our online [http://lists.open-bio.org/mailman/listinfo/bioinfo-core discussion lists] we discuss many topics that are challenging bioinformatics cores world wide: from IT, new instrumentation, staffing and training bioinformaticians, tools, software, to services for biologists and MD's.<br />
<BR><BR><br />
We hold several events throughout the year including quarterly conference calls (with published [[Call Minutes]]) and a yearly set of informal presentations and dinners at the annual meeting, Intelligent Systems in Molecular Biology ([http://www.iscb.org/iscb-conferences ISMB]), the official conference of [http://www.iscb.org/ ISCB]<br />
<br><br><br />
Please browse, add and participate in the wiki and the discussion lists. To edit the wiki, create a New Account and then edit the [[BioWiki:Community_portal | Community Portal]] to add a link for your core facility and its description.<br />
<br />
= Wiki page links =<br />
*[[Call Minutes]]: Annual meetings at ISMB with presenations; Detailed minutes from quarterly conference calls on selected and pertinent topics. <br />
*[[BioWiki:Community_portal | Community Portal]]: list your organization!<br />
*[[Ongoing Discussions]]: discussion forums including lists of software, tools, etc.<br />
*[[Special:Categories]]: find pages using categories such as Tools, Presentations, NextGenSequencing, Meetings etc.<br />
<br />
=Bioinfo-core Member Publications relevant to core facilities=<br />
*[http://collections.plos.org/ploscompbiol/corefacilities.php PLoS Computational Biology Journal--CORE facilities: editorial and perspectives]<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000372 The Need for Centralization of Computational Biology Resources] Lewitter F, Rebhan M, Richter B, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000369 Managing and Analyzing Next-Generation Sequence Data] Richter BG, Sexton DP<br />
*[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000368 Establishing a Successful Bioinformatics Core Facility Team] Lewitter F, Rebhan M</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2018:_BioinfoCoreWorkshop&diff=10346ISMB 2018: BioinfoCoreWorkshop2018-09-17T08:26:25Z<p>Alastair.kerr: </p>
<hr />
<div><br />
<br />
=Workshop Overview=<br />
<br />
The bioinfo-core workshop is scheduled for Saturday, July 7, 2018, from 2:00-4:00 pm in Columbus EF at the Hyatt Regency in Chicago.<br />
<br />
The bioinformatics core workshop is a workshop by practitioners and managers of Core Facilities for all members of core facilities, including scientists, engineers, analysts, operations and management staff. In this 15th year of bringing the Core community together at ISMB, we will explore in-depth three topics relevant to bioinformatics core facilities through lightning talks that broadly explore each area followed by small-group break out discussions with insights brought back to the full audience for further discussion and knowledge share.<br />
<br />
Organizers:<br />
<br />
* Madelaine Gogol, Stowers Institute, United States<br />
* Hemant Kelkar, University of North Carolina, United States<br />
* Alastair Kerr, University of Edinburgh, United Kingdom<br />
* Brent Richter, Partners HealthCare of Massachusetts General and Brigham and Women’s Hospitals, United States<br />
* Alberto Riva, University of Florida, United States<br />
<br />
==Part A: Strategies for Hiring, Recruiting, and Interviewing new bioinformaticians==<br />
<br />
Methods to find, interview and hire highly successful staff and bioinformaticians for a core facility. Speakers will introduce experience and challenges including finding and hiring people, interview techniques and questions and best practices for recruiting candidates<br />
<br />
==Part B: Containerization, Clouds, and Workflows==<br />
<br />
Topics to be covered include cloud infrastructure recommendations and limitations, key datasets of value hosted in the cloud, containerization technology that works and workflow tool development and results.<br />
<br />
==Part C: When good experiments go bad: Negotiating experiment quality failures==<br />
<br />
A non-exhaustive survey of methods and successes in detecting failures and exploring guidelines for terminating bad projects.<br />
<br />
==Part D: Small group discussion==<br />
During this longer session, audience members will divide into groups based on their own interests. Groups will come up with their main take away points and bring them back to the main audience for knowledge sharing and for further discussion. Topics may include all previous presentation areas as well as other areas of interest to running or working within a bioinformatics core facility such as single cell analysis or long read analysis.<br />
<br />
{|class="wikitable"<br />
|-<br />
|Time<br />
|Title<br />
|Authors<br />
|-<br />
|2:00 PM - 2:08 PM<br />
|Bioinformatics Core Staffing ([http://bioinfo-core.org/index.php/File:ISMB_Grimm.pptx slides])<br />
|Sara Grimm, NIEHS, United States<br />
|-<br />
|2:08 PM - 2:16 PM<br />
|Characteristics of a highly successful candidate and how to find them<br />
|Brent Richter, Partners HealthCare, United States<br />
|-<br />
|2:16 PM - 2:24 PM<br />
|nf-core: community-driven best-practice Nextflow pipelines ([https://slides.com/apeltzer/ismb2018-nfcore#/ slides])<br />
|Alexander Peltzer, Quantitative Biology Center, Tübingen, Germany<br />
|-<br />
|2:24 PM - 2:32 PM<br />
|Data Science in the 21st Century: Streaming Public Data into Containerized Workflows ([http://bioinfo-core.org/index.php/File:Containers_Workflows_Lightning_v2.pptx slides])<br />
|Ben Busby, NCBI, United States<br />
|-<br />
|2:32 PM - 2:40 PM<br />
|Shesmu - An analysis orchestration system designed for FAIR standards and the GA4GH cloud ecosystem ([http://bioinfo-core.org/index.php/File:ljorgensen_ISMB_2018_biocore.pptx slides])<br />
|Lars Jorgensen, OICR, Canada<br />
|-<br />
|2:40 PM - 2:48 PM<br />
|A (Fire)Cloud-Based DNA Methylation Data Preprocessing and Quality Control Platform ([http://bioinfo-core.org/index.php/File:ISMB_LighteningTalk.pptx slides])<br />
|Divy Kangeyan, Harvard University, United States<br />
|-<br />
|2:48 PM - 2:56 PM<br />
|Usability of Marginal Data ([http://bioinfo-core.org/index.php/File:Thimmapuram_ISMB2018.pptx slides])<br />
|Jyothi Thimmapuram, Purdue University, United States<br />
|-<br />
|2:56 PM - 3:04 PM<br />
|Experimental Failures ([http://bioinfo-core.org/index.php/File:Termination_of_Bad_Projects_-_Experimental_Failures1.pdf slides])<br />
|Krishna Karuturi, The Jackson Laboratory, United States<br />
|-<br />
|3:04 PM - 3:20 PM<br />
|Small Group Discussions<br />
|<br />
|-<br />
|3:20 PM - 4:00 PM<br />
|Report to all present the insights obtained within the small group discussions<br />
|<br />
|}<br />
<br />
<br />
Notes (feel free to contribute or modify)<br />
<br />
Sara Grimm: Bioinformatics Core Staffing<br />
They support 60 labs. Embedded support model, staff is assigned to a particular lab. Mentioned needing soft skills to communicate and manage expe ctations. Hiring is done by a contracting agency, so they have little control, and there is local competition. They get applicants from life sciences or sometimes IT mid-career. They want someone comfortable at the command line with at least one programming language, conversant in basic biology. They include a scientist on the interview panel, and make sure being in a core is a good fit with their career goals.<br />
<br />
Brent Richter: <br />
Showed (complex) org chart. Ideally, want someone who will stay for 2 or more years. Recommended keeping the job description pretty fresh and detailed. Google similar positions and see what the descriptions are like. Offer learning opportunities and define responsibilities clearly, don't be overly general. What bigger areas can the position grow into, who will they report to, how will their career goals be supported? This is a good chance to clarify the scope of the position. The first 90 days offer immediate feedback/praise/criticism. Check in 5 min weekly - do they need anything? Yearly review is an opportunity to re-recruit high performers, or provide constructive criticism if someone is struggling.<br />
<br />
Alex Peltzer: NF-core<br />
Diverse, big, erroneous data. Large scale projects that integrate old with new data. Nextflow allows fast prototyping, task composition, parallelization, containerization. http://nf-co.re to collect pipelines (nextflow, MIT license, docker bundled). Continuous integration testing, stable release tags. Cookie cutter skeleton available for new pipelines, gitter channel.<br />
<br />
Ben Busby:<br />
WE CAN SAVE COMPBIO by submitting data FOR biologists. Lots of free cloud out there, just call it 'education'. Docker vs Singularity - Singularity offers user same permissions (doesn't req root) may be more comfortable for IT. Antibiotic resistance pipeline, simple enough for juniors in college. Prokaryotic genome pipeline. Nanopore simple enough for high schoolers. ATACflow, Jupyter notebook "press the triangles". Mentioned Google collaboratory.<br />
<br />
Lars Jorgensen:<br />
Shesmu. They get the samples nobody else wants to sequence. No "standard" pipeline. Niassa (seqware fork). "Deciders", but infrastructure is troublesome. hard to write and debug, large memory reqs. Shesmu - decider server. Olives - determine what actions to do. Stateless, so recovers nicely if server dies. Olives make jira tickets if something is missing. Good: unified interface. Bad: They are now maintaining a compiler. oicr-gsi/shesmu.<br />
<br />
Divy Kangayan<br />
Firecloud - scalable genomic analysis. Need something scalable, reproducible, access to public data, best practices. Mainly applied to methylation data so far. R and scmeth, WDL glues tools together. Lots of QC, read cvg, cpg cvg, cpg density, m bias plot.<br />
<br />
Jyothi Thimmapurum<br />
Data that is close to the lower limit of qualification, barely exceeding the minimum requirements. How do you use it? Why did it fail? Experimental design failures - insufficient replicates, wrong type of reads, too few reads. Contamination - sample mix ups, contamination during sample processing or lib prep. Mistakes in protocol, or seq machine failures. Plant DNA can confound studies of bacterial endophytes. WT/mut experiment, but SNPs same in ALL samples, hmmm. Repurpose the data if possible. Still might be able to address some questions or give them some useful info. Txome assembly instead of RNA-seq. You can still learn SOMEthing. Also can be data analysis failures - often fixable - wrong ref genome, wrong analysis methods, how you dealt with missing data. Data interp failures, didn't do multiple hypothesis testing.<br />
<br />
Krishna Karuturi<br />
Not really a *fun* topic, but an important, sticky topic. They have 100 labs. Prevention is better than cure. Exp failures affect relationships with labs and timing. What we really NEED are superheroes, but... They do a multi-point QC inspection. When an experiment fails, do a design re view and figure out why or where they could have caught it. Need to decide if its a drop/no drop situation. In the case of confounding batch effects, if the biological effect being testing is much larger than the batch effect, perhaps proceed with caution. Limit "free" time for projects, or they will lag and drag on.<br />
<br />
Hiring/Interviewing small group:<br />
"Lock them in a room" with a competancy test, something they would have to do on the job. Give them 40 minutes for a task that might take 25 minutes. Examples were fixing a broken script (multiple languages available) or writing an email (maybe that's for a different type of job, but something like that). Would you want to go on a camping trip with them? Coffee or lunch with the group can be a good enticement, give them a feel for the group, give you a feel for how they interact. Send ahead of time an RNA-seq analysis and have them run it, present results at the interview. Most people hadn't had any formal recruitment training. Entry level hiring is easier, but they may only stay 2-3 years. Is there a track to PI level, some people have that. Recruits should have a presence in github. Have them provide examples of how they solved a problem and search for evidence of their autonomy and ability to teach themselves new things. Hiring is part vetting and part seduction. How do you make the job attractive, sell it. Work for the common good? Crosstraining available within group. Put a link to your group website in job description, and then ask in the interview if they've been to your website. Bad sign if they haven't.<br />
<br />
<br />
'''Workflow small group '''<br />
<br />
[http://cromwell.readthedocs.io/en/develop/ Cromwell]<br />
The Broad’s java based pipeline manager for their WDL (“Widdle”) pipeline language.<br />
CWL is available as an option in the languages section of the configuration. <br />
<br />
[https://github.com/Barski-lab/cwl-airflow cwl-airflow]<br />
CWL pipeline manager variant of apache’s airflow originally developed by AirBnB.<br />
Python based but seems to require several out of date packages. A VirtualENV is recommended <br />
Appears to wrap cwl-runner and cwltool.<br />
<br />
[https://www.nextflow.io Nextflow] <br />
Groovy based language. <br />
There appears to be a prototype CWL to nextflow converter :<br />
https://www.nextflow.io/blog/2017/nextflow-and-cwl.html <br />
pipelines seems to use a s3 bucket. AWS account and command line tools are therefore required . <br />
<br />
[http://nf-co.re/ nf-core] : A community effort to collect curated Nextflow pipelines. As of August 9th, 3 released pipelines, 7 in development<br />
<br />
<br />
'''Nextflow''' supports Docker and Singularity containers technology.<br />
<br />
This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.<br />
It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes and Amazon AWS cloud platforms.<br />
Nextflow is based on the dataflow programming model which greatly simplifies writing complex distributed pipelines<br />
Parallelisation is implicitly defined by the processes input and output declarations. The resulting applications are inherently parallel and can scale-up or scale-out, transparently, without having to adapt to a specific platform architecture.<br />
All the intermediate results produced during the pipeline execution are automatically tracked.<br />
Stream oriented<br />
Nextflow extends the Unix pipes model with a fluent DSL, allowing you to handle complex stream interactions easily.<br />
<br />
<br />
This group is interested in a communication mechanism, maybe a slack channel or maybe a part of biostars?<br />
<br />
Experimental failures small group:<br />
Have evidence before pointing fingers... Some groups go-pro film a sample prep, can you imagine. Jointly meet with PIs. Force people to fill out metadata spreadsheet or form before project? Earlier involvement of analysts is better. Some places have a conflict resolution mechanism btw groups.<br />
<br />
Ideas for group operations to follow up on:<br />
<br />
Cloud/workflow discussion again?<br />
<br />
More time? Ask for 3 hours next time. One comment that 8 minutes is too short, but some people liked it.<br />
<br />
4-5 small groups might be better.<br />
<br />
I'd love to have some topics we decide, but then allow people to submit posters and select talks from submitted posters. Sounds like we can get in on a monetary poster prize via ISCB. We could also seek our own sponsers and use the money for travel fellowships. ISCB will be our bank and hold funds in escrow.<br />
<br />
Do we need to alter our communication strategy? A slack channel was requested by a few people. ISCB may give us access to Zoom for conf calls.<br />
<br />
We will try to get the details on who checks the bioinfo-core box when registering... Right now (supposedly) ISCB will add those people to our mailing list, but we have not confirmed that this is happening. It would be good to verify.<br />
<br />
Regarding workflows (edit Alastair) my issues are :<br />
<br />
Most tutorials online are geared around a single user setting up the workflow software in a user account. Not many sys-admin friendly <br />
<br />
• How to configure a pipeline for all users <br />
• How/where is best to store pipelines and make accessible for users <br />
• How bad are the docker vulnerabilities <br />
• How to configure dockerised pipelines properly<br />
• How to convert docker pipelines to singularity (with and without AWS) <br />
• Easiest pipelines to create from scratch<br />
• Resources per pipeline language? Such as yaml files for each tool? <br />
• Running the pipeline manager: Once per user? How to enforce port usage? If once per server, how to only use individual users outputs?</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2014:_InfrastructureForNewCores&diff=10121ISMB 2014: InfrastructureForNewCores2014-11-17T10:18:54Z<p>Alastair.kerr: /* Software */</p>
<hr />
<div>== Introduction ==<br />
This page was created in response to a suggestion made at the [[ISMB_2014:_BioinfoCoreWorkshopWriteUp | 2014 ISMB workshop]]. One of the discussions there centered around configuring resources for a core facility and the merits of commercial packages. As an extension of this discussion some people suggested that it would be nice to have an idea of what core pieces of infrastructure people had within their cores, and more specifically to be able to make suggestions for which pieces of hardware and software the group would consider to be useful when setting up a new core and some specific suggestions for packages which could fulfil each of these functions.<br />
<br />
* Hardware: buy or rent (AWS)?<br />
** Assume: ~20% usage for own hardware (my own average for the past year)<br />
** Assume: assume equivalent pricing on S3 vs institutional storage ($0.03/GB/mo).<br />
** who does the sysadmin stuff for your own hardware<br />
* Software<br />
** "Pathway Analysis" software. IPA is ~$6,000/year, or ~$20,000/year for concurrent license. Cheaper/free alternatives?<br />
** "Plasmid drawing / in silico cloning" tools commercial (CLCBio/VectorNTI) or free, open source solutions?<br />
** "Sample tracking" and LIMS<br />
** "workflow managers"<br />
* Time tracking: spreadsheet (easy for startup, hard to scale) vs enterprise solution (more cumbersome initially (?), scales better)<br />
* Hire personnel<br />
** core people vs. embedded bioinformatician<br />
* Interaction with exiting groups<br />
** other core facilities (functional genomics, mass spec, etc)<br />
** computational / statistical Biology research groups<br />
* Setting up a culture of collaboration with the wet lab research groups<br />
** make sure they talk to you before they start the experiment<br />
** co-authorship vs acknowledgement<br />
* Teaching<br />
** offering training courses<br />
** co-supervision of students<br />
<br />
== Notes from the call 14-Nov-2014 ==<br />
===Hardware===<br />
Need to be careful when buying hardware - make sure that you have the infrastructure in place to host servers you buy. Putting a suitable machine room in place with cooling and power <br />
<br />
Make sure you match the hardware to the tasks. Clusters are not generic. A lot of modelling or chemistry is fast CPU and low memory / storage but bioinformatics often need high storage and IO. Work out what sort of jobs you’re going to do and buy appropriately.<br />
<br />
Need to have a lot of storage which might be the biggest problem. Need to be sure that you’re confident in your admin skills if you’re going to take this on yourself.<br />
<br />
Might be possible to re-purpose existing hardware if you have something available. You should find out what hardware might be available in your institution. You might find there are groups which have infrastructure already available which you could tap into which can get you up and running quickly and from which you can make a decision about whether to create a joint system, or whether you will ultimately need to go on your own.<br />
<br />
You can do a lot with a single large SMP box. Custom distributions like BioLinux can help. If you look to scale up then get advice from a group or company who have done this many times before. Companies like BioTeam can help to put together a design and will think of things you’ve not considered.<br />
<br />
Once you have significant infrastructure you really need to look towards having a full-time sysadmin. Maintaining the hardware, backups, storage, software and data pipelines is a huge task. Ideally you can involve central core IT services at you infrastructure to allow your people <br />
<br />
Once you start with a multi-user cluster then managing the software, hardware, queues etc becomes a full time job and if you don’t have this then you will be continually falling behind on security patches, upgrades. A nice fall back is to set up a consultancy agreement with an individual or company which allows a smooth transition to having a permanent position within your group.<br />
<br />
Pretty much no one has gone with using cloud services. Although this seems attractive, the practical problems of maintaining an instance to your specification is expensive and difficult. Might be worth looking at openstack as an alternative.<br />
<br />
===Managing interactions with existing groups===<br />
There are often people who will be working in computational biology or statistics groups and its important to establish good relationships with these people at the start. Make sure people are aware that your group is being started and what its purpose is. Try to talk to all interested parties up front and be aware of any political problems which might exist. Try to confront issues up front and don’t wait for them to fester. Try to collaborate with groups rather than compete with them - in the end there’s always more than enough work to go around.<br />
<br />
Try to be the place that people come to get pointed to other experts. Don’t try to do everything yourself but be quick to forward people to other groups when you know they have more specific experience than your group does. This will provide a better service and will not alienate people.<br />
<br />
===Your first hire===<br />
Don’t rush into it. When there are only two of you you will need to be completely confident in both the skill set and the personal qualities of the first person you hire. Ideally try to find someone you know already or someone who comes recommended from someone you trust.<br />
<br />
For their skills you generally want someone with good problem solving skills. When you’re a small core everyone needs to be a jack of all trades so don’t focus on their existing skills but try to see how well they’re likely to pick up new areas since things will quickly change.<br />
<br />
Make personal interactions with the informaticians who are already in the institution.<br />
<br />
===Time tracking===<br />
Often a good idea to have some sort of tracking in place from an early stage so you can justify the time you are spending. Even if you’re not having to charge for the work you’re doing then it can still be useful to know where your time is going.<br />
<br />
Could do something as simple as a spreadsheet. Projects, who is working on it etc.<br />
<br />
There are also lots of projects management systems such as [http://www.redmine.org/ Redmine] which can do the same sort of thing. Also [http://www.atlassian.com/software/jira Atalassian Jira] can be useful in this area, also click time is an online system which can do some of this. Most of them have some kind of time tracking capability. Really pays off in the long term when you can collate statistics. Can extend these from project tracking to help desk or other systems.<br />
<br />
Don’t use this as a barrier to people. Always make it easy for people to come and talk to you and don’t track to track or bill this. You want to try to make yourself as useful as possible and make your group the place that people think of first. Don’t make them make appointments - try to have an open door policy.<br />
<br />
Even if you have big projects these systems still work well and you can link tasks together.<br />
<br />
The really important thing is to be able to keep track of the work you’ve done. All of your data is electronic so you need a way to be able to store data / scripts / notes etc.<br />
<br />
When your project list grows these systems are also useful to be able to flag up problems in your workflow when jobs have waited too long or have gone on longer than expected.<br />
<br />
===Software===<br />
What software do you need.<br />
<br />
[http://galaxyproject.org/ Galaxy] is a good place to start. Easy way to offer both datasets and tools to people and can be useful for teaching.<br />
<br />
Should have a revision control system to keep track all of the scripts and software you write. They can also make it easy to share code. If you use something like git then it’s easy to use a public git repository or keep things private deepening on your needs. There is a github for science [http://github.com/blog/1840-improving-github-for-science github for science] and also projects like [http://gitorious.org/ gitorius] if you want to share this yourself.<br />
<br />
You need some way to share data. Set up an FTP server or similar.<br />
<br />
Using simple apache web directories can be useful for bigwig and bed files for use in genome browsers.<br />
<br />
Can use dropbox for small files. Can set up ownCloud for larger more manageable storage.<br />
<br />
Have some way to share notes and experience. Evernote works well but you could use a local wiki or a system such as confluence. The main thing is to not lose track of your hard won experience.<br />
<br />
For sending data to users can use [http://shiny.rstudio.com/ Shiny] which is a nice way to share tabular data and interactive graphs. Any shiny app can be set up to read file directories or SQL databases for their source data and allow the user to download selected data or graphs via simple commands in the user interface.</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2014:_InfrastructureForNewCores&diff=10120ISMB 2014: InfrastructureForNewCores2014-11-17T10:17:40Z<p>Alastair.kerr: /* Software */</p>
<hr />
<div>== Introduction ==<br />
This page was created in response to a suggestion made at the [[ISMB_2014:_BioinfoCoreWorkshopWriteUp | 2014 ISMB workshop]]. One of the discussions there centered around configuring resources for a core facility and the merits of commercial packages. As an extension of this discussion some people suggested that it would be nice to have an idea of what core pieces of infrastructure people had within their cores, and more specifically to be able to make suggestions for which pieces of hardware and software the group would consider to be useful when setting up a new core and some specific suggestions for packages which could fulfil each of these functions.<br />
<br />
* Hardware: buy or rent (AWS)?<br />
** Assume: ~20% usage for own hardware (my own average for the past year)<br />
** Assume: assume equivalent pricing on S3 vs institutional storage ($0.03/GB/mo).<br />
** who does the sysadmin stuff for your own hardware<br />
* Software<br />
** "Pathway Analysis" software. IPA is ~$6,000/year, or ~$20,000/year for concurrent license. Cheaper/free alternatives?<br />
** "Plasmid drawing / in silico cloning" tools commercial (CLCBio/VectorNTI) or free, open source solutions?<br />
** "Sample tracking" and LIMS<br />
** "workflow managers"<br />
* Time tracking: spreadsheet (easy for startup, hard to scale) vs enterprise solution (more cumbersome initially (?), scales better)<br />
* Hire personnel<br />
** core people vs. embedded bioinformatician<br />
* Interaction with exiting groups<br />
** other core facilities (functional genomics, mass spec, etc)<br />
** computational / statistical Biology research groups<br />
* Setting up a culture of collaboration with the wet lab research groups<br />
** make sure they talk to you before they start the experiment<br />
** co-authorship vs acknowledgement<br />
* Teaching<br />
** offering training courses<br />
** co-supervision of students<br />
<br />
== Notes from the call 14-Nov-2014 ==<br />
===Hardware===<br />
Need to be careful when buying hardware - make sure that you have the infrastructure in place to host servers you buy. Putting a suitable machine room in place with cooling and power <br />
<br />
Make sure you match the hardware to the tasks. Clusters are not generic. A lot of modelling or chemistry is fast CPU and low memory / storage but bioinformatics often need high storage and IO. Work out what sort of jobs you’re going to do and buy appropriately.<br />
<br />
Need to have a lot of storage which might be the biggest problem. Need to be sure that you’re confident in your admin skills if you’re going to take this on yourself.<br />
<br />
Might be possible to re-purpose existing hardware if you have something available. You should find out what hardware might be available in your institution. You might find there are groups which have infrastructure already available which you could tap into which can get you up and running quickly and from which you can make a decision about whether to create a joint system, or whether you will ultimately need to go on your own.<br />
<br />
You can do a lot with a single large SMP box. Custom distributions like BioLinux can help. If you look to scale up then get advice from a group or company who have done this many times before. Companies like BioTeam can help to put together a design and will think of things you’ve not considered.<br />
<br />
Once you have significant infrastructure you really need to look towards having a full-time sysadmin. Maintaining the hardware, backups, storage, software and data pipelines is a huge task. Ideally you can involve central core IT services at you infrastructure to allow your people <br />
<br />
Once you start with a multi-user cluster then managing the software, hardware, queues etc becomes a full time job and if you don’t have this then you will be continually falling behind on security patches, upgrades. A nice fall back is to set up a consultancy agreement with an individual or company which allows a smooth transition to having a permanent position within your group.<br />
<br />
Pretty much no one has gone with using cloud services. Although this seems attractive, the practical problems of maintaining an instance to your specification is expensive and difficult. Might be worth looking at openstack as an alternative.<br />
<br />
===Managing interactions with existing groups===<br />
There are often people who will be working in computational biology or statistics groups and its important to establish good relationships with these people at the start. Make sure people are aware that your group is being started and what its purpose is. Try to talk to all interested parties up front and be aware of any political problems which might exist. Try to confront issues up front and don’t wait for them to fester. Try to collaborate with groups rather than compete with them - in the end there’s always more than enough work to go around.<br />
<br />
Try to be the place that people come to get pointed to other experts. Don’t try to do everything yourself but be quick to forward people to other groups when you know they have more specific experience than your group does. This will provide a better service and will not alienate people.<br />
<br />
===Your first hire===<br />
Don’t rush into it. When there are only two of you you will need to be completely confident in both the skill set and the personal qualities of the first person you hire. Ideally try to find someone you know already or someone who comes recommended from someone you trust.<br />
<br />
For their skills you generally want someone with good problem solving skills. When you’re a small core everyone needs to be a jack of all trades so don’t focus on their existing skills but try to see how well they’re likely to pick up new areas since things will quickly change.<br />
<br />
Make personal interactions with the informaticians who are already in the institution.<br />
<br />
===Time tracking===<br />
Often a good idea to have some sort of tracking in place from an early stage so you can justify the time you are spending. Even if you’re not having to charge for the work you’re doing then it can still be useful to know where your time is going.<br />
<br />
Could do something as simple as a spreadsheet. Projects, who is working on it etc.<br />
<br />
There are also lots of projects management systems such as [http://www.redmine.org/ Redmine] which can do the same sort of thing. Also [http://www.atlassian.com/software/jira Atalassian Jira] can be useful in this area, also click time is an online system which can do some of this. Most of them have some kind of time tracking capability. Really pays off in the long term when you can collate statistics. Can extend these from project tracking to help desk or other systems.<br />
<br />
Don’t use this as a barrier to people. Always make it easy for people to come and talk to you and don’t track to track or bill this. You want to try to make yourself as useful as possible and make your group the place that people think of first. Don’t make them make appointments - try to have an open door policy.<br />
<br />
Even if you have big projects these systems still work well and you can link tasks together.<br />
<br />
The really important thing is to be able to keep track of the work you’ve done. All of your data is electronic so you need a way to be able to store data / scripts / notes etc.<br />
<br />
When your project list grows these systems are also useful to be able to flag up problems in your workflow when jobs have waited too long or have gone on longer than expected.<br />
<br />
===Software===<br />
What software do you need.<br />
<br />
[http://galaxyproject.org/ Galaxy] is a good place to start. Easy way to offer tools to people and can be useful for teaching.<br />
<br />
Should have a revision control system to keep track all of the scripts and software you write. They can also make it easy to share code. If you use something like git then it’s easy to use a public git repository or keep things private deepening on your needs. There is a github for science [http://github.com/blog/1840-improving-github-for-science github for science] and also projects like [http://gitorious.org/ gitorius] if you want to share this yourself.<br />
<br />
You need some way to share data. Set up an FTP server or similar.<br />
<br />
Using simple apache web directories can be useful for bigwig and bed files for use in genome browsers.<br />
<br />
Can use dropbox for small files. Can set up ownCloud for larger more manageable storage.<br />
<br />
Have some way to share notes and experience. Evernote works well but you could use a local wiki or a system such as confluence. The main thing is to not lose track of your hard won experience.<br />
<br />
For sending data to users can use [http://shiny.rstudio.com/ Shiny] which is a nice way to share tabular data and interactive graphs. Any shiny app can be set up to read file directories or SQL databases for their source data and allow the user to download selected data or graphs via simple commands in the user interface.</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2014:_InfrastructureForNewCores&diff=10119ISMB 2014: InfrastructureForNewCores2014-11-17T10:17:18Z<p>Alastair.kerr: /* Software */</p>
<hr />
<div>== Introduction ==<br />
This page was created in response to a suggestion made at the [[ISMB_2014:_BioinfoCoreWorkshopWriteUp | 2014 ISMB workshop]]. One of the discussions there centered around configuring resources for a core facility and the merits of commercial packages. As an extension of this discussion some people suggested that it would be nice to have an idea of what core pieces of infrastructure people had within their cores, and more specifically to be able to make suggestions for which pieces of hardware and software the group would consider to be useful when setting up a new core and some specific suggestions for packages which could fulfil each of these functions.<br />
<br />
* Hardware: buy or rent (AWS)?<br />
** Assume: ~20% usage for own hardware (my own average for the past year)<br />
** Assume: assume equivalent pricing on S3 vs institutional storage ($0.03/GB/mo).<br />
** who does the sysadmin stuff for your own hardware<br />
* Software<br />
** "Pathway Analysis" software. IPA is ~$6,000/year, or ~$20,000/year for concurrent license. Cheaper/free alternatives?<br />
** "Plasmid drawing / in silico cloning" tools commercial (CLCBio/VectorNTI) or free, open source solutions?<br />
** "Sample tracking" and LIMS<br />
** "workflow managers"<br />
* Time tracking: spreadsheet (easy for startup, hard to scale) vs enterprise solution (more cumbersome initially (?), scales better)<br />
* Hire personnel<br />
** core people vs. embedded bioinformatician<br />
* Interaction with exiting groups<br />
** other core facilities (functional genomics, mass spec, etc)<br />
** computational / statistical Biology research groups<br />
* Setting up a culture of collaboration with the wet lab research groups<br />
** make sure they talk to you before they start the experiment<br />
** co-authorship vs acknowledgement<br />
* Teaching<br />
** offering training courses<br />
** co-supervision of students<br />
<br />
== Notes from the call 14-Nov-2014 ==<br />
===Hardware===<br />
Need to be careful when buying hardware - make sure that you have the infrastructure in place to host servers you buy. Putting a suitable machine room in place with cooling and power <br />
<br />
Make sure you match the hardware to the tasks. Clusters are not generic. A lot of modelling or chemistry is fast CPU and low memory / storage but bioinformatics often need high storage and IO. Work out what sort of jobs you’re going to do and buy appropriately.<br />
<br />
Need to have a lot of storage which might be the biggest problem. Need to be sure that you’re confident in your admin skills if you’re going to take this on yourself.<br />
<br />
Might be possible to re-purpose existing hardware if you have something available. You should find out what hardware might be available in your institution. You might find there are groups which have infrastructure already available which you could tap into which can get you up and running quickly and from which you can make a decision about whether to create a joint system, or whether you will ultimately need to go on your own.<br />
<br />
You can do a lot with a single large SMP box. Custom distributions like BioLinux can help. If you look to scale up then get advice from a group or company who have done this many times before. Companies like BioTeam can help to put together a design and will think of things you’ve not considered.<br />
<br />
Once you have significant infrastructure you really need to look towards having a full-time sysadmin. Maintaining the hardware, backups, storage, software and data pipelines is a huge task. Ideally you can involve central core IT services at you infrastructure to allow your people <br />
<br />
Once you start with a multi-user cluster then managing the software, hardware, queues etc becomes a full time job and if you don’t have this then you will be continually falling behind on security patches, upgrades. A nice fall back is to set up a consultancy agreement with an individual or company which allows a smooth transition to having a permanent position within your group.<br />
<br />
Pretty much no one has gone with using cloud services. Although this seems attractive, the practical problems of maintaining an instance to your specification is expensive and difficult. Might be worth looking at openstack as an alternative.<br />
<br />
===Managing interactions with existing groups===<br />
There are often people who will be working in computational biology or statistics groups and its important to establish good relationships with these people at the start. Make sure people are aware that your group is being started and what its purpose is. Try to talk to all interested parties up front and be aware of any political problems which might exist. Try to confront issues up front and don’t wait for them to fester. Try to collaborate with groups rather than compete with them - in the end there’s always more than enough work to go around.<br />
<br />
Try to be the place that people come to get pointed to other experts. Don’t try to do everything yourself but be quick to forward people to other groups when you know they have more specific experience than your group does. This will provide a better service and will not alienate people.<br />
<br />
===Your first hire===<br />
Don’t rush into it. When there are only two of you you will need to be completely confident in both the skill set and the personal qualities of the first person you hire. Ideally try to find someone you know already or someone who comes recommended from someone you trust.<br />
<br />
For their skills you generally want someone with good problem solving skills. When you’re a small core everyone needs to be a jack of all trades so don’t focus on their existing skills but try to see how well they’re likely to pick up new areas since things will quickly change.<br />
<br />
Make personal interactions with the informaticians who are already in the institution.<br />
<br />
===Time tracking===<br />
Often a good idea to have some sort of tracking in place from an early stage so you can justify the time you are spending. Even if you’re not having to charge for the work you’re doing then it can still be useful to know where your time is going.<br />
<br />
Could do something as simple as a spreadsheet. Projects, who is working on it etc.<br />
<br />
There are also lots of projects management systems such as [http://www.redmine.org/ Redmine] which can do the same sort of thing. Also [http://www.atlassian.com/software/jira Atalassian Jira] can be useful in this area, also click time is an online system which can do some of this. Most of them have some kind of time tracking capability. Really pays off in the long term when you can collate statistics. Can extend these from project tracking to help desk or other systems.<br />
<br />
Don’t use this as a barrier to people. Always make it easy for people to come and talk to you and don’t track to track or bill this. You want to try to make yourself as useful as possible and make your group the place that people think of first. Don’t make them make appointments - try to have an open door policy.<br />
<br />
Even if you have big projects these systems still work well and you can link tasks together.<br />
<br />
The really important thing is to be able to keep track of the work you’ve done. All of your data is electronic so you need a way to be able to store data / scripts / notes etc.<br />
<br />
When your project list grows these systems are also useful to be able to flag up problems in your workflow when jobs have waited too long or have gone on longer than expected.<br />
<br />
===Software===<br />
What software do you need.<br />
<br />
[http://galaxyproject.org/ Galaxy] is a good place to start. Easy way to offer tools to people and can be useful for teaching.<br />
<br />
Should have a revision control system to keep track all of the scripts and software you write. They can also make it easy to share code. If you use something like git then it’s easy to use a public git repository or keep things private deepening on your needs. There is a github for science [http://github.com/blog/1840-improving-github-for-science github for science] and also projects like [http://gitorious.org/ gitorius] if you want to share this yourself.<br />
<br />
You need some way to share data. Set up an FTP server or similar.<br />
<br />
Using simple apache web directories is useful for bigwig and bed files for use in genome browsers.<br />
<br />
Can use dropbox for small files. Can set up ownCloud for larger more manageable storage.<br />
<br />
Have some way to share notes and experience. Evernote works well but you could use a local wiki or a system such as confluence. The main thing is to not lose track of your hard won experience.<br />
<br />
For sending data to users can use [http://shiny.rstudio.com/ Shiny] which is a nice way to share tabular data and interactive graphs. Any shiny app can be set up to read file directories or SQL databases for their source data and allow the user to download selected data or graphs via simple commands in the user interface.</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2014:_InfrastructureForNewCores&diff=10118ISMB 2014: InfrastructureForNewCores2014-11-17T10:16:40Z<p>Alastair.kerr: /* Software */</p>
<hr />
<div>== Introduction ==<br />
This page was created in response to a suggestion made at the [[ISMB_2014:_BioinfoCoreWorkshopWriteUp | 2014 ISMB workshop]]. One of the discussions there centered around configuring resources for a core facility and the merits of commercial packages. As an extension of this discussion some people suggested that it would be nice to have an idea of what core pieces of infrastructure people had within their cores, and more specifically to be able to make suggestions for which pieces of hardware and software the group would consider to be useful when setting up a new core and some specific suggestions for packages which could fulfil each of these functions.<br />
<br />
* Hardware: buy or rent (AWS)?<br />
** Assume: ~20% usage for own hardware (my own average for the past year)<br />
** Assume: assume equivalent pricing on S3 vs institutional storage ($0.03/GB/mo).<br />
** who does the sysadmin stuff for your own hardware<br />
* Software<br />
** "Pathway Analysis" software. IPA is ~$6,000/year, or ~$20,000/year for concurrent license. Cheaper/free alternatives?<br />
** "Plasmid drawing / in silico cloning" tools commercial (CLCBio/VectorNTI) or free, open source solutions?<br />
** "Sample tracking" and LIMS<br />
** "workflow managers"<br />
* Time tracking: spreadsheet (easy for startup, hard to scale) vs enterprise solution (more cumbersome initially (?), scales better)<br />
* Hire personnel<br />
** core people vs. embedded bioinformatician<br />
* Interaction with exiting groups<br />
** other core facilities (functional genomics, mass spec, etc)<br />
** computational / statistical Biology research groups<br />
* Setting up a culture of collaboration with the wet lab research groups<br />
** make sure they talk to you before they start the experiment<br />
** co-authorship vs acknowledgement<br />
* Teaching<br />
** offering training courses<br />
** co-supervision of students<br />
<br />
== Notes from the call 14-Nov-2014 ==<br />
===Hardware===<br />
Need to be careful when buying hardware - make sure that you have the infrastructure in place to host servers you buy. Putting a suitable machine room in place with cooling and power <br />
<br />
Make sure you match the hardware to the tasks. Clusters are not generic. A lot of modelling or chemistry is fast CPU and low memory / storage but bioinformatics often need high storage and IO. Work out what sort of jobs you’re going to do and buy appropriately.<br />
<br />
Need to have a lot of storage which might be the biggest problem. Need to be sure that you’re confident in your admin skills if you’re going to take this on yourself.<br />
<br />
Might be possible to re-purpose existing hardware if you have something available. You should find out what hardware might be available in your institution. You might find there are groups which have infrastructure already available which you could tap into which can get you up and running quickly and from which you can make a decision about whether to create a joint system, or whether you will ultimately need to go on your own.<br />
<br />
You can do a lot with a single large SMP box. Custom distributions like BioLinux can help. If you look to scale up then get advice from a group or company who have done this many times before. Companies like BioTeam can help to put together a design and will think of things you’ve not considered.<br />
<br />
Once you have significant infrastructure you really need to look towards having a full-time sysadmin. Maintaining the hardware, backups, storage, software and data pipelines is a huge task. Ideally you can involve central core IT services at you infrastructure to allow your people <br />
<br />
Once you start with a multi-user cluster then managing the software, hardware, queues etc becomes a full time job and if you don’t have this then you will be continually falling behind on security patches, upgrades. A nice fall back is to set up a consultancy agreement with an individual or company which allows a smooth transition to having a permanent position within your group.<br />
<br />
Pretty much no one has gone with using cloud services. Although this seems attractive, the practical problems of maintaining an instance to your specification is expensive and difficult. Might be worth looking at openstack as an alternative.<br />
<br />
===Managing interactions with existing groups===<br />
There are often people who will be working in computational biology or statistics groups and its important to establish good relationships with these people at the start. Make sure people are aware that your group is being started and what its purpose is. Try to talk to all interested parties up front and be aware of any political problems which might exist. Try to confront issues up front and don’t wait for them to fester. Try to collaborate with groups rather than compete with them - in the end there’s always more than enough work to go around.<br />
<br />
Try to be the place that people come to get pointed to other experts. Don’t try to do everything yourself but be quick to forward people to other groups when you know they have more specific experience than your group does. This will provide a better service and will not alienate people.<br />
<br />
===Your first hire===<br />
Don’t rush into it. When there are only two of you you will need to be completely confident in both the skill set and the personal qualities of the first person you hire. Ideally try to find someone you know already or someone who comes recommended from someone you trust.<br />
<br />
For their skills you generally want someone with good problem solving skills. When you’re a small core everyone needs to be a jack of all trades so don’t focus on their existing skills but try to see how well they’re likely to pick up new areas since things will quickly change.<br />
<br />
Make personal interactions with the informaticians who are already in the institution.<br />
<br />
===Time tracking===<br />
Often a good idea to have some sort of tracking in place from an early stage so you can justify the time you are spending. Even if you’re not having to charge for the work you’re doing then it can still be useful to know where your time is going.<br />
<br />
Could do something as simple as a spreadsheet. Projects, who is working on it etc.<br />
<br />
There are also lots of projects management systems such as [http://www.redmine.org/ Redmine] which can do the same sort of thing. Also [http://www.atlassian.com/software/jira Atalassian Jira] can be useful in this area, also click time is an online system which can do some of this. Most of them have some kind of time tracking capability. Really pays off in the long term when you can collate statistics. Can extend these from project tracking to help desk or other systems.<br />
<br />
Don’t use this as a barrier to people. Always make it easy for people to come and talk to you and don’t track to track or bill this. You want to try to make yourself as useful as possible and make your group the place that people think of first. Don’t make them make appointments - try to have an open door policy.<br />
<br />
Even if you have big projects these systems still work well and you can link tasks together.<br />
<br />
The really important thing is to be able to keep track of the work you’ve done. All of your data is electronic so you need a way to be able to store data / scripts / notes etc.<br />
<br />
When your project list grows these systems are also useful to be able to flag up problems in your workflow when jobs have waited too long or have gone on longer than expected.<br />
<br />
===Software===<br />
What software do you need.<br />
<br />
[http://galaxyproject.org/ Galaxy] is a good place to start. Easy way to offer tools to people and can be useful for teaching.<br />
<br />
Should have a revision control system to keep track all of the scripts and software you write. They can also make it easy to share code. If you use something like git then it’s easy to use a public git repository or keep things private deepening on your needs. There is a github for science [http://github.com/blog/1840-improving-github-for-science github for science] and also projects like [http://gitorious.org/ gitorius] if you want to share this yourself.<br />
<br />
You need some way to share data. Set up an FTP server or similar.<br />
<br />
Using a simple apache web directories is useful for bigwig and bed files for use in genome browsers.<br />
<br />
Can use dropbox for small files. Can set up ownCloud for larger more manageable storage.<br />
<br />
Have some way to share notes and experience. Evernote works well but you could use a local wiki or a system such as confluence. The main thing is to not lose track of your hard won experience.<br />
<br />
For sending data to users can use [http://shiny.rstudio.com/ Shiny] which is a nice way to share tabular data and interactive graphs. Any shiny app can be set up to read file directories or SQL databases for their source data and allow the user to download selected data or graphs via simple commands in the user interface.</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2014:_BioinfoCoreWorkshopWriteUp&diff=10107ISMB 2014: BioinfoCoreWorkshopWriteUp2014-07-22T15:37:00Z<p>Alastair.kerr: /* Topic 1: Core on a Budget vs Enterprise */</p>
<hr />
<div>== Introduction ==<br />
The annual bioinfo-core workshop ran successfully at the 2014 ISMB conference. We had a good attendance for the meeting despite the workshop clashing exactly with the world cup final and we're very grateful for everyone who chose to come along.<br />
<br />
We changed the format of the workshop slightly from previous years. In the past we had always had two sets of presentations followed by a moderated group discussion. This year we had only one formal session with presentations, with the second half of the workshop being taken up with a larger group discussion covering a number of topics which were collated from suggestions taken from the bioinfo-core mailing list.<br />
<br />
The group discussions were very lively and we had a large number of people contributing to them. For those who weren't able to attend we will try to summarise some of the main points of the discussions below - as these were all active discussion sessions we don't have a great number of notes to work from so anyone who has other points they remember please fill in anything which has been missed.<br />
<br />
==Introduction to the Workshop by Brent Richter==<br />
Earlier in the conference Brent had already presented to a special session which had introduced bioinfo-core as one of the new ISCB COSIs (communities of special interest). He was able to summarise the rationale for having bioinfo-core as a group and talk about the activities the group performs. Hopefully the increased exposure the group receives through becoming a COSI will help bring us to the attention of some people who might not have known about us before.<br />
<br />
<br />
== Topic 1: Core on a Budget vs Enterprise ==<br />
Moderated by Matt Eldridge and David Sexton<br />
<br />
Speakers:<br />
<br />
* Alastair Kerr - Edinburgh Uni [[media:AK-ismb-coreshop.pdf|Slides]]<br />
* Mike Poidinger - A*Star [[media:Mpoidinger_ismb2014.pdf|Slides]]<br />
<br />
The purpose of this session was to look at whether it is possible to run a core facility on a limited budget, and to explore what becomes possible when you have a larger amount of money to spend.<br />
<br />
Alastair Kerr led the session talking about his [http://bifx1.bio.ed.ac.uk/ small core (2 people) at Edinburgh University]. His core is almost completely self-reliant and has to cover all of the hardware, storage, software and analysis infrastructure required for the full range of users he supports. Alastair described how his infrastructure is built on a number of key open source components from [http://en.wikipedia.org/wiki/ZFS ZFS] based storage systems which provide 0.5PB of storage for a fraction of the cost of commercial systems to pipelining and workflow systems built within [http://galaxyproject.org/ Galaxy], to user-friendly analysis scripts provided to users though the [http://shiny.rstudio.com/ R Shiny system].<br />
<br />
Alastair described how he actively avoids the use of commercial software within his group and described occasions in the past where their adoption of initially useful commercial packages had ultimately had negative impacts when the software later changed its licensing fees or became unsupported. The only commercial package they still have is [http://www.dnastar.com/t-allproducts.aspx Lasergene] for basic molecular biology manipulations and this is mostly for historical reasons and for the lack of a suitable open alternative. <br />
<br />
A key benefit with this choice of open source infrastructure is that the process of data analysis, from raw data to paper figures, can be shared and used by anyone. Alastair's group facilitates this process by ensuring that any scripts developed in-house are shared in either [https://toolshed.g2.bx.psu.edu/ Galaxy Toolshed] or [http://github.com github].<br />
<br />
Mike Poidinger then went on to present the contrary case. His group is very well funded by his [http://www.a-star.edu.sg/ supporting institution] and is somewhat larger than the Edinburgh group with 9 members. Mike's initial contention was that it should be a requirement when setting up a core that sufficient funding be provided and that it would be reasonable to refuse to head up a core where suitable funding to provide an appropriate infrastructure was not forthcoming. Mike stressed that open source software played a major role in the operation of his core, with much of the analysis of data being provided by these types of packages, which are generally much more agile and able that their commercial equivalents. However, he made a strong case for two particular pieces of commercial support software which now form a key part of his infrastructure - [http://accelrys.com/products/pipeline-pilot/ Pipeline Pilot] and [http://spotfire.tibco.com/ Spotfire]. <br />
<br />
Mike's contention was that whilst open source packages are very good at performing individual analyses, they can be difficult to work with due to the difficulty in collating and understanding the wide variety of output files and formats they generate. His group uses [http://accelrys.com/products/pipeline-pilot/ pipeline pilot] to abstract away a lot of the 'dirty' parts of an analysis so that they can leave the commercial system to store and retrieve appropriate data and to handle the format conversions required to pass data through several parts of a larger pipeline. Having this type of collation system in place means that all of the analysis can be done in the form of pipelines and a complete record of all analyses is preserved and can be reproduced or reused very easily. <br />
<br />
The other package heavily used within his group is [http://spotfire.tibco.com/ Spotfire]. This is a data presentation and integration package which makes it easy for users to explore the quantitative data coming out of the end of an analysis pipeline. It would compete with simple solutions such as Excel, or more complex analyses and visualisations in R, but provides a friendly and powerful interface to the data. Mike's team have linked these packages to other tools such as the Moin moin wiki to provide a combined system which keeps a complete record of analyses, presents it back to the original users in a friendly way and provides an interface through which they can themselves manipulate and explore the data further.<br />
<br />
Overall it was Mike's contention that the use of these commercial products within his group added around 20% to the efficiency of his staff, and also allowed new members to get started much more quickly. The cost of the licensing for these packages was therefore outweighed by the efficiency improvements which his group gained from their use.<br />
<br />
=== Discussion ===<br />
There were some questions about the talks which had been presented. There was some lively discussion about the financial benefits of using commercial software, with some people arguing that the amount of money spent on a big commercial system would fund an additional FTE and that this would be a more productive use of the funding. Whilst a consensus was not really reached on this point, it seems that the merits of this, and possibly several other commercial / open decisions depends on the scale of your group. Smaller cores are more able to support their own infrastructure but as the size of the group or the community expands then the support of infrastructure becomes more of a burden. At this point getting commercial support for storage, pipelining or data management becomes more attractive and allows the core to focus on the science rather than the specifics of the platforms being used.<br />
<br />
A suggestion which came out of this discussion was that bioinfo-core could try to collate some ideas about what infrastructure would be useful to put in place when establishing a new core. The idea would be that we could generate a basic check list of the types of components you would want and give some options for available solutions for each area and add comments about the merits of each. To this end we've set up a [[ISMB_2014:_InfrastructureForNewCores | basic template page]] which we can expand after further discussion on the list.<br />
<br />
Much of the subsequent discussion for the session focussed around whether there were individual or groups of commercial packages for which there wasn't a suitable free and open alternative. The major area which came up was for packages providing functional annotation, with the main contenders being [http://www.ingenuity.com/products/ipa Ingenuity IPA] and [http://lsresearch.thomsonreuters.com/ | GeneGo Metcore]. Several sites are paying for these types of packages and the consensus was that what you're paying for wasn't the software but rather the expanded set of gene lists and connections which have been manually mined from the literature by these companies.<br />
<br />
These types of system are generally liked by users as they provide an easy way into the biology of an analysis result. They offer some advantages over equivalent open source products, but their major open competitors such as [http://david.abcc.ncifcrf.gov/ DAVID], [http://cbl-gorilla.cs.technion.ac.il/ GOrilla] and [http://www.genemania.org/ GeneMania] are also very good and well used.<br />
<br />
There was a general opinion that the costs and licensing terms for the commercial annotation packages were quite severe. This was especially the case for IPA where some sites had starting to do cost recovery for the licence and found that many of the previous users weren't prepared to pay the costs for this. MetaCore licensing was more flexible with the ability to buy licences for a given number of simultaneous users which fitted better for many people's use cases.<br />
<br />
Comments were also made about the utility of these systems. There was some concern that although these systems are popular they may not be all that biologically informative. Some groups had experienced that people tended to pick and choose hits from functional analysis in the same way that they picked from gene hit lists to try to reinforce an idea they already had, rather than trying to formulate novel hypotheses.<br />
<br />
Another case for commercial packages was made for cases where you want to quickly enter into a new area of science and you don't have the resources available to build up an in-house platform for open tools. This can often happen if there is an important but likely transient interest in a new area of science. The example cited was the use of [https://www.dnanexus.com/ DNA Nexus] for variant calling, which may not be the absolute best in class, but is likely good enough for new users and is a well researched and validated platform. Setting this up takes minimal time and effort and can provide a cost effective solution for cores without the time or experience to develop a more tailored pipeline.<br />
<br />
== Session 2: Community suggested open discussions ==<br />
Moderated by Simon Andrews and Brent Richter<br />
<br />
For this session the workshop organisers had put out a request on the mailing list for topics the group would like to discuss. There were a large number of responses which were then collated to pull out the most common topics for the discussion session. Other suggestions will either be put back to the list or will be used as part of one of the forthcoming conference calls.<br />
<br />
There were 3 major areas which were selected for coverage within the session:<br />
<br />
* Using pipelines within a core<br />
* Managing workloads<br />
* Funding your core<br />
<br />
=== Using pipelines within a core ===<br />
<br />
The motivation for this topic was to see how many cores had already introduced automation within their core and to look at the factors which influenced their choice of pipelineing system. We had already heard from a group which was heavily involved in the commercial pipeline pilot system and one which had used galaxy to construct workflows. We then heard from a couple of other groups - one had started from the [http://www.broadinstitute.org/gatk/guide/topic?name=queue GATK queue] system but had found that this wasn't directly useable on their system. Another group had developed a new pipelining system [http://www.bioinformatics.babraham.ac.uk/projects/clusterflow/ ClusterFlow] since they found that none of the available systems fitted well with their existing infrastructure immediately and that it was as easy to develop their own system from scratch than try to tailor an existing system to fit their needs.<br />
<br />
Several people said that they had developed pipelines but without using a traditional pipelining system. Rather they had simply produced specific scripts which ran an analysis end to end. Sometimes they had split this up into several modular scripts which were chained together, but without the benefits in parallelisation and scheduling which could have been provided by a formal system. These types of scripted pipelines are common in that they naturally develop out of individual analyses but there was some concern about how well they could scale in future.<br />
<br />
In practical terms the factors which had influenced or limited the adoption of pipelines were things such as:<br />
<br />
* The ability to configure the settings used for steps within a pipeline. There were competing theories about this - some people preferred steps which were very static once set up so that it was easier to maintain consistent operations. Others wanted the ability to tweak all settings easily to have maximum flexibility within the pipeline. A lot of people had found that the overhead of writing suitable wrappers for individual programs within a pipline was quite high, especially if all options needed to be encoded in this to allow them to be changed.<br />
<br />
* Fit with existing infrastructure. Many people implementing pipelines will not be building a new system from scratch but will need this to integrate with an existing cluster, so the ability of the pipeline to support the scheduling system being used, the software management system, the nature of the filesystem and various other factors.<br />
<br />
* Recording and reproducibility. There is quite a lot of variability in how the results and settings for pipelines are recorded and what information is retained. Some groups need the ability to easily query and collate results from factors within a large set of pipelines and depending on how the data is recorded this may be more or less easy.<br />
<br />
=== Managing workloads ===<br />
<br />
=== Funding your core ===</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2014:_BioinfoCoreWorkshopWriteUp&diff=10106ISMB 2014: BioinfoCoreWorkshopWriteUp2014-07-22T15:32:54Z<p>Alastair.kerr: /* Topic 1: Core on a Budget vs Enterprise */</p>
<hr />
<div>== Introduction ==<br />
The annual bioinfo-core workshop ran successfully at the 2014 ISMB conference. We had a good attendance for the meeting despite the workshop clashing exactly with the world cup final and we're very grateful for everyone who chose to come along.<br />
<br />
We changed the format of the workshop slightly from previous years. In the past we had always had two sets of presentations followed by a moderated group discussion. This year we had only one formal session with presentations, with the second half of the workshop being taken up with a larger group discussion covering a number of topics which were collated from suggestions taken from the bioinfo-core mailing list.<br />
<br />
The group discussions were very lively and we had a large number of people contributing to them. For those who weren't able to attend we will try to summarise some of the main points of the discussions below - as these were all active discussion sessions we don't have a great number of notes to work from so anyone who has other points they remember please fill in anything which has been missed.<br />
<br />
==Introduction to the Workshop by Brent Richter==<br />
Earlier in the conference Brent had already presented to a special session which had introduced bioinfo-core as one of the new ISCB COSIs (communities of special interest). He was able to summarise the rationale for having bioinfo-core as a group and talk about the activities the group performs. Hopefully the increased exposure the group receives through becoming a COSI will help bring us to the attention of some people who might not have known about us before.<br />
<br />
<br />
== Topic 1: Core on a Budget vs Enterprise ==<br />
Moderated by Matt Eldridge and David Sexton<br />
<br />
Speakers:<br />
<br />
* Alastair Kerr - Edinburgh Uni [[media:AK-ismb-coreshop.pdf|Slides]]<br />
* Mike Poidinger - A*Star [[media:Mpoidinger_ismb2014.pdf|Slides]]<br />
<br />
The purpose of this session was to look at whether it is possible to run a core facility on a limited budget, and to explore what becomes possible when you have a larger amount of money to spend.<br />
<br />
Alastair Kerr led the session talking about his [http://bifx1.bio.ed.ac.uk/ small core (2 people) at Edinburgh University]. His core is almost completely self-reliant and has to cover all of the hardware, storage, software and analysis infrastructure required for the full range of users he supports. Alastair described how his infrastructure is built on a number of key open source components from [http://en.wikipedia.org/wiki/ZFS ZFS] based storage systems which provide 0.5PB of storage for a fraction of the cost of commercial systems to pipelining and workflow systems built within [http://galaxyproject.org/ Galaxy], to user-friendly analysis scripts provided to users though the [http://shiny.rstudio.com/ R Shiny system].<br />
<br />
Alastair described how he actively avoids the use of commercial software within his group and described occasions in the past where their adoption of initially useful commercial packages had ultimately had negative impacts when the software later changed its licensing fees or became unsupported. The only commercial package they still have is [http://www.dnastar.com/t-allproducts.aspx Lasergene] for basic molecular biology manipulations and this is mostly for historical reasons and for the lack of a suitable open alternative. A key benefit with this choice of open source infrastructure is that the process of data analysis from raw data to paper figure can be shared and used by anyone. Alastair's group facilitates this process by ensuring that any scripts developed in-house are shared in either [https://toolshed.g2.bx.psu.edu/ Galaxy Toolshed] or [http://github.com github].<br />
<br />
Mike Poidinger then went on to present the contrary case. His group is very well funded by his [http://www.a-star.edu.sg/ supporting institution] and is somewhat larger than the Edinburgh group with 9 members. Mike's initial contention was that it should be a requirement when setting up a core that sufficient funding be provided and that it would be reasonable to refuse to head up a core where suitable funding to provide an appropriate infrastructure was not forthcoming. Mike stressed that open source software played a major role in the operation of his core, with much of the analysis of data being provided by these types of packages, which are generally much more agile and able that their commercial equivalents. However, he made a strong case for two particular pieces of commercial support software which now form a key part of his infrastructure - [http://accelrys.com/products/pipeline-pilot/ Pipeline Pilot] and [http://spotfire.tibco.com/ Spotfire]. <br />
<br />
Mike's contention was that whilst open source packages are very good at performing individual analyses, they can be difficult to work with due to the difficulty in collating and understanding the wide variety of output files and formats they generate. His group uses [http://accelrys.com/products/pipeline-pilot/ pipeline pilot] to abstract away a lot of the 'dirty' parts of an analysis so that they can leave the commercial system to store and retrieve appropriate data and to handle the format conversions required to pass data through several parts of a larger pipeline. Having this type of collation system in place means that all of the analysis can be done in the form of pipelines and a complete record of all analyses is preserved and can be reproduced or reused very easily. <br />
<br />
The other package heavily used within his group is [http://spotfire.tibco.com/ Spotfire]. This is a data presentation and integration package which makes it easy for users to explore the quantitative data coming out of the end of an analysis pipeline. It would compete with simple solutions such as Excel, or more complex analyses and visualisations in R, but provides a friendly and powerful interface to the data. Mike's team have linked these packages to other tools such as the Moin moin wiki to provide a combined system which keeps a complete record of analyses, presents it back to the original users in a friendly way and provides an interface through which they can themselves manipulate and explore the data further.<br />
<br />
Overall it was Mike's contention that the use of these commercial products within his group added around 20% to the efficiency of his staff, and also allowed new members to get started much more quickly. The cost of the licensing for these packages was therefore outweighed by the efficiency improvements which his group gained from their use.<br />
<br />
=== Discussion ===<br />
There were some questions about the talks which had been presented. There was some lively discussion about the financial benefits of using commercial software, with some people arguing that the amount of money spent on a big commercial system would fund an additional FTE and that this would be a more productive use of the funding. Whilst a consensus was not really reached on this point, it seems that the merits of this, and possibly several other commercial / open decisions depends on the scale of your group. Smaller cores are more able to support their own infrastructure but as the size of the group or the community expands then the support of infrastructure becomes more of a burden. At this point getting commercial support for storage, pipelining or data management becomes more attractive and allows the core to focus on the science rather than the specifics of the platforms being used.<br />
<br />
A suggestion which came out of this discussion was that bioinfo-core could try to collate some ideas about what infrastructure would be useful to put in place when establishing a new core. The idea would be that we could generate a basic check list of the types of components you would want and give some options for available solutions for each area and add comments about the merits of each. To this end we've set up a [[ISMB_2014:_InfrastructureForNewCores | basic template page]] which we can expand after further discussion on the list.<br />
<br />
Much of the subsequent discussion for the session focussed around whether there were individual or groups of commercial packages for which there wasn't a suitable free and open alternative. The major area which came up was for packages providing functional annotation, with the main contenders being [http://www.ingenuity.com/products/ipa Ingenuity IPA] and [http://lsresearch.thomsonreuters.com/ | GeneGo Metcore]. Several sites are paying for these types of packages and the consensus was that what you're paying for wasn't the software but rather the expanded set of gene lists and connections which have been manually mined from the literature by these companies.<br />
<br />
These types of system are generally liked by users as they provide an easy way into the biology of an analysis result. They offer some advantages over equivalent open source products, but their major open competitors such as [http://david.abcc.ncifcrf.gov/ DAVID], [http://cbl-gorilla.cs.technion.ac.il/ GOrilla] and [http://www.genemania.org/ GeneMania] are also very good and well used.<br />
<br />
There was a general opinion that the costs and licensing terms for the commercial annotation packages were quite severe. This was especially the case for IPA where some sites had starting to do cost recovery for the licence and found that many of the previous users weren't prepared to pay the costs for this. MetaCore licensing was more flexible with the ability to buy licences for a given number of simultaneous users which fitted better for many people's use cases.<br />
<br />
Comments were also made about the utility of these systems. There was some concern that although these systems are popular they may not be all that biologically informative. Some groups had experienced that people tended to pick and choose hits from functional analysis in the same way that they picked from gene hit lists to try to reinforce an idea they already had, rather than trying to formulate novel hypotheses.<br />
<br />
Another case for commercial packages was made for cases where you want to quickly enter into a new area of science and you don't have the resources available to build up an in-house platform for open tools. This can often happen if there is an important but likely transient interest in a new area of science. The example cited was the use of [https://www.dnanexus.com/ DNA Nexus] for variant calling, which may not be the absolute best in class, but is likely good enough for new users and is a well researched and validated platform. Setting this up takes minimal time and effort and can provide a cost effective solution for cores without the time or experience to develop a more tailored pipeline.<br />
<br />
== Session 2: Community suggested open discussions ==<br />
Moderated by Simon Andrews and Brent Richter<br />
<br />
For this session the workshop organisers had put out a request on the mailing list for topics the group would like to discuss. There were a large number of responses which were then collated to pull out the most common topics for the discussion session. Other suggestions will either be put back to the list or will be used as part of one of the forthcoming conference calls.<br />
<br />
There were 3 major areas which were selected for coverage within the session:<br />
<br />
* Using pipelines within a core<br />
* Managing workloads<br />
* Funding your core<br />
<br />
=== Using pipelines within a core ===<br />
<br />
The motivation for this topic was to see how many cores had already introduced automation within their core and to look at the factors which influenced their choice of pipelineing system. We had already heard from a group which was heavily involved in the commercial pipeline pilot system and one which had used galaxy to construct workflows. We then heard from a couple of other groups - one had started from the [http://www.broadinstitute.org/gatk/guide/topic?name=queue GATK queue] system but had found that this wasn't directly useable on their system. Another group had developed a new pipelining system [http://www.bioinformatics.babraham.ac.uk/projects/clusterflow/ ClusterFlow] since they found that none of the available systems fitted well with their existing infrastructure immediately and that it was as easy to develop their own system from scratch than try to tailor an existing system to fit their needs.<br />
<br />
Several people said that they had developed pipelines but without using a traditional pipelining system. Rather they had simply produced specific scripts which ran an analysis end to end. Sometimes they had split this up into several modular scripts which were chained together, but without the benefits in parallelisation and scheduling which could have been provided by a formal system. These types of scripted pipelines are common in that they naturally develop out of individual analyses but there was some concern about how well they could scale in future.<br />
<br />
In practical terms the factors which had influenced or limited the adoption of pipelines were things such as:<br />
<br />
* The ability to configure the settings used for steps within a pipeline. There were competing theories about this - some people preferred steps which were very static once set up so that it was easier to maintain consistent operations. Others wanted the ability to tweak all settings easily to have maximum flexibility within the pipeline. A lot of people had found that the overhead of writing suitable wrappers for individual programs within a pipline was quite high, especially if all options needed to be encoded in this to allow them to be changed.<br />
<br />
* Fit with existing infrastructure. Many people implementing pipelines will not be building a new system from scratch but will need this to integrate with an existing cluster, so the ability of the pipeline to support the scheduling system being used, the software management system, the nature of the filesystem and various other factors.<br />
<br />
* Recording and reproducibility. There is quite a lot of variability in how the results and settings for pipelines are recorded and what information is retained. Some groups need the ability to easily query and collate results from factors within a large set of pipelines and depending on how the data is recorded this may be more or less easy.<br />
<br />
=== Managing workloads ===<br />
<br />
=== Funding your core ===</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2014:_BioinfoCoreWorkshopWriteUp&diff=10105ISMB 2014: BioinfoCoreWorkshopWriteUp2014-07-22T11:52:43Z<p>Alastair.kerr: /* Discussion */</p>
<hr />
<div>== Introduction ==<br />
The annual bioinfo-core workshop ran successfully at the 2014 ISMB conference. We had a good attendance for the meeting despite the workshop clashing exactly with the world cup final and we're very grateful for everyone who chose to come along.<br />
<br />
We changed the format of the workshop slightly from previous years. In the past we had always had two sets of presentations followed by a moderated group discussion. This year we had only one formal session with presentations, with the second half of the workshop being taken up with a larger group discussion covering a number of topics which were collated from suggestions taken from the bioinfo-core mailing list.<br />
<br />
The group discussions were very lively and we had a large number of people contributing to them. For those who weren't able to attend we will try to summarise some of the main points of the discussions below - as these were all active discussion sessions we don't have a great number of notes to work from so anyone who has other points they remember please fill in anything which has been missed.<br />
<br />
==Introduction to the Workshop by Brent Richter==<br />
Earlier in the conference Brent had already presented to a special session which had introduced bioinfo-core as one of the new ISCB COSIs (communities of special interest). He was able to summarise the rationale for having bioinfo-core as a group and talk about the activities the group performs. Hopefully the increased exposure the group receives through becoming a COSI will help bring us to the attention of some people who might not have known about us before.<br />
<br />
<br />
== Topic 1: Core on a Budget vs Enterprise ==<br />
Moderated by Matt Eldridge and David Sexton<br />
<br />
Speakers:<br />
<br />
* Alastair Kerr - Edinburgh Uni [[media:AK-ismb-coreshop.pdf|Slides]]<br />
* Mike Poidinger - A*Star [[media:Mpoidinger_ismb2014.pdf|Slides]]<br />
<br />
The purpose of this session was to look at whether it is possible to run a core facility on a limited budget, and to explore what becomes possible when you have a larger amount of money to spend.<br />
<br />
Alastair Kerr led the session talking about his [http://bifx1.bio.ed.ac.uk/ small core (2 people) at Edinburgh University]. His core is almost completely self-reliant and has to cover all of the hardware, storage, software and analysis infrastructure required for the full range of users he supports. Alastair described how his infrastructure is built on a number of key open source components from [http://en.wikipedia.org/wiki/ZFS ZFS] based storage systems which provide 0.5PB of storage for a fraction of the cost of commercial systems to pipelining and workflow systems built within [http://galaxyproject.org/ Galaxy], to user-friendly analysis scripts provided to users though the [http://shiny.rstudio.com/ R Shiny system].<br />
<br />
Alastair described how he actively avoids the use of commercial software within his group and described occasions in the past where their adoption of initially useful commercial packages had ultimately had negative impacts when the software later changed its licensing fees or became unsupported. The only commercial package they still have is [http://www.dnastar.com/t-allproducts.aspx Lasergene] for basic molecular biology manipulations and this is mostly for historical reasons and for the lack of a suitable open alternative.<br />
<br />
Mike Poidinger then went on to present the contrary case. His group is very well funded by his [http://www.a-star.edu.sg/ supporting institution] and is somewhat larger than the Edinburgh group with 9 members. Mike's initial contention was that it should be a requirement when setting up a core that sufficient funding be provided and that it would be reasonable to refuse to head up a core where suitable funding to provide an appropriate infrastructure was not forthcoming. Mike stressed that open source software played a major role in the operation of his core, with much of the analysis of data being provided by these types of packages, which are generally much more agile and able that their commercial equivalents. However, he made a strong case for two particular pieces of commercial support software which now form a key part of his infrastructure - [http://accelrys.com/products/pipeline-pilot/ Pipeline Pilot] and [http://spotfire.tibco.com/ Spotfire]. <br />
<br />
Mike's contention was that whilst open source packages are very good at performing individual analyses, they can be difficult to work with due to the difficulty in collating and understanding the wide variety of output files and formats they generate. His group uses [http://accelrys.com/products/pipeline-pilot/ pipeline pilot] to abstract away a lot of the 'dirty' parts of an analysis so that they can leave the commercial system to store and retrieve appropriate data and to handle the format conversions required to pass data through several parts of a larger pipeline. Having this type of collation system in place means that all of the analysis can be done in the form of pipelines and a complete record of all analyses is preserved and can be reproduced or reused very easily. <br />
<br />
The other package heavily used within his group is [http://spotfire.tibco.com/ Spotfire]. This is a data presentation and integration package which makes it easy for users to explore the quantitative data coming out of the end of an analysis pipeline. It would compete with simple solutions such as Excel, or more complex analyses and visualisations in R, but provides a friendly and powerful interface to the data. Mike's team have linked these packages to other tools such as the Moin moin wiki to provide a combined system which keeps a complete record of analyses, presents it back to the original users in a friendly way and provides an interface through which they can themselves manipulate and explore the data further.<br />
<br />
Overall it was Mike's contention that the use of these commercial products within his group added around 20% to the efficiency of his staff, and also allowed new members to get started much more quickly. The cost of the licensing for these packages was therefore outweighed by the efficiency improvements which his group gained from their use.<br />
<br />
=== Discussion ===<br />
There were some questions about the talks which had been presented. There was some lively discussion about the financial benefits of using commercial software, with some people arguing that the amount of money spent on a big commercial system would fund an additional FTE and that this would be a more productive use of the funding. Whilst a consensus was not really reached on this point, it seems that the merits of this, and possibly several other commercial / open decisions depends on the scale of your group. Smaller cores are more able to support their own infrastructure but as the size of the group or the community expands then the support of infrastructure becomes more of a burden. At this point getting commercial support for storage, pipelining or data management becomes more attractive and allows the core to focus on the science rather than the specifics of the platforms being used.<br />
<br />
A suggestion which came out of this discussion was that bioinfo-core could try to collate some ideas about what infrastructure would be useful to put in place when establishing a new core. The idea would be that we could generate a basic check list of the types of components you would want and give some options for available solutions for each area and add comments about the merits of each. To this end we've set up a [[ISMB_2014:_InfrastructureForNewCores | basic template page]] which we can expand after further discussion on the list.<br />
<br />
Much of the subsequent discussion for the session focussed around whether there were individual or groups of commercial packages for which there wasn't a suitable free and open alternative. The major area which came up was for packages providing functional annotation, with the main contenders being [http://www.ingenuity.com/products/ipa Ingenuity IPA] and [http://lsresearch.thomsonreuters.com/ | GeneGo Metcore]. Several sites are paying for these types of packages and the consensus was that what you're paying for wasn't the software but rather the expanded set of gene lists and connections which have been manually mined from the literature by these companies.<br />
<br />
These types of system are generally liked by users as they provide an easy way into the biology of an analysis result. They offer some advantages over equivalent open source products, but their major open competitors such as [http://david.abcc.ncifcrf.gov/ DAVID], [http://cbl-gorilla.cs.technion.ac.il/ GOrilla] and [http://www.genemania.org/ GeneMania] are also very good and well used.<br />
<br />
There was a general opinion that the costs and licensing terms for the commercial annotation packages were quite severe. This was especially the case for IPA where some sites had starting to do cost recovery for the licence and found that many of the previous users weren't prepared to pay the costs for this. MetaCore licensing was more flexible with the ability to buy licences for a given number of simultaneous users which fitted better for many people's use cases.<br />
<br />
Comments were also made about the utility of these systems. There was some concern that although these systems are popular they may not be all that biologically informative. Some groups had experienced that people tended to pick and choose hits from functional analysis in the same way that they picked from gene hit lists to try to reinforce an idea they already had, rather than trying to formulate novel hypotheses.<br />
<br />
Another case for commercial packages was made for cases where you want to quickly enter into a new area of science and you don't have the resources available to build up an in-house platform for open tools. This can often happen if there is an important but likely transient interest in a new area of science. The example cited was the use of [https://www.dnanexus.com/ DNA Nexus] for variant calling, which may not be the absolute best in class, but is likely good enough for new users and is a well researched and validated platform. Setting this up takes minimal time and effort and can provide a cost effective solution for cores without the time or experience to develop a more tailored pipeline.<br />
<br />
== Session 2: Community suggested open discussions ==<br />
Moderated by Simon Andrews and Brent Richter<br />
<br />
For this session the workshop organisers had put out a request on the mailing list for topics the group would like to discuss. There were a large number of responses which were then collated to pull out the most common topics for the discussion session. Other suggestions will either be put back to the list or will be used as part of one of the forthcoming conference calls.<br />
<br />
There were 3 major areas which were selected for coverage within the session:<br />
<br />
* Using pipelines within a core<br />
* Managing workloads<br />
* Funding your core<br />
<br />
=== Using pipelines within a core ===<br />
<br />
The motivation for this topic was to see how many cores had already introduced automation within their core and to look at the factors which influenced their choice of pipelineing system. We had already heard from a group which was heavily involved in the commercial pipeline pilot system and one which had used galaxy to construct workflows. We then heard from a couple of other groups - one had started from the [http://www.broadinstitute.org/gatk/guide/topic?name=queue GATK queue] system but had found that this wasn't directly useable on their system. Another group had developed a new pipelining system [http://www.bioinformatics.babraham.ac.uk/projects/clusterflow/ ClusterFlow] since they found that none of the available systems fitted well with their existing infrastructure immediately and that it was as easy to develop their own system from scratch than try to tailor an existing system to fit their needs.<br />
<br />
Several people said that they had developed pipelines but without using a traditional pipelining system. Rather they had simply produced specific scripts which ran an analysis end to end. Sometimes they had split this up into several modular scripts which were chained together, but without the benefits in parallelisation and scheduling which could have been provided by a formal system. These types of scripted pipelines are common in that they naturally develop out of individual analyses but there was some concern about how well they could scale in future.<br />
<br />
In practical terms the factors which had influenced or limited the adoption of pipelines were things such as:<br />
<br />
* The ability to configure the settings used for steps within a pipeline. There were competing theories about this - some people preferred steps which were very static once set up so that it was easier to maintain consistent operations. Others wanted the ability to tweak all settings easily to have maximum flexibility within the pipeline. A lot of people had found that the overhead of writing suitable wrappers for individual programs within a pipline was quite high, especially if all options needed to be encoded in this to allow them to be changed.<br />
<br />
* Fit with existing infrastructure. Many people implementing pipelines will not be building a new system from scratch but will need this to integrate with an existing cluster, so the ability of the pipeline to support the scheduling system being used, the software management system, the nature of the filesystem and various other factors.<br />
<br />
* Recording and reproducibility. There is quite a lot of variability in how the results and settings for pipelines are recorded and what information is retained. Some groups need the ability to easily query and collate results from factors within a large set of pipelines and depending on how the data is recorded this may be more or less easy.<br />
<br />
=== Managing workloads ===<br />
<br />
=== Funding your core ===</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=ISMB_2014:_BioinfoCoreWorkshopWriteUp&diff=10104ISMB 2014: BioinfoCoreWorkshopWriteUp2014-07-22T09:34:08Z<p>Alastair.kerr: /* Topic 1: Core on a Budget vs Enterprise */</p>
<hr />
<div>== Introduction ==<br />
The annual bioinfo-core workshop ran successfully at the 2014 ISMB conference. We had a good attendance for the meeting despite the workshop clashing exactly with the world cup final and we're very grateful for everyone who chose to come along.<br />
<br />
We changed the format of the workshop slightly from previous years. In the past we had always had two sets of presentations followed by a moderated group discussion. This year we had only one formal session with presentations, with the second half of the workshop being taken up with a larger group discussion covering a number of topics which were collated from suggestions taken from the bioinfo-core mailing list.<br />
<br />
The group discussions were very lively and we had a large number of people contributing to them. For those who weren't able to attend we will try to summarise some of the main points of the discussions below - as these were all active discussion sessions we don't have a great number of notes to work from so anyone who has other points they remember please fill in anything which has been missed.<br />
<br />
==Introduction to the Workshop by Brent Richter==<br />
Earlier in the conference Brent had already presented to a special session which had introduced bioinfo-core as one of the new ISCB COSIs (communities of special interest). He was able to summarise the rationale for having bioinfo-core as a group and talk about the activities the group performs. Hopefully the increased exposure the group receives through becoming a COSI will help bring us to the attention of some people who might not have known about us before.<br />
<br />
<br />
== Topic 1: Core on a Budget vs Enterprise ==<br />
Moderated by Matt Eldridge and David Sexton<br />
<br />
Speakers:<br />
<br />
* Alastair Kerr - Edinburgh Uni [[media:AK-ismb-coreshop.pdf|Slides]]<br />
* Mike Poidinger - A*Star [[media:Mpoidinger_ismb2014.pdf|Slides]]<br />
<br />
The purpose of this session was to look at whether it is possible to run a core facility on a limited budget, and to explore what becomes possible when you have a larger amount of money to spend.<br />
<br />
Alastair Kerr led the session talking about his [http://bifx1.bio.ed.ac.uk/ small core (2 people) at Edinburgh University]. His core is almost completely self-reliant and has to cover all of the hardware, storage, software and analysis infrastructure required for the full range of users he supports. Alastair described how his infrastructure is built on a number of key open source components from [http://en.wikipedia.org/wiki/ZFS ZFS] based storage systems which provide 0.5PB of storage for a fraction of the cost of commercial systems to pipelining and workflow systems built within [http://galaxyproject.org/ Galaxy], to user-friendly analysis scripts provided to users though the [http://shiny.rstudio.com/ R Shiny system].<br />
<br />
Alastair described how he actively avoids the use of commercial software within his group and described occasions in the past where their adoption of initially useful commercial packages had ultimately had negative impacts when the software later changed its licensing fees or became unsupported. The only commercial package they still have is [http://www.dnastar.com/t-allproducts.aspx Lasergene] for basic molecular biology manipulations and this is mostly for historical reasons and for the lack of a suitable open alternative.<br />
<br />
Mike Poidinger then went on to present the contrary case. His group is very well funded by his [http://www.a-star.edu.sg/ supporting institution] and is somewhat larger than the Edinburgh group with 9 members. Mike's initial contention was that it should be a requirement when setting up a core that sufficient funding be provided and that it would be reasonable to refuse to head up a core where suitable funding to provide an appropriate infrastructure was not forthcoming. Mike stressed that open source software played a major role in the operation of his core, with much of the analysis of data being provided by these types of packages, which are generally much more agile and able that their commercial equivalents. However, he made a strong case for two particular pieces of commercial support software which now form a key part of his infrastructure - [http://accelrys.com/products/pipeline-pilot/ Pipeline Pilot] and [http://spotfire.tibco.com/ Spotfire]. <br />
<br />
Mike's contention was that whilst open source packages are very good at performing individual analyses, they can be difficult to work with due to the difficulty in collating and understanding the wide variety of output files and formats they generate. His group uses [http://accelrys.com/products/pipeline-pilot/ pipeline pilot] to abstract away a lot of the 'dirty' parts of an analysis so that they can leave the commercial system to store and retrieve appropriate data and to handle the format conversions required to pass data through several parts of a larger pipeline. Having this type of collation system in place means that all of the analysis can be done in the form of pipelines and a complete record of all analyses is preserved and can be reproduced or reused very easily. <br />
<br />
The other package heavily used within his group is [http://spotfire.tibco.com/ Spotfire]. This is a data presentation and integration package which makes it easy for users to explore the quantitative data coming out of the end of an analysis pipeline. It would compete with simple solutions such as Excel, or more complex analyses and visualisations in R, but provides a friendly and powerful interface to the data. Mike's team have linked these packages to other tools such as the Moin moin wiki to provide a combined system which keeps a complete record of analyses, presents it back to the original users in a friendly way and provides an interface through which they can themselves manipulate and explore the data further.<br />
<br />
Overall it was Mike's contention that the use of these commercial products within his group added around 20% to the efficiency of his staff, and also allowed new members to get started much more quickly. The cost of the licensing for these packages was therefore outweighed by the efficiency improvements which his group gained from their use.<br />
<br />
=== Discussion ===<br />
There were some questions about the talks which had been presented. There was some lively discussion about the financial benefits of using commercial software, with some people arguing that the amount of money spent on a big commercial system would fund an additional FTE and that this would be a more productive use of the funding. Whilst a consensus was not really reached on this point, it seems that the merits of this, and possibly several other commercial / open decisions depends on the scale of your group. Smaller cores are more able to support their own infrastructure but as the size of the group or the community expands then the support of infrastructure becomes more of a burden. At this point getting commercial support for storage, pipelining or data management becomes more attractive and allows the core to focus on the science rather than the specifics of the platforms being used.<br />
<br />
A suggestion which came out of this discussion was that bioinfo-core could try to collate some ideas about what infrastructure would be useful to put in place when establishing a new core. The idea would be that we could generate a basic check list of the types of components you would want and give some options for available solutions for each area and add comments about the merits of each. To this end we've set up a [[ISMB_2014:_InfrastructureForNewCores | basic template page]] which we can expand after further discussion on the list.<br />
<br />
Much of the subsequent discussion for the session focussed around whether there were individual or groups of commercial packages for which there wasn't a suitable free and open alternative. The major area which came up was for packages providing functional annotation, with the main contenders being [http://www.ingenuity.com/products/ipa Ingenuity IPA] and [http://lsresearch.thomsonreuters.com/ | GeneGo Metcore]. Several sites are paying for these types of packages and the consensus was that what you're paying for wasn't the software but rather the expanded set of gene lists and connections which have been manually mined from the literature by these companies.<br />
<br />
These types of system are generally liked by users as they provide an easy way into the biology of an analysis result. They offer some advantages over equivalent open source products, but their major open competitors such as [http://david.abcc.ncifcrf.gov/ DAVID] and [http://www.genemania.org/ GeneMania] are also very good and well used.<br />
<br />
There was a general opinion that the costs and licensing terms for the commercial annotation packages were quite severe. This was especially the case for IPA where some sites had starting to do cost recovery for the licence and found that many of the previous users weren't prepared to pay the costs for this. MetaCore licensing was more flexible with the ability to buy licences for a given number of simultaneous users which fitted better for many people's use cases.<br />
<br />
Comments were also made about the utility of these systems. There was some concern that although these systems are popular they may not be all that biologically informative. Some groups had experienced that people tended to pick and choose hits from functional analysis in the same way that they picked from gene hit lists to try to reinforce an idea they already had, rather than trying to formulate novel hypotheses.<br />
<br />
Another case for commercial packages was made for cases where you want to quickly enter into a new area of science and you don't have the resources available to build up an in-house platform for open tools. This can often happen if there is an important but likely transient interest in a new area of science. The example cited was the use of [https://www.dnanexus.com/ DNA Nexus] for variant calling, which may not be the absolute best in class, but is likely good enough for new users and is a well researched and validated platform. Setting this up takes minimal time and effort and can provide a cost effective solution for cores without the time or experience to develop a more tailored pipeline.<br />
<br />
== Session 2: Community suggested open discussions ==<br />
Moderated by Simon Andrews and Brent Richter<br />
<br />
For this session the workshop organisers had put out a request on the mailing list for topics the group would like to discuss. There were a large number of responses which were then collated to pull out the most common topics for the discussion session. Other suggestions will either be put back to the list or will be used as part of one of the forthcoming conference calls.<br />
<br />
There were 3 major areas which were selected for coverage within the session:<br />
<br />
* Using pipelines within a core<br />
* Managing workloads<br />
* Funding your core<br />
<br />
=== Using pipelines within a core ===<br />
<br />
The motivation for this topic was to see how many cores had already introduced automation within their core and to look at the factors which influenced their choice of pipelineing system. We had already heard from a group which was heavily involved in the commercial pipeline pilot system and one which had used galaxy to construct workflows. We then heard from a couple of other groups - one had started from the [http://www.broadinstitute.org/gatk/guide/topic?name=queue GATK queue] system but had found that this wasn't directly useable on their system. Another group had developed a new pipelining system [http://www.bioinformatics.babraham.ac.uk/projects/clusterflow/ ClusterFlow] since they found that none of the available systems fitted well with their existing infrastructure immediately and that it was as easy to develop their own system from scratch than try to tailor an existing system to fit their needs.<br />
<br />
Several people said that they had developed pipelines but without using a traditional pipelining system. Rather they had simply produced specific scripts which ran an analysis end to end. Sometimes they had split this up into several modular scripts which were chained together, but without the benefits in parallelisation and scheduling which could have been provided by a formal system. These types of scripted pipelines are common in that they naturally develop out of individual analyses but there was some concern about how well they could scale in future.<br />
<br />
In practical terms the factors which had influenced or limited the adoption of pipelines were things such as:<br />
<br />
* The ability to configure the settings used for steps within a pipeline. There were competing theories about this - some people preferred steps which were very static once set up so that it was easier to maintain consistent operations. Others wanted the ability to tweak all settings easily to have maximum flexibility within the pipeline. A lot of people had found that the overhead of writing suitable wrappers for individual programs within a pipline was quite high, especially if all options needed to be encoded in this to allow them to be changed.<br />
<br />
* Fit with existing infrastructure. Many people implementing pipelines will not be building a new system from scratch but will need this to integrate with an existing cluster, so the ability of the pipeline to support the scheduling system being used, the software management system, the nature of the filesystem and various other factors.<br />
<br />
* Recording and reproducibility. There is quite a lot of variability in how the results and settings for pipelines are recorded and what information is retained. Some groups need the ability to easily query and collate results from factors within a large set of pipelines and depending on how the data is recorded this may be more or less easy.<br />
<br />
=== Managing workloads ===<br />
<br />
=== Funding your core ===</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=File:AK-ismb-coreshop.pdf&diff=10103File:AK-ismb-coreshop.pdf2014-07-22T09:33:10Z<p>Alastair.kerr: Slides for Alastair Kerr's talk at the workshop</p>
<hr />
<div>Slides for Alastair Kerr's talk at the workshop</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Galaxy_Experiences&diff=9417Galaxy Experiences2011-07-26T10:56:48Z<p>Alastair.kerr: /* Advice for Initial Setup */</p>
<hr />
<div>== If adding to the list, please add your institution here and flag your comment ==<br />
* Default (not flagged): Bioinformatics Core, Wellcome Trust Centre for Cell Biology, Edinburgh, UK.<br />
** Installed in our centre in 2007 and the 1st production server was rolled out in April 2008<br />
<br />
== Hardware on which it is installed ==<br />
* Main server: 2x6 core (=24 logical core) 64GB RAM<br />
** Two instances running on different ports, one for testing and the other for production<br />
* Cluster: Under development<br />
* Various desktop machines for the development of new tools<br />
* May eventually add a cloud instance <br />
<br />
== Key uses in the core facility ==<br />
* Rapid prototyping: ability to add a tool that is under development and push back the optimisation to the user by allowing the user to play around with parameters and different data sets.<br />
* Generic workflows: Publish workflows for common tasks that anyone can import<br />
* Galaxy pages: Create tutorials and training materials with embedded Galaxy objects<br />
* Data Sharing: Use of galaxy's libraries to store and share data with users. As there is a concept of 'groups' we can share data with specific labs and projects. We have implemented specific file directories for each group so that command line users can place their data there for easy upload to Galaxy's libraries without any data duplication <br />
<br />
== Additional Benefits ==<br />
* NGS centric: many tools come with galaxy wrappers<br />
* Metadata on genome build (optional) and data type forces good data practices <br />
* Any command line tool can be added fairly quickly: a few min for a simple XML wrapper to a morning for a more complicated interface. <br />
<br />
== Unresolved Issues ==<br />
* Login via Apache: <br />
** At the moment if authentication comes from apache, galaxy assumes that the user has permission to use galaxy and will set up an account with the email given my apache. This is why our group has not yet implemented it on our university cluster. <br />
<br />
== Advice for Initial Setup ==<br />
* Database for logging jobs: can use sqlite [default], mysql and postgres. <br />
** Sqlite will start to break as the load on the server increases<br />
** mysql support lacks many of the reporting features<br />
** postgres is fully supported (and is used on the main galaxy site) and hence I would recommend setting it up from the get-go as transferring data between schemas is non-trivial<br />
* Turn off debugging on the production server otherwise the paster.log can become very large. <br />
* Do not run the galaxy process as root as all jobs run by galaxy will be run by the user that launched the process. Having all jobs run as root is unsafe. We create a galaxy user account and run the process as that user and have all files owned by that user.<br />
* Genome data: galaxy should have script available to download these. We only download the genomes relevant to our users and create new chain files and 2bit files for custom genomes.<br />
<br />
== Data Clean Up ==<br />
* Data is not automatically deleted when the user deletes files from their history. Scripts are available to purge this data: use them in cron<br />
* There is an optimal order in which to execute these scripts, refer to the wiki<br />
* Problem with users not deleting files: not trivial to link fields in the data store to individual users<br />
<br />
== Updating Galaxy ==<br />
* Fetch galaxy updates from a mercurial repository. Learn mercurial commands and how to merge/fork if implementing your own local changes to Galaxy code.<br />
* Use diff command on .sample files to view changes to available tools, datatypes, environment parameters etc after each update<br />
* Galaxy Tool Shed contains repositories of 3rd party tools to download and add to local instances<br />
* Read through the Galaxy wiki, particularly the Deploy Galaxy pages. <br />
* Add your own datatypes, external data sources and export links</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Galaxy_Experiences&diff=9416Galaxy Experiences2011-07-26T10:45:56Z<p>Alastair.kerr: </p>
<hr />
<div>== If adding to the list, please add your institution here and flag your comment ==<br />
* Default (not flagged): Bioinformatics Core, Wellcome Trust Centre for Cell Biology, Edinburgh, UK.<br />
** Installed in our centre in 2007 and the 1st production server was rolled out in April 2008<br />
<br />
== Hardware on which it is installed ==<br />
* Main server: 2x6 core (=24 logical core) 64GB RAM<br />
** Two instances running on different ports, one for testing and the other for production<br />
* Cluster: Under development<br />
* Various desktop machines for the development of new tools<br />
* May eventually add a cloud instance <br />
<br />
== Key uses in the core facility ==<br />
* Rapid prototyping: ability to add a tool that is under development and push back the optimisation to the user by allowing the user to play around with parameters and different data sets.<br />
* Generic workflows: Publish workflows for common tasks that anyone can import<br />
* Galaxy pages: Create tutorials and training materials with embedded Galaxy objects<br />
* Data Sharing: Use of galaxy's libraries to store and share data with users. As there is a concept of 'groups' we can share data with specific labs and projects. We have implemented specific file directories for each group so that command line users can place their data there for easy upload to Galaxy's libraries without any data duplication <br />
<br />
== Additional Benefits ==<br />
* NGS centric: many tools come with galaxy wrappers<br />
* Metadata on genome build (optional) and data type forces good data practices <br />
* Any command line tool can be added fairly quickly: a few min for a simple XML wrapper to a morning for a more complicated interface. <br />
<br />
== Unresolved Issues ==<br />
* Login via Apache: <br />
** At the moment if authentication comes from apache, galaxy assumes that the user has permission to use galaxy and will set up an account with the email given my apache. This is why our group has not yet implemented it on our university cluster. <br />
<br />
== Advice for Initial Setup ==<br />
* Database for logging jobs: can use sqlite [default], mysql and postgres. <br />
** Sqlite will start to break as the load on the server increases<br />
** mysql support lacks many of the reporting features<br />
** postgres is fully supported (and is used on the main galaxy site) and hence I would recommend setting it up from the get-go as transferring data between schemas is non-trivial<br />
* Do not run the galaxy process as root as all jobs run by galaxy will be run by the user that launched the process. Having all jobs run as root is unsafe. We create a galaxy user account and run the process as that user and have all files owned by that user.<br />
* Genome data: galaxy should have script available to download these. We only download the genomes relevant to our users and create new chain files and 2bit files for custom genomes.<br />
<br />
== Data Clean Up ==<br />
* Data is not automatically deleted when the user deletes files from their history. Scripts are available to purge this data: use them in cron<br />
* There is an optimal order in which to execute these scripts, refer to the wiki<br />
* Problem with users not deleting files: not trivial to link fields in the data store to individual users<br />
<br />
== Updating Galaxy ==<br />
* Fetch galaxy updates from a mercurial repository. Learn mercurial commands and how to merge/fork if implementing your own local changes to Galaxy code.<br />
* Use diff command on .sample files to view changes to available tools, datatypes, environment parameters etc after each update<br />
* Galaxy Tool Shed contains repositories of 3rd party tools to download and add to local instances<br />
* Read through the Galaxy wiki, particularly the Deploy Galaxy pages. <br />
* Add your own datatypes, external data sources and export links</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Galaxy_Experiences&diff=9413Galaxy Experiences2011-07-26T09:31:44Z<p>Alastair.kerr: /* Information based on experiences at the Bioinformatics Core, Wellcome Trust Centre for Cell Biology, Edinburgh, UK. */</p>
<hr />
<div>== If adding to the list, please add your institution here and flag your comment ==<br />
* Default (not flagged): Bioinformatics Core, Wellcome Trust Centre for Cell Biology, Edinburgh, UK.<br />
** Installed in our centre in 2007 and the 1st production server was rolled out in April 2008<br />
<br />
== Hardware on which it is installed ==<br />
* Main server: 2x6 core (=24 logical core) 64GB RAM<br />
** Two instances running on different ports, one for testing and the other for production<br />
* Cluster: Under development<br />
* Various desktop machines for the development of new tools<br />
* May eventually add a cloud instance <br />
<br />
== Key uses in the core facility ==<br />
* Rapid prototyping: ability to add a tool that is under development and push back the optimisation to the user by allowing the user to play around with parameters and different data sets.<br />
* Generic workflows: Publish workflows for common tasks that anyone can import<br />
* Galaxy pages: Create tutorials and training materials with embedded Galaxy objects<br />
* Data Sharing: Use of galaxy's libraries to store and share data with users. As there is a concept of 'groups' we can share data with specific labs and projects. We have implemented specific file directories for each group so that command line users can place there data there for easy upload to Galaxy's libraries without any data duplication <br />
<br />
== Additional Benefits ==<br />
* NGS centric: many tools come with galaxy wrappers<br />
* Metadata on genome build (optional) and data type forces good data practices <br />
* Any command line tool can be added fairly quickly: a few min for a simple XML wrapper to a morning for a more complicated interface. <br />
<br />
== Unresolved Issues ==<br />
* Login via Apache: <br />
** At the moment if authentication comes from apache, galaxy assumes that the user has permission to use galaxy and will set up an account with the email given my apache. This is why our group has not yet implemented it on our university cluster. <br />
<br />
== Advice for Initial Setup ==<br />
* Database for logging jobs: can use sqlite [default], mysql and postgres. <br />
** Sqlite will start to break as the load on the server increases<br />
** mysql support lacks many of the reporting features<br />
** postgres is fully supported (and is used on the main galaxy site) and hence I would recommend setting it up from the get-go as transferring data between schemas is non-trivial<br />
* Do not run the galaxy process as root as all jobs run by galaxy will be run by the user that launched the process. Having all jobs run as root is unsafe. We create a galaxy user account and run the process as that user and have all files owned by that user.<br />
* Genome data: galaxy should have script available to download these. We only download the genomes relevant to our users and create new chain files and 2bit files for custom genomes.<br />
<br />
== Data Clean Up ==<br />
* Data is not automatically deleted when the user deletes files from their history. Scripts are available to purge this data: use them in cron<br />
* There is an optimal order in which to execute these scripts, refer to the wiki<br />
Problem with users not deleting files: not trivial to link fields in the data store to individual users<br />
<br />
== Updating Galaxy ==<br />
* Fetch galaxy updates from a mercurial repository. Learn mercurial commands and how to merge/fork if implementing your own local changes to Galaxy code.<br />
* Use diff command on .sample files to view changes to available tools, datatypes, environment parameters etc after each update<br />
* Galaxy Tool Shed contains repositories of 3rd party tools to download and add to local instances<br />
* Read through the Galaxy wiki, particularly the Deploy Galaxy pages. <br />
* Add your own datatypes, external data sources and export links</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=Galaxy_Experiences&diff=9412Galaxy Experiences2011-07-26T09:22:43Z<p>Alastair.kerr: Created page with "== Information based on experiences at the Bioinformatics Core, Wellcome Trust Centre for Cell Biology, Edinburgh, UK. == Feel free to add further information and flag your comme..."</p>
<hr />
<div>== Information based on experiences at the Bioinformatics Core, Wellcome Trust Centre for Cell Biology, Edinburgh, UK. ==<br />
Feel free to add further information and flag your comments if desired.<br />
<br />
== Hardware on which it is installed ==<br />
* Main server: 2x6 core (=24 logical core) 64GB RAM<br />
** Two instances running on different ports, one for testing and the other for production<br />
* Cluster: Under development<br />
* Various desktop machines for the development of new tools<br />
* May eventually add a cloud instance <br />
<br />
== Key uses in the core facility ==<br />
* Rapid prototyping: ability to add a tool that is under development and push back the optimisation to the user by allowing the user to play around with parameters and different data sets.<br />
* Generic workflows: Publish workflows for common tasks that anyone can import<br />
* Galaxy pages: Create tutorials and training materials with embedded Galaxy objects<br />
* Data Sharing: Use of galaxy's libraries to store and share data with users. As there is a concept of 'groups' we can share data with specific labs and projects. We have implemented specific file directories for each group so that command line users can place there data there for easy upload to Galaxy's libraries without any data duplication <br />
<br />
== Additional Benefits ==<br />
* NGS centric: many tools come with galaxy wrappers<br />
* Metadata on genome build (optional) and data type forces good data practices <br />
* Any command line tool can be added fairly quickly: a few min for a simple XML wrapper to a morning for a more complicated interface. <br />
<br />
== Unresolved Issues ==<br />
* Login via Apache: <br />
** At the moment if authentication comes from apache, galaxy assumes that the user has permission to use galaxy and will set up an account with the email given my apache. This is why our group has not yet implemented it on our university cluster. <br />
<br />
== Advice for Initial Setup ==<br />
* Database for logging jobs: can use sqlite [default], mysql and postgres. <br />
** Sqlite will start to break as the load on the server increases<br />
** mysql support lacks many of the reporting features<br />
** postgres is fully supported (and is used on the main galaxy site) and hence I would recommend setting it up from the get-go as transferring data between schemas is non-trivial<br />
* Do not run the galaxy process as root as all jobs run by galaxy will be run by the user that launched the process. Having all jobs run as root is unsafe. We create a galaxy user account and run the process as that user and have all files owned by that user.<br />
* Genome data: galaxy should have script available to download these. We only download the genomes relevant to our users and create new chain files and 2bit files for custom genomes.<br />
<br />
== Data Clean Up ==<br />
* Data is not automatically deleted when the user deletes files from their history. Scripts are available to purge this data: use them in cron<br />
* There is an optimal order in which to execute these scripts, refer to the wiki<br />
Problem with users not deleting files: not trivial to link fields in the data store to individual users<br />
<br />
== Updating Galaxy ==<br />
* Fetch galaxy updates from a mercurial repository. Learn mercurial commands and how to merge/fork if implementing your own local changes to Galaxy code.<br />
* Use diff command on .sample files to view changes to available tools, datatypes, environment parameters etc after each update<br />
* Galaxy Tool Shed contains repositories of 3rd party tools to download and add to local instances<br />
* Read through the Galaxy wiki, particularly the Deploy Galaxy pages. <br />
* Add your own datatypes, external data sources and export links</div>Alastair.kerrhttp://bioinfo-core.org/index.php?title=List_of_Software&diff=9411List of Software2011-07-26T09:15:30Z<p>Alastair.kerr: </p>
<hr />
<div>Please enter information about your core facility on the [[BioWiki:Community Portal]]. <br />
<br />
[[Category:Tools]]<br />
<br />
<br><br />
<hr><br />
<h2>Tools popular with biologists</h2><br />
<br>This is a list of the tools that were mentioned in our discussion on "bioinformatics for the biologists" on Feb 4th, 2008.<br />
<br><br><br />
<h3>Microarray data retrieval and basic analysis</h3><br />
<ul><br />
<li>[http://base.thep.lu.se BASE] from Lund University, Sweden (basic analysis and database)<br />
<li>[http://www.tm4.org/ Multiple Experiment Viewer from Quackenbush Group at Harvard Univ]<br />
<li>[http://bioinformatics.skcc.org/webarray/ WebArray: Web interface, based on bioconductor and mySQL]<br />
<li>[https://carmaweb.genome.tugraz.at/carma CarmaWeb]<br />
<li>[http://www.bioinformatics.bbsrc.ac.uk/projects/chipmonk/ ChipMonk] for ChIP-on-chip data, e.g. from Nimblegen<br />
<li>[http://biosun1.harvard.edu/complab/dchip/ DChip]: basic analysis (e.g. get a list of differentially expressed genes)<br />
<li>[http://bioinfo.bgu.ac.il/bsu/microarrays/links/ A good links site] for multiple applications<br />
</ul><br />
<h3>Structure visualization and basic analysis</h3><br />
<ul><br />
<li>[http://pymol.sourceforge.net/ PyMOL]<br />
<li>AutoDock for docking<br />
<li>DiscoveryStudio (?)<br />
<li>[http://kinemage.biochem.duke.edu Kinmage] for 3D Macromolecule analysis from Richardson Lab at Duke<br />
<li>[http://swissmodel.expasy.org/spdbv/ DeepView] for structure analysis and modeling from the Swiss Institute of Bioinformatics and GlaxoSmithKline R&D<br />
<li>[http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml Cn3D] for structure veiwer from NCBI<br />
</ul><br />
<h3>Visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data</h3><br />
<ul><br />
<li>[http://cytoscape.org/ Cytoscape]<br />
<li>[http://reactome.org/ Reactome] including [http://www.reactome.org/cgi-bin/skypainter2?DB=gk_current Skypainter] and the latest biomart interface [http://brie8.cshl.org/cgi-bin/mart Reactome Mart]<br />
<li>[http://www.bioinformatics.ed.ac.uk/epe/ EPE] Edinburgh Pathway Editor<br />
</ul><br />
<h3>Pipeline Tools</h3><br />
<ul><br />
<li>[http://taverna.sourceforge.net/ Taverna] Best uses in conjunction with in-house web services to ensure reliability. Most command line applications can be converted to a web service using [http://wwww.ebi.ac.uk/soaplab/ Soaplab]. Version 2 of taverna should handle large file sizes a lot better.<br />
<li>[http://galaxy.psu.edu/ Galaxy] This package is from Penn State: [[Galaxy Experiences|Experiences in a Core Facility]]<br />
<li> [http://www.biomedcentral.com/1471-2105/8/208 Alternatives:] A paper highlighting Seahawk ([http://biomoby.org/ Biomoby]) but also mentions both Taverna and Remora <br />
</ul><br />
<br />
<hr><br />
<h3>Looking for collaborations in Rosetta Resolver training</h3><br />
The National Institute of Biotechnology is experimenting with Rosetta in setting up a national test-bed for bioinformatics software. As a part of this collaboration, we are trying to develop a training strategy that would allow researchers, both industry and academia, figure out the value of Resolver for their research with minimal training.<p><br />
We are looking for someone interested in developing together training material for people who understand microarray analysis, at least the basics of it, and wish to get rapidly started with Resolver.<p><br />
Eitan Rubin, [mailto:erubin@bgu.ac.il erubin@bgu.ac.il]<br />
<hr><br />
<br />
<hr><br />
<center>Return to [http://www.bioinfo-core.org Bioinfo-core wiki]</center><br />
<br />
<hr></div>Alastair.kerr