Galaxy Experiences

From BioWiki
Revision as of 02:45, 26 July 2011 by Alastair.kerr (talk | contribs)
Jump to navigationJump to search

If adding to the list, please add your institution here and flag your comment

  • Default (not flagged): Bioinformatics Core, Wellcome Trust Centre for Cell Biology, Edinburgh, UK.
    • Installed in our centre in 2007 and the 1st production server was rolled out in April 2008

Hardware on which it is installed

  • Main server: 2x6 core (=24 logical core) 64GB RAM
    • Two instances running on different ports, one for testing and the other for production
  • Cluster: Under development
  • Various desktop machines for the development of new tools
  • May eventually add a cloud instance

Key uses in the core facility

  • Rapid prototyping: ability to add a tool that is under development and push back the optimisation to the user by allowing the user to play around with parameters and different data sets.
  • Generic workflows: Publish workflows for common tasks that anyone can import
  • Galaxy pages: Create tutorials and training materials with embedded Galaxy objects
  • Data Sharing: Use of galaxy's libraries to store and share data with users. As there is a concept of 'groups' we can share data with specific labs and projects. We have implemented specific file directories for each group so that command line users can place their data there for easy upload to Galaxy's libraries without any data duplication

Additional Benefits

  • NGS centric: many tools come with galaxy wrappers
  • Metadata on genome build (optional) and data type forces good data practices
  • Any command line tool can be added fairly quickly: a few min for a simple XML wrapper to a morning for a more complicated interface.

Unresolved Issues

  • Login via Apache:
    • At the moment if authentication comes from apache, galaxy assumes that the user has permission to use galaxy and will set up an account with the email given my apache. This is why our group has not yet implemented it on our university cluster.

Advice for Initial Setup

  • Database for logging jobs: can use sqlite [default], mysql and postgres.
    • Sqlite will start to break as the load on the server increases
    • mysql support lacks many of the reporting features
    • postgres is fully supported (and is used on the main galaxy site) and hence I would recommend setting it up from the get-go as transferring data between schemas is non-trivial
  • Do not run the galaxy process as root as all jobs run by galaxy will be run by the user that launched the process. Having all jobs run as root is unsafe. We create a galaxy user account and run the process as that user and have all files owned by that user.
  • Genome data: galaxy should have script available to download these. We only download the genomes relevant to our users and create new chain files and 2bit files for custom genomes.

Data Clean Up

  • Data is not automatically deleted when the user deletes files from their history. Scripts are available to purge this data: use them in cron
  • There is an optimal order in which to execute these scripts, refer to the wiki
  • Problem with users not deleting files: not trivial to link fields in the data store to individual users

Updating Galaxy

  • Fetch galaxy updates from a mercurial repository. Learn mercurial commands and how to merge/fork if implementing your own local changes to Galaxy code.
  • Use diff command on .sample files to view changes to available tools, datatypes, environment parameters etc after each update
  • Galaxy Tool Shed contains repositories of 3rd party tools to download and add to local instances
  • Read through the Galaxy wiki, particularly the Deploy Galaxy pages.
  • Add your own datatypes, external data sources and export links