10th Discussion-20 January 2011

From BioWiki
Jump to: navigation, search

Brief Description and Continuing Discussion:

Topic 1: Storing and Analyzing data in the cloud (Led by Dawei Lin and Brent Richter)

presentation to complement discussion: Media:BioCoreDiscussion_Cloud.pdf

Public, proprietary, and fee-for-service offerings in cloud computing is growing within the bioinformatics community, from collaboration tools (Google Docs), hosting companies (Amazon, Univa) to Bioinformatics tools and database (Geospiza, Ensembl).

We will explore some of these as an introduction to the current state of cloud computing in bioinformatics and then develop an understanding of current utility for core facilities.

Preliminary Information

Questions we hope to consider and discuss from the participants:

  • What tools/services are being used by participants
  • Are there advantages to using the cloud either for internal use or for sharing tools, if any.
  • Experiences with vendors who are providing analysis tools and databases in the cloud (Life Technologies, Bioscope)
  • What resources and time are needed for developing a cloud computing tool

Topic 2: Publishing the work of Core Facilities (Led by Fran Lewitter and Simon Andrews)

For most core facilities their primary objective is the effective processing of data for research groups rather than undertaking their own novel research. Where members of a core facility appear on a paper it is usually behind members of the research group they were assisting.

However during the course of operating a core facility many of us will have generated software, methods or results which would be of use to the wider community and it would be beneficial to everyone if this kind of work could be published.

This section of the call will look at the feasibility of publishing the work of a core facility. How people handle this at the moment and whether there are steps we can take to encourage this in the future.

Preliminary Information

Questions which we hope to address would include (please feel free to add more):

  • Do people publish work they have done in the course of operating a core facility
  • Are there suitable journals for publishing this kind of work
  • How should time spent on preparing publications be funded
  • Are there areas of core facility work for which papers are not published, but where these sorts of papers would actually be useful
  • How do groups handle joint publications between research groups and the core facility

Transcript of Minutes

People Present

  • Jason Lee
  • Fran Lewitter - Whitehead
  • George Bell - Whitehead
  • Dawei Lin - UCDavis
  • Jose ?? - UCDavis
  • Simon Andrews - Babraham
  • Laxman Iyer - Tufts
  • Matt Eldridge - CRUK
  • Sarah Deer - Max Planck
  •  ?? - Indiana
  • Stuart Levine - MIT
  • Charlie Whittaker - MIT
  •  ?? - MGH
  • Brent Richter - Partners
  • Cory Johnson - NIH

Cloud Computing

This was a topic which was covered on a call a couple of years back but the tools were poorly developed at that stage so we decided to revisit.

[Introductory talk by Dawei]

UCDavis have been using the cloud for more than a year. Did a workshop in conjunction with Amazon in Seattle also did a conference on cloud computing in Santa Clara, finishing with a 2 day workshop.

The cloud environment will be a combination of public (commercial) and private (in-house) infrastructure.

Cloud is based around generalised OS. Storage and hardware is all generic and can be expanded as required.

Scalability is the big advantage. Never used to be possible to suddendly be able to use >100 CPUs.

Big public datasets now exist in the cloud. This means there is no need to download them to use them - you can just do your analysis in the cloud. This becomes more important as datasets get bigger.

Cloud is now affordable. 3 months work on Amazon cost UCD $13. Makes it great for prototyping. It also means that next-gen sequencing data can be analysed without having to make a large initial outlay for hardware.

Cloud has proved to be more reliable than local resources.

Some companies are now setting up business in the cloud and providing services. These are already useful for some applications. More cloud specific bioinformatics tools are being written. Just download the analysis summary at the end without having to ever pull down the raw data.

Some data producers are depositing new data directly into the cloud. You can then choose to analyse it there, or to download it locally by pulling it from Amazon.

At this point the discussion opened up for questions

[Question] Is anyone actually using these resources yet?

A long silence suggested that noone is!

[Question] Are you pushing the use of the cloud towards end users, or merely for internal use?

[Answers] Most participants on the courses run so far have been from biology labs with no computing infrastructure. It allows them to run analyses in the cloud which would otherwise not be possible.

More and more sequence analysis is being delivered in the cloud. Users don't then need to upload the data there since it's already present.

Partners have many groups using cloud resources. Try to get best resources for their needs. Mostly used for development, but systems are moved in house once live data is analysed. Some people are also pulling datasets from Amazon.

[Question] Can you only pay for Amazon using a credit card? Not everyone has one available to them.

[Answer] Amazon say that you can now pay via a purchase order.

[Question] Do you need an Amazon account to download the 1000 genomes data pushed into the cloud?

[Answer] Don't know first hand - but you should be able to download without an account. You'd just need an account if you wanted to analyse it in the cloud without downloading.

[Question] Could a core facility create an account and then let other people use it?

[Answer] This is currently being done. The users are charged for data upload and for custom machine images (which cost $1 per month for 11GB space).

Amazon is working to push towards academic sector so they are likely to develop a charging mechanism which would operate through a university.

[Question] For large datasets can you send physical media to Amazon, and what does the storage cost?

[Answer] The biggest bottleneck for any analysis is the lack of a large data pipe to Amazon. This is only a problem when initially uploading data. For large datasets you can send them phyical media. Storage costs 10 cents per GB per month. You also pay for transfer in and out of the cloud - they do this to encourage you to leave your data there. Transfer between services within the cloud is free.

As a footnote to this discussion mention was made of VCL (virtual computing lab). This is an environment which allows you to set up a cloud like infrastructure in house. It is available from vcl.nscu.edu.

Publishing the work of core facilities

The publication topic was suggested since many core facilities are not required or even encouraged to publish their work as part of their remit. This topic aimed to find out if people were publishing and to talk about the practical problems they may face when doing this - also to discuss whether we should be encouraging more core facilities to share data and methods which might be useful to other facilities.

[Question] Are people publishing their work, and are they required or encouraged to do so?

[Answer] At Partners they are not enouraged to publish, but they can do it if there's time. More commonly case studies run by the group are published but these tend to be written by someone else.

UCDavis DNA technology core director's job description says there's no requirement to publish, but applicants previous publication record is reviewed when appointments are made. There therefore ends up being an implicit requirement to publish.

Whitehead have published but mostly through collaborators. Also occasionally do reviews which may cover some core work.

[Question] How do people handle costs associated with publishing, either direct (publication charges) or indirect (billable time spend writing papers)?

[Answers] Publication charges are a delicate issue. Normally people end up writing papers in their own time since there is no official funding.

In some cases it is possible to get the parent insitution to pay for this since ultimately getting more publications is good for them.

[Question] Which journals would people suggest as the most likely to accept papers covering the work of core facilities?

[Answers]Bioinformatics, PLoS One, NAR web issue.

It was pointed out that in other fields there are specialised trade journals that cater to similar groups as core facilities, where topics of interest (for example bioinformatics software usability) are not considered suitable for conventional journals in the field. Institutions such as the IEEE set these up and might consider it for core facilities.

It was also suggested that for unconventional results even something as simple as the core wiki might be a good place for a short write up of a topic which might be of interest to other facilities. There was some concern about the lack of peer review, but the suggestion was that this sort of thing would operate more like PLos One where there was superficial review followed by ongoing comment on the article which ended up determining its relevance.

Other communities have undertaken similar projects before. The worm community publish their own newsletter which helped to disseminate information and also provided exposure of the community to other areas of science.

One big advantage to allowing small simple articles in a lightly reviewed format is that they are at least then able to be cited. Some people have had problems suggesting the use of unpublished software or methods when writing grants. Some grant bodies don't like the use of a URL as a citation, and a specialised journal would allow a proper citation.

Mention was made of the PLoS CB core facility collection. This is a collection of articles from the various PLoS journals which are relevant to core facilities. (The initial collection is comprised of papers that emerged from a previous ISMB Bioinformatics core workshop.) This is apparently an area which PLoS CB is keen to expand and would encourage core facility software submissions to PLoS One. Such articles could be linked into the PLoS CB core facilities collection. If people have suggestions for papers that might be suitable to publish in the core facilities collection, please contact the editors.

In summary many people felt that the community would benefit from increasing the number of publications which came out of core facilities, and that it would be good if core facilities were encouraged and supported in doing this.