The Omics platform that is being built through the RDS Omics project is made up of 3 interconnecting components. These three components are:
1. The microbial-GVL which is a tailored version of the GVL (i.e. the “Genomics Virtual Laboratory”, a cloud based suite of pre-configured genomics analysis tools and workflow platforms running on a scalable Slurm cluster), pre-populated with a suite of tools specifically for microbial -omics analysis.
Characteristics defining the microbial-GVL include:
- Highly scalable computationally
- Tool- and compute- centric (as opposed to data-centric)
- Highly capable in terms of available tools and analysis options
- Controlled, in that there is a large effort in pre-configuring the GVL workbench
- Reproducible and workflow-oriented
- Flexible in that several options for data analysis platforms are offered: Workflow GUI (Galaxy); Statistical GUI (RStudio); programmatic GUI (Jupyter Python GUI); command line. Tools on the microbial GVL are available through all platform interfaces.
2. The Data Management Platform (or “DMP”, constructed with the DaRIS/Mediaflux platform) which will hold the multi-omics datasets (genomic, transcriptomic, proteomic, metabolomic). Using a specialised web-based portal, a user will be able to query the data and then send that queried data to GenomeSpace (see #3 below).
Several characteristics define the DMP:
- A structured queryable data model to organise the data
- A structured set of meta-data, compatible with international standards in holding and submitting -omic datasets to international repositories (see https://www.embl-abr.org.au/sepsis-project/);
- A highly curated reference dataset. This will initially include the SEPSIS raw data and subsequently derived secondary and tertiary data outputs including assembled genomes, lists of metabolites and peptides from metabolomics and proteomics streams, transcript count summaries from differential gene expression analyses;
- Authenticated users (AAF supported) will gain access to the data according to their role (only project staff will have write access). The DMP is not suitable or intended for general read-write data storage, but rather exposing high value structured reference datasets;
- Data-set centric: the DMP is not designed as a user environment, but as a repository.
- A specialised web-based portal to enable the query of data and then despatch to end points
- The meta-data can be customised to other projects
- The choice of queryable meta-data is customizable
- The DMP can inter-operate in many ways
- Send data to GenomeSpace
- Send data to other end points (e.g. an scp server, user’s desktop computer)
- Present the results of queries natively via protocols such as SMB (future)
Essentially the DMP can be considered to be fulfilling the role of a curated national omics data repository – highly reliable, highly structured, highly curated, designed for high value reference datasets.
3. GenomeSpace is the data-centric middleware which enables researchers and general users to access their omics data from the platform through a web-based file system interface and analyse it using a number of different tools, including (but not limited to) the microbial GVL. Genomespace can be considered a true data-centric ‘virtual laboratory’ that will allow general users to store and access their data and data analyses on cloud resources, linking seamlessly to the microbial GVL analysis environment, in addition to other tools.
It has been developed at the Broad Institute of Harvard and MIT and is used extensively around the world for research and in industry. One of our contributions from the RDS Omics project has been to extend GenomeSpace functionality to allow RDS Swift storage to be used, a function that has been requested for GenomeSpace for some time.
In some ways GenomeSpace can be considered analogous to GoogleDrive, in that it provides a web accessible file system perspective on cloud based data, to which various different tools can be applied. Files and sets of files can be uploaded to GenomeSpace or imported from a number of sources. The DMP is one of those sources.
GenomeSpace has several defining characteristics:
- Highly flexible. Different data sources can be connected, files can be uploaded through a simple web interface; files can be manipulated via simple file system functions.
- Highly accessible. GenomeSpace is web-based middleware that exposes a user’s research data in the context of a palette of analysis and visualisation tools.
- User-centric: users manage their own data and construct their own virtual analysis environment. The screenshot below depicts a typical GenomeSpace interface, in which the user’s data is visible through a simple file system interface and a suite of tools (including the GVL) appear as options across the top. Users can send data to and from tools, and upload and download data to their data portal. GenomeSpace does not store data, but rather exposes cloud data stores through a web interface and brokers direct data transfers to and from analysis tools/platforms and the cloud data store.
Thee Bioplatforms Australia Antibiotic Resistant Pathogens Initiative reference data, made available through the DMP, should be considered only one of a number of potential data sources for analysis; and the microbial GVL considered only one of a number of potential tools for analysis. GenomeSpace middleware precisely fits this paradigm.
As and when other Omics data sets are ingested, a data model and associated metadata will need to be defined to suit that dataset. Once the metadata and data is in place, then all the existing tools and workflows within the microbial GVL can be used.
A high level view of how the three (DMP, GenomeSpace, GVL) components are interconnected in terms of data flows is shown in Figure 1 below. From the view of a user, the three are used together as follows in terms of a simple workflow:
A – A biologist logs into the DMP and they query for a type of data in the platform. For example, they might request all genomics fastq files acquired in a specific time frame for the Streptococcus pyogenes bacteria.
B – The DMP then sends the resultant datasets to the user’s GenomeSpace portal. The flexibility of GenomeSpace allows the user to import other data from other sources into GenomeSpace if they wish.
C – The user then analyses data using the microbial-GVL, . Alternatively, as GenomeSpace is integrated with many other highly-used analysis and visualisation tools, users can also choose to analyse on or visualise their data outside of the mGVL.
D – Results of analysis are sent back to GenomeSpace portal, where they can be shared or published. Analyses can be recorded as workflows for the microbial-GVL and saved as publishable, reproducible entities in GenomeSpace.
These three components have been developed throughout 2016. The Omics platform will have full end to end functionality (steps A-D) by mid September. Some pre-pilot data has already been loaded in and as pilot data is generated, it will be ingested in the platform throughout Q3.
End to end demonstrations will be scheduled starting early October for iterative feedback and continuous improvement of the platform. If you are interested in scheduling a one on one demonstration please contact us at firstname.lastname@example.org