A Biologist Guide on working with Multi-Omics Datasets

Lab Scientist can easily be drained out interpreting results from a single set of OMICS data. Imagine the complexity working with Multiple and Varied types of OMICS experiment. Let’s take a situation where a scientist has a time series experiment and has captured Transcriptome, Epigenome and Proteome at various time points. His data involves RNA-SEQ, microRNA-SEQ, SILAC and may be Co-IP followed by Mass Spec. Such situations demand extensive data analytics and understanding of integration methods. One option is to hire a dedicated Bioinformatician, however the way most Bioinformatician Brains are wired is around numbers, algorithms, programming, database and the last thing that rings their bell is the underlying biology of the system. Here we define 6 easy to use software’s which a Biologist can learn with minimum effort and will help him manage Multi-OMICS research and data integration. You will still need a Bioinformatician to process the raw data.

OMICS Data Analysis using Excel

No surprise excel is a wonderful tool from Microsoft for any type of OMICS dataset. The raw data processing would be specific to the Technology or specific to the application area; however there will be a point where you will be viewing your data on Excel or similar such spreadsheet.
The data will eventually end up as a matrix with rows and column. The rows are normally the Samples and the columns Identifier. Identifier can be Transcript Id, Gene Id, SNP Id or Protein Id. There can be more information specific to individual transcripts/genes/proteins/SNP etc. Similarly columns can have more information about the specific samples e.g. clinical information.
Excel is commonly used by scientist to view the data, save and share, do basic calculations and create beautiful graphs. Little do most scientist realise there is immense potential to perform data processing and mining using some of the smart features of Excel. Excel supports multiple sheets, so all of your data can sit in a single excel file.

Calculating Fold Change Using Excel

Fold change is the most widely used metric to estimate relative expression. Fold change is nothing but ratio of Treatment/Controls. To calculate Fold chance 1) Calculate the average of Replicates for both Treatment and Control. Use the command =average (select the replicates). Then click on sign which will come when you point your mouse at the lower right of cell. This will calculate average for all rows in the sheet. Similarly calculate average for controls and then calculate Fold Change = Average Treatment / Average Control. For fold change < 1 make another row and type = -1 / Fold Change. This will calculate fold change where the control expression is higher than Treatment and make your data a normally distributed data.

Calculating p-value

If your data is normally distributed, we perform t-test to identify the differences in expression. If your data is not normally distributed you can use a different test, or force your data to follow normal distribution. This we will be discuss in a different Blog. But for now we will work on an assumption that your data is normally distributed.
To calculate p-value using t-test, type in =t.test (select control samples, select treatment samples, 2, 2). The 2 and 2 indicates that your data is two tailed and they are of equal variance. Change the last 2 to 1 if your samples are paired. (See below Figure) Keep subscribed to our Blogs to get an exclusive article on Normality, Different types of significance testing and working with False Discovery Rate. The p-value <= 0.05 will signify that gene/protein/miRNA to be significantly deferentially expressed among the two groups. Use the option Data -> Sort to sort the table to get exclusively significant genes (marked in the below figure as red).

Bring your Multi Omics Experiment to common identifier using DAVID and EASE

This is the most powerful tool if you are working with multi-omics datasets. Different Omics experiment will lead to different identifier. E.g. Microarrays will lead to Affymetrix Id, NGS will lead to Ensembl ID or Entrez Id. miRNA profiling will lead to a miRNA specific ID. Quantitative Proteomics will lead to Uniprot ID. The big question at this stage is how to compare the different OMICS experiment. The only option is to bring all data to a common Identifier. Easily said than done, it involves lots of information and need to be done with care. DAVID and EASE is an excellent and easy to use tool that any Biologist can use to convert one set of identifier to other. You just need to supply the list, tell the software what identifier it is and what identifier you want to convert. DAVID and Ease can be accessed at https://david.ncifcrf.gov/

Gene Ontology and Pathways Analysis using Panther or David and Ease

As a Biologist your biggest interest is in finding the Pathways and Networks that are affected among the two groups. There are large number of software to perform Gene Ontology, Pathways and Network Analysis. Nearly all uses a concept of enrichment in the differentially expressed genes or proteins compared to the background. A background is either all annotated genes/proteins or what all are represented in your Omics experiment.
With most software’s, all you need is to input your list of Differentially Expressed Genes/Proteins. And the rest are taken care by the software. David and Ease does a good bit of downstream analysis, but if you find it confusing to use and interpret use Panther. Panther is one of the simplest applications to use for Gene Ontology and Pathways Analyses. It also represents your classification as Pie Charts (Fig 1 to the right) and Bar Charts (Fig 2 to the right), which can be quiet informative. Panther can be accessed online at http://www.pantherdb.org/

Finding the common or specific elements using Venny

At this point you would probably be having huge amount of information in terms of gene list, ontology, pathways and networks lists. You would be interested in what are the common genes/proteins among 2 or more groups. Venny is one of the simplest tools to identify common genes/proteins among two or more lists. You can even identify which genes/proteins are specific to a particular list and do not appear in another. If you click on the number in the Venn diagram you will get the list of those exclusive/inclusive genes. The figure on the right was created using Venny. Venny can be accessed at http://bioinfogp.cnb.csic.es/tools/venny/

Integrating the Regulome to the Genome, Transcriptome and Proteome

While it is relatively easy to integrate informatics obtained from Transcriptomics and Proteomics Studies, it gets quiet complex when we try to integrate with Regulomics and Genomics. When clinical information is involved the whole data science gets more complex and difficult to mine. Let’s assume that we have informatics on differentially expressed miRNA, or Variation Information from Genomic Studies or Active Binding sites from a ChIP-SEQ experiment and we want to integrate some or all of this information with Transcriptomics and Proteomics studies. The challenge here lies in linking up this regulome elements to Transcriptomics and/or Proteomics experiment. Few approaches can be used.
1) Identifying how strong is the link between that variation/miRNA/Transcriptional Factors to the Transcriptome or Proteome.
2) Identifying over-represented variation/miRNA/Transcriptional Factors from Transcriptomics and Proteomics studies and matching the pattern from the Regulomics studies.
Such analysis gets incredibly complex, so we won’t discuss much in details in this Blog, and leave it for a dedicated Blog on these approaches.
Probably at this point you would have realised that you are in the middle of data explosion and still far away from the Biology that you are desperately looking for.

Omics Data Management is the Key. Easy Manage your Data with Microsoft Access

Big multinational companies will have the solutions for you. It would be a custom solution and may cost you million dollars and still may not work in real time analytics. So what is the alternative?
There is a piece of software that is already on your computer and can do the same what expensive software can do for you. It’s part of Microsoft Office suite and known as Microsoft Access. Probably you have never even though of what it does. Basically Microsoft Access is a powerful Database system with the ease of usability within the reach of a smart Biologist. The details of how it can help will be a part of separate Blogs but here we give you a glimpse of how you can store all your Omics dataset and develop Links among various Omics Experiment. Once the Links are in place, you can start creating queries and reports on how individual elements regulate or link up with each other.
Let’s take an very simple example. You have performed Transcriptomics experiment using Microarrays, and a Proteomics experiment using 2D-DIGE. At some point in the analysis you will be having a list of Differentially Expressed Transcripts/Genes and a list of differentially expressed proteins. And now you are interested in mapping the Transcriptome to Proteome.
Open Access. Create a Database. Choose Blank Database. On the right panel provide the name and directory and click create (Fig 1 on the Right). Following this upload your Gene Expression Data and Proteomics Data as Excel Files (External Data > Excel). Use Defaults (Fig 2 on the Right). You will also need an excel file with corresponding links between Gene Expression ID and Proteomics ID. This can be obtained using David and Ease which we have already discussed earlier. Upload this Link Excel file in same manner as you did for Transcriptomics and Proteomics experiment.
Now you will have to link the 3 tables. (Create > Query Design) Add the 3 tables and link the common identifier as shown in the picture. Select * in each table. * is indicative that you want all information (Fig 3 on the Right). Close the window and it will ask to save the query. Save the query. It will be saved on the left panel and you can export the data in Excel Format (Fig 4 on the Right). Now you will have the informatics from the Transcriptomics and Proteomics in a single table ready to be analysed. You can bring in other omics datasets to the database, create links and develop smart queries.
Most important is how you manage your data. People in your lab keep moving out and join in. Unless you have a strong data management policy, your data and analytics will be lost in time. If data management is important to you, email us info@seqome.com

Piecing in the Puzzle Using Cytoscape

When we try to understand the cellular dynamics, we need to understand the basics before we can start using Cytoscape to integrate multi-omics datasets and figure out the relevant networks and information flow in the Biological system.
A network is interconnection between molecules in the system defined as Nodes and Edges. Nodes are your Genes or Proteins and Edges are the interconnection between the two nodes. To understanding cellular dynamics, the first step is to import or create a Network. This will define the relationship among the various molecules in the system. From this point analysis typically depends on what you are looking for. The core Cytoscape is supported with 100s of plugins to perform different types of Analysis. While some Pluggins will help you overlay your Omics datasets, other plugins can help you find Gene Ontology, Pathways and relevant networks. In the new version of Cytoscape, they have replaced the word Pluggin to Apps. Cytoscape can be downloaded from http://www.cytoscape.org/
Fig 1 below shows what you see when when you launch Cytoscape; you can create your network or load an existing network. Fig 2 Below shows the complete mouse network. You can zoom in, map your Omics data-set and study your system in great details.

MiMi pluggin from http://mimiplugin.ncibi.org/ is good pluggin to start with, to understand the relations among your genes/proteins. So what all is needed is your list of interesting genes/proteins and tell the pluggin to find the relations among the molecules, direct or may be indirect. Once you have the Network you can start changing the architecture, review the interaction, change shape and overlay the expression or other parameters. There are 100 of things that you can do with Cytoscape and as of today there are 140 Apps/Pluggins to extend the basic feature of Cytoscape. Below 3 figures shows the 3 steps in which you can get your network from a list of genes/proteins. First step (Fig 1 below ) is to launch the Pluggin. Second step(Fig 2 below ) is to provide the list of genes/proteins. Select the options. Start with query + nearest genes. Its good to try out all the 4 options. Last figure is the sample output of the MiMi pluggin. At this point you can overlay your expression, color code nodes or other parameters on the network and understand the interrelation among the nodes.

Please follow and like us:

Love Technology, Follow us