How to Evaluate and Use Human and Mouse mRNA Data Sets (e.g. GTEx)
In this HOW TO we evaluate the new human GTEx data using a single "pro band" gene called SIRT1. There are about 48 GTEx data sets in GeneNetwork (May 2014)
These data sets are RNA-seq RPKM data generated at the Broad Institute as part of a large NIH program (http://www.gtexportal.org/home/). All of the data are from a single large cohort of normal human subjects. There will eventually be hundreds of samples for each tissue, organ, or cell type, but at present numbers of cases per tissue can be low (e.g., n = 5 for liver). These are new data of unproven quality, but let's see what kind of data we get using GTEx and a brain region. Here are the steps along with images for the analysis of the GTEx amygdala data.
1. First we need to find the data in GeneNetwork (GN). There are many choices, including a number of brain regions (amygdala, caudate, cerebellum...). GN does not tell us how many samples are in each data set, but you can look it up quite easily. In the case of Adrenal Gland mRNA there are only 12 RNA-seq data sets. I would not bother with RNA-seq data (except for confirmatory analysis) when n < 20 cases.
For the amygdala, the sample size is now 23. This is marginalbut let's evaluate the data quality anyway. Type in SIRT1 in the search field.
You will see that SIRT1 expression in human amygdala is low—2 to 3 units. This is on a log2 RPKM scale, and the SIRT1 values are all between 2.3 and 3.2. This is very low value***, but perhaps still useable. The distribution shown below is certainly interesting . In this type of "normal" plot, if values are normally distributed across the subjects then the points will be arrayed on a straight line oriented from bottom-left to top-right. In this case the points are obviously not on a straight "normal" line. In fact there is something odd going on in this distribution—an unexpected break at about 2.8 units. Ten individuals are in a high group, the remaining 13 are in a low group. This is either an artifact (e.g., a batch effect), a sampling error, or a genuine biological effect. Let's hope for the latter.
*** In GN if the average log2 RPKM expression is less than 2, then we discard all of the values for that gene/transcript across all subjects. In this amygdala data, 40665 of 52776 gene/transcript entries were discarded in this way. What is a value of 2? ANSWER: When we download the GTEx RPKM values we add 1 to every value and then convert these counts to a log2 scale (1, 2, 4, 8, `6, 32, 64, 128, etc). Thus an original RPKM value of 0 is converted to 1, and log2 of 1 is 0. An RPKM value of 1 is converted to 2 and log2 of 2 is 1. An original RPKM value of 3 is converted to 4 and log2 of 4 is 2. In summary, if the average RPKM of the original data is less than 3 across all subjects, then we list the data with the value of 0 in GN. If for some reason you want to see the raw log2 values, then select the RawLog2 versions of the data. But beware, since the variation you see at this level will be very noisy.
How can we evaluate the quality of these SIRT1 data and the curious distribution shown above. A simple way to do this is to evaluate te covariates of SIRT1 expression in amygdala (or other tissues) to see if they "make sense". Use the Calculate Correlations tools shown below.
Note that I have selected the Literature r tab (one of three tabs) and requested that GN return a list of the top 2000 literature covariates. This procedure will generate a fixed list of genes/transcripts that are associated with SIRT1 based strictly on a statistical analysis of word co-occurennce in PubMed abstracts (see work of Ramin Homayouni and Michael Berry for more detail, PMID: 21533142l). Every search using SIRT1 in any tissue or organ of human, mouse, or rat, will generate the same list.
Why is this useful? Because, you can easily compare the literature correlations with the corresponding mRNA-based sample correlations to assess whether or not expression data are in line with expectation (scientific "priors"). Here is the output for SIRT1:
The three rows that I have highlighted in yellow have high sample correlations and high literature correlations (Lit Corr). For the gene in the first highlighted row (RRP8) there is already a known link to SIRT1 that you can find in OMIM (http://omim.org/entry/615818).
By the way, as shown in the figure below, GeneNetwork provides OMIM links to most genes in the Trait Data page.
Based on these literature correlations, the SIRT1 data appear to be of some value and it therefore makes sense to explore SIRT1 function in amygdala (and probably other tissues) of humans using GTEx.
The next step is to compute the top 2000 covariates of SIRT1 in amygdala using the default Sample r method. Go ahead and do this yourself.
Correlations range from 0.888 to –0.531 in this list of top 2000 covariates. You can resort these table (click on the column heads) and you can also export the table. Note also that there is a More Options button.
Let's evaluate whether these covariates make sense by performing a Gene Ontology (GO) enrichment analysis of this set (or a subset) using WebGestalt. Click on the Gene Set function at the top of the page.
If you run the GO using the top 500 to 2000—unfiltered by polarity of correlation—then you will get this result page (you may need to export into a drawing program to see the categories.
The GO analysis is done by a web service at Vanderbilt University run by Bing Zhang and team ( PMID: 15980575, PMID: 23703215).
I'm not sure if this GO analysis above makes much biological sense, but it certainly does not flag the mitochondrion. This initial GO analysis is dominated by the positive covariates, and positive covariates will include positive signal from confounders in the data. Odd as it may seem the negative covariates are often more useful.
To separately evaluate all of the negative covariates you have to open up More Options and just select the negative value subset. Enter the values as shown below and then click on the Select Traits button. Now repeat the Gene Set GO analysis.
At the bottom right you will see the category "mitochondrion" that includes this list of 96 mitochondrial genes. That is serious enrichment with a p value is about 1E-5. If you click on the red box that is labeled "Mitochondrion" then WebGestalt will list out all 96 of these genes and their negative r values. Here is part of the table of covariates.
The important thing is that you should be able to replicate this analysis easily yourself and then extend to other tissues.
Try the same analysis with other tissues, the hypothalamus for example. Among the top 2000, there should be 158 negative covariates of SIRT1. Statistics are even better than for the amygdala.
CONCLUSION: You now have an good idea of how to exploit covariation among mRNA microtraits to both evaluate data quality and to find potentially interesting biological associations—in this case between SIRT1 and genes critical in mitochondrial function. Once you have a better feel for the data, you can try other genes and other tissues. Can you find novel functional or molecular companions of SIRT1?