Bioinformatics Code-Along
Thursday July 27 at 4:00 PM PST/7:00 PM EST
Zoom link: Launch Meeting - Zoom
Objective
The objective of the TranscriptomicsQC starter project is to provide learners with a robust hands-on experience in performing quality control (QC) of transcriptomics data. This project focuses on microarray data quality control for beginners, with the option to include RNA-seq quality control for advanced users.
Through the step-by-step project activities, learners will develop a strong proficiency in evaluating, correcting errors, and filtering probes/reads, ensuring the quality of the data. This project aims to provide students with a comprehensive foundation in bioinformatics, emphasizing the critical aspect of quality control. By the end of the project, learners will be empowered to start independently exploring and analyzing genomic data using the powerful tools of R and Bioconductor.
If you are totally new to Bioinformatics, please start with the GEOQuest Combo Starter Project
Learning Outcomes
Upon completion of the TranscriptomicsQC starter project, you will have learned:
1. Importance of quality control in transcriptomics data analysis
Understand the significance of quality control in ensuring the reliability and accuracy of transcriptomics data. And recognize the impact of data quality on downstream analyses and interpretation.
2. Usage of R and Bioconductor for genomic data analysis
You will gain practical experience in utilizing R, a popular programming language for statistical computing, and Bioconductor, a widely-used collection of R packages for genomic data analysis.
3. Proficiency in data pre-processing and QC
- Develop hands-on experience in using Bioconductor packages and data visualization tools for data pre-processing and quality control.
- Acquire skills in performing data pre-processing steps such as background correction and normalization.
- Learn to generate informative visualizations, including PCA plots, boxplots, and RLE plots, to assess data quality.
By completing the TranscriptomicsQC project, you will develop a strong foundation in transcriptomics data quality control, enabling you to ensure the reliability and integrity of your data, make informed decisions in subsequent analyses, and replicate the essential workflow using R and Bioconductor.
Steps and Tasks
Note: If you are joining from the GEOQuest Combo Starter Project, please skip to Step 5.
1. Download Expression Data from GEO
- Download the expression data (
CEL
files) for the dataset GSE32323. [NOTE: For GSE32323, specifically, only download the files with names of the format*chip_array_C#*
. The other files are for cell lines, not actual human samples.]
Once you are comfortable with the flow, you can download all the datasets mentioned in the paper. The data is available as a TAR file containing multiple CEL files. Windows users may require software such as WinRAR to extract the files.
2. Acquire and Understand Metadata
- Metadata provides additional information about each sample, such as the age of the patient from whom the sample came, tissue or cell type of sample origin, disease status of the patient, etc. It is used in limma analysis, a statistical method for differential gene expression (DGE) analysis.
- To study the metadata, download the ‘Series Matix Files’, unzip, extract the metadata by copying onto a spreadsheet, store as a .csv file*
If you are having trouble with this step, you can find an appropriate GSE32323 metadata file here. Feel free to use this file in the next steps or use as reference to better understand how to create your own metadata file.
3. Install Required Packages
- Install and familiarize yourself with the following R packages by reading their documentation to understand their functions and use cases:
- Use
BiocManager::install("package_name")
to install: - Use
install.packages("package_name")
- Use
4. Load Data into R
- Once the necessary packages are installed, load the data into your R environment using the appropriate functions, such as
ReadAffy()
for ‘.CEL’ files andread.csv()
for metadata files.
5. Quality Control and Data Pre-processing
Replicate the analysis workflow performed in GEO2R using R and Bioconductor. This involves writing R code to perform the following steps:
- Conduct quality control (QC) using Bioconductor packages and data visualization
- Inter-array: quality control between arrays
- arrayQualityMetrics – distance between arrays (correlation heatmap), PCA, boxplots, density plots, RLE, NUSE
- “by hand”
- PCA
- Boxplots
- Relative Log Expression (RLE)
- Normalized Unscaled Standard Error (NUSE)
- Intra-array: quality control within arrays
- arrayQualityMetrics – variance mean dependence, RNA degradation plot, MA plots, perfect matches and mismatches (PM, MM)
- Inter-array: quality control between arrays
- Background correction and normalization
- Create data visualizations of quality control, background correction, and normalization.
- You should create at least 1 visualization for each of the methods below:
- PCA
- Boxplot (affy)
- RLE (affyPLM)
- NUSE (affyPLM)
- Create PCA plot before and after implementing background correction and normalization
- You should create at least 1 visualization for each of the methods below:
- Identify outlying samples using analysis from previous tasks
Background Information for Step 5
What is QC and why is it done?
The purpose of QC is to check whether the data and any conclusions drawn from it can be considered reliable. It also helps us detect outliers and improve the signal-to-noise ratio. If our QC shows that there are many outliers in our data (i.e., a low signal-to-noise ratio), the data may be too variable to find any meaningful results.
What is background correction and normalization?
Background correction removes signal from nonspecific hybridization (i.e., signal emitted by things other than a sample hybridizing to a probe). Normalization corrects for systematic biases due to environment variations such as different absorption rates, spatial heterogeneity in a microarray chip, etc. These two processes are sometimes performed separately, but the packages we’re using perform these two methods in one function. There are 3 different methods of background correction and therefore 3 possible functions you can use:
- MAS 5.0 (Affymetrix’s algorithm – Microarray Suite)
- Robust Multiarray Averaging (RMA)
- GeneChip RMA (GCRMA) – page 4-6
Visualizations
Boxplots
Boxplots are used to visualize the distribution of data. For transcriptomics, we want to visualize the distribution of probe intensities for each sample. So, this figure will contain multiple boxplots, one for each sample. Using this visualization, we can see whether any of the samples have a drastically different distribution than others. If a sample’s probe intensity distribution is substantially different from others, it may be an outlier.
- You can use the
boxplot
function to create a boxplot in R
Principle Component Analysis (PCA)
PCA is often used as a dimensionality reduction calculation. I’ve always found visualizing the process of PCA much more helpful than reading a written explanation. I highly recommend you check out the StatQuest videos listed in Resources.
Here, “dimension” refers to the number of attributes in our data (i.e., the probes). There are thousands of probes in our data and it would be difficult (mostly impossible) to visualize thousands of axes on a single plot. But we want to visualize the variation between samples stratified by probe. By using PCA, we can reduce the dimensions necessary to explain the variation and can reasonably visualize the data. The rationale behind PCA has its roots in matrix algebra. If you are new to matrix algebra, but would like to learn more, check out Introduction to Matrix Algebra and Reduced Row Echelon Form
- Use the
prcomp()
function withscale=FALSE
andcenter=FALSE
- Plot PC1 vs. PC2
Correlation Heatmaps
Heatmaps add meaningful colors to data visualizations based on data values. Correlation is a measure of how similar 2 data points are and values range from -1 to 1.
- A value closer to 0 indicates that data are dissimilar
- A value closer to 1 indicates that data are very similar and vary in the same direction (i.e., positive correlation)
- A value closer to -1 indicates that data are very similar and vary in the opposite direction (i.e., negative correlation)
Hierarchical Clustering (i.e., what is the branching structure in the margins of pheatmap()
output?)
If you’re curious about the branching structure on the top and left side of your heatmap, I suggest you look into hierarchical clustering. Hierarchical clustering is an unsupervised learning technique that tries to cluster similar things together with no prior knowledge using dis/similarity metrics, such as correlation. The branching structure can be interpreted much like a family tree (or a phylogeny tree). Samples connected by “lower” branches are more similar than those with further branches. You only need a basic understanding of algebra to understand most of the dis/similarity metric calculations!
- Use
cor()
to calculate correlation between samples - Use
factor()
to group samples into Normal and Cancer - Use
pheatmap
or ggplot2 to plot a heatmap
How to identify outliers
Once we have visualizations, we have to figure out what they mean and determine which samples will be considered outliers, if any. These outliers should be removed before continuing analysis. There may not be an initial consensus on which samples are outliers. This determination can be subjective especially for edge cases.
STEM-Away Evaluation for Microcredentials and Virtual-Internship Eligibility
Automatic Evaluation:
a) Submit list of outlier(s) identified
Manual Evaluation:
a) Write a brief report summarizing the analysis process and key findings. Include the steps performed, the rationale behind each step, and the interpretation of the results. Your report should demonstrate a clear understanding of the analysis workflow and your ability to interpret the results accurately.
b) Participate in a live presentation or discussion where you explain the analysis process and present your findings. You will be evaluated based on the clarity of your explanations, your ability to answer questions about the analysis, and your overall understanding of the project.
The combination of automatic and manual evaluation methods will provide a comprehensive assessment of your skills in running the R script code from GEO2R and understanding the analysis process.
You have the flexibility to choose any of the evaluation methods mentioned above, either the automatic evaluation by submitting the output file or the manual evaluation by submitting a report or participating in a live presentation. This allows you to showcase your skills and understanding of the project in a way that best suits your learning style and preferences.
Resources and Learning Materials
Here are some suggested resources that will aid you in this project:
Quality Control
- ArrayQualityMetrics: A Bioconductor package used to provide quality assessment results of microarray data.
- RLE In-Depth Explanation – page 15
- NUSE In-Depth Explanation – page 14
Data Visualization
Boxplots
- Boxplots, Clearly Explained (StatQuest)
- Quantiles and Percentiles, Clearly Explained!!! (StatQuest)
- What is a statistical distribution? (StatQuest)
- The Normal Distribution, Clearly Explained!!! (StatQuest)
PCA (Principal Component Analysis)
- PCA (Wikipedia)
- PCA Classic Example - Iris Data
- PCA main ideas in only 5 minutes (StatQuest)
- PCA, Step-by-Step (StatQuest)
- PCA - Practical Tips (StatQuest)
- PCA in R (StatQuest)
Heatmaps
- Drawing and Interpreting Heatmaps (StatQuest)
- Pearson’s Correlation, Clearly Explained (StatQuest)
- Hierarchical Clustering (Wikipedia)
- Understanding the Concept of Hierarchical Clustering
- Hierarchical Clustering (StatQuest)
Appendix
Feeling R stuck? Check out these helpful tips!
-
Visualization Tips
- By default
pheatmap
implements a blue-to-red palette, but you can change this if you want! Since we are looking at correlation and the values on the two extremes are meaningful, we want a “diverging” color palette (blue-to-red is also a diverging palette). - For outlier detection, we are interested in samples that are dissimilar. But samples that are dissimilar have correlation close to 0, between -1 and 1. Most diverging palettes color the middle values as white, grey, or something less visually striking. If you want to make the outliers more noticeable, use a dissimilarity metric, such as
1-correlation
. - For a visual explanation, check out StatQuest’s video on Drawing and Interpreting Heatmaps
- By default
-
Some methods require the data in different formats, so pay attention to the documentation. For example,
arrayQualityMetrics()
function is looking for an AffyBatch object; theprcomp()
function is looking for a data frame. You can determine an objects type by looking for the variable in the “Environment” tab in RStudio or by using the functionclass(object_name)
-
If you have a plot in the plot window that is squished or doesn’t look right, you can adjust the window size by dragging the edges. Or you can export the plot and set the dimensions of the output to larger values. This will create a PNG or PDF in your files that you can then open on your computer.
-
arrayQualityMetrics()
typically takes a long time to run and uses a lot of memory. If you are on windows and get an error message that mentions “can’t allocate vector of size…”, try increasing the memory you’re allocating to Rsudio using the functionmemory.limit()
. Note: you cannot allocate more memory than the amount of RAM on your computer. -
If you’re stuck on interpreting you QC visualizations
- Try googling the name of the plot or package.
- The
arrayQualityMetrics()
HTML output includes a little explanation of each plot and its interpretation. - Feel free to ask questions!
-
RMA correction runs the fastest with the lease memory requirements, so if you have a slow computer or not a lot of RAM, I recommend you use this method.