Transcriptomics Data Quality Control: Microarray
Objective
The primary objective of this project is to provide a robust, hands-on experience in performing quality control (QC) of transcriptomics data, specifically focusing on microarray data. You will learn how to evaluate, correct errors, and filter probes/reads to ensure the quality of the data using R and Bioconductor packages.
This project is classified as exploratory level because it utilizes well-established microarray technology with a straightforward, unified workflow. Working with pre-processed CEL files (rather than raw sequence data) and using integrated R packages that handle much of the underlying complexity internally makes this an accessible starting point for bioinformatics analysis. The preprocessing steps (background correction and normalization) are relatively straightforward, and fewer technical decisions are required compared to more complex genomic analyses.
While microarray technology has largely been superseded by RNA-Seq for transcriptomics studies, the fundamental concepts and analytical skills you’ll develop here—including data quality assessment, normalization techniques, and statistical analysis—form a strong foundation that directly transfers to RNA-Seq analysis. The systematic approach to data quality control and preprocessing you’ll learn is valuable across all types of genomic data analysis.
This project is further exapnded here: Bioinformatics Workflow - STEM-Away®
Learning Outcomes
By completing this project, you will:
-
Understand the importance of quality control in transcriptomics data analysis:
- Recognize the impact of data quality on downstream analyses and interpretations.
- Appreciate how QC enhances the reliability and accuracy of transcriptomics data.
-
Gain proficiency in using R and Bioconductor for genomic data analysis:
- Learn to install and utilize essential R packages for bioinformatics.
- Develop skills in data manipulation and visualization in R.
-
Acquire expertise in data pre-processing and quality control techniques:
- Perform data pre-processing steps such as background correction and normalization.
- Generate informative visualizations (e.g., PCA plots, boxplots, RLE plots) to assess data quality.
- Identify and remove outlier samples to improve data integrity.
Prerequisites and Theoretical Foundations
1. Basic Knowledge of R Programming
- Data Structures: Vectors, data frames, lists.
- Control Flow: If-else statements, loops, functions.
- Data Manipulation: Reading and writing data, subsetting, merging.
Click to view R code examples
# Basic data structures
vector <- c(1, 2, 3)
data_frame <- data.frame(sample = c('A', 'B', 'C'), value = c(10, 20, 30))
list <- list(name = 'Sample1', values = vector)
# Control flow
for (i in 1:5) {
print(i)
}
# Functions
add_numbers <- function(x, y) {
return(x + y)
}
result <- add_numbers(5, 3)
2. Understanding of Microarray Technology
- Microarrays: Tools for measuring gene expression levels across thousands of genes simultaneously.
- CEL Files: Raw data files generated from Affymetrix microarray experiments.
- Probes and Probe Sets: Short DNA sequences used to detect gene expression.
Click to view microarray concepts
- Probes: Single-stranded DNA sequences designed to hybridize to specific RNA transcripts.
- Probe Sets: Groups of probes targeting the same gene.
- Hybridization: The process where DNA/RNA strands pair with complementary sequences.
3. Fundamentals of Quality Control in Transcriptomics
- Importance of QC:
- Ensures data reliability and accuracy.
- Identifies technical artifacts and outliers.
- Background Correction and Normalization:
- Corrects for non-specific signals and systematic biases.
- Data Visualization Techniques:
- Boxplots: Assess distribution of probe intensities.
- Principal Component Analysis (PCA): Reduce dimensionality and visualize sample variation.
- Relative Log Expression (RLE) and Normalized Unscaled Standard Error (NUSE): Assess data quality across arrays.
Click to view QC concepts
- Background Correction: Removes background noise from microarray data.
- Normalization: Adjusts data to account for technical variability.
- Outliers: Samples that deviate significantly from others and may distort analysis.
Skills Gained
-
Data Acquisition and Management:
- Downloading and handling large genomic datasets.
- Managing metadata associated with biological samples.
-
Proficiency with R and Bioconductor Packages:
- Utilizing packages like affy, affyPLM, arrayQualityMetrics.
- Performing data preprocessing and QC steps.
-
Data Visualization and Interpretation:
- Creating and interpreting boxplots, PCA plots, heatmaps, RLE, and NUSE plots.
- Identifying outliers and assessing data quality visually.
-
Critical Thinking in Bioinformatics:
- Understanding the impact of QC on downstream analyses.
- Making informed decisions based on QC results.
Tools Required
- Programming Language: R (version 4.0 or higher recommended)
- Integrated Development Environment (IDE):
- RStudio: Provides a user-friendly interface for R programming.
- Bioconductor Packages:
- affy: For handling Affymetrix microarray data.
- affyPLM: For fitting probe-level models.
- arrayQualityMetrics: For comprehensive quality assessment.
- CRAN Packages:
- ggplot2: Data visualization.
- pheatmap: Creating heatmaps.
- Data:
- GSE32323 Dataset: Available from the Gene Expression Omnibus (GEO).
Steps and Tasks
Note: If you are joining from the GEOQuest Combo Starter Project, please skip to Step 5.
Step 1: Download Expression Data from GEO
Tasks:
- Access GSE32323 on GEO:
- Visit the GEO accession link: GSE32323.
- Download CEL Files:
- Download the CEL files with names in the format
*chip_array_C#*
. - Exclude files related to cell lines; focus on actual human samples.
- Download the CEL files with names in the format
- Extract the TAR File:
- Use extraction software (e.g., WinRAR for Windows) to unpack the TAR file and obtain the CEL files.
Implementation Details
- Downloading Files:
- On the GEO page, look for the “Download family” section.
- Click on the link to download the “CEL files”.
- Extracting Files:
- After downloading, right-click on the TAR file and select “Extract Here” or use appropriate software.
Step 2: Acquire and Understand Metadata
Tasks:
- Download Series Matrix File:
- Provides metadata for the samples, including patient information and experimental conditions.
- Process the Metadata:
- Extract relevant information into a spreadsheet.
- Save the metadata as a
.csv
file for easy loading into R.
Implementation Details
- Accessing Metadata:
- On the GSE32323 page, find the “Series Matrix File(s)” link.
- Creating the Metadata File:
- Open the TXT file in a text editor or spreadsheet software.
- Organize the data into columns (e.g., sample ID, condition, age).
- Save the file as
metadata.csv
.
- Alternative:
- If encountering difficulties, use the provided metadata file: GSE32323_metadata.csv.
Step 3: Install Required Packages
Tasks:
- Install Bioconductor Packages:
- affy, affyPLM, arrayQualityMetrics.
- Install CRAN Packages:
- ggplot2, pheatmap.
- Load the Packages into R.
Implementation:
# Install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("affy", "affyPLM", "arrayQualityMetrics"))
# Install CRAN packages
install.packages(c("ggplot2", "pheatmap"))
# Load the packages
library(affy)
library(affyPLM)
library(arrayQualityMetrics)
library(ggplot2)
library(pheatmap)
Explanation
- BiocManager: Recommended for installing Bioconductor packages.
- Package Documentation: Refer to the documentation for each package to understand their functions and usage.
Step 4: Load Data into R
Tasks:
- Set Working Directory:
- Set the R working directory to the folder containing the CEL files.
- Read the CEL Files:
- Use the
ReadAffy()
function from the affy package.
- Use the
- Load the Metadata:
- Use
read.csv()
to load the metadata CSV file.
- Use
Implementation:
# Set working directory
setwd("path/to/CEL/files")
# Read CEL files
affy_data <- ReadAffy()
# Load metadata
metadata <- read.csv("metadata.csv", stringsAsFactors = FALSE)
# Ensure sample names match
sampleNames(affy_data) <- metadata$SampleID
Explanation
ReadAffy()
: Reads all CEL files in the working directory.- Sample Names: It’s important that the sample names in
affy_data
match those inmetadata
.
Step 5: Quality Control and Data Pre-processing
Tasks:
- Perform Inter-Array Quality Control:
- Boxplots: Assess the distribution of probe intensities across samples.
- PCA Analysis: Visualize sample clustering and identify outliers.
- Correlation Heatmaps: Examine similarities between samples.
- RLE and NUSE Plots: Assess data quality using affyPLM.
- Perform Intra-Array Quality Control:
- RNA Degradation Plots: Check for RNA integrity issues.
- MA Plots: Visualize intensity-dependent biases.
- Background Correction and Normalization:
- Use methods like RMA, MAS5, or GCRMA.
- Visualize Pre- and Post-Normalization Data:
- Generate plots to compare before and after normalization.
- Identify and Remove Outliers:
- Based on QC results, determine if any samples should be excluded.
Implementation:
# Inter-array QC - Boxplots before normalization
boxplot(affy_data, main = "Boxplot of Raw Intensities", col = "lightblue")
# PCA before normalization
exprs_raw <- exprs(affy_data)
pca_raw <- prcomp(t(exprs_raw))
plot(pca_raw$x[,1], pca_raw$x[,2], col = "blue", pch = 16,
xlab = "PC1", ylab = "PC2", main = "PCA of Raw Data")
# Background correction and normalization using RMA
norm_data <- rma(affy_data)
# Boxplots after normalization
boxplot(norm_data, main = "Boxplot of Normalized Intensities", col = "lightgreen")
# PCA after normalization
exprs_norm <- exprs(norm_data)
pca_norm <- prcomp(t(exprs_norm))
plot(pca_norm$x[,1], pca_norm$x[,2], col = "green", pch = 16,
xlab = "PC1", ylab = "PC2", main = "PCA of Normalized Data")
# Generate RLE and NUSE plots
# Fit probe-level models
plm_fit <- fitPLM(affy_data)
# RLE Plot
RLE(plm_fit, main = "Relative Log Expression (RLE) Plot")
# NUSE Plot
NUSE(plm_fit, main = "Normalized Unscaled Standard Error (NUSE) Plot")
# Correlation Heatmap
cor_matrix <- cor(exprs_norm)
pheatmap(cor_matrix, main = "Correlation Heatmap of Normalized Data")
# Identify outliers based on plots
# For example, remove samples with high NUSE values
outlier_samples <- names(which(apply(NUSE(plm_fit), 2, median) > 1.1))
norm_data_filtered <- norm_data[, !(colnames(norm_data) %in% outlier_samples)]
Explanation
- Boxplots: Compare the distribution of intensities before and after normalization.
- PCA Plots: Visualize how samples cluster and whether normalization improves clustering.
- RMA Normalization: Robust Multi-array Average method for background correction and normalization.
- RLE and NUSE: Diagnostic plots from affyPLM to assess data quality.
- Outlier Detection: Samples with median NUSE values significantly higher than 1 indicate poor quality.
Background Information for Step 5
Click to expand background information
What is QC and Why is it Done?
Quality Control ensures that the data and conclusions drawn are reliable. It helps detect outliers and improve the signal-to-noise ratio. High variability or poor-quality data can lead to incorrect interpretations.
What is Background Correction and Normalization?
- Background Correction: Removes non-specific signals to reduce noise.
- Normalization: Adjusts data to correct for systematic biases (e.g., variations in sample processing).
Methods:
- MAS5.0: Affymetrix’s Microarray Suite algorithm.
- RMA: Robust Multi-array Average; widely used for normalization.
- GCRMA: An extension of RMA that considers GC content.
Visualizations
- Boxplots: Show the distribution of probe intensities for each sample; identify samples with different distributions.
- PCA: Reduces dimensionality to visualize sample variability in 2D or 3D plots.
- Correlation Heatmaps: Display pairwise correlation coefficients between samples; clusters similar samples together.
- RLE and NUSE: Assess the quality of arrays; deviations indicate potential issues.
How to Identify Outliers
- Visual Inspection: Look for samples that deviate significantly in plots.
- Statistical Measures: Use metrics from QC plots (e.g., NUSE > 1.1).
- Consistency: Outliers should be consistently identified across multiple QC methods.
Code Snippets
Click to view code snippets
1. Setting Up the Environment
# Install and load required packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("affy", "affyPLM", "arrayQualityMetrics"))
install.packages(c("ggplot2", "pheatmap"))
library(affy)
library(affyPLM)
library(arrayQualityMetrics)
library(ggplot2)
library(pheatmap)
2. Loading the Data
# Set working directory to the location of CEL files
setwd("path/to/CEL/files")
# Read CEL files
affy_data <- ReadAffy()
# Load metadata
metadata <- read.csv("metadata.csv", stringsAsFactors = FALSE)
# Assign sample names
sampleNames(affy_data) <- metadata$SampleID
3. Performing Quality Control
# Boxplot of raw data
boxplot(affy_data, main = "Boxplot of Raw Intensities", col = "lightblue")
# PCA of raw data
exprs_raw <- exprs(affy_data)
pca_raw <- prcomp(t(exprs_raw))
plot(pca_raw$x[,1], pca_raw$x[,2], col = "blue", pch = 16,
xlab = "PC1", ylab = "PC2", main = "PCA of Raw Data")
# Fit probe-level models
plm_fit <- fitPLM(affy_data)
# RLE plot
RLE(plm_fit, main = "Relative Log Expression (RLE) Plot")
# NUSE plot
NUSE(plm_fit, main = "Normalized Unscaled Standard Error (NUSE) Plot")
4. Background Correction and Normalization
# Normalize data using RMA
norm_data <- rma(affy_data)
# Boxplot of normalized data
boxplot(norm_data, main = "Boxplot of Normalized Intensities", col = "lightgreen")
# PCA of normalized data
exprs_norm <- exprs(norm_data)
pca_norm <- prcomp(t(exprs_norm))
plot(pca_norm$x[,1], pca_norm$x[,2], col = "green", pch = 16,
xlab = "PC1", ylab = "PC2", main = "PCA of Normalized Data")
5. Correlation Heatmap
# Compute correlation matrix
cor_matrix <- cor(exprs_norm)
# Create heatmap
pheatmap(cor_matrix, main = "Correlation Heatmap of Normalized Data",
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation")
6. Identifying Outliers
# Median NUSE values
nuse_values <- apply(NUSE(plm_fit), 2, median)
# Identify samples with median NUSE > 1.1
outlier_samples <- names(nuse_values[nuse_values > 1.1])
print("Outlier samples based on NUSE:")
print(outlier_samples)
# Remove outliers from normalized data
norm_data_filtered <- norm_data[, !(colnames(norm_data) %in% outlier_samples)]
# Verify removal
sampleNames(norm_data_filtered)
Conclusion
In this project, you have:
- Developed proficiency in performing quality control on microarray data using R and Bioconductor packages.
- Learned to load and preprocess genomic data, including background correction and normalization.
- Generated and interpreted various data visualizations (boxplots, PCA plots, heatmaps, RLE, and NUSE plots) to assess data quality.
- Identified and removed outlier samples, enhancing the reliability of downstream analyses.
- Gained a solid foundation in bioinformatics, equipping you to explore and analyze genomic data independently.
Next Steps:
-
Advanced Analysis:
- Perform differential gene expression analysis using packages like limma.
- Explore RNA-seq data QC and analysis for a deeper understanding.
-
Data Integration:
- Integrate metadata into your analyses to study the effects of experimental conditions.
-
Further Learning:
- Delve into statistical methods and machine learning techniques applied to genomics.
Encouragement for Further Exploration:
Quality control is a critical step in any genomic data analysis pipeline. By mastering these techniques, you ensure the integrity of your research and set a strong foundation for more advanced bioinformatics analyses. Continue to build on this knowledge, and consider contributing to open-source projects or collaborating with peers to enhance your skills.
Happy coding and learning!
Resources and Learning Materials
Here are some suggested resources that will aid you in this project:
Quality Control
- ArrayQualityMetrics: A Bioconductor package used to provide quality assessment results of microarray data.
- RLE In-Depth Explanation – page 15
- NUSE In-Depth Explanation – page 14
Data Visualization
Boxplots
- Boxplots, Clearly Explained (StatQuest)
- Quantiles and Percentiles, Clearly Explained!!! (StatQuest)
- What is a statistical distribution? (StatQuest)
- The Normal Distribution, Clearly Explained!!! (StatQuest)
PCA (Principal Component Analysis)
- PCA (Wikipedia)
- PCA Classic Example - Iris Data
- PCA main ideas in only 5 minutes (StatQuest)
- PCA, Step-by-Step (StatQuest)
- PCA - Practical Tips (StatQuest)
- PCA in R (StatQuest)
Heatmaps
- Drawing and Interpreting Heatmaps (StatQuest)
- Pearson’s Correlation, Clearly Explained (StatQuest)
- Hierarchical Clustering (Wikipedia)
- Understanding the Concept of Hierarchical Clustering
- Hierarchical Clustering (StatQuest)
Appendix
Feeling R stuck? Check out these helpful tips!
-
Visualization Tips
- By default
pheatmap
implements a blue-to-red palette, but you can change this if you want! Since we are looking at correlation and the values on the two extremes are meaningful, we want a “diverging” color palette (blue-to-red is also a diverging palette). - For outlier detection, we are interested in samples that are dissimilar. But samples that are dissimilar have correlation close to 0, between -1 and 1. Most diverging palettes color the middle values as white, grey, or something less visually striking. If you want to make the outliers more noticeable, use a dissimilarity metric, such as
1-correlation
. - For a visual explanation, check out StatQuest’s video on Drawing and Interpreting Heatmaps
- By default
-
Some methods require the data in different formats, so pay attention to the documentation. For example,
arrayQualityMetrics()
function is looking for an AffyBatch object; theprcomp()
function is looking for a data frame. You can determine an objects type by looking for the variable in the “Environment” tab in RStudio or by using the functionclass(object_name)
-
If you have a plot in the plot window that is squished or doesn’t look right, you can adjust the window size by dragging the edges. Or you can export the plot and set the dimensions of the output to larger values. This will create a PNG or PDF in your files that you can then open on your computer.
-
arrayQualityMetrics()
typically takes a long time to run and uses a lot of memory. If you are on windows and get an error message that mentions “can’t allocate vector of size…”, try increasing the memory you’re allocating to Rsudio using the functionmemory.limit()
. Note: you cannot allocate more memory than the amount of RAM on your computer. -
If you’re stuck on interpreting you QC visualizations
- Try googling the name of the plot or package.
- The
arrayQualityMetrics()
HTML output includes a little explanation of each plot and its interpretation. - Feel free to ask questions!
-
RMA correction runs the fastest with the lease memory requirements, so if you have a slow computer or not a lot of RAM, I recommend you use this method.