🟡 RNA-Seq Data Quality Control and Preprocessing

stemaway · October 30, 2024, 4:15pm

RNA-Seq Data Quality Control and Preprocessing

Objective

The primary objective of this project is to provide a robust, hands-on experience in performing quality control (QC) and preprocessing of RNA-Seq data. You will learn how to evaluate, correct errors, and filter reads to ensure the quality of RNA-Seq data using R and Bioconductor packages. This project focuses on RNA-Seq data, which has become the standard for transcriptomics studies due to its high throughput and sensitivity.

By the end of this project, you will have a comprehensive foundation in RNA-Seq data analysis, emphasizing the critical aspect of quality control. You will be empowered to independently explore and analyze RNA-Seq data, ensuring the reliability and integrity of your findings.

Learning Outcomes

By completing this project, you will:

Understand the importance of quality control in RNA-Seq data analysis:
- Recognize the impact of data quality on downstream analyses and interpretations.
- Appreciate how QC enhances the reliability and accuracy of RNA-Seq data.
Gain proficiency in using R and Bioconductor for RNA-Seq data analysis:
- Learn to install and utilize essential R packages for bioinformatics.
- Develop skills in data manipulation and visualization in R.
Acquire expertise in RNA-Seq data preprocessing and quality control techniques:
- Perform quality assessment using tools like FastQC and MultiQC.
- Learn to trim and filter raw sequencing reads.
- Align RNA-Seq reads to a reference genome using STAR or HISAT2.
- Generate count matrices for gene expression analysis.
Generate informative visualizations:
- Create plots to assess data quality (e.g., per-base sequence quality, GC content).
- Visualize alignment statistics and gene coverage.

Prerequisites and Theoretical Foundations

1. System and Software Requirements

Computational Resources:

RAM: Minimum 16GB (32GB recommended)
Storage: 50GB free space per 10 samples (paired-end)
CPU: Multi-core processor (4+ cores recommended)
Operating System: Unix-like environment (Linux/Mac) preferred

Required Software Environment

R: Version 4.0 or higher
Command-line proficiency: Basic Unix commands
Package managers: Conda or Bioconda recommended
Container systems: Basic Docker/Singularity knowledge helpful

2. Basic Knowledge of R Programming

Essential knowledge of R data structures and control flow required.

Core R Concepts and Examples

# Data Structures
vectors <- c(1, 2, 3)                               # Atomic vectors
matrices <- matrix(1:9, nrow=3)                     # 2D structures
data_frames <- data.frame(ID=1:3, Value=c("A","B","C")) # Tabular data
lists <- list(nums=1:3, chars=c("A","B"))          # Heterogeneous data

# Control Flow Examples
if (condition) {
    # action
} else {
    # alternative action
}

for (item in collection) {
    # repeated action
}

Essential R Packages

Data manipulation: tidyverse, data.table
Bioconductor core: BiocParallel, GenomicRanges
Visualization: ggplot2, ComplexHeatmap
RNA-seq specific: DESeq2, edgeR, tximport

3. Understanding of RNA-Seq Technology

Key concepts in RNA sequencing technology and experimental design.

Library Preparation and Sequencing

Library Preparation:

Poly-A selection vs Ribosomal depletion
Stranded vs unstranded protocols
UMIs (Unique Molecular Identifiers)
Library complexity and PCR duplication

Sequencing Technology:

Short-read sequencing (Illumina)
Read lengths and paired-end sequencing
Sequencing depth requirements
Quality scores (Phred scale)

Sequencing Depth Guidelines

Recommended coverage per sample type:

mRNA profiling: 20-30M reads
lncRNA analysis: 30-50M reads
Novel transcript discovery: 50-100M reads
Single-cell RNA-seq: 50K-100K reads per cell

Factors affecting depth requirements:

Genome size
Transcript complexity
Expression level of interest
Biological variation

File Formats

# FASTQ Format Example
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

# SAM/BAM Format Example
READ_ID  FLAG  CHROM  POS  MAPQ  CIGAR  RNEXT  PNEXT  TLEN  SEQ  QUAL

4. Quality Control Fundamentals

Understanding basic quality metrics and their interpretation is essential.

Quality Metrics and Thresholds

Minimum QC thresholds:

Base quality: >80% bases above Q30
Alignment rate: >75%
Unique alignment: >60%
Gene detection: >12,000 genes (human)
Ribosomal content: <10%
Duplication rate: <60%

Quality Score Interpretation:

Phred score = -10 * log10(error probability)
Q30 = 99.9% base call accuracy
Q20 = 99% base call accuracy

Common Quality Issues

Base quality degradation
Adapter contamination
PCR duplicates
GC bias
rRNA contamination
Coverage biases

5. Experimental Design

Design Guidelines

Sample Requirements:

Minimum 3 biological replicates per condition
Balanced design across conditions
Random assignment to sequencing lanes
Inclusion of technical controls

Power Analysis Considerations:

Effect size of interest
Biological variability
Sequencing depth
Multiple testing burden
Cost constraints

6. Statistical Concepts

Key Statistical Methods

Negative binomial distribution
Dispersion estimation
Multiple testing correction
Effect size estimation

Example Calculations:

# Differential Expression Testing
res <- results(dds, alpha=0.05)
res <- res[order(res$padj),]

# Multiple Testing Correction
p.adjust(pvalues, method="BH")

7. Reproducibility Practices

Version Control and Documentation

Version Control:

# Git basics for analysis tracking
git init
git add analysis_script.R
git commit -m "Initial analysis script"

Documentation Standards:

README files
Analysis logs
Parameter documentation
Quality report generation

Steps and Tasks

System Requirements Quick Reference

Minimum 16GB RAM (32GB recommended for large datasets)

50GB free storage space per 10 samples (paired-end)

Multiple CPU cores for parallel processing

Unix-like environment (Linux/Mac) recommended for command-line tools

Step 1: Download RNA-Seq Data from Public Repositories

System Requirements:

Storage: ~10GB per paired-end sample
Internet: High-speed connection recommended
Time Estimate: 1-4 hours depending on dataset size

Dataset Download:

For this project, we recommend using a small RNA-Seq dataset to minimize computational requirements.

Dataset Options:

Yeast RNA-Seq Data
- Accession Number: SRR453566
- Description: RNA-Seq data for Saccharomyces cerevisiae, suitable for testing and small-scale analysis.
E. coli RNA-Seq Data
- Accession Number: SRR577902
- Description: RNA-Seq data from Escherichia coli, minimal computational resources required.

How to Download Data:

# Install SRA Toolkit
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
tar -xzf sratoolkit.current-ubuntu64.tar.gz
export PATH=$PATH:$PWD/sratoolkit.current-ubuntu64/bin

# Download and convert to FASTQ
fastq-dump --split-files --gzip SRR453566

Note: Replace SRR453566 with the accession number of your chosen dataset.

Step 2: Quality Assessment (FastQC/MultiQC)

System Requirements:

RAM: 2GB per sample
CPU: Can utilize multiple cores
Time: ~5-10 minutes per sample

Command Line:

# Create output directories
mkdir -p qc/fastqc qc/multiqc

# Run FastQC
fastqc SRR453566_1.fastq.gz SRR453566_2.fastq.gz -o qc/fastqc

# Aggregate reports with MultiQC
multiqc qc/fastqc -o qc/multiqc

Note: Adjust file names according to your downloaded data.

Step 3: Read Trimming and Filtering

System Requirements:

RAM: 4GB per sample
Time: 15-30 minutes per sample

Command Line:

# Trimming with Trimmomatic
trimmomatic PE \
  -threads 4 \
  SRR453566_1.fastq.gz SRR453566_2.fastq.gz \
  SRR453566_1.trimmed.fastq.gz SRR453566_1.unpaired.fastq.gz \
  SRR453566_2.trimmed.fastq.gz SRR453566_2.unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 4: Post-trim QC and Read Recovery Assessment

R Environment:

# Load libraries
library(ShortRead)
library(ggplot2)

# Analyze read recovery
original_counts <- countFastq("SRR453566_1.fastq.gz")
trimmed_counts <- countFastq("SRR453566_1.trimmed.fastq.gz")

recovery_rate <- (trimmed_counts / original_counts) * 100

# Plot recovery rate
ggplot(data.frame(Sample="SRR453566", Rate=recovery_rate), aes(x=Sample, y=Rate)) +
  geom_bar(stat="identity") +
  theme_minimal() +
  labs(title="Read Recovery After Trimming", y="Recovery Rate (%)")

Step 5: Genome Indexing and Alignment

System Requirements:

RAM: 16GB for small genomes (e.g., yeast)
Time: 1-2 hours

Command Line:

# Index the genome (e.g., yeast genome)
STAR --runMode genomeGenerate \
     --runThreadN 4 \
     --genomeDir genome_index \
     --genomeFastaFiles yeast_genome.fa \
     --sjdbGTFfile yeast_annotations.gtf \
     --sjdbOverhang 99

# Align reads
STAR --runThreadN 4 \
     --genomeDir genome_index \
     --readFilesIn SRR453566_1.trimmed.fastq.gz SRR453566_2.trimmed.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix aligned/SRR453566_ \
     --outSAMtype BAM SortedByCoordinate

Step 6: Generate Count Matrix

System Requirements:

RAM: 4GB
Time: 30 minutes

Command Line:

# Create count matrix with featureCounts
featureCounts -T 4 \
    -p -B -C \
    -t exon \
    -g gene_id \
    -a yeast_annotations.gtf \
    -o counts/raw_counts.txt \
    aligned/SRR453566_Aligned.sortedByCoord.out.bam

R Environment:

# Read and format counts
counts <- read.table("counts/raw_counts.txt", header=TRUE, row.names=1, skip=1)
counts <- counts[,6:ncol(counts)]  # Remove metadata columns

# Create DESeq2 object
library(DESeq2)
coldata <- data.frame(condition=factor("control"), row.names=colnames(counts))
dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~condition)

Step 7: Quality Assessment of Count Data

R Environment:

# Normalize counts
vsd <- vst(dds, blind=TRUE)

# PCA plot
plotPCA(vsd, intgroup="condition")

Step 8: Outlier Detection and Handling

R Environment:

# Calculate sample distances
sampleDists <- dist(t(assay(vsd)))
heatmap(as.matrix(sampleDists), symm=TRUE)

Step 9: Batch Effect Assessment

R Environment:

# Since we have only one condition, batch effect assessment is minimal
# Include batch effect correction if metadata is available

Step 10: Final Quality Report Generation

R Environment:

# Generate QC report using R Markdown
library(rmarkdown)
render("qc_report.Rmd", output_file="final_qc_report.html")

Best Practices Summary

Always check computational requirements before starting each step
Use parallel processing when available
Monitor memory usage during intensive steps
Keep intermediate files until pipeline is complete
Document parameter choices and their rationale
Generate comprehensive QC reports for reproducibility

Conclusion

In this project, you have:

Developed proficiency in performing quality control on RNA-Seq data using tools like FastQC and MultiQC.
Learned to preprocess RNA-Seq data, including trimming, filtering, and aligning reads to a reference genome.
Generated and interpreted various data visualizations to assess data quality and sample relationships.
Identified and addressed potential issues, such as low-quality reads and outlier samples, enhancing the reliability of downstream analyses.
Gained a solid foundation in RNA-Seq data analysis, equipping you to explore and analyze transcriptomics data independently.

Next Steps:

Differential Expression Analysis:
- Use DESeq2 or edgeR to identify differentially expressed genes between conditions.
Functional Analysis:
- Perform gene ontology (GO) enrichment or pathway analysis to interpret biological significance.
Data Integration:
- Integrate other data types, such as miRNA or proteomics data, for a multi-omics approach.
Further Learning:
- Explore advanced topics like alternative splicing analysis, single-cell RNA-Seq, or long-read sequencing data.

Thank you!