🟡 RNA-Seq Data Quality Control and Preprocessing

RNA-Seq Data Quality Control and Preprocessing

Objective

The primary objective of this project is to provide a robust, hands-on experience in performing quality control (QC) and preprocessing of RNA-Seq data. You will learn how to evaluate, correct errors, and filter reads to ensure the quality of RNA-Seq data using R and Bioconductor packages. This project focuses on RNA-Seq data, which has become the standard for transcriptomics studies due to its high throughput and sensitivity.

By the end of this project, you will have a comprehensive foundation in RNA-Seq data analysis, emphasizing the critical aspect of quality control. You will be empowered to independently explore and analyze RNA-Seq data, ensuring the reliability and integrity of your findings.


Learning Outcomes

By completing this project, you will:

  • Understand the importance of quality control in RNA-Seq data analysis:

    • Recognize the impact of data quality on downstream analyses and interpretations.
    • Appreciate how QC enhances the reliability and accuracy of RNA-Seq data.
  • Gain proficiency in using R and Bioconductor for RNA-Seq data analysis:

    • Learn to install and utilize essential R packages for bioinformatics.
    • Develop skills in data manipulation and visualization in R.
  • Acquire expertise in RNA-Seq data preprocessing and quality control techniques:

    • Perform quality assessment using tools like FastQC and MultiQC.
    • Learn to trim and filter raw sequencing reads.
    • Align RNA-Seq reads to a reference genome using STAR or HISAT2.
    • Generate count matrices for gene expression analysis.
  • Generate informative visualizations:

    • Create plots to assess data quality (e.g., per-base sequence quality, GC content).
    • Visualize alignment statistics and gene coverage.

Prerequisites and Theoretical Foundations

1. System and Software Requirements

Computational Resources:

  • RAM: Minimum 16GB (32GB recommended)
  • Storage: 50GB free space per 10 samples (paired-end)
  • CPU: Multi-core processor (4+ cores recommended)
  • Operating System: Unix-like environment (Linux/Mac) preferred
Required Software Environment
  • R: Version 4.0 or higher
  • Command-line proficiency: Basic Unix commands
  • Package managers: Conda or Bioconda recommended
  • Container systems: Basic Docker/Singularity knowledge helpful

2. Basic Knowledge of R Programming

Essential knowledge of R data structures and control flow required.

Core R Concepts and Examples
# Data Structures
vectors <- c(1, 2, 3)                               # Atomic vectors
matrices <- matrix(1:9, nrow=3)                     # 2D structures
data_frames <- data.frame(ID=1:3, Value=c("A","B","C")) # Tabular data
lists <- list(nums=1:3, chars=c("A","B"))          # Heterogeneous data

# Control Flow Examples
if (condition) {
    # action
} else {
    # alternative action
}

for (item in collection) {
    # repeated action
}
Essential R Packages
  • Data manipulation: tidyverse, data.table
  • Bioconductor core: BiocParallel, GenomicRanges
  • Visualization: ggplot2, ComplexHeatmap
  • RNA-seq specific: DESeq2, edgeR, tximport

3. Understanding of RNA-Seq Technology

Key concepts in RNA sequencing technology and experimental design.

Library Preparation and Sequencing

Library Preparation:

  • Poly-A selection vs Ribosomal depletion
  • Stranded vs unstranded protocols
  • UMIs (Unique Molecular Identifiers)
  • Library complexity and PCR duplication

Sequencing Technology:

  • Short-read sequencing (Illumina)
  • Read lengths and paired-end sequencing
  • Sequencing depth requirements
  • Quality scores (Phred scale)
Sequencing Depth Guidelines

Recommended coverage per sample type:

  • mRNA profiling: 20-30M reads
  • lncRNA analysis: 30-50M reads
  • Novel transcript discovery: 50-100M reads
  • Single-cell RNA-seq: 50K-100K reads per cell

Factors affecting depth requirements:

  • Genome size
  • Transcript complexity
  • Expression level of interest
  • Biological variation
File Formats
# FASTQ Format Example
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

# SAM/BAM Format Example
READ_ID  FLAG  CHROM  POS  MAPQ  CIGAR  RNEXT  PNEXT  TLEN  SEQ  QUAL

4. Quality Control Fundamentals

Understanding basic quality metrics and their interpretation is essential.

Quality Metrics and Thresholds

Minimum QC thresholds:

  • Base quality: >80% bases above Q30
  • Alignment rate: >75%
  • Unique alignment: >60%
  • Gene detection: >12,000 genes (human)
  • Ribosomal content: <10%
  • Duplication rate: <60%

Quality Score Interpretation:

  • Phred score = -10 * log10(error probability)
  • Q30 = 99.9% base call accuracy
  • Q20 = 99% base call accuracy
Common Quality Issues
  • Base quality degradation
  • Adapter contamination
  • PCR duplicates
  • GC bias
  • rRNA contamination
  • Coverage biases

5. Experimental Design

Design Guidelines

Sample Requirements:

  • Minimum 3 biological replicates per condition
  • Balanced design across conditions
  • Random assignment to sequencing lanes
  • Inclusion of technical controls

Power Analysis Considerations:

  • Effect size of interest
  • Biological variability
  • Sequencing depth
  • Multiple testing burden
  • Cost constraints

6. Statistical Concepts

Key Statistical Methods
  • Negative binomial distribution
  • Dispersion estimation
  • Multiple testing correction
  • Effect size estimation

Example Calculations:

# Differential Expression Testing
res <- results(dds, alpha=0.05)
res <- res[order(res$padj),]

# Multiple Testing Correction
p.adjust(pvalues, method="BH")

7. Reproducibility Practices

Version Control and Documentation

Version Control:

# Git basics for analysis tracking
git init
git add analysis_script.R
git commit -m "Initial analysis script"

Documentation Standards:

  • README files
  • Analysis logs
  • Parameter documentation
  • Quality report generation

Steps and Tasks

:computer: System Requirements Quick Reference

  • Minimum 16GB RAM (32GB recommended for large datasets)
  • 50GB free storage space per 10 samples (paired-end)
  • Multiple CPU cores for parallel processing
  • Unix-like environment (Linux/Mac) recommended for command-line tools

Step 1: Download RNA-Seq Data from Public Repositories

System Requirements:

  • Storage: ~10GB per paired-end sample
  • Internet: High-speed connection recommended
  • Time Estimate: 1-4 hours depending on dataset size

Dataset Download:

For this project, we recommend using a small RNA-Seq dataset to minimize computational requirements.

Dataset Options:

  1. Yeast RNA-Seq Data

    • Accession Number: SRR453566
    • Description: RNA-Seq data for Saccharomyces cerevisiae, suitable for testing and small-scale analysis.
  2. E. coli RNA-Seq Data

    • Accession Number: SRR577902
    • Description: RNA-Seq data from Escherichia coli, minimal computational resources required.

How to Download Data:

# Install SRA Toolkit
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz
tar -xzf sratoolkit.current-ubuntu64.tar.gz
export PATH=$PATH:$PWD/sratoolkit.current-ubuntu64/bin

# Download and convert to FASTQ
fastq-dump --split-files --gzip SRR453566

Note: Replace SRR453566 with the accession number of your chosen dataset.


Step 2: Quality Assessment (FastQC/MultiQC)

System Requirements:

  • RAM: 2GB per sample
  • CPU: Can utilize multiple cores
  • Time: ~5-10 minutes per sample

Command Line:

# Create output directories
mkdir -p qc/fastqc qc/multiqc

# Run FastQC
fastqc SRR453566_1.fastq.gz SRR453566_2.fastq.gz -o qc/fastqc

# Aggregate reports with MultiQC
multiqc qc/fastqc -o qc/multiqc

Note: Adjust file names according to your downloaded data.


Step 3: Read Trimming and Filtering

System Requirements:

  • RAM: 4GB per sample
  • Time: 15-30 minutes per sample

Command Line:

# Trimming with Trimmomatic
trimmomatic PE \
  -threads 4 \
  SRR453566_1.fastq.gz SRR453566_2.fastq.gz \
  SRR453566_1.trimmed.fastq.gz SRR453566_1.unpaired.fastq.gz \
  SRR453566_2.trimmed.fastq.gz SRR453566_2.unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 4: Post-trim QC and Read Recovery Assessment

R Environment:

# Load libraries
library(ShortRead)
library(ggplot2)

# Analyze read recovery
original_counts <- countFastq("SRR453566_1.fastq.gz")
trimmed_counts <- countFastq("SRR453566_1.trimmed.fastq.gz")

recovery_rate <- (trimmed_counts / original_counts) * 100

# Plot recovery rate
ggplot(data.frame(Sample="SRR453566", Rate=recovery_rate), aes(x=Sample, y=Rate)) +
  geom_bar(stat="identity") +
  theme_minimal() +
  labs(title="Read Recovery After Trimming", y="Recovery Rate (%)")

Step 5: Genome Indexing and Alignment

System Requirements:

  • RAM: 16GB for small genomes (e.g., yeast)
  • Time: 1-2 hours

Command Line:

# Index the genome (e.g., yeast genome)
STAR --runMode genomeGenerate \
     --runThreadN 4 \
     --genomeDir genome_index \
     --genomeFastaFiles yeast_genome.fa \
     --sjdbGTFfile yeast_annotations.gtf \
     --sjdbOverhang 99

# Align reads
STAR --runThreadN 4 \
     --genomeDir genome_index \
     --readFilesIn SRR453566_1.trimmed.fastq.gz SRR453566_2.trimmed.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix aligned/SRR453566_ \
     --outSAMtype BAM SortedByCoordinate

Step 6: Generate Count Matrix

System Requirements:

  • RAM: 4GB
  • Time: 30 minutes

Command Line:

# Create count matrix with featureCounts
featureCounts -T 4 \
    -p -B -C \
    -t exon \
    -g gene_id \
    -a yeast_annotations.gtf \
    -o counts/raw_counts.txt \
    aligned/SRR453566_Aligned.sortedByCoord.out.bam

R Environment:

# Read and format counts
counts <- read.table("counts/raw_counts.txt", header=TRUE, row.names=1, skip=1)
counts <- counts[,6:ncol(counts)]  # Remove metadata columns

# Create DESeq2 object
library(DESeq2)
coldata <- data.frame(condition=factor("control"), row.names=colnames(counts))
dds <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~condition)

Step 7: Quality Assessment of Count Data

R Environment:

# Normalize counts
vsd <- vst(dds, blind=TRUE)

# PCA plot
plotPCA(vsd, intgroup="condition")

Step 8: Outlier Detection and Handling

R Environment:

# Calculate sample distances
sampleDists <- dist(t(assay(vsd)))
heatmap(as.matrix(sampleDists), symm=TRUE)

Step 9: Batch Effect Assessment

R Environment:

# Since we have only one condition, batch effect assessment is minimal
# Include batch effect correction if metadata is available

Step 10: Final Quality Report Generation

R Environment:

# Generate QC report using R Markdown
library(rmarkdown)
render("qc_report.Rmd", output_file="final_qc_report.html")

:bulb: Best Practices Summary

  • Always check computational requirements before starting each step
  • Use parallel processing when available
  • Monitor memory usage during intensive steps
  • Keep intermediate files until pipeline is complete
  • Document parameter choices and their rationale
  • Generate comprehensive QC reports for reproducibility

Conclusion

In this project, you have:

  • Developed proficiency in performing quality control on RNA-Seq data using tools like FastQC and MultiQC.
  • Learned to preprocess RNA-Seq data, including trimming, filtering, and aligning reads to a reference genome.
  • Generated and interpreted various data visualizations to assess data quality and sample relationships.
  • Identified and addressed potential issues, such as low-quality reads and outlier samples, enhancing the reliability of downstream analyses.
  • Gained a solid foundation in RNA-Seq data analysis, equipping you to explore and analyze transcriptomics data independently.

Next Steps:

  • Differential Expression Analysis:

    • Use DESeq2 or edgeR to identify differentially expressed genes between conditions.
  • Functional Analysis:

    • Perform gene ontology (GO) enrichment or pathway analysis to interpret biological significance.
  • Data Integration:

    • Integrate other data types, such as miRNA or proteomics data, for a multi-omics approach.
  • Further Learning:

    • Explore advanced topics like alternative splicing analysis, single-cell RNA-Seq, or long-read sequencing data.

Thank you!