Code Along for Integrating Gene and Protein Networks for Disease Biomarker Discovery

Step 1: Data Acquisition and Preprocessing

To acquire and preprocess the gene expression data, you can use the GEOquery and limma packages in R. Here’s an example:

library(GEOquery)
library(limma)

# Download gene expression data from GEO
gse <- getGEO("GSE12345", GSEMatrix = TRUE)  # Replace "GSE12345" with the actual GEO accession number
data <- exprs(gse[[1]])  # Assuming you only have one matrix of gene expression data

# Preprocess the gene expression data
# Remove probes with low expression
data <- data[rowSums(data) > threshold, ]
# Normalize the expression values
data <- normalize(data, method = "quantile")
# Select relevant samples or conditions
data <- data[, conditions_of_interest]

In this code snippet, we first load the GEOquery package to download the gene expression data from the Gene Expression Omnibus (GEO) database. We use the getGEO function to retrieve the data based on the GEO accession number. Next, we extract the expression values from the downloaded data using the exprs function.

After acquiring the gene expression data, we preprocess it by removing probes with low expression, normalizing the expression values, and selecting relevant samples or conditions. In the code snippet, we remove probes with expression values below a certain threshold using the rowSums function. Then, we normalize the expression values using the normalize function with the “quantile” method. Finally, we select the samples or conditions of interest by subsetting the data based on the desired columns.

Step 2: Gene and Protein Interaction Network Construction

To construct the gene and protein interaction networks, you can use the WGCNA and STRINGdb packages in R. Here’s an example:

library(WGCNA)
library(STRINGdb)

# Construct a gene co-expression network using WGCNA
gene_expr <- as.data.frame(data)
gene_network <- blockwiseModules(gene_expr, power = 6, TOMType = "unsigned",
                                 minModuleSize = 30, reassignThreshold = 0,
                                 mergeCutHeight = 0.25, numericLabels = TRUE,
                                 pamRespectsDendro = FALSE, saveTOMs = TRUE,
                                 saveTOMFileBase = "TOM", verbose = 0)

# Build a protein interaction network using STRING
gene_symbols <- rownames(data)  # Assuming the row names are gene symbols
ensembl_ids <- mapIds(org.Hs.eg.db, gene_symbols, "SYMBOL", "ENSEMBL")
entrez_ids <- mapIds(org.Hs.eg.db, ensembl_ids, "ENSEMBL", "ENTREZID")

string_db <- STRINGdb$new(version = "11", species = 9606, score_threshold = 0.7)
ppi <- string_db$networkFromGene(gene = entrez_ids)

In this code snippet, we first load the WGCNA package to construct a gene co-expression network. We convert the gene expression data to a data frame using the as.data.frame function. Then, we use the blockwiseModules function to perform the network construction. This function calculates the pairwise correlations between genes, constructs a hierarchical clustering tree, and assigns genes to modules based on their co-expression patterns.

Next, we load the STRINGdb package to build a protein interaction network. We convert the gene symbols to Entrez IDs using the mapIds function from the org.Hs.eg.db package. Then, we create a STRINGdb object and query the STRING database for protein-protein interactions using the networkFromGene function. The networkFromGene function takes a vector of Entrez IDs as input and retrieves the corresponding protein-protein interactions.

Step 3: Network Integration

To integrate the gene and protein interaction networks, you can use the igraph package in R. Here’s an example:

library(igraph)

# Convert the gene and protein networks to igraph objects
gene_network <- graph_from_adjacency_matrix(as.matrix(gene_network$TOM))
protein_network <- graph_from_data_frame(ppi, directed = FALSE)

# Map the gene nodes to their corresponding proteins
gene_nodes <- V(gene_network)$name
protein_nodes <- V(protein_network)$name
mapped_nodes <- match(gene_nodes, protein_nodes, nomatch = 0)

# Combine the networks by connecting the gene nodes to their corresponding proteins
integrated_network <- gene_network + protein_network
integrated_network <- add_edges(integrated_network, cbind(gene_nodes, mapped_nodes))

# Assign appropriate weights or confidence scores to the edges
edge_weights <- ...  # Replace '...' with your method of assigning weights
E(integrated_network)$weight <- edge_weights

In this code snippet, we first load the igraph package to work with network objects. We convert the gene network to an igraph object using the graph_from_adjacency_matrix function, passing the adjacency matrix of the gene network as input. Similarly, we convert the protein network to an igraph object using the graph_from_data_frame function, passing the protein-protein interaction data frame as input.

Next, we map the gene nodes in the gene network to their corresponding proteins in the protein network. We extract the names of the gene and protein nodes using the V function, and then use the match function to find the indices of the matching protein nodes. We set the nomatch argument to 0 to indicate that unmatched gene nodes should not be mapped to any protein nodes.

After mapping the gene nodes to their corresponding proteins, we combine the gene and protein networks by adding the protein network to the gene network using the + operator. We use the add_edges function to connect the gene nodes to their corresponding proteins. Finally, we assign appropriate weights or confidence scores to the edges in the integrated network based on your specific method.

Step 4: Network Analysis and Biomarker Identification

To analyze the integrated gene and protein network and identify potential biomarkers, you can use network analysis techniques and network-based algorithms. Here’s an example using the igraph and WGCNA packages:

library(igraph)
library(WGCNA)

# Convert the integrated network to an igraph object
integrated_network <- graph_from_adjacency_matrix(as.matrix(integrated_network))

# Calculate node degree and betweenness centrality as network measures
node_degree <- degree(integrated_network, mode = "all")
node_betweenness <- betweenness(integrated_network, normalized = TRUE)

# Perform module detection using WGCNA
gene_expr <- as.data.frame(data)
gene_network <- blockwiseModules(gene_expr, power = 6, TOMType = "unsigned",
                                 minModuleSize = 30, reassignThreshold = 0,
                                 mergeCutHeight = 0.25, numericLabels = TRUE,
                                 pamRespectsDendro = FALSE, saveTOMs = TRUE,
                                 saveTOMFileBase = "TOM", verbose = 0)
gene_module_assignments <- gene_network$colors

# Combine the network measures and module assignments into a single data frame
biomarker_candidates <- data.frame(node_degree, node_betweenness, gene_module_assignments)

# Select the most promising biomarker candidates based on the network measures and module assignments
selected_biomarkers <- subset(biomarker_candidates, node_degree > threshold1 &
                              node_betweenness > threshold2 &
                              gene_module_assignments == module_of_interest)

In this code snippet, we first load the igraph and WGCNA packages. We convert the integrated network to an igraph object using the graph_from_adjacency_matrix function, passing the adjacency matrix of the integrated network as input.

Next, we calculate the node degree and betweenness centrality as network measures using the degree and betweenness functions, respectively. The degree function calculates the number of edges connected to each node, while the betweenness function measures the number of shortest paths passing through each node.

Then, we perform module detection using the blockwiseModules function from the WGCNA package. This function identifies modules of highly interconnected genes based on their co-expression patterns. We convert the gene expression data to a data frame using the as.data.frame function, and then pass it to the blockwiseModules function along with other parameters.

After performing module detection, we combine the network measures (node degree and betweenness centrality) and module assignments into a single data frame called biomarker_candidates. This data frame contains information about the network measures and module assignments for each node in the integrated network.

Finally, we select the most promising biomarker candidates based on the network measures and module assignments. In the code snippet, we use the subset function to filter the biomarker_candidates data frame based on specific criteria, such as a minimum node degree (node_degree > threshold1), a minimum betweenness centrality (node_betweenness > threshold2), and a specific module assignment (gene_module_assignments == module_of_interest).

Step 5: Functional and Pathway Analysis

To interpret the biological significance of the identified biomarkers, you can perform functional and pathway analysis. Here’s an example using the clusterProfiler and org.Hs.eg.db packages:

library(clusterProfiler)
library(org.Hs.eg.db)

# Convert the selected biomarkers to Entrez IDs
ensembl_ids <- mapIds(org.Hs.eg.db, selected_biomarkers, "SYMBOL", "ENSEMBL")
entrez_ids <- mapIds(org.Hs.eg.db, ensembl_ids, "ENSEMBL", "ENTREZID")

# Perform Gene Ontology (GO) enrichment analysis
go_results <- enrichGO(entrez_ids, OrgDb = org.Hs.eg.db, keyType = "ENTREZID",
                       ont = "BP", pvalueCutoff = 0.05, qvalueCutoff = 0.1,
                       readable = TRUE)

# Perform pathway enrichment analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database
kegg_results <- enrichKEGG(entrez_ids, OrgDb = org.Hs.eg.db, keyType = "ENTREZID",
                          pvalueCutoff = 0.05, qvalueCutoff = 0.1, readable = TRUE)

In this code snippet, we first load the clusterProfiler and org.Hs.eg.db packages. We convert the selected biomarkers to Entrez IDs using the mapIds function from the org.Hs.eg.db package. This function maps gene symbols to Entrez IDs based on the specified key type.

Next, we perform Gene Ontology (GO) enrichment analysis using the enrichGO function from the clusterProfiler package. We pass the Entrez IDs, the org.Hs.eg.db object as the reference database, the key type as “ENTREZID”, and other parameters such as the ontology (“BP” for biological process), p-value cutoff, and q-value cutoff. The enrichGO function returns a data frame containing the enriched GO terms and their associated statistics.

Similarly, we perform pathway enrichment analysis using the enrichKEGG function from the clusterProfiler package. We pass the Entrez IDs, the org.Hs.eg.db object as the reference database, the key type as “ENTREZID”, and other parameters such as the p-value cutoff and q-value cutoff. The enrichKEGG function returns a data frame containing the enriched KEGG pathways and their associated statistics.

Step 6: Results Interpretation and Conclusion

To interpret the results of the functional and pathway analysis and formulate a conclusion, you can examine the enriched GO terms and KEGG pathways and discuss their relevance to the disease of interest. Here’s an example:

# Visualize the enriched GO terms and KEGG pathways
dotplot(go_results, showCategory = 10)
dotplot(kegg_results, showCategory = 10)

# Formulate a conclusion based on the results

In this code snippet, we use the dotplot function from the clusterProfiler package to visualize the enriched GO terms and KEGG pathways. We pass the enrichment analysis results (go_results and kegg_results) as input to the dotplot function, and specify the number of categories to show (showCategory = 10) to limit the number of terms/pathways displayed.

After visualizing the results, you can formulate a conclusion based on the findings. Discuss the enriched GO terms and KEGG pathways that are most relevant to the disease of interest, and highlight their potential implications in understanding the underlying biological mechanisms. Additionally, you can suggest future directions for the research, such as experimental validation of the identified biomarkers or further exploration of specific pathways or functions.

Evaluation:

To evaluate the success of this project, you can consider the following criteria:

  • The ability to correctly acquire, preprocess, and integrate the gene expression and protein-protein interaction data.
  • The quality of the constructed gene and protein interaction networks, including the appropriateness of the network integration approach.
  • The application of effective network analysis techniques and biomarker identification methods.
  • The biological relevance and interpretability of the identified biomarkers, as demonstrated by the functional and pathway analysis.
  • The clarity and insightfulness of the results interpretation and the formulation of a meaningful conclusion.

Resources and Learning Materials:

Here are some additional resources and learning materials that can help you further understand the concepts and techniques used in this project:

  1. R for statistical computing: https://www.r-project.org/
  2. Bioconductor for genomics and proteomics data analysis in R: https://www.bioconductor.org/
  3. Gene Expression Omnibus (GEO): https://www.ncbi.nlm.nih.gov/geo/
  4. Human Protein Reference Database (HPRD): http://www.hprd.org/
  5. Search Tool for the Retrieval of Interacting Genes/Proteins (STRING): https://string-db.org/
  6. limma package for gene expression data analysis in R: https://bioconductor.org/packages/release/bioc/html/limma.html
  7. WGCNA package for weighted gene co-expression network analysis in R: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/
  8. igraph package for network analysis in R: https://igraph.org/r/
  9. clusterProfiler package for functional enrichment analysis in R: https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html
  10. org.Hs.eg.db package for annotation of gene identifiers in R: https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html

I hope this helps you in understanding the steps and tasks involved in this project. If you have any further questions, feel free to ask!