🔸 Code Along for A-DE-1: Biomedical Knowledge Graph

The following code snippets provide a comprehensive guide to setting up your environment, preparing the data, constructing the knowledge graph, performing initial network analysis, and example application.

1. Environment Setup

Before you start, ensure all necessary libraries are installed to handle the dataset and perform graph-based operations. This setup includes libraries for data manipulation, graph construction, and visualization.

# Install NetworkX for graph operations, Pandas for data manipulation, and iGraph for advanced graph analysis
!pip install networkx pandas python-igraph

# Additional visualization libraries
!pip install matplotlib seaborn

2. Loading and Preparing the Data

Load the dataset using Pandas, a powerful tool for data analysis. This step includes preliminary data checks such as viewing the top rows of the dataset and summary statistics to understand the nature of the data.

import pandas as pd

# Load the dataset
data = pd.read_csv('protein_interactions.csv')

# Display the first few rows of the dataframe
print(data.head())

# Get a summary of the dataframe
print(data.describe())

3. Constructing the Knowledge Graph

Use NetworkX to construct the knowledge graph from the interaction data. This involves creating a graph object and adding edges from the dataset, representing protein interactions.

import networkx as nx

# Create a graph from the Pandas dataframe
G = nx.from_pandas_edgelist(data, source='protein1', target='protein2')

# Display basic information about the graph
print(nx.info(G))

4. Basic Graph Analysis

Perform basic graph analysis to understand the structure and key metrics of the graph. This includes calculating the number of connected components and visualizing the network.

# Calculate the number of connected components
connected_components = nx.number_connected_components(G)
print("Number of connected components:", connected_components)

# Visualize the graph
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
nx.draw(G, node_color='lightblue', with_labels=True, node_size=500, font_size=10)
plt.title('Protein-Protein Interaction Graph')
plt.show()

5. Advanced Network Analysis

Implement more complex network analysis techniques such as calculating centrality measures to identify key nodes within the network.

# Calculate degree centrality of the graph
centrality = nx.degree_centrality(G)
sorted_centrality = sorted(centrality.items(), key=lambda x: x[1], reverse=True)

# Display the top 5 nodes with the highest centrality
print("Top 5 nodes by degree centrality:", sorted_centrality[:5])

6. Using iGraph for Advanced Analysis

For those interested in exploring more sophisticated analyses, convert the NetworkX graph to an iGraph object. iGraph provides powerful tools for community detection and network motifs.

from igraph import Graph

# Convert NetworkX graph to iGraph
ig = Graph.TupleList(G.edges(), directed=False)

# Perform community detection
community = ig.community_multilevel()
print("Modularity of found partitions:", community.modularity)

7. Applications

# Load disease association data
disease_associations = pd.read_csv('disease_associations.csv')

# Create subgraphs for specific diseases
disease_pathways = {}
for disease in disease_associations['disease'].unique():
    associated_proteins = disease_associations[disease_associations['disease'] == disease]['protein_id']
    subgraph = G.subgraph(associated_proteins)
    disease_pathways[disease] = subgraph

# Analyze a specific disease pathway
scc_disease = nx.strongly_connected_components(disease_pathways['Alzheimer’s Disease'])

8. Visualization Techniques

import plotly.graph_objs as go

# Create a trace for the nodes
node_trace = go.Scatter(
    x=[],
    y=[],
    mode='markers',
    hoverinfo='text',
    marker=dict(
        colorscale='Viridis',
        reversescale=True,
        color=[],
        size=10,
        colorbar=dict(
            thickness=15,
            title='Node Connections',
            xanchor='left',
            titleside='right'
        ),
        line_width=2
    )
)

# Create a trace for the edges
edge_trace = go.Scatter(
    x=[],
    y=[],
    mode='lines',
    line=dict(color='rgb(125,125,125)', width=1),
    hoverinfo='none'
)

for node in G.nodes():
    x, y = G.nodes[node]['pos']
    node_trace['x'].append(x)
    node_trace['y'].append(y)
    node_trace['marker']['color'].append(G.degree(node))
    node_trace['hovertext'].append(f"Protein: {node}")

for edge in G.edges():
    x0, y0 = G.nodes[edge[0]]['pos']
    x1, y1 = G.nodes[edge[1]]['pos']
    edge_trace['x'] += [x0, x1, None]
    edge_trace['y'] += [y0, y1, None]

fig = go.Figure(data=[edge_trace, node_trace],
                layout=go.Layout(
                    title='Protein-Protein Interaction Network',
                    titlefont_size=16,
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20, l=5, r=5, t=40),
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
                )
)

fig.show()