Firstly, a big thank you to @ayahashim16 and @Ahmedz for sharing their findings.
Key Points to Address:
1. Issue with Edge Construction:
- One issue we’ve identified is related to how positive and negative edges were being constructed in the code snippets we previously shared. The examples showed edges across the entire network, but to correctly build these edges, we need to constrain them to relevant subsets of the network. See details in previous replies.
2. Switch from Association Scores to Network Properties:
- Using non-zero association scores for negative edges provides too much information to the model (identified by Aya). So, it is time to jump to network-based features for building your edge representations. Here’s a code snippet that demonstrates a simpler way to generate these features:
def simple_node_embedding(G, dim=64):
embeddings = {}
for node in G.nodes():
degree = G.degree(node)
neighbor_degrees = [G.degree(n) for n in G.neighbors(node)]
avg_neighbor_degree = np.mean(neighbor_degrees) if neighbor_degrees else 0
embedding = np.zeros(dim)
embedding[0] = degree
embedding[1] = avg_neighbor_degree
embedding[2:] = np.random.randn(dim-2)
embeddings[node] = embedding / np.linalg.norm(embedding)
return embeddings
- While using more advanced methods like Node2Vec can yield better embeddings, it requires substantial RAM and might only be feasible on a paid tier of Colab or another cloud platform.
- By using network properties (like node degrees or embedding vectors), we’re leveraging the structure of known interactions to predict unknown ones, which is more generalizable and biologically meaningful.
3. Play with PPI Data:
- Experiment with the PPI (Protein-Protein Interaction) data and use as much as your computational resources can handle. The more data you can include, the richer the network will be, and the more informative your models will become. Here’s a snippet to help you manage large datasets:
# Cap the number of PPI interactions
if len(ppi_ot) > 100000:
ppi_filtered = ppi_ot.sample(n=100000, random_state=42)
else:
additional_needed = 10000 - len(ppi_ot)
ppi_non_ot = ppi_df[~((ppi_df['GeneName1'].isin(ot_genes)) | (ppi_df['GeneName2'].isin(ot_genes)))]
additional_sample = ppi_non_ot.sample(n=min(additional_needed, len(ppi_non_ot)), random_state=42)
ppi_filtered = pd.concat([ppi_ot, additional_sample])
4. Model Selection and Tuning:
- When trying out different models, pay attention to the performance metrics. If a model is overfitting or if its performance metrics are too perfect (all metrics = 1), you’ll want to make the model “weaker” by reducing its complexity. For example, in Random Forest, you can reduce the number of trees, limit the depth of each tree, or increase the minimum number of samples required to split a node. Here’s an example:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=50, # Fewer trees
max_depth=10, # Limit depth
min_samples_split=10, # Larger splits
min_samples_leaf=5, # Larger leaf nodes
random_state=42
)
rf.fit(X_train, y_train)
5. Training:
Use separate tests sets and cross-validation to further help prevent overfitting: (Cross-validation helps assess how well your model generalizes across different subsets of the data)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
def cross_validate_and_evaluate(model, X_train_val, y_train_val, X_test, y_test, model_name, cv=5):
# Perform cross-validation
cv_scores = cross_val_score(model, X_train_val, y_train_val, cv=cv, scoring=‘f1’)
print(f"\n{model_name} Cross-Validation Results:")
print(f"Mean F1-score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
# Train on full training set and evaluate on test set
model.fit(X_train_val, y_train_val)
y_pred = model.predict(X_test)
Finally, here is some results we generated:
Play around with model parameters and amount of data and see what you can get!!
And this is a rough analysis:
Model Performance:
a) Logistic Regression:
- Consistent performance between cross-validation (CV) and test set.
- F1-score: 0.8049 (CV) vs 0.8088 (Test)
- Good balance of precision and recall, but room for improvement
b) Random Forest:
- High performance, but potential overfitting.
- F1-score: 0.9752 (CV) vs 0.9668 (Test)
- The small drop in test set performance suggests it’s generalizing well, despite high scores.
c) Gradient Boosting:
- Best performing model with consistent results.
- F1-score: 0.9914 (CV) vs 0.9916 (Test)
- Excellent balance of precision and recall.d)
d) SVM:
- Solid performance, consistent between CV and test set.
- F1-score: 0.8476 (CV) vs 0.8437 (Test)
- Good balance, but not as high as ensemble methods.
@Updates-2024 @Preinternship