In this module, we broke down how to parse through all the papers we need to in the database. Learned how to use the PubMed parser and how to leverage Dask to process the papers in parallel. Understood how to use the Stanford parser to further break down words and sentences.
Achievements: Processed a small subset of papers using the PubMed parser and sentences using the Stanford parser
Goals: Apply knowledge of Dask to process the entire dataset of PubMed papers.
We gained a deeper understanding of the Stanford parser and the mechanism with which it operates. Learned more about how the neural network’s task is to determine the optimal transition (push to stack, left arc, or right arc) depending on the state of the system (stack, buffer, set of dependency arcs).
Achievements: Completed text implementation on Stanford Parser.
Goals: To start parsing Medline abstracts and form dependency paths that we can use to input into the EBC algorithm.
Learned more about the Ensemble Biclustering for Classification Algorithm (EBC). This algorithm is used to determine which drug-gene pairs and which dependency paths follow the same biological mechanisms. Read about Information Theoretical Co-Clustering (ITCC), which is the backbone for EBC. Implemented the EBC algorithm in Python.
Achievements: Developing an intuitive understanding of EBC and implementing it.
Goals: Cluster assignments for drug-gene pairs and dependency paths
Continued working on implementing the EBC algorithm, especially focusing on the supervised step since the unsupervised step has been implemented already. Obtained known drug-gene relationships from DrugBank and constructed seed sets and test sets
Achievements: Implementing the supervised step of EBC and generating the final co-occurrence scores
Goals: Use this co-occurrence matrix along with DrugBank to examine if our clustering matches up with ground-truth relationships.
Created a dendrogram from the co-occurrence matrix rankings. Worked on documenting the project and prepared the final presentation.