Huiwen.goh - Machine Learning Pathway

Self-Assessment

Things I Have Learned
Technical: Building on the knowledge I’ve gained over the last session, I’ve also learned much more about multi-label classifications, and about how to implement a multi-label classification deep learning model. I also read up on annotation tools and gained an idea how they could be used, and implemented a similar interface in our program. Lastly, I learned about how active learning loops can help to continuously train a model and also implemented one in our program.

Tools: I’ve taken a deeper delve in constructing deep learning models using keras, and undergoing hyperparameter tuning to improve its performance. Furthermore, I used a lot of tools (eg. transformers, sklearn) that I newly learned over the June session and gained more experience in them.

Soft Skills: I have learned how to work with other team members which sometimes have contrasting schedules and was able to find time to set up meetings and collaborate on the projects. It was also a good experience for me to make sure I was disciplined while working on my part of project to ensure that I completed my tasks in the specified time.

Three Achievements:

  1. I scrapped data from the Stack Exchange forums, obtaining the posts with tags relating to machine-learning. Then, the posts were preprocessed and vectorized using BERT and TF-IDF embeddings. The tags were also sorted and we only used the top 25 most frequent tags in the model, which were then binarized for feeding to the deep learning model.

  2. A keras deep learning model was implemented, which would predict the top 10 most probable tags for each post that was given. The model had an testing accuracy of approximately 72%

  3. An active learning and tag annotation component was added to the prediction model. Users could enter the actual tags of the post they deem fit, and those posts and user-determined tags would be added to the training set to retrain the model for better performace

Meetings/ Training Attended (Including Social Team Events)
I’ve attended most team meets and sub-team meetings to work on some specific parts of the project (only exception was when I was out of the city for a weekend)

Goals for Upcoming Week
Right now a lot of our implementations are in separate script files (text vectorization, deep learning model, active learning component etc). Our goal for this weekend is to integrate all these separate code files into a integrated program which would only take one input file and would be able to continuously predict tags and learn from the user inputs

Tasks Done

  • Vectorized the post data using BERT and TF-IDF
    I used distilBERT embeddings, and for TF-IDF I selected the top 5000/7500/10000 words and created those embeddings for model training

  • Implemented the deep learning keras model and evaluated obtained results
    I tried a few deep learning models and evaluated the accuracies and recall scores for testing data set at each prediction threshold. It was observed that the TD-IDF embeddings consistently performed better than BERT embeddings, and that there was not a significant difference between the number of top TF-IDF features that were chosen

  • Implemented an active learning loop for the model
    A active learning loop was implemented such that every time there were 10 new data points (tags inputted by the user), we would re-vectorized the text using tf-idf embeddings and feed it into the deep learning model once again to train the model and continuously predict tags