Kjean - Machine Learning Pathway

Week: July 26 - August 1, 2020

Overview of Things Learned:

The past week, I learned about different library/framework for web scraping, including BeautifulSoup and Scrapy. I initially thought that web scraping could exclusively be done by going through HTML tags, but then realized that JSON requests were a much neater way to mine data from websites. In terms of tools, I became familiar with Google Colab for interactive Python notebooks. Although I personally use Jupyter Notebook, Colab reveals to be an excellent tool for team collaboration.

Achievement Highlights

  1. Scraping and storing data from the Webflow Community Forum
  2. Pre-processing text data by removing stop words and punctuations
  3. Familiarity with Jupyter notebook and Google Colab

Meetings Attended:

  • Web Scraping Check-in (July, 29)
  • Web Scraping Presentations and Preprocessing (July 31)

Goals for the upcoming week:

  • Complete/refine data pre-processing
  • Complete TF-IDF

Tasks done:

Web scraping using Scrapy:

  • I had trouble getting data from the HTML tags, so I reached out to my teammates who recommended using JSON rather HTML requests. #teamwork

Data pre-processing:

  • Due to the nature of my forum, some of the text data that I pre-processed include HTML links, from which I was unable to extract keywords with simple pre-processing functions. I m still working on improving my pre-processing techniques.

Week: August 2 to 8, 2020

Overview of Things Learned:

Technical: Data cleaning, data pre-processing, TF-IDF

Tools: Regex, Skicit-Learn

Achievement Highlights

  1. Refined pre-processing methods on the dataset
  2. Applied TF-IDF with Skicit-Learn

Meetings Attended:

  • 8/5 - Pre processing check in
  • 8/7 - Pre processing and TF-IDF presentation

Goals for the upcoming week:

  • Learn about BERT
  • Apply BERT to the pre-processed dataset

Tasks done:

  • Data pre-processing refinement: Used regex to remove special characters that could not be removed with the String library. A comparison of the raw and processed datasets can be found here.
  • TF-IDF Processing: Applied TF-IDF with the Scikit-Learn on the scraped data set successfully and obtained a vector matrix with weighted values.

Week: August 9 - 15, 2020

Overview of Things Learned:

Technical: Data processing, Machine learning

Tools: Regex, Skicit-Learn, DistillBERT, Torch

Achievement Highlights

  1. Applied more processing functions on my dataset to remove noise and get it ready for BERT
  2. Learned about BERT and its advantages over other types of transformers

Meetings Attended:

  • 8/10 - Present Preprocessing and TF-IDF

Goals for the upcoming week:

  • Learn more theory about BERT
  • Implement BERT to classify text from multiple forums

Tasks done:

Data processing refinement:

  • I initially pre-processed my dataset for the TF-IDF notebook by removing not only HMTL tags and punctuations, but also the stop words. I then realized that this strategy was not optimal for BERT because sentence context is important for the BERT model. So, I re-processed my dataset accordingly

Learning about BERT:

  • I haven’t had exposure to BERT before, so it was quite insightful to learn the technical aspect of the tool and its applications by machine learning engineers. #learn #growthmindset

Week: August 16 - 22, 2020

Overview of Things Learned:

Technical: Machine learning, BERT

Tools: Torch, Transformers, DistilBERT

Achievement Highlights

  1. Learning more about BERT
  2. Sentence embedding of posts from 5 different forums

Meetings Attended:

  • 8/17 - BERT Implementation Check In #2
  • 8/20 - BERT meeting
  • 8/21 - BERT presentation

Goals for the upcoming week:

  • Complete implementation of BERT (i.e. train the classification model)
  • Deploy a Docker container with a simple model on AWS

Tasks done:

Data preparation: I merged the posts from 5 different forums, dropped the rows without values and encoded the labels prior to implementing BERT

Sentence embedding: I embedded the sentences from the posts’ content using the DistilBERT model; the maximum sentence length in my dataset was 128.

Week: August 23 - 29, 2020

Overview of Things Learned:

Technical: Machine learning, BERT

Tools: Torch, Transformers, DistilBERT

Soft skills: communication collaboration teamwork

Achievement Highlights

  1. Completion of BERT model
  2. Overview of BERT in the final presentation

Meetings Attended:

  • 8/24 : Team Meeting
  • 8/26 : Practice Presentation Team 4
  • 8/27 : Mock presentation Team 4
  • 8/27 : Final presentation Team 4

Goals for the upcoming week:

  • Attempt to deploy a Docker container with a simple model on AWS
  • Expand my Machine Learning knowledge and experience!

Tasks done:

Classification model: I completed my BERT classification model with an accuracy of 90%. To reduce computation time, I used a dataset of 34,452 posts for training, of which 3,829 were selected as validation samples.

Final presentation: During our final presentation, I broadly explained how BERT works, and its benefits over other types of NLP techniques such as TF-IDF and unidirectional transformers.

FINAL SELF-ASSESSMENT: Brief Summary of Things Learned

  1. Web Scraping: I kicked off the internship by learning about different Python frameworks for web scraping such as Selenium, BeautifulSoup and Scrapy, which all had a surprising amount of built-in features. I learned to acquire data from JSON requests rather than raw HTML.
  2. Pre-Processing : I learned about different pre-processing techniques and libraries such as String, Regex to remove unwanted characters and prepare scraped dataset for NLP models.
  3. TF-IDF : Learned about the theory behind TF-IDF, its simplicity and speed for classification models. I used the Skicit-Learn library to create a TF-IDF matrix with my dataset.
  4. BERT : Learned about BERT and its benefits over other NLP techniques such as TF-IDF and unidirectional transformers. After much debugging, I was able to implement a classification model with BERT.
  5. Deployment : Learned the benefits of using AWS, Flask and Docker to deploy machine learning models for production.