The past week, I learned about different library/framework for web scraping, including BeautifulSoup and Scrapy. I initially thought that web scraping could exclusively be done by going through HTML tags, but then realized that JSON requests were a much neater way to mine data from websites. In terms of tools, I became familiar with Google Colab for interactive Python notebooks. Although I personally use Jupyter Notebook, Colab reveals to be an excellent tool for team collaboration.
Achievement Highlights
Scraping and storing data from the Webflow Community Forum
Pre-processing text data by removing stop words and punctuations
Familiarity with Jupyter notebook and Google Colab
Meetings Attended:
Web Scraping Check-in (July, 29)
Web Scraping Presentations and Preprocessing (July 31)
Goals for the upcoming week:
Complete/refine data pre-processing
Complete TF-IDF
Tasks done:
Web scraping using Scrapy:
I had trouble getting data from the HTML tags, so I reached out to my teammates who recommended using JSON rather HTML requests. #teamwork
Data pre-processing:
Due to the nature of my forum, some of the text data that I pre-processed include HTML links, from which I was unable to extract keywords with simple pre-processing functions. I m still working on improving my pre-processing techniques.
Technical: Data cleaning, data pre-processing, TF-IDF
Tools: Regex, Skicit-Learn
Achievement Highlights
Refined pre-processing methods on the dataset
Applied TF-IDF with Skicit-Learn
Meetings Attended:
8/5 - Pre processing check in
8/7 - Pre processing and TF-IDF presentation
Goals for the upcoming week:
Learn about BERT
Apply BERT to the pre-processed dataset
Tasks done:
Data pre-processing refinement: Used regex to remove special characters that could not be removed with the String library. A comparison of the raw and processed datasets can be found here.
TF-IDF Processing: Applied TF-IDF with the Scikit-Learn on the scraped data set successfully and obtained a vector matrix with weighted values.
Applied more processing functions on my dataset to remove noise and get it ready for BERT
Learned about BERT and its advantages over other types of transformers
Meetings Attended:
8/10 - Present Preprocessing and TF-IDF
Goals for the upcoming week:
Learn more theory about BERT
Implement BERT to classify text from multiple forums
Tasks done:
Data processing refinement:
I initially pre-processed my dataset for the TF-IDF notebook by removing not only HMTL tags and punctuations, but also the stop words. I then realized that this strategy was not optimal for BERT because sentence context is important for the BERT model. So, I re-processed my dataset accordingly
Learning about BERT:
I haven’t had exposure to BERT before, so it was quite insightful to learn the technical aspect of the tool and its applications by machine learning engineers. #learn#growthmindset
Attempt to deploy a Docker container with a simple model on AWS
Expand my Machine Learning knowledge and experience!
Tasks done:
Classification model: I completed my BERT classification model with an accuracy of 90%. To reduce computation time, I used a dataset of 34,452 posts for training, of which 3,829 were selected as validation samples.
Final presentation: During our final presentation, I broadly explained how BERT works, and its benefits over other types of NLP techniques such as TF-IDF and unidirectional transformers.
FINAL SELF-ASSESSMENT: Brief Summary of Things Learned
Web Scraping: I kicked off the internship by learning about different Python frameworks for web scraping such as Selenium, BeautifulSoup and Scrapy, which all had a surprising amount of built-in features. I learned to acquire data from JSON requests rather than raw HTML.
Pre-Processing : I learned about different pre-processing techniques and libraries such as String, Regex to remove unwanted characters and prepare scraped dataset for NLP models.
TF-IDF : Learned about the theory behind TF-IDF, its simplicity and speed for classification models. I used the Skicit-Learn library to create a TF-IDF matrix with my dataset.
BERT : Learned about BERT and its benefits over other NLP techniques such as TF-IDF and unidirectional transformers. After much debugging, I was able to implement a classification model with BERT.
Deployment : Learned the benefits of using AWS, Flask and Docker to deploy machine learning models for production.