Week: 7/27
Overview of Things Learned:
-
Technical Area: I learned about scraping data from the discourse forum with the help of BeautifulSoup and Selenium . I used Pandas and CSV module to store the scraped data. I further preprocessed the collected data by using the String and re module for string manipulation. I learned about punctutation and whitespace which already existed as predefined strings in the string module.
-
Tools: I used Jupyter notebooks for writing my codes. I am also using the STEM-Away forum and Slack channel for regular updates from the mentors.
-
Soft Skills: Soft skills that came into play this week include effective communication with the project leads and technical leads, regularity in attending meetings and punctual submission of work according to the given deadlines.
Achievement Highlights
- I had previously only used BeautifulSoup4 to scrape data. But I encountered a different issue while scraping the Discourse forum for codecademy which was solved by Selenium. So I learned a new module to use for webscraping.
- I successfully completed the pre-processing task to remove all unwanted characters which are obtained by scraping data from the web pages.
- I already have some experience in working with dataframes using pandas but this week’s scraping and pre-processing tasks helped me revise some concepts and gave me a better exposure at using dataframes to store real-world data.
Meetings attended
- Kick-off meeting for Team4 (7//20)
- Web Scraping check-in (7/29)
- Web-scraping presentations and preprocessing (7/31)
Goals for the Upcoming Week
We were introduced to TF-IDF and count vectorizers towards the end of week 1. So I hope to learn more about the same this week. And help the fellow team members with any issues they might still be facing in the tasks of Week 1.
Tasks Done
-
Web Scraping- Used BeautifulSoup4 and Selenium to scrape 3 categories (Help, Community and Projects) of the Codecademy forum in Discourse and saved their respective data in different csv files. Initially I encountered the issue of not being able to pre-load the complete page using bs4 but I solved it by using Selenium.
-
Pre-processing- Cleaned the data scraped from the threads in the forum to remove any unwanted punctuations and whitespaces using string and pandas module.
Week: 8/3
Overview of Things Learned:
- Technical Area: Pre-processing, TF-IDF
- Tools: ScikitLearn, nltk, re
- Soft Skills: Teamwork, Consistency
Achievement Highlights
*Used data pre-processing techniques like removing stop words and special symbols like emojis to obtain clean data.
- Used Scikit-learn to construct a TF-IDF matrix for the cleaned data.
Meetings attended
- 8/7 - Presenting Preprocessing and TF-IDF
Goals for the Upcoming Week
- Getting a better understanding and completing the task assigned for BERT
- Attend meetings.
Tasks Done
-
TF-IDF processing: The one thing that I had not done in the pre-processing earlier was removing the emojis present in the data. After I found a solution for that using regex, the rest of the task for TF-IDF was pretty easy.
Week: 8/10
Overview of Things Learned:
- Technical Area: data processing, BERT model training
- Tools: torch, transformers
- Soft Skills: communication, teamwork
Achievement Highlights
- Figured out how to implement the model as a category classifier instead of a forum classifier because we just had the data of one forum that was divided into multiple categories.
Meetings attended
- 8/10 - Present Pre-Processing and TF-IDF
- 8/12 - Check-in about implementing the BERT Model
Goals for the Upcoming Week
- Make a forum classifier
- Assist any team members
Tasks Done
-
BERT notebook: Since the notebook used multiple forums to train the model, I had some issues trying to fit it to categories. It was not easy but I was finally able to do it. The task included fine-tuning the pre-processed data to get completely clean data to clean the model. After completing this step, I trained the model and achieved success in the same.
Week: 8/17
Overview of Things Learned:
Achievement Highlights
- Implemented BERT to make a forum classifier.
- Learned about deployment: its use, importance and the process.
- Shared a few troubleshooting tips for other team members to help them overcome some common problems that we all faced.
Meetings attended
- 8/17 - BERT Implementation Check In #2
- 8/20 - BERT Meeting
- 8/21 - BERT Presentations
Goals for the Upcoming Week
- Learn to deploy the model using AWS/Docker.
Tasks Done
-
Bert Notebook: Used the data scraped by other team members as well to get a diverse data with posts from many different forums. Then used this data to train the BERT model while understanding the nuances of its working including sentence embedding, attention masks and other such concepts. I was finally able to successfully train the model and obtained about 87% accuracy in training dataset as well as validation dataset.
Week: 8/24
Overview of Things Learned:
Achievement Highlights
- Created a web app in Flask for the given sample machine learning model which was in the shared resource
- Attempted to deploy the BERT- trained model that I made for this project.
Meetings attended
- 8/24 - Team Meeting
- 8/26 - Practice Presentation Team 4
- 8/27 - Mock presentation Team 4
- 8/27 - Final presentation Team 4
Goals for the Upcoming Weeks
- Try to expand on the knowledge and experience gained during this internship and make more projects using NLP.
Tasks Done
-
Deployment: Built a flask app that predicted the category of a model trained on Iris dataset. Then deployed it using Dockerfile and AWS.
-
Final Presentation : Contributed to the team presentation which was made to explain the model in front of the Industry Mentor.
FINAL SELF-ASSESSMENT: A Plethora of New Things Learned
-
Web Scraping: It started with web-scraping where I learned the multiple ways of scraping data using Scrapy, BeautifulSoup, json and how to deal with possible errors that we might run into.
-
Pre-Processing:This was more like a revision exercise for me. I used Pandas library to clean the scraped data.
-
TF-IDF: This was a new concept that I was introduced to. It was interesting learning about its advantages and disadvantages. We implemented TF-IDF on our clean data to obtain features in the form of vectorized matrix.
-
BERT: We then shifted to BERT to overcome the shortcomings of TF-IDF. This was one topic that I learned the most about. I ran into quite a few errors again and again but was finally able to train the model and made a decent forum classifier.
-
Deployment: Towards the end, we were told to make an attempt to deploy the trained model. In this process, I learned how to use Flask, Docker and AWS to deploy a simple machine learning application.