Chanh_Bui - Machine Learning Pathway

Final Self-Assessment
Overview of things learned:

  • Technical Area: Over the course of the 5 weeks, I had learned how to use Beautiful Soup to scrap data from multiple forums and store it into a file. I also learned how to pre-process the data that I collected. I learned about the BERT model and how the model works. I then learned how to use the transformers library to train a BERT model and use it for a classification task.
  • Tools: We learned how to use GitHub, Asana, and Slack for communication and collaboration.
  • Soft skills: I learned how to communicate with my teammate and how to help them with their difficulties. I also learned how to manage my time. I learned how to communicate with teammates and leads if I have any difficulties or questions.

Three achievements I had so far:

  • Our team has successfully scraped more than 5000 posts and preprocess them from a discourse website.
  • Understand how BERT models work and learn more about NLP.
  • I successfully train a BERT model to classify posts into categories for a forum.

List of meetings I have joined so far:

  • All the team meetings including all the general team meetings, web scraping workshop meetings, and data processing meetings.
  • Git Webinar
  • STEMCast Webinar
  • Python Webinar
  • NLP Webinar 3
  • Q and A Webinar

Goals of the upcoming weeks:

  • Deploy the model

Task Done:

  • Web Scraping: Scrape data from 5000 posts using Selenium and store the data in a data frame.
  • Data Cleaning: Clean data with different methods using the nltk module.
  • Calculate TF-IDF: Calculate TF-IDF statistics using sklearn module.
  • Work as a team for web scraping and data-preprocessing tasks.
  • Learn about NLP and BERT with transformer library with the documentation.
  • BERT: Train BERT Model to classify posts into categories in a forum.

Problems Faced and how I solved it:

  • I had problems understanding how to use BERT in the transformer library. Reading the documentation for the library helped me understand a lot.
  • My 1st BERT model’s accuracy is not very high. When I decrease the number of classes and the accuracy goes up. I think my model needs more fine-tuning.

Self Assessment Week 4:

Overview of things learned:

  • Technical Area: I learned about the transformer library and how to use it for BERT. I also learned in-depth about how to use BERT for tasks such as sequence classification. I understood how BERT can be applied to the data we collected to create a model to classify any given post into a specific category. Also, I learned about PyTorch which is necessary for using BERT model with the transformer library to classify sequences.
  • Tools: We are still using GitHub, Asana and Slack for communication and collaboration
  • Soft skills: I further develop my skills to communicate with people and also manage myself. Learn how to solve and help other team members with their difficulties.

Three achievements I had so far:

  • Our team has successfully scraped more than 5000 posts and preprocess them from a discourse website.
  • Understand how BERT models work and learn more about NLP.
  • I learn how to use Bert model using the transformers library and use it for a small data set.

List of meetings I have joined so far:

  • All the team meetings including all the general team meetings, web scraping workshop meetings, and data processing meetings.
  • Git Webinar
  • STEMCast Webinar
  • Python Webinar
  • NLP Webinar 3
  • Q and A Webinar

Goals of the upcoming week:

  • Finish using BERT model for our classification task

Task Done:

  • Web Scraping: Scrape data from 5000 posts using Selenium and store the data in a data frame.
  • Data Cleaning: Clean data with different methods using the nltk module.
  • Calculate TF-IDF: Calculate TF-IDF statistics using sklearn module.
  • Work as a team for web scraping and data-preprocessing tasks.
  • Learn about NLP and BERT with transformer library with the documentation.
  • BERT: Use BERT Model for small data set.

Problems Faced and how I solved it:

  • I had problems understanding how to use BERT in the transformer library. Reading the documentation for the library helped me understand a lot.

Self Assessment Week 3:

Overview of things learned:

  • Technical Area: I learned more about scraping and pre-processing this week. I also learned how to use the pandas module. Most importantly, I learn about NLP and specifically the BERT model. From the NLP webinars and team meetings, I was able to understand how NLP models and BERT works under the hoods. I also learned more problem-solving skills when working on our team tasks.
  • Tools: This week, I started to use GitHub to collaborate with people in my team for the scraping and pre-processing tasks. I also use asana and Slack to communicate with my team and to divide tasks.
  • Soft skills: I learned to communicate with my teammates and efficiently divide tasks to all team members. I also improve my time management skill to finish my tasks on time.

Three achievements I had so far:

  • Our team has successfully scraped more than 5000 posts from a discourse website.
  • We have successfully pre-processed the data by removing tags and stopwords, tokenization, lemmatizations.
  • I learn how to use Bert model using the transformers module.

List of meetings I have joined so far:

  • All the team meetings including all the general team meetings, web scraping workshop meetings, and data processing meetings.
  • Git Webinar
  • STEMCast Webinar
  • Python Webinar
  • NLP Webinar 3
  • Q and A Webinar

Goals of the upcoming week:

  • Apply the Bert model to the data our team collected.

Task Done:

  • Web Scraping: Scrape data from 5000 posts using Selenium and store the data in a data frame.
  • Data Cleaning: Clean data with different methods using the nltk module.
  • Calculate TF-IDF: Calculate TF-IDF statistics using sklearn module.
  • Work as a team for web scraping and data-preprocessing tasks.

Problems Faced and how I solved it:

  • I had problems where the web driver changes website too fast that my code does not actually scraping any data. I have to add a line of code for the driver to wait and let the data be scraped.
  • I also did not know how to get more posts since the main page of the website only lists the latest 30 posts. I looked at the document of selenium and learned how to scroll a webpage in selenium.

Self Assessment Week 1/2:

Overview of things learned:

  • Technical Area: I learned how to do web scraping and also how to clean the data in python using new modules such as Selenium, Beautiful Soup, or ntlk module. I also learn about TF-IDF, a statistical measure of a word’s importance in a corpus.
  • Tools: I learn how to use Asana and Slack for communication and team management. We have not done much with these 2 tools but I can see their importance in our project. I also have learned about GitHub and VS Code. These will be extremely helpful for me when working on the project.
  • Soft skills: I learned how to communicate with people effectively on different platforms. I also learned how to manage my time effectively to learn more materials relating this project.

Three achievements I had so far:

  • I had successfully scraped data from a website on discourse with a specific tag and class.
  • I had successfully cleaned the data I had scraped with lemmatization, removing stop words, lowercasing, and tokenization using the ntlk module.
  • I had successfully calculate the TF-IDF of the data collected using the sklearn module.

List of meetings I have joined so far:

  • All the team meetings including all the general team meetings, web scraping workshop meetings, and data processing meetings.
  • Git Webinar
  • STEMCast Webinar

Goals of the upcoming week:

  • Learn to use more modules to clean and process data. It is most ideal to start on the machine learning part of the project and contribute to it.

Task Done:

  • Web Scraping: Scrape data from a website using Selenium and store the data.
  • Data Cleaning: Clean data with different methods using the nltk module.
  • Calculate TF-IDF: Calculate TF-IDF statistics using sklearn module.

Problems Faced and how I solved it:

  • I did not know what module to use for data cleaning and processing. The workshop really help me to know what to use and how to use the module.