- Web Scraping with Beautiful Soup
- Data Cleaning
Three achievement highlights
- Went through the scraping tutorial, understood it and executed it.
- Created a web scraping script to collect data from a website.
- Perform data cleaning and EDA to output clean csv file.
List of meeting/training sessions attended
- STEMCast: Overview of ML and project
- Git Webinar
- Weekly Team meeting 06/8
Goals for upcoming weeks
- Study text pre-processing,NLP & text classification.
- Explore data visualization.
Technical: BERT basics, TF-IDF analysis
- Obtain a basic understanding of encoders in BERT
- Understand the implementation of TF-IDF
- Calculate Term Frequency and Inverse Document Frequency for various article categories
Week 4 Team Meeting (Monday)
- Implement a full-scale BERT model to surpass the competency of Word2Vec
- Run tests to see which categories are best for classifying articles
Tasks that were completed this week were obtaining TF-IDF for the scraped data. First I went through various sources to understand how to implement and the correct way of the output that we need to have from the TF-IDF implementation.
I also did research to gain an understanding of BERT. Went through multiple blogs and youtube videos to understand the concept of BERT and presented my understandings in front of the group. Currently I am working on understanding the implementation of BERT that would help me use it in our project.
Technical: BERT, Training and Validation
Tools: Model training, BertTokenizer
- Implement a BERT Model.
- Gained valuable insights from the accuracy of the table and understood what modifications are needed.
- Refine the BERT model to improve its prediction accuracy
- Implement one-hot encoding to ensure dataset is not biased and then implement BERT model once again
After facing certain problems in executing BERT. I could finally implement the BERT model and could analyse the results and I understood where I need to work on. I understood that the dataset is biased so the accuracy of the current cannot be considered appropriate. I had to go through resources to study how can I remove this bias. Now I look forward to apply one-hot encoding and try to implement BERT on this new dataset.