Week: 7/27
Overview of Things Learned:
- Technical Area: Web scraping, Data cleaning, Data preprocessing
- Tools: BeautifulSoup, Selenium, Scrapy, Requests, Pandas, Regex, Google Colab
Achievement Highlights
- Scraped data of over 30,000 posts
- Scraped JSON data using Scrapy
- Learned about data cleaning and preprocessing
Meetings attended
- 7/23 – ML team kick-off meeting
- 7/31 - Web Scraping and Preprocessing Presentations
Goals for the Upcoming Week
- Complete Preprocessing of the scraped data
- Learn about TF-IDF
Tasks Done
- Web scraping: Built a web crawler using Scrapy and used it to scrape data from JSON files and stored this data in a CSV file. Loaded this data in a dataframe and cleaned the data.
Week: 8/3
Overview of Things Learned:
- Technical Area: Data preprocessing, TF-IDF
- Tools: BeautifulSoup, Requests, Pandas, Regex, ScikitLearn, nltk
Achievement Highlights
- Preprocessed scraped data using libraries like markdown and regex
- Learnt about TF-IDF
Meetings attended
- 7/27 - Introduction to Web Scraping and this Week’s Deliverables
- 7/29 - Web Scraping Check-In Meeting and Intro to Preprocessing
- 7/31 - Web Scraping and Preprocessing Presentations
Goals for the Upcoming Week
- Complete TF-IDF
- Learn about BERT
Tasks Done
- Preprocessing: Removed HTML code from the text in the scraped data and removed symbols & multiple spaces as well. Filtered out the useful data by omitting the null values in the data.
Week: 8/7
Overview of Things Learned:
- Technical Area: TF-IDF
- Tools: Scikit-learn, Pytorch, transformers library
Achievement Highlights
- Removed stop words from the preprocessed data
- Used the TFIDF Vectorizer in scikit learn to create the TFIDF matrix
Meetings attended
- 8/12 - Check-in about implementing the BERT Model
- 8/14 - Present BERT Model implementations
Goals for the Upcoming Week
- Complete BERT model implementation
Tasks Done
- TF-IDF: Successfully created the TF-IDF matrix for the text in the posts after removing any values that were not useful including stop words.
- BERT : Filtered out noise from the dataset including emojis and hyperlinks.