Week: 7/27
Overview of Things Learned:
Achievement Highlights
- Scraped data from over 41,000 posts (The Commons forum)
- Utilized both raw HTML and JSON data collection
- Learned about cleaning data
- Learning about debugging skills
Meetings attended
- 7/29 - Web Scraping Check-In Meeting and Intro to Preprocessing
- 7/31 - Web Scraping and Preprocessing Presentations
Goals for the Upcoming Week
- Continue learning about BERT and TD-IDF
Tasks Done
- Web scraping: Built a web crawler using Scrapy and used it to scrape data from JSON files. I originally had a raw HTML scraper using Selenium and BeautifulSoup but preloading pages and then running the scraper took too much time. Consequently, I switched to Scrapy as it is quicker and more efficient
- Pre-processing: After collecting all the data, I stored them directly into a .csv file onto my computer using pd and cleaned them up. After running the crawler for a few hours, it finished and I then pushed the .csv file to the github repo.
Week: 8/3
Overview of Things Learned:
Achievement Highlights
- Polished skills on how to find and remove certain HTML tags, whitespace, and punctuation through different libraries
- Obtained accurate results, stored that data in a .csv file.
- Created a matrix with weighted values from TF-IDF
Meetings attended
- 8/5 - Preprocessing Check-In
- 8/7 - Presenting Preprocessing and TF-IDF
Goals for the Upcoming Week
- Polish TD-IDF
- Work on BERT
Tasks Done
- Pre-processing: Previous code had lots of errors, I fixed the errors and added a space between words
- TF-IDF: Completed the matrix with Scikit-Learn with weighted values
Week: 8/10
Overview of Things Learned:
Achievement Highlights
- Learned about BERT, how it works, and its advantages
- Began training BERT model with data gathered from The Commons Forums
Goals for the Upcoming Week
- Learn more about BERT
- Continue working on the model and fine tune any mistakes
Tasks Done
- Data processing: I reprocessed my data from The Commons to work better with BERT and also reprocessed to reduce noise from the posts.
- BERT: I created a simple BERT model and tested small data sets. I am working on fine-tuning the model to reduce the percentage of error.
Week: 8/17
Overview of Things Learned:
Achievement Highlights
- Learned about more BERT with teammates
- Troubleshooted and fixed few problems, made BERT model more accurate
Goals for the Upcoming Week
- Complete BERT
- Work on AWS and Docker
Tasks Done
- Finished training BERT model and learned a lot about how it worked
Week: 8/24
Overview of Things Learned:
Achievement Highlights
- Tried to deploy BERT model I made with AWS/Docker
- Examined performance of model
Goals for the Upcoming Week
- Continue to learn more about Machine Learning and programming in general after the internship!
Tasks Done
- Finished BERT and deployment using AWS/Docker
FINAL SELF-ASSESSMENT: Brief Summary of Things Learned
- Web Scraping: Learned how to gather data from forums using JSON files, selenium, BeautifulSoup, and requests.
- Pre-Processing: Learned how to process large data sets and remove html tags, symbols, and other unwanted information using python libraries.
- TF-IDF: Learned how to use Scikit-learn and Scipy to created a weighted TD-IDF vector matrix using the processed data scrapped from forums.
- BERT: I had a lot of problems as I was new to this but I managed to fix most of them. I learned a lot about BERT and PyTorch and managed to finish the model.
- Deployment: I learned how to use Docker, AWS, and Flask to deploy a simple machine learning application.