Technical: I have had a good review of Python so far. I have learned how to use the BeautifulSoup and Pandas libraries.
Tools: I now have an understanding of Git and uploading files to GitHub. I have been using Visual Studio Code, for the first time, to edit my python files, debug, and upload to Git. I have also become familiar with Jupyter Notebook.
- Successfully cloning a GitHub repository and uploading a python file back onto to GitHub
- Accessing the article links that a web page leads to (basic web crawling)
- Accessing relevant data from the HTML of a web page (eg. Author, Title)
Week 1 Team Meeting (1.5 Hours)
Week 2 Team Meeting (1 Hour)
Week 3 Team Meeting (1 Hour)
Intermediate Python Training Session (1.5 Hours)
Git Webinar (3 Hours)
- Make a word cloud using the articles from community.smartthings.com as data analysis practice
- Implement a parts of speech tagging script on our practice data to help with text classification
- Use the bag of words method to find the inverse document frequency of each word in our practice data and calculate word vectors
This week, I performed web scraping and data cleaning on the community.smartthings.com site. One hurdle I faced was trying to access the rest of a web page with infinite scrolling. With some help from the project leads, I was given the URL format of each “page” to move through the infinite scrolling page.
Using our scripts, the team leads compiled about 3000 test cases to train our recommendation system later. I faced a hurdle with uploading my script to GitHub, as I have not used it before. However, with some internet research and persistence, I successfully uploaded a Python file and a Jupyter Notebook file.
Note: I would like to become a participant for this project.
Technical: BERT basics, TF-IDF analysis, Word embeddings
- Obtain a basic understanding of encoders in BERT
- Calculate Term Frequency and Inverse Document Frequency for various article categories
- Create Word2Vec embeddings that successfully show relationships between words
NLP Webinar: Part 1 (1 hour)
NLP Webinar: Part 2 (1 hour)
NLP Webinar: Part 2 (1 hour)
Week 4 Team Meeting (1.5 hours)
- Implement a full-scale BERT model to surpass the competency of Word2Vec
- Run tests to see which categories are best for classifying articles
This week, I performed two pre-processing techniques using the data our team collected from web scraping. The Term Frequency/Inverse Document Frequency technique was simpler, and because the entire team attempted this, I wanted to try word embeddings as well. I found the idea of representing words as vectors really interesting, and I had a lot of fun implementing this. We presented our findings during our weekly meeting, and I made some visual graphs that represented the similarity between words based on their embeddings.
I also did research and attended the NLP webinars to gain an understanding of BERT. It is a rather complex technique, but it was helpful introduction. The concept of encoding, which is important for BERT, became familiar and I presented that as well.
Technical: BERT, Training and Validation
Tools: BertForSequenceClassification, BertTokenizer
- Implement a Sequence Classification BERT Model
- Train 3000 articles in three different versions of a BERT Model
- Receive understandable data from the validation set
Week 5 Team Meeting (2 hours)
- Refine the BERT model to improve its prediction accuracy
- Put together a final presentation with the team to reflect on our progress over five weeks
This week, I implemented three different versions of a BERT model in python. All of them used sequence classification, but I used different categorization techniques to see if one was consistently more productive. Two versions binned the articles into four distinct categories, and one version used the original tag of each article (32 categories). None of these techniques, even when trained on 3000 articles, produced particularly interesting results – the model tended to sort the entire validation set into one or two categories.
During the week 5 meeting, I presented my findings to the group. Some other folks had a similar issue to me where the model only sorted between one or two categories. I hope that during this week, we can help the model categorize more productively.