Alexc - Machine Learning Pathway

alexc · June 16, 2020, 6:30pm

What I Learned:
Technical Area: Web scraping with python, Data Cleaning, EDA, git commands
*Tools: * Jupyter Notebook, Github, BeautifulSoup, pandas
*Soft Skills: * Teamwork, Communication, Time Management, Researching

Achievement Highlights:

Completed my first web scrape following a provided resource
Learned how to use Jupyter Notebook and other tools to perform data cleaning and EDA
Perform my first git push to a collaborative repository

Meetings/Training Attended:

All team meetings
Viewed ML Overview and Data Mining StemCasts
Git Webinar

Goals for upcoming week:
Research and complete text analysis and communicate with my group to provide an update to the rest of the team

Detailed Statements of Tasks Done:

I used experience from my first web scrape as well as links of resources provided by leads to gather data from a SmartThings post on Jupyter Notebook. I managed to obtain the usernames and messages of everyone who made a reply in the post without too much trouble.
I did some data cleaning with the information I obtained from the first task by removing html, punctuation, and stop words while keeping URLs intact. I was able to accomplish this task thanks to more resources provided by my team lead and by cooperating with another teammate.
Finally, for EDA I created a bar graph on the frequency a person made a reply in the post from the previous tasks as well as create another graph to obtain the most used words in the post. Once again, I was able to overcome and struggles thanks to resources provided by my leads.

Additionally, I would like to upgrade from an observer to a participant.

alexc · June 26, 2020, 7:20pm

What I Learned:
Technical Area: Text Pre-preprocessing and Topic Modeling
Tools: LDA (from Gensim package) and pyLDAVis
Soft Skills: Teamwork, Communication, Flexibility

Achievement Highlights:

Created a TF-IDF Matrix
Created a Visual Representation of my Topic Modeling using pyLDAVis
Worked with a group to present our newly gained knowledge

Meetings/Training Attended:

All Team Meetings
BERT Lecture by leads

Goals for upcoming week:

Use BERT to get the context of our team’s data
Evaluate different data processing models to identify the best one for our data
Communicate with group to develop and presentation that will inform the rest of the team with our achievements for the week

Detailed Statements of Tasks Done:

I used resources provided by my leads as well as do some research of my own to create a TF-IDF matrix for my team’s data we prepared from the first few weeks of the internship. I first tokenized the data to put everything into a list, then I used for loops to calculate the frequency of each term in their respective posts as well as their total frequency from all the posts we webscraped.
Since most people chose to create a TF-IDF matrix, I decided to try some topic modeling. After some research, I ended up using a model called Latent Dirichlet Allocation from a package called Gensim to develop topics out of the words from the posts. I then used another tool from Gensim called pyLDAvis to create an interactive visual representation of my newly created topics.
Finally, I worked on creating a presentation with my group within our team. I decided to share my work on topic modeling since only one other member did some topic modeling. I pasted my work onto our slideshow and presented my week’s progress to the rest of the team.

alexc · July 3, 2020, 4:51pm

What I Learned:
Technical Area: BERT, RandomForestClassifier
Tools: BertModel, BertTokenizer, Text Classifiers

Achievement Highlights:

Research and learn about BERT
Train 3000 posts by implementing BERT
Predict tags using text classifiers

Meetings/Training Attended:

Team Meeting

Goals for upcoming week:

Refine code to improve accuracy of text classifiers
Start a personal project using what I learned from this internship

Detailed Statements of Tasks Done:

I tokenized text of posts from certain columns of a data frame and tried to implement BERT to train the data. However, I ran into a problem because some of the data were more than 512 characters which is too large for BERT to train. After talking and working with another teammate, I was able to split the larger posts into digestible amounts so that I was able to then utilize BERT.
After turning the text into numerical values, I used the RandomForestClassifier to create predictions for the tags that applied to the posts I trained. This process took 4-5 hours and unfortunately ended up with around 34% accuracy.
During our next team meeting, I shared my tools I used and the results of my methods. Here I learned that I was not the only one who struggled to achieve an ideal accuracy.