Mariamihaila - Machine Learning (Level 1) Pathway

mariamihaila · August 11, 2021, 7:27pm

Technical Area:

Learned about the different types of recommender systems and collaborative vs. content based filtering
Learned three types of similarity measures: cosine similarity, dot product, and the Euclidean distance.
Followed the tutorial to build a movie recommender system

From NLP Basics Parts 1 - 3 webinar and some additional research on my own I learned about

The challenges of NLP, and the many ways subtle inflections or subtext can change a sentences’s meaning.
Cox proportional models and how to express them as hazard functions and how to interpret hazard ratios
Some old language models and their limitations
- One hot encoding limitations: columns of resulting matrix are mutually orthogonal (dot product is always 0)
- word2Vec and gl0ve limitations: a word gets the same vector regardless of context.
Vanilla Neural Network, Recurrent Neural Networks and LSTM basics
Attention:
- I learned the basic architecture of a transformer
- I studied the linear algebra steps of the core attention model: compute compatibility between the query and the keys via a dot product, and then normalize it through a softmax function to get the attention weights.

Tools:

I studied the BeautifulSoup documentation and learned the fundamentals of parsing HTML text, what the important objects are and how to use them.
I Learned how to use Scrapy to build a simple web crawler
I studied the Selenium and Webdriver documentation
Set up Colab

Soft Skills:

I improved my ability to read technical papers by reading the “Attention is all you need” paper and how to filter out relevant information
I learned that it is sometimes better to focus on the big picture at first, before getting stuck in the nitty gritty details. For example; it is best to first understand what the softmax function does before delving into the formula itself.
I’m still learning how to organize my time between studying theoretical concepts and applications.

Achievements

I now have a basic understanding of the fundamentals of NLP - what it is, the challenges, the solutions, and some of the linear algebra behind attention models.
I followed two tutorials: the movie recommendation system and the web crawler
I am getting better at using BeautifulSoup and Scrapy

Tasks Completed:

Downloaded all the necessary libraries
Watched the first four STEMCasts and all of the NLP Basics series
Followed GitHub tutorial
Worked on training a sentiment analysis ML model (I still have a few bugs here and there, but nothing too serious).

mariamihaila · July 21, 2021, 4:50am

Module 2

Technical Area:

Learned how to navigate a website’s source code using inspect element, and how to identify different html tags
Learned how to scrape a website using Selenium and Beautifulsoup to parse HTML text, and how to compile the all of the data into a csv file
Learned the basics of Exploratory Data Analysis using pandas, including basic feature extraction and text preprocessing.
Followed a PyTorch tutorial to learn the fundamentals and trained a neural network to classify images from a fashion data set

Tools:

Selenium + webdriver
BeautifulSoup
pandas
Pytorch
github
CoLab: fixed the issue that none of my libraries were downloading by connecting to a Jupyter notebook on my local machine, giving Colab access to my local file system

Soft Skills

Set up linkedin account
Collaborated with group members to choose a forum to scrape ( we chose the Tapas forum!) as well as on discord and Trello.
Gained confidence and patience when using tools that was initially unfamiliar with.

Achievements

Successfully created branch and pushed my scraping and EDA code to the team github using git commands from the terminal.
After going through the tutorial for scraping the flowester forum, I was able to write my own scraper class for the Tapas forum - even though the source code was differently organized.
I found work arounds for problems, such as getting colab to run on my local machine

Tasks Completed

Scraped Tapas forum (title, category_name, leading post, post replies, date and time created), and saved the data into a csv file.

did basic feature extraction and text preprocessing for the announcements category (although the same code would work for all the categories)
pushed scraper class and EDA code to github

mariamihaila · July 29, 2021, 6:04am

Week 3

Technical Area

Gained proficiency web scraping using BeautifulSoup and Selenium
Studied the math behind TF - IDF, then implemented a short TF - IDF program using the numpy library. After growing confident with the reasoning behind it, I switched over to the pre - built TfidfVectorizer from the ski-kit learn library to calculate the TF-IDF scores for the forum data.
Learned about word embeddings, and how word vectors are calculated. I then converted some sample words into vectors using word2vec and practiced calculating the cosine similarity between them.
Practiced visualizing data via bigrams, trigrams and word clouds.

Tools

git & github: I made a private github repository to practice git commands from the terminal - how to create a branch, how to create a folder, and how to push code and submit a pull request.
ski-kit learn - Since this library is incompatible with the Apple M1 chip because of the new ARM - 64 architecture, I installed a version of conda called miniforge and made a new virtual environment to install it successfully.
Colab - I learned how to connect to different virtual environments on my local machine
Pandas - grew familiar with pd Series objects and their associated functions
nltk
TextBlob
Word Cloud
Selenium
Beautiful Soup
matplotlib
Seaborn

Soft Skills

I started thoroughly documenting my code and writing tutorials inside colab notebooks so that I can reference them later.
I participated more in the meetings, getting help with my questions related to scraping and TF - IDF

Achievements

I can see myself making a lot of progress - especially with python, installing and managing libraries, and github.
I can now quickly adapt to a new library’s features by searching through the code documentation and testing things out myself.
I finished writing my scraper class, and storing the scraped data into a csv file with 9,053 rows.

Tasks Completed

Added the ability to scrape num views and num replies to my scraper class
Stored scraped data for up to 1,000 topics for each category in the Tapas Forum into a csv file.
Pushed csv file and updated scraper class to the team github

mariamihaila · August 4, 2021, 9:17pm

Week 4

Technical Area

Studied the different methods for numeric representation of text, and compared and contrasted their strengths and potential shortcomings.
Reviewed cosine similarity, and its applications for machine learning. By calculating the cosine similarity between two word vectors, you calculate the euclidean distance between them - which tells you how similar the two words are to each other .
Learned more regex expressions in order to remove any links, html tags, or images from my data.
Read about some different machine learning classification models

Tools

Soft Skills

Maintained the good habit of thorough code documentation and writing mini tutorials for myself.
Made sure that I understood a concept before implementing it in code.
Started teaching my nine year old sister about neural networks and we’re thinking of writing a picture book about Ai.

Achievements

Cleaned and organized my EDA, and added more visualizations - and pushed the updated code to the team github.
Thoroughly preprocessed my data.

Tasks Completed

Data cleaning - removed all html, links, and images from the forum data
Converted the forum data into vector representation using the Bag of Words model
Computed the cosine similarity matrix