Mariamihaila - Machine Learning (Level 1) Pathway

Technical Area:

  • Learned about the different types of recommender systems and collaborative vs. content based filtering
  • Learned three types of similarity measures: cosine similarity, dot product, and the Euclidean distance.
  • Followed the tutorial to build a movie recommender system

From NLP Basics Parts 1 - 3 webinar and some additional research on my own I learned about

  • The challenges of NLP, and the many ways subtle inflections or subtext can change a sentences’s meaning.
  • Cox proportional models and how to express them as hazard functions and how to interpret hazard ratios
  • Some old language models and their limitations
    • One hot encoding limitations: columns of resulting matrix are mutually orthogonal (dot product is always 0)
    • word2Vec and gl0ve limitations: a word gets the same vector regardless of context.
  • Vanilla Neural Network, Recurrent Neural Networks and LSTM basics
  • Attention:
    • I learned the basic architecture of a transformer
    • I studied the linear algebra steps of the core attention model: compute compatibility between the query and the keys via a dot product, and then normalize it through a softmax function to get the attention weights.

Tools:

  • I studied the BeautifulSoup documentation and learned the fundamentals of parsing HTML text, what the important objects are and how to use them.
  • I Learned how to use Scrapy to build a simple web crawler
  • I studied the Selenium and Webdriver documentation
  • Set up Colab

Soft Skills:

  • I improved my ability to read technical papers by reading the “Attention is all you need” paper and how to filter out relevant information
  • I learned that it is sometimes better to focus on the big picture at first, before getting stuck in the nitty gritty details. For example; it is best to first understand what the softmax function does before delving into the formula itself.
  • I’m still learning how to organize my time between studying theoretical concepts and applications.

Achievements

  • I now have a basic understanding of the fundamentals of NLP - what it is, the challenges, the solutions, and some of the linear algebra behind attention models.
  • I followed two tutorials: the movie recommendation system and the web crawler
  • I am getting better at using BeautifulSoup and Scrapy

Tasks Completed:

  • Downloaded all the necessary libraries
  • Watched the first four STEMCasts and all of the NLP Basics series
  • Followed GitHub tutorial
  • Worked on training a sentiment analysis ML model (I still have a few bugs here and there, but nothing too serious).

Module 2

Technical Area:

  • Learned how to navigate a website’s source code using inspect element, and how to identify different html tags
  • Learned how to scrape a website using Selenium and Beautifulsoup to parse HTML text, and how to compile the all of the data into a csv file
  • Learned the basics of Exploratory Data Analysis using pandas, including basic feature extraction and text preprocessing.
  • Followed a PyTorch tutorial to learn the fundamentals and trained a neural network to classify images from a fashion data set

Tools:

  • Selenium + webdriver
  • BeautifulSoup
  • pandas
  • Pytorch
  • github
  • CoLab: fixed the issue that none of my libraries were downloading by connecting to a Jupyter notebook on my local machine, giving Colab access to my local file system

Soft Skills

  • Set up linkedin account
  • Collaborated with group members to choose a forum to scrape ( we chose the Tapas forum!) as well as on discord and Trello.
  • Gained confidence and patience when using tools that was initially unfamiliar with.

Achievements

  • Successfully created branch and pushed my scraping and EDA code to the team github using git commands from the terminal.
  • After going through the tutorial for scraping the flowester forum, I was able to write my own scraper class for the Tapas forum - even though the source code was differently organized.
  • I found work arounds for problems, such as getting colab to run on my local machine

Tasks Completed

  • Scraped Tapas forum (title, category_name, leading post, post replies, date and time created), and saved the data into a csv file.
  • did basic feature extraction and text preprocessing for the announcements category (although the same code would work for all the categories)
  • pushed scraper class and EDA code to github

Week 3

Technical Area

  • Gained proficiency web scraping using BeautifulSoup and Selenium

  • Studied the math behind TF - IDF, then implemented a short TF - IDF program using the numpy library. After growing confident with the reasoning behind it, I switched over to the pre - built TfidfVectorizer from the ski-kit learn library to calculate the TF-IDF scores for the forum data.

  • Learned about word embeddings, and how word vectors are calculated. I then converted some sample words into vectors using word2vec and practiced calculating the cosine similarity between them.

  • Practiced visualizing data via bigrams, trigrams and word clouds.

Tools

  • git & github: I made a private github repository to practice git commands from the terminal - how to create a branch, how to create a folder, and how to push code and submit a pull request.

  • ski-kit learn - Since this library is incompatible with the Apple M1 chip because of the new ARM - 64 architecture, I installed a version of conda called miniforge and made a new virtual environment to install it successfully.

  • Colab - I learned how to connect to different virtual environments on my local machine

  • Pandas - grew familiar with pd Series objects and their associated functions

  • nltk

  • TextBlob

  • Word Cloud

  • Selenium

  • Beautiful Soup

  • matplotlib

  • Seaborn

Soft Skills

  • I started thoroughly documenting my code and writing tutorials inside colab notebooks so that I can reference them later.

  • I participated more in the meetings, getting help with my questions related to scraping and TF - IDF

Achievements

  • I can see myself making a lot of progress - especially with python, installing and managing libraries, and github.

  • I can now quickly adapt to a new library’s features by searching through the code documentation and testing things out myself.

  • I finished writing my scraper class, and storing the scraped data into a csv file with 9,053 rows.

Tasks Completed

  • Added the ability to scrape num views and num replies to my scraper class

  • Stored scraped data for up to 1,000 topics for each category in the Tapas Forum into a csv file.

  • Pushed csv file and updated scraper class to the team github

Week 4

Technical Area

  • Studied the different methods for numeric representation of text, and compared and contrasted their strengths and potential shortcomings.
  • Reviewed cosine similarity, and its applications for machine learning. By calculating the cosine similarity between two word vectors, you calculate the euclidean distance between them - which tells you how similar the two words are to each other .
  • Learned more regex expressions in order to remove any links, html tags, or images from my data.
  • Read about some different machine learning classification models

Tools

  • Pandas
  • Colab
  • Git/Github
  • NumPy
  • Ski-kit Learn

Soft Skills

  • Maintained the good habit of thorough code documentation and writing mini tutorials for myself.
  • Made sure that I understood a concept before implementing it in code.
  • Started teaching my nine year old sister about neural networks and we’re thinking of writing a picture book about Ai.

Achievements

  • Cleaned and organized my EDA, and added more visualizations - and pushed the updated code to the team github.
  • Thoroughly preprocessed my data.

Tasks Completed

  • Data cleaning - removed all html, links, and images from the forum data
  • Converted the forum data into vector representation using the Bag of Words model
  • Computed the cosine similarity matrix