Jeff_Weng_04 - Machine Learning (Level 1) Pathway

Jeff_Weng_04 · June 16, 2021, 4:23am

Things I Learned

Technical Area:

Independently examined simple and multivariable regression, weights, bias, MSE, and gradient descent
Examined the weaknesses of one-hot encoding, regex, N-Gram counting, and pretrained word vectors, as well as the implications of RNNS and LSTM as we transition to attention-oriented systems
Strengthened understanding of Machine Learning fundamentals (through the webinars and NLP Basics Series) such as Deep Learning, similarity, and attention models.
Examined some of the applications of linear algebra and calculus in machine learning.
Studied and employed webscraping and the documentation for utilizing specific libraries.
Studied the application of PyTorch in storing embeddings as matrices through torch.nn.Embedding, N-Gram language modeling, and CBOW

Tools

VS Code programming environment
BeautifulSoup4/Selenium/raw urllib for webscraping in Python
Git and GitHub Desktop for distributing/organizing programs into repositories
Google Colab / Jupyter Notebooks for data visualization

Soft Skills

Coordinated group progress across STEM-away forum, Slack, Trello, WhatsApp, and Google Forms and reconsolidated communicative abilities
Developed logistical organization skills through categorizing announcements/changes
Read official documentation and public discussion boards in order to effectively debug programs
Examined ethical practices behind webscraping and robot.txt files of different webpages in order to understand what can and cannot be scraped

Achievements

Established Discord server to augment Slack communication with more intuitive voice channel accessibility
Experimented with Bs4 and urllib in scraping data from DiscourseHub communities: (https://github.com/JeffreyW2468/ML2)
Utilized Selenium to scrape the DiscourseHub forums and return reply volume
Organized program into a GitHub repository

Tasks Completed

Integrated Discord with team communication and consolidated member feedback through Google Form check-in
Studied the webinars, NLP Basics, and other materials to solidify understanding of Machine Learning
Installed important packages through terminal and source pages
Scraped Discourse forums with Python
Practiced implementation of Git/GitHub

Outcome

Implemented a more intuitive workspace for team members, established a firm understanding of where members were in understanding, and gained critical insight into machine learning theory and application

Jeff_Weng_04 · June 23, 2021, 7:02am

Jeffrey Weng - ML Module 2, Level 1

Things I Learned

Technical Area:

Honed proficiency in integration/implementation of bs4, selenium, and pandas, and expanded from utilization of chromedriver to geckodriver as well
Continued practicing finding uniquely identifiable tags with HTML inspector for scraping
Established technical familiarity with implementation of csv/json file for data writing

Tools

VS Code
BeautifulSoup4/Selenium/pandas/chromedriver/geckodriver
Git/GitHub Desktop
Jupyter Notebook

Soft Skills

Developed PM skills through logistical and technical means by thoroughly addressing team member questions on Slack/WhatsApp and directing members to appropriate resources

Achievements

Scraped data from https://community.cartalk.com
Resolved member inquiries
Established more uniform collaboration among members

Tasks Completed

Explored data from cartalk (team-established DiscourseHub forum)
Implemented and examined resource programs through both chromedriver/geckodriver methodologies
Streamlined/troubleshooted programming environment with Python interpreter configurations

Outcome

Streamlined both personal and team workspace in both technical and logistical aspects and obtained useful data from scraping team designated forum

Jeff_Weng_04 · July 6, 2021, 8:37am

Jeffrey Weng - ML Module 3, Level 1

Things I Learned

Technical Area:

Leveraged Pandas libraries to integrate scraped csv data from cartalk community
Learned how to preprocess textual data through textblob, PorterStemmer, stopwords, lemmatization, n-grams, etc.
Visualized data with matplotlib, seaborn, and wordcloud
Strengthened understanding of term and inverse document frequency through writing program calculations in jupyter (ipynb)

Tools

VS Code
Pandas/numpy/nltk/sklearn/matplotlib/textblob/seaborn/wordcloud/tfidfvectorizer/pyLDAvis
Git
Jupyter Notebook

Soft Skills

Continued PM coordination through Google Form surveys, check-ins, and troubleshooting

Achievements

Explored data from categories 5 and 6 (buy/sell, safety) from https://community.cartalk.com
Slack consolidation

Tasks Completed

Visualized data with matplotlib/seaborn/wordcloud
Implemented bag of words with sklearn
Performed sentiment analysis on data

github.com

JeffreyW2468/Cartalk_EDA/blob/main/Data-Visualization.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f5d0509e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-2-a86301cd1e3c>:34: FutureWarning: The default value of regex will change from True to False in a future version.\n",
      "  data['Leading Comment'] = data['Leading Comment'].str.replace('[^\\w\\s]','')\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:>"

This file has been truncated. show original

Outcome

Maintained team organization moving forward
Explored category-specific data (Leading Comments specifically) thoroughly

Jeff_Weng_04 · August 5, 2021, 7:38am

Jeffrey Weng - ML Module 3, Level 1 (Continued)

Things I Learned

Technical Area:

Ran four different classification models on data: naive bayes, decision tree, linear support vector machine, logistic regression
Refined combinations of methodologies for cleaning data in combined csv of scraped data (testing 5 different strategies i.e. lowercase + removal of special symbols)
Tested different feature selections and observed respective effects on model accuracy → moved forward with second feature selection strategy (author + topic title + leading comment + other comments + tags), as it produced the highest overall accuracy (with logistic regression having the highest in all cases)

Tools

VS Code
Pandas, numpy, nltk, sklearn, etc
Git + GitHub
Jupyter Notebook (individual code cell testing with .ipynb files in VSCode)

Soft Skills

Monitored and managed group across Slack, Trello, and Google Forms for presentation preparation / technical troubleshooting

Achievements

Trained basic ML models and recorded a variance of accuracy depending on data cleaning methodology and model type *Determined most appropriate model/strategy moving forward with the project

Tasks Completed

Implemented four separate classification models for cartalk dataset
Found best method of cleaning data for optimization of model accuracy
Calculated F1 score, recall, and precision

Outcome

Streamlined project results and prepared data for further analysis

Jeff_Weng_04 · August 5, 2021, 7:56am

Jeffrey Weng - ML Module 4, Level 1

Things I Learned

Technical Area:

Augmented initial four classifications with an additional three → Random Forest, XG Boost, and Light GBM
Tested and recorded accuracy output for the new models for each data cleaning strategy / feature selection, with an emphasis on feature 2
Examined implications/implementation of BERT ML model

Tools

VS Code
Pandas, numpy, nltk, sklearn, etc
Brew → libomp (for macOS)
Git + GitHub
Jupyter Notebook

Soft Skills

Strengthened project coordination abilities across platforms like Google Slides for presentational purposes
Helped troubleshoot minor technical difficulties / library implementation issues
Consolidated independent concept studying skills

Achievements

Studied and implemented classification models of larger complexity - developed tentative methodologies for optimizing model runtime in an Jupyter coding environment
Familiarized myself with the concept of BERT
Studied implementations of xlnet, xlm, roberta, and distilbert models

Tasks Completed

Added an additional three classification models to NLP project → observed implications through analysis of respective accuracy
Consolidated and resolved runtime inefficiencies with increased complexity of new models

Outcome

Gained important insights on the distinction between separate approaches (model/strategy/feature selection) with regards to accuracy/data analysis
Realized the importance of balanced data as opposed to unbalanced