Mutaz_ahmed - Machine Learning Pathway

Weeks 1 + 2
Concise Overview of Things Learned:

Technical Area:

  • Count Vectorizer
  • TF-IDF
  • Introduction to LSTM and RNN Models

Tools:

  • sklearn API
  • Pytorch

Soft Skills:

  • Presentation over Digital Medium

Three Achievement Highlights:

  1. Use the CountVectorizer and Logistic Regression on the Amazon dataset.
  2. Build TF-IDF Logistic Regression model used on the Amazon dataset which yielded an accuracy level of 74%
  3. Present to my team about the Count Vectorizer model and present Team Amazon findings with all the models run.

Meetings Attended:

  • Updates on Webscraping 6/8
  • Data Cleaning + Last MInute Webscraping Problems 6/9
  • Review and Refine our Data for ML 6/12
  • Embedding Techniques + ML Algorithms 6/16
  • Embedding Techniques + ML Algorithms 6/18
  • Week 5 Planning + Team Building 6/19
  • Presentation of Week 3 Teams Work 6/22
  • Discuss Data Merging and Bert vs TF-IDF 6/24

Tasks Completed:

  • I was able to learn about the count vectorizer word embedding and make some initial models to see how it functions.
  • I then created a presentation and ran my team through what I learned, most notably that count vectorizer is a lite version of TF-IDF word embedding which should generally be used over count vectorizer.
  • Some hurdles I faced this week were that I had a lot of work from my courses in school as it was the end of one of my summer classes. Initially, I was meant to present my word embedding on Monday 6/15 but thankfully, Rohit and Sara allowed me to present on 6/16 instead.
  • I also was able to research and learn more about LSTM and BERT models which function differently from the previous models that we’ve discussed. I look forward to working with this through this next portion of the project.

Goals for Next Week:

  • Work on the TF-IDF and Logistic Regression model on the combined dataset from the Flowster and Amazon teams.
  • Also play around with BERT models to learn the difference between this method and previous.

Concise Overview of Things Learned:

Technical Area:

  • TF-IDF
  • Logistic Regression
  • Hypertuning Parameters

Tools:

  • sklearn API
  • gridsearchcv

Soft Skills:

  • Presentation over Digital Medium

Three Achievement Highlights:

  1. Built TF-IDF & Logistic Regression model to be used on the non-augmented merged dataset
  2. Built TF-IDF & Logistic Regression model to be used on the augmented merged dataset.
  3. Hypertuned the C parameter and the max_iter parameter using gridsearchcv through SKLearn

Meetings Attended:

  • Training the models results 6/29
  • Updates with Evaluating the Models 7/01
  • Preparing the Results Presentation 7/06

Tasks Completed:

  • Built TF-IDF & Logistic Regression classifiers that ran on the merged dataset and the augmented dataset.
  • With both models, I used hypertuning on the C and max iteration parameter of logistic regression to try and optimize the results
  • I also contributed to the final presentation with the parts I completed for the project.
  • One hurdle overcome was that hypertuning took a very long time train, depending on how much I varied the parameters, it could take hours. The google colab runtime kept disconnecting before the model would finish training due to inactivity. In order to solve this, I found a script online that, when pasted into the browser console, would imitate the user clicking on page every couple minutes, thus allowing the program to continue running.