Vishnupriya_Kanuri - Machine Learning (Level 1) Pathway

Technical Areas worked on :

  1. Started scraping Tapas forum using Beautiful Soup. Installed Selenium and the webdriver for further scraping.
  2. Refreshed workflow knowledge of Git & Github environments.
  3. Worked with Trello to streamline project flow for the team.
  4. Set up the discord server for team communication.

Tools

  • Jupyter Notebook
  • Trello
  • Discord
  • Git & Github

Soft Skills

  1. Working alongside other leads to develop a streamlined project workflow for the team (Team management & leadership)
  2. Time management skills (Making sure the time set for meetings is convenient for people from all time-zones)
  3. Collaboration and Team-work (Having ice-breakers each session to ensure a comfortable, collaborative environment and scrum meetings every day to clarify doubts)

Three achievement highlights:

  1. Managing to work with a team majorly from a different time zone.
  2. Ran into an issue with Selenium that took almost half a day to troubleshoot but eventually got it resolved.
  3. Forming a rapport with the team and co-leads to ensure smooth progress of the project.
1 Like

Machine Learning - Level 1 Week 2 Self Assessment

Technical Areas worked on :

  1. Scraped the Tapas Forum using Selenium. Scraped up all the titles, comments, categories and dates of all the titles containing the word ‘meme’.
  2. Converted the data into csv format and a dataframe using pandas.
  3. Used a third party software (Nvivo) to generate a word cloud.
  4. Performed elementary EDA on the data. Referred to the EDA tutorial on the STEM-Away forum and calculated no of words, no of characters, average word length, no. of stopwords and performed basic pre-processing such as removing punctuation & stopwords, stemming and tokenization.
  5. Additionally, pushed testcode to github repo.

Tools/Libraries

  1. Jupyter Notebook
  2. Python
  3. Git/Github
  4. nltk and sklearn libraries
  5. Selenium and Beautiful Soup
  6. Trello
  7. Nvivo (third party qualitative research software)
  8. Pandas

Soft Skills

  1. Team Management : Set up the agenda for the week, arranged meetings with fellow leads and team-mates and made sure that everyone is on the same page with respect to deliverables.
  2. Collaboration and Communication : Fostered a healthy rapport with fellow leads and worked alongside them to determine best practices for better team output. We are in constant communication via leads discord channel to address immediate doubts/clarifications.
  3. Task Management : Divided the team into two sub-groups on the basis of time-zones to ensure more efficiency and teamwork.
  4. Tech help : Organizing scrum meetings everyday for 10 minutes in case of any tech assistance required by any team member. Demonstrated a quick 10 minute overview of libraries installation in one of the scrums.
  5. Time Management skills : Made sure that the time of meetings is convenient for everyone.

Three achievement highlights

  1. Debugging : Working with Selenium proved to be a little tough but eventually worked around the issues. The Xpaths kept changing everytime I loaded a fresh application of the forum and I had to find more robust methods to improve the code.
  2. Steep learning curve : The week was filled with fast paced learnings from working with selenium to figuring EDA out.
  3. Co-moderating the team sessions : Leading certain sections of team meetings by explaining the week deliverables and clarifying doubts have been wonderful in terms of exploring my personal leadership style.

Goals for the week

  1. Completing advanced text processing and visualisations. (N-grams, TF-IDF, Word Embeddings and word clouds using python)
  2. Dwelling more into NLP resources
  3. Making sure that the team has completed module 2 deliverables by the end of the week and has module 3 requisites prepared.

(A word cloud of my scraped data generated via Nvivo before pre-processing and cleaning the data)

Machine Learning - Level 1 Week 3 Self Assessment

Technical Areas worked on :

  1. Re-worked my scraper to scrape ~13,000 titles, replies and views, category-wise from the Tapas forum.
  2. Stored the data in 11 separate csv files and eventually combined them into one csv file.
  3. Performed EDA tasks on my data as follows - a. Calculated frequency of rare and most common words in the data and plotted them. b. Calculated the word length and the average word length of each title c. Removed stop-words and converted the data into lowercase. d. Plotted unigram, bigram and trigrams of all the titles with views greater than 100. e. Performed sentiment analysis of the data. f. Obtained a sparse matrix using TF-IDF method g. Ran the code for bag-of-words method on the data.

Tools/Libraries :

  1. Jupyter Notebook
  2. Python
  3. Git/Github
  4. nltk and sklearn libraries
  5. Selenium
  6. Pandas/Numpy
  7. Trello

Soft Skills :

  1. Team Management : Set the agenda for the week, met co-leads, held team-meetings and prepared a deliverables document for Module 3 Week 1 for the team to follow.
  2. Help and collaboration : Helped resolve errors/divided up the work/provided resources for team-mates to follow via scrum-meetings and discord channels.

Three achievement highlights :

  1. Figured out an efficient way to scrape the forum just by using Selenium and re-tweaked the code to yield 10k+ data samples.
  2. Figured out what EDA methods worked and what methods didn’t work in the process. For eg : Spelling correction and stemming distorted my data quite a bit, so I decided to not employ those functions in my EDA.
  3. Helped streamline team progress by preparing a clear deliverables document for every team-member to follow. The document lays out all the tasks to be finished and all the module 3 requisites to be prepared by the end of this week.

Goals for the week

  1. Delve more into the intuition of advanced EDA concepts such as word embeddings and bag of words.
  2. Read up on cosine similarities.
  3. Watch STEM-Away tutorials on recommendation systems.
  4. Prepare a progress deck till Module 2 with the team to exhibit to the mentors and start discussing the ML pipeline for our recommendation system.

(A snapshot of the n-grams plotted for the scraped data)

Machine Learning - Level 1 Week 4 Self Assessment

Technical Areas worked on :

  1. Understood the intuition behind word embeddings and different distance metrics and ran TF-IDF and cosine similarity on my pre-processed data.
  2. Ran a BERT-based classifier on my data and managed to classify a title to its appropriate category.
  3. Read several articles on recommender systems (content-based and collaborative-filtering).
  4. Installed and explored PyTorch and TensorFlow libraries.
  5. Pushed my module 2 code, csv files and visualisations to the team repo.

Tools/Libraries

  1. Jupyter Notebook
  2. Python
  3. Scikit-Learn
  4. PyTorch & TensorFlow
  5. Pandas/Numpy
  6. Trello
  7. Git/Github

Soft Skills :

  1. Set up the agenda for the team meeting and brainstormed the deliverables and roadmap with co-leads.
  2. Attended scrum meetings to resolve and troubleshoot queries; explained a general overview of Module 3 expectations to the team and prepared a document for the same, complete with resources.

Three achievement highlights :

  1. Got around to the concepts of word embeddings and cosine similarity, which seemed a bit complex at first.
  2. Figured out a general roadmap of an ML pipeline for a recommender system and the concepts/libraries that need to be understood and deployed.
  3. Attended office hours with Sara to figure out team logistics.
  4. Attended an ML session 1 presentation to get insights into how the final product is to be delivered.

Goals for the week :

  1. Play around with some additional visualizations (Word embeddings using TSNE)
  2. Charter a problem statement with co-leads to work towards.
  3. Charter a clear road-map with co-leads on how the work is to be divided moving forward.
  4. Explore more python libraries and associated concepts and start working on the recommender system in accordance with the problem statement alongside the team.

(Snapshot of the BERT based classifier)

Machine Learning - Level 1 Week 5 Self Assessment

Technical Areas worked on :

1, Trained different combinations of word embeddings and classifiers on the data to figure out the best performing one :

(As perfomed on the csv file selected as the team csv) :

a. TF-IDF + Naive Bayes b. TF-IDF + SVM c. Word2Vec + Naive Bayes d. Word2Vec + SVM e. TF-IDF + Logistic Regression f. TF-IDF + SVM (along with hyperparameter tuning) g. TF-IDF + Random Forest (along with hyperparameter tuning) h. TF-IDF + Bagging Trees (along with hyperparameter tuning) g. TF-IDF + XGBoost (along with hyperparameter tuning) h. Additionally, ran Roberta on the data personally scraped.

The combination of TF-IDF and SVM along with hyperparameter tuning yielded the best results

  1. Gleaned a better insight on BERT and BERT family of classifiers (RoBERTa, XLNet, DistilBERT) and and its architecture via several articles.

Tools/Libraries

  1. Jupyter Notebook
  2. nltk
  3. Scikit-Learn
  4. Pandas/Numpy
  5. Trello
  6. Git/Github
  7. Gensim
  8. Gridsearch CV

Soft Skills :

  1. Met the co-leads multiple times over the week and finalized on the : a. Problem Statement for the Recommender System b. Sub-teams for classifiers and recommendation system to improve team efficiency. c. One common, most comprehensive csv file for the whole team to work on.
  2. Explained week deliverables during the team meeting, stressed on the importance of amping up inter-team communication and made a comprehensive document of deliverables to be completed for the week (along with resources)
  3. Attended office hours with Anubhav to clarify doubts with respect to the ML pipeline and the problem statement.

Three achievement highlights

  1. Read about and understood the non-directional nature of BERT classifier and deeper insights into its architecture.
  2. Ran a bevy of classification models and learnt how to implement GridSearch CV for hyperparameter tuning.
  3. Clarified team members’ questions / doubts via Discord and Scrum.

Goals for the week :

  1. Explore more classification models/ensemble models and pipelines for the data. Also focus on making the accuracy better.
  2. Explore deep learning frameworks/neural networks for classification of the data
  3. Run more sophisticated word embeddings (such as TF-IDF + BERT)
  4. In tandem, start working on building the recommender system.
  5. Run confusion matrices after having trained the data to gain an overall picture of model performance (and not just accuracy in silos)
  6. Trace team progress and work towards integrating everyone’s contributions into one complete team deliverable.
  7. Some models proved to be very computationally expensive (BERT, XGBoost). Figure out some work-arounds for that.
  8. Towards the end of the week, focus on how to deploy the web app for our deliverable. (Snapshot of TF-IDF + SVM hyperparameter tuning ~ yielded an accuracy of 96%)

Machine Learning - Level 1 Week 6 Self Assessment

Technical Areas worked on :

  1. Trained advanced classification models on data ; a. BERT b. XLNet c. DistilBERT d. Electra
  2. Ran confusion matrix for each of the previously trained classification models.
  3. Attempted to bump the accuracy metrics via hyperparameter tuning.
  4. Made a basic recommender system

Tools/Libraries

  1. Google Colab/Jupyter Notebook
  2. SimpleTransformers
  3. Pandas/Numpy
  4. Tokenizers (HuggingFace)
  5. GridSearch CV
  6. Trello
  7. Gensim

Soft skills :

  1. Finalized the deliverable to be deployed on the app alongside the leads and the team. (Best performing classifier).
  2. Helped divide work among the leads and team to catalyze team progress.
  3. Set an internal deadline for the team to finish their deliverables so that we can progress to the final stage of app development.

Three achievement highlights :

  1. Achieved ~95% accuracy (and equivalent Precision, Recall, F1 metrics) for my TF-IDF + SVM model using hyperparameter tuning. Although, my highest simple transformers’ accuracy is 67% (that of Electra).
  2. Figured a work-around for a faster deployment of BERT models. (Using the GPU option on Google Colab)
  3. Helped streamline team progress.

Goals for the week

  1. Try and push my accuracy past 70% for my BERT models.
  2. Start working and collaborating with my team members on the app and app deployment.
  3. Start working on the team progress deck.

confusion matrix

(Confusion matrix for my highest accuracy classifier)

(Metrics for my highest accuracy classifier)

(Classification example using Electra)

Machine Learning - Level 1 Week 7 Self Assessment

Technical Areas Worked on :

  1. Iterated over different hyperparameter tuning techniques to bump up accuracy for all the simple transformer models : a. GridSearchCV b. Bayesian Optimization c. Population Based Training
  2. Provided the code file of my highest performing model (XLnet) to the co-lead overseeing the deployment of the web-app to run a demo version of the web app based classifier.
  3. Understood the web deployment pipeline and ran the flask code for my model to create a locally deployed web-app. (But ran into a few errors with my tar-files)

Tools/Libraries

  1. Google Colab/Kaggle Notebooks
  2. Flask
  3. Pandas/Numpy
  4. Tokenizers
  5. ray[tune]
  6. Simple Transformers

Soft skills :

  1. Worked on the final deck alongside the team
  2. Met with the team in consecutive 4-day meetings to fine tune the process and wrap the deliverables up

Three achievement highlights :

  1. Exhausted my GPU resources on Google Colab but pivoted to Kaggle notebooks as a work-around.
  2. Kept at hyperparameter tuning till I got a satisfactory bump in the accuracy. (Though I faced a couple of errors with the PBT tuning)
  3. Finished portions of the deck and successfully wrapped up the deliverables alongside the team

Goals for the week

  1. Present our final deck to mentors and seek constructive feedback to improve for future.
  2. Exchange Linkedin handles/social media handles with the team to stay in touch.

(XLnet metrics - My highest performing simple transformer model)