Melissa - Machine Learning Pathway

OVERVIEW

Technical Area

I learned to use techniques from the Python library Scrapy to retrieve and parse data from over 221 pages on the Codecombat Forum. On my journey to work with forum data and infinite scrolling, I became comfortable with CSS Locator syntax and chaining together CSS Locators with XPath. My partner used Selenium framework and BeautifulSoup library initially, and I became familiar with those as well.

Ultimately we learned that the simple approach is sometimes best, and I learned to utilize API calls to handle the infinite scrolling on multiple pages when working with a forum.

I used the Pandas library, Jupyter Notebook, and regular expressions to clean and analyze 40,000 lines of data.

Tools:

Python, Pandas, Jupyter Notebook, Scrapy, Git, Colab, PyCharm, Regular Expressions

Soft Skills:

Worked and coordinated with team members in different time zones. Helped communicate the lessons my partner and I learned to multiple team members.

Achievement highlights:

  1. Used Pandas and Jupyter Notebook for the first time, and cleaned over 40,000 lines of data using a combination of Pandas and Regex.
  2. Created my first webcrawler using the Python library Scrapy and scraped 221 pages of a forum.
  3. Created my first tokens using DistilBert pretrained data.

List of meetings/ training attended including social team events:

Team Meetings

  • All team meetings for Group 8
  • Small group session with Charlene and Pratik for BERT training
  • Two team meetings with Priya and Trang
  • One team meeting with Trang
  • Small group session with Charlene and team for BERT/project questions.

Python Training
hosted by Gorbal and Shreyas Pasumarthi

  • Session 1 - Advanced Level (Intro to handling the dataset)
  • Session 2 - Beginner Level (Learning the basics of Python to hopefully bring you up to an advanced level)
  • Session 3 - Advanced Level (Answering any questions, etc.)

Git Training

  • Git webinar for all ML teams (Hosted by industry mentor)

Navigating STEM-Away 101

  • I hosted STEM-Away website navigation training with Shreyas Pasumarthi

Data-mining

  • Technical Deep Dive - Data Mining hosted by Maleeha Imran
  • STEMCasts - Overview of ML and project hosted by Kunal Singh
  • Technical Deep Dive - Recommendation Models hosted by Sara EL-ATEIF

Goals for the upcoming week:

Learn more about Scikit Learn and PyTorch. Use DistilBERT to have a working forum classification model up and running.

Detailed statement of tasks done:

Webscraping

Created my first webcrawler using the Python library Scrapy. I went down a rabbit hole and over-complicated this by learning all about CSS Locators with XPath. Maleehas resources helped my partner reign me in and focus on the task at hand. I came across errors while scraping and noticed that my crawler was not going as fast as my partners. Thanks to my partner Trang and Charlene, I became more familiar with the errors. I learned the proper installs to help my computer work better with Python. I learned how to become more comfortable with a Colab notebook.

Cleaning Data

Used the Pandas library, Jupyter Notebook, and regular expressions to clean and analyze 100,000 lines of data. I learned that we only needed the posts, and reformatted the data down to approximately 40,000 lines of data to meet the project requirements. The Advanced Level training session (Intro to handling the dataset) helped me realize how simple and easy it was to get going with Pandas and Jupyter Notebook. Thanks to my partner’s code review, I learned to use best practices for my code that are widely accepted in the python community.

Understanding BERT

I am testing out DistilBERT, and have begun tokenization of data. I am now learning about padding and masking. I believe the major challenge I will face this week is to learn more about Scikit Learn and PyTorch.

1 Like

Hi all! I am putting the finishing touches and cleaning up my project. I will have an update on my 3 week extension with project link this week. :smiley: I redid my model using 7 of the discord forums, worked with Flask/Docker/AWS/Postman. I just need to clean up the look/test a bit more.

I put some screenshots below from my project and will have a more detailed self-assessment with a link to my project.

Screen Shot 2020-07-28 at 12.00.19 AM
folksy prediction Screen Shot 2020-07-28 at 12.07.55 AM Screen Shot 2020-07-28 at 12.08.06 AM
bnz prediction Screen Shot 2020-07-28 at 12.09.31 AM
Screen Shot 2020-07-28 at 12.09.46 AM
airline prediction Screen Shot 2020-07-28 at 12.11.10 AM Screen Shot 2020-07-28 at 12.11.21 AM
huel prediction Screen Shot 2020-07-28 at 12.12.43 AM Screen Shot 2020-07-28 at 12.12.53 AM
quickfile prediction Screen Shot 2020-07-28 at 12.14.56 AM Screen Shot 2020-07-28 at 12.15.07 AM
schizophrenia prediction !Screen Shot 2020-07-28 at 12.14.08 AM Screen Shot 2020-07-28 at 12.14.32 AM
hopscotch prediction Screen Shot 2020-07-28 at 12.19.14 AM Screen Shot 2020-07-28 at 12.19.49 AM
codecombat prediction Screen Shot 2020-07-28 at 11.41.47 AM Screen Shot 2020-07-28 at 11.41.58 AM
action shots

2 Likes

End of internship assessment! #machinelearning-summer2020 #

Overview:

I have worked on projects using the MERN (Mongo DB, Express, React, Node) stack. Last year I made the decision to go back to college to finish my Bachelors degree in Computer Science, and have enjoyed every moment of it! I discovered STEM-Away from Stephanie Holt in a women’s STEM Leadership Collective Facebook group. Though I had only recently learned Java and SQL, Stephanie Holt and Debaleena Das encouraged me to push myself beyond what I already knew.

I was placed on Team BERTinator, led by Charlene. For week one, she picked DistilBERT as our team’s transformer model and broke us into small groups to learn about NLP transformers and create web crawlers, using Scrapy to scrape data from our chosen public discourse forum. For the following four weeks, I learned how to clean data using Pandas, learned methods for web scraping, data cleaning, and data analysis exploration. I explored data wrangling (tokenization - preprocessing), and different approaches for text classification using machine learning classification models, as well as trained and tested the dataset to evaluate my model. Our team all had working models trained on the forum data we had scraped.

For the second part of my internship I was given the opportunity to move to the Bioinformatics sector as a lead or to expand on my current project using Flask, Docker, and AWS. I decided to stay on for the expansion. I had only dabbled in AWS in the past, and I was excited to learn more about it and Docker. I decided to challenge myself to redo my recommender model and use even more scraped forum data from my teammates in session one. I used ktrain with DistilBERT to wrap libraries and lessen the amount of code used. I learned the hard way why it’s best to use Google Colab to train your model!

I then learned how to use Flask in the front and back end and deployed my model in Flask on my local computer. I then dived into the world of Docker, and learned how to create my own Docker images, and to tag and push my code to the Docker hub. I learned how useful containers are, and am eager to take this knowledge to the next level in the future with Kubernetes and pods. My next step was to deploy my model onto AWS. I chose AWS EC2 since my project was too big to deploy via serverless and AWS Lambda. I used Forklift as my file manager to upload and worked with my files in EC2 and deployed my project in EC2 Ubuntu. I decided to listen to that insistent warning in Flask (if you’ve worked in Flask, you know the one) - and to the hopes of my team lead from the very beginning of this internship - and I deployed my model into production using uWSGI, and NGINX.

Tools I used during this internship:

  • Python - Numpy, NLTK, Scikit Learn, Pandas, Keras, PyTorch,Tensorflow Scrapy, Seaborn, Flask
  • DistilBERT
  • PyCharm
  • Forklift
  • Docker
  • AWS(EC2,S3, Elastic IP)
  • uWSGI
  • NGINX
  • Ubuntu
  • Slack
  • Google Colab
  • Hugging Face Transformers
  • ktrain
  • Jupyter Notebook
  • Postman
  • Regex

From where I started off:

  • I had only the most basic of skills or experience in Python whatsoever at the start of this project. I knew nothing about machine learning or the tools for Python. Therefore I had been looking for a Java-based internship. But STEM-Away encouraged me to take on challenges I had never faced before. The experience was WILD! There was so much I had to learn about machine learning, about cloud services, about data cleaning, about modeling, about Docker…I had briefly touched on AWS but didn’t know how much I didn’t know. This internship has kept me on my toes the entire time. It’s been challenging and wonderful all at once.

Some of the skills I learned or improved on:

I learned the importance of documenting my bugs, and my solutions for them. I came across a lot of bugs especially working with the Tensorflow library in AWS, and with versions of libraries are needed to work with ktrain.

Where I am post Machine Learning Internship:

Technical Skills

  • Data-mining
  • Data-cleaning
  • Transformer Neural Networks: Using the DistiBERT model
  • Ktrain
  • Flask
  • Docker
  • AWS (EC2, S3, Elastic IP)
  • Python

Soft Skills

  • Global networking
  • Research
  • Self-teaching
  • Determination
  • Reading documentation

Challenges I faced during this internship:

  • Learning an entirely new programming language (Python) quickly.
  • I could not use AWS Lambda free tier for my project as it became bigger.
  • I had never worked in Machine Learning before. I didn’t know what a web scraper or how a DistilBERT model worked.
  • When Debaleena said to use Docker, Flask, and AWS, I didn’t realize how intensive this could be to learn. As I was the only person in my group who had signed up for the expansion project, I often found myself working out some very tricky bugs on my own.

How I came over these challenges:

  • I put in the extra hours to read documentation, learn, document, and put the pieces together. I learned to document my bugs and wrote code snippets to follow up on.
  • During the first half of the internship, I read all of the documentation, articles, attended all of the meetings/webinars, and used the examples provided by Charlene and Meleeha which helped me put together my first model.
  • I used AWS EC2 along with Forklift for my file management for my larger model and files.
  • During the second half of the internship, I rapidly adapted to working alone with bugs and became a much more self-sufficient programmer in the process.

Achievement Highlights:

  • Deployed working machine learning model in production using uWSGI, and NGINX. (See here!! http://3.129.123.13/)
  • Hosted a webinar on how to navigate the STEM-Away site.
  • Promoted to STEM-Away Ambassador.
  • Scraped the data from the Codecombat Forum in the required format.
  • Cleaned thousands of lines of data to meet DistilBERT training standards.
  • Learned how to use Python and its tools and libraries quickly.
  • I explored different approaches for text classification using machine learning classification models, as well as trained and tested the dataset to evaluate my model.
  • Learned how to create images, containers, and tags in Docker.
  • Learned my way around AWS.
  • Deployed working machine learning model in Flask.
  • Deployed working machine learning model in Docker.
  • Deployed working machine learning model in AWS EC2
  • Helped multiple fellow STEM-Away interns with their project cloud engineering and web scraping questions.
  • Learning things as they came, under pressure, and on the fly.
  • Made some amazing friends from around the world: Vrinda, Priya, Sara, Neda, Phillip, Pratik, Geoffrey, Gorbal, Khanh, and more! :slight_smile:

I want to thank Debaleena, Stephanie, Charlene, Maleeha, Priya, (everyone I listed above too :slight_smile:) and everyone at STEM-Away. I never would have thought to accomplish all that I did. This experience definitely has been life-changing for me.

2 Likes