Atrelex - Machine Learning Pathway

Certificates July 2020 Certificates Machine Learning Pipeline (NLP)Self-Assessments

atrelex August 6, 2020, 4:57pm 1

Week: 7/27

Overview of Things Learned:

Technical Area: Web scraping, Data cleaning, Data preprocessing
Tools: BeautifulSoup, Selenium, Scrapy, Requests, Pandas, Regex, Google Colab

Achievement Highlights

Scraped data of over 30,000 posts
Scraped JSON data using Scrapy
Learned about data cleaning and preprocessing

Meetings attended

7/23 – ML team kick-off meeting
7/31 - Web Scraping and Preprocessing Presentations

Goals for the Upcoming Week

Complete Preprocessing of the scraped data
Learn about TF-IDF

Tasks Done

Web scraping: Built a web crawler using Scrapy and used it to scrape data from JSON files and stored this data in a CSV file. Loaded this data in a dataframe and cleaned the data.

STEM-Away® Virtual-Internships Certificate July 2020

STEM-Away® Virtual-Internships Certificate July 2020

atrelex August 25, 2020, 6:00pm 2

Week: 8/3

Overview of Things Learned:

Technical Area: Data preprocessing, TF-IDF
Tools: BeautifulSoup, Requests, Pandas, Regex, ScikitLearn, nltk

Achievement Highlights

Preprocessed scraped data using libraries like markdown and regex
Learnt about TF-IDF

Meetings attended

7/27 - Introduction to Web Scraping and this Week’s Deliverables
7/29 - Web Scraping Check-In Meeting and Intro to Preprocessing
7/31 - Web Scraping and Preprocessing Presentations

Goals for the Upcoming Week

Complete TF-IDF
Learn about BERT

Tasks Done

Preprocessing: Removed HTML code from the text in the scraped data and removed symbols & multiple spaces as well. Filtered out the useful data by omitting the null values in the data.

atrelex August 25, 2020, 6:11pm 3

Week: 8/7

Overview of Things Learned:

Technical Area: TF-IDF
Tools: Scikit-learn, Pytorch, transformers library

Achievement Highlights

Removed stop words from the preprocessed data
Used the TFIDF Vectorizer in scikit learn to create the TFIDF matrix

Meetings attended

8/12 - Check-in about implementing the BERT Model
8/14 - Present BERT Model implementations

Goals for the Upcoming Week

Complete BERT model implementation

Tasks Done

TF-IDF: Successfully created the TF-IDF matrix for the text in the posts after removing any values that were not useful including stop words.
BERT : Filtered out noise from the dataset including emojis and hyperlinks.