Jianglu0725 - Machine Learning (Level 1) Pathway

Things Learned

  1. Technical Area

    → Machine learning basic:

    1. Difference between Programming and Machine learning

    2. Define ML problems: Classification v.s. Regression

    3. Types of variables: supervised learning v.s. Unsupervised Learning

    4. Different filtering data method: content based filtering v.s. Collaborating filtering

    5. Caculating similarity method: cos, Dot product eucidean distance

    6. Useful model for data filtering: K-nearest neighbors

    7. Evaluation ways for different output: MSE and Classification matrix

    → NLP

    1. Pros and cons of : vanilla neural networks, RNNS and LSTM

    2. Modern NLP: attention mechanisms and the transformer (Key, Query, Value)

      • Basic Process:

        → Encoder: take sentence to vectors

        • BERT: Features, How it trained, components

        → Decoder: take vectors and weight to translate

        • GPT2
      • Multihead attention:

        • Core attention model
  2. Tool

    → Prepare working environment:

    Beautiful Soap, Selenium, Github, VS code

    → Have a basic idea of:

    Pytorch, word enveddings, BERT, Fine Tuning BERT for Classification, Multi-Label Classification, Simple Transformers

  3. Soft Skills

    → Managing a group

    Keep members to participate by answer their question in time and efficiently

    Track their progress using trello by checkboxs

    For time conflict for meeting, record is always a good choice

    → Time management

    Listing out personal task for each week, I also found trello can be used for self task and time management which is great

    → Problem Solving

    Stem away provide many useful information and guide for solving recommendation system ML problems, it is always a good way to learn from others, videos and web resource when I meet problem

Three achievement highlights

  1. Gathering all my team members

    • Creating Slack and Whatsapp group
    • Creating Trello task for members to track their weekly progress
    • Held first welcome meeting with members and answer their questions
  2. Successfully scraping data from Discourse Hub website

  3. Have a hands on practice on NLP

Tasks Completed Process

  1. Gathering members

    • I posted one team set up forum and provide collective useful information for my members
    • I sent slack and trello invitation to them by finding all members username
    • After that I sent Whats app invitation using both their numbers on the post and sharing whatsapp group QR code on slack

    Result

    • Turned out working very well, we have only one person unable to connect and one person move to next session
    • Members are very active on Whatsapp and Slack. Whatsapp is good for quick conversation and Slack is good for technology information sharing
  2. Scraping data from Discourse Hub

    • I followed steps on video to scraping data

    Result

    • Very useful tools and successfully get data.
1 Like

Week 2 Self-assessments

Things Learned

  1. Technical Area

    → Data scraping :

    1. Learnt how to use selenium and chromedriver

    2. Used chrome inspect to inspect webpage and found useful information from HTML code

    3. Learnt dealing with big data in the scraping process

  2. Tool

    → Scraping tool:

    Beautiful Soap, Selenium, VS code, pandas

  3. Soft Skills

    → Managing a group

    Learned how to get feedback and check progress with members.

Three achievement highlights

  1. Successfully scraped data from https://community.cartalk.com
  2. Helped members with their technical problems
  3. members got more involved in the team and were more active in taking responsibilities.

Tasks completed process

  1. scraping data
  • I took a look at the sample code.
  • I tried to run the sample code and run each function to get a closer look at their functionalities.
  • I applied a similar method, scraping the website we choose. The website we choose contains a bunch of information that takes a long time for the code to run. I tried to optimize the code and test out different categories.

Result

  • I successfully scraping the data from the website.
  1. Team Management

    • I talk to members both in group and privately.
    • I also schedule private meetings with them to try to help them out.

    Result

    • I successfully helped them with the code
    • I also recieved a lot of feedback from them.

Week 3 Self-assessments

Things Learned

  1. Technical Area

    → Exploratory Data analysis

    1. Do basic data cleaning, get a closer look at data in different categories
    2. Advanced text processing
  2. Tool

    ->Explore data analysis:

    nltk, sklearn, seaborn, matplotlib

  3. Soft Skills

    → Managing a group

    tried to monitor each member through trello

Three achievement highlights

  1. have some findings after analyzing data
  2. Helped members with their technical problems
  3. Learnt several methods for data analysis

Tasks completed process

  1. Data cleaning

    • Removed punctuation
    • Removed stop-words
    • Removed common and rare words
    • Used NLTK stem check spelling

    Result

    • Got info that needs for the next step
  2. Team Management

    • Using TFIDF Vectorize sentence
    • getting 2 grams and 3 grams data
    • Creating a word cloud

    Result

    • Created some datasheet to show most important information
    • Created a picture showing the most important words

Week 4 Self-assessments

This week we did our project presentation and Colin and Sara gave us a lot of advice. The main task this week was to improve the previous work as the previous part was a bit lacking.

Things Learned

  1. Technical Area

    → Machine Learning:

    1. At the beginning of all ML projects a pipeline chart or other means should be used to understand the purpose and requirements of ml. It is important to know what inputs and outputs are to be accomplished with ml.

    → Data scraping:

    1. Too little data for ML can cause bias in the subsequent analysis. If the data is extracted only for a period of time, it only reflects the correlation of the data for that period of time, resulting in an incomplete analysis.
  2. Tool

    → EDA scraping tool:

    Beautiful Soap, Selenium, VS code, pandas

    → Explore data analysis:

    nltk, sklearn, seaborn, matplotlib

  3. Soft Skills

    → Managing a group

    1. A short presentation such as a 5-minute presentation will be more effective than a single person going through the conversation.
    2. When assigning tasks to members, I need to be very careful to divide them so that they understand exactly what they are going to do. Giving a deadline during the week will make it more likely that the project will be completed in a limited amount of time.

Three achievement highlights

  1. Successfully scraped data from https://community.cartalk.com. This time we got around 6000
  2. Almost every member was involved
  3. We basically completed part of the data analysis

Tasks completed the process

  1. scraping data
  • I divide the whole group into 4 groups, each group completes data scraping for 1-2 categories. I do a data crawl of the largest and smallest category by myself.

  • Each team was given 2-3 days to capture the data

  • I applied a similar method, scraping the website we choose. The website we choose contains a bunch of information that takes a long time for the code to run. I tried to optimize the code and test out different categories.

  • I opened a separate Trello week4 board to monitor the progress of each group

    Result

  • We crawled almost 6000 more data than before

  1. Team Management

    • Give task deadline on Trello and ask for the progress of crawling data at that time.

    Result

    • Our group has become significantly more efficient

Week 5 Self-assessments

Things Learned

  1. Basic Recommender & Simple Classifiers

    → Data Modeling:

    1. Naive Bayes Classifier for Multinomial Models
    2. Decision Tree
    3. Linear Support vector machine
    4. Logistic Regression
    5. Random Forest
    6. XGBoot
  2. Tool

    → Basic Modeling:

    sklearn

Three Achievement highlights

  1. Further Data analysis has new findings
  2. Experiment with different data cleaning methods to improve model accuracy
  3. Find the most suitable model for this data among multiple models

Tasks completed the process

  1. Data Analysis
  • The word cloud shows the words that appear in each group, and then delete them to observe the accuracy
  • There are a lot of links to images in the data, and these can affect our system
  1. Data cleaning
  • In addition to the basic data cleaning based on the above two findings test data processing
  1. Data Modeling
  • Testing the accuracy of different models under different data cleaning scenarios

    Result

  • Found logistic regression is the best