Web Scraper: Extracting and Analyzing COVID-19 Data

Objective: The objective of this project is to build a powerful web scraper using Python that can extract data from a website. The data we will be targeting is related to the COVID-19 pandemic, specifically the number of cases, deaths, and recoveries in different countries. Once the data is extracted, we will store it in a structured format and perform some basic analysis on it. This project will not only enhance your web scraping skills but also provide you with hands-on experience in data extraction and manipulation.

Learning Outcomes: By completing this project, you will learn how to:

  • Set up a Python environment for web scraping.
  • Use popular libraries like BeautifulSoup and requests for web scraping.
  • Inspect and understand the structure of a website.
  • Extract data from a website using various scraping techniques.
  • Store the extracted data in a structured format.
  • Perform basic data analysis on the extracted data.

Steps and Tasks:

  1. Set up your Python environment

    • Install Python on your system if you haven’t already.
    • Set up a virtual environment for this project.
    • Install the necessary libraries, including BeautifulSoup and requests.
  2. Understand the target website

    • Visit the website we will be scraping: COVID - Coronavirus Statistics - Worldometer
    • Take a look at the data we want to extract: the total cases, deaths, and recoveries for each country.
    • Inspect the HTML structure of the website to identify the elements that contain the data we need. You can right-click on the webpage and select “Inspect” to view the HTML code.
  3. Fetch the HTML content

    • Write a Python script that uses the requests library to fetch the HTML content of the target website.
    • Parse the HTML content using BeautifulSoup to make it easier to navigate and extract data from.

    Code Snippet:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.worldometers.info/coronavirus/"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
  4. Extract the data

    • Use BeautifulSoup to find the HTML elements that contain the data we want to extract.
    • Extract the country names, total cases, total deaths, and total recoveries using appropriate BeautifulSoup methods like find_all and get_text.

    Code Snippet:

    table = soup.find('table', {'id': 'main_table_countries'})
    rows = table.find_all('tr')
    
    for row in rows:
        data = row.find_all('td')
        if len(data) > 0:
            country = data[0].get_text().strip()
            cases = data[1].get_text().strip()
            deaths = data[3].get_text().strip()
            recoveries = data[5].get_text().strip()
    
            # Print or store the extracted data
            print(f"Country: {country}")
            print(f"Total Cases: {cases}")
            print(f"Total Deaths: {deaths}")
            print(f"Total Recoveries: {recoveries}")
    
  5. Store the data in a structured format

    • Create lists or dictionaries to store the extracted data in a structured format.
    • Consider using a dictionary where the country name is the key and another dictionary containing the cases, deaths, and recoveries as the value.

    Code Snippet:

    data = {}
    for row in rows:
        # Extract data as before
    
        # Store the data in a dictionary
        data[country] = {
            'cases': cases,
            'deaths': deaths,
            'recoveries': recoveries
        }
    
  6. Perform basic data analysis

    • Calculate and print the total number of cases, deaths, and recoveries worldwide.
    • Find the country with the highest number of cases, deaths, and recoveries.
    • Calculate the percentage of cases, deaths, and recoveries for each country relative to the worldwide total.

    Code Snippet:

    total_cases = 0
    total_deaths = 0
    total_recoveries = 0
    
    for country, stats in data.items():
        # Extract cases, deaths, and recoveries as before
    
        # Calculate totals
        total_cases += cases
        total_deaths += deaths
        total_recoveries += recoveries
    
    # Print the totals
    print(f"Total Cases: {total_cases}")
    print(f"Total Deaths: {total_deaths}")
    print(f"Total Recoveries: {total_recoveries}")
    
  7. Visualize the data

    • Install a data visualization library like Matplotlib.
    • Use the data you have extracted to create visualizations, such as a bar chart showing the number of cases, deaths, and recoveries for each country.

    Code Snippet:

    import matplotlib.pyplot as plt
    
    # Extract data as before
    
    # Create lists for visualization
    countries = []
    cases = []
    deaths = []
    recoveries = []
    
    for country, stats in data.items():
        countries.append(country)
        cases.append(stats['cases'])
        deaths.append(stats['deaths'])
        recoveries.append(stats['recoveries'])
    
    # Create a bar chart
    plt.bar(countries, cases, color='blue', label='Cases')
    plt.bar(countries, deaths, color='red', label='Deaths')
    plt.bar(countries, recoveries, color='green', label='Recoveries')
    
    plt.xlabel('Country')
    plt.ylabel('Count')
    plt.title('COVID-19 Statistics by Country')
    plt.legend()
    
    plt.xticks(rotation=90)
    plt.tight_layout()
    
    # Display the chart
    plt.show()
    

Evaluation:

  • Your web scraper should be able to fetch the HTML content from the target website successfully.
  • You should accurately extract the country names, total cases, deaths, and recoveries from the HTML.
  • The extracted data should be stored in a structured format, such as a dictionary.
  • Your data analysis should include calculating the total number of cases, deaths, and recoveries worldwide, finding the country with the highest number of cases, deaths, and recoveries, and calculating the percentage of cases, deaths, and recoveries for each country relative to the worldwide total.
  • You should create a visualization of the data, such as a bar chart showing the number of cases, deaths, and recoveries for each country.

Resources and Learning Materials:

Need a little extra help? Sure! Let’s break down the code snippets and go through each step in detail:

Step 1: Set up your Python environment To get started, you’ll need to install Python on your system if you haven’t already. You can download it from the official Python website: https://www.python.org/

After that, set up a virtual environment for this project. A virtual environment allows you to isolate your project and its dependencies from your system’s Python environment. You can create a virtual environment using the following command:

python -m venv myenv

This will create a new virtual environment named ‘myenv’. You can activate the virtual environment using the appropriate command for your operating system:

  • Windows: myenv\Scripts\activate
  • macOS/Linux: source myenv/bin/activate

Next, you’ll need to install the necessary libraries, including BeautifulSoup and requests. You can use the pip package manager that comes with Python to install these libraries:

pip install beautifulsoup4 requests

Step 2: Understand the target website The website we’ll be scraping is Worldometer’s COVID-19 page: COVID - Coronavirus Statistics - Worldometer

We’re interested in extracting the total cases, deaths, and recoveries for each country.

Step 3: Fetch the HTML content The first thing we need to do in our Python script is fetch the HTML content of the website. We can use the requests library to make a GET request to the website and get the HTML content. We’ll also use the BeautifulSoup library to parse the HTML and make it easier to work with.

import requests
from bs4 import BeautifulSoup

url = "https://www.worldometers.info/coronavirus/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In this code, we import the necessary libraries, then define the URL of the website we want to scrape. We make a GET request to the URL using requests.get() and store the response in the ‘response’ variable. We then create a BeautifulSoup object called ‘soup’ and pass in the ‘response.content’ and ‘html.parser’ as arguments. This allows BeautifulSoup to parse the HTML content.

Step 4: Extract the data Now that we have the HTML content of the website and have parsed it using BeautifulSoup, we can start extracting the data we’re interested in. In this case, we want to extract the country names, total cases, total deaths, and total recoveries.

table = soup.find('table', {'id': 'main_table_countries'})
rows = table.find_all('tr')

for row in rows:
    data = row.find_all('td')
    if len(data) > 0:
        country = data[0].get_text().strip()
        cases = data[1].get_text().strip()
        deaths = data[3].get_text().strip()
        recoveries = data[5].get_text().strip()

        # Print or store the extracted data
        print(f"Country: {country}")
        print(f"Total Cases: {cases}")
        print(f"Total Deaths: {deaths}")
        print(f"Total Recoveries: {recoveries}")

In this code, we use the ‘find’ method of the BeautifulSoup object to locate the table that contains the data we want. We inspect the HTML of the website and find that the table has an ‘id’ attribute of ‘main_table_countries’, so we pass that as an argument to the ‘find’ method.

Once we have the table, we use the ‘find_all’ method to get a list of all the rows in the table. We iterate over each row and use the ‘find_all’ method again to get a list of all the cells in the row.

We check if the ‘data’ list is not empty to make sure we’re only processing rows with actual data. For each row, we extract the country name, total cases, total deaths, and total recoveries by accessing the appropriate index in the ‘data’ list and using the ‘get_text’ method to get the text content of the cell. We also use the ‘strip’ method to remove any leading or trailing whitespace.

Finally, we print the extracted data. You can also store it in a data structure of your choice for further analysis.

Step 5: Store the data in a structured format To store the extracted data in a structured format, we can use a dictionary. We’ll create a dictionary called ‘data’ where the country name will be the key and another dictionary containing the cases, deaths, and recoveries as the value.

data = {}

for row in rows:
    # Extract data as before

    # Store the data in a dictionary
    data[country] = {
        'cases': cases,
        'deaths': deaths,
        'recoveries': recoveries
    }

In this code, we initialize an empty dictionary called ‘data’. We then iterate over each row as before and extract the data. Instead of printing the data, we now store it in the ‘data’ dictionary using the country name as the key and a dictionary containing the cases, deaths, and recoveries as the value.

Step 6: Perform basic data analysis For the basic data analysis, we’ll calculate the total number of cases, deaths, and recoveries worldwide, find the country with the highest number of cases, deaths, and recoveries, and calculate the percentage of cases, deaths, and recoveries for each country relative to the worldwide total.

total_cases = 0
total_deaths = 0
total_recoveries = 0

for country, stats in data.items():
    # Extract cases, deaths, and recoveries as before

    # Calculate totals
    total_cases += cases
    total_deaths += deaths
    total_recoveries += recoveries

# Print the totals
print(f"Total Cases: {total_cases}")
print(f"Total Deaths: {total_deaths}")
print(f"Total Recoveries: {total_recoveries}")

In this code, we initialize three variables to keep track of the total cases, deaths, and recoveries worldwide. We then iterate over each country in the ‘data’ dictionary and extract the cases, deaths, and recoveries as before.

We calculate the totals by adding the cases, deaths, and recoveries to the respective variables. Finally, we print the totals.

Step 7: Visualize the data For data visualization, we can create a bar chart using a library like Matplotlib. We’ll install Matplotlib first using the following command:

pip install matplotlib
import matplotlib.pyplot as plt

# Extract data as before

# Create lists for visualization
countries = []
cases = []
deaths = []
recoveries = []

for country, stats in data.items():
    countries.append(country)
    cases.append(stats['cases'])
    deaths.append(stats['deaths'])
    recoveries.append(stats['recoveries'])

# Create a bar chart
plt.bar(countries, cases, color='blue', label='Cases')
plt.bar(countries, deaths, color='red', label='Deaths')
plt.bar(countries, recoveries, color='green', label='Recoveries')

plt.xlabel('Country')
plt.ylabel('Count')
plt.title('COVID-19 Statistics by Country')
plt.legend()

plt.xticks(rotation=90)
plt.tight_layout()

# Display the chart
plt.show()

In this code, we import the necessary functions from the Matplotlib library. We then create four empty lists: ‘countries’, ‘cases’, ‘deaths’, and ‘recoveries’. We iterate over each country in the ‘data’ dictionary and append the country name to the ‘countries’ list and the corresponding cases, deaths, and recoveries to their respective lists.

We then use the ‘bar’ function of the pyplot module of Matplotlib to create a bar chart. We pass in the ‘countries’ list as the x-axis, the ‘cases’, ‘deaths’, and ‘recoveries’ lists as the heights of the bars, and specify colors and labels for the bars.

We also add labels to the x-axis and y-axis, a title to the chart, and a legend. We rotate the x-axis labels by 90 degrees to make them more readable. Finally, we use the ‘show’ function to display the chart.

That’s it! You’ve now built a web scraper in Python using the BeautifulSoup and requests libraries. The scraper fetches the HTML content of a website, extracts data from it, stores the data in a structured format, performs basic data analysis, and visualizes the data.

Remember to run the complete script outside of the code snippets provided, as the code snippets are meant to be used as a reference within the larger script.

I hope this helps! Let me know if you have any further questions.

@joy.b has been assigned as the mentor. View code along.