Analyzing and Visualizing COVID-19 Data using R

Objective: The objective of this project is to leverage R programming skills to analyze and visualize COVID-19 data. By working on this project, you will gain hands-on experience in data analysis, statistical modeling, and data visualization using R. You will also develop a deeper understanding of the COVID-19 pandemic and its impact through the lens of data.

Learning Outcomes:

  1. Proficiency in R programming for data analysis, statistical modeling, and data visualization.
  2. Understanding of data preprocessing techniques, including data cleaning and transformation.
  3. Knowledge of statistical concepts and their application in modeling and analyzing real-world data.
  4. Ability to interpret and communicate the findings from data analysis and visualization.
  5. Familiarity with the COVID-19 pandemic and its data sources.

Steps and Tasks:

  1. Gather COVID-19 Data

    • Access reliable sources of COVID-19 data, such as the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.
    • Download the dataset in CSV format and save it to your local machine.
  2. Load Required Libraries

    • Install and load the necessary R libraries for this project, including tidyverse, ggplot2, dplyr, lubridate, and forecast.

    Code Snippet:

    # Install required libraries
    install.packages(c("tidyverse", "ggplot2", "dplyr", "lubridate", "forecast"))
    # Load libraries
  3. Import and Preprocess the Data

    • Use the read_csv() function from the tidyverse package to import the COVID-19 data into R.
    • Perform data preprocessing tasks, such as removing unnecessary columns, renaming columns, and converting data types.
    • Handle missing values and outliers appropriately.

    Code Snippet:

    # Import the COVID-19 data
    covid_data <- read_csv("path/to/covid_data.csv")
    # Data preprocessing
    covid_data <- covid_data %>%
      select(-c("Province/State", "Lat", "Long")) %>%
      rename(Date = "ObservationDate", Country = "Country/Region", Confirmed = "Confirmed") %>%
      mutate(Date = as.Date(Date, format = "%Y-%m-%d"), Confirmed = as.numeric(Confirmed)) %>%
    # View the preprocessed data
  4. Perform Exploratory Data Analysis (EDA)

    • Conduct EDA to gain insights into the COVID-19 data.
    • Summarize the data using descriptive statistics.
    • Visualize the data using plots and graphs.

    Code Snippet:

    # Summary statistics
    # Visualizations
    ggplot(data = covid_data, aes(x = Date, y = Confirmed, color = Country)) +
      geom_line() +
      labs(title = "COVID-19 Confirmed Cases Over Time",
           x = "Date",
           y = "Confirmed Cases",
           color = "Country") +
    ggplot(data = covid_data, aes(x = Date, y = Confirmed, color = Country)) +
      geom_line() +
      labs(title = "COVID-19 Confirmed Cases Over Time",
           x = "Date",
           y = "Confirmed Cases",
           color = "Country") +
  5. Build a Time Series Forecasting Model

    • Create a time series dataset for a specific country using the ts() function.
    • Split the data into training and testing sets.
    • Build a forecasting model using the ARIMA (AutoRegressive Integrated Moving Average) method.
    • Evaluate the model’s performance using metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

    Code Snippet:

    # Time series analysis for a specific country (e.g., United States)
    us_data <- covid_data %>% filter(Country == "US") %>% select(Date, Confirmed)
    us_ts <- ts(us_data$Confirmed, start = c(2020, 1), frequency = 7)
    # Split the data into training and testing sets
    us_train <- window(us_ts, end = c(2020, 40))
    us_test <- window(us_ts, start = c(2020, 41))
    # Build an ARIMA model
    arima_model <- auto.arima(us_train)
    # Forecast future values
    forecast <- forecast(arima_model, h = 30)
    # Evaluate the model
    accuracy(forecast, us_test)
  6. Visualize the Forecasted Data

    • Plot the original time series data along with the forecasted values.
    • Add confidence intervals to the plot.
    • Customize the visualization for better clarity.

    Code Snippet:

    # Visualize the forecasted data
    ggplot() +
      geom_line(data = fortify(us_ts), aes(x = Index, y = us_ts, color = "Observed")) +
      geom_line(data = forecast, aes(x = Index, y = Point.Forecast, color = "Forecast")) +
      geom_ribbon(data = forecast, aes(x = Index, ymin = Lo.80, ymax = Hi.80), fill = "blue", alpha = 0.2) +
      labs(title = "Forecast of COVID-19 Confirmed Cases in the US",
           x = "Date",
           y = "Confirmed Cases",
           color = "") +
      scale_color_manual(values = c("Observed" = "black", "Forecast" = "red")) +
  7. Create Interactive Visualizations

    • Use the plotly library to create interactive visualizations of the COVID-19 data.
    • Customize the interactive plots with labels, titles, and color schemes.

    Code Snippet:

    # Interactive visualization of COVID-19 data
    covid_data %>%
      filter(Country %in% c("US", "India", "Brazil", "Russia", "South Africa")) %>%
      plot_ly(x = ~Date, y = ~Confirmed, color = ~Country, type = "scatter", mode = "lines") %>%
      layout(title = "COVID-19 Confirmed Cases Over Time",
             xaxis = list(title = "Date"),
             yaxis = list(title = "Confirmed Cases"),
             colorway = c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd"))
  8. Draw Insights and Conclusions

    • Interpret the findings from your data analysis and visualization.
    • Draw meaningful insights about the COVID-19 pandemic and its impact.
    • Summarize your conclusions in a clear and concise manner.


  • Your project will be evaluated based on the following criteria:
    • Correctness and efficiency of the R code.
    • Quality of data preprocessing, including handling missing values and outliers.
    • Insightfulness of the data analysis and visualization.
    • Accuracy and interpretation of the time series forecasting model.
    • Clarity and coherence of the insights and conclusions.

Resources and Learning Materials:

Need a little extra help?

  1. For data preprocessing, if you encounter an error like “Error: Can’t subset columns that don’t exist”, it means that the columns you are trying to select or filter do not exist in your dataset. You can use the names() function to check the column names in your dataset and make sure they match the ones you are trying to select or filter.
  2. When visualizing the COVID-19 data using ggplot2, if you get a warning message like “Removed X rows containing missing values”, it means that there are missing values in your data and ggplot2 is automatically removing those rows. You can use the na.omit() function to remove missing values before plotting the data.
  3. If you are new to time series forecasting and the ARIMA model, it can be challenging to select the appropriate parameters for the model. You can use the auto.arima() function from the forecast package to automatically select the best ARIMA model based on the data.
  4. When evaluating the performance of your time series forecasting model, if you get an error like “Error in accuracy.default(forecast, us_test) : First argument should be a forecast object or a time series”, it means that the accuracy() function expects a forecast object or a time series as the first argument. Make sure you have correctly defined your forecast object and your test data as time series using the ts() function.

@joy.b has been assigned as the mentor. View code along.