Data Science: Data Visualization

Module 1: Drug Poisoning Mortality by State

Ayush Noori | EduSTEM Data Science


Welcome to Module 1 of EduSTEM Data Science! This course is designed for motivated middle school and high school students who are curious to learn about diverse contemporary challenges through the lens of data science.

Module 1 will give you a hands-on introduction to data visualization in R. We will use the popular ggplot2 package to create an animated bar plot representation of the drug poisoning epidemic over time in the United States. Please refer to the background reading of this module to learn more about unintentional drug overdose and the opioid epidemic.

This module will use data from National Center of Health Statistics, available here.

This dataset describes drug poisoning deaths at the U.S. and state level by selected demographic characteristics and includes age-adjusted death rates for drug poisoning. Deaths are classified using the International Classification of Diseases, Tenth Revision (ICD–10).

Drug-poisoning deaths are defined as having ICD–10 underlying cause-of-death codes

  • X40–X44 (unintentional),
  • X60–X64 (suicide),
  • X85 (homicide), or
  • Y10–Y14 (undetermined intent).

    Estimates are based on the National Vital Statistics System multiple cause-of-death mortality files (1). Age-adjusted death rates (deaths per 100,000 U.S. standard population for 2000) are calculated using the direct method. Populations used for computing death rates for 2011–2016 are postcensal estimates based on the 2010 U.S. census. Rates for census years are based on populations enumerated in the corresponding censuses. Rates for noncensus years before 2010 are revised using updated intercensal population estimates and may differ from rates previously published. Death rates for some states and years may be low due to a high number of unresolved pending cases or misclassification of ICD–10 codes for unintentional poisoning as R99, “Other ill-defined and unspecified causes of mortality” (2). For example, this issue is known to affect New Jersey in 2009 and West Virginia in 2005 and 2009 but also may affect other years and other states. Drug poisoning death rates may be underestimated in those instances.

    References:

    1. National Center for Health Statistics. National Vital Statistics System: Mortality data. Available from: http://www.cdc.gov/nchs/deaths.htm.
    2. CDC. CDC Wonder: Underlying cause of death 1999–2016. Available from: Underlying Cause of Death 1999-2020.

    Background Reading

    1. Module 1 Background #1.pdf (287.4 KB)
    1. Module 1 Background #2.pdf (724.3 KB)
    1. Module 1 Background #3.pdf (627.5 KB)

    Publication from ncbi.nlm.nih.gov

    Worldwide Prevalence and Trends in Unintentional Drug Overdose: A Systematic Review of the Literature.

    Setup

    Load the requisite libraries.

    if(!require(data.table)){
      install.packages("data.table")
      library(data.table)
    }
    
    if(!require(ggplot2)){
      install.packages("ggplot2")
      library(ggplot2)
    }
    
    if(!require(gganimate)){
      install.packages("gganimate")
      library(gganimate)
    }
    
    if(!require(dplyr)){
      install.packages("dplyr")
      library(dplyr)
    }
    

    Set the working directory to where you have saved the NCHS data.

    dir = "<insert appropriate directory address here>"
    setwd(dir)
    

    Read and Process Data

    Read the data into the RStudio workspace. Then, select the needed columns from the dataset and create a rank column to facilitate plot animation.

    dat = fread("Module 1 Data.csv", header = TRUE)
    
    # select the needed columns from the dataset
    
    dat = dat[, c("State", "Year", "Age-adjusted Rate")]
    colnames(dat) = c("State", "Year", "Rate")
    
    # create a rank column which will allow plot animation
    
    dat = dat %>%
      group_by(Year) %>%
      # the * 1 makes it possible to have non-integer ranks while sliding
      mutate(rank = min_rank(-Rate) * 1) %>%
      ungroup()
    

    Build Static Plots

    Build all the static plots using the popular ggplot2 package, which is elegant and aesthetically pleasing, but with very different syntax than base R graphics.

    ggplot2 works with dataframes. Here, we supply the dat object as a dataframe to the ggplot2() function. Aesthetic information from the source dataset, including the X and Y axes, are specified inside the aes() function.

    The layers in ggplot2 are called geoms. Once the dataframe is specified and base setup is completed, you can append the geoms one on top of the other by calling their respective functions. Here, we use geom_tile() to create the bar plot and geom_text() to create the data labels along the y-axis. The documentation has an extensive list of available geoms.

    Finally, the key function here is transition_states, which stitches all the individual static plots together by year to allow us to animate the plot.

    p = ggplot(dat, aes(rank, group = State, 
                                  fill = as.factor(State), color = as.factor(State))) +
      geom_tile(aes(y = Rate/2,
                    height = Rate,
                    width = 0.9), alpha = 0.8, color = NA) +
      
      # text labels along y-axis after coordinates are flipped (requires clip = "off" in coord_*)
      geom_text(aes(y = 0, label = paste(paste(paste(State, ":", sep=""), Rate, sep=" "), " ")),
                vjust = 0.2, hjust = 1) +
      
      coord_flip(clip = "off", expand = FALSE) +
      scale_y_continuous(labels = scales::comma) +
      scale_x_reverse() +
      guides(color = FALSE, fill = FALSE) +
      
      labs(title='{closest_state}', x = "", y = "Drug Poisoning Mortality by State") +
      theme(plot.title = element_text(hjust = 0, size = 22),
            axis.ticks.y = element_blank(),  # these relate to the axes post-flip
            axis.text.y  = element_blank(),  # these relate to the axes post-flip
            plot.margin = margin(1,1,1,4, "cm")) +
      
      # the transition_states() function stitches all the individual plots together by year
      
      transition_states(Year, transition_length = 4, state_length = 1) +
      ease_aes('cubic-in-out')
    

    Create Animated Plot

    Finally, animate the plot! You will find the animated plot generated in the working directory.

    animate(p, fps = 25, duration = 30, width = 800, height = 2000, renderer = gifski_renderer("Drug Poisoning Mortality by State.gif"))
    



    With these courses, we hope to further our mission to make high-quality STEMX education accessible for all. For questions or support, please feel free to reach out to me at anooristudent@outlook.com.

    Best Regards,

    Ayush Noori

    EduSTEM Boston Chapter Founder


    Resources:

    1. R and RStudio Desktop

    R is a free programming language and software environment for statistical computing, bioinformatics, and data visualization. RStudio is the associated free integrated development environment. Please download them via the instructions here to complete the course activities.

    1. Stack Overflow

    Stack Overflow is a question and answer site for programmers, and hosts a wide variety of answers to common R questions. It is an indispensable resource for the nascent R programmer. You can also refer to the RStudio Community.


    image-edustem

  • 1 Like