Exploring COVID-19 Data

As the COVID-19 virus wreaks havoc across our world, many of us are sitting in our homes trying to figure out if there’s something we can do to help. Many of those with medical skills have already responded to calls for help. Some of us with experience in the data science fields have tried to figure out if there’s something we can do. I am pretty late to the game to work on this, but still wanted to start from scratch. Part of my interest in starting from scratch is that I’ve found that you don’t fully understand a problem unless you really dive in and play with the data. This project gave me the opportunity to to that. If you want to look at some folks that are ahead of me, I recommend this blog post

Data

The two data sources that I’ve set up in my workflows is the data from John Hopkins University (I pull from their Github repository found here). The John Hopkins University repository has been one of the primary source that most of the media use for their official numbers. The JHU data repository is updated daily, and feeds a dashboard found here.

For about a week during the pandemic crisis, JHU’s county data wasn’t as clean and easy to parse. For that reason, I’ve turned to this site for USA county level data. We’ll analyze this source below:

Analyzing National, State, and County trends

Let’s start by computing some of the common trend graphics that are seen in many news articles. For example, at the time that I’m writing this, the New York Times trend analysis can be found at this link.

For my analysis, I will be using the R programming language. Similar analysis could be done in Python using the Pandas package.

World Trends

We’ll start by analyzing some world trends. To do this we’ll use the JHU COVID-19 data found on Github. To streamline this process, I’ve cloned the Github repository and will run a git pull in the terminal before running the code below. This will ensure that I have the most updated data.

Now that we have the data, let’s read it in and look at the format.

require(readr)
require(knitr)
df <- read_csv('/Users/dbeskow/Dropbox/CMU/virus/COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
kable(head(df[,1:10]))

Province/State	Country/Region	Lat	Long
NA	Afghanistan	33.0000	65.0000
NA	Albania	41.1533	20.1683
NA	Algeria	28.0339	1.6596
NA	Andorra	42.5063	1.5218
NA	Angola	-11.2027	17.8739
NA	Antigua and Barbuda	17.0608	-61.7964

We see from this messy format that it’s in wide format. In other words, each day the data curators are adding another columnn of data with the new statistics for each geographic region. This format makes it easy to update the data, but hard to analyze in R. Let’s convert it to long format as well as convert the dates from character type to date type with the lubridate package:

require(dplyr)
require(tidyr)
require(lubridate)
df_long <- data_long <- gather(df, date, count, 5:ncol(df))
names(df_long) <- c('state', 'country', 'lat', 'lon', 'date', 'count')
df_long$date <- mdy(df_long$date)
kable(head(df_long))

state	country	lat	lon	date
NA	Afghanistan	33.0000	65.0000	2020-01-22
NA	Albania	41.1533	20.1683	2020-01-22
NA	Algeria	28.0339	1.6596	2020-01-22
NA	Andorra	42.5063	1.5218	2020-01-22
NA	Angola	-11.2027	17.8739	2020-01-22
NA	Antigua and Barbuda	17.0608	-61.7964	2020-01-22

This is much easier to work with.

When we look at the state category, we see that some nations are broken out by state (China, Australia, others), while other nations are left at the national level. In order to compare at the national level, we need to extract the countries that are by state and aggregate the data to a national level. I demonstrate this below

# Separate the data
china <- df_long %>% filter(country == 'China')

# Aggregate to national level
china <- china %>%
  group_by(country,date) %>%
  summarise(count = sum(count)) %>%
  mutate(state = as.character(NA), lat = as.numeric(NA), lon = as.numeric(NA))

# Combine back together
df_long <- df_long %>% bind_rows(china)

Now that we have data at the national level, we can move on to visualization.

For our first visualization, let’s create the common trend map that is found on most news sites. This trend line shows the top countries in the world and plots them on log scale.

require(ggplot2)
require(ggpubr)
s <- df_long %>% group_by(country) %>% summarise(total = sum(count)) %>% top_n(10)
world <- df_long %>% filter(country %in% s$country) %>% filter(count > 100) %>%
  filter(is.na(state)) #%>% filter(date > ymd('20200228'))
ggline(world, x = "date", y = "count",
       color = 'country',
       linetype = 'country',
       yscale = 'log10',
       plot_type = 'l',
       show.line.label = TRUE,
       repel = TRUE,
       legend = 'none')

As you can see, it’s a little hard to compare each country because they start at different times. If you look at a similar plot by the New York Times as well as other media, you’ll see that they use days since (n^{th}) case. Let’s see if we can replicate this. As you can see above, we’ve only included counts over 100 in our data. Now we just need to develop a new (x) axis (days since 100th case). I show how to do this below:

library(ggrepel)
min_date <- world %>% group_by(country) %>% summarize(min_date = min(date))
world_adj <- world %>% left_join(min_date) %>% mutate(days_since_100 = date - min_date)
ggline(world_adj, x = "days_since_100", y = "count", color = 'country',
       linetype = 'country',
       yscale = 'log10',
       plot_type = 'l',
       show.line.label = TRUE,
       repel = TRUE)

That looks much better.

Now our plot, and many of the plots that the news agencies are using, is using a logarithmic scale for the (y) axis. This is helpful for a number of reasons, especially if you want to view exponential growth as a linear line. For example, some news agencies add lines to show the slope at which cases double every 2 days, 3, days, 4 days, etc.

The problem with the log scale is that it can mask the true nature of the exponential growth. Below I plot my first plot, but on a normal scale instead of the exponential scale.

ggline(world, x = "date", y = "count",
       color = 'country',
       linetype = 'country',
       plot_type = 'l',
       show.line.label = TRUE,
       repel = TRUE,
       legend = 'none')

State Level Analysis

Let’s start by getting some data. The code below will download the data found at www.usafacts.org.

download.file('https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_confirmed_usafacts.csv', 'covid_confirmed_usafacts.csv')

Now that we have the data, let’s read it in and look at the format:

covid =  read_csv('covid_confirmed_usafacts.csv', col_types = cols(countyFIPS = 'c'))
kable(head(covid[,1:10]))

countyFIPS	County Name	State	stateFIPS
0	Statewide Unallocated	AL	1
1001	Autauga County	AL	1
1003	Baldwin County	AL	1
1005	Barbour County	AL	1
1007	Bibb County	AL	1
1009	Blount County	AL	1

Like the JHU data, this is in wide format and must be converted to long format.

covid_long <- gather(covid, date, count, 5:ncol(covid))
names(covid_long) <- c('fips','county','state','state_fips','date','count')
covid_long$date <- mdy(covid_long$date)
# head(covid_long)

This is much easier to work with.

Now let’s look at the top 10 states with side by side plots with normal axis and log-scale axis.

require(gridExtra)
s <- covid_long %>% group_by(state) %>% summarise(total = max(count)) %>% top_n(10)

state_long <- covid_long %>%
  group_by(state,date) %>%
  summarise(count = sum(count)) %>%
  filter(state %in% s$state) %>%
  filter(count > 10)

p1 <- ggline(state_long , x = "date", y = "count", color = 'state',
       yscale = 'log10',  
       plot_type = 'l',
       show.line.label = TRUE,  
       legend = 'none',
       repel = TRUE) + ggtitle('Cases by State (Log Scale)')

p2 <- ggline(state_long , x = "date", y = "count", color = 'state',       
       plot_type = 'l',
       show.line.label = TRUE,
       legend = 'none',
       repel = TRUE) + ggtitle('Cases by State (Normal Scale)')

ggarrange(p1, p2 , nrow = 1, ncol = 2, widths = c(6,6), heights = c(4))

Well, that’s enough for today. In my next blog I’ll try to plot the county level data in a choropleth plot.

Those with experience modeling epidemics have been crutial in helping policy makers. One of the more influential studies among these was performed by the Imperial College and London and can be found here. Many of these models use versions of the SIR (Susceptible, Infected, Recovered) model. SIR models compute the estimated number of infected persons at some time in the future. I’ve studied System Dynamics as well as Agent Based SIR models during my time here at CMU. However, I’ve used them to map disinformation rather than a virus (see our agent based paper here).

Written on April 15, 2020