Exploring COVID-19 Data

As the COVID-19 virus wreaks havoc across our world, many of us are sitting in our homes trying to figure out if there’s something we can do to help. Many of those with medical skills have already responded to calls for help. Some of us with experience in the data science fields have tried to figure out if there’s something we can do. I am pretty late to the game to work on this, but still wanted to start from scratch. Part of my interest in starting from scratch is that I’ve found that you don’t fully understand a problem unless you really dive in and play with the data. This project gave me the opportunity to to that. If you want to look at some folks that are ahead of me, I recommend this blog post


The two data sources that I’ve set up in my workflows is the data from John Hopkins University (I pull from their Github repository found here). The John Hopkins University repository has been one of the primary source that most of the media use for their official numbers. The JHU data repository is updated daily, and feeds a dashboard found here.

For about a week during the pandemic crisis, JHU’s county data wasn’t as clean and easy to parse. For that reason, I’ve turned to this site for USA county level data. We’ll analyze this source below:

Let’s start by computing some of the common trend graphics that are seen in many news articles. For example, at the time that I’m writing this, the New York Times trend analysis can be found at this link.

For my analysis, I will be using the R programming language. Similar analysis could be done in Python using the Pandas package.

We’ll start by analyzing some world trends. To do this we’ll use the JHU COVID-19 data found on Github. To streamline this process, I’ve cloned the Github repository and will run a git pull in the terminal before running the code below. This will ensure that I have the most updated data.

Now that we have the data, let’s read it in and look at the format.

df <- read_csv('/Users/dbeskow/Dropbox/CMU/virus/COVID-19/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20
NA Afghanistan 33.0000 65.0000 0 0 0 0 0 0
NA Albania 41.1533 20.1683 0 0 0 0 0 0
NA Algeria 28.0339 1.6596 0 0 0 0 0 0
NA Andorra 42.5063 1.5218 0 0 0 0 0 0
NA Angola -11.2027 17.8739 0 0 0 0 0 0
NA Antigua and Barbuda 17.0608 -61.7964 0 0 0 0 0 0

We see from this messy format that it’s in wide format. In other words, each day the data curators are adding another columnn of data with the new statistics for each geographic region. This format makes it easy to update the data, but hard to analyze in R. Let’s convert it to long format as well as convert the dates from character type to date type with the lubridate package:

df_long <- data_long <- gather(df, date, count, 5:ncol(df))
names(df_long) <- c('state', 'country', 'lat', 'lon', 'date', 'count')
df_long$date <- mdy(df_long$date)
state country lat lon date count
NA Afghanistan 33.0000 65.0000 2020-01-22 0
NA Albania 41.1533 20.1683 2020-01-22 0
NA Algeria 28.0339 1.6596 2020-01-22 0
NA Andorra 42.5063 1.5218 2020-01-22 0
NA Angola -11.2027 17.8739 2020-01-22 0
NA Antigua and Barbuda 17.0608 -61.7964 2020-01-22 0

This is much easier to work with.

When we look at the state category, we see that some nations are broken out by state (China, Australia, others), while other nations are left at the national level. In order to compare at the national level, we need to extract the countries that are by state and aggregate the data to a national level. I demonstrate this below

# Separate the data
china <- df_long %>% filter(country == 'China')

# Aggregate to national level
china <- china %>%
  group_by(country,date) %>%
  summarise(count = sum(count)) %>%
  mutate(state = as.character(NA), lat = as.numeric(NA), lon = as.numeric(NA))

# Combine back together
df_long <- df_long %>% bind_rows(china)

Now that we have data at the national level, we can move on to visualization.

For our first visualization, let’s create the common trend map that is found on most news sites. This trend line shows the top countries in the world and plots them on log scale.

s <- df_long %>% group_by(country) %>% summarise(total = sum(count)) %>% top_n(10)
world <- df_long %>% filter(country %in% s$country) %>% filter(count > 100) %>%
  filter(is.na(state)) #%>% filter(date > ymd('20200228'))
ggline(world, x = "date", y = "count",
       color = 'country',
       linetype = 'country',
       yscale = 'log10',
       plot_type = 'l',
       show.line.label = TRUE,
       repel = TRUE,
       legend = 'none')

As you can see, it’s a little hard to compare each country because they start at different times. If you look at a similar plot by the New York Times as well as other media, you’ll see that they use days since (n^{th}) case. Let’s see if we can replicate this. As you can see above, we’ve only included counts over 100 in our data. Now we just need to develop a new (x) axis (days since 100th case). I show how to do this below:

min_date <- world %>% group_by(country) %>% summarize(min_date = min(date))
world_adj <- world %>% left_join(min_date) %>% mutate(days_since_100 = date - min_date)
ggline(world_adj, x = "days_since_100", y = "count", color = 'country',
       linetype = 'country',
       yscale = 'log10',
       plot_type = 'l',
       show.line.label = TRUE,
       repel = TRUE)

That looks much better.

Now our plot, and many of the plots that the news agencies are using, is using a logarithmic scale for the (y) axis. This is helpful for a number of reasons, especially if you want to view exponential growth as a linear line. For example, some news agencies add lines to show the slope at which cases double every 2 days, 3, days, 4 days, etc.

The problem with the log scale is that it can mask the true nature of the exponential growth. Below I plot my first plot, but on a normal scale instead of the exponential scale.

ggline(world, x = "date", y = "count",
       color = 'country',
       linetype = 'country',
       plot_type = 'l',
       show.line.label = TRUE,
       repel = TRUE,
       legend = 'none')

State Level Analysis

Let’s start by getting some data. The code below will download the data found at www.usafacts.org.

download.file('https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_confirmed_usafacts.csv', 'covid_confirmed_usafacts.csv')

Now that we have the data, let’s read it in and look at the format:

covid =  read_csv('covid_confirmed_usafacts.csv', col_types = cols(countyFIPS = 'c'))
countyFIPS County Name State stateFIPS 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20
0 Statewide Unallocated AL 1 0 0 0 0 0 0
1001 Autauga County AL 1 0 0 0 0 0 0
1003 Baldwin County AL 1 0 0 0 0 0 0
1005 Barbour County AL 1 0 0 0 0 0 0
1007 Bibb County AL 1 0 0 0 0 0 0
1009 Blount County AL 1 0 0 0 0 0 0

Like the JHU data, this is in wide format and must be converted to long format.

covid_long <- gather(covid, date, count, 5:ncol(covid))
names(covid_long) <- c('fips','county','state','state_fips','date','count')
covid_long$date <- mdy(covid_long$date)
# head(covid_long)

This is much easier to work with.

Now let’s look at the top 10 states with side by side plots with normal axis and log-scale axis.

s <- covid_long %>% group_by(state) %>% summarise(total = max(count)) %>% top_n(10)

state_long <- covid_long %>%
  group_by(state,date) %>%
  summarise(count = sum(count)) %>%
  filter(state %in% s$state) %>%
  filter(count > 10)

p1 <- ggline(state_long , x = "date", y = "count", color = 'state',
       yscale = 'log10',  
       plot_type = 'l',
       show.line.label = TRUE,  
       legend = 'none',
       repel = TRUE) + ggtitle('Cases by State (Log Scale)')

p2 <- ggline(state_long , x = "date", y = "count", color = 'state',       
       plot_type = 'l',
       show.line.label = TRUE,
       legend = 'none',
       repel = TRUE) + ggtitle('Cases by State (Normal Scale)')

ggarrange(p1, p2 , nrow = 1, ncol = 2, widths = c(6,6), heights = c(4))

Well, that’s enough for today. In my next blog I’ll try to plot the county level data in a choropleth plot.

Those with experience modeling epidemics have been crutial in helping policy makers. One of the more influential studies among these was performed by the Imperial College and London and can be found here. Many of these models use versions of the SIR (Susceptible, Infected, Recovered) model. SIR models compute the estimated number of infected persons at some time in the future. I’ve studied System Dynamics as well as Agent Based SIR models during my time here at CMU. However, I’ve used them to map disinformation rather than a virus (see our agent based paper here).

Written on April 15, 2020