The covidcast package provides some simple utilities for exploring the correlations between two signals, over space or time, which may be helpful for simple analyses and explorations of data.
For these examples, we’ll load confirmed case and death rates per 100,000 population, and restrict our analysis to counties with at least 500 total cases by August 15th.
library(covidcast)
library(dplyr)
<- "2020-03-01"
start_day <- "2020-08-15"
end_day
<- covidcast_signal(
iprop data_source = "jhu-csse",
signal = "confirmed_7dav_incidence_prop",
start_day = start_day, end_day = end_day)
<- covidcast_signal(
dprop data_source = "jhu-csse",
signal = "deaths_7dav_incidence_prop",
start_day = start_day, end_day = end_day)
# Restrict attention to "active" counties with at least 500 total cases
<- 500
case_num <- covidcast_signal(
geo_values data_source = "jhu-csse",
signal = "confirmed_cumulative_num",
start_day = end_day, end_day = end_day) %>%
filter(value >= case_num) %>%
pull(geo_value)
<- iprop %>% filter(geo_value %in% geo_values)
cases <- dprop %>% filter(geo_value %in% geo_values) deaths
The covidcast_cor()
function is your primary way to calculate correlations. The first option we have is to “slice by time”: this calculates, for each time, correlation between the signals over all geographic locations. This is obtained by setting by = "time_value"
:
library(ggplot2)
# Compute correlation per time, over all counties
<- covidcast_cor(cases, deaths, by = "time_value")
df_cor
# Plot the correlation time series
ggplot(df_cor, aes(x = time_value, y = value)) + geom_line() +
labs(title = "Correlation between case and death rates",
subtitle = sprintf("Per day, over counties with at least %i cases",
case_num),x = "Date", y = "Correlation")
The above plot addresses the question: “on any given day, are case and death rates linearly associated, over US counties?”. We might be interested in broadening this question, instead asking: “on any given day, do higher case rates tend to associate with higher death rates?”, removing the dependence on a linear relationship. The latter can be addressed using Spearman correlation, accomplished by setting method = "spearman"
in the call to covidcast_cor()
. Spearman correlation is highly robust and invariant to monotone transformations (it doesn’t rely on any particular functional form for the dependence between two variables).
We might also be interested in how case rates associate with death rates in the future. Using the dt_x
parameter in covidcast_cor()
, we can lag case rates back any number of days we want, before calculating correlations.
Here we set dt_x = -10
. This means that cases
will be lagged by 10 days, meaning that cases on June 1st will be correlated with deaths on June 11th. (It might help to think of it this way: deaths on a certain day will be correlated with cases with an offset of -10 days.)
# Use Spearman correlation, with case rates and 10-day lagged case rates
<- covidcast_cor(cases, deaths, by = "time_value",
df_cor1 method = "spearman")
<- covidcast_cor(cases, deaths, by = "time_value", dt_x = -10,
df_cor2 method = "spearman")
# Stack rowwise into one data frame, then plot time series
<- rbind(df_cor1, df_cor2)
df_cor $dt <- as.factor(c(rep(0, nrow(df_cor1)), rep(-10, nrow(df_cor2))))
df_corggplot(df_cor, aes(x = time_value, y = value)) +
geom_line(aes(color = dt)) +
labs(title = "Correlation between case and death rates",
subtitle = sprintf("Per day, over counties with at least %i cases",
case_num),x = "Date", y = "Correlation") +
theme(legend.position = "bottom")
We can see that, for the most part, the Spearman measure has bolstered the correlations; and generally, lagging the case rates back by 10 days improves correlations, confirming case rates are better correlated with death rates 10 days from now.
The second option we have is to “slice by location”: this calculates, for each geographic location, correlation between the time series of two signals. This is obtained by setting by = "geo_value"
. We’ll again look at correlations both for observations at the same time and for 10-day lagged case rates:
# Compute correlation per county, over all times
<- covidcast_cor(cases, deaths, by = "geo_value")
df_cor1 <- covidcast_cor(cases, deaths, by = "geo_value", dt_x = -10)
df_cor2
# Stack rowwise into one data frame, then plot densities
<- rbind(df_cor1, df_cor2)
df_cor $dt <- as.factor(c(rep(0, nrow(df_cor1)), rep(-10, nrow(df_cor2))))
df_corggplot(df_cor, aes(value)) +
geom_density(aes(color = dt, fill = dt), alpha = 0.5) +
labs(title = "Correlation between case and death rates",
subtitle = "Computed separately for each county, over all times",
x = "Date", y = "Density") +
theme(legend.position = "bottom")
Using some tricks, we can attach the necessary properties to the data frame so we can plot these correlations in space as a choropleth map, using plot.covidcast_signal()
:
# Turn the data into a covidcast_signal data frame so it can be plotted
$time_value <- start_day
df_cor2<- as.covidcast_signal(df_cor2,
df_cor2 geo_type = "county",
signal = "correlations")
# Plot choropleth maps, using the covidcast plotting functionality
plot(df_cor2, title = "Correlations between 10-day lagged case and death rates",
range = c(-1, 1), choro_col = c("orange", "lightblue", "purple"))
You could also imagine trying to move the signals with various lags to see at what lag one signal is most correlated with the other. A simple way to achieve this:
# Loop over values for dt, and compute correlations per county
<- -(0:15)
dt_vec <- vector("list", length(dt_vec))
df_list for (i in seq_along(dt_vec)) {
<- covidcast_cor(cases, deaths, dt_x = dt_vec[i],
df_list[[i]] by = "geo_value")
$dt <- dt_vec[i]
df_list[[i]]
}
# Stack into one big data frame, and then plot the median correlation by dt
<- do.call(rbind, df_list)
df %>%
df group_by(dt) %>%
summarize(median = median(value, na.rm = TRUE), .groups = "drop_last") %>%
ggplot(aes(x = dt, y = median)) +
geom_line() +
geom_point() +
labs(title = "Median correlation between case and death rates",
x = "dt", y = "Correlation") +
theme(legend.position = "bottom", legend.title = element_blank())
We can see that the median correlation between case and death rates (where the correlations come from slicing by location) is maximized when we lag the case incidence rates back 8 days in time.