Aggregating data spatially, and making glyph maps

Global Historical Climatology Network (GHCN) provides daily climate measures from stations across the world. prcp_aus extracts daily precipitation and minimum temperature from GHCN for 639 stations in Australia from 2016 to 2020. This is where these stations locate in an Australia map:

state_map <- st_simplify(ozmaps::abs_ste, dTolerance = 4000)
ggplot() + 
  geom_sf(data = state_map, inherit.aes = FALSE,
          color = "grey80", alpha = 0.4, linetype = 3) +
  geom_point(data = prcp_aus , aes(x = long, y = lat)) + 
  theme_void()

This is a lot of stations to look at for one climate variable and they can’t all fit into a glyph map. What we can do is to group stations into clusters and look at the aggregated series in the glyph map. In this vignette, I will introduce how to perform aggregation using a hierarchical data structure in cubble.

First, let’s summarise the daily data into weekly sum for each station:

prcp_aus
#> # cubble:   id [663]: nested form
#> # bbox:     [113.53, -43.66, 153.64, -10.05]
#> # temporal: wk [dbl], prcp [dbl]
#>    id           long   lat  elev name                wmo_id ts               
#>    <chr>       <dbl> <dbl> <dbl> <chr>                <dbl> <list>           
#>  1 ASN00001006  128. -15.5   3.8 wyndham aero         95214 <tibble [53 × 2]>
#>  2 ASN00001007  126. -13.8   6   troughton island     94102 <tibble [53 × 2]>
#>  3 ASN00001013  128. -15.5  11   wyndham              94214 <tibble [53 × 2]>
#>  4 ASN00001018  126. -16.4 546   mount elizabeth      94211 <tibble [53 × 2]>
#>  5 ASN00001019  127. -14.3  23   kalumburu            94100 <tibble [53 × 2]>
#>  6 ASN00001020  126. -14.1  51   truscott             95101 <tibble [53 × 2]>
#>  7 ASN00001025  126. -15.4 385   doongan              94215 <tibble [53 × 2]>
#>  8 ASN00002012  128. -18.2 422   halls creek airport  94212 <tibble [53 × 2]>
#>  9 ASN00002032  128. -17.0 203   warmun               94213 <tibble [53 × 2]>
#> 10 ASN00002056  129. -15.8  44   kununurra aero       94216 <tibble [53 × 2]>
#> # … with 653 more rows

First we need to assign each station a cluster number. Here we use a simple kmean algorithm based on the distance matrix to create 20 clusters. This creates station_nested as a station level nested cubble with a cluster column indicating the group each station belongs to.

station_nested
#> # cubble:   id [663]: nested form
#> # bbox:     [113.53, -43.66, 153.64, -10.05]
#> # temporal: wk [dbl], prcp [dbl]
#>    id           long   lat  elev name                wmo_id ts       cluster
#>    <chr>       <dbl> <dbl> <dbl> <chr>                <dbl> <list>     <int>
#>  1 ASN00001006  128. -15.5   3.8 wyndham aero         95214 <tibble>      15
#>  2 ASN00001007  126. -13.8   6   troughton island     94102 <tibble>      15
#>  3 ASN00001013  128. -15.5  11   wyndham              94214 <tibble>      15
#>  4 ASN00001018  126. -16.4 546   mount elizabeth      94211 <tibble>      15
#>  5 ASN00001019  127. -14.3  23   kalumburu            94100 <tibble>      15
#>  6 ASN00001020  126. -14.1  51   truscott             95101 <tibble>      15
#>  7 ASN00001025  126. -15.4 385   doongan              94215 <tibble>      15
#>  8 ASN00002012  128. -18.2 422   halls creek airport  94212 <tibble>      15
#>  9 ASN00002032  128. -17.0 203   warmun               94213 <tibble>      15
#> 10 ASN00002056  129. -15.8  44   kununurra aero       94216 <tibble>      15
#> # … with 653 more rows

To create a group level cubble, use switch_key() with the new key variable, cluster:

cluster_nested <- station_nested %>%  switch_key(cluster)
cluster_nested %>%  head(5)
#> # cubble:   cluster [5]: nested form
#> # bbox:     [118.84, -43.66, 152.87, -10.05]
#> # temporal: id [chr], wk [dbl], prcp [dbl]
#>   cluster .val              ts                  
#>     <int> <list>            <list>              
#> 1       1 <tibble [18 × 6]> <tibble [954 × 3]>  
#> 2       2 <tibble [51 × 6]> <tibble [2,677 × 3]>
#> 3       3 <tibble [19 × 6]> <tibble [1,007 × 3]>
#> 4       4 <tibble [6 × 6]>  <tibble [318 × 3]>  
#> 5       5 <tibble [63 × 6]> <tibble [3,339 × 3]>

The resulted cluster_nested now has cluster as the key and all the station level time invariant variables are nested inside .val. Currently, there is no cluster level time invariant variables and we can add the centroid of each cluster by get_centroid():

(cluster_nested <- cluster_nested %>%  get_centroid())
#> # cubble:   cluster [20]: nested form
#> # bbox:     [113.53, -43.66, 153.64, -10.05]
#> # temporal: id [chr], wk [dbl], prcp [dbl]
#>    cluster .val              ts                   hull     cent_long cent_lat
#>      <int> <list>            <list>               <list>       <dbl>    <dbl>
#>  1       1 <tibble [18 × 6]> <tibble [954 × 3]>   <tibble>      141.    -22.9
#>  2       2 <tibble [51 × 6]> <tibble [2,677 × 3]> <tibble>      147.    -41.8
#>  3       3 <tibble [19 × 6]> <tibble [1,007 × 3]> <tibble>      121.    -30.6
#>  4       4 <tibble [6 × 6]>  <tibble [318 × 3]>   <tibble>      143.    -11.9
#>  5       5 <tibble [63 × 6]> <tibble [3,339 × 3]> <tibble>      150.    -32.3
#>  6       6 <tibble [56 × 6]> <tibble [2,951 × 3]> <tibble>      151.    -28.2
#>  7       7 <tibble [21 × 6]> <tibble [1,113 × 3]> <tibble>      145.    -29.5
#>  8       8 <tibble [61 × 6]> <tibble [3,233 × 3]> <tibble>      117.    -31.2
#>  9       9 <tibble [58 × 6]> <tibble [3,074 × 3]> <tibble>      148.    -35.3
#> 10      10 <tibble [55 × 6]> <tibble [2,915 × 3]> <tibble>      145.    -37.8
#> 11      11 <tibble [37 × 6]> <tibble [1,961 × 3]> <tibble>      142.    -36.1
#> 12      12 <tibble [29 × 6]> <tibble [1,537 × 3]> <tibble>      149.    -23.6
#> 13      13 <tibble [16 × 6]> <tibble [848 × 3]>   <tibble>      130.    -27.3
#> 14      14 <tibble [26 × 6]> <tibble [1,378 × 3]> <tibble>      143.    -17.7
#> 15      15 <tibble [19 × 6]> <tibble [1,007 × 3]> <tibble>      125.    -17.7
#> 16      16 <tibble [48 × 6]> <tibble [2,544 × 3]> <tibble>      139.    -33.9
#> 17      17 <tibble [20 × 6]> <tibble [1,060 × 3]> <tibble>      116.    -23.7
#> 18      18 <tibble [17 × 6]> <tibble [901 × 3]>   <tibble>      134.    -17.3
#> 19      19 <tibble [24 × 6]> <tibble [1,272 × 3]> <tibble>      132.    -13.0
#> 20      20 <tibble [19 × 6]> <tibble [1,007 × 3]> <tibble>      137.    -30.3

You can also use face_temporal() to get the cluster level long cubble:

(cluster_long <- cluster_nested %>%  face_temporal(ts))
#> # cubble:  wk, cluster [20]: long form
#> # bbox:    [113.53, -43.66, 153.64, -10.05]
#> # spatial: id [chr], long [dbl], lat [dbl], elev [dbl], name [chr], wmo_id
#> #   [dbl], hull [list], cent_long [dbl], cent_lat [dbl]
#>    cluster id             wk  prcp
#>      <int> <chr>       <dbl> <dbl>
#>  1       1 ASN00015602     1   256
#>  2       1 ASN00015602     2  1554
#>  3       1 ASN00015602     3     8
#>  4       1 ASN00015602     4   184
#>  5       1 ASN00015602     5  2022
#>  6       1 ASN00015602     6    80
#>  7       1 ASN00015602     7    44
#>  8       1 ASN00015602     8    28
#>  9       1 ASN00015602     9   102
#> 10       1 ASN00015602    10  1658
#> # … with 35,086 more rows

Now we should have access to both station and cluster level in the nested and long form. Let’s summarise them within a diagram:

With these data, we can make a glyph map to understand the precipitation pattern in Australia:

Or to inspect the station membership of each cluster:

ggplot() + 
  geom_sf(data = state_map, inherit.aes = FALSE,
          color = "grey80", alpha = 0.4, linetype = 3) +
  geom_point(data = station_nested, aes(x = long, y = lat), size = 0.5) +
  ggforce::geom_mark_hull(
    data = cluster_nested %>% tidyr::unnest(hull),
    expand = 0, radius = 0,
    aes(x = long, y = lat, group = cluster)) + 
  theme_void()