Most StatCan tables are small in size and can easily processed in
memory. However, some tables are so large that this is not a feasible
strategy. Table 43-10-0024
is one such example and comes
with a CSV file that is several gigabytes in size. In cases like this it
is more useful to store and access the data as an SQLite database using
the get_cansim_sqlite
function instead of the usual
get_cansim
. In these circumstances it is also useful to
cache the data for longer than just the current R session, and the
data_cache
option allows to specify a permanent location.
It defaults to getOption("cansim.cache_path")
, and if this
option is not set it will only cache the data for the duration of the
current session.
For this vignette we use the (rather small) motor vehicle sales data as an example.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(cansim)
One main difference to the get_cansim
method is that
get_cansim_sqlite
does not return data but only a
connection to the database. This allows us to filter the data before
fetching the data into memory. When this is called the first time it
will fetch the data, parse it into an SQLite database, index the main
columns. This can take a significant amount of time, depending on the
size of the table. In any subsequent call it will simply open a database
connection to the cached database.
<- get_cansim_sqlite("20-10-0001")
connection #> Accessing CANSIM NDM product 20-10-0001 from Statistics Canada
#> Parsing data
#> Indexing GEO
#> Indexing Vehicle type
#> Indexing Origin of manufacture
#> Indexing Sales
#> Indexing Seasonal adjustment
#> Indexing REF_DATE
#> Indexing DGUID
#> Indexing GeoUID
We can inspect the data by looking at the first few columns to get a general idea of what the data looks like.
head(connection)
#> # Source: SQL [6 x 19]
#> # Database: sqlite 3.39.1 [/private/var/folders/z4/gcjq2cd93p3bs5bgp8j2vv240000gp/T/Rtmpcd5gVh/cansim_20100001_eng/20100001-eng.sqlite]
#> REF_D…¹ GEO DGUID Vehic…² Origi…³ Sales Seaso…⁴ UOM UOM_ID SCALA…⁵ SCALA…⁶
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1946-01 Cana… 2016… Total,… Total,… Units Unadju… Units 300 "units… 0
#> 2 1946-01 Cana… 2016… Total,… Total,… Doll… Unadju… Doll… 81 "thous… 3
#> 3 1946-01 Cana… 2016… Passen… Total,… Units Unadju… Units 300 "units… 0
#> 4 1946-01 Cana… 2016… Passen… Total,… Units Season… Units 300 "units… 0
#> 5 1946-01 Cana… 2016… Passen… Total,… Doll… Unadju… Doll… 81 "thous… 3
#> 6 1946-01 Cana… 2016… Trucks Total,… Units Unadju… Units 300 "units… 0
#> # … with 8 more variables: VECTOR <chr>, COORDINATE <chr>, VALUE <dbl>,
#> # STATUS <chr>, SYMBOL <chr>, TERMINATED <chr>, DECIMALS <chr>, GeoUID <chr>,
#> # and abbreviated variable names ¹REF_DATE, ²`Vehicle type`,
#> # ³`Origin of manufacture`, ⁴`Seasonal adjustment`, ⁵SCALAR_FACTOR,
#> # ⁶SCALAR_ID
To make good use of the data we will have to look at the metadata and
inspect the member columns and variables available. The metadata will be
available in the R session after the database connection has been
opened. Trying to access the metadata before the connection has been
opened will result in the package to attempt to download the data via
the get_cansim
call, which the
get_cansim_sqlite
function is trying to avoid.
get_cansim_table_overview("20-10-0001")
#> New motor vehicle sales
#> CANSIM Table 20-10-0001
#> Start Reference Period: 1946-01-01, End Reference Period: 2022-11-01, Frequency: Monthly
#>
#> Column Geography (11)
#> Newfoundland and Labrador, Prince Edward Island, Nova Scotia, New Brunswick, Quebec, Ontario, Manitoba, Saskatchewan, Alberta, British Columbia and the Territories, ...
#>
#> Column Vehicle type (3)
#> Passenger cars, Trucks, Total, new motor vehicles
#>
#> Column Origin of manufacture (5)
#> North America, Total, overseas, Japan, Other countries, Total, country of manufacture
#>
#> Column Sales (2)
#> Units, Dollars
#>
#> Column Seasonal adjustment (2)
#> Unadjusted, Seasonally adjusted
This gives us an understanding of the available variables. For the
purpose of this vignette we are interested in the breakdown of sales
units by Vehicle type in Canada overall. The data is stored in its raw
form in the database, the only processing done is that it is augmented
by the GeoUID. In order to work with it we need to collect
the data. If the only operation done is filtering, and no selection or
renaming of columns was done before accessing the data, we can utilize
the custom collect_and_normalize
function to collect and at
the same time normalize the data so it will appear the same way as if we
had used the get_cansim
function. This will add the
category and hierarchy metadata and the normalized value column.
<- connection %>%
data filter(GEO=="Canada",
`Seasonal adjustment`=="Unadjusted",
=="Units",
Sales`Origin of manufacture`=="Total, country of manufacture",
`Vehicle type` %in% c("Passenger cars","Trucks")) %>%
collect_and_normalize()
%>% head()
data #> # A tibble: 6 × 30
#> REF_DATE GEO DGUID UOM UOM_ID SCALA…¹ SCALA…² VECTOR COORD…³ VALUE STATUS
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 1946-01 Canada 2016… Units 300 "units… 0 v4216… 1.2.1.… 1102 <NA>
#> 2 1946-01 Canada 2016… Units 300 "units… 0 v4216… 1.3.1.… 1654 <NA>
#> 3 1946-02 Canada 2016… Units 300 "units… 0 v4216… 1.2.1.… 1609 <NA>
#> 4 1946-02 Canada 2016… Units 300 "units… 0 v4216… 1.3.1.… 2154 <NA>
#> 5 1946-03 Canada 2016… Units 300 "units… 0 v4216… 1.2.1.… 4734 <NA>
#> 6 1946-03 Canada 2016… Units 300 "units… 0 v4216… 1.3.1.… 2466 <NA>
#> # … with 19 more variables: SYMBOL <chr>, TERMINATED <chr>, DECIMALS <chr>,
#> # GeoUID <chr>, val_norm <dbl>, Date <date>, `Hierarchy for GEO` <chr>,
#> # `Classification Code for Vehicle type` <chr>,
#> # `Hierarchy for Vehicle type` <chr>,
#> # `Classification Code for Origin of manufacture` <chr>,
#> # `Hierarchy for Origin of manufacture` <chr>,
#> # `Classification Code for Sales` <chr>, `Hierarchy for Sales` <chr>, …
Given the data we can further filter the date range and plot it.
%>%
data filter(Date>=as.Date("1990-01-01")) %>%
ggplot(aes(x=Date,y=val_norm,color=`Vehicle type`)) +
geom_smooth(span=0.2,method = 'loess', formula = y ~ x) +
theme(legend.position="bottom") +
scale_y_continuous(labels = function(d)scales::comma(d,scale=10^-3,suffix="k")) +
labs(title="Canada new motor vehicle sales",caption="StatCan Table 20-10-0001",
x=NULL,y="Number of units")
When we don’t need the database connection any more we should remember to close it. This frees up resources and keeps the database handlers happy.
disconnect_cansim_sqlite(connection)
Since we now have the option of a more permanent cache we should take
care to manage that space properly. The
list_cansim_sqlite_cached_tables
function gives us an
overview over the cached data we have.
list_cansim_sqlite_cached_tables()
#> # A tibble: 1 × 8
#> cansimTa…¹ langu…² timeCached sqlit…³ title path timeReleased
#> <chr> <chr> <dttm> <chr> <chr> <chr> <dttm>
#> 1 20-10-0001 eng 2023-01-19 22:04:51 45.5 Mb New … cans… 2023-01-19 13:30:00
#> # … with 1 more variable: upToDate <lgl>, and abbreviated variable names
#> # ¹cansimTableNumber, ²language, ³sqliteSize
Cached data won’t update automatically, we will have to pass the
refresh=TRUE
option to refresh it manually when we want a
fresh data pull from StatCan. If we want to free up disk space we can
remove a cached table.
remove_cansim_sqlite_cached_table("20-10-0001")
#> Removing cached data for 20-10-0001 (eng)
#> NULL