stats19 enables access to and processing of Great Britain’s official road traffic casualty database, STATS19. A description of variables in the database can be found in a document provided by the UK’s Department for Transport (DfT). The datasets are collectively called STATS19 after the form used to report them, which can be found here. This vignette focuses on how to use the stats19 package to work with STATS19 data.
Note: The Department for Transport refers to “accidents”, but “crashes” is a more appropriate term, as emphasised in the “crash not accident” arguments of road safety advocacy groups such as RoadPeace. We use the term “accidents” only in reference to nomenclature within the data as provided.
The development version is hosted on GitHub and can be installed and loaded as follows:
# from CRAN
install.packages("stats19")
# you can install the latest development (discoraged) using:
::install_github("ITSLeeds/stats19") remotes
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
The easiest way to get STATS19 data is with
get_stats19()
. This function takes 2 main arguments,
year
and type
. The year can be any year
between 1979 and 202x where x is the current year minus one or two due
to the delay in publishing STATS19 statistics. The type can be one of
accidents
, casualties
and
vehicles
, described below. get_stats19()
performs 3 jobs, corresponding to three main types of functions:
Download: A dl_stats19()
function
accepts year
, type
and filename
arguments to make it easy to find the right file to download
only.
Read: STATS19 data is provided in a particular
format that benefits from being read-in with pre-specified column types.
This is taken care of with read_*()
functions providing
access to the 3 main tables in STATS19 data:
read_accidents()
reads-in the crash data (which has one
row per incident)read_casualties()
reads-in the casualty data (which has
one row per person injured or killed)read_vehicles()
reads-in the vehicles table, which
contains information on the vehicles involved in the crashes (and has
one row per vehicle)Format: There are corresponding
format_*()
functions for each of the read_*()
functions. These have been exported for convenience, as the two sets of
functions are closely related, there is also a format
parameter for the read_*()
functions, which by default is
TRUE
, adds labels to the tables. The raw data provided by
the DfT contains only integers. Running
read_*(..., format = TRUE)
converts these integer values to
the corresponding character variables for each of the three tables. For
example, read_accidents(format = TRUE)
converts values in
the accident_severity
column from 1
,
2
and 3
to Slight
,
Serious
and Fatal
using
fromat_accidents()
function. To read-in raw data without
formatting, set format = FALSE
.
Multiple functions (read_*
and format_*
)
are needed for each step because of the structure of STATS19 data, which
are divided into 3 tables:
Data files containing multiple years worth of data can be downloaded. Datasets since 1979 are broadly consistent, meaning that STATS19 data represents a rich historic geographic record of road casualties at a national level, as stated in the DfT’s road casualties report in 2017:
The current set of definitions and detail of information goes back to 1979, providing a long period for comparison.
stats19 enables download of raw STATS19 data with
dl_*
functions. The following code chunk, for example,
downloads and unzips a .zip file containing STATS19 data from 2017:
dl_stats19(year = 2017, type = "accident", ask = FALSE)
#> Files identified: dft-road-casualty-statistics-accident-2017.csv
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-accident-2017.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /data/stats19/dft-road-casualty-statistics-accident-2017.csv
Note that in the previous command, ask = FALSE
, meaning
you will not be asked. By default you are asked to confirm, before
downloading large files. Currently, these files are downloaded to a
default location of tempdir
which is a platform independent
“safe” but temporary location to download the data in. Once downloaded,
they are unzipped under original DfT file names. The
dl_stats19()
function prints out the location and final
file name(s) of unzipped files(s) as shown above.
dl_stats19()
takes three parameters. Supplying a
file_name
is interpreted to mean that the user is aware of
what to download and the other two parameters will be ignored. You can
also use year
and type
to “search” through the
file names, which are stored in a lazy-loaded dataset called
stats19::file_names
.
You can find out the names of files that can be downloaded with
names(stats19::file_names)
, an example of which is shown
below:
::file_names$DigitalBreathTestData2013.zip
stats19#> [1] "DigitalBreathTestData2013.zip"
To see how file_names
was created, see
?file_names
. Data files from other years can be selected
interactively. Just providing a year, for example, presents the user
with multiple options (from file_names
), illustrated
below:
dl_stats19(year = 2017)
Multiple matches. Which do you want to download?
1: dft-road-casualty-statistics-casualty-2017.csv
2: dft-road-casualty-statistics-vehicle-2017.csv
3: dft-road-casualty-statistics-accident-2017.csv
Selection:
Enter an item from the menu, or 0 to exit
When R is running interactively, you can select which of the 3 matching files to download: those relating to vehicles, casualties or accidents in 2017.
In a similar approach to the download section before, we can read
files downloaded using a data_dir
location of the file and
the filename
to read. The code below will download the
dftRoadSafetyData_Accidents_2017.zip
file from the DfT
servers and read its content. Files are saved by default in
tempdir()
, but this can be overridden to ensure permanent
storage in a user-defined location.
= get_stats19(year = 2017, type = "acc", format = FALSE)
crashes_2017_raw #> Files identified: dft-road-casualty-statistics-accident-2017.csv
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-accident-2017.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /data/stats19/dft-road-casualty-statistics-accident-2017.csv
#> Reading in:
#> /data/stats19/dft-road-casualty-statistics-accident-2017.csv
#> Rows: 129982 Columns: 36
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): accident_index, accident_reference, longitude, latitude, date, lo...
#> dbl (27): accident_year, location_easting_osgr, location_northing_osgr, pol...
#> time (1): time
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
stats19 imports data with
readr::read_csv()
which results in a ‘tibble’ object: a
data frame with more user-friendly printing and a few other
features.
class(crashes_2017_raw)
#> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dim(crashes_2017_raw)
#> [1] 129982 36
There are three read_*()
functions, corresponding to the
three different classes of data provided by the DfT: 1.
read_accidents()
2. read_casualties()
3.
read_vehicles()
In all cases, a default parameter read_*(format = TRUE)
returns the data in formatted form, as described above. Data can also be
imported in the form directly provided by the DfT by passing
format = FALSE
, and then subsequently formatted with
additional format_*()
functions, as described in a final
section of this vignette. Each of these read_*()
functions
is now described in more detail.
After raw data files have been downloaded as described in the previous section, they can then be read-in as follows:
= read_accidents(year = 2017, format = FALSE)
crashes_2017_raw #> Reading in:
#> /data/stats19/dft-road-casualty-statistics-accident-2017.csv
#> Rows: 129982 Columns: 36
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): accident_index, accident_reference, longitude, latitude, date, lo...
#> dbl (27): accident_year, location_easting_osgr, location_northing_osgr, pol...
#> time (1): time
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
= format_accidents(crashes_2017_raw)
crashes_2017 #> date and time columns present, creating formatted datetime column
nrow(crashes_2017_raw)
#> [1] 129982
ncol(crashes_2017_raw)
#> [1] 36
nrow(crashes_2017)
#> [1] 129982
ncol(crashes_2017)
#> [1] 37
What just happened? We read-in data on all road crashes recorded by the police in 2017 across Great Britain. The dataset contains
32 columns (variables) for
129,982 crashes.
This work was done by read_accidents(format = FALSE)
,
which imported the “raw” STATS19 data without cleaning messy column
names or re-categorising the outputs. format_accidents()
function automates the process of matching column names with variable
names and labels in a .xls
file provided by the DfT. This means crashes_2017
is
much more usable than crashes_2017_raw
, as shown below,
which shows some key variables in the messy and clean datasets:
c(7, 18, 23, 25)]
crashes_2017_raw[#> # A tibble: 129,982 × 4
#> latitude first_road_class junction_control second_road_number
#> <chr> <dbl> <dbl> <dbl>
#> 1 51.650061 3 -1 -1
#> 2 51.522425 3 4 0
#> 3 51.514096 3 4 0
#> 4 51.624832 3 4 154
#> 5 51.573408 3 2 10
#> 6 51.438762 6 -1 -1
#> 7 51.525305 3 4 0
#> 8 51.522 3 -1 -1
#> 9 51.621219 3 2 5109
#> 10 51.489732 3 4 0
#> # … with 129,972 more rows
c(7, 18, 23, 25)]
crashes_2017[#> # A tibble: 129,982 × 4
#> latitude first_road_class junction_control second_road_number
#> <chr> <chr> <chr> <chr>
#> 1 51.650061 A Data missing or out of range Unknown
#> 2 51.522425 A Give way or uncontrolled first_road_class is …
#> 3 51.514096 A Give way or uncontrolled first_road_class is …
#> 4 51.624832 A Give way or uncontrolled <NA>
#> 5 51.573408 A Auto traffic signal <NA>
#> 6 51.438762 Unclassified Data missing or out of range Unknown
#> 7 51.525305 A Give way or uncontrolled first_road_class is …
#> 8 51.522 A Data missing or out of range Unknown
#> 9 51.621219 A Auto traffic signal <NA>
#> 10 51.489732 A Give way or uncontrolled first_road_class is …
#> # … with 129,972 more rows
By default, format = TRUE
, meaning that the two stages
of read_accidents(format = FALSE)
and
format_accidents()
yield the same result as
read_accidents(format = TRUE)
. For the full list of
columns, run names(crashes_2017)
.
Note: As indicated above, the term “accidents” is
only used as directly provided by the DfT; “crashes” is a more
appropriate term, hence we call our resultant datasets
crashes_*
.
It is also possible to import the “raw” data as provided by the DfT.
A .xls
file provided by the DfT defines the column names for the datasets
provided. The packaged datasets stats19_variables
and
stats19_schema
provide summary information about the
contents of this data guide. These contain the full variable names in
the guide (stats19_variables
) and a complete look up table
relating integer values to the .csv
files provided by the
DfT and their labels (stats19_schema
). The first rows of
each dataset are shown below:
stats19_variables#> # A tibble: 98 × 5
#> # Groups: table [4]
#> table variable note colum…¹ type
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Accident accident_index unique va… accide… char…
#> 2 Accident accident_index unique va… accide… char…
#> 3 Accident accident_index unique va… accide… char…
#> 4 Accident accident_reference In year i… accide… char…
#> 5 Accident accident_severity <NA> accide… char…
#> 6 Accident accident_year <NA> accide… nume…
#> 7 Accident carriageway_hazards <NA> carria… char…
#> 8 Accident date <NA> date char…
#> 9 Accident day_of_week <NA> day_of… char…
#> 10 Accident did_police_officer_attend_scene_of_accident <NA> did_po… char…
#> # … with 88 more rows, and abbreviated variable name ¹column_name
stats19_schema#> # A tibble: 914 × 7
#> table variable code label note variable_format…¹ type
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Accident police_force 1 Metropolitan Police <NA> police_force char…
#> 2 Accident police_force 3 Cumbria <NA> police_force char…
#> 3 Accident police_force 4 Lancashire <NA> police_force char…
#> 4 Accident police_force 5 Merseyside <NA> police_force char…
#> 5 Accident police_force 6 Greater Manchester <NA> police_force char…
#> 6 Accident police_force 7 Cheshire <NA> police_force char…
#> 7 Accident police_force 10 Northumbria <NA> police_force char…
#> 8 Accident police_force 11 Durham <NA> police_force char…
#> 9 Accident police_force 12 North Yorkshire <NA> police_force char…
#> 10 Accident police_force 13 West Yorkshire <NA> police_force char…
#> # … with 904 more rows, and abbreviated variable name ¹variable_formatted
The code that generated these small datasets can be found in their
help pages (accessed with ?stats19_variables
and
?stats19_schema
respectively). stats19_schema
is used internally to automate the process of formatting the downloaded
.csv
files. Column names are formatted by the function
format_column_names()
, as illustrated below:
format_column_names(stats19_variables$variable[1:3])
#> [1] "accident_index" "accident_index" "accident_index"
Previous approaches to data formatting STATS19
data
involved hard-coding results. This more automated approach to data
cleaning is more consistent and fail-safe. The three functions:
format_accidents()
, format_vehicles()
and
format_casualties()
do the data formatting on the
respective data frames, as illustrated below:
= format_accidents(crashes_2017_raw)
crashes_2017 #> date and time columns present, creating formatted datetime column
# vehicle data for 2017
dl_stats19(year = 2017, type = "vehicle", ask = FALSE)
#> Files identified: dft-road-casualty-statistics-vehicle-2017.csv
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-vehicle-2017.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /data/stats19/dft-road-casualty-statistics-vehicle-2017.csv
= read_vehicles(year = 2017)
vehicles_2017_raw #> Rows: 238926 Columns: 27
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): accident_index, accident_reference
#> dbl (25): accident_year, vehicle_reference, vehicle_type, towing_and_articul...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
= format_vehicles(vehicles_2017_raw)
vehicles_2017
# casualties data for 2017
dl_stats19(year = 2017, type = "casualty", ask = FALSE)
#> Files identified: dft-road-casualty-statistics-casualty-2017.csv
#>
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2017.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /data/stats19/dft-road-casualty-statistics-casualty-2017.csv
= read_casualties(year = 2017)
casualties_2017 #> Rows: 170993 Columns: 18
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): accident_index, accident_reference
#> dbl (16): accident_year, vehicle_reference, casualty_reference, casualty_cla...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The package automates this two-step read_*
and
format_*
process by defaulting in all cases to
data_year = read_*(year, format = TRUE)
.
read_*
functions return, by default, formatted data. The
two-step process may nevertheless be important for reference to the
official nomenclature and values as provided by the DfT.
A summary of the outputs for each of the three tables is shown below.
= function(x) {
summarise_stats19 data.frame(row.names = 1:length(x),
name = substr(names(x), 1, 19),
class = sapply(x, function(v) class(v)[1]),
n_unique = sapply(x, function(v) length(unique(v))),
first_label = sapply(x, function(v) substr(unique(v)[1], 1, 16)),
most_common_value = sapply(x, function(v)
substr(names(sort(table(v), decreasing = TRUE)[1]), 1, 16)[1])
) }
::kable(summarise_stats19(crashes_2017),
knitrcaption = "Summary of formatted crash data.")
name | class | n_unique | first_label | most_common_value |
---|---|---|---|---|
accident_index | character | 129982 | 2017010001708 | 2017010001708 |
accident_year | numeric | 1 | 2017 | 2017 |
accident_reference | character | 129982 | 010001708 | 010001708 |
location_easting_os | numeric | 89265 | 532920 | 533650 |
location_northing_o | numeric | 91209 | 196330 | 181170 |
longitude | character | 124619 | -0.080107 | NULL |
latitude | character | 123292 | 51.650061 | NULL |
police_force | character | 51 | Metropolitan Pol | Metropolitan Pol |
accident_severity | character | 3 | Fatal | Slight |
number_of_vehicles | numeric | 15 | 2 | 2 |
number_of_casualtie | numeric | 20 | 3 | 1 |
date | Date | 365 | 2017-08-05 | 2017-12-01 |
day_of_week | character | 7 | Saturday | Friday |
time | hms | 1439 | 03:12:00 | 17:00:00 |
local_authority_dis | character | 380 | Enfield | Birmingham |
local_authority_ons | character | 381 | E09000010 | E08000025 |
local_authority_hig | character | 208 | E09000010 | E10000016 |
first_road_class | character | 6 | A | A |
first_road_number | character | 2 | NA | first_road_class |
road_type | character | 6 | Single carriagew | Single carriagew |
speed_limit | numeric | 6 | 30 | 30 |
junction_detail | character | 11 | Not at junction | Not at junction |
junction_control | character | 6 | Data missing or | Give way or unco |
second_road_class | character | 7 | NA | Unclassified |
second_road_number | character | 3 | Unknown | first_road_class |
pedestrian_crossing | character | 5 | None within 50 m | None within 50 m |
pedestrian_crossing | character | 8 | No physical cros | No physical cros |
light_conditions | character | 6 | Darkness - light | Daylight |
weather_conditions | character | 10 | Fine no high win | Fine no high win |
road_surface_condit | character | 7 | Dry | Dry |
special_conditions_ | character | 10 | None | None |
carriageway_hazards | character | 8 | None | None |
urban_or_rural_area | character | 3 | Urban | Urban |
did_police_officer_ | character | 3 | Yes | Yes |
trunk_road_flag | character | 3 | Non-trunk | Non-trunk |
lsoa_of_accident_lo | character | 28286 | E01001450 | -1 |
datetime | POSIXct | 89676 | 2017-08-05 03:12 | 2017-05-16 17:00 |
::kable(summarise_stats19(vehicles_2017),
knitrcaption = "Summary of formatted vehicles data.")
name | class | n_unique | first_label | most_common_value |
---|---|---|---|---|
accident_index | character | 129982 | 2017010001708 | 2017500194936 |
accident_year | numeric | 1 | 2017 | 2017 |
accident_reference | character | 129982 | 010001708 | 500194936 |
vehicle_reference | numeric | 24 | 1 | 1 |
vehicle_type | character | 1 | NA | NA |
towing_and_articula | character | 1 | NA | NA |
vehicle_manoeuvre | character | 1 | NA | NA |
vehicle_direction_f | character | 1 | NA | NA |
vehicle_direction_t | character | 1 | NA | NA |
vehicle_location_re | character | 1 | NA | NA |
junction_location | character | 1 | NA | NA |
skidding_and_overtu | character | 1 | NA | NA |
hit_object_in_carri | character | 1 | NA | NA |
vehicle_leaving_car | character | 1 | NA | NA |
hit_object_off_carr | character | 1 | NA | NA |
first_point_of_impa | character | 1 | NA | NA |
vehicle_left_hand_d | character | 1 | NA | NA |
journey_purpose_of_ | character | 1 | NA | NA |
sex_of_driver | character | 1 | NA | NA |
age_of_driver | character | 1 | NA | NA |
age_band_of_driver | character | 1 | NA | NA |
engine_capacity_cc | character | 1 | NA | NA |
propulsion_code | character | 1 | NA | NA |
age_of_vehicle | numeric | 72 | 1 | -1 |
generic_make_model | character | 1 | NA | NA |
driver_imd_decile | character | 1 | NA | NA |
driver_home_area_ty | character | 1 | NA | NA |
::kable(summarise_stats19(casualties_2017),
knitrcaption = "Summary of formatted casualty data.")
name | class | n_unique | first_label | most_common_value |
---|---|---|---|---|
accident_index | character | 129982 | 2017010001708 | 201797NC00502 |
accident_year | numeric | 1 | 2017 | 2017 |
accident_reference | character | 129982 | 010001708 | 97NC00502 |
vehicle_reference | numeric | 15 | 1 | 1 |
casualty_reference | numeric | 43 | 1 | 1 |
casualty_class | character | 3 | Passenger | Driver or rider |
sex_of_casualty | character | 3 | Female | Male |
age_of_casualty | character | 2 | NA | Data missing or |
age_band_of_casualt | character | 12 | 16 - 20 | 26 - 35 |
casualty_severity | character | 3 | Slight | Slight |
pedestrian_location | character | 11 | Not a Pedestrian | Not a Pedestrian |
pedestrian_movement | character | 10 | Not a Pedestrian | Not a Pedestrian |
car_passenger | character | 5 | Front seat passe | Not car passenge |
bus_or_coach_passen | character | 6 | Not a bus or coa | Not a bus or coa |
pedestrian_road_mai | character | 5 | No / Not applica | No / Not applica |
casualty_type | character | 22 | Car occupant | Car occupant |
casualty_home_area_ | character | 4 | Urban area | Urban area |
casualty_imd_decile | character | 11 | More deprived 10 | Data missing or |
For testing and other purposes, a sample from the accidents table is provided in the package. A few columns from the two-row sample is shown below:
Accident_Severity | Speed_limit | Pedestrian_Crossing-Human_Control | Light_Conditions |
---|---|---|---|
2 | 30 | 0 | 1 |
2 | 30 | 0 | 1 |
2 | 60 | 0 | 1 |
As with crashes_2017
, casualty data for 2017 can be
downloaded, read-in and formatted as follows:
dl_stats19(year = 2017, type = "casualty", ask = FALSE)
#> Files identified: dft-road-casualty-statistics-casualty-2017.csv
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-2017.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /data/stats19/dft-road-casualty-statistics-casualty-2017.csv
= read_casualties(year = 2017)
casualties_2017 #> Rows: 170993 Columns: 18
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): accident_index, accident_reference
#> dbl (16): accident_year, vehicle_reference, casualty_reference, casualty_cla...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow(casualties_2017)
#> [1] 170993
ncol(casualties_2017)
#> [1] 18
The results show that there were
170,993 casualties reported by the police in the STATS19 dataset in 2017, and
16 columns (variables). Values for a sample of these columns are shown below:
c(4, 5, 6, 14)]
casualties_2017[#> # A tibble: 170,993 × 4
#> vehicle_reference casualty_reference casualty_class bus_or_coach_passenger
#> <dbl> <dbl> <chr> <chr>
#> 1 1 1 Passenger Not a bus or coach pass…
#> 2 2 2 Driver or rider Not a bus or coach pass…
#> 3 2 3 Passenger Not a bus or coach pass…
#> 4 1 1 Passenger Not a bus or coach pass…
#> 5 3 1 Driver or rider Not a bus or coach pass…
#> 6 1 1 Passenger Not a bus or coach pass…
#> 7 1 1 Pedestrian Not a bus or coach pass…
#> 8 2 1 Driver or rider Not a bus or coach pass…
#> 9 1 1 Driver or rider Not a bus or coach pass…
#> 10 2 2 Driver or rider Not a bus or coach pass…
#> # … with 170,983 more rows
The full list of column names in the casualties
dataset
is:
names(casualties_2017)
#> [1] "accident_index" "accident_year"
#> [3] "accident_reference" "vehicle_reference"
#> [5] "casualty_reference" "casualty_class"
#> [7] "sex_of_casualty" "age_of_casualty"
#> [9] "age_band_of_casualty" "casualty_severity"
#> [11] "pedestrian_location" "pedestrian_movement"
#> [13] "car_passenger" "bus_or_coach_passenger"
#> [15] "pedestrian_road_maintenance_worker" "casualty_type"
#> [17] "casualty_home_area_type" "casualty_imd_decile"
Data for vehicles involved in crashes in 2017 can be downloaded, read-in and formatted as follows:
dl_stats19(year = 2017, type = "vehicle", ask = FALSE)
#> Files identified: dft-road-casualty-statistics-vehicle-2017.csv
#> https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-vehicle-2017.csv
#> Data already exists in data_dir, not downloading
#> Data saved at /data/stats19/dft-road-casualty-statistics-vehicle-2017.csv
= read_vehicles(year = 2017)
vehicles_2017 #> Rows: 238926 Columns: 27
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): accident_index, accident_reference
#> dbl (25): accident_year, vehicle_reference, vehicle_type, towing_and_articul...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nrow(vehicles_2017)
#> [1] 238926
ncol(vehicles_2017)
#> [1] 27
The results show that there were
238,926 vehicles involved in crashes reported by the police in the STATS19 dataset in 2017, with
23 columns (variables). Values for a sample of these columns are shown below:
c(3, 14:16)]
vehicles_2017[#> # A tibble: 238,926 × 4
#> accident_reference vehicle_leaving_carriageway hit_object_off_carri…¹ first…²
#> <chr> <chr> <chr> <chr>
#> 1 010001708 Did not leave carriageway None Front
#> 2 010001708 Did not leave carriageway None Back
#> 3 010009342 Did not leave carriageway None Back
#> 4 010009342 Did not leave carriageway None Front
#> 5 010009344 Did not leave carriageway None Front
#> 6 010009344 Did not leave carriageway None Front
#> 7 010009344 Did not leave carriageway None Front
#> 8 010009348 Did not leave carriageway None Front
#> 9 010009348 Did not leave carriageway None Offside
#> 10 010009350 Did not leave carriageway None Offside
#> # … with 238,916 more rows, and abbreviated variable names
#> # ¹hit_object_off_carriageway, ²first_point_of_impact
The full list of column names in the vehicles
dataset
is:
names(vehicles_2017)
#> [1] "accident_index" "accident_year"
#> [3] "accident_reference" "vehicle_reference"
#> [5] "vehicle_type" "towing_and_articulation"
#> [7] "vehicle_manoeuvre" "vehicle_direction_from"
#> [9] "vehicle_direction_to" "vehicle_location_restricted_lane"
#> [11] "junction_location" "skidding_and_overturning"
#> [13] "hit_object_in_carriageway" "vehicle_leaving_carriageway"
#> [15] "hit_object_off_carriageway" "first_point_of_impact"
#> [17] "vehicle_left_hand_drive" "journey_purpose_of_driver"
#> [19] "sex_of_driver" "age_of_driver"
#> [21] "age_band_of_driver" "engine_capacity_cc"
#> [23] "propulsion_code" "age_of_vehicle"
#> [25] "generic_make_model" "driver_imd_decile"
#> [27] "driver_home_area_type"
An important feature of STATS19 data is that the “accidents” table
contains geographic coordinates. These are provided at ~10m resolution
in the UK’s official coordinate reference system (the Ordnance Survey
National Grid, EPSG code 27700). stats19 converts the
non-geographic tables created by format_accidents()
into
the geographic data form of the sf
package
with the function format_sf()
as follows:
= format_sf(crashes_2017)
crashes_sf #> 19 rows removed with no coordinates
The note arises because NA
values are not permitted in
sf
coordinates, and so rows containing no coordinates are
automatically removed. Having the data in a standard geographic form
allows various geographic operations to be performed on it. Spatial
operations, such as spatial subsetting and spatial aggregation, can be
performed, to show the relationship between STATS19 data and other
geographic objects, such as roads, schools and administrative zones.
An example of an administrative zone dataset of relevance to STATS19
data is the boundaries of police forces in England, which is provided in
the packaged dataset police_boundaries
. The following code
chunk demonstrates the kind of spatial operations that can be performed
on geographic STATS19 data, by counting and plotting the number of
fatalities per police force:
library(sf)
library(dplyr)
%>%
crashes_sf filter(accident_severity == "Fatal") %>%
select(n_fatalities = accident_index) %>%
aggregate(by = police_boundaries, FUN = length) %>%
plot()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
Of course, one should not draw conclusions from such analyses without care. In this case, denominators are needed to infer anything about road safety in any of the police regions. After suitable denominators have been included, performance metrics such as ‘health risk’ (fatalities per 100,000 people), ‘traffic risk’ (fatalities per billion km, f/bkm) and ‘exposure risk’ (fatalities per million hours, f/mh) can be calculated (Feleke et al. 2018; Elvik et al. 2009).
The following code chunk, for example, returns all crashes within the jurisdiction of West Yorkshire Police:
=
west_yorkshire $pfa16nm == "West Yorkshire", ]
police_boundaries[police_boundaries#> old-style crs object detected; please recreate object with a recent sf::st_crs()
#> old-style crs object detected; please recreate object with a recent sf::st_crs()
= crashes_sf[west_yorkshire, ]
crashes_wy nrow(crashes_sf)
#> [1] 129963
nrow(crashes_wy)
#> [1] 4371
This subsetting has selected the
4,371 crashes which occurred in West Yorkshire.
The three main tables we have just read-in can be joined by shared key variables. This is demonstrated in the code chunk below, which subsets all casualties that took place in West Yorkshire, and counts the number of casualties by severity for each crash:
library(tidyr)
library(dplyr)
= casualties_2017$accident_index %in% crashes_wy$accident_index
sel = casualties_2017[sel, ]
casualties_wy = casualties_wy %>%
cas_types select(accident_index, casualty_type) %>%
group_by(accident_index) %>%
summarise(
Total = n(),
walking = sum(casualty_type == "Pedestrian"),
cycling = sum(casualty_type == "Cyclist"),
passenger = sum(casualty_type == "Car occupant")
) = left_join(crashes_wy, cas_types) cj
What just happened? We found the subset of casualties that took place
in West Yorkshire with reference to the accident_index
variable. Then we used the dplyr function
summarise()
, to find the number of people who were in a
car, cycling, and walking when they were injured. This new casualty
dataset is joined onto the crashes_wy
dataset. The result
is a spatial (sf
) data frame of crashes in West Yorkshire,
with columns counting how many road users of different types were hurt.
The joined data has additional variables:
::setdiff(names(cj), names(crashes_wy))
base#> [1] "Total" "walking" "cycling" "passenger"
As a simple spatial plot, we can map all the crashes that have happened in West Yorkshire in 2017, with the colour related to the total number of people hurt in each crash. Placing this plot next to a map of West Yorkshire provides context:
plot(
$cycling > 0, "speed_limit", ],
cj[cjcex = cj$Total[cj$cycling > 0] / 3,
main = "Speed limit (cycling)"
)plot(
$passenger > 0, "speed_limit", ],
cj[cjcex = cj$Total[cj$passenger > 0] / 3,
main = "Speed limit (passenger)"
)
The spatial distribution of crashes in West Yorkshire clearly relates to the region’s geography. Car crashes tend to happen on fast roads, including busy Motorway roads, displayed in yellow above. Cycling is as an urban activity, and the most bike crashes can be found in near Leeds city centre, which has a comparatively high level of cycling (compared with the low baseline of 3%). This can be seen by comparing the previous map with an overview of the area, from an academic paper on the social, spatial and temporal distribution of bike crashes (Lovelace, Roberts, and Kellar 2016):
In addition to the Total
number of people hurt/killed,
cj
contains a column for each type of casualty (cyclist,
car occupant, etc.), and a number corresponding to the number of each
type hurt in each crash. It also contains the geometry
column from crashes_sf
. In other words, joins allow the
casualties and vehicles tables to be geo-referenced. We can then explore
the spatial distribution of different casualty types. The following
figure, for example, shows the spatial distribution of pedestrians and
car passengers hurt in car crashes across West Yorkshire in 2017:
library(ggplot2)
= cj %>%
crashes_types filter(accident_severity != "Slight") %>%
mutate(type = case_when(
> 0 ~ "Walking",
walking > 0 ~ "Cycling",
cycling > 0 ~ "Passenger",
passenger TRUE ~ "Other"
))table(crashes_types$speed_limit)
#>
#> 20 30 40 50 60 70
#> 31 573 85 22 35 35
ggplot(crashes_types, aes(size = Total, colour = speed_limit)) +
geom_sf(show.legend = "point", alpha = 0.3) +
facet_grid(vars(type), vars(accident_severity)) +
scale_size(
breaks = c(1:3, 12),
labels = c(1:2, "3+", 12)
+
) scale_color_gradientn(colours = c("blue", "yellow", "red")) +
theme(axis.text = element_blank(), axis.ticks = element_blank())
Spatial distribution of serious and fatal crashes in West Yorkshire, for cycling, walking, being a car passenger and other modes of travel. Colour is related to the speed limit where the crash happened (red is faster) and size is proportional to the total number of people hurt in each crash (legend not shown).
It is clear that different types of road users tend to get hurt in different places. Car occupant casualties (labelled ‘passengers’ in the map above), for example, are comparatively common on the outskirts of cities such as Leeds, where speed limits tend to be higher and where there are comparatively higher volumes of motor traffic. Casualties to people on foot tend to happen in the city centres. That is not to say that cities centres are more dangerous per unit distance (typically casualties per billion kilometres, bkm, is the unit used) walked: there is more walking in city centres (you need a denominator to estimate risk).
To drill down further, we can find the spatial distribution of all pedestrian casualties, broken-down by seriousness of casualty, and light conditions. This can be done with tidyvers functions follows:
table(cj$light_conditions)
#>
#> Darkness - lighting unknown Darkness - lights lit
#> 864 1051
#> Darkness - lights unlit Darkness - no lighting
#> 11 88
#> Daylight
#> 2357
%>%
cj filter(walking > 0) %>%
mutate(light = case_when(
== "Daylight" ~ "Daylight",
light_conditions == "Darkness - lights lit" ~ "Lit",
light_conditions TRUE ~ "Other/Unlit"
%>%
)) ggplot(aes(colour = speed_limit)) +
geom_sf() +
facet_grid(vars(light), vars(accident_severity)) +
scale_color_continuous(low = "blue", high = "red") +
theme(axis.text = element_blank(), axis.ticks = element_blank())
We can also explore seasonal and daily trends in crashes by aggregating crashes by day of the year:
= cj %>%
crashes_dates st_set_geometry(NULL) %>%
group_by(date) %>%
summarise(
walking = sum(walking),
cycling = sum(cycling),
passenger = sum(passenger)
%>%
) gather(mode, casualties, -date)
ggplot(crashes_dates, aes(date, casualties)) +
geom_smooth(aes(colour = mode), method = "loess") +
ylab("Casualties per day")
#> `geom_smooth()` using formula 'y ~ x'
Different types of crashes also tend to happen at different times of day. This is illustrated in the plot below, which shows the times of day when people who were travelling by different modes were most commonly injured.
library(stringr)
= cj %>%
crash_times st_set_geometry(NULL) %>%
group_by(hour = as.numeric(str_sub(time, 1, 2))) %>%
summarise(
walking = sum(walking),
cycling = sum(cycling),
passenger = sum(passenger)
%>%
) gather(mode, casualties, -hour)
ggplot(crash_times, aes(hour, casualties)) +
geom_line(aes(colour = mode))
Note that bike crashes tend to have distinct morning and afternoon peaks, in-line with previous research (Lovelace, Roberts, and Kellar 2016). A disproportionate number of car crashes appear to happen in the afternoon.
There is much potential to extend the package beyond downloading, reading and formatting STATS19 data. The greatest potential is to provide functions that will help with analysis of STATS19 data, to help with road safety research. Much academic research has been done using the data, a few examples of which are highlighted below to demonstrate the wide potential for further work.
The broader point is that the stats19 package could help road safety research, by making open access data on road crashes more accessible to researchers worldwide. By easing the data download and cleaning stages of research, it could also encourage reproducible analysis in the field.
There is great potential to add value to and gain insight from the data by joining the datasets with open data, for example from the Consumer Data Research Centre (CDRC, which funded this research), OpenStreetMap and the UK’s Ordnance Survey. If you have any suggestions on priorities for these future directions of (hopefully safe) travel, please get in touch on at github.com/ITSLeeds/stats19/issues.