library(dataset)
In the R language, datasets are usually contained in a data.frame() object, or in one of their modernized versions. For example, tibble::tibble() or data.table::data.table() are inherited from the base data.frame().
The base data.frame() constructor, like most base R types, is very flexible. It allows the use of any kind of metadata attached to the object.
<- data.frame( x = c(1,2), y = c(3,4))
foo attr(foo, "Title") <- "My Foo Object"
attributes(foo)
#> $names
#> [1] "x" "y"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1 2
#>
#> $Title
#> [1] "My Foo Object"
For reproducible research, publication, or linking resources on the
web, the standardization of metadata is critically important. The aim
dataset()
class is a modernized data.frame that has
standardized attributes.
dataset_title(iris_dataset)
#> $Title
#> [1] "Iris Dataset"
publisher(iris_dataset)
#> [1] "American Iris Society"
According to the RDF Data Cube Vocabulary DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:
Observations
: these are the measured values, and the
cells of a data frame object in R.Organizational structure
: To locate an observation
within the hypercube, one has at least to know the value of each
dimension at which the observation is located, so these values must be
specified for each observation. Datasets can have additional
organizational structure in the form of slices as described in section
7.2.Structural metadata
: Metadata to interpret the data.
What is the unit of measurement? Is it a normal value or a series break?
Is the value measured or estimated? These metadata are provided as
attributes and can be attached to individual observations, or to higher
levels.Reference metadata
: Metadata that describes the dataset
as a whole, such as categorization of the dataset, its publisher, or an
endpoint where it can be accessed.Information | dataset |
---|---|
dimensions | first column section of the dataset |
measurements | second column section of the dataset |
attributes | third column section of the dataset |
reference | attributes of the R object |
<- eurostat::get_eurostat('rd_e_gerdtot')
rd_e_gerdtot head(rd_e_gerdtot)
#> # A tibble: 6 × 5
#> sectperf unit geo time values
#> <chr> <chr> <chr> <date> <dbl>
#> 1 BES EUR_HAB AT 2021-01-01 1008.
#> 2 BES EUR_HAB BE 2021-01-01 1054.
#> 3 BES EUR_HAB BG 2021-01-01 52.3
#> 4 BES EUR_HAB CY 2021-01-01 109.
#> 5 BES EUR_HAB CZ 2021-01-01 279
#> 6 BES EUR_HAB DE 2021-01-01 904.
Dimensions
are usually needed in data analysis, because
they are used to subsetting (slicing) the dataset. They contain
information about the reference time period and geographical area.
In a dataset that has homogeneous dimensions (all data relate to the year 2022 and to the area of the United States) you could move the dimensions into the attributes of the R object, or simply omit them. However, dimensions are critically important for filtering out the observations (measurements) that you want to work with, or to correctly join (integrate) datasets. If you want to create a composite indicator from two datasets that related to the United States and the year 2022, you do not want to accidentally match measurements about 2021 or Canada.
dimensions(rd_e_gerdtot) <- c("geo", "time", "sectperf")
The measurements are the actual observed values. In a long-form tidy dataset you usually have only one: ‘value’
measures(rd_e_gerdtot) <- "value"
Attributes are similar to dimensions, but they can be fully static and constant in a dataset. You may have measurements for the same reference area and time available in both kilograms and tons in the same dataset, in which case you will likely use filter the correct unit of measure when you do analytical work or join (integrate) the data.
If your measurement unit is always millimeters (like in the iris dataset), it is tempting to treat this as a dataset-wide constant (and therefore move it to the attributes of the data frame R object), but we do not recommend this approach. Imagine that you want to join this dataset with some other data that is measured in centimeters or inches, or a dataset that has values in both millimeters and centimeters. To correctly match your data you will be filtering on attributes, too.
Attributes that may vary across observations (rows) should remain in
the dataset in the datacube model. To avoid confusion with the base R
attributes()
function, we named the function that sets the
attributes within a dataset to attributes_measures()
.
attributes_measures(rd_e_gerdtot) <- "unit"
datacite(rd_e_gerdtot)
#> $names
#> [1] "sectperf" "unit" "geo" "time" "values"
#>
#> $dimensions
#> names class
#> sectperf sectperf character
#> geo geo character
#> time time Date
#> isDefinedBy
#> sectperf https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl
#> geo https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl
#> time https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl
#> codeList
#> sectperf not yet defined
#> geo not yet defined
#> time not yet defined
#>
#> $measures
#> [1] names class isDefinedBy codeListe
#> <0 rows> (or 0-length row.names)
#>
#> $attributes
#> names class
#> unit unit character
#> isDefinedBy
#> unit https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl
#> codeListe
#> unit not yet defined
Our dataset
R package aims to increase the
Findability, Accessibility, Interoperability, and
Reuse of digital assets, particularly datacubes and datasets
used in statistics and data analysis. The FAIR principles
“…emphasize machine-actionability (i.e., the capacity of computational
systems to find, access, interoperate, and reuse data with none or
minimal human intervention) because humans increasingly rely on
computational support to deal with data as a result of the increase in
volume, complexity, and creation speed of data.”
This is the role of the reference metadata
in the RDF Data Cube
Vocabulary and the SDMX data cube model. We generally keep the
reference metadata as attributes()
of the R object, because
they do not relate to the rows (observations) of the data, but the
entire set of data. However, omitting all reference metadata from the
columns is not a good practice if you want your data to be used in a
knowledge graph (or the semantic web, or the Web 3.0.)
<- eurostat::get_eurostat_toc()
toc <- toc[which(toc$code == "rd_e_gerdtot"),]
rd_e_gerdtot_reference
datacite_add(rd_e_gerdtot,
Title = 'GERD by sector of performance',
Creator = person("Daniel", "Antal"),
Identifier = 'eurostat_rd_e_gerdtot',
Publisher = 'Eurostat',
PublicationYear = substr(rd_e_gerdtot_reference$`last update of data`, 7,11),
Subject = subject_create("Reserach",
subjectScheme = "LC Subject Headings",
schemeURI = "http://id.loc.gov/authorities/subjects",
valueURI = "http://id.loc.gov/authorities/subjects/sh85113021"),
Language = "English")
#> # A tibble: 36,738 × 5
#> sectperf unit geo time values
#> <chr> <chr> <chr> <date> <dbl>
#> 1 BES EUR_HAB AT 2021-01-01 1008.
#> 2 BES EUR_HAB BE 2021-01-01 1054.
#> 3 BES EUR_HAB BG 2021-01-01 52.3
#> 4 BES EUR_HAB CY 2021-01-01 109.
#> 5 BES EUR_HAB CZ 2021-01-01 279
#> 6 BES EUR_HAB DE 2021-01-01 904.
#> 7 BES EUR_HAB DK 2021-01-01 1008.
#> 8 BES EUR_HAB EA19 2021-01-01 544.
#> 9 BES EUR_HAB EE 2021-01-01 231.
#> 10 BES EUR_HAB EL 2021-01-01 117.
#> # … with 36,728 more rows
datacite(rd_e_gerdtot)
#> $names
#> [1] "sectperf" "unit" "geo" "time" "values"
#>
#> $dimensions
#> names class
#> sectperf sectperf character
#> geo geo character
#> time time Date
#> isDefinedBy
#> sectperf https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl
#> geo https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl
#> time https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl
#> codeList
#> sectperf not yet defined
#> geo not yet defined
#> time not yet defined
#>
#> $measures
#> [1] names class isDefinedBy codeListe
#> <0 rows> (or 0-length row.names)
#>
#> $attributes
#> names class
#> unit unit character
#> isDefinedBy
#> unit https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/specs/src/main/vocab/sdmx-attribute.ttl
#> codeListe
#> unit not yet defined
Following the datacube model, our datasets are data frames with
clearly defined dimensions (time
, geo
,
sex
), measurements (value
), and attributes
(unit
, freq
, status
). In this
example, all dimensions and values are following the SDMX attribute
definition, i.e. they have a standardized, natural language independent
codelist. (To use these codelists, use the statcodelist data
package.)
R objects inherited from the base data.frame()
have row
(observation) identifiers as row.names()
attributes. This
works well if you work with a single data frame, but this approach is
not sufficient to identify observations if you work with several data
frame, and you want to organize them into a database, or join them into
new tables, or you want to make them available on a knowledge graph.
When joining data tables or working in a relational database, you need unique identifiers for each unique observation unit in your system. If you want to broaden the usability of your data to the entire semantic web, and use it as linked data, you need a truly unique identifier (URI) for each observation.
We recommend the use of an explicit row identifier. The popular
modern R data frames, tibble::tibble()
and
data.table::data.table()
use row identifiers.
One of the advantages of using an explicit row identifier is that it can form the root for minting a URI for the entire dataset by collapsing all dimensions and attributes into a concatenated string starting with the row identifier. This will make your dataset ready to be used in triplets, a strict, tidy, three-column long-form dataset used in linked open data applications. As mentioned earlier, in homogeneous (or homogeneously subsetted) datasets, you could move the dimensions and the attributes out from the data frame cells into the descriptive attributes. However, if you want to work with linked data, you must have all structural information present in the data cells, because this makes it possible that different data publisher’s data can be linked together without having a utopistic, global database map.
In the following example, we concatenate the rowed
, and
the time
, geo
and sex
dimensions
into a single URI. We can do this, because in a well-organized dataset
the combination of dimensions is unique (otherwise we would be just
simply duplicating an observation.) However, adding the attributes to
the URI would be superfluous, because there combination is not unique in
the observations.
The From dataset To RDF vignette article shows you how you can organize your data into strict, tidy, three-column triples that can be serialized into RDF data.