The primary aim of dataset is create well-referenced, well-described, interoperable datasets from data.frames, tibbles or data.tables that translate well into the W3C DataSet definition within the Data Cube Vocabulary in a reproducible manner. The data cube model in itself is is originated in the Statistical Data and Metadata eXchange, and it is almost fully harmonized with the Resource Description Framework (RDF), the standard model for data interchange on the web1.
A mapping of R objects into these models has numerous advantages:
Our package functions work with any structured R objects (data.fame,
data.table, tibble, or well-structured lists like json), however, the
best functionality is achieved by the (See The
dataset S3 Class), which is inherited from
data.frame()
.
You can install the development version of dataset from Github:
::install_github('dataobservatory-eu/dataset') remotes
or install from CRAN:
install.packages('dataset')
The dataset()
constructor creates a dataset from a
data.frame or similar object.
library(dataset)
#>
#> Attaching package: 'dataset'
#> The following object is masked from 'package:base':
#>
#> as.data.frame
<- dataset(
my_iris_dataset x = iris,
Dimensions = NULL,
Measures = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
Attributes = "Species",
Title = "Iris Dataset",
Issued = 1936
)
is.dataset(my_iris_dataset)
#> [1] TRUE
Then you add the metadata:
<- dublincore_add(
my_iris_dataset x = my_iris_dataset,
Creator = person("Edgar", "Anderson", role = "aut"),
Publisher = "American Iris Society",
Source = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
Date = 1935,
Language = "en"
)
print(my_iris_dataset)
#> Iris Dataset by Edgar Anderson
#> Published by American Iris Society
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#>
#> ... 140 further observations.
#> Source:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
summary(my_iris_dataset)
#> Iris Dataset by Edgar Anderson
#> Published by American Iris Society
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
#> 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
#> Median :5.800 Median :3.000 Median :4.350 Median :1.300
#> Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
#> 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
#> Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
#> Species
#> setosa :50
#> versicolor:50
#> virginica :50
#>
#>
#>
#> Source:https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
<- dublincore(x=my_iris_dataset)
metadata #> Title: Iris Dataset
#> Publiser: American Iris Society | Source: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x | Date: 1936 | Language: eng | Identifier: | Rights: | Description: |
#> names: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> - dimensions: <none>
#> - measures: Sepal.Length (numeric) Sepal.Width (numeric) Petal.Length (numeric) Petal.Width (numeric)
#> - attributes: Species (factor)
Beware that the metadata variable is more structured than the printed version.
str(metadata)
#> List of 11
#> $ names : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#> $ dimensions:'data.frame': 0 obs. of 4 variables:
#> ..$ names : chr(0)
#> ..$ class : chr(0)
#> ..$ isDefinedBy: chr(0)
#> ..$ codeList : chr(0)
#> $ measures :'data.frame': 4 obs. of 4 variables:
#> ..$ names : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
#> ..$ class : chr [1:4] "numeric" "numeric" "numeric" "numeric"
#> ..$ isDefinedBy: chr [1:4] "https://purl.org/linked-data/cube" "https://purl.org/linked-data/cube" "https://purl.org/linked-data/cube" "https://purl.org/linked-data/cube"
#> ..$ codeListe : chr [1:4] "not yet defined" "not yet defined" "not yet defined" "not yet defined"
#> $ attributes:'data.frame': 1 obs. of 4 variables:
#> ..$ names : chr "Species"
#> ..$ class : chr "factor"
#> ..$ isDefinedBy: chr "https://purl.org/linked-data/cube|https://raw.githubusercontent.com/UKGovLD/publishing-statistical-data/master/"| __truncated__
#> ..$ codeListe : chr "not yet defined"
#> $ Type :List of 2
#> ..$ resourceType : chr "DCMITYPE:Dataset"
#> ..$ resourceTypeGeneral: chr "Dataset"
#> $ Title :List of 1
#> ..$ Title: chr "Iris Dataset"
#> $ Date : num 1936
#> $ Creator :Class 'person' hidden list of 1
#> ..$ :List of 5
#> .. ..$ given : chr "Edgar"
#> .. ..$ family : chr "Anderson"
#> .. ..$ role : chr "aut"
#> .. ..$ email : NULL
#> .. ..$ comment: NULL
#> $ Source : chr "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x"
#> $ Publisher : chr "American Iris Society"
#> $ Language : chr "eng"
This package is in an early development phase. The current dataset S3 class is inherited from the base R data.frame. Later versions may change to the modern tibble, which carries a larger dependency footprint but easier to work with. Easy interoperability with the data.table package remains a top development priority.
According to the RDF Data Cube Vocabulary DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:
Observations
: these are the measured values, and the
cells of a data frame object in R.Organizational structure
: To locate an observation
within the hypercube, one has at least to know the value of each
dimension at which the observation is located, so these values must be
specified for each observation. Datasets can have additional
organizational structure in the form of slices as described in section
7.2.Structural metadata
: Metadata to interpret the data.
What is the unit of measurement? Is it a normal value or a series break?
Is the value measured or estimated? These metadata are provided as
attributes and can be attached to individual observations, or to higher
levels.Reference metadata
: Metadata that describes the dataset
as a whole, such as categorization of the dataset, its publisher, or an
endpoint where it can be accessed.Information | dataset |
---|---|
dimensions | first column section of the dataset |
measurements | second column section of the dataset |
attributes | third column section of the dataset |
reference | attributes of the R object |
Our dataset class follows the organizational model of the datacube, which is used by the Statistical Data and Metadata eXchange, and which is also described in a non-normative manner by the the RDF Data Cube Vocabulary. While the SDMX standards predate the Resource Description Framework (RDF) framework for the semantic web, they are already harmonized to a great deal, which enables users and data publishers to create machine-to-machine connections among statistical data. Our goal is to create a modern data frame object in R with utilities that allow the R user to benefit from synchronizing data with semantic web applications, including statistical resources, libraries, or open science repositories.
The The dataset S3 Class vignette explains in more detail our interpretation of the datacube model, and some considerations and dilemmas that we are facing in the further development of this early stage package.
Our datasets:
Contain Dublin Core or DataCite (or both) metadata that makes the findable and easier accessible via online libraries. See vignette article Datasets With FAIR Metadata.
Their dimensions can be easily and unambiguously reduced to triples for RDF applications; they can be easily serialized to, or synchronized with semantic web applications. See vignette article From dataset To RDF.
Contain processing metadata that greatly enhance the reproducibility of the results, and the reviewability of the contents of the dataset, including metadata defined by the DDI Alliance, which is particularly helpful for not yet processed data;
Follow the datacube model of the Statistical Data and Metadata eXchange, therefore allowing easy refreshing with new data from the source of the analytical work, and particularly useful for datasets containing results of statistical operations in R;
Correct exporting with FAIR metadata to the most used file formats and straightforward publication to open science repositories with correct bibliographical and use metadata. See Export And Publish a dataset
Use programmatically the dataspice package to publish dataset documentation.
Relatively lightweight in dependencies and easily works with data.frame, tibble or data.table R objects.
Please note that the dataset
package is released with a
Contributor
Code of Conduct. By contributing to this project, you agree to abide
by its terms.
Furthermore, rOpenSci
Community Contributing Guide - A guide to help people find ways
to contribute to rOpenSci is also applicable, because
dataset
is under software review for potential inclusion in
rOpenSci.
RDF Data Cube Vocabulary, W3C Recommendation 16 January 2014 https://www.w3.org/TR/vocab-data-cube/, Introduction to SDMX data modeling https://www.unescap.org/sites/default/files/Session_4_SDMX_Data_Modeling_%20Intro_UNSD_WS_National_SDG_10-13Sep2019.pdf↩︎