library(dataset)
According to the RDF Data Cube Vocabulary DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:
Observations
: these are the measured values, and the
cells of a data frame object in R.Organizational structure
: To locate an observation
within the hypercube, one has at least to know the value of each
dimension at which the observation is located, so these values must be
specified for each observation. Datasets can have additional
organizational structure in the form of slices as described in section
7.2.Structural metadata
: Metadata to interpret the data.
What is the unit of measurement? Is it a normal value or a series break?
Is the value measured or estimated? These metadata are provided as
attributes and can be attached to individual observations, or to higher
levels.Reference metadata
: Metadata that describes the dataset
as a whole, such as categorization of the dataset, its publisher, or an
endpoint where it can be accessed.Information | dataset |
---|---|
dimensions | first column section of the dataset |
measurements | second column section of the dataset |
attributes | third column section of the dataset |
reference | attributes of the R object |
Our dataset
R package aims to increase the
Findability, Accessibility, Interoperability, and
Reuse of digital assets, particularly datacubes and datasets
used in statistics and data analysis. The FAIR principles
“…emphasize machine-actionability (i.e., the capacity of computational
systems to find, access, interoperate, and reuse data with none or
minimal human intervention) because humans increasingly rely on
computational support to deal with data as a result of the increase in
volume, complexity, and creation speed of data.”
The dataset
package adds metadata to R objects for full
compatibility The RDF
Data Cube Vocabulary [which includes the DataSet definition], which
is harmonized with the Statistical Data and
Metadata eXchange. This is necessary to correctly import data from
statistical sources or from the semantic web, and harmonize the
processed results with such services.
The The RDF Data
Cube Vocabulary in itself uses some core elements fo the Dublin
Core Metadata Elements 1.1, which is a library metadata standard to
archive, store, and find information. The dataset
R package
goes beyond the DataSet definition and allows to add all the Dublin
Core Metadata Elements 1.1, and to add the DataCite
Mandatory Properties 4.4 and the DataCite
Recommended and Optional Properties 4.3.
The Dublin Core
is an ISO Standard 15836:2009 since
February 2009 [ISO15836]. Currently the 15 “core” elements are part of a
bigger DCMI
Metadata Terms. The DataCite mandatory elements are
narrower than the DublinCore elements, and the full DataCite is broader.
The two standards are very-well harmonized with little difference.
We give a preference for DataCite, because it is a more modern standard that is better suited for the documentation of software and data products. It is also the preferred choice of the EU open repositories that are used to deposit publicly funded research results, including datasets.
<- dataset(x=iris,
iris_ds Dimensions=NULL,
Measures=c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width"),
Attributes = "Species",
Title = "Iris Dataset",
Label = "The famous iris dataset used in R examples")
dublincore(iris_ds)
#> Title: Iris Dataset
#> Publiser: | Source: | Date: 19340 | Language: | Identifier: | Rights: | Description: NA |
#> names: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> - dimensions: <none>
#> - measures: Sepal.Length (numeric) Sepal.Width (numeric) Petal.Length (numeric) Petal.Width (numeric)
#> - attributes: Species (factor)
The dataset()
constructor adds W3C/SDMX compatible
structural metadata (declares the measured values and the attributes of
the observations) and some Dublin Core data for findability (these are
properties that are the same in DataCite.)
You can add more descriptive metadata to further support discovery,
interoperability. You do not need the dataset()
class, any
data.frame, i.e. data.frame()
,
tibble::tibble()
or data.table::data.table()
will do. The dataset constructor does not alter the data or the data
structure of a data frame or an object inherited from data frame, only
adds standardized metadata to it.
<- datacite_add(iris,
iris_ds Title = "Iris Dataset",
Creator = person(family ="Anderson", given ="Edgar", role = "aut"),
Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
Publisher= "American Iris Society",
PublicationYear = 1935,
Geolocation = "US",
Language = "en",
Version = "1.0")
datacite(iris_ds)
#> $names
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#>
#> $Title
#> $Title$Title
#> [1] "Iris Dataset"
#>
#>
#> $Creator
#> [1] "Edgar Anderson [aut]"
#>
#> $Identifier
#> [1] "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x"
#>
#> $Publisher
#> [1] "American Iris Society"
#>
#> $Issued
#> [1] 1935
#>
#> $publication_year
#> [1] 1935
#>
#> $Type
#> $Type$resourceType
#> [1] "Dataset"
#>
#> $Type$resourceTypeGeneral
#> [1] "Dataset"
#>
#>
#> $Description
#> [1] NA
#>
#> $Geolocation
#> [1] "US"
#>
#> $Language
#> [1] "eng"
#>
#> $Version
#> [1] "1.0"
#>
#> $Rights
#> [1] NA
#>
#> $Size
#> [1] "11.68 kB [11.41 KiB]"
library(data.table)
datacite_add(x = data.table::data.table(iris),
Title = "Iris Dataset",
Creator = person(family ="Anderson", given ="Edgar", role = "aut"))
Currently, the DataCite properties (all mandatory, and what was
filled up from the optional ones) can be seen along with standard R
metadata. As a property of the dataset
class, which follows
W3C and SDMX standards, and survey
, which will follow DDI
standards, the history of the dataset
from creation
(import) will be recorded as attribute metadata.
<- dublincore_add(
dct_iris x = iris,
Title = "Iris Dataset",
Creator = person("Edgar", "Anderson", role = "aut"),
Publisher = "American Iris Society",
Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
Date = 1935,
Language = "en"
)
dublincore(dct_iris)
#> Title: Iris Dataset
#> Publiser: American Iris Society | Source: | Date: | Language: eng | Identifier: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x | Rights: | Description: NA |
#> names: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> - dimensions: <none>
#> - measures: <none>
#> - attributes: <none>
If you work with R, you are almost certainly familiar with the iris dataset. The ?iris will provide you with some information about this often used dataset in tutorials. But how you make sure that you do not forget its important properties?
The function datacite DataCite
add at least the mandatory properties of the DataCite Metadata Schema 4.3, a
list of core metadata properties chosen for an accurate and consistent
identification of a resource for citation and retrieval purposes.
DataCite is largely interoperable to the other similar international
standard, the Dublin Core. We
will later add similar dublincore
function, however, the
practical differences are so small that adjustments, if needed, can be
easily made by hand.
<- datacite_add(
iris_dataset x = iris,
Title = "Iris Dataset",
Creator = person("Anderson", "Edgar", role = "aut"),
Publisher= "American Iris Society",
Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
PublicationYear = 1935,
Description = "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.",
Language = "en")
datacite(iris_dataset)
#> $names
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#>
#> $Title
#> $Title$Title
#> [1] "Iris Dataset"
#>
#>
#> $Creator
#> [1] "Anderson Edgar [aut]"
#>
#> $Identifier
#> [1] "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x"
#>
#> $Publisher
#> [1] "American Iris Society"
#>
#> $Issued
#> [1] 1935
#>
#> $publication_year
#> [1] 1935
#>
#> $Type
#> $Type$resourceType
#> [1] "Dataset"
#>
#> $Type$resourceTypeGeneral
#> [1] "Dataset"
#>
#>
#> $Description
#> [1] "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica."
#>
#> $Geolocation
#> [1] NA
#>
#> $Language
#> [1] "eng"
#>
#> $Rights
#> [1] NA
#>
#> $Size
#> [1] "11.73 kB [11.45 KiB]"
The x
parameter can be any well-structured R object that
meets the definition of a dataset: a data.frame, or an inherited class
of it (data.table, tibble);
or a well-structured list (for example, a json object.)
<- bibentry_dataset(iris_dataset)
iris_bibentry toBibtex(iris_bibentry)
#> @Misc{,
#> title = {Iris Dataset},
#> author = {{Edgar} and {Anderson}},
#> publisher = {American Iris Society},
#> size = {11.73 kB [11.45 KiB]},
#> year = {1935},
#> }
print(iris_bibentry, sytle="html")
Edgar, Anderson (1935). “Iris Dataset.”
The interoperability and reusability of datasets is greatly enhanced if they follow a standardized and practical format. Our datasets follow the tidy data principles^[Wickham, H.(2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10 and are interoperable with the W3C DF Data Cube Vocabulary^[RDF Data Cube Vocabulary, W3C Recommendation 16 January 2014 https://www.w3.org/TR/vocab-data-cube/#metadata (semantic web) and SDMX (statistical) dataset definitions.
Both W3C and SDMX uses are more complex object, the Datacube in its description. The dataset is a reused datacube. To adhere to tidy data principles and easy use in reproducible research workflows, we further reduced our subjective definition of the dataset.
dataset
constructor first subsets the dataset for
the obs_id
observation identifier, and if it is missing, it
creates one.dimensions
, such as geographic
concept or time concept. The iris dataset does not have these variables,
so we do not select anything.measurements
. In case only one
measurement
is present, we have a long-form dataset that
can be easily serialized into an RDF
object, for
example.attributes
that are unlikely to be
used for statistical aggregation (unlike the dimensions) and which are
not measured values.<- dataset(subset(iris,
petal_length select = c("Petal.Length", "Species")),
Dimensions = NULL,
Measures = "Petal.Length",
Attributes = "Species")
<- dataset(subset(iris,
petal_width select = c("Petal.Width", "Species")),
Dimensions = NULL,
Measures = "Petal.Width",
Attribute = "Species")
library(dplyr)
%>%
petal_length left_join (petal_width, by = c("Species")) %>%
sample_n(10)
#> Untitled
#> Petal.Length Species Petal.Width
#> 1 4.5 versicolor 1.3
#> 2 1.5 setosa 0.2
#> 3 1.0 setosa 0.3
#> 4 4.9 virginica 2.2
#> 5 1.5 setosa 0.2
#> 6 5.6 virginica 2.1
#> 7 4.8 versicolor 1.4
#> 8 6.7 virginica 1.8
#> 9 1.5 setosa 0.2
#> 10 4.7 versicolor 1.5
The obvious motivation of this format is that the datasets can be easily integrated, joined, combined, because they are tidy.