Datasets With FAIR Metadata

According to the RDF Data Cube Vocabulary DataSet is a collection of statistical data that corresponds to a defined structure. The data in a data set can be roughly described as belonging to one of the following kinds:

Information	dataset
dimensions	first column section of the dataset
measurements	second column section of the dataset
attributes	third column section of the dataset
reference	attributes of the R object

FAIR metadata

Our dataset R package aims to increase the Findability, Accessibility, Interoperability, and Reuse of digital assets, particularly datacubes and datasets used in statistics and data analysis. The FAIR principles “…emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.”

The dataset package adds metadata to R objects for full compatibility The RDF Data Cube Vocabulary [which includes the DataSet definition], which is harmonized with the Statistical Data and Metadata eXchange. This is necessary to correctly import data from statistical sources or from the semantic web, and harmonize the processed results with such services.

The The RDF Data Cube Vocabulary in itself uses some core elements fo the Dublin Core Metadata Elements 1.1, which is a library metadata standard to archive, store, and find information. The dataset R package goes beyond the DataSet definition and allows to add all the Dublin Core Metadata Elements 1.1, and to add the DataCite Mandatory Properties 4.4 and the DataCite Recommended and Optional Properties 4.3.

The Dublin Core is an ISO Standard 15836:2009 since February 2009 [ISO15836]. Currently the 15 “core” elements are part of a bigger DCMI Metadata Terms. The DataCite mandatory elements are narrower than the DublinCore elements, and the full DataCite is broader. The two standards are very-well harmonized with little difference.

We give a preference for DataCite, because it is a more modern standard that is better suited for the documentation of software and data products. It is also the preferred choice of the EU open repositories that are used to deposit publicly funded research results, including datasets.

iris_ds <- dataset(x=iris, 
                   Dimensions=NULL, 
                   Measures=c("Sepal.Length", "Sepal.Width", 
                              "Petal.Length", "Petal.Width"),
                   Attributes = "Species", 
                   Title = "Iris Dataset", 
                   Label = "The famous iris dataset used in R examples")

dublincore(iris_ds)
#> Title: Iris Dataset 
#> Publiser:   | Source:   | Date:  19340  | Language:   | Identifier:   | Rights:   | Description:  NA  | 
#> names:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species 
#> - dimensions: <none>
#> - measures: Sepal.Length (numeric)  Sepal.Width (numeric)  Petal.Length (numeric)  Petal.Width (numeric)  
#> - attributes: Species (factor)

The dataset() constructor adds W3C/SDMX compatible structural metadata (declares the measured values and the attributes of the observations) and some Dublin Core data for findability (these are properties that are the same in DataCite.)

You can add more descriptive metadata to further support discovery, interoperability. You do not need the dataset() class, any data.frame, i.e. data.frame(), tibble::tibble() or data.table::data.table() will do. The dataset constructor does not alter the data or the data structure of a data frame or an object inherited from data frame, only adds standardized metadata to it.

iris_ds <- datacite_add(iris,
                        Title = "Iris Dataset",
                        Creator = person(family ="Anderson", given ="Edgar", role = "aut"),
                        Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
                        Publisher= "American Iris Society",
                        PublicationYear = 1935,
                        Geolocation = "US",
                        Language = "en", 
                        Version = "1.0")

datacite(iris_ds)
#> $names
#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
#> 
#> $Title
#> $Title$Title
#> [1] "Iris Dataset"
#> 
#> 
#> $Creator
#> [1] "Edgar Anderson [aut]"
#> 
#> $Identifier
#> [1] "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x"
#> 
#> $Publisher
#> [1] "American Iris Society"
#> 
#> $Issued
#> [1] 1935
#> 
#> $publication_year
#> [1] 1935
#> 
#> $Type
#> $Type$resourceType
#> [1] "Dataset"
#> 
#> $Type$resourceTypeGeneral
#> [1] "Dataset"
#> 
#> 
#> $Description
#> [1] NA
#> 
#> $Geolocation
#> [1] "US"
#> 
#> $Language
#> [1] "eng"
#> 
#> $Version
#> [1] "1.0"
#> 
#> $Rights
#> [1] NA
#> 
#> $Size
#> [1] "11.68 kB [11.41 KiB]"

library(data.table)

datacite_add(x = data.table::data.table(iris), 
             Title = "Iris Dataset", 
             Creator =  person(family ="Anderson", given ="Edgar", role = "aut"))

Currently, the DataCite properties (all mandatory, and what was filled up from the optional ones) can be seen along with standard R metadata. As a property of the dataset class, which follows W3C and SDMX standards, and survey, which will follow DDI standards, the history of the dataset from creation (import) will be recorded as attribute metadata.

dct_iris <- dublincore_add(
  x = iris,
  Title = "Iris Dataset",
  Creator = person("Edgar", "Anderson", role = "aut"),
  Publisher = "American Iris Society",
  Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
  Date = 1935,
  Language = "en"
)

dublincore(dct_iris)
#> Title: Iris Dataset 
#> Publiser:  American Iris Society  | Source:   | Date:   | Language:  eng  | Identifier:  https://doi.org/10.1111/j.1469-1809.1936.tb02137.x  | Rights:   | Description:  NA  | 
#> names:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species 
#> - dimensions: <none>
#> - measures: <none>
#> - attributes: <none>

FAir: Findable & Accessible Datasets

If you work with R, you are almost certainly familiar with the iris dataset. The ?iris will provide you with some information about this often used dataset in tutorials. But how you make sure that you do not forget its important properties?

The function datacite DataCite add at least the mandatory properties of the DataCite Metadata Schema 4.3, a list of core metadata properties chosen for an accurate and consistent identification of a resource for citation and retrieval purposes. DataCite is largely interoperable to the other similar international standard, the Dublin Core. We will later add similar dublincore function, however, the practical differences are so small that adjustments, if needed, can be easily made by hand.

iris_dataset <- datacite_add(
  x = iris,
  Title = "Iris Dataset",
  Creator = person("Anderson", "Edgar", role = "aut"),
  Publisher= "American Iris Society",
  Identifier = "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x",
  PublicationYear = 1935,
  Description = "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.",
  Language = "en")

datacite(iris_dataset)
#> $names
#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
#> 
#> $Title
#> $Title$Title
#> [1] "Iris Dataset"
#> 
#> 
#> $Creator
#> [1] "Anderson Edgar [aut]"
#> 
#> $Identifier
#> [1] "https://doi.org/10.1111/j.1469-1809.1936.tb02137.x"
#> 
#> $Publisher
#> [1] "American Iris Society"
#> 
#> $Issued
#> [1] 1935
#> 
#> $publication_year
#> [1] 1935
#> 
#> $Type
#> $Type$resourceType
#> [1] "Dataset"
#> 
#> $Type$resourceTypeGeneral
#> [1] "Dataset"
#> 
#> 
#> $Description
#> [1] "This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica."
#> 
#> $Geolocation
#> [1] NA
#> 
#> $Language
#> [1] "eng"
#> 
#> $Rights
#> [1] NA
#> 
#> $Size
#> [1] "11.73 kB [11.45 KiB]"

The x parameter can be any well-structured R object that meets the definition of a dataset: a data.frame, or an inherited class of it (data.table, tibble); or a well-structured list (for example, a json object.)

Managing citations and bibliographies

iris_bibentry <- bibentry_dataset(iris_dataset)
toBibtex(iris_bibentry)
#> @Misc{,
#>   title = {Iris Dataset},
#>   author = {{Edgar} and {Anderson}},
#>   publisher = {American Iris Society},
#>   size = {11.73 kB [11.45 KiB]},
#>   year = {1935},
#> }

print(iris_bibentry, sytle="html")

Edgar, Anderson (1935). “Iris Dataset.”

faIR: Interoperable & Reusable Datasets

The interoperability and reusability of datasets is greatly enhanced if they follow a standardized and practical format. Our datasets follow the tidy data principles^[Wickham, H.(2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10 and are interoperable with the W3C DF Data Cube Vocabulary^[RDF Data Cube Vocabulary, W3C Recommendation 16 January 2014 https://www.w3.org/TR/vocab-data-cube/#metadata (semantic web) and SDMX (statistical) dataset definitions.

Both W3C and SDMX uses are more complex object, the Datacube in its description. The dataset is a reused datacube. To adhere to tidy data principles and easy use in reproducible research workflows, we further reduced our subjective definition of the dataset.

The dataset constructor first subsets the dataset for the obs_id observation identifier, and if it is missing, it creates one.
Then it selects the dimensions, such as geographic concept or time concept. The iris dataset does not have these variables, so we do not select anything.
Next we select the measurements. In case only one measurement is present, we have a long-form dataset that can be easily serialized into an RDF object, for example.
Next we select any attributes that are unlikely to be used for statistical aggregation (unlike the dimensions) and which are not measured values.
We can pass on further optional dataset attributes. These attributes do not correspond with a single observation, rather the entire dataset.

petal_length <- dataset(subset(iris, 
                               select = c("Petal.Length", "Species")), 
                        Dimensions = NULL, 
                        Measures   = "Petal.Length", 
                        Attributes = "Species")

petal_width <- dataset(subset(iris, 
                              select = c("Petal.Width", "Species")), 
                       Dimensions = NULL, 
                       Measures   = "Petal.Width", 
                       Attribute  = "Species")

library(dplyr)

petal_length %>%
  left_join (petal_width, by = c("Species")) %>%
  sample_n(10)
#> Untitled
#>    Petal.Length    Species Petal.Width
#> 1           4.5 versicolor         1.3
#> 2           1.5     setosa         0.2
#> 3           1.0     setosa         0.3
#> 4           4.9  virginica         2.2
#> 5           1.5     setosa         0.2
#> 6           5.6  virginica         2.1
#> 7           4.8 versicolor         1.4
#> 8           6.7  virginica         1.8
#> 9           1.5     setosa         0.2
#> 10          4.7 versicolor         1.5

The obvious motivation of this format is that the datasets can be easily integrated, joined, combined, because they are tidy.