Working with manydata

James Hollway and Henrique Sposito

2022-11-17

library(manydata)
## Registered S3 method overwritten by 'geojsonlint':
##   method         from 
##   print.location dplyr
## Registered S3 method overwritten by 'ggforce':
##   method           from 
##   scale_type.units units

Discovering data

The first thing users of the package will want to do is to identify datasets that might contribute to their research goals. Since some of these data packages are too big for CRAN, we expect that their developers will instead choose to make their packages available on GitHub. To make it easier to identify all packages in the many packages universe, we have developed the get_packages() function. The function lists the many packages available and allow users to download them.

get_packages()

Understanding data

Packages in the many packages universe have the advantage to facilitate comparison and analysis of multiple datasets in a specific domain of global governance. This is possible with a particular coding system which follows the same principles across the different packages.

In {manystates} for example, all datasets from the states database contain variables named Begand End which represent the beginning and ending date of an episode of state sovereignty.

In {manyenviron}, the agreements database also have the Beg and End variables but those are attributed to treaties (signature and term dates). For the memberships database, Beg and End represent when a relationship between states and an agreement starts (either signature, ratification or entry into force) and ends (either withdrawal or term).

This specific variable name allows the comparison across the datasets which have different sources but same informations. It enables to point out the recurrence, difference or absence of observations between the datasets and extract more robust data when researching on a particular governance domain.

Loading data

Let us say that we wish to download the {manystates} package, which offers a set of datasets related to state actors in global governance. We can download and install the latest release version of the {manystates} package using the same function as before, only specifying which package we want to ‘get’, ‘get_packages(“manystates”)’.

For now, let’s work with the Roman Emperors database included in manydata. We can get a quick summary of the datasets included in this package with the following command:

data(package = "manydata")
data(emperors, package = "manydata")
emperors
## $wikipedia
## # A tibble: 68 × 15
##    ID        Beg    End   FullN…¹ Birth Death CityB…² Provi…³ Rise  Cause Killer
##    <chr>     <mdat> <mda> <chr>   <chr> <chr> <chr>   <chr>   <chr> <chr> <chr> 
##  1 Augustus  -0026… 0014… IMPERA… 0062… 0014… Rome    Italia  Birt… Assa… Wife  
##  2 Tiberius  0014-… 0037… TIBERI… 0041… 0037… Rome    Italia  Birt… Assa… Other…
##  3 Caligula  0037-… 0041… GAIVS … 0012… 0041… Antitum Italia  Birt… Assa… Senate
##  4 Claudius  0041-… 0054… TIBERI… 0009… 0054… Lugdun… Gallia… Birt… Assa… Wife  
##  5 Nero      0054-… 0068… NERO C… 0037… 0068… Antitum Italia  Birt… Suic… Senate
##  6 Galba     0068-… 0069… SERVIV… 0002… 0069… Terrac… Italia  Seiz… Assa… Other…
##  7 Otho      0069-… 0069… MARCVS… 0032… 0069… Terent… Italia  Appo… Suic… Other…
##  8 Vitellius 0069-… 0069… AVLVS … 0015… 0069… Rome    Italia  Seiz… Assa… Other…
##  9 Vespasian 0069-… 0079… TITVS … 0009… 0079… Falacr… Italia  Seiz… Natu… Disea…
## 10 Titus     0079-… 0081… TITVS … 0039… 0081… Rome    Italia  Birt… Natu… Disea…
## # … with 58 more rows, 4 more variables: Dynasty <chr>, Era <chr>, Notes <chr>,
## #   Verif <chr>, and abbreviated variable names ¹​FullName, ²​CityBirth,
## #   ³​ProvinceBirth
## 
## $UNRV
## # A tibble: 99 × 7
##    ID               Beg     End     Birth Death FullName                 Dynasty
##    <chr>            <mdate> <mdate> <chr> <chr> <chr>                    <chr>  
##  1 Augustus         -0027   -0014   63 BC 14    Gaius Julius Caesar Oct… Julio-…
##  2 Tiberius         -0014   0037    42 BC 37    Tiberius Claudius Nero … Julio-…
##  3 Gaius (Caligula) 0037    0041    12    14    Gaius Caesar Germanicus… Julio-…
##  4 Claudius         0041    0054    10 BC 41    Tiberius Claudius Nero … Julio-…
##  5 Nero             0054    0068    37    68    Claudius Nero Caesar (b… Julio-…
##  6 Galba            0068    0069    3 BC  69    Servius Sulpicius Galba… Year o…
##  7 Otho             0069    0069    32    69    Marcus Salvius Otho / I… Year o…
##  8 Vitellius        0069    0069    15    69    Aulus Vitellius / Aulus… Year o…
##  9 Vespasian        0069    0079    9     79    Titus Flavius Vespasian… Year o…
## 10 Titus            0079    0081    39    79    Titus Flavius Vespasian… Flavian
## # … with 89 more rows
## 
## $britannica
## # A tibble: 87 × 3
##    ID              Beg     End    
##    <chr>           <mdate> <mdate>
##  1 Augustus        -0031   0014   
##  2 Tiberius        0014    0037   
##  3 Caligula        0037    0041   
##  4 Claudius        0041    0054   
##  5 Nero            0054    0068   
##  6 Galba           0068    0069   
##  7 Otho            0069-01 0069-04
##  8 Aulus Vitellius 0069-07 0069-12
##  9 Vespasian       0069    0079   
## 10 Titus           0079    0081   
## # … with 77 more rows

We can see that there are three named datasets relating to emperors here: wikipedia (dataset assembled from Wikipedia pages), UNVR (United Nations of Roman Vitrix), and britannica (Britannica Encyclopedia List of Roman Emperors). Each of these datasets has their advantages and so we may wish to understand their differences, summarise variables across them, and perhaps also rerun models across them.

To retrieve an individual dataset from this database, we can use the pluck() function.

wikipedia <- pluck(emperors, "wikipedia")

However, the real value of the various ‘many packages’ is that multiple datasets relating to the same phenomenon are presented together.

Comparing data

First of all, we want to understand what the differences between the datasets in a database. One important way to understand the relationship between these datasets is to understand what their relative advantages and disadvantages are. For example, one dataset may be long (has many observations) while another is shorter but wider (has more variables). One might include details further back in history while the other is more recent, but include more missing data or less precise data (i.e. coded at a less granular level) than another with a more restrictive. Or one might appear complete yet offer less information on where the original data points were sourced or how certain variables were coded, while another provides an extensive and transparent codebook that facilitates replication.

Using data_source() and data_contrast()

We can bring up the database level documentation using: ?emperors. This informs users on the datasets present in the database as well as the variables in the various datasets. Though, if we want a more detailed summary of the various levels of data and sources, we can use data_source() and data_contrast().

The data_source() function displays bibliographic references for the datasets within a database.

data_source(pkg = "manydata", database = NULL, dataset = NULL)
## Component 1 :
##            Reference                                                                                                        
## wikipedia  "(????). "List_of_Roman_emperors." <URL:tps://en.wikipedia.org/wiki/List_of_Roman_emperors>. Accesse" [truncated]
## UNRV       "(????). "Roman Emperor list." <URL:https://www.unrv.com/government/emperor.php>. Accessed: 2021-07-" [truncated]
## britannica "(????). "List of Roman emperors." <URL:https://www.britannica.com/topic/list-of-Roman-emperors-2043" [truncated]

The data_contrast() function returns a data frame with the key metadata of each level of data objects (many package, database, and dataset). This metadata includes the following elements:

data_contrast(pkg = "manydata", database = NULL, dataset = NULL)
## emperors :
##            Unique ID Missing Data Rows Columns         Beg         End
## wikipedia         68        9.9 %   68      15 -0026-01-16  0014-08-19
## UNRV              98       6.06 %   99       7 -0027-01-01 -0014-12-31
## britannica        87          0 %   87       3 -0031-01-01  0014-12-31
##                                                                        URL
## wikipedia                           https://github.com/zonination/emperors
## UNRV                           https://www.unrv.com/government/emperor.php
## britannica https://www.britannica.com/topic/list-of-Roman-emperors-2043294

An example of inference sensitivity to data sources

Next we may be interested in whether any relationships we are interested in or inferences we want to draw are sensitive to which data we use. That is, we are interested in the robustness of any results to different data specifications.

We can start by exploring whether our conclusion about when emperors began their reign would differ depending on which dataset we use. We can use the purrr::map() function used above, but this time pass it the mean() function and tell it to operate on just the “Beg” variable, which represents when emperors began their reign (removing any NAs). Since manydata datasets are always ordered by “Beg” (and then “ID”), we can remove any subsequent (duplicated) entries by ID to concentrate on first appearances.

library(dplyr)
emperors %>% 
  purrr::map(function(x){
    x %>% dplyr::filter(!duplicated(ID)) %>%
      dplyr::summarise(mean(Beg, na.rm = TRUE))
  })
## $wikipedia
## # A tibble: 68 × 1
##    `mean(Beg, na.rm = TRUE)`
##    <chr>                    
##  1 -0026-01-16              
##  2 0014-09-18               
##  3 0037-03-18               
##  4 0041-01-25               
##  5 0054-10-13               
##  6 0068-06-08               
##  7 0069-01-15               
##  8 0069-04-17               
##  9 0069-12-21               
## 10 0079-06-24               
## # … with 58 more rows
## 
## $UNRV
## # A tibble: 98 × 1
##    `mean(Beg, na.rm = TRUE)`
##    <chr>                    
##  1 -0027-07-02              
##  2 -0014-07-02              
##  3 0037-07-02               
##  4 0041-07-02               
##  5 0054-07-02               
##  6 0068-07-01               
##  7 0069-07-02               
##  8 0069-07-02               
##  9 0069-07-02               
## 10 0079-07-02               
## # … with 88 more rows
## 
## $britannica
## # A tibble: 87 × 1
##    `mean(Beg, na.rm = TRUE)`
##    <chr>                    
##  1 -0031-07-02              
##  2 0014-07-02               
##  3 0037-07-02               
##  4 0041-07-02               
##  5 0054-07-02               
##  6 0068-07-01               
##  7 0069-01-16               
##  8 0069-07-16               
##  9 0069-07-02               
## 10 0079-07-02               
## # … with 77 more rows

Consolidating data

Now that we have compared the data and looked at some of the different inferences drawn, let us examine how to select and consolidate databases.

The consolidate() function facilitates consolidating a set of datasets, or a database, from a ‘many’ package into a single dataset with some combination of the rows and columns. The function includes separate arguments for rows and columns, as well as for how to resolve conflicts in observations across datasets. The key argument indicates the column to collapse datasets by. This provides users with considerable flexibility in how they combine data.

For example, users may wish to see units and variables coded in “any” dataset (i.e. units or variables present in at least one of the datasets in the database) or units and variables coded in “every” dataset (i.e. units or variables present in all of the datasets in the database).

consolidate(database = emperors, rows = "any", cols = "any",
            resolve = "coalesce", key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 138 × 15
##    ID           CityB…¹ Provi…² Rise  Cause Killer Era   Notes Verif Birth Death
##    <chr>        <chr>   <chr>   <chr> <chr> <chr>  <chr> <chr> <chr> <chr> <chr>
##  1 Aemilian     <NA>    Africa  Appo… Assa… Other… Prin… birt… <NA>  0207… 0253…
##  2 Allectus     <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  ?     297  
##  3 Anastasius   <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  430   518  
##  4 Anthemius    <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  420   472  
##  5 Antoninus P… <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  86    161  
##  6 Antonius Pi… Lanuvi… Italia  Birt… Natu… Disea… Prin… <NA>  <NA>  0086… 0161…
##  7 Arcadius     <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  377   408  
##  8 Augustus     Rome    Italia  Birt… Assa… Wife   Prin… birt… Redd… 0062… 0014…
##  9 Aulus Vitel… <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  <NA>  <NA> 
## 10 Aurelian     Sirmium Pannon… Appo… Assa… Praet… Prin… <NA>  <NA>  0214… 0275…
## # … with 128 more rows, 4 more variables: FullName <chr>, Dynasty <chr>,
## #   Beg <mdate>, End <mdate>, and abbreviated variable names ¹​CityBirth,
## #   ²​ProvinceBirth
consolidate(database = emperors, rows = "every", cols = "every",
            resolve = "coalesce", key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 41 × 3
##    ID             Beg         End        
##    <chr>          <mdate>     <mdate>    
##  1 Aemilian       0253-08-15~ 0253-10-15~
##  2 Augustus       -0026-01-16 0014-08-19 
##  3 Aurelian       0270-09-15  0275-09-15 
##  4 Balbinus       0238-04-22  0238-07-29 
##  5 Caracalla      0198        0217-04-08 
##  6 Carinus        0283-08-01~ 0285-08-01~
##  7 Carus          0282-10-01~ 0283-08-01~
##  8 Claudius       0041-01-25  0054-10-13 
##  9 Commodus       0177        0192-12-31 
## 10 Constantine II 0337-05-22  0340-01-01 
## # … with 31 more rows

Users can also choose how they want to resolve conflicts between observations in consolidate() with several ‘resolve’ methods:

consolidate(database = emperors, rows = "any", cols = "every", resolve = "max", key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 138 × 3
##    ID              Beg        End       
##    <chr>           <chr>      <chr>     
##  1 Aemilian        0253-12-31 0253-12-31
##  2 Allectus        0293       0297      
##  3 Anastasius      0491       0518      
##  4 Anthemius       0467       0472      
##  5 Antoninus Pius  0138       0161      
##  6 Antonius Pius   0138-07-10 0161-03-07
##  7 Arcadius        0395       0408      
##  8 Augustus        -031-12-31 0014-12-31
##  9 Aulus Vitellius 0069-07    0069-12   
## 10 Aurelian        0270-12-31 0275-12-31
## # … with 128 more rows
consolidate(database = emperors, rows = "every", cols = "any", resolve = "min", key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 41 × 15
##    ID           CityB…¹ Provi…² Rise  Cause Killer Era   Notes Verif Birth Death
##    <chr>        <chr>   <chr>   <chr> <chr> <chr>  <chr> <chr> <chr> <chr> <chr>
##  1 Aemilian     <NA>    Africa  Appo… Assa… Other… Prin… birt… <NA>  0207… 0253…
##  2 Augustus     Rome    Italia  Birt… Assa… Wife   Prin… birt… Redd… 0062… 0014…
##  3 Aurelian     Sirmium Pannon… Appo… Assa… Praet… Prin… <NA>  <NA>  0214… 0275…
##  4 Balbinus     <NA>    Unknown Appo… Assa… Praet… Prin… birt… <NA>  0178… 0238…
##  5 Caracalla    Lugdun… Gallia… Birt… Assa… Other… Prin… reig… <NA>  0188… 0217…
##  6 Carinus      <NA>    Unknown Birt… Died… Oppos… Prin… deat… <NA>  ?     0285…
##  7 Carus        Narbo   Gallia… Seiz… Natu… Light… Prin… birt… <NA>  0230… 0283…
##  8 Claudius     Lugdun… Gallia… Birt… Assa… Wife   Prin… birt… Redd… 0009… 0054…
##  9 Commodus     Lanuvi… Italia  Birt… Assa… Praet… Prin… reig… <NA>  0161… 0192…
## 10 Constantine… Arelate Gallia… Birt… Exec… Other… Domi… birt… <NA>  0316… 0340…
## # … with 31 more rows, 4 more variables: FullName <chr>, Dynasty <chr>,
## #   Beg <chr>, End <chr>, and abbreviated variable names ¹​CityBirth,
## #   ²​ProvinceBirth
consolidate(database = emperors, rows = "every", cols = "every", resolve = "mean", key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 41 × 3
##    ID             Beg         End        
##    <chr>          <chr>       <chr>      
##  1 Aemilian       0253-08-15~ 0253-10-15~
##  2 Augustus       -0026-01-16 0014-08-19 
##  3 Aurelian       0270-09-15  0275-09-15 
##  4 Balbinus       0238-04-22  0238-07-29 
##  5 Caracalla      0198        0217-04-08 
##  6 Carinus        0283-08-01~ 0285-08-01~
##  7 Carus          0282-10-01~ 0283-08-01~
##  8 Claudius       0041-01-25  0054-10-13 
##  9 Commodus       0177        0192-12-31 
## 10 Constantine II 0337-05-22  0340-01-01 
## # … with 31 more rows
consolidate(database = emperors, rows = "any", cols = "any", resolve = "median", key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 138 × 15
##    ID           CityB…¹ Provi…² Rise  Cause Killer Era   Notes Verif Birth Death
##    <chr>        <chr>   <chr>   <chr> <chr> <chr>  <chr> <chr> <chr> <chr> <chr>
##  1 Aemilian     <NA>    Africa  Appo… Assa… Other… Prin… birt… <NA>  0207… 0253…
##  2 Allectus     <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  ?     297  
##  3 Anastasius   <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  430   518  
##  4 Anthemius    <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  420   472  
##  5 Antoninus P… <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  86    161  
##  6 Antonius Pi… Lanuvi… Italia  Birt… Natu… Disea… Prin… <NA>  <NA>  0086… 0161…
##  7 Arcadius     <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  377   408  
##  8 Augustus     Rome    Italia  Birt… Assa… Wife   Prin… birt… Redd… 0062… 0014…
##  9 Aulus Vitel… <NA>    <NA>    <NA>  <NA>  <NA>   <NA>  <NA>  <NA>  <NA>  <NA> 
## 10 Aurelian     Sirmium Pannon… Appo… Assa… Praet… Prin… <NA>  <NA>  0214… 0275…
## # … with 128 more rows, 4 more variables: FullName <chr>, Dynasty <chr>,
## #   Beg <chr>, End <chr>, and abbreviated variable names ¹​CityBirth,
## #   ²​ProvinceBirth
consolidate(database = emperors, rows = "every", cols = "every", resolve = "random", key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 41 × 3
##    ID             Beg        End       
##    <chr>          <chr>      <chr>     
##  1 Aemilian       0253-08-15 0253-12-31
##  2 Augustus       -031-12-31 0014-12-31
##  3 Aurelian       0270-12-31 0275-09-15
##  4 Balbinus       0238-04-22 0238-07-29
##  5 Caracalla      0198-12-31 0217-04-08
##  6 Carinus        0283-12-31 0285-12-31
##  7 Carus          0282-12-31 0283-08-01
##  8 Claudius       0041-12-31 0054-12-31
##  9 Commodus       0180-12-31 0192-12-31
## 10 Constantine II 0337-05-22 0340-12-31
## # … with 31 more rows

Users can even specify how conflicts for different variables should be ‘resolved’:

consolidate(database = emperors, rows = "any", cols = "every", resolve = c(Beg = "min", End = "max"), key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 138 × 3
##    ID              Beg        End       
##    <chr>           <chr>      <chr>     
##  1 Aemilian        0253-01-01 0253-12-31
##  2 Allectus        0293       0297      
##  3 Anastasius      0491       0518      
##  4 Anthemius       0467       0472      
##  5 Antoninus Pius  0138       0161      
##  6 Antonius Pius   0138-07-10 0161-03-07
##  7 Arcadius        0395       0408      
##  8 Augustus        -026-01-16 0014-12-31
##  9 Aulus Vitellius 0069-07    0069-12   
## 10 Aurelian        0270-01-01 0275-12-31
## # … with 128 more rows

Alternatively, users can “favour” a dataset in a database over others:

consolidate(database = favour(emperors, "UNRV"), rows = "every", cols = "any", resolve = "coalesce", key = "ID")
## There were 116 matched observations by ID variable across datasets in database.
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 41 × 15
##    ID       FullN…¹ Birth Death CityB…² Provi…³ Rise  Cause Killer Dynasty Era  
##    <chr>    <chr>   <chr> <chr> <chr>   <chr>   <chr> <chr> <chr>  <chr>   <chr>
##  1 Aemilian "Marcu… 207?  253   <NA>    Africa  Appo… Assa… Other… Gordian Prin…
##  2 Augustus "Gaius… 63 BC 14    Rome    Italia  Birt… Assa… Wife   Julio-… Prin…
##  3 Aurelian "Luciu… 214   275   Sirmium Pannon… Appo… Assa… Praet… Gordian Prin…
##  4 Balbinus "Decim… 170?  238   <NA>    Unknown Appo… Assa… Praet… Gordian Prin…
##  5 Caracal… "born … 188   217   Lugdun… Gallia… Birt… Assa… Other… Severan Prin…
##  6 Carinus  "Marcu… ?     285   <NA>    Unknown Birt… Died… Oppos… co-emp… Prin…
##  7 Carus    "Marcu… 230?  283   Narbo   Gallia… Seiz… Natu… Light… .       Prin…
##  8 Claudius "Tiber… 10 BC 41    Lugdun… Gallia… Birt… Assa… Wife   Julio-… Prin…
##  9 Commodus "Marcu… 161   192   Lanuvi… Italia  Birt… Assa… Praet… Adopti… Prin…
## 10 Constan… "Flavi… 317   340   Arelate Gallia… Birt… Exec… Other… House … Domi…
## # … with 31 more rows, 4 more variables: Notes <chr>, Verif <chr>, Beg <mdate>,
## #   End <mdate>, and abbreviated variable names ¹​FullName, ²​CityBirth,
## #   ³​ProvinceBirth

Users can, even, declare multiple key ID columns to consolidate a database or multiple datasets:

consolidate(database = emperors, rows = "any", cols = "any", resolve = c(Death = "max", Cause = "coalesce"),
            key = c("ID", "Beg"))
## ℹ Resolving conflicts...
## ℹ Coalescing compatible rows...
## # A tibble: 202 × 4
##    ID             Beg         Cause          Death      
##    <chr>          <mdate>     <chr>          <chr>      
##  1 Aemilian       0253        <NA>           253        
##  2 Aemilian       0253-08-15~ Assassination  0253-10-15~
##  3 Allectus       0293        <NA>           297        
##  4 Anastasius     0491        <NA>           518        
##  5 Anthemius      0467        <NA>           472        
##  6 Antoninus Pius 0138        <NA>           161        
##  7 Antonius Pius  0138-07-10  Natural Causes 0161-03-07 
##  8 Arcadius       0383        <NA>           <NA>       
##  9 Arcadius       0395        <NA>           408        
## 10 Augustus       -0026-01-16 Assassination  0014-08-19 
## # … with 192 more rows