As demonstrated in
vignette("mudata2", package = "mudata2")
, mudata objects
are easy to use and have a quick data-to-analysis time. In contrast,
getting data into the format takes a little more time, and requires some
familiarity with dplyr and tidyr. This
process is essentially the data cleaning step, except that instead of
discarding all the information that you don’t need (or won’t fit in the
output data structure), you can keep almost everything, possibly adding
some documentation that didn’t previously exist. This is a front-end
investment of time that will make subsequent users of the data better
informed about how and why the data were collected in the first
place.
(Mostly) universal data (mudata) objects are created using the
mudata()
function, which at minimum takes a data
frame/tibble with one row per measurement. As an example, I’ll use the
data table from the ns_climate
dataset:
library(mudata2)
%>% tbl_data() ns_climate
## # A tibble: 115,541 × 7
## dataset location param date value flag flag_…¹
## <chr> <chr> <chr> <date> <dbl> <chr> <chr>
## 1 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-01-01 NA M Missing
## 2 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-02-01 NA M Missing
## 3 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-03-01 NA M Missing
## 4 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-04-01 NA M Missing
## 5 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-05-01 NA M Missing
## 6 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-06-01 NA M Missing
## 7 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-07-01 NA M Missing
## 8 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-08-01 NA M Missing
## 9 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-09-01 NA M Missing
## 10 ecclimate_monthly SABLE ISLAND 6454 mean_max_… 1897-10-01 12.2 <NA> <NA>
## # … with 115,531 more rows, and abbreviated variable name ¹flag_text
At minimum the data table must contain the columns param
and value
. The param
column contains the
identifier of the measured parameter (a character vector), and the
value
column contains the value of the measurement (there
is no restriction on what type this is except that it has to be the same
type for all parameters; see below for ways around this). To represent
measurements at more than one location, you can include a location
column with location identifiers (a character vector). To represent
measurements at more than one point in time, you can include a column
between param
and value
specifying at what
time the measurement was taken. To the right of the value
column, you can include any columns needed to add context to
value
(I typically use this for uncertainty, detection
limits, and comments on a particular measurement).
In the context of ns_climate
, the location
column contains station names like “SABLE ISLAND”, the
param
column contains measurement names like
“mean_max_temp”, and the point in time the measurement was taken is
included in the date
column. To the right of the
value
column, there are two columns that add extra “flag”
information provided by Environment Canada. These data are distributed
with Environment Canada climate downloads, but are often discarded
because the 12 paired columns in the standard wide data format in which
they are distributed are a bit unwieldy.
In general, the steps to create a mudata object are:
mudata()
.update_locations()
, update_params()
, and
update_datasets()
.update_columns_table()
to include the metadata columns you
just added in the columns table.update_columns()
.write_mudata()
.As an example, I’m going to use a small subset of the sediment
chemistry data that I work with on a regular basis. Instead of being
aligned along the “time” or “date” axis, these data are aligned along
the “depth” axis, or in other words, the columns that identify each
measurement are location
(the sediment sample ID),
param
(the chemical that was measured), and
depth
(the position in the sediment sample). This dataset
is included in the package as pocmaj
and
pocmajsum
.
I’ll use the tidyverse for data wrangling, and the
pocmaj
and pocmajsum
datasets to illustrate
how to get from common data formats to the parameter-long,
one-row-per-measurement data needed by the mudata()
function.
library(tidyverse)
data("pocmaj")
data("pocmajsum")
Parameter-wide, summarised data is the probably the most common form of data. If you’ve gotten this far, there is a good chance that you have data like this hanging around somewhere:
<- pocmajsum %>%
pocmajwide select(core, depth, Ca, V, Ti)
core | depth | Ca | V | Ti |
---|---|---|---|---|
MAJ-1 | 0 | 1885 | 78 | 2370 |
MAJ-1 | 1 | 1418 | 70 | 2409 |
MAJ-1 | 2 | 1550 | 70 | 2376 |
MAJ-1 | 3 | 1448 | 64 | 2485 |
MAJ-1 | 4 | 1247 | 57 | 2414 |
MAJ-1 | 5 | 1412 | 81 | 1897 |
POC-2 | 0 | 1622 | 33 | 2038 |
POC-2 | 1 | 1488 | 36 | 2016 |
POC-2 | 2 | 2416 | 79 | 3270 |
POC-2 | 3 | 2253 | 79 | 3197 |
POC-2 | 4 | 2372 | 87 | 3536 |
POC-2 | 5 | 2635 | 87 | 3890 |
This is a small subset of paleolimnological data for two sediment
cores near Halifax, Nova Scotia. The data is a multi-parameter
spatiotemporal dataset because it contains multiple parameters (calcium,
titanium, and vanadium concentrations) measured along a common axis
(depth in the sediment core) at discrete locations (cores named MAJ-1
and POC-2). Currently, our columns are not named properly: for the
mudata format the terminology is ‘location’ not ‘core’. The
rename()
function is the easiest way to do this.
<- pocmajwide %>%
pocmajwide rename(location = core)
Finally, we need to get the data into a parameter-long format, with a
column named param
and our actual values in a single column
called value
. This can be done using the
gather()
function.
<- pocmajwide %>%
pocmajlong gather(Ca, Ti, V, key = "param", value = "value")
The (first six rows of the) data now look like this:
location | depth | param | value |
---|---|---|---|
MAJ-1 | 0 | Ca | 1885 |
MAJ-1 | 1 | Ca | 1418 |
MAJ-1 | 2 | Ca | 1550 |
MAJ-1 | 3 | Ca | 1448 |
MAJ-1 | 4 | Ca | 1247 |
MAJ-1 | 5 | Ca | 1412 |
The last important thing to consider is the axis on which the data
are aligned. This sounds complicated but isn’t: these axes are the same
axes you might use to plot the data, in this case depth
.
The mudata()
constructor needs to know which column this
is, either by explicitly passing x_columns = "depth"
or by
placing the column between “param” and “value”. In most cases (like this
one) it can be guessed (you’ll see a message telling you which columns
were assigned this value).
Now the data is ready to be put into the mudata()
constructor. If it isn’t, the constructor will throw an error telling
you how to fix the data.
<- mudata(pocmajlong) md
## Guessing x columns: depth
md
## A mudata object aligned along "depth"
## distinct_datasets(): "default"
## distinct_locations(): "MAJ-1", "POC-2"
## distinct_params(): "Ca", "Ti", "V"
## src_tbls(): "data", "locations" ... and 3 more
##
## tbl_data() %>% head():
## # A tibble: 6 × 5
## dataset location param depth value
## <chr> <chr> <chr> <int> <dbl>
## 1 default MAJ-1 Ca 0 1885.
## 2 default MAJ-1 Ca 1 1418
## 3 default MAJ-1 Ca 2 1550
## 4 default MAJ-1 Ca 3 1448
## 5 default MAJ-1 Ca 4 1247
## 6 default MAJ-1 Ca 5 1412.
Data is often output in a format similar to the format above, but
with uncertainty information in paired columns. Data from an ICP-MS, for
example is often in this format, with the concentration and a +/- column
next to it. One of the advantages of a long format is the ability to
include this information in a way that makes plotting with error bars
easier. The pocmajsum
dataset is a version of the dataset
described above, but with standard deviation values in paired columns
with the value itself.
pocmajsum
core | depth | Ca | Ca_sd | Ti | Ti_sd | V | V_sd |
---|---|---|---|---|---|---|---|
MAJ-1 | 0 | 1885 | 452 | 2370 | 401 | 78 | 9 |
MAJ-1 | 1 | 1418 | NA | 2409 | NA | 70 | NA |
MAJ-1 | 2 | 1550 | NA | 2376 | NA | 70 | NA |
MAJ-1 | 3 | 1448 | NA | 2485 | NA | 64 | NA |
MAJ-1 | 4 | 1247 | NA | 2414 | NA | 57 | NA |
MAJ-1 | 5 | 1412 | 126 | 1897 | 81 | 81 | 12 |
POC-2 | 0 | 1622 | 509 | 2038 | 608 | 33 | 5 |
POC-2 | 1 | 1488 | NA | 2016 | NA | 36 | NA |
POC-2 | 2 | 2416 | NA | 3270 | NA | 79 | NA |
POC-2 | 3 | 2253 | NA | 3197 | NA | 79 | NA |
POC-2 | 4 | 2372 | NA | 3536 | NA | 87 | NA |
POC-2 | 5 | 2635 | 143 | 3890 | 45 | 87 | 8 |
As above, we need to rename the core
column to
location
using the rename()
function.
<- pocmajsum %>%
pocmajwide rename(location = core)
Then (also as above), we need to gather()
the data to
get it into long form. Because we have paired columns, this is handled
by a different function (from the mudata package) called
parallel_gather()
.
<- parallel_gather(
pocmajlong
pocmajwide,key = "param",
value = c(Ca, Ti, V),
sd = c(Ca_sd, Ti_sd, V_sd)
)
location | depth | param | value | sd |
---|---|---|---|---|
MAJ-1 | 0 | Ca | 1885 | 452 |
MAJ-1 | 1 | Ca | 1418 | NA |
MAJ-1 | 2 | Ca | 1550 | NA |
MAJ-1 | 3 | Ca | 1448 | NA |
MAJ-1 | 4 | Ca | 1247 | NA |
MAJ-1 | 5 | Ca | 1412 | 126 |
The data is now ready to be fed to the mudata()
constructor:
<- mudata(pocmajlong) md
## Guessing x columns: depth
md
## A mudata object aligned along "depth"
## distinct_datasets(): "default"
## distinct_locations(): "MAJ-1", "POC-2"
## distinct_params(): "Ca", "Ti", "V"
## src_tbls(): "data", "locations" ... and 3 more
##
## tbl_data() %>% head():
## # A tibble: 6 × 6
## dataset location param depth value sd
## <chr> <chr> <chr> <int> <dbl> <dbl>
## 1 default MAJ-1 Ca 0 1885. 452.
## 2 default MAJ-1 Ca 1 1418 NA
## 3 default MAJ-1 Ca 2 1550 NA
## 4 default MAJ-1 Ca 3 1448 NA
## 5 default MAJ-1 Ca 4 1247 NA
## 6 default MAJ-1 Ca 5 1412. 126.
When mudata objects are created using only the data table, the
package creates the necessary tables for parameter, location, and
dataset metadata (if you have these tables prepared already, you can
pass them as the arguments locations
, params
,
and datasets
). These tables provide a place to put
metadata, but doesn’t create any by default. This data is usually needed
later, and including it in the object at the point of creation avoids
others or future you from scratching their (your) heads with the
question “where did core POC-2 come from anyway…”. To do this,
you can update the tables using update_params()
,
update_locations()
, and update_datasets()
. The
first argument of these functions is a vector of identifiers to update
(or all of them if not specified), followed by key/value pairs.
# default parameter table
%>%
md tbl_params()
## # A tibble: 3 × 2
## dataset param
## <chr> <chr>
## 1 default Ca
## 2 default Ti
## 3 default V
# parameter table with metadata
%>%
md update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
tbl_params()
## # A tibble: 3 × 3
## dataset param method
## <chr> <chr> <chr>
## 1 default Ca Portable XRF Spectrometer (Olympus X-50)
## 2 default Ti Portable XRF Spectrometer (Olympus X-50)
## 3 default V Portable XRF Spectrometer (Olympus X-50)
# default location table
%>%
md tbl_locations()
## # A tibble: 2 × 2
## dataset location
## <chr> <chr>
## 1 default MAJ-1
## 2 default POC-2
# location table with metadata
%>%
md update_locations(
"MAJ-1",
latitude = -64.298, longitude = 44.819, lake = "Lake Major"
%>%
) update_locations(
"POC-2",
latitude = -65.985, longitude = 44.913, lake = "Pockwock Lake"
%>%
) tbl_locations()
## # A tibble: 2 × 5
## dataset location latitude longitude lake
## <chr> <chr> <dbl> <dbl> <chr>
## 1 default MAJ-1 -64.3 44.8 Lake Major
## 2 default POC-2 -66.0 44.9 Pockwock Lake
The concept of a “dataset” is intended to refer to the source of a
dataset, but could be anything that applies to data, params, and
locations labelled with that dataset. In this case it would make sense
to add that the source data is the mudata2 package. The
default name is “default”, which you can change in the
mudata()
function by passing dataset_id
or by
using rename_datasets()
.
# default datasets table
%>%
md tbl_datasets()
## # A tibble: 1 × 1
## dataset
## <chr>
## 1 default
# datasets table with metadata
%>%
md update_datasets(source = "R package mudata2") %>%
tbl_datasets()
## # A tibble: 1 × 2
## dataset source
## <chr> <chr>
## 1 default R package mudata2
All together, the param/location/dataset documentation looks like this:
<- md %>%
md_doc update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
update_locations(
"MAJ-1",
latitude = -63.486, longitude = 44.732, lake = "Lake Major"
%>%
) update_locations(
"POC-2",
latitude = -63.839, longitude = 44.794, lake = "Pockwock Lake"
%>%
) update_datasets(source = "R package mudata2")
The mudata()
constructor automatically generates a
barebones columns table (tbl_columns()
), but since the
creation of the object we have created new columns that need
documentation. Thus, before documenting columns using
update_columns()
, it is necessary to call
update_columns_table()
to synchronize the columns table
with the object.
<- md_doc %>%
md_doc update_columns_table()
Then, you can use update_columns()
to add information
about various columns to the object.
# default columns table
%>%
md_doc tbl_columns()
## # A tibble: 16 × 4
## dataset table column type
## <chr> <chr> <chr> <chr>
## 1 default data dataset character
## 2 default data location character
## 3 default data param character
## 4 default data depth integer
## 5 default data value double
## 6 default data sd double
## 7 default locations dataset character
## 8 default locations location character
## 9 default locations latitude double
## 10 default locations longitude double
## 11 default locations lake character
## 12 default params dataset character
## 13 default params param character
## 14 default params method character
## 15 default datasets dataset character
## 16 default datasets source character
# columns with metadata
%>%
md_doc update_columns("depth", description = "Depth in sediment core (cm)") %>%
update_columns("sd", description = "Standard deviation uncertainty of n=3 values") %>%
tbl_columns() %>%
select(dataset, table, column, description, type)
## # A tibble: 16 × 5
## dataset table column description type
## <chr> <chr> <chr> <chr> <chr>
## 1 default data dataset <NA> char…
## 2 default data location <NA> char…
## 3 default data param <NA> char…
## 4 default data depth Depth in sediment core (cm) inte…
## 5 default data value <NA> doub…
## 6 default data sd Standard deviation uncertainty of n=3 valu… doub…
## 7 default locations dataset <NA> char…
## 8 default locations location <NA> char…
## 9 default locations latitude <NA> doub…
## 10 default locations longitude <NA> doub…
## 11 default locations lake <NA> char…
## 12 default params dataset <NA> char…
## 13 default params param <NA> char…
## 14 default params method <NA> char…
## 15 default datasets dataset <NA> char…
## 16 default datasets source <NA> char…
You’ll notice there’s a type
column that is also
automatically generated, which I suggest that you don’t mess with (it
will get overwritten by default before you write the object to disk). If
something is the wrong type, you should use the mudate_*()
family of functions to fix the column type, then run
update_columns_table()
again. From the top, the
documentation looks like this:
<- md %>%
md_doc update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
update_locations(
"MAJ-1",
latitude = -63.486, longitude = 44.732, lake = "Lake Major"
%>%
) update_locations(
"POC-2",
latitude = -63.839, longitude = 44.794, lake = "Pockwock Lake"
%>%
) update_datasets(source = "R package mudata2") %>%
update_columns_table() %>%
update_columns("depth", description = "Depth in sediment core (cm)") %>%
update_columns("sd", description = "Standard deviation uncertainty of n=3 values")
There are three possible formats to which mudata objects can be read:
A directory of CSV files (one per table), a ZIP archive of the directory
format, and a JSON encoding of the tables. You can write all of them
using write_mudata()
with a filename
of the
appropriate extension:
# write to directory
write_mudata(poc_maj, "poc_maj.mudata")
# write to ZIP
write_mudata(poc_maj, "poc_maj.mudata.zip")
# write to JSON
write_mudata(poc_maj, "poc_maj.mudata.json")
Then, you can read the file/directory using
read_mudata()
:
# read from directory
read_mudata("poc_maj.mudata")
# read from ZIP
read_mudata("poc_maj.mudata.zip")
# read from JSON
read_mudata("poc_maj.mudata.json")
The convention of using “.mudata.*” isn’t necessary, but seems like a good idea to point potential data users in the direction of this package.
That is most of what there is to creating mudata objects. For more
reading, I suggest looking at the documentation for
mudata()
, update_locations()
,
mudata_prepare_column()
, and
read_mudata()
.