REDCap is an electronic data capture software that is widely used in the academic research community. The REDCapR package streamlines calls to the REDCap API from an R environment. One of REDCapR’s main uses is to import records from a REDCap project. This works well for simple projects, however becomes ugly when complex databases that include longitudinal structure and/or repeating instruments are involved.
The REDCapTidieR package aims to make the life of analysts who deal with complex REDCap databases easier. It builds upon REDCapR to make its output tidier. Instead of one large data frame that contains all the data from your project, you get to work with a set of tidy tibbles, one for each REDCap instrument.
Let’s look at a REDCap database that has information about some 734 superheroes, derived from the Superhero Database.
Here is a screenshot of the REDCap Record Status Dashboard of this database. It has two instruments, Heroes Information which captures “demographic” data about each individual superhero such as their name, gender, and alignment (good or evil), and Super Hero Powers which captures each one of the superpowers that a specific superhero possesses.
REDCap Record Status Dashboard for the Superhero database
To import data from REDCap, use the
read_redcap()
function. read_redcap()
requires
a REDCap database URI and a REDCap
API token. You need to have API access to the REDCap database to
use REDCapTidieR. REDCapTidieR does not work with files exported from
REDCap. We use it here to import data from the Superheroes
database. You can see that it returns a tibble named superheroes
.
We use rmarkdown::paged_table()
so you can explore this
tibble.
library(REDCapTidieR)
<- read_redcap(redcap_uri, token)
superheroes
|>
superheroes ::paged_table() rmarkdown
You can see that the tibble that read_redcap()
returned
has only two rows. This may be
surprising because you might expect more rows from a database with 734
superheroes. read_redcap()
returns data in a special object
that we call the supertibble. The
supertibble contains, among other things, tibbles with the data and
metadata derived from each instrument. We call these the data tibbles and
metadata
tibbles.
Each row of the supertibble corresponds to one REDCap
instrument. The redcap_form_name
and
redcap_form_label
columns identify which instrument the
row relates to. The redcap_data
column contains the data
tibbles. The redcap_metadata
column contains the metadata
tibbles. Additional columns contain useful information about the data
tibble, such as row and column counts, size in memory, and the
percentage of missing values in the data.
We designed the supertibble so you can explore it with the RStudio Data Viewer. You can click on the table icon in the Environment tab to view of the supertibble in the data viewer. At a glance you see an overview of the instruments in the REDCap project.
Data Viewer showing the superheroes
supertibble
You can drill down into individual tables in the
redcap_data
and redcap_metadata
columns. Note
that in the heroes_information
data tibble, each row
represents a superhero, identified by their record_id
.
Data Viewer showing the
heroes_information
data tibble
In the super_hero_powers
data tibble, each row
represents a superpower of a specific hero. Each row is identified by
the combination of record_id
and
redcap_repeat_instance
. This difference in granularity is because in REDCap
super_hero_powers
was designated to be a repeating instrument
whereas heroes_information
was designated as a nonrepeating
instrument.
Data Viewer showing the
super_hero_powers
data tibble
You can also explore the metadata tibbles in the
redcap_metadata
column to find out about field labels, field types, and other field
attributes.
Data Viewer showing the
heroes_information
metadata tibble
REDCapTidieR provides three different functions to extract data tibbles from a supertibble.
The bind_tibbles()
function takes a supertibble and
binds its data tibbles directly into the global environment. When you use
bind_tibbles()
while working interactively in the RStudio
IDE, you will see data tibbles appear in the Environment pane.
Demonstration of the bind_tibbles
function
By default, bind_tibbles()
extracts all data tibbles
from the supertibble. With the tbls
argument you can
specify a subset of data tibbles that should be extracted. With the
environment
argument you can supply your own environment
object to which the tibbles will be bound.
The extract_tibbles()
function takes a supertibble and
returns a named list of data tibbles. The default is to extract all data
tibbles. We use str
here to show the structure of the list
returned by extract_tibbles()
.
<- superheroes |>
superheroes_list extract_tibbles()
|>
superheroes_list str(max.level = 1)
#> List of 2
#> $ heroes_information: tibble [734 × 12] (S3: tbl_df/tbl/data.frame)
#> $ super_hero_powers : tibble [5,966 × 4] (S3: tbl_df/tbl/data.frame)
You can use tidyselect selectors to select specific data tibbles.
|>
superheroes extract_tibbles(ends_with("powers")) |>
str(max.level = 1)
#> List of 1
#> $ super_hero_powers: tibble [5,966 × 4] (S3: tbl_df/tbl/data.frame)
The extract_tibble()
takes a supertibble and returns a
single data tibble.
|>
superheroes extract_tibble("heroes_information") |>
::paged_table() rmarkdown
You might wonder if it’s memory efficient to have both the
supertibble and the extracted tibbles in your environment. Because of
R’s copy-on-modify
behavior, extracted data tibbles actually use very little additional
memory. To demonstrate this, here we check the size of the
superheroes
supertibble:
::obj_size(superheroes)
lobstr#> 313.50 kB
If we bind the data tibbles into the environment and then check the combined size of the supertibble and the two data tibbles we get the following:
|>
superheroes bind_tibbles()
::obj_size(superheroes, heroes_information, super_hero_powers)
lobstr#> 313.50 kB
The same is true if we use the extract_tibble()
or
extract_tibbles()
functions:
<- superheroes |> extract_tibble("heroes_information")
a <- superheroes |> extract_tibbles()
b
::obj_size(superheroes, a, b)
lobstr#> 313.69 kB
REDCapTidieR integrates with the labelled package to allow you to attach labels to variables in the supertibble. Variable labels can make data exploration easier. An increasing number of R packages support labelled data, including ggplot (via ggeasy) and gtsummary. The RStudio Data Viewer shows variable labels below variable names.
Data Viewer showing part of a labelled supertibble
The make_labelled()
function takes a supertibble and
returns a supertibble with variable labels applied to the
variables of the supertibble as
well as to the variables of all data and metadata tibbles in the
redcap_data
and redcap_metadata
columns of the
supertibble.
You can use the labelled::look_for()
function to explore
the variable labels of a tibble.
|>
superheroes make_labelled() |>
bind_tibbles()
::look_for(heroes_information)
labelled#> pos variable label col_type values
#> 1 record_id Record ID dbl
#> 2 name Hero name: chr
#> 3 gender Gender chr
#> 4 eye_color Eye color chr
#> 5 race Race chr
#> 6 hair_color Hair color chr
#> 7 height Height dbl
#> 8 weight Weight dbl
#> 9 publisher Publisher chr
#> 10 skin_color Skin Color chr
#> 11 alignment Alignment chr
#> 12 form_status_complete REDCap Instrument Completed? fct Incomplete
#> Unverified
#> Complete
Where did these labels come from? These labels are actually the
REDCap field
labels that prompt data entry in the REDCap instrument!
REDCapTidieR places them into the field_label
variable of
the instrument’s metadata
tibble. Below you can see that the field labels of the REDCap
instrument for heroes_information
are the same as the
labels above.
REDCap data entry view of the
heroes_information
instrument
The label for name
doesn’t look quite right. Let’s
remove that trailing :
. The make_labelled()
function has a format_labels
argument that you can use to
preprocess labels before applying them to the variables.
|>
superheroes make_labelled(format_labels = ~ gsub(":", "", .)) |>
bind_tibbles()
::look_for(heroes_information, "hero")
labelled#> pos variable label col_type values
#> 2 name Hero name chr
Removing trailing :
characters is a fairly common
operation, so REDCapTidieR provides a format helper function that you
can pass to the format_labels
argument:
fmt_strip_trailing_colon("Hero name:")
#> [1] "Hero name"
To find out about other helpers included with REDCapTidieR, see
?`format-helpers`
.
The format_labels
argument will also accept multiple
functions in a vector or list. You can use any function that takes a
character vector and returns a modified character vector.
make_labelled()
will process the variable labels using
these functions in the order they are supplied. In the following
example, we remove the trailing colon with
fmt_strip_trailing_colon()
and then make the labels lower
case with base::tolower()
.
|>
superheroes make_labelled(
format_labels = c(
fmt_strip_trailing_colon,::tolower
base
)|>
) bind_tibbles()
::look_for(heroes_information)
labelled#> pos variable label col_type values
#> 1 record_id record id dbl
#> 2 name hero name chr
#> 3 gender gender chr
#> 4 eye_color eye color chr
#> 5 race race chr
#> 6 hair_color hair color chr
#> 7 height height dbl
#> 8 weight weight dbl
#> 9 publisher publisher chr
#> 10 skin_color skin color chr
#> 11 alignment alignment chr
#> 12 form_status_complete redcap instrument completed? fct Incomplete
#> Unverified
#> Complete