In the Getting Started vignette, we stated that importing records from complex REDCap projects via the REDCap API can be ugly. In this section we will describe what we mean by this in more precise language.
The REDCapR::redcap_read_oneshot()
function pulls data
from a REDCap project via the REDCap
API and provides it to the R environment without performing much
processing. It returns a list with an element named data
that contains all of the data of the project as a single data frame.
Below, we use this function to import data from the Superhero
REDCap database, which contains a nonrepeating
instrument heroes_information
and a repeating instrument
super_hero_powers
. We use
rmarkdown::paged_table()
to allow you to explore this large
data frame in the browser.
<- "123456789ABCDEF123456789ABCDEF04"
superheroes_token <- "https://my.institution.edu/redcap/api/" redcap_uri
<- REDCapR::redcap_read_oneshot(redcap_uri, superheroes_token)$data
superheroes_ugly
|>
superheroes_ugly ::paged_table() rmarkdown
This data structure is called a block matrix. It’s what happens when REDCap mashes the contents of a project that has both repeating and nonrepeating instruments into a single table.
While it may seem a good idea to have everything in one data frame, there are significant downsides, including:
It’s unwieldy! Although there are only 734 superheroes in the data set, there are 6,700 rows. Every transformation first requires whittling down a huge data set.
There are a lot of NA
values. Many of these
NA
values don’t represent fields left blank during data entry but
instead are an artifact of how the table is generated.
Metadata that would be helpful is missing. For example, it’s not possible to determine which columns are associated with a specific instrument for nonrepeating instruments.
The meaning of a row in the data set is inconsistent. Some rows represent observations on a per-record level (i.e. those derived from nonrepeating instruments) and others represent observations on a per-repeat-instance level (i.e. those derived from repeating instruments). This technically violates the definition of Tidy Data because multiple types of observational units are stored in the same table.
REDCapTidieR breaks the block matrix up into data tibbles, one for each instrument. REDCapTidieR then collects the data tibbles as a list column in the supertibble.
These data tibbles are tidier than the block matrix because one instrument can only be repeating or nonrepeating, and therefore each data tibble can only have one type of observational unit stored inside of it. Each data tibble has consistent granularity
Let’s take a look at the data tibbles derived from the Superhero
database and contrast with the block matrix. Consider
heroes_information
, which contains data from a
nonrepeating instrument. This tibble is much smaller
than the block matrix at 734 rows by 12 columns, and there are no
NA
s. The data tibble’s name heroes_information
is descriptive. REDCapTidieR knows the names of all instruments because
in addition to querying the project’s data it also queries the project’s
metadata. Each entry has the same granularity, one
superhero, identified by its record_id
. A
number of data columns follow,
starting with name
, gender
,
eye_color
, race
, etc. These columns are in the
same order as on the REDCap instrument. The last column of a
REDCapTidieR data tibble is always form_status_complete
which indicates whether the instrument has been marked as completed.
library(REDCapTidieR)
read_redcap(redcap_uri, superheroes_token) |>
bind_tibbles()
|>
heroes_information ::paged_table() rmarkdown
Now consider the super_hero_powers
table, which contains
data from a repeating instrument. In addition to
record_id
there is a redcap_repeat_instance
column. The granularity of each row is one superpower per
superhero, identified by its record_id
and
redcap_repeat_instance
. Because these two columns uniquely
identify a row in the data tibble, we call them identifier
columns. In the REDCapTidieR supertibble, identifier
columns always come first, followed by [data columns]
and finally the form_status_complete
column.
|>
super_hero_powers ::paged_table() rmarkdown
REDCap databases support two main mechanisms to allow collecting the same data multiple times: repeating instruments and longitudinal projects.
The granularity of each table (i.e. the observational unit that a single row represents) depends on the structure of the project (classic, longitudinal, longitudinal with multiple arms) as well as the structure of the instrument (repeating or nonrepeating). REDCapTidieR generates identifier columns according to the following table:
Classic | Longitudinal, one arm | Longitudinal, multi-arm | |
---|---|---|---|
Nonrepeating | record_id |
record_id +redcap_event |
record_id +redcap_event +redcap_arm |
Repeating | record_id +redcap_repeat_instance |
record_id +redcap_repeat_instance +redcap_event |
record_id +redcap_repeat_instance +redcap_event +redcap_arm |
Note: Taken in combination, the identifier columns of any REDCapTidieR tibble are guaranteed to be unique and not
NA
and can therefore be used as composite primary key. This makes it easy to join REDCapTidieR tibbles to each other!
By default, REDCap names the record ID field record_id
,
but this can be changed, and REDCapTidieR is smart enough to handle
this. For example, if the record ID field was renamed to
subject_id
then the record ID column of each data tibble
would be subject_id
.
Let’s look at a REDCap database you might use to capture data for a
clinical trial. This is a longitudinal database that assesses some data
(e.g. demographics) just once at enrollment, and other data
(e.g. physical exam or labs) multiple times at pre-defined study visits.
We use dplyr::select()
here to highlight instrument names
and whether or not each instrument is repeating. This database has six
nonrepeating and three repeating instruments.
<- "123456789ABCDEF123456789ABCDEF06" longitudinal_token
library(REDCapTidieR)
<- read_redcap(redcap_uri, longitudinal_token)
longitudinal
|>
longitudinal ::select(redcap_form_name, redcap_form_label, structure) |>
dplyr::paged_table() rmarkdown
The demographics
instrument is nonrepeating. The
granularity is one row per study subject per event, but since it is only designated
for a single event, it is really one row per study subject. The
redcap_event
column identifies the name of the event with
which the instrument is associated: screening__enrollm
(Screening and Enrollment).
|>
longitudinal bind_tibbles()
|>
demographics ::paged_table() rmarkdown
The chemistry
instrument is nonrepeating as well.
However, it is designated for multiple events because chemistry labs are
drawn at multiple study visits. The redcap_event
column
shows that we have chemistry
data from four different
events. The granularity is one row per study subject per event. Each of
the three subjects has multiple rows. Each row is identified by the
combination of subject_id
and
redcap_event
.
|>
chemistry ::paged_table() rmarkdown
The adverse_events
is repeating. Since
adverse events aren’t tied to specific study visits, and a patient can
have any number of different adverse events at any time, it makes sense
to designate this instrument to a single special event
adverse_event
, which is what we’ve done here. The
granularity of this table is one row per study subject per repeat
instance per event. However since it is only designated for a single
event, it is really one row per study subject per repeat instance. The
first subject has three adverse events listed, and the second subject
has two.
|>
adverse_events ::paged_table() rmarkdown
It is possible to have a repeating instrument designated to multiple events, however this is an uncommon pattern. REDCapTidieR supports this scenario as well.
REDCapTidieR does not allow you to have the same instrument designated both as a repeating and as a nonrepeating instrument in different events. It will throw an error if it detects this.
At this time, REDCapTidieR does not support repeating events. Feel free to submit a feature request if this is something you need!
REDCapTidieR supports projects with multiple arms. If you have a
project with multiple arms, there will be an additional column
redcap_arm
to identify the arm that the row is associated
with.
REDCapTidieR performs a number of opinionated transformations on categorical fields to streamline exploring and working with them.
Fields that are of type
yesno
or truefalse
are converted to
logical variables.
$adverse_event_serious |>
adverse_events::glimpse()
dplyr#> logi [1:5] FALSE FALSE FALSE FALSE FALSE
Fields of single-answer categorical type (dropdown
,
radio
) are converted to factor variables by
default. To understand how REDCapTidieR constructs this factor, consider
the Field Editor for the adverse_event_grade
variable which
is a single-answer radio
field. There are five choices for this field. Choices have
raw values and choice labels,
separated by a comma. Here, the raw values are 1
,
2
, 3
, 4
, and 5
, and
their associated choice labels are Grade 1
,
Grade 2
, Grade 3
, Grade 4
, and
Grade 5
. We find that choice labels are generally more
useful and intuitive to work with than raw values. For this reason,
REDCapTidieR derives the labels of the factor from the choice label. The
order of the factor levels is the same as the order of choices in
REDCap.
REDCap field editor showing the choices of a single-answer field
This makes it possible to discover all of the different choices for a specific field even if not all of them are represented in the data. For example, even though we have only Grade 1 and 2 adverse events in the data, we can see that there are 5 possible grades.
$adverse_event_grade
adverse_events#> [1] Grade 2 Grade 1 Grade 1 Grade 2 Grade 1
#> Levels: Grade 1 Grade 2 Grade 3 Grade 4 Grade 5
Adverse event grades are intrinsically ordered. When a multiple choice field has intrinsically ordered choices, those choices are usually presented in the proper order. Since REDCapTidieR preserves the order of choices from REDCap, you can convert intrinsically ordered variables to ordered factors.
<- adverse_events$adverse_event_grade
ae_grade
|>
ae_grade factor(ordered = TRUE, levels = levels(ae_grade))
#> [1] Grade 2 Grade 1 Grade 1 Grade 2 Grade 1
#> Levels: Grade 1 < Grade 2 < Grade 3 < Grade 4 < Grade 5
Fields of the multi-answer checkbox
type are expanded so
that each choice is represented as one logical
variable. Consider the
adverse_event_relationship_other
field. This field has
eight choices, and their raw values (i.e. the part before the comma) are
apheresis
, ld_chemo
,
soc_treatment
, underlying_disease
,
other_research_proc
, con_meds
, and
other
.
REDCap field editor showing the choices of a multi-answer field
We recommend that for multi-answer checkbox
fields you
make the raw value a human readable word as we do in the above example,
and not serial number which is with REDCap will provide by default.
REDCapTidieR will construct the name of each variable from the name of
the field, followed by three underscores, followed by the raw value of
the choice defined in REDCap. Here are the variables REDCapTidieR
generates from the adverse_event_relationship_other
field:
|>
adverse_events ::select(dplyr::starts_with("adverse_event_relationship_other___")) |>
dplyr::glimpse()
dplyr#> Rows: 5
#> Columns: 8
#> $ adverse_event_relationship_other___apheresis <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___ld_chemo <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___soc_treatment <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___underlying_disease <lgl> TRUE, FALSE, FA…
#> $ adverse_event_relationship_other___other_research_proc <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___con_meds <lgl> TRUE, FALSE, TR…
#> $ adverse_event_relationship_other___other <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___none <lgl> FALSE, TRUE, FA…
Instruments that are used to generate REDCap Surveys generate additional data columns:
redcap_survey_timestamp
: the time at which the
survey was competed
redcap_survey_identifier
: the participant
identifier. This will be NA
if the Participant Identifier
feature in REDCap is disabled.
<- read_redcap(redcap_uri, survey_token) |>
survey extract_tibble("survey")
|>
survey ::glimpse()
dplyr#> Rows: 4
#> Columns: 9
#> $ record_id <dbl> 1, 2, 3, 4
#> $ survey_yesno <lgl> TRUE, FALSE, NA, NA
#> $ survey_radio <fct> Choice 1, Choice 2, NA, NA
#> $ survey_checkbox___one <lgl> FALSE, FALSE, FALSE, FALSE
#> $ survey_checkbox___two <lgl> TRUE, TRUE, FALSE, FALSE
#> $ survey_checkbox___three <lgl> TRUE, TRUE, FALSE, FALSE
#> $ redcap_survey_identifier <lgl> NA, NA, NA, NA
#> $ redcap_survey_timestamp <dttm> 2022-11-09 10:33:35, NA, NA, NA
#> $ form_status_complete <fct> Complete, Incomplete, Incomplete, Incomplete
In summary, here are the rules by which REDCapTidieR constructs data tibbles:
form_status_complete
.