Diving Deeper: Understanding REDCapTidieR Data Tibbles

The Block Matrix

In the Getting Started vignette, we stated that importing records from complex REDCap projects via the REDCap API can be ugly. In this section we will describe what we mean by this in more precise language.

The REDCapR::redcap_read_oneshot() function pulls data from a REDCap project via the REDCap API and provides it to the R environment without performing much processing. It returns a list with an element named data that contains all of the data of the project as a single data frame.

Below, we use this function to import data from the Superhero REDCap database, which contains a nonrepeating instrument heroes_information and a repeating instrument super_hero_powers. We use rmarkdown::paged_table() to allow you to explore this large data frame in the browser.

superheroes_token <- "123456789ABCDEF123456789ABCDEF04"
redcap_uri <- "https://my.institution.edu/redcap/api/"
superheroes_ugly <- REDCapR::redcap_read_oneshot(redcap_uri, superheroes_token)$data

superheroes_ugly |>
  rmarkdown::paged_table()

This data structure is called a block matrix. It’s what happens when REDCap mashes the contents of a project that has both repeating and nonrepeating instruments into a single table.

While it may seem a good idea to have everything in one data frame, there are significant downsides, including:

Structure of REDCapTidieR Data Tibbles

REDCapTidieR breaks the block matrix up into data tibbles, one for each instrument. REDCapTidieR then collects the data tibbles as a list column in the supertibble.

These data tibbles are tidier than the block matrix because one instrument can only be repeating or nonrepeating, and therefore each data tibble can only have one type of observational unit stored inside of it. Each data tibble has consistent granularity

Let’s take a look at the data tibbles derived from the Superhero database and contrast with the block matrix. Consider heroes_information, which contains data from a nonrepeating instrument. This tibble is much smaller than the block matrix at 734 rows by 12 columns, and there are no NAs. The data tibble’s name heroes_information is descriptive. REDCapTidieR knows the names of all instruments because in addition to querying the project’s data it also queries the project’s metadata. Each entry has the same granularity, one superhero, identified by its record_id. A number of data columns follow, starting with name, gender, eye_color, race, etc. These columns are in the same order as on the REDCap instrument. The last column of a REDCapTidieR data tibble is always form_status_complete which indicates whether the instrument has been marked as completed.

library(REDCapTidieR)

read_redcap(redcap_uri, superheroes_token) |>
  bind_tibbles()

heroes_information |>
  rmarkdown::paged_table()

Now consider the super_hero_powers table, which contains data from a repeating instrument. In addition to record_id there is a redcap_repeat_instance column. The granularity of each row is one superpower per superhero, identified by its record_id and redcap_repeat_instance. Because these two columns uniquely identify a row in the data tibble, we call them identifier columns. In the REDCapTidieR supertibble, identifier columns always come first, followed by [data columns] and finally the form_status_complete column.

super_hero_powers |>
  rmarkdown::paged_table()

Longitudinal REDCap projects

REDCap databases support two main mechanisms to allow collecting the same data multiple times: repeating instruments and longitudinal projects.

The granularity of each table (i.e. the observational unit that a single row represents) depends on the structure of the project (classic, longitudinal, longitudinal with multiple arms) as well as the structure of the instrument (repeating or nonrepeating). REDCapTidieR generates identifier columns according to the following table:

Classic Longitudinal, one arm Longitudinal, multi-arm
Nonrepeating record_id record_id +
redcap_event
record_id +
redcap_event +
redcap_arm
Repeating record_id +
redcap_repeat_instance
record_id +
redcap_repeat_instance +
redcap_event
record_id +
redcap_repeat_instance +
redcap_event +
redcap_arm

Note: Taken in combination, the identifier columns of any REDCapTidieR tibble are guaranteed to be unique and not NA and can therefore be used as composite primary key. This makes it easy to join REDCapTidieR tibbles to each other!

By default, REDCap names the record ID field record_id, but this can be changed, and REDCapTidieR is smart enough to handle this. For example, if the record ID field was renamed to subject_id then the record ID column of each data tibble would be subject_id.

Let’s look at a REDCap database you might use to capture data for a clinical trial. This is a longitudinal database that assesses some data (e.g. demographics) just once at enrollment, and other data (e.g. physical exam or labs) multiple times at pre-defined study visits. We use dplyr::select() here to highlight instrument names and whether or not each instrument is repeating. This database has six nonrepeating and three repeating instruments.

longitudinal_token <- "123456789ABCDEF123456789ABCDEF06"
library(REDCapTidieR)

longitudinal <- read_redcap(redcap_uri, longitudinal_token)

longitudinal |>
  dplyr::select(redcap_form_name, redcap_form_label, structure) |>
  rmarkdown::paged_table()

The demographics instrument is nonrepeating. The granularity is one row per study subject per event, but since it is only designated for a single event, it is really one row per study subject. The redcap_event column identifies the name of the event with which the instrument is associated: screening__enrollm (Screening and Enrollment).

longitudinal |>
  bind_tibbles()

demographics |>
  rmarkdown::paged_table()

The chemistry instrument is nonrepeating as well. However, it is designated for multiple events because chemistry labs are drawn at multiple study visits. The redcap_event column shows that we have chemistry data from four different events. The granularity is one row per study subject per event. Each of the three subjects has multiple rows. Each row is identified by the combination of subject_id and redcap_event.

chemistry |>
  rmarkdown::paged_table()

The adverse_events is repeating. Since adverse events aren’t tied to specific study visits, and a patient can have any number of different adverse events at any time, it makes sense to designate this instrument to a single special event adverse_event, which is what we’ve done here. The granularity of this table is one row per study subject per repeat instance per event. However since it is only designated for a single event, it is really one row per study subject per repeat instance. The first subject has three adverse events listed, and the second subject has two.

adverse_events |>
  rmarkdown::paged_table()

It is possible to have a repeating instrument designated to multiple events, however this is an uncommon pattern. REDCapTidieR supports this scenario as well.

REDCapTidieR does not allow you to have the same instrument designated both as a repeating and as a nonrepeating instrument in different events. It will throw an error if it detects this.

At this time, REDCapTidieR does not support repeating events. Feel free to submit a feature request if this is something you need!

REDCapTidieR supports projects with multiple arms. If you have a project with multiple arms, there will be an additional column redcap_arm to identify the arm that the row is associated with.

Categorical variables

REDCapTidieR performs a number of opinionated transformations on categorical fields to streamline exploring and working with them.

Fields that are of type yesno or truefalse are converted to logical variables.

adverse_events$adverse_event_serious |>
  dplyr::glimpse()
#>  logi [1:5] FALSE FALSE FALSE FALSE FALSE

Fields of single-answer categorical type (dropdown, radio) are converted to factor variables by default. To understand how REDCapTidieR constructs this factor, consider the Field Editor for the adverse_event_grade variable which is a single-answer radio field. There are five choices for this field. Choices have raw values and choice labels, separated by a comma. Here, the raw values are 1, 2, 3, 4, and 5, and their associated choice labels are Grade 1, Grade 2, Grade 3, Grade 4, and Grade 5. We find that choice labels are generally more useful and intuitive to work with than raw values. For this reason, REDCapTidieR derives the labels of the factor from the choice label. The order of the factor levels is the same as the order of choices in REDCap.

REDCap field editor showing the choices of a single-answer field

This makes it possible to discover all of the different choices for a specific field even if not all of them are represented in the data. For example, even though we have only Grade 1 and 2 adverse events in the data, we can see that there are 5 possible grades.

adverse_events$adverse_event_grade
#> [1] Grade 2 Grade 1 Grade 1 Grade 2 Grade 1
#> Levels: Grade 1 Grade 2 Grade 3 Grade 4 Grade 5

Adverse event grades are intrinsically ordered. When a multiple choice field has intrinsically ordered choices, those choices are usually presented in the proper order. Since REDCapTidieR preserves the order of choices from REDCap, you can convert intrinsically ordered variables to ordered factors.

ae_grade <- adverse_events$adverse_event_grade

ae_grade |>
  factor(ordered = TRUE, levels = levels(ae_grade))
#> [1] Grade 2 Grade 1 Grade 1 Grade 2 Grade 1
#> Levels: Grade 1 < Grade 2 < Grade 3 < Grade 4 < Grade 5

Fields of the multi-answer checkbox type are expanded so that each choice is represented as one logical variable. Consider the adverse_event_relationship_other field. This field has eight choices, and their raw values (i.e. the part before the comma) are apheresis, ld_chemo, soc_treatment, underlying_disease, other_research_proc, con_meds, and other.

REDCap field editor showing the choices of a multi-answer field

We recommend that for multi-answer checkbox fields you make the raw value a human readable word as we do in the above example, and not serial number which is with REDCap will provide by default. REDCapTidieR will construct the name of each variable from the name of the field, followed by three underscores, followed by the raw value of the choice defined in REDCap. Here are the variables REDCapTidieR generates from the adverse_event_relationship_other field:

adverse_events |>
  dplyr::select(dplyr::starts_with("adverse_event_relationship_other___")) |>
  dplyr::glimpse()
#> Rows: 5
#> Columns: 8
#> $ adverse_event_relationship_other___apheresis           <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___ld_chemo            <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___soc_treatment       <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___underlying_disease  <lgl> TRUE, FALSE, FA…
#> $ adverse_event_relationship_other___other_research_proc <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___con_meds            <lgl> TRUE, FALSE, TR…
#> $ adverse_event_relationship_other___other               <lgl> FALSE, FALSE, F…
#> $ adverse_event_relationship_other___none                <lgl> FALSE, TRUE, FA…

Surveys

Instruments that are used to generate REDCap Surveys generate additional data columns:

survey <- read_redcap(redcap_uri, survey_token) |>
  extract_tibble("survey")

survey |>
  dplyr::glimpse()
#> Rows: 4
#> Columns: 9
#> $ record_id                <dbl> 1, 2, 3, 4
#> $ survey_yesno             <lgl> TRUE, FALSE, NA, NA
#> $ survey_radio             <fct> Choice 1, Choice 2, NA, NA
#> $ survey_checkbox___one    <lgl> FALSE, FALSE, FALSE, FALSE
#> $ survey_checkbox___two    <lgl> TRUE, TRUE, FALSE, FALSE
#> $ survey_checkbox___three  <lgl> TRUE, TRUE, FALSE, FALSE
#> $ redcap_survey_identifier <lgl> NA, NA, NA, NA
#> $ redcap_survey_timestamp  <dttm> 2022-11-09 10:33:35, NA, NA, NA
#> $ form_status_complete     <fct> Complete, Incomplete, Incomplete, Incomplete

Summary

In summary, here are the rules by which REDCapTidieR constructs data tibbles:

  1. One data tibble is built for each instrument.
  2. The first column is always the record ID column which is derived from the record ID field.
  3. Additional identifier columns may follow the record ID column, depending on the structure of the instrument and the structure of the project (see the table above).
  4. After the identifier columns, data columns appear in the same order as in the REDCap instrument. True/false type fields are converted to logical columns. Single-answer categorical fields are converted to factor columns. Multi-answer checkbox fields are expanded to a set of logical columns, one for each choice.
  5. If the instrument is used as a survey, the survey timestamp and identifier columns will appear next.
  6. The final column is always form_status_complete.