REDCapDM



1 Introduction

The REDCapDM package will first allow us to read both data exported directly from REDCap and through an API connection. Next, it will allow us to create reports of queries such as outliers or missing values, and to track them. Finally, we will be able to preprocess previously downloaded data.



2 Functions

Functions included in the package:



3 Examples

3.1 redcap_data

Exported data from REDCap:

datos <- redcap_data(data_path="C:/Users/username/example.r",
                     dic_path="C:/Users/username/example_dictionary.csv")

Data using API:

datos_api <- redcap_data(uri ="https://redcap.idibell.cat/api/",
                         token = "55E5C3D1E83213ADA2182A4BFDEA")

For the examples we will use COVICAN dataset that is already included in the package. For more information use ?covican.

datos_redcap <- covican
List of 2
 $ data      :'data.frame': 342 obs. of  56 variables:
 $ dictionary:'data.frame': 21 obs. of  8 variables:

datos_redcap$data:

   record_id        redcap_event_name redcap_data_access_group
1      100-6      initial_visit_arm_1              hospital_11
2      100-6 follow_up_visit_da_arm_1              hospital_11
21    100-13      initial_visit_arm_1              hospital_11
22    100-13 follow_up_visit_da_arm_1              hospital_11
34    100-16      initial_visit_arm_1              hospital_11
35    100-16 follow_up_visit_da_arm_1              hospital_11

datos_redcap$dictionary:

            field_name                   form_name field_type
1            record_id inclusionexclusion_criteria       text
2                inc_1 inclusionexclusion_criteria      radio
3                inc_2 inclusionexclusion_criteria      radio
4                inc_3 inclusionexclusion_criteria      radio
6                exc_1 inclusionexclusion_criteria      radio
10 screening_fail_crit inclusionexclusion_criteria       calc

3.2 rd_transform

3.2.1 Raw data transformation

datos <- rd_transform(
  data = datos_redcap$data, 
  dic = datos_redcap$dictionary
)
data <- datos$data

#To print the results
datos$results
1. Recalculating calculated fields and saving them as '[field_name]_recalc'

| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
|            2            |         0 (0%)         |            1 (50%)            |


|     field_name      | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
|         age         |     Yes      |   FALSE   |
| screening_fail_crit |     Yes      |   TRUE    |

2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic

Table: Checkbox variables advisable to be reviewed

| Variables without any branching logic |
|:-------------------------------------:|
|        type_underlying_disease        |

3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns

As you can see, there are 4 steps in the transformation:

  1. Recalculate autocalculated fields
  2. There were two autocalculated fields, as we see in the summary we have been able to recalculate both of them and there is one that has changed:

    data %>% 
      dplyr::select(d_birth, d_ingreso, age, age_recalc) %>% 
      dplyr::filter(age != age_recalc)
         d_birth  d_ingreso age age_recalc
    1 1945-04-16 2020-04-16  74         75
  3. Transformation of the checkboxes: change the names of the checkboxes to the names of their options and the name of their labels to No/Yes. Also, if the checkbox contains a door question, it changes the missings to
  4. #Checkbox no gatekeeper
    
    table(data$type_underlying_disease_haematological_cancer)
    
     No Yes 
    103  87 
    #Checkbox with gatekeeper: [type_underlying_disease(0)]='1'
    
    #In the original data set:
    table(datos_redcap$data$type_underlying_disease___0, datos_redcap$data$underlying_disease_hemato___1)
       
          0   1
      0 103   0
      1  84   3
    #In the transformed data set:
    table(data$type_underlying_disease_haematological_cancer, data$underlying_disease_hemato_acute_myeloid_leukemia)
         
          No Yes
      No   0   0
      Yes 84   3
  5. Changes the original variables by their version in factor, except ‘redcap_event_name’ and ‘redcap_data_access_group’ which keep the two versions and the checkboxes that have already been transformed.
  6. str(data$dm)
     Factor w/ 2 levels "No","Yes": 1 NA 2 NA 2 NA 1 NA 2 NA ...
  7. Eliminates variables that contain a pattern. By default the pattern it looks for is ’_complete’. In this case there was no ’_complete’ variable initially.

3.2.2 Data transformation and classification by event

In order to perform the transformation by event we have to add the path where the file with the correspondence between events and forms is located. In this case, we will use the path where we have saved the file of this correspondence that we have downloaded from the COVICAN project.

datos <- rd_transform(
             data = datos_redcap$data, 
             dic = datos_redcap$dictionary,
             event_path = "files/COVICAN_instruments.csv",
             final_format = "by_event"
             )

#To print the results
datos$results
1. Recalculating calculated fields and saving them as '[field_name]_recalc'

| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
|            2            |         0 (0%)         |            1 (50%)            |


|     field_name      | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
|         age         |     Yes      |   FALSE   |
| screening_fail_crit |     Yes      |   TRUE    |

2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic

Table: Checkbox variables advisable to be reviewed

| Variables without any branching logic |
|:-------------------------------------:|
|        type_underlying_disease        |

3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
5. Erasing variables from forms that are not linked to any event
6. Final arrangment of the data by event

Now a step in the transformation has been added, which is to split the preprocessed data for each of the events in the study so that the function returns

datos$data
# A tibble: 2 × 3
  events                   vars       df             
  <chr>                    <list>     <list>         
1 initial_visit_arm_1      <chr [25]> <df [190 × 25]>
2 follow_up_visit_da_arm_1 <chr [8]>  <df [152 × 8]> 

3.2.3 Data transformation and classification by form

To perform the transformation by form we must also add the path of the file of the events and forms

datos <- rd_transform(data = datos_redcap$data, 
             dic = datos_redcap$dictionary,
             event_path = "files/COVICAN_instruments.csv",
             final_format = "by_form"
             )

#To print the results
datos$results
1. Recalculating calculated fields and saving them as '[field_name]_recalc'

| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
|            2            |         0 (0%)         |            1 (50%)            |


|     field_name      | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
|         age         |     Yes      |   FALSE   |
| screening_fail_crit |     Yes      |   TRUE    |

2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic

Table: Checkbox variables advisable to be reviewed

| Variables without any branching logic |
|:-------------------------------------:|
|        type_underlying_disease        |

3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
5. Erasing variables from forms that are not linked to any event
6. Final arrangment of the data by form

As before, a final step is added, which consists of splitting the preprocessed data for each form of the study so that the function returns

datos$data
# A tibble: 6 × 4
  form                        events    vars       df             
  <chr>                       <list>    <list>     <list>         
1 inclusionexclusion_criteria <chr [1]> <chr [11]> <df [190 × 11]>
2 demographics                <chr [1]> <chr [9]>  <df [190 × 9]> 
3 comorbidities               <chr [1]> <chr [10]> <df [190 × 10]>
4 vital_signs                 <chr [2]> <chr [7]>  <df [177 × 7]> 
5 laboratory_findings         <chr [2]> <chr [7]>  <df [177 × 7]> 
6 microbiological_studies     <chr [1]> <chr [6]>  <df [190 × 6]> 

3.2.4 Additional arguments

checkbox_labels: specify the name of the categories that will have the checkbox variables. Default is No/Yes, we can change it to N/Y.

datos <- rd_transform(
  data = datos_redcap$data, 
  dic = datos_redcap$dictionary,
  checkbox_labels = c("N", "Y")
)

data <- datos$data

We see how the categories of the following checkboxes have changed, for example

table(data$type_underlying_disease_haematological_cancer)

  N   Y 
103  87 

exclude_to_factor: specify the name of a variable that we do not want to be transformed to a factor. For example, if we want the variable dm to keep its numeric version

datos <- rd_transform(
  data = datos_redcap$data, 
  dic = datos_redcap$dictionary,
  exclude_to_factor = "dm"
)

data <- datos$data

table(data$dm)

  0   1 
140  45 

keep_labels: logical argument to retain data set tags that are processed from REDCap

datos <- rd_transform(
  data = datos_redcap$data, 
  dic = datos_redcap$dictionary,
  keep_labels = TRUE
)

data <- datos$data

str(data[,1:5])
'data.frame':   342 obs. of  5 variables:
 $ record_id                      : 'labelled' chr  "100-13" "100-13" "100-16" "100-16" ...
  ..- attr(*, "label")= Named chr ""
  .. ..- attr(*, "names")= chr "record_id"
 $ redcap_event_name              : 'labelled' chr  "initial_visit_arm_1" "follow_up_visit_da_arm_1" "initial_visit_arm_1" "follow_up_visit_da_arm_1" ...
  ..- attr(*, "label")= Named chr "Event Name"
  .. ..- attr(*, "names")= chr "redcap_event_name"
 $ redcap_event_name.factor       : Factor w/ 5 levels "Initial visit",..: 1 2 1 2 1 2 1 2 1 2 ...
  ..- attr(*, "label")= Named chr ""
  .. ..- attr(*, "names")= chr "redcap_data_access_group"
 $ redcap_data_access_group       : 'labelled' chr  "hospital_11" "hospital_11" "hospital_11" "hospital_11" ...
  ..- attr(*, "label")= Named chr "Patients older than 18 years"
  .. ..- attr(*, "names")= chr "inc_1"
 $ redcap_data_access_group.factor: Factor w/ 26 levels "Hospital 1","Hospital 2",..: 11 11 11 11 11 11 11 11 11 11 ...
  ..- attr(*, "label")= Named chr "Cancer patients"
  .. ..- attr(*, "names")= chr "inc_2"

delete_vars: specify strings whereby any variables containing them will be removed from the data set. By default all variables containing ’_complete’ are removed. In this case we did not have any complete variables. We want for example to remove all inclusion and exclusion criteria

datos <- rd_transform(
  data = datos_redcap$data, 
  dic = datos_redcap$dictionary,
  delete_vars = c("inc_", "exc_")
)

data <- datos$data

names(data)
 [1] "record_id"                                              
 [2] "redcap_event_name"                                      
 [3] "redcap_event_name.factor"                               
 [4] "redcap_data_access_group"                               
 [5] "redcap_data_access_group.factor"                        
 [6] "screening_fail_crit"                                    
 [7] "screening_fail_crit_recalc"                             
 [8] "d_birth"                                                
 [9] "d_ingreso"                                              
[10] "age"                                                    
[11] "age_recalc"                                             
[12] "dm"                                                     
[13] "type_dm"                                                
[14] "copd"                                                   
[15] "fio2_aportado"                                          
[16] "analitica_disponible"                                   
[17] "potassium"                                              
[18] "resp_freq"                                              
[19] "hemato_neo"                                             
[20] "leukemia"                                               
[21] "type_underlying_disease_haematological_cancer"          
[22] "type_underlying_disease_solid_tumour"                   
[23] "underlying_disease_hemato_acute_myeloid_leukemia"       
[24] "underlying_disease_hemato_myelodysplastic_syndrome"     
[25] "underlying_disease_hemato_chronic_myeloid_leukaemia"    
[26] "underlying_disease_hemato_acute_lymphoblastic_leukaemia"
[27] "underlying_disease_hemato_hodgkin_lymphoma"             
[28] "underlying_disease_hemato_non_hodgkin_lymphoma"         
[29] "underlying_disease_hemato_multiple_myeloma"             
[30] "underlying_disease_hemato_myelofibrosis"                
[31] "underlying_disease_hemato_aplastic_anaemia"             
[32] "urine_culture"                                          

which_event: in the event-driven transformation, specify whether you only want a specific event to be returned. For example, we only want the first visit

datos <- rd_transform(
             data = datos_redcap$data, 
             dic = datos_redcap$dictionary,
             event_path = "files/COVICAN_instruments.csv",
             final_format = "by_event",
             which_event = "initial_visit_arm_1"
             )

data <- datos$data

table(data$redcap_event_name)

initial_visit_arm_1 
                190 

which_form: in the transformation by form, specify if you only want a specific form to be returned. For example, we only want the form demographics

datos <- rd_transform(
             data = datos_redcap$data, 
             dic = datos_redcap$dictionary,
             event_path = "files/COVICAN_instruments.csv",
             final_format = "by_form",
             which_form = "demographics"
             )

data <- datos$data

names(data)
[1] "record_id"                       "redcap_event_name"              
[3] "redcap_data_access_group"        "redcap_event_name.factor"       
[5] "redcap_data_access_group.factor" "d_ingreso"                      
[7] "d_birth"                         "age"                            
[9] "age_recalc"                     

wide: so that the data set that is returned in the transformation by form is in wide format or not. We do this for the laboratory findings form

datos <- rd_transform(
             data = datos_redcap$data, 
             dic = datos_redcap$dictionary,
             event_path = "files/COVICAN_instruments.csv",
             final_format = "by_form",
             which_form = "laboratory_findings",
             wide = TRUE
             )

data <- datos$data

head(data)
# A tibble: 6 × 5
  record_id analitica_disponible_1 analitica_disponible_2 potassium_1 potassiu…¹
  <chr>     <fct>                  <fct>                        <dbl>      <dbl>
1 100-13    Yes                    Yes                           3.66       4.1 
2 100-16    Yes                    No                            4.04      NA   
3 100-31    Yes                    <NA>                          4.58      NA   
4 100-34    Yes                    No                            3.48      NA   
5 100-36    Yes                    No                            4.09      NA   
6 100-52    Yes                    Yes                           3.7        7.15
# … with abbreviated variable name ¹​potassium_2

3.3 rd_rlogic

It transforms the REDCap logic into logic that can be evaluated in R. This function is built into the recalculate function, but it may be useful to use it separately. Let’s see how it transforms the logics of the autocalculated field calculations.

#screening failure
rd_rlogic(logic = "if([exc_1]='1' or [inc_1]='0' or [inc_2]='0' or [inc_3]='0',1,0)",
          data = datos_redcap$data)
[1] "ifelse(data$exc_1=='1' | data$inc_1=='0' | data$inc_2=='0' | data$inc_3=='0',1,0)"
#age
rd_rlogic(logic = 'rounddown(datediff([d_birth],[d_ingreso],"y","dmy"),0)',
          data = datos_redcap$data)
[1] "floor(lubridate::time_length(lubridate::interval(data$d_birth,data$d_ingreso), 'year'))"

3.4 rd_insert_na

Function to set missing variables when a certain logic is fulfilled. Useful for example in the checkboxes that we do not have a gatekeeper. For example, we put the checkbox without gatekeeper (type_underlying_disease) to missing when the age is less than 65 years old.

datos <- rd_transform(
    data = datos_redcap$data, 
    dic = datos_redcap$dictionary
)

data <- datos$data

#Before inserting missings
table(data$type_underlying_disease_haematological_cancer)

 No Yes 
103  87 
data2 <- rd_insert_na(
  data = data,
  filter = rep("age < 65", 2),
  vars = grep("type_underlying_disease", names(data), value = TRUE)
)

#After inserting missings
table(data2$type_underlying_disease_haematological_cancer)

 No Yes 
 65  50 

3.5 rd_query

3.5.1 Output

Identifier DAG Event Instrument Field Repetition Description Query Code
100-58 Hospital 11 Initial visit Comorbidities copd
Chronic pulmonary disease The value is NA and it should not be missing 100-58-1
102-113 Hospital 24 Initial visit Demographics age
Age The value is NA and it should not be missing 102-113-1
105-11 Hospital 5 Initial visit Comorbidities copd
Chronic pulmonary disease The value is NA and it should not be missing 105-11-1
105-11 Hospital 5 Initial visit Demographics age
Age The value is NA and it should not be missing 105-11-2
105-56 Hospital 5 Initial visit Comorbidities copd
Chronic pulmonary disease The value is NA and it should not be missing 105-56-1
105-56 Hospital 5 Initial visit Demographics age
Age The value is NA and it should not be missing 105-56-2


3.5.2 Missings

Mandatory arguments:

example <- rd_query(variables = c("copd", "age"),
                         expression = c("%in%NA", "%in%NA"),
                         event = "initial_visit_arm_1",
                         dic = datos_redcap$dictionary,
                         data = datos_redcap$data)

# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic pulmonary disease 6
age Age 5


Optional arguments:

example<- rd_query(variables = c("age", "copd"),
                        variables_names = c("Age", "Chronic obstructive pulmonary disease"),#### OPCIONAL
                        expression = c("%in%NA", "%in%NA"),
                        query_name = c("Age is missing at baseline visit", "COPD"), #### OPCIONAL
                        instrument = c("Inclusión del paciente","Inclusión"),  #### OPCIONAL
                        event = "initial_visit_arm_1",
                        dic = datos_redcap$dictionary,
                        data = datos_redcap$data)


3.5.3 Missings in a branching logic

Sin filtro:

example <- rd_query(variables = c("age", "copd", "potassium"),
                          expression = c("%in%NA", "%in%NA", "%in%NA"),
                          event = "initial_visit_arm_1",
                          dic = datos_redcap$dictionary,
                          data=datos_redcap$data)
Warning: Some of the variables that were checked for missings present a branching logic. 
Check the results tab of output for more details (...$results).
# Printing results
example$results
Report of queries
Variables Description Total Branching logic
age Age 5 -
copd Chronic pulmonary disease 6 -
potassium Potassium 31 [analitica_disponible]=‘1’


Applying filter:

example <- rd_query(variables = c("potassium"),
                         expression = c("%in%NA"),
                         event = "initial_visit_arm_1",
                         dic = datos_redcap$dictionary,
                         data = datos_redcap$data,
                         filter = c("analitica_disponible=='1'"))
Warning: Some of the variables that were checked for missings present a branching logic. 
Check the results tab of output for more details (...$results).
# Printing results
example$results
Report of queries
Variables Description Total Branching logic
potassium Potassium 21 [analitica_disponible]=‘1’


3.5.4 Expresiones

Simple:

example <- rd_query(variables=c("age"),
                         expression=c(">20"),
                         event="initial_visit_arm_1",
                         dic=datos_redcap$dictionary,
                         data=datos_redcap$data)

# Printing results
example$results
Report of queries
Variables Description Total
age Age 185


Complex:

example <- rd_query(variables=c("age", "copd"),
                         expression=c("(>20 & <70) | %in%NA", "==1"),
                         event="initial_visit_arm_1",
                         dic=datos_redcap$dictionary,
                         data=datos_redcap$data)

# Printing results
example$results
Report of queries
Variables Description Total
age Age 108
copd Chronic pulmonary disease 21


3.5.5 Special cases

Same expression for all variables:

example <- rd_query(variables = c("copd","age","dm"),
                         expression = c("%in%NA"),
                         event = "initial_visit_arm_1",
                         dic = datos_redcap$dictionary,
                         data = datos_redcap$data)
Warning: There are more variables than expressions, so the same expression was
applied to all variables
# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic pulmonary disease 6
age Age 5
dm Diabetes (treated with insulin or antidiabetic … 5


Not defining an event:

example <- rd_query(variables = c("copd"),
                         expression = c("%in%NA"),
                         dic = datos_redcap$dictionary,
                         data = datos_redcap$data)
Warning: event = NA, but the dataset presents a variable that indicates the
presence of events, please specify the event.
# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic pulmonary disease 158


3.5.6 Additional arguments

negate: negation of expression used

example <- rd_query(variables = c("copd"),
                         expression = c("%in%NA"),
                         negate = TRUE,
                         event = "initial_visit_arm_1",
                         dic = datos_redcap$dictionary,
                         data = datos_redcap$data)

# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic pulmonary disease 184


addTo: join queries to an existing data frame

example2 <- rd_query(variables = c("age"),
                         expression = c("%in%NA"),
                         event = "initial_visit_arm_1",
                         dic = datos_redcap$dictionary,
                         data=datos_redcap$data,
                         addTo = example)

# Printing results
example2$results
Report of queries
Variables Description Total
copd Chronic pulmonary disease 184
age Age 5


report_title: customize the queries table title

example <- rd_query(variables = c("copd", "age"),
                         expression = c("%in%NA", "<20"),
                         event = "initial_visit_arm_1",
                         dic = datos_redcap$dictionary,
                         data = datos_redcap$data,
                         report_title = "Missing COPD values in the baseline event")

# Printing results
example$results
Missing COPD values in the baseline event
Variables Description Total
copd Chronic pulmonary disease 6


report_zeros: choose whether variables with zero queries should be reported in the table

example <- rd_query(variables = c("copd", "age"),
                         expression = c("%in%NA", "<20"),
                         event = "initial_visit_arm_1",
                         dic = datos_redcap$dictionary,
                         data = datos_redcap$data,
                         report_zeros = TRUE)

# Printing results
example$results
Report of queries
Variables Description Total
copd Chronic pulmonary disease 6
age Age 0


3.6 rd_event

Simple:

example <- rd_event(event = "follow_up_visit_da_arm_1",
                    dic = datos_redcap$dictionary,
                    data = datos_redcap$data)

# Print results
example$results
Report of queries
Events Description Total
follow_up_visit_da_arm_1 Follow up visit day 14+/-5d 38


Filter:

example <- rd_event(event = "follow_up_visit_da_arm_1",
                    filter = "screening_fail_crit==0",
                    dic = datos_redcap$dictionary,
                    data = datos_redcap$data)

# Print results
example$results
Report of queries
Events Description Total
follow_up_visit_da_arm_1 Follow up visit day 14+/-5d 34


Several events:

example <- rd_event(event = c("initial_visit_arm_1","follow_up_visit_da_arm_1"),
                    filter = "screening_fail_crit==0",
                    dic = datos_redcap$dictionary,
                    data = datos_redcap$data,
                    report_zeros = TRUE)

# Print results
example$results
Report of queries
Events Description Total
follow_up_visit_da_arm_1 Follow up visit day 14+/-5d 34
initial_visit_arm_1 Initial visit 0

3.7 check_queries

check <- check_queries(old = example$queries, new = queries_nuevas)

# Print results
check$results
Report of modifications
State Total
Modified 18
Unmodified 16
New 1


Query control output:

Identifier DAG Event Instrument Field Repetition Description Query Code Modification
101-36 Hospital 1 follow_up_visit_da_arm_1
Follow up visit day 14+/-5d The event ‘Follow up visit day 14+/-5d’ is missing. 101-36-1 Unmodified
102-6 Hospital 1 follow_up_visit_da_arm_1
Follow up visit day 14+/-5d The event ‘Follow up visit day 14+/-5d’ is missing. 102-6-1 Unmodified
102-10 Hospital 1 follow_up_visit_da_arm_1
Follow up visit day 14+/-5d The event ‘Follow up visit day 14+/-5d’ is missing. 102-10-1 Unmodified
102-13 Hospital 1 follow_up_visit_da_arm_1
Follow up visit day 14+/-5d The event ‘Follow up visit day 14+/-5d’ is missing. 102-13-1 Unmodified
102-23 Hospital 1 follow_up_visit_da_arm_1
Follow up visit day 14+/-5d The event ‘Follow up visit day 14+/-5d’ is missing. 102-23-1 Unmodified
102-27 Hospital 1 follow_up_visit_da_arm_1
Follow up visit day 14+/-5d The event ‘Follow up visit day 14+/-5d’ is missing. 102-27-1 Modified