The REDCapDM package will first allow us to read both data exported directly from REDCap and through an API connection. Next, it will allow us to create reports of queries such as outliers or missing values, and to track them. Finally, we will be able to preprocess previously downloaded data.
Functions included in the package:
redcap_data: read data.
rd_query: identification of queries.
rd_event: identification of events that are ‘missing’.
check_queries: control queries.
rd_transform: data pre-processing.
rd_rlogic: transformation of REDCap logic into R logic.
rd_insert_na: manual entry of a ‘missing’ in specific variables using a filter.
Exported data from REDCap:
<- redcap_data(data_path="C:/Users/username/example.r",
datos dic_path="C:/Users/username/example_dictionary.csv")
Data using API:
<- redcap_data(uri ="https://redcap.idibell.cat/api/",
datos_api token = "55E5C3D1E83213ADA2182A4BFDEA")
For the examples we will use COVICAN dataset that is already included
in the package. For more information use ?covican
.
<- covican datos_redcap
List of 2
$ data :'data.frame': 342 obs. of 56 variables:
$ dictionary:'data.frame': 21 obs. of 8 variables:
datos_redcap$data
:
record_id redcap_event_name redcap_data_access_group
1 100-6 initial_visit_arm_1 hospital_11
2 100-6 follow_up_visit_da_arm_1 hospital_11
21 100-13 initial_visit_arm_1 hospital_11
22 100-13 follow_up_visit_da_arm_1 hospital_11
34 100-16 initial_visit_arm_1 hospital_11
35 100-16 follow_up_visit_da_arm_1 hospital_11
datos_redcap$dictionary
:
field_name form_name field_type
1 record_id inclusionexclusion_criteria text
2 inc_1 inclusionexclusion_criteria radio
3 inc_2 inclusionexclusion_criteria radio
4 inc_3 inclusionexclusion_criteria radio
6 exc_1 inclusionexclusion_criteria radio
10 screening_fail_crit inclusionexclusion_criteria calc
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary
)<- datos$data
data
#To print the results
$results datos
1. Recalculating calculated fields and saving them as '[field_name]_recalc'
| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
| 2 | 0 (0%) | 1 (50%) |
| field_name | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
| age | Yes | FALSE |
| screening_fail_crit | Yes | TRUE |
2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic
Table: Checkbox variables advisable to be reviewed
| Variables without any branching logic |
|:-------------------------------------:|
| type_underlying_disease |
3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
As you can see, there are 4 steps in the transformation:
There were two autocalculated fields, as we see in the summary we have been able to recalculate both of them and there is one that has changed:
%>%
data ::select(d_birth, d_ingreso, age, age_recalc) %>%
dplyr::filter(age != age_recalc) dplyr
d_birth d_ingreso age age_recalc
1 1945-04-16 2020-04-16 74 75
#Checkbox no gatekeeper
table(data$type_underlying_disease_haematological_cancer)
No Yes
103 87
#Checkbox with gatekeeper: [type_underlying_disease(0)]='1'
#In the original data set:
table(datos_redcap$data$type_underlying_disease___0, datos_redcap$data$underlying_disease_hemato___1)
0 1
0 103 0
1 84 3
#In the transformed data set:
table(data$type_underlying_disease_haematological_cancer, data$underlying_disease_hemato_acute_myeloid_leukemia)
No Yes
No 0 0
Yes 84 3
str(data$dm)
Factor w/ 2 levels "No","Yes": 1 NA 2 NA 2 NA 1 NA 2 NA ...
In order to perform the transformation by event we have to add the path where the file with the correspondence between events and forms is located. In this case, we will use the path where we have saved the file of this correspondence that we have downloaded from the COVICAN project.
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_event"
)
#To print the results
$results datos
1. Recalculating calculated fields and saving them as '[field_name]_recalc'
| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
| 2 | 0 (0%) | 1 (50%) |
| field_name | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
| age | Yes | FALSE |
| screening_fail_crit | Yes | TRUE |
2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic
Table: Checkbox variables advisable to be reviewed
| Variables without any branching logic |
|:-------------------------------------:|
| type_underlying_disease |
3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
5. Erasing variables from forms that are not linked to any event
6. Final arrangment of the data by event
Now a step in the transformation has been added, which is to split the preprocessed data for each of the events in the study so that the function returns
$data datos
# A tibble: 2 × 3
events vars df
<chr> <list> <list>
1 initial_visit_arm_1 <chr [25]> <df [190 × 25]>
2 follow_up_visit_da_arm_1 <chr [8]> <df [152 × 8]>
To perform the transformation by form we must also add the path of the file of the events and forms
<- rd_transform(data = datos_redcap$data,
datos dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_form"
)
#To print the results
$results datos
1. Recalculating calculated fields and saving them as '[field_name]_recalc'
| Total calculated fields | Non-transcribed fields | Recalculated different fields |
|:-----------------------:|:----------------------:|:-----------------------------:|
| 2 | 0 (0%) | 1 (50%) |
| field_name | Transcribed? | Is equal? |
|:-------------------:|:------------:|:---------:|
| age | Yes | FALSE |
| screening_fail_crit | Yes | TRUE |
2. Transforming checkboxes: changing their values to No/Yes, their names to the names of its options and transforming missing values of those checkboxes having question doors specified in the branching logic
Table: Checkbox variables advisable to be reviewed
| Variables without any branching logic |
|:-------------------------------------:|
| type_underlying_disease |
3. Replacing original variables for their factor version
4. Deleting variables that contain some patterns
5. Erasing variables from forms that are not linked to any event
6. Final arrangment of the data by form
As before, a final step is added, which consists of splitting the preprocessed data for each form of the study so that the function returns
$data datos
# A tibble: 6 × 4
form events vars df
<chr> <list> <list> <list>
1 inclusionexclusion_criteria <chr [1]> <chr [11]> <df [190 × 11]>
2 demographics <chr [1]> <chr [9]> <df [190 × 9]>
3 comorbidities <chr [1]> <chr [10]> <df [190 × 10]>
4 vital_signs <chr [2]> <chr [7]> <df [177 × 7]>
5 laboratory_findings <chr [2]> <chr [7]> <df [177 × 7]>
6 microbiological_studies <chr [1]> <chr [6]> <df [190 × 6]>
checkbox_labels: specify the name of the categories that will have the checkbox variables. Default is No/Yes, we can change it to N/Y.
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary,
checkbox_labels = c("N", "Y")
)
<- datos$data data
We see how the categories of the following checkboxes have changed, for example
table(data$type_underlying_disease_haematological_cancer)
N Y
103 87
exclude_to_factor: specify the name of a variable that we do not want to be transformed to a factor. For example, if we want the variable dm to keep its numeric version
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary,
exclude_to_factor = "dm"
)
<- datos$data
data
table(data$dm)
0 1
140 45
keep_labels: logical argument to retain data set tags that are processed from REDCap
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary,
keep_labels = TRUE
)
<- datos$data
data
str(data[,1:5])
'data.frame': 342 obs. of 5 variables:
$ record_id : 'labelled' chr "100-13" "100-13" "100-16" "100-16" ...
..- attr(*, "label")= Named chr ""
.. ..- attr(*, "names")= chr "record_id"
$ redcap_event_name : 'labelled' chr "initial_visit_arm_1" "follow_up_visit_da_arm_1" "initial_visit_arm_1" "follow_up_visit_da_arm_1" ...
..- attr(*, "label")= Named chr "Event Name"
.. ..- attr(*, "names")= chr "redcap_event_name"
$ redcap_event_name.factor : Factor w/ 5 levels "Initial visit",..: 1 2 1 2 1 2 1 2 1 2 ...
..- attr(*, "label")= Named chr ""
.. ..- attr(*, "names")= chr "redcap_data_access_group"
$ redcap_data_access_group : 'labelled' chr "hospital_11" "hospital_11" "hospital_11" "hospital_11" ...
..- attr(*, "label")= Named chr "Patients older than 18 years"
.. ..- attr(*, "names")= chr "inc_1"
$ redcap_data_access_group.factor: Factor w/ 26 levels "Hospital 1","Hospital 2",..: 11 11 11 11 11 11 11 11 11 11 ...
..- attr(*, "label")= Named chr "Cancer patients"
.. ..- attr(*, "names")= chr "inc_2"
delete_vars: specify strings whereby any variables containing them will be removed from the data set. By default all variables containing ’_complete’ are removed. In this case we did not have any complete variables. We want for example to remove all inclusion and exclusion criteria
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary,
delete_vars = c("inc_", "exc_")
)
<- datos$data
data
names(data)
[1] "record_id"
[2] "redcap_event_name"
[3] "redcap_event_name.factor"
[4] "redcap_data_access_group"
[5] "redcap_data_access_group.factor"
[6] "screening_fail_crit"
[7] "screening_fail_crit_recalc"
[8] "d_birth"
[9] "d_ingreso"
[10] "age"
[11] "age_recalc"
[12] "dm"
[13] "type_dm"
[14] "copd"
[15] "fio2_aportado"
[16] "analitica_disponible"
[17] "potassium"
[18] "resp_freq"
[19] "hemato_neo"
[20] "leukemia"
[21] "type_underlying_disease_haematological_cancer"
[22] "type_underlying_disease_solid_tumour"
[23] "underlying_disease_hemato_acute_myeloid_leukemia"
[24] "underlying_disease_hemato_myelodysplastic_syndrome"
[25] "underlying_disease_hemato_chronic_myeloid_leukaemia"
[26] "underlying_disease_hemato_acute_lymphoblastic_leukaemia"
[27] "underlying_disease_hemato_hodgkin_lymphoma"
[28] "underlying_disease_hemato_non_hodgkin_lymphoma"
[29] "underlying_disease_hemato_multiple_myeloma"
[30] "underlying_disease_hemato_myelofibrosis"
[31] "underlying_disease_hemato_aplastic_anaemia"
[32] "urine_culture"
which_event: in the event-driven transformation, specify whether you only want a specific event to be returned. For example, we only want the first visit
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_event",
which_event = "initial_visit_arm_1"
)
<- datos$data
data
table(data$redcap_event_name)
initial_visit_arm_1
190
which_form: in the transformation by form, specify if you only want a specific form to be returned. For example, we only want the form demographics
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_form",
which_form = "demographics"
)
<- datos$data
data
names(data)
[1] "record_id" "redcap_event_name"
[3] "redcap_data_access_group" "redcap_event_name.factor"
[5] "redcap_data_access_group.factor" "d_ingreso"
[7] "d_birth" "age"
[9] "age_recalc"
wide: so that the data set that is returned in the transformation by form is in wide format or not. We do this for the laboratory findings form
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary,
event_path = "files/COVICAN_instruments.csv",
final_format = "by_form",
which_form = "laboratory_findings",
wide = TRUE
)
<- datos$data
data
head(data)
# A tibble: 6 × 5
record_id analitica_disponible_1 analitica_disponible_2 potassium_1 potassiu…¹
<chr> <fct> <fct> <dbl> <dbl>
1 100-13 Yes Yes 3.66 4.1
2 100-16 Yes No 4.04 NA
3 100-31 Yes <NA> 4.58 NA
4 100-34 Yes No 3.48 NA
5 100-36 Yes No 4.09 NA
6 100-52 Yes Yes 3.7 7.15
# … with abbreviated variable name ¹potassium_2
It transforms the REDCap logic into logic that can be evaluated in R. This function is built into the recalculate function, but it may be useful to use it separately. Let’s see how it transforms the logics of the autocalculated field calculations.
#screening failure
rd_rlogic(logic = "if([exc_1]='1' or [inc_1]='0' or [inc_2]='0' or [inc_3]='0',1,0)",
data = datos_redcap$data)
[1] "ifelse(data$exc_1=='1' | data$inc_1=='0' | data$inc_2=='0' | data$inc_3=='0',1,0)"
#age
rd_rlogic(logic = 'rounddown(datediff([d_birth],[d_ingreso],"y","dmy"),0)',
data = datos_redcap$data)
[1] "floor(lubridate::time_length(lubridate::interval(data$d_birth,data$d_ingreso), 'year'))"
Function to set missing variables when a certain logic is fulfilled. Useful for example in the checkboxes that we do not have a gatekeeper. For example, we put the checkbox without gatekeeper (type_underlying_disease) to missing when the age is less than 65 years old.
<- rd_transform(
datos data = datos_redcap$data,
dic = datos_redcap$dictionary
)
<- datos$data
data
#Before inserting missings
table(data$type_underlying_disease_haematological_cancer)
No Yes
103 87
<- rd_insert_na(
data2 data = data,
filter = rep("age < 65", 2),
vars = grep("type_underlying_disease", names(data), value = TRUE)
)
#After inserting missings
table(data2$type_underlying_disease_haematological_cancer)
No Yes
65 50
Identifier | DAG | Event | Instrument | Field | Repetition | Description | Query | Code |
---|---|---|---|---|---|---|---|---|
100-58 | Hospital 11 | Initial visit | Comorbidities | copd |
|
Chronic pulmonary disease | The value is NA and it should not be missing | 100-58-1 |
102-113 | Hospital 24 | Initial visit | Demographics | age |
|
Age | The value is NA and it should not be missing | 102-113-1 |
105-11 | Hospital 5 | Initial visit | Comorbidities | copd |
|
Chronic pulmonary disease | The value is NA and it should not be missing | 105-11-1 |
105-11 | Hospital 5 | Initial visit | Demographics | age |
|
Age | The value is NA and it should not be missing | 105-11-2 |
105-56 | Hospital 5 | Initial visit | Comorbidities | copd |
|
Chronic pulmonary disease | The value is NA and it should not be missing | 105-56-1 |
105-56 | Hospital 5 | Initial visit | Demographics | age |
|
Age | The value is NA and it should not be missing | 105-56-2 |
Mandatory arguments:
<- rd_query(variables = c("copd", "age"),
example expression = c("%in%NA", "%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
# Printing results
$results example
Variables | Description | Total |
---|---|---|
copd | Chronic pulmonary disease | 6 |
age | Age | 5 |
Optional arguments:
<- rd_query(variables = c("age", "copd"),
examplevariables_names = c("Age", "Chronic obstructive pulmonary disease"),#### OPCIONAL
expression = c("%in%NA", "%in%NA"),
query_name = c("Age is missing at baseline visit", "COPD"), #### OPCIONAL
instrument = c("Inclusión del paciente","Inclusión"), #### OPCIONAL
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
Sin filtro:
<- rd_query(variables = c("age", "copd", "potassium"),
example expression = c("%in%NA", "%in%NA", "%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data=datos_redcap$data)
Warning: Some of the variables that were checked for missings present a branching logic.
Check the results tab of output for more details (...$results).
# Printing results
$results example
Variables | Description | Total | Branching logic |
---|---|---|---|
age | Age | 5 | - |
copd | Chronic pulmonary disease | 6 | - |
potassium | Potassium | 31 | [analitica_disponible]=‘1’ |
Applying filter:
<- rd_query(variables = c("potassium"),
example expression = c("%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data,
filter = c("analitica_disponible=='1'"))
Warning: Some of the variables that were checked for missings present a branching logic.
Check the results tab of output for more details (...$results).
# Printing results
$results example
Variables | Description | Total | Branching logic |
---|---|---|---|
potassium | Potassium | 21 | [analitica_disponible]=‘1’ |
Simple:
<- rd_query(variables=c("age"),
example expression=c(">20"),
event="initial_visit_arm_1",
dic=datos_redcap$dictionary,
data=datos_redcap$data)
# Printing results
$results example
Variables | Description | Total |
---|---|---|
age | Age | 185 |
Complex:
<- rd_query(variables=c("age", "copd"),
example expression=c("(>20 & <70) | %in%NA", "==1"),
event="initial_visit_arm_1",
dic=datos_redcap$dictionary,
data=datos_redcap$data)
# Printing results
$results example
Variables | Description | Total |
---|---|---|
age | Age | 108 |
copd | Chronic pulmonary disease | 21 |
Same expression for all variables:
<- rd_query(variables = c("copd","age","dm"),
example expression = c("%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
Warning: There are more variables than expressions, so the same expression was
applied to all variables
# Printing results
$results example
Variables | Description | Total |
---|---|---|
copd | Chronic pulmonary disease | 6 |
age | Age | 5 |
dm | Diabetes (treated with insulin or antidiabetic … | 5 |
Not defining an event:
<- rd_query(variables = c("copd"),
example expression = c("%in%NA"),
dic = datos_redcap$dictionary,
data = datos_redcap$data)
Warning: event = NA, but the dataset presents a variable that indicates the
presence of events, please specify the event.
# Printing results
$results example
Variables | Description | Total |
---|---|---|
copd | Chronic pulmonary disease | 158 |
negate: negation of expression used
<- rd_query(variables = c("copd"),
example expression = c("%in%NA"),
negate = TRUE,
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
# Printing results
$results example
Variables | Description | Total |
---|---|---|
copd | Chronic pulmonary disease | 184 |
addTo: join queries to an existing data frame
<- rd_query(variables = c("age"),
example2 expression = c("%in%NA"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data=datos_redcap$data,
addTo = example)
# Printing results
$results example2
Variables | Description | Total |
---|---|---|
copd | Chronic pulmonary disease | 184 |
age | Age | 5 |
report_title: customize the queries table title
<- rd_query(variables = c("copd", "age"),
example expression = c("%in%NA", "<20"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data,
report_title = "Missing COPD values in the baseline event")
# Printing results
$results example
Variables | Description | Total |
---|---|---|
copd | Chronic pulmonary disease | 6 |
report_zeros: choose whether variables with zero queries should be reported in the table
<- rd_query(variables = c("copd", "age"),
example expression = c("%in%NA", "<20"),
event = "initial_visit_arm_1",
dic = datos_redcap$dictionary,
data = datos_redcap$data,
report_zeros = TRUE)
# Printing results
$results example
Variables | Description | Total |
---|---|---|
copd | Chronic pulmonary disease | 6 |
age | Age | 0 |
Simple:
<- rd_event(event = "follow_up_visit_da_arm_1",
example dic = datos_redcap$dictionary,
data = datos_redcap$data)
# Print results
$results example
Events | Description | Total |
---|---|---|
follow_up_visit_da_arm_1 | Follow up visit day 14+/-5d | 38 |
Filter:
<- rd_event(event = "follow_up_visit_da_arm_1",
example filter = "screening_fail_crit==0",
dic = datos_redcap$dictionary,
data = datos_redcap$data)
# Print results
$results example
Events | Description | Total |
---|---|---|
follow_up_visit_da_arm_1 | Follow up visit day 14+/-5d | 34 |
Several events:
<- rd_event(event = c("initial_visit_arm_1","follow_up_visit_da_arm_1"),
example filter = "screening_fail_crit==0",
dic = datos_redcap$dictionary,
data = datos_redcap$data,
report_zeros = TRUE)
# Print results
$results example
Events | Description | Total |
---|---|---|
follow_up_visit_da_arm_1 | Follow up visit day 14+/-5d | 34 |
initial_visit_arm_1 | Initial visit | 0 |
<- check_queries(old = example$queries, new = queries_nuevas)
check
# Print results
$results check
State | Total |
---|---|
Modified | 18 |
Unmodified | 16 |
New | 1 |
Query control output:
Identifier | DAG | Event | Instrument | Field | Repetition | Description | Query | Code | Modification |
---|---|---|---|---|---|---|---|---|---|
101-36 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 101-36-1 | Unmodified |
102-6 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-6-1 | Unmodified |
102-10 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-10-1 | Unmodified |
102-13 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-13-1 | Unmodified |
102-23 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-23-1 | Unmodified |
102-27 | Hospital 1 | follow_up_visit_da_arm_1 |
|
|
|
Follow up visit day 14+/-5d | The event ‘Follow up visit day 14+/-5d’ is missing. | 102-27-1 | Modified |