The CDMConnector package allows us to work with cdm data in different
locations consistently. The cdm_reference
may be to tables
in a database, files on disk, or tables loaded into R. This allows
computation to take place wherever is most convenient.
Here we have a schematic of how CDMConnector can be used to create
cdm_references
to different locations.
To show how this can work (and slightly overcomplicate things to show different options), let´s say we want to create a histogram with age of patients at diagnosis of tear of meniscus of knee (concept_id of “4035415”). We can start in the database and, after loading the required packages, subset our person table people to only include those people in the condition_occurrence table with condition_concept_id “4035415”
library(CDMConnector)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
<- DBI::dbConnect(duckdb::duckdb(), dbdir = eunomia_dir())
db <- cdm_from_con(db, cdm_tables = c("person", "condition_occurrence"))
cdm
# first filter to only those with condition_concept_id "4035415"
$condition_occurrence %>% tally()
cdm#> # Source: SQL [1 x 1]
#> # Database: DuckDB 0.6.1 [root@Darwin 21.6.0:R 4.2.2//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpnQ23cR/hslnivhk]
#> n
#> <dbl>
#> 1 65332
$condition_occurrence <- cdm$condition_occurrence %>%
cdmfilter(condition_concept_id == "4035415") %>%
select(person_id, condition_start_date)
$condition_occurrence %>% tally()
cdm#> # Source: SQL [1 x 1]
#> # Database: DuckDB 0.6.1 [root@Darwin 21.6.0:R 4.2.2//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpnQ23cR/hslnivhk]
#> n
#> <dbl>
#> 1 83
# then left_join person table
$person %>% tally()
cdm#> # Source: SQL [1 x 1]
#> # Database: DuckDB 0.6.1 [root@Darwin 21.6.0:R 4.2.2//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpnQ23cR/hslnivhk]
#> n
#> <dbl>
#> 1 2694
$person <- cdm$condition_occurrence %>%
cdmselect(person_id) %>%
left_join(select(cdm$person, person_id, year_of_birth), by = "person_id")
$person %>% tally()
cdm#> # Source: SQL [1 x 1]
#> # Database: DuckDB 0.6.1 [root@Darwin 21.6.0:R 4.2.2//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpnQ23cR/hslnivhk]
#> n
#> <dbl>
#> 1 83
We can save these tables to file
<- tempfile()
dOut dir.create(dOut)
::stow(cdm, dOut) CDMConnector
And now we can create a cdm_reference
to the files
<- cdm_from_files(dOut, as_data_frame = FALSE)
cdm_arrow
$person %>%
cdm_arrownrow()
#> [1] 83
$condition_occurrence %>%
cdm_arrownrow()
#> [1] 83
And create an age at diagnosis variable
$result <- cdm_arrow$person %>%
cdm_arrowleft_join(cdm_arrow$condition_occurrence, by = "person_id") %>%
mutate(age_diag = year(condition_start_date) - year_of_birth)
We can then bring in this result to R and make the histogram
<- cdm_arrow$result %>%
result collect()
str(result)
#> tibble [85 × 4] (S3: tbl_df/tbl/data.frame)
#> $ person_id : num [1:85] 430 458 372 452 165 459 99 66 161 145 ...
#> $ year_of_birth : num [1:85] 1931 1972 1961 1962 1968 ...
#> $ condition_start_date: Date[1:85], format: "1997-12-05" "2010-06-25" ...
#> $ age_diag : num [1:85] 66 38 57 51 45 14 10 58 39 25 ...
%>%
result ggplot(aes(age_diag)) +
geom_histogram()