Tplyr Metadata

Tplyr has a bit of a unique design, which might feel a bit weird as you get used to the package. The process flow of building a tplyr_table() object first, and then using build() to construct the data frame is different than programming in the tidyverse, or creating a ggplot. Why create the tplyr_table() object first? Why is the tplyr_table() object different than the resulting data frame?

The purpose of the tplyr_table() object is to let Tplyr do more than just summarize data. As you build the table, all of the metadata around the table being built is maintained - the target variables being summarized, the grouped variables by row and column, the filter conditions necessary applied to the table and each layer. As a user, you provide this information to create the summary. But what about after the results are produced? Summarizing data inevitably leads to new questions. Within clinical summaries, you may want to know which subjects experienced an adverse event, or why the lab summaries of a particular visit’s descriptive statistics are abnormal. Normally, you’d write a query to recreate the data that lead to that particular summary. Tplyr now allows you to immediately extract the input data or metadata that created an output result, thus providing traceability from the result back to the source.

Generating the Metadata

Consider the following example:

t <- tplyr_table(adsl, TRT01P, where = SAFFL == "Y") %>% 
  add_layer(
    group_count(RACE)
  ) %>% 
  add_layer(
    group_desc(AGE, where = EFFFL == "Y")
  )

dat <- t %>% build(metadata=TRUE)

kable(dat)
row_id row_label1 var1_Placebo var1_Xanomeline High Dose var1_Xanomeline Low Dose ord_layer_index ord_layer_1
c1_1 AMERICAN INDIAN OR ALASKA NATIVE 0 ( 0.0%) 1 ( 1.2%) 0 ( 0.0%) 1 1
c2_1 BLACK OR AFRICAN AMERICAN 8 ( 9.3%) 9 ( 10.7%) 6 ( 7.1%) 1 2
c3_1 WHITE 78 ( 90.7%) 74 ( 88.1%) 78 ( 92.9%) 1 3
d1_2 n 79 74 81 2 1
d2_2 Mean (SD) 75.9 ( 8.43) 75.4 ( 7.87) 77.8 ( 8.02) 2 2
d3_2 Median 76.0 75.5 78.0 2 3
d4_2 Q1, Q3 69.5, 81.0 70.2, 79.0 71.0, 82.0 2 4
d5_2 Min, Max 52, 88 56, 88 51, 88 2 5
d6_2 Missing 0 0 0 2 6

To trigger the creation of metadata, the build() function has a new argument metadata. By specifying TRUE, the underlying metadata within Tplyr are prepared in an extractable format. This is the only action a user needs to specify for this action to take place.

When the metadata argument is used, a new column will be produced in the output dataframe called row_id. The row_id variable provides a persistent reference to a row of interest, even if the output dataframe is sorted. If you review vignette("styled-table"), note that we expect a certain amount of post processing and styling of the built data frame from Tplyr, to let you use whatever other packages you prefer. As such, this reference ID is necessary.

Extracting The Input Source

So, let’s cut to the chase. The most likely way you would use this metadata is to pull out the source data that created a cell. For this, we’ve provided the function get_meta_subset(). The only information that you need is the row_id and column name of the result cell of interest. For example, looking at the result above, what if we want to know who the 8 subjects in the Placebo group who where Black or African American:

get_meta_subset(t, 'c2_1', 'var1_Placebo') %>% 
  kable()
USUBJID TRT01P SAFFL RACE
01-701-1203 Placebo Y BLACK OR AFRICAN AMERICAN
01-701-1363 Placebo Y BLACK OR AFRICAN AMERICAN
01-705-1282 Placebo Y BLACK OR AFRICAN AMERICAN
01-706-1041 Placebo Y BLACK OR AFRICAN AMERICAN
01-708-1286 Placebo Y BLACK OR AFRICAN AMERICAN
01-708-1296 Placebo Y BLACK OR AFRICAN AMERICAN
01-708-1378 Placebo Y BLACK OR AFRICAN AMERICAN
01-711-1036 Placebo Y BLACK OR AFRICAN AMERICAN

By using the row_id and column, the dataframe is pulled right out for us. Notice that USUBJID was included by default, even though Tplyr there’s no reference anywhere in the tplyr_table() to the variable USUBJID. This is because get_meta_subset() has an additional argument add_cols that allows you to specify additional columns you want included in the resulting dataframe, and has a default of USUBJID. So let’s say we want additionally include the variable SEX.

get_meta_subset(t, 'c2_1', 'var1_Placebo', add_cols = vars(USUBJID, SEX)) %>% 
  kable()
USUBJID SEX TRT01P SAFFL RACE
01-701-1203 F Placebo Y BLACK OR AFRICAN AMERICAN
01-701-1363 F Placebo Y BLACK OR AFRICAN AMERICAN
01-705-1282 F Placebo Y BLACK OR AFRICAN AMERICAN
01-706-1041 F Placebo Y BLACK OR AFRICAN AMERICAN
01-708-1286 F Placebo Y BLACK OR AFRICAN AMERICAN
01-708-1296 M Placebo Y BLACK OR AFRICAN AMERICAN
01-708-1378 M Placebo Y BLACK OR AFRICAN AMERICAN
01-711-1036 M Placebo Y BLACK OR AFRICAN AMERICAN

Variables should be provided using dplyr::vars(), just like the cols argument on tplyr_table() and the by arguments in each layer type.

As mentioned, the input source data can be extracted for any result cell created by Tplyr. So let’s say we want to know the subjects relevant for the descriptive statistics around age in the Xanomeline High Dose group:

get_meta_subset(t, 'd1_2', 'var1_Xanomeline High Dose') %>% 
  head(10) %>% 
  kable()
USUBJID TRT01P EFFFL SAFFL AGE
01-701-1028 Xanomeline High Dose Y Y 71
01-701-1034 Xanomeline High Dose Y Y 77
01-701-1133 Xanomeline High Dose Y Y 81
01-701-1146 Xanomeline High Dose Y Y 75
01-701-1148 Xanomeline High Dose Y Y 57
01-701-1180 Xanomeline High Dose Y Y 56
01-701-1181 Xanomeline High Dose Y Y 79
01-701-1239 Xanomeline High Dose Y Y 56
01-701-1275 Xanomeline High Dose Y Y 61
01-701-1287 Xanomeline High Dose Y Y 56

Note: Trimmed for space

Notice how the columns returned are different. First off, within the summary above, we pulled results from the descriptive statistics layer. The target variable for this layer was AGE, and as such AGE is returned in the resulting output. Additionally, a layer level where argument was used to subset to EFFFL == "Y", which leads to EFFFL being included in the output as well.

Extracting a Result Cell’s Metadata

To extract the dataframe in get_meta_subset(), the metadata of the result cell needs to first be extracted. This metadata can be directly accessed using the function get_meta_result(). Using the last example of get_meta_subset() above:

get_meta_result(t, 'd1_2', 'var1_Xanomeline High Dose')
#> tplyr_meta: 4 names, 3 filters
#> Names:
#>     TRT01P, EFFFL, SAFFL, AGE 
#> Filters:
#>     TRT01P == c("Xanomeline High Dose"), EFFFL == "Y", SAFFL == "Y"

The resulting output is a new object Tplyr called tplyr_meta(). This is a container of a relevent metadata for a specific result. The object itself is a list with two elements: names and filters.

The names element contains quosures for each variable relevant to a specific result. This will include the target variable, the by variables used on the layer, the cols variables used on the table, and all variables included in any filter condition relevant to create the result.

The filters element contains each filter condition (provided as calls) necessary to create a particular cell. This will include the table level where argument, the layer level where argument, the filter condition for the specific value of any by variable or cols variable necessary to create the cell, and similarly the filter for the treatment group of interest.

The results are provided this was so that they can be unpacked directly into dplyr syntax when necessary, which is exactly what happens in get_meta_subset(). For example:

m <- get_meta_result(t, 'd1_2', 'var1_Xanomeline High Dose')

adsl %>% 
  filter(!!!m$filters) %>% 
  select(!!!m$names) %>% 
  head(10) %>% 
  kable()
TRT01P EFFFL SAFFL AGE
Xanomeline High Dose Y Y 71
Xanomeline High Dose Y Y 77
Xanomeline High Dose Y Y 81
Xanomeline High Dose Y Y 75
Xanomeline High Dose Y Y 57
Xanomeline High Dose Y Y 56
Xanomeline High Dose Y Y 79
Xanomeline High Dose Y Y 56
Xanomeline High Dose Y Y 61
Xanomeline High Dose Y Y 56

Note: Trimmed for space

But - who says you can’t let your imagination run wild?

cat(c("adsl %>%\n",
  "   filter(\n      ",
  paste(purrr::map_chr(m$filters, ~ rlang::as_label(.)), collpase=",\n      "),
  ") %>%\n",
  paste("   select(", paste(purrr::map_chr(m$names, rlang::as_label), collapse=", "), ")", sep="")
))
adsl %>%
    filter(
       TRT01P == c("Xanomeline High Dose") ,
       EFFFL == "Y" ,
       SAFFL == "Y" ,
       ) %>%
    select(TRT01P, EFFFL, SAFFL, AGE)

So, What Does This Get Me?

So we get get metadata around a result cell, and we can get the exact results from a result cell. You just need a row ID and a column name. But - what does that get you? You can query your tables - and that’s great. But how do you use that.

The idea behind this is really to support Shiny. Consider this minimal application. Click any of the result cells within the table and see what happens.

Source code available here

That’s what this is all about. The persistent row_id and column selection enables you to use something like Shiny to automatically query a cell based on its position in a table. Using click events and a package like reactable, you can pick up the row and column selected and pass that information into get_meta_result(). Once you get the resulting data frame, it’s up to you what you do with it, and you have the world of Shiny at the tip of your fingers.