The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code!
There are three ways to use the package:
Interactive data exploration (univariat, bivariat, multivariat)
Generate an Automated Report with one line of code. The target can be binary, categorical or numeric.
Manual exploration using a easy to remember set of tidy functions. Introduces four main verbs. explore() to grafically explore a variable or table, describe() to describe a variable or table, explain_tree() to create a simple decision tree that explains a target. report() to generate an automated report of all variables.
explore package on Github: https://github.com/rolkra/explore
As the explore-functions fits well into the tidyverse, we load the dplyr-package as well.
library(dplyr)
library(explore)
Explore your dataset (in this case the iris dataset) in one line of code:
explore(iris)
A shiny app is launched, you can inspect individual variable, explore their relation to a target (binary / categorical / numerical), grow a decision tree or create a fully automated report of all variables with a few “mouseclicks”.
You can choose each variable containng as a target, that is binary (0/1, FALSE/TRUE or “no”/“yes”), categorical or numeric.
Create a rich HTML report of all variables with one line of code:
# report of all variables
%>% report(output_file = "report.html", output_dir = tempdir()) iris
Or you can simply add a target and create the report. In this case we use a binary target, but a categorical or numerical target would work as well.
# report of all variables and their relationship with a binary target
$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris%>%
iris report(output_file = "report.html",
output_dir = tempdir(),
target = is_versicolor)
If you use a binary tharget, the parameter split = FALSE (or targetpct = TRUE) will give you a different view on the data.
Grow a decision tree with one line of code:
%>% explain_tree(target = Species) iris
You can grow a decision tree with a binary target too.
$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris%>% select(-Species) %>% explain_tree(target = is_versicolor) iris
Or using a numerical target. The syntax stays the same.
%>% explain_tree(target = Sepal.Length) iris
You can control the growth of the tree using the parameters
maxdepth
, minsplit
and cp
.
Explore your table with one line of code to see which type of variables it contains.
%>% explore_tbl() iris
You can also use describe_tbl() if you just need the main facts without visualisation.
%>% describe_tbl()
iris #> 150 observations with 6 variables
#> 0 observations containing missings (NA)
#> 0 variables containing missings (NA)
#> 0 variables with no variance
Explore a variable with one line of code. You don’t have to care if a variable is numerical or categorical.
%>% explore(Species) iris
%>% explore(Sepal.Length) iris
Explore a variable and its relationship with a binary target with one line of code. You don’t have to care if a variable is numerical or categorical.
%>% explore(Sepal.Length, target = is_versicolor) iris
Using split = FALSE will change the plot to %target:
%>% explore(Sepal.Length, target = is_versicolor, split = FALSE) iris
The target can have more than two levels:
%>% explore(Sepal.Length, target = Species) iris
Or the target can even be numeric:
%>% explore(Sepal.Length, target = Petal.Length) iris
%>%
iris select(Sepal.Length, Sepal.Width) %>%
explore_all()
%>%
iris select(Sepal.Length, Sepal.Width, is_versicolor) %>%
explore_all(target = is_versicolor)
%>%
iris select(Sepal.Length, Sepal.Width, is_versicolor) %>%
explore_all(target = is_versicolor, split = FALSE)
%>%
iris select(Sepal.Length, Sepal.Width, Species) %>%
explore_all(target = Species)
%>%
iris select(Sepal.Length, Sepal.Width, Petal.Length) %>%
explore_all(target = Petal.Length)
data(iris)
To use a high number of variables with explore_all() in a
RMarkdown-File, it is necessary to set a meaningful fig.width and
fig.height in the junk. The function total_fig_height() helps to
automatically set fig.height:
fig.height=total_fig_height(iris)
%>%
iris explore_all()
If you use a target:
fig.height=total_fig_height(iris, var_name_target = "Species")
%>% explore_all(target = Species) iris
You can control total_fig_height() by parameters ncols (number of columns of the plots) and size (height of 1 plot)
Explore correlation between two variables with one line of code:
%>% explore(Sepal.Length, Petal.Length) iris
You can add a target too:
%>% explore(Sepal.Length, Petal.Length, target = Species) iris
If you use explore to explore a variable and want to set lower and
upper limits for values, you can use the min_val
and
max_val
parameters. All values below min_val will be set to
min_val. All values above max_val will be set to max_val.
%>% explore(Sepal.Length, min_val = 4.5, max_val = 7) iris
explore
uses auto-scale by default. To deactivate it use
the parameter auto_scale = FALSE
%>% explore(Sepal.Length, auto_scale = FALSE) iris
Describe your data in one line of code:
%>% describe()
iris #> # A tibble: 5 × 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 Sepal.Length dbl 0 0 35 4.3 5.84 7.9
#> 2 Sepal.Width dbl 0 0 23 2 3.06 4.4
#> 3 Petal.Length dbl 0 0 43 1 3.76 6.9
#> 4 Petal.Width dbl 0 0 22 0.1 1.2 2.5
#> 5 Species fct 0 0 3 NA NA NA
The result is a data-frame, where each row is a variable of your
data. You can use filter
from dplyr for quick checks:
# show all variables that contain less than 5 unique values
%>% describe() %>% filter(unique < 5)
iris #> # A tibble: 1 × 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 Species fct 0 0 3 NA NA NA
# show all variables contain NA values
%>% describe() %>% filter(na > 0)
iris #> # A tibble: 0 × 8
#> # … with 8 variables: variable <chr>, type <chr>, na <int>, na_pct <dbl>,
#> # unique <int>, min <dbl>, mean <dbl>, max <dbl>
You can use describe
for describing variables too. You
don’t need to care if a variale is numerical or categorical. The output
is a text.
# describe a numerical variable
%>% describe(Species)
iris #> variable = Species
#> type = factor
#> na = 0 of 150 (0%)
#> unique = 3
#> setosa = 50 (33.3%)
#> versicolor = 50 (33.3%)
#> virginica = 50 (33.3%)
# describe a categorical variable
%>% describe(Sepal.Length)
iris #> variable = Sepal.Length
#> type = double
#> na = 0 of 150 (0%)
#> unique = 35
#> min|max = 4.3 | 7.9
#> q05|q95 = 4.6 | 7.255
#> q25|q75 = 5.1 | 6.4
#> median = 5.8
#> mean = 5.843333
Use one of the prepared datasets to explore:
# create dataset and describe it
<- create_data_app(obs = 100)
data describe(data)
#> # A tibble: 7 × 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 os chr 0 0 3 NA NA NA
#> 2 free int 0 0 2 0 0.62 1
#> 3 downloads int 0 0 99 255 6704. 18386
#> 4 rating dbl 0 0 5 1 3.44 5
#> 5 type chr 0 0 10 NA NA NA
#> 6 updates dbl 0 0 72 1 45.6 99
#> 7 screen_sizes dbl 0 0 5 1 2.61 5
# create dataset and describe it
<- create_data_random(obs = 100, vars = 5)
data describe(data)
#> # A tibble: 7 × 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 id int 0 0 100 1 50.5 100
#> 2 target_ind int 0 0 2 0 0.53 1
#> 3 var_1 int 0 0 61 1 51.4 99
#> 4 var_2 int 0 0 63 1 48.6 98
#> 5 var_3 int 0 0 62 1 49.2 100
#> 6 var_4 int 0 0 68 0 48.6 100
#> 7 var_5 int 0 0 64 2 51.9 99
You can build you own random dataset by using
create_data_empty()
and add_var_randm_*()
functions:
# create dataset and describe it
<- create_data_empty(obs = 1000) %>%
data add_var_random_01("target") %>%
add_var_random_dbl("age", min_val = 18, max_val = 80) %>%
add_var_random_cat("gender",
cat = c("male", "female", "other"),
prob = c(0.4, 0.4, 0.2)) %>%
add_var_random_starsign() %>%
add_var_random_moon()
describe(data)
#> # A tibble: 5 × 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 target int 0 0 2 0 0.51 1
#> 2 age dbl 0 0 1000 18.1 48.9 80.0
#> 3 gender chr 0 0 3 NA NA NA
#> 4 random_starsign chr 0 0 12 NA NA NA
#> 5 random_moon chr 0 0 4 NA NA NA
%>% select(random_starsign, random_moon) %>% explore_all() data
Create a Data Dictionary of a dataset (Markdown File data_dict.md)
%>% data_dict_md(output_dir = tempdir()) iris
Add title, detailed descriptions and change default filename
<- data.frame(
description variable = c("Species"),
description = c("Species of Iris flower"))
data_dict_md(iris,
title = "iris flower data set",
description = description,
output_file = "data_dict_iris.md",
output_dir = tempdir())
To clean a variable you can use clean_var
. With one line
of code you can rename a variable, replace NA-values and set a minimum
and maximum for the value.
%>%
iris clean_var(Sepal.Length,
min_val = 4.5,
max_val = 7.0,
na = 5.8,
name = "sepal_length") %>%
describe()
#> # A tibble: 5 × 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 sepal_length dbl 0 0 26 4.5 5.81 7
#> 2 Sepal.Width dbl 0 0 23 2 3.06 4.4
#> 3 Petal.Length dbl 0 0 43 1 3.76 6.9
#> 4 Petal.Width dbl 0 0 22 0.1 1.2 2.5
#> 5 Species fct 0 0 3 NA NA NA
Create an RMarkdown template to explore your own data. Set output_dir (existing file may be overwritten)
create_notebook_explore(
output_dir = tempdir(),
output_file = "notebook-explore.Rmd")