Analysis on the dependency heaviness of R packages

R packages under analysis were retrieved from CRAN/Biocoductor on <%=readLines(url("https://pkgndep.github.io/date.txt"))[1]%>. There are <%=n_cran%> packages from CRAN and <%=n_bioc%> packages from Bioconductor (bioc version 3.15).


Legends:

High heaviness Packages with adjusted heaviness on child packages higher than <%=CUTOFF$adjusted_heaviness_on_children[2]%>.

Median heaviness Packages with adjusted heaviness on child packages between <%=CUTOFF$adjusted_heaviness_on_children[1]%> and <%=CUTOFF$adjusted_heaviness_on_children[2]%>.

reducible Packages whose parent's heaviness could be reduced, i.e. only a limited number of functions are imported from the heaviest parent.

Columns:      Heaviness from parent packages      Heaviness on child/downstream packages


The full table of dependency heaviness analysis can be obtained by df = pkgndep::all_pkg_stat_snapshot().

<% reducible_str = ifelse(only_reducible, 'on', '') exclude_children_str = ifelse(exclude_children, 'on', '') if(exclude_children) { col.names = c(qq("Package"), "Repository", qq("Number of strong dependency packages"), qq("Number of all dependency packages"), qq("Number of parent packages"), qq("Max heaviness from parent packages"), qq("Max co-heaviness from parent packages"), qq("Heaviness on child packages"), qq("Number of child packages"), qq("Heaviness on indirect downstream packages (excluding children)"), qq("Number of indirect downstream packages (excluding children)")) } else { col.names = c(qq("Package"), "Repository", qq("Number of strong dependency packages"), qq("Number of all dependency packages"), qq("Number of parent packages"), qq("Max heaviness from parent packages"), qq("Max co-heaviness from parent packages"), qq("Heaviness on child packages"), qq("Number of child packages"), qq("Heaviness on downstream packages"), qq("Number of downstream packages")) } %> <%= as.character(knitr::kable(df2, format = "html", row.names = FALSE, escape = FALSE, table.attr = "id='dependency-table' class='table table-striped'", col.names = col.names, align = c("l", rep("r", ncol(df2) - 1)))) %> <% if(package == "") { %> <% if(order_by == "adjusted_heaviness_on_children") order_by = "" %>
records per page, showing <%=ind[1]%> to <%=ind[length(ind)]%> of <%=nrow(df)%> pacakges.
<% nr = nrow(df) if(nr > records_per_page) { %> <%= page_select(page, ceiling(nr/records_per_page), qq("order_by=@{order_by}&reducible=@{reducible_str}&exclude_children=@{exclude_children_str}")) %> <% } %> <% } %>

Dependency categories

For a package denoted as P, its direct dependency packages are listed in the Depends, Imports, LinkingTo, Suggestes and Enhances fields in its DESCRIPTION file. We define the following dependency categories for package P:

  • Strong parent packages: Dependency packages listed in the Depends, Imports, and LinkingTo of P (red box in the figure). They are also called the strong direct dependency packages of P. Strong parent packages are mandatory to be installed when installing P.
  • Weak parent packages: Dependency packages listed in the Suggest and Enhances of P (green box in the figure). They are optionally required when installing P.
  • Strong dependency packages: Total dependency packages by recursively looking for parent packages (category A, B, as well as the packages in red box in the figure). They are also called the upstream packages. Note strong dependency packages include parent packages. Strong dependency packages are mandatory to be installed when installing P.
  • All dependency packages: Total dependencies by recursively looking for parent packages, but on the level of P, its weak parents are also included (package category A, B, C and D, plus all packages listed in the red and green boxes in the figure). It simulates when the full functionality of P is required, or all weak parents become strong parents, the total number of strong dependency packages P requires.
  • Child packages: Packages whose parents include P (category E in the figure). They are the packages on which P has a direct impact of dependencies.
  • Downstream packages: Total packages by recursively looking for child packages (category E and F in the figure). P is required for the installation of any of its downstream packages. Note downstream packages include child packages.
  • Indirect downstream packages: Downstream packages excluding child packages (category F in the figure), i.e., these with distance to P of at least 2 in the global dependency graph. These are the packages on which P has an indirect influence of dependencies. Note in some of the current studies, they are also called transitive packages.
<%= paste(readLines(system.file("website", "dependency_diagram.svg", package = "pkgndep")), collapse = "\n") %>

Heaviness metrics

Various metrics for the heaviness are defined as follows:

  • Heaviness from a parent. If package A is a strong parent of P, the heaviness of A on P denoted as $h$ is calculated as $h = n_1 - n_2$ where $n_1$ is the number of strong dependencies of P, and $n_2$ is the number of strong dependencies of P after changing A from a strong parent to a weak parent, i.e., by moving A to Suggests of P. Thus, the heaviness measures the number of additionally required strong dependencies that A brings to P and they are not brought by any other parent. If package B is a weak parent of P, $n_2$ is defined as the number of strong dependencies of P after changing B to a strong parent of P, i.e., by moving B to Imports of P. In this scenario, the heaviness of the weak parent is calculated as $n_2 - n_1$.
  • Max heaviness from parents. Assume package P has $K_p$ parents, the heaviness denoted as $h_{max}$ is defined as $h_{max}=\underset{k\in\{1..K_p\}}{\max}h_{k}$ where $h_k$ is the heaviness of the kth parent on P.
  • Heaviness on the child packages. Assume P has $K_c$ child packages and the kth child is denoted as Ak. Denote the number of strong dependencies of Ak as $n_{1k}$, and denote the number of strong dependencies of Ak after changing P as a weak parent of Ak as $n_{2k}$, the heaviness of P on its child packages denoted as $h_c$ is calculated as $h_c=\frac{1}{K_c}\sum_{k=1}^{K_c}(n_{1k}-n_{2k})$. The heaviness measures the average number of additional dependencies that P brings to its child packages.
  • Heaviness on the downstream packages. The definition is similar to the heaviness on the child packages. Assume P has $K_d$ downstream packages and the kth downstream package is denoted as Bk. Denote the number of strong dependencies of Bk as $n_{1k}$, and denote the number of strong parents of Bk after changing P to a weak parent of all P's child packages as $n_{2k}$. The heaviness of P on its downstream packages denoted as $h_d$ is calculated as $h_d=\frac{1}{K_d}\sum_{k=1}^{K_d}(n_{1k}-n_{2k})$.
  • Heaviness on the indirect downstream packages.The calculation is the same as $h_d$ except here child packages are excluded from downstream packages. Denote the heaviness as $h_{id}$ and denote the set of P's child packages as $S_c$, $h_{id}$ is defined as $h_{id}=\frac{1}{K_d-K_c}\sum_{k=1}^{K_d}(n_{1k}-n_{2k})\cdot I(B_k\notin S_c)$, where $K_c$ and $K_d$ are the numbers of child and downstream packages respectively, and $I()$ is an indicator function. $h_{id}$ is set to 0 if $K_c = K_d$, i.e., P has no indirect downstream packages. $h_{id}$ measures the contribution of dependencies of P to the ecosystems in an indirect way.

Adjusted heaviness

If grouping packages by $K$ which can be the number of parent, child or downstream packages depending on the type of the heaviness metrics, the distributions of heaviness values always have long tails, and the tails are especially longer for smaller $K$. Thus, if simply ranking packages based on the original heaviness values, top packages are preferably associated with small $K$. In general, packages with small $K$ are of less interest because they only have very small impacts to the ecosystem. To prioritize packages with broader impacts to the ecosystem, the original definitions of various heaviness metrics are adjusted to decrease the weights of packages with smaller $K$. Please note, the designs of the adjusted heaviness metrics are empirical and the absolute values of adjusted heaviness are meaningless, which are only used for ranking packages. A detailed explanation of various adjusted heaviness metrics can be found in the tab "Heaviness analysis".

Co-heaviness from parent pairs

The co-heaviness measures the number of additional dependency packages simultaneously brought by two parent packages. Let A and B be two parents of P, denote $S_A$ as the set of reduced dependency packages when only changing A to a weak parent of P, denote $S_B$ as the set of reduced dependency packages when only changing B to a weak parent of P, and denote $S_{AB}$ as the set of reduced dependency packages when changing both A and B to weak parents of P, then the co-heaviness of A and B on P denoted as $h_{co}$ is defined as $h_{co} = \left|S_{AB}\setminus\cup(S_A,S_B)\right|$ where the symbol $X \setminus Y$ corresponds to the set of elements in $X$ but not in $Y$, and $|X|$ is the number of elements in set $X$. The co-heaviness measures the number of reduced packages only caused by co-action of A and B.

Loading content...