<% df = load_pkg_stat_snapshot() %>

In this report, we study two questions on the package dependencies:

Heaviness from parent packages

For each package, we first look at the maximal heaviness from its parents. Following plots show the relation between number of parents and max heaviness from parents. Generally, on the border of the point cloud, there is a trend that max heaviness from parents drops as numbers of parents increase. This is because when a package has more parents, additional dependency packages brought by each parent would have more overlap (i.e., dependencies from parent A overlap to the dependencies from parent B). Since heaviness measures number of unique dependencies that a single parent brings in, or in other words, the number of dependencies that are mutually exclusive to those brought by all other parents, thus with more parents, the max heaviness from parents would decrease.

In the plot, we can see there are several packages far away from the cloud (highlighted in red and orange). These packages can be thought as those having extreamly heavy parents compared to most of the others. To capture these packages with heavy parents, we define "adjusted max heaviness on parent packages" as follows.

For a package P, denote $h$ as the max heaviness from its parent packages. The adjusted heaviness is calculated as $ h^{adj} = h \cdot a $ where $a$ is a zooming factor. $a$ is calculated as $a = (n+30)/n_{max}$ where $n$ is the number of parents for package P and $n_{max}$ is the maximal number of parents of all packages (i.e., all CRAN/Bioconductor packages). The value of 30 is selected emperically to balance the zooming rate on different $n$.

The zooming factor $a$ decreases the heaviness faster for small number of parents, thus, it actually transforms the original distribution of point cloud more horizontal so that it is easy to set a cutoff to mark extream points. The plot of adjusted heaviness verse number of parents can be seen by clicking the radio button "Adjusted max heaviness verse number of parent packages" below. We simply mark a package as having highly heavy parents if the adjusted heaviness larger than <%=CUTOFF$adjusted_max_heaviness_from_parents[2]%> and having median heavy parents if the adjusted heaviness is between <%=CUTOFF$adjusted_max_heaviness_from_parents[1]%> and <%=CUTOFF$adjusted_max_heaviness_from_parents[2]%>. The packages with highly heavy parents are listed on the right of the following figure.

<% df2 = df[df$adjusted_max_heaviness_from_parents >= CUTOFF$adjusted_max_heaviness_from_parents[2], , drop = FALSE] top_pkgs = rownames(df2)[order(df2$adjusted_max_heaviness_from_parents, decreasing = TRUE)] generate_top_pkgs_html = function(top_pkgs, caption = "Top packages") { nr = max(12, ceiling(length(top_pkgs)/3)) n_col = ceiling(length(top_pkgs) / nr) html_tb = matrix("", nrow = nr, ncol = n_col) for(i in seq_along(top_pkgs)) { i_col = ceiling(i/nr - 1/(nr+1)) i_row = (i-1) %% nr + 1 html_tb[i_row, i_col] = qq("@{top_pkgs[i]}") } kable(html_tb, format = "html", row.names = FALSE, escape = FALSE, table.attr = "class='table'", caption = caption) } %>
<%=generate_top_pkgs_html(top_pkgs)%>

<%= img(paste0(env$figure_dir, "/plot-parent-max-heaviness.png"), style="height:500px")%>

A package may have more than one heavy parents, thus, we next look at the total heaviness from all parents for a package. Note, heaviness from a parent measures the number of additional unique packages it brings in, that are not brought by any of the other parents, therefore, total heaviness from parents is actually the number of dependency packages that are brought by only one parent package. Generally, for a package, majority of its parents only contribute very small heaviness while only a few parents (mostly 1 ~ 3) contribute high heaviness. Thus, the "total heaviness from all parents" can be approximately treated as "total heaviness from heavy parents".

Similarly, we define an "adjusted total heaviness from parents" to adjust the point distribution more horizontally. It is defined as:

$h^{adj} = h \cdot a$, where $a = \sqrt{n}/\sqrt{n_{max}}$. Note here $h$ is the total heaviness from parents for package P.

The plot of adjusted heaviness verse number of parents can be seen by clicking the radio button "Adjusted total heaviness verse number of parent packages". We simply set a package as having highly heavy parents if the adjusted heaviness larger than <%=CUTOFF$adjusted_total_heaviness_from_parents[2]%> and having median heavy parents if the adjusted heaviness is between <%=CUTOFF$adjusted_total_heaviness_from_parents[1]%> and <%=CUTOFF$adjusted_total_heaviness_from_parents[2]%>.

<% df2 = df[df$adjusted_total_heaviness_from_parents >= CUTOFF$adjusted_total_heaviness_from_parents[2], , drop = FALSE] top_pkgs = rownames(df2)[order(df2$adjusted_total_heaviness_from_parents, decreasing = TRUE)] %>
<%=generate_top_pkgs_html(top_pkgs)%>

<%= img(paste0(env$figure_dir, "/plot-parent-total-heaviness.png"), style="height:500px")%>

According to both figures in this section, CRAN packages have very similar trends for the max and total heaviness from parents. But Bioconductor packages in general have heavier parents, e.g., musicatk and singleCellTK.

Heaviness on child packages

Generally, the heaviness on child packages has a trend to decrease with increasing the number of child packages, since it is averaged on the heaviness of all children. To highlight the packages that heavily affect large numbers of children, the original definition of heaviness is adjusted. The original definition of heaviness on child packages is defined as:

For a package P, assume it has $K$ child packages and the $k^{th}$ child is denoted as $A_k$. Denote $n_{1k}$ as the number of strong dependencies of package $A_k$ and $n_{2k}$ as the number of strong dependencies of $A_k$ if moving P to its Suggests, the heaviness of P on its child packages denoted as $h$ is calculated as $h = \frac{1}{K} \sum_k^K(n_{1k} - n_{2k})$, which is the average heaviness to all its child packages.

Since the original heaviness is scaled by the number of children, it is possible that large $K$ generates a small heaviness. The heaviness on child package is adjusted by adding a small constant $a$ to $K$, so that heaviness for small $K$ decreases more quickly than large $K$.

$h^{adj} = \frac{1}{K + a} \sum_k^K(n_{1k} - n_{2k})$

We emperically select 10 for $a$. Clicking on the following title to see how $a$ is selected.

How is a selected?

It is easy to see that a decreases $h$ faster for smaller $K$ than larger $K$. To select an optimized value for $a$, we took $a$ as integers in the set {1, 2, …, 29, 30}; and for a specific package indexed as $k$ and a value of $a$, we calculated the adjusted heaviness on its child packages, denoted as $h_{k,a}^{adj}$ and the vector for all packages is denoted as $h_a^{adj}$. $a$ is selected as the value by which the ranking of adjusted heaviness of all packages becomes stable. To measure the stability of the ranking of $h_a^{adj}$ compared to $h_{a-1}^{adj}$, we calculate the stability score denoted as $s_a$ as $s_a=\frac{1}{N}\sum_k^NI(|R_{k,a}-R_{k,a-1}|\le50)$, where $N$ is the total number of packages in the ecosystem, $R_{k,a}$ and $R_{k,a-1}$ are the ranks of package $k$’s adjusted heaviness in $h_a^{adj}$ and $h_{a-1}^{adj}$ respectively, and $I()$ is the indicator function.

$s_a$, or its general denotation $s$, measures the proportion of packages whose ranking difference of adjusted heaviness is no larger than 50 in the two neighboring values of $a$ (50 is a small value compared to the total number of R packages in the ecosystem). When $s$ becomes stable with $a$, we can conclude increasing $a$ won't greatly change the ranking of $s$. Thus, based on the “knee” rule, $a$ was selected to 10.

<%= img(paste0(env$figure_dir, "/plot-select-a-adjusted-heaviness-children.png"), style="height:500px")%>


The plot of adjusted heaviness verse number of children can be seen by clicking the radio button "Adjusted heaviness verse number of child packages". We simply set a package having a highly heavy impact on its children if the adjusted heaviness larger than <%=CUTOFF$adjusted_heaviness_on_children[2]%> and having a median heavy impact if the adjusted haviness is between <%=CUTOFF$adjusted_heaviness_on_children[1]%> and <%=CUTOFF$adjusted_heaviness_on_children[2]%>.

<% df2 = df[df$adjusted_heaviness_on_children >= CUTOFF$adjusted_heaviness_on_children[2], , drop = FALSE] top_pkgs = rownames(df2)[order(df2$adjusted_heaviness_on_children, decreasing = TRUE)] %>
<%=generate_top_pkgs_html(top_pkgs)%>

<%= img(paste0(env$figure_dir, "/plot-child-heaviness.png"), style="height:500px")%>

The analysis of heaviness on child packages is more useful for developers because it tells when adding a new direct dependency package to their package, the expected number of additional dependency package it brings to. E.g., if they add lumi in Imports of their package, the package will likely have 111 more extra dependency packages.

Heaviness on indirect downstream packages

We next look at the indirect effects to the downstream packages' dependencies. Note here we only look at the downstream packages with excluding the child packages. A comparison of with including child packages can be found in the next section of this report.

Similar to the heaviness on child packages, heaviness on indirect downstream packages also decrease as the number of downstream packages increase. We also define "adjusted heaviness on indirect downstream packages". The original definition of heaviness on indirect downstream packages is as follows:

For a package P, assume it has $K$ downstream packages (also include child packages) and the $k^{th}$ downstream package is denoted as $B_k$. Denote $n_{1k}$ as the number of strong dependencies of package $B_k$. Since P can affect its downstream in an indirect manner, we recalculate the global dependency relations for all packages by moving P to all its child packages' Suggests. Then we denote $n_{2k}$ as the number of strong dependencies of $B_k$ in the modified dependency graph. Next we denote $S_c$ as the set of child packages of P and $K_c$ as the number of its child packages, thus $K \geq K_c$. The adjusted heaviness of P on its indirect downstream packages (excluding child packages) denoted as $h$ is calculated as: $h = \frac{1}{K-K_c} \sum_{k}^K(n_{1k} - n_{2k}) \cdot I(B_{k} \notin S_c)$ where $I()$ is an indicator function. $h$ is set to 0 when $K = K_c$.

Then a small constant $a$ is added to $K - K_c$ to adjust the original heaviness:

$h^{adj} = \frac{1}{K-K_c + a} \sum_{k}^K(n_{1k} - n_{2k}) \cdot I(B_{k} \notin S_c)$

We emperically select 6 for $a$. Clicking on the following title to see how $a$ is selected.

How is a calculated?

Similar method as in the section Heaviness on child packages. A value of 6 is taken as the optimized value of $a$.

<%= img(paste0(env$figure_dir, "/plot-select-a-adjusted-heaviness-downstream-no-children.png"), style="height:500px")%>


The plot of adjusted heaviness verse number of downstream can be seen by clicking the radio button "Adjusted heaviness verse number of indirect downstream packages". We simply set a package having a highly heavy impact on its indirect downstream packages if the adjusted heaviness larger than <%=CUTOFF$adjusted_heaviness_on_indirect_downstream[2]%> and having median heavy impact if the adjusted haviness is between <%=CUTOFF$adjusted_heaviness_on_indirect_downstream[1]%> and <%=CUTOFF$adjusted_heaviness_on_indirect_downstream[2]%>.

<% df2 = df[df$adjusted_heaviness_on_indirect_downstream >= CUTOFF$adjusted_heaviness_on_indirect_downstream[2], , drop = FALSE] top_pkgs = rownames(df2)[order(df2$adjusted_heaviness_on_indirect_downstream, decreasing = TRUE)] %>
<%=generate_top_pkgs_html(top_pkgs)%>

<%= img(paste0(env$figure_dir, "/plot-downstream-no-children-heaviness.png"), style="height:500px")%>

The figure shows CRAN packages have more affect on the dependencies of indirect downstream packages.

Why child packages are removed from downstream packages in the analysis?

Each of The following two plots visualizes the ranking of all packages based on their heaviness on child packages and on downstream packages. For each plot, the left and right panels contain sorted heaviness for children and downstream respectively. In the middle panel are lines connecting the same package in the two rankings. The two ends of a line are assigned with the same color. There is a "Venn diagram" at the bottom panel which shows the overlap of the top 500 packages with the highest heaviness on children and the highest heaviness on downstream.

The left plot shows top 500 packages with the highest heaviness on children almost also have the highest heaviness on downstream (474 out of 500), and the right plot shows if only considering the indirect downstream packages, the overlap to packages with top heaviness on children have very small overlap.

<%= img(paste0(env$figure_dir, "/plot-compare-downstream-and-downstream2.png"), style="height:500px")%>

We think why there is such a huge overlap for the top packages with the highest heaviness on children and on downstream is because the downstream packages are mainly composed of child packages. To demonstrate this, for 474 packages that are both in the lists of top 500 packages having the highest heaviness on children and top 500 packages having the highest heaviness on downstream, we plot the fraction of their child packages in downstream packages. The following plot clearly shows for these top packages, their downstream packages are mostly child packages. With 76.4% of them, their downstream packages are completely child packages, and with 91.1% of them, more than 60% of their downstream packages are child packages.

<%= img(paste0(env$figure_dir, "/plot-top-500-children-downstream-pct.png"), style="height:500px")%>