Introduction

Topological data analysis is a relatively new area of data science which can compare and contrast data sets via non-linear global structure. The main tool of topological data analysis, persistent homology (Edelsbrunner, Letscher, and Zomorodian 2000; Zomorodian and Carlsson 2005), builds on techniques from the field of algebraic topology to describe shape features present in a data set (stored in a “persistence diagram”). Persistent homology has been used in a number of applications, including

distinguishing between the topology of human brain function of healthy control subjects vs. subjects with a neurological disorder (Gracia-Tabuenca et al. 2020; Hyekyoung et al. 2014; Lee et al. 2011),
predicting stock market crashes from the topology of stock price correlations over time (Yen and Cheong 2021),
finding non-trivial and complex topological structure in local patches of naturalistic images (Carlsson et al. 2007),
translating sentences in one language into another language (from a set of candidate sentences) using the persistence diagrams of their word embeddings (Haim Meirom and Bobrowski 2022), and
improving model predictions of various chemical properties of molecules by including topological features (Krishnapriyan 2021), etc.

For a broad introduction to the mathematical background and main tools of topological data analysis, see (Chazal and Michel 2017).

Distance and kernel functions are used to calculate differences and similarities respectively between objects in complicated spaces, and are the foundation of many machine learning algorithms which are used in traditional data science (Murphy 2012). Several papers have used distance and kernel computations tailored to persistence diagrams (Kerber, Morozov, and Nigmetov 2017; Le and Yamada 2018) in order to perform machine learning or inference tasks with groups of persistence diagrams, but to date no publicly available software in either R or python provides the functionality for these types of analyses.

Two main R packages exist for computing and comparing persistence diagrams – TDA (B. T. Fasy et al. 2021) and TDAstats (Wadhwa et al. 2019). While both packages have been used to carry out topological data analysis in R, neither package supports machine learning with groups of persistence diagrams because both lack a kernel function. TDAstats has an inference method; however, it is based on a non-standard distance function (making inferences drawn from this function unsubstantiated). TDA has a standard distance function but no inference function (and also stores persistence diagrams in a non-standard data structure, which requires additional preprocessing steps for further analysis).

In order to make the power of topological data analysis more readily available to data and machine learning practitioners, it would be very helpful to have a single software package that can

deliver a fast engine for calculating persistence diagrams,
convert persistence diagrams computed using the R packages TDA and TDAstats into a commonly-used data type for data analyses (a data frame),
implements fast versions of both distance and kernel calculations for pairs of persistence diagrams,
contribute tools for interpreting persistence diagrams, and
provide parallelized methods for machine learning and inference for persistence diagrams.

In accomplishing these goals TDApplied is a comprehensive and optimized package which permits researchers and data scientists to analyze multiple data sets using machine learning and inference via topological data analysis. This vignette documents the theory and application of TDApplied functions, on both simulated and neuroscience data.

Software Review

Topological data analysis has gained popularity over the past two decades since the original paper on persistent homology was published (Edelsbrunner, Letscher, and Zomorodian 2000), and two main R packages have been created for topological data analysis:

The R package TDA (B. T. Fasy et al. 2021) is a wrapper for a number of C++ persistent homology packages, including Dionysus (Morozov 2017), GUDHI (Lacombe et al. 2019) and PHAT (Bauer, Kerber, and Reininghaus 2013), and provides both distance calculation capabilities between persistence diagrams and the ability to find representative data points of each topological feature in a persistence diagram (the points for each feature are called a “representative cycle”). Moreover, a method is provided for determining which topological features in a persistence diagram should be considered “real” from a statistical perspective.
The R package TDAstats (Wadhwa et al. 2019) is a wrapper of the C++ persistent (co)homology engine Ripser (Bauer 2015; Silva and Vejdemo-Johansson 2011b), and provides an inference method for detecting group differences of two groups of persistence diagrams (as in (Robinson and Turner 2017). However, the distance calculation currently implemented in that package, is not based on an accepted distance metric for persistence diagrams.

The paper (Somasundaram et al. 2021) compared the runtimes of the homology calculations in the R packages TDA and TDAstats, and concluded that certain persistent homology calculations are faster with TDAstats. The choice of which package to use therefore may vary by application, depending on whether speed is desired or whether knowledge of which data points represent which topological feature. Therefore, TDApplied can accept as input persistence diagrams computed from either package in order to cover a wide variety of potential use-cases.

It is worthwhile to note that there are two other R packages for carrying out topological data analysis - TDAvec (Islambekov and Luchinsky 2022) (methods for vectorizing persistence diagrams) and TDAkit (You and Yu 2021) (clustering and dimension reduction methods for functional summaries of persistence diagrams, called persistence landscapes/silhouettes). Topological summaries like persistence landscapes have their own suite of statistical methods (Bubenik 2015; Bubenik and Dlotko 2017) due to their simpler underlying space compared to that of persistence diagrams (Robinson and Turner 2017). There are also several python packages dedicated to topological data analysis calculations, including scikit-TDA (Saul and Tralie 2019) (which computes persistence diagrams, and distances/kernels between them, using several python libraries), and giotto-tda (Tauzin et al. 2020) (which computes persistence diagrams and uses them to analyze time series data).

There are a number of shortcomings of available topological data analysis software. Firstly, in both python and R there is currently no package which allows for machine learning and inference of persistence diagrams, a limitation which greatly constrains the types of analyses that can be carried out. In R this is partially because there is no package for kernel calculations of persistence diagrams, but the very slow computation of distances between persistence diagrams in the TDA package also inhibits the practicality of distance-based inference procedures. Additionally, in R, the output of persistent homology calculations from the package TDA is a list with an element called “diagram” of class “diagram,” which is not compatible with data frame methods that form the basis for data analysis in R. On the other hand, TDAstats does not compute a published distance metric for persistence diagrams, making inferences drawn from its permutation_test function unclear. Overall, the non-standard data type returned by persistent homology calculations in TDA, the slow distance calculations in TDA and the non-published distance metric in TDAstats may be limiting the development of TDA applications in R.

Package TDApplied

The package TDApplied aims to solve the five goals outlined in the introductory paragraph. Firstly, the function diagram_to_df allows the conversion of the output of TDA persistent homology calculations to a data frame. Secondly, the functions diagram_distance and diagram_kernel allow for fast distance and kernel calculations respectively, and their counterparts distance_matrix and gram_matrix compute in parallel, for scalability, the (cross) distance and (cross) Gram matrices respectively. Thirdly, these distance and kernel calculations are used to perform machine learning and inference on persistence diagrams. Methods include

dimension reduction with metric multidimensional scaling (MDS) and kernel principal components analysis (kpca),
clustering with kernel k-means,
regression and classification with kernel support-vector machines (ksvm), and
inference with distance and kernel calculations looking for group differences and group independence, respectively.

The kernel machine learning methods implemented in TDApplied are wrappers of the flexible R package for kernel calculations kernlab (Karatzoglou et al. 2004), with some additional processing steps specific to persistence diagrams. Fourthly, a fast method for computing persistence diagrams via python is implemented in the PyH function. Finally, interpretative tools for persistence diagrams include the plot_diagram function for plotting persistence diagrams and the bootstrap_persistence_thresholds function for determining which topological features in a dataset should be considered “real.” In the subsequent sections we will describe these applications in more detail.

Background: Persistent Homology

The main tool of topological data analysis is called persistent homology (see (Edelsbrunner, Letscher, and Zomorodian 2000) for the introductory paper, and (Zomorodian and Carlsson 2005) for further computational details). Persistent homology has been applied in a variety of areas, including (but not limited to) economics (largely for the application of time series, for example see (Yen and Cheong 2021)), neuroscience (see (al 2021) for a number of functional MRI applications), etc.

Persistent homology starts with data points and a distance function. It assumes that these points were sampled from some kind of shape. This shape has certain features that exist at various scales, but sampling induces noise in these features. Persistent homology aims to describe certain mathematical features of this underlying shape, by forming approximations to the shape at various distance scales. The mathematical features which are tracked are clusters (connected components), loops (ellipses), voids (spheres), etc, and the “significance” of each feature is calculated (i.e. are the feature “real” or not). The homological dimension of these features are 0, 1 and 2 respectively (higher dimensional features can also be calculated). What’s really interesting about these particular mathematical features is that they can tell us where our data is not, which is extremely important information which other data analysis methods can’t provide.

The persistent homology algorithm proceeds in the following manner: first, if the input is a dataset and distance metric, then the distance matrix, storing the distance metric value of each pair of points in the dataset, is computed. Next, a parameter $\epsilon \geq 0$ is grown starting at 0, and at each $\epsilon$ value we compute a shape approximation of the dataset $C_{\epsilon}$ , called a simplicial complex (see (Edelsbrunner, Letscher, and Zomorodian 2000) or (Zomorodian and Carlsson 2005) for more details). We construct $C_{\epsilon}$ by connecting all pairs of points whose distance is at most $\epsilon$ . To encode higher-dimensional structure in these approximations, we also add a triangle between any triple of points which are all connected, a tetrahedron between any quadruple of points which are all connected, etc. Note that this process of forming a sequence of skeletal approximations is called a filtration, and other methods exist for forming the approximations (the one described here is the most commonly used, called the Rips-Vietoris complex).

At any given $\epsilon$ value, some topological features will exist in $C_{\epsilon}$ . As $\epsilon$ grows, the $C_{\epsilon}$ ’s will contain each other, i.e. if $\epsilon_{1} < \epsilon_{2}$ then every edge (triangle, tetrahedron etc.) in $C_{\epsilon_1}$ will also be present in $C_{\epsilon_2}$ . Therefore, each topological feature will be “born” at some $\epsilon_{birth}$ value, and “die” at some some $\epsilon_{death}$ value. Consider the example of a loop – a loop will be “born” when the last connection around the circumference of the loop is connected (at the $\epsilon$ value which is the largest distance between consecutive points around the loop), and the loop will “die” when the last connection across the loop’s diameter is connected thereby filling in its hole.

Therefore, the output of persistent homology, a persistence diagram, in each dimension has one 2D point for each topological feature found in the filtration process, where the $x$ -value of the point is the birth $\epsilon$ value and the $y$ -value is the death $\epsilon$ value. This is why every point lies above the diagonal – features die after they are born! The difference of a points $y$ and $x$ value, $y-x$ , is called the “persistence” of the corresponding topological feature. Points which have high (large) persistence likely represent real topological features of the dataset, whereas points with low persistence likely represent topological noise.

A persistence diagram containing $n$ topological features can be represented in a vector of length $2n$ . However, persistence diagrams can contain different numbers of features, and the ordering of the features is arbitrary. There, there is no obvious vector representation of all persistence diagrams that can be used as the input of machine learning or statistical inference. Nevertheless, we can apply a number of these techniques to persistence diagrams provided we can quantify how near (similar) or far (distant) they are from each other, and describing suitable distance and similarity measures with their accompanying analysis methods will be the content of the following section.

TDApplied Methods and Underlying Theory

In this section we will describe the various computational tools implemented in the package TDApplied to analyze persistence diagrams, both explaining the mathematics and providing functional examples. To run our examples we must start by loading the TDApplied package:

library("TDApplied")

Since TDApplied uses the output of TDA/TDAstats calculations (or its own persistent homology function, “PyH,” discussed later) as inputs to its functions, either TDA or TDAstats should be at least installed (if not attached) when using TDApplied. All examples, except those in the subsection on bootstrapping, will analyze simple diagrams which are random deviations of three persistence diagrams called D1, D2 and D3 (each with points in dimension 0 only), which we will plot with the TDApplied function plot_diagram:

D1 = data.frame(dimension = c(0),birth = c(2),death = c(3))
D2 = data.frame(dimension = c(0),birth = c(2,0),death = c(3.3,0.5))
D3 = data.frame(dimension = c(0),birth = c(0),death = c(0.5))
par(mfrow = c(1,3))
plot_diagram(D1,title = "D1",max_radius = 4,legend = F)
plot_diagram(D2,title = "D2",max_radius = 4,legend = F)
plot_diagram(D3,title = "D3",max_radius = 4,legend = F)

When desired, random Gaussian noise with a small variance will be added to the birth and death values of the points in these three diagrams (being careful make sure the points always have appropriate birth and death values), which will be achieved with a function (only used in this vignette), generate_TDApplied_vignette_data.

Here is an example of making noisy copies of D1:

Distance Between Persistence Diagrams

As the persistence diagram is a descriptor of the underlying shape structure of a dataset, it can be useful to quantify the differences between pairs of persistence diagrams. There are several ways to compute distances between persistence diagrams in the same homological dimension (like dimension 0 for clusters, dimension 1 for loops, etc.). The most common two are called the 2-wasserstein and bottleneck distances (Kerber, Morozov, and Nigmetov 2017). These techniques find an optimal matching of the 2D points in their input two diagrams, and compute a cost of that optimal matching. A point from one diagram is allowed either to be paired (matched) with a point in the other diagram or its diagonal projection, i.e. the nearest point on the diagonal line $y=x$ (matching a point to its diagonal projection is essentially saying that feature is likely topological noise because it died very soon after it was born).

Allowing points to be paired with their diagonal projections both allows for matchings of persistence diagrams with different numbers of points (which is almost always the case in practice) and also formalizes the idea that some points in a persistence diagram represent noise. The “cost” value associated with a matching is given by either (i) the maximum of infinity-norm distances between paired points, or (ii) the square-root of the sum of squared infinity-norm between matched points. The cost of the optimal matching under loss (i) is called the bottleneck distance of persistence diagrams, and the cost of the optimal matching of cost (ii) is called the 2-wasserstein metric of persistence diagrams. Both distance metrics have been used in a number of applications, but the 2-wasserstein metric is able to find more fine-scale differences in persistence diagrams compared to the bottleneck distance. The problem of finding an optimal matching can be solved with the Hungarian algorithm, which is implemented in the R package clue (Hornik 2005).

In the picture we can see that there is a “better” matching between D1 and D2 compared to D1 and D3, so the (wasserstein/bottleneck) distance value between D1 and D2 would be smaller than that of D1 and D3.

The wasserstein and bottleneck distances have been implemented in the TDApplied function diagram_distance. We can confirm that the distance between D1 and D2 is smaller than D1 and D3 for both distances:

# calculate 2-wasserstein distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.3905125

# calculate 2-wasserstein distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.559017

# calculate bottleneck distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.3

# calculate bottleneck distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.5

There is a generalization of the 2-wasserstein distance for any $p \geq 1$ , the $p$ -wasserstein distance, which can also be computed using the diagram_distance function by varying the parameter p. To see the computational details and proof of correctness for wasserstein and bottleneck distance calculations, see the appendix.

Another distance metric between persistence diagrams, which will be useful for kernel calculations, is called the Fisher information metric, $d_{FIM}(D_1,D_2,\sigma)$ (details can be found in (Le and Yamada 2018)). The idea is to represent the two persistence diagrams as probability density functions, with a 2D-Gaussian point mass centered at each point in both diagrams (including the diagonal projections of the points in the opposite diagram), all of variance $\sigma^2 > 0$ , and calculate how much those distributions agree on their pdf value at each point in the plane (called their Fisher information metric).

Points in the rightmost plot which are close to white in color have the most similar pdf values in the two distributions, and would not contribute to a large distance value, however having more points with a red color would contribute to a larger distance value.

The diagram_distance function can also calculate the Fisher information metric between persistence diagrams:

# Fisher information metric calculation between D1 and D2 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.02354779

# Fisher information metric calculation between D1 and D3 for sigma = 1
diagram_distance(D1,D3,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.08821907

Again, D1 and D2 are less different than D1 and D3 using the Fisher information metric.

Multidimensional Scaling of Persistence Diagrams

Dimension reduction is a task in machine learning which is commonly used for data visualization, removing noise in data, and decreasing the number of covariates in a model (which can be helpful in reducing overfitting). One common dimension reduction technique in machine learning is called multidimensional scaling (MDS) (Cox and Cox 2008). MDS takes as input an $n$ by $n$ distance (or dissimilarity) matrix $D$ , computed from $n$ points in a dataset, and outputs an embedding of those points into a Euclidean space of chosen dimension $k$ which best preserves the inter-point distances. MDS is often used for visualizing data in exploratory analyses, and can be particularly useful when the input data points do not live in a shared Euclidean space (as is the case for persistence diagrams). Using the R function cmdscale from the package stats (R Core Team (2021)) we can compute the optimal embedding of a set of persistence diagrams using any of the three distance metrics using the function diagram_mds. Here is an example of the diagram_mds function projecting nine persistence diagrams, three noisy copies sampled from each of D1, D2 and D3, into 2D space:

# create 9 diagrams based on D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)

# calculate their 2D MDS embedding in dimension 0 with the bottleneck distance
mds <- diagram_mds(diagrams = g,dim = 0,p = Inf,k = 2,num_workers = 2)

# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(mds[,1],mds[,2],xlab = "Embedding coordinate 1",ylab = "Embedding coordinate 2",
     main = "MDS plot",col = as.factor(rep(c("D1","D2","D3"),each = 3)),bty = "L")
legend("topright", inset=c(-0.2,0), 
       legend=levels(as.factor(c("D1","D2","D3"))), 
       pch=16, col=unique(as.factor(c("D1","D2","D3"))))

The MDS plot shows the clear separation between the three generating diagrams (D1, D2 and D3), and the embedded coordinates could be used for further downstream analyses.

Testing for Group Differences of Persistence Diagrams

One of the most important inference procedures in classical statistics is the analysis of variance (ANOVA), which can find differences in the means of groups of normally-distributed measurements (Casella, Berger, and Company 2002). Distributions of persistence diagrams and their means can be complicated (see (Turner 2013) and (Turner et al. 2014)). Therefore, a non-parametric permutation test has been proposed which can find differences in groups of persistence diagrams. Such a test was first proposed in (Robinson and Turner 2017), and some variations have been suggested in later publications. In (Robinson and Turner 2017), two groups of persistence diagrams would be compared. The null hypothesis, $H_0$ , is that the diagrams from the two groups are generated from shapes with the same type and scale of topological features, i.e. they “come” from the same “shape.” The alternative hypothesis, $H_A$ , is that the underlying type or scale of the features are different between the two groups. In each dimension a p-value is computed, finding evidence against $H_0$ in that dimension. A measure of within-group distances (a “loss function”) is calculated for the two groups, and that measure is compared to a null distribution for when the group labels are permuted.

This inference procedure is implemented in the permutation_test function, with several speedups and additional functionalities. Firstly, the loss function is computed in parallel for scalability since distance computations can be expensive. Secondly, we store distance calculations as we compute them because these calculations are often repeated. Additional functionality includes allowing for any number groups (not just two) and allowing for a pairing between groups of the same size as described in (Abdallah (2021)). When a natural pairing exists between the groups (like if the groups represent persistence diagrams from the same subject of a study in different conditions) we can simulate a more realistic null distribution by restricting the way in which we permute group labels, achieving higher statistical power.

In order to demonstrate the utility of the permutation test we will detect differences between noisy copies of D1, D2, D3:

# permutation test between three diagrams
g1 <- generate_TDApplied_vignette_data(3,0,0)
g2 <- generate_TDApplied_vignette_data(0,3,0)
g3 <- generate_TDApplied_vignette_data(0,0,3)
perm_test <- permutation_test(g1,g2,g3,
                              num_workers = 2,
                              dims = c(0))
perm_test$p_values
#>          0 
#> 0.04761905

As expected, a difference was found (at the $\alpha = 0.05$ significance level) between the three groups.

The package TDAstats also has a function called permutation_test which is based on the same test procedure, however it uses an unpublished distance metric between persistence diagrams and does not use parallelization for scalability. As such, care must be taken if both TDApplied and TDAstats are attached in an R script to use the desired permutation_test function.

Finding Real Topological Features via Bootstrapping

While the focus of TDApplied is applied topological data analysis, and as such this vignette focuses on machinery for carrying out such analyses, a fundamental question in topological data analysis which must not be overlooked is how to tell if a point in a persistence diagram should be considered “real” (signal) or not (noise). This question has been addressed via a bootstrapping approach in (B. Fasy et al. 2014), which has been implemented in the TDA package. Due to the speed increase of TDApplied’s distance function (discussed in a later section), TDApplied has its own implementation of this bootstrap procedure.

The idea of the bootstrap procedure is as follows, where $X$ is the input data set and $\alpha$ is the desired threshold for type 1 error:

We first compute the diagram of $X$ , $D$ .
Then we repeatedly sample, with replacement, the original data to obtain $\{X_1,\dots,X_n\}$ and compute new persistence diagrams $\{D_1,\dots,D_n\}$ .
We calculate the bottleneck distance of each new diagram with the original, $d_{\infty}(D_i,D)$ , in each dimension separately.
Finally we compute the $1-\alpha$ percentile of these values in each dimension.

These thresholds form a square-shaped “confidence interval” around each point in $D$ . In particular, if $t$ was the threshold found for dimension $k$ then the confidence interval around a point $(x,y) \in D$ (of dimension $k$ ) is the set of points $\{(x',y'):\mbox{max}(|x-x'|,|y-y'|) < t\}$ . For example, if we sampled 50 points on the unit circle, calculated the bottleneck-threshold-based confidence interval around the single 1-dimensional point, we would get something like this:

In this setup, “real” points will be those whose (open) confidence intervals do not overlap the diagonal line, where birth and death is the same. Note that the persistence threshold is twice the bottleneck distance threshold calculated above: the (Euclidean) distance from the point to the bottom right corner of the box is $\sqrt{2}t$ , which must be greater than or equal to the (Euclidean) distance of the point to its diagonal projection, which is $\frac{y-x}{\sqrt{2}}$ . Therefore, for the point to be considered real, $\sqrt{2}t \leq \frac{y-x}{\sqrt{2}}$ , implying that the persistence of the point, $y-x$ , must be no less than twice the bottleneck threshold $t$ .

The function bootstrap_persistence_thresholds can be used to determine these persistence thresholds. Here is an example for a circle dataset (which is not the same as the one from above), and the results can be plotted with the plot_diagram function:

# sample 50 points from the unit circle
circle <- TDA::circleUnif(n = 50)

# calculate the bootstrapped persistence thresholds using 2 cores
# and 20 iterations
thresh <- bootstrap_persistence_thresholds(X = circle,FUN = "ripsDiag",
                                 maxdim = 1,thresh = 2,num_workers = 2,
                                 num_samples = 30)
diag <- thresh$diag

# plot original diagram and thresholded diagram side-by-side:
par(mfrow = c(1,2))

plot_diagram(diag,title = "Circle diagram")

plot_diagram(diag,title = "Circle diagram with thresholds",
             thresholds = thresh$thresholds)

The bootstrap procedure can be done in parallel or sequentially, depending on which function is specified to calculate the persistence diagrams. There are three possible such functions - TDAstats’ calculate_homology, TDA’s ripsDiag and TDApplied’s PyH (described in later sections). Based on our simulations, the order of fastest to slowest of these functions are PyH, calculate_homology and ripsDiag. Moreover, both ripsDiag and PyH allow for the calculation of representative (co)cycles (i.e. the approximate location in the data of each topological feature), whereas calculate_homology does not. Therefore, our recommendation is to pick the function according to the following criteria: if a user can use the PyH function, then it should be used in all cases except for when the input data set is small, the machine has many available cores and the number of desired bootstrap iterations is large. Otherwise, use calculate_homology for speed or ripsDiag if the other two functions are not available.

Kernels Between Persistence Diagrams

A kernel function is a special (positive semi-definite) symmetric similarity measure between objects in some complicated space which can be used to project data into a space suitable for machine learning (Murphy 2012). Some examples of machine learning techniques which can be “kernelized” when dealing with complicated data are k-means (kernel k-means), principal components analysis (kernel PCA), and support vector machines (SVM) which are inherently based on kernel calculations.

There have been, to date, four main kernels proposed for persistence diagrams. In TDApplied the persistence Fisher kernel (Le and Yamada 2018) has been implemented because of its practical advantages over the other kernels – smaller cross-validation SVM error on a number of test data sets and a faster method for cross validation. For information on the other three kernels see (Kusano, Fukumizu, and Hiraoka 2018), (Carrière, Cuturi, and Oudot 2017), and (Reininghaus et al. 2014).

The persistence Fisher kernel is computed directly from the Fisher information metric between two persistence diagrams: let $\sigma > 0$ be the parameter for $d_{FIM}$ , and let $t > 0$ . Then the persistence Fisher kernel is defined as $k_{PF}(D_1,D_2) = \mbox{exp}(-t*d_{FIM}(D_1,D_2,\sigma))$ . Computing the persistence Fisher kernel can be achieved with the diagram_kernel function in TDApplied:

# calculate the kernel value between D1 and D2 with sigma = 2, t = 2
diagram_kernel(D1,D2,dim = 0,sigma = 2,t = 2)
#> [1] 0.9872455
# calculate the kernel value between D1 and D3 with sigma = 2, t = 2
diagram_kernel(D1,D3,dim = 0,sigma = 2,t = 2)
#> [1] 0.9707209

As before, D1 and D2 are more similar than D1 and D3.

Kernel k-means of Persistence Diagrams

Kernel k-means (Dhillon, Guan, and Kulis 2004) is a method which can find hidden groups in complex data, extending regular k-means clustering (Murphy 2012) via a kernel. A “kernel distance” is calculated between a persistence diagram and a cluster center using only the kernel function, and the algorithm converges like regular k-means. This algorithm is implemented in the function diagram_kkmeans as a wrapper of the kernlab function kkmeans. Moreover, a prediction function predict_diagram_kkmeans can be used to find the nearest cluster labels for a new set of diagrams. Here is an example clustering three groups of noisy copies from D1, D2 and D3:

# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)
                              
# calculate kmeans clusters with centers = 3, and sigma = t = 2
clust <- diagram_kkmeans(diagrams = g,centers = 3,dim = 0,t = 2,sigma = 2,num_workers = 2)

# display cluster labels
clust$clustering@.Data
#> [1] 3 3 3 1 1 1 2 2 2

As we can see, the diagram_kkmeans function was able to correctly separate the three generating diagrams D1, D2 and D3 (the cluster labels are arbitrary and therefore may not be 1,1,1,2,2,2,3,3,3, however the induce they correct partition).

If we wish to predict the cluster label for new persistence diagrams (computed via the largest kernel value to any cluster center), we can use the predict_diagram_kkmeans function as follows:

# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)

# predict cluster labels
predict_diagram_kkmeans(new_diagrams = g_new,clustering = clust,num_workers = 2)
#> [1] 3 3 3 1 1 1 2 2 2

This function correctly predicted the cluster for each new diagram (assigning each diagram to the cluster label by D1, D2 or D3, depending on which diagram it was generated from).

Kernel Principal Components Analysis of Persistence Diagrams

PCA is another dimension reduction technique in machine learning, but can be preferable compared to MDS in certain situations because it allows for the projection of new data points onto an old embedding model (Murphy 2012). For example, this can be important if PCA is used as a pre-processing step in model fitting. Kernel PCA (kPCA) (Schölkopf, Smola, and Müller 1998) is an extension of regular PCA which uses a kernel to project complex data into a high-dimensional Euclidean space and then uses PCA to project that data into a low-dimensional space. The diagram_kpca method computes the kPCA embedding of a set of persistence diagrams, and the predict_diagram_kpca function can be used to project new diagrams using a pre-trained kPCA model. Here is an example using a group of noisy copies of D1, D2 and D3:

# create noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(3,3,3)

# calculate their 2D PCA embedding with sigma = t = 2
pca <- diagram_kpca(diagrams = g,dim = 0,t = 2,sigma = 2,features = 2,num_workers = 2)

# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(pca$pca@rotated[,1],pca$pca@rotated[,2],xlab = "Embedding coordinate 1",
     ylab = "Embedding coordinate 2",main = "PCA plot",
     col = as.factor(rep(c("D1","D2","D3"),each = 3)))
legend("topright",inset = c(-0.2,0), 
       legend=levels(as.factor(c("D1","D2","D3"))), pch=16, 
       col=unique(as.factor(c("D1","D2","D3"))))

The function was able to recognize the three groups, and the embedding coordinates can be used for further downstream analysis. However, an important advantage of kPCA over MDS is that in kPCA we can project new points onto an old embedding using the predict_diagram_kpca function:

# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)

# project new diagrams onto old model
new_pca <- predict_diagram_kpca(new_diagrams = g_new,embedding = pca,num_workers = 2)

# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(new_pca[,1],new_pca[,2],xlab = "Embedding coordinate 1",
     ylab = "Embedding coordinate 2",main = "PCA prediction plot",
     col = as.factor(rep(c("D1","D2","D3"),each = 3)))
legend("topright",inset = c(-0.2,0), 
       legend=levels(as.factor(c("D1","D2","D3"))), pch=16, 
       col=unique(as.factor(c("D1","D2","D3"))))

As we can see, the original three groups, and their approximate location in 2D space, is preserved during prediction.

Kernel Support Vector Machines of Persistence Diagrams

SVMs (Murphy 2012) are one of the most popular machine learning techniques for regression and classification tasks. SVMs use a kernel function to project complex data into a high-dimensional space and then find a sparse set of training examples, called “support vectors,” which maximally linearly separate the outcome variable classes (or yield the highest explained variance in the case of regression).

SVMs have been implemented in the function diagram_ksvm, tailored for input datasets which contain pairs of persistence diagrams and their outcome variable labels. A prediction method is supplied called predict_diagram_ksvm which can be used to predict the label value of a set of new persistence diagrams given a pre-trained model. A parallelized implementation of cross-validation model-fitting is used based on the remarks in (Le and Yamada 2018) for scalability (which avoids needlessly recomputing persistence Fisher information metric values). Here is an example of fitting an SVM model on a list of persistence diagrams for a classification task (guessing whether the diagram comes from D1, D2 or D3):

# create thirty noisy copies of D1, D2 and D3
g <- generate_TDApplied_vignette_data(10,10,10)

# create response vector
y <- as.factor(rep(c("D1","D2","D3"),each = 10))

# fit model with cross validation
model_svm <- diagram_ksvm(diagrams = g,cv = 2,dim = c(0),
                          y = y,sigma = c(1,0.1),t = c(1,2),
                          num_workers = 2)

We can use the function predict_diagram_ksvm to predict new diagrams like so:

# create nine new diagrams
g_new <- generate_TDApplied_vignette_data(3,3,3)

# predict
predict_diagram_ksvm(new_diagrams = g_new,model = model_svm,num_workers = 2)
#> [1] D1 D1 D1 D2 D2 D2 D3 D3 D3
#> Levels: D1 D2 D3

As we can see the best SVM model was able to separate the three diagrams We can gain more information about the best model found during model fitting and the CV results by accessing different list elements of model_svm.

Testing for Independence Between Two Groups of Paired Persistence Diagrams

An important question when presented with two groups of paired persistence diagrams is determining if the pairings are independent or not. A procedure was described in (Gretton et al. 2007) which can be used to answer this question using kernel computations, and importantly uses a parametric null distribution. The null hypothesis for this test is that the groups are independent, and the alternative hypothesis is that the groups are not independent. A test statistic called the Hilbert-Schmidt independence criteria is calculated, and its value is compared to a gamma distribution with certain parameters which can be estimated from the data.

This inference procedure has been implemented in the independence_test function, and returns the p-value of the test in each desired dimension of the diagrams (among other additional information). We would expect to find no dependence between noisy copies of D1, D2 and D3, since each copy is generated randomly:

# create 10 noisy copies of D1 and D2
g1 <- generate_TDApplied_vignette_data(10,0,0)
g2 <- generate_TDApplied_vignette_data(0,10,0)

# do independence test with sigma = t = 1
indep_test <- independence_test(g1,g2,dims = c(0),num_workers = 2)
indep_test$p_values
#>         0 
#> 0.1354271

The p-value of this test would not be significant at any typical significance threshold, reflecting the fact that there is no real (i.e. non-spurious) dependence between the two groups, as expected.

Fast Persistence Diagram Calculations Using Reticulate

While the focus of TDApplied is carrying out applied topological data analysis with persistence diagrams computed with the R packages TDA and TDAstats, TDApplied also provides its own persistent cohomology engine via python; the latter produces the exact same output as persistent homology, just much faster (Silva and Vejdemo-Johansson 2011b, 2011a). The ripser module in python is a wrapper for the same persistent cohomology engine as the R package TDAstats, which has provided until now the fastest persistence diagram calculation function in R (Somasundaram et al. 2021). However, it was observed by the authors that the python implementation seemed faster, even when called via the reticulate package. Therefore, we created the PyH function as a wrapper of the python ripser module for Vietoris-Rips filtration persistent cohomology.

`PyH` Requirements and Usage

There are three prerequisites that must be satisfied in order to use the PyH function:

The reticulate package must be installed.
Python must be installed and configured to work with reticulate.
The ripser python module must be installed, via reticulate::py_install("ripser").

Note that the installation status of python is checked using the function reticulate::py_available(), which according to several online forums does not always behave as expected. If error messages occur using TDApplied functions regarding python not being installed then we recommend consulting online resources to ensure that the py_available function returns TRUE on your system. Due to the complicated dependencies required to use the PyH function, it is only an optional function in the package and therefore the reticulate package is only suggested in the TDApplied namespace.

For an example use of PyH we will use the following pre-loaded dataset called “circ” (which is stored as a data frame in this vignette):

We would then calculate the persistence diagram as follows:

# import the ripser module
ripser <- import_ripser()

# calculate the persistence diagram
diag <- PyH(X = circ,maxdim = 1,thresh = 1,ripser = ripser)

The ripser module is imported outside of the PyH function using the import_ripser function, because this operation can be slow and if many persistence diagrams are to be computed then repeating this operation would be highly prohibitive. Additionally, the import_ripser function checks the three prerequisites for PyH listed earlier in this section, to avoid repeating checks. The import_ripser function is the same as reticulate::import("ripser"), but with the extra checks. In the TDApplied testing it was verified that the PyH output matches the TDAstats calculate_homology output for the same input data and parameters, but with one small difference: in the python ripser calculation there is always one connected component (i.e. dimension 0 topological feature) with infinite death value, whereas this feature is omitted in the TDAstats calculation. Whether or not to include this feature in the outputted diagram can be decided by the user via the ignore_infinite_cluster parameter being set to TRUE (the default value) or FALSE respectively.

Representative Cocycles

One of the advantages of the R package TDA over TDAstats is its ability to calculate representative cycles in the data, i.e. localizing the persistence diagram topological features in the input data set. This can permit deep analyses of the original data set by finding particular types of features spanned by certain subsets of data points. For example, a representative cycle of a 1-dimensional topological feature would be a set of edges between data points which lie along that feature (a loop). The PyH function can also find representative (co)cycles in its input data, which are returned if the calculate_representatives parameter is set to TRUE. In that case, the PyH function returns a list with a data frame called “diagram,” containing the persistence diagram, and a list called “representatives.” The representatives element has one element for each dimension in the persistence diagram, with one matrix/array for each point in the persistence diagram of that dimension (except for dimension 0, where the list is always empty). The matrix for a $d$ -dimensional feature (1 for loops, 2 for voids, etc.) has $d + 1$ columns, where row $i$ contains the row indices in the data set of the data points in the $i$ -th substructure in the representative (a substructure of a loop would be an edge, a substructure of a void would be a triangle, etc.). Here is an example where we calculate the representative cocycles of our circ dataset:

# ripser has already been imported, so calculate diagram with representatives
diag <- PyH(circ,maxdim = 1,thresh = 2,ripser = ripser,calculate_representatives = T)

# identify the loops in the diagram
diag$diagram[which(diag$diagram$dimension == 1),]

#>    dimension     birth    death
#> 50         1 0.6008424 1.738894

# show the representative for the loop, just the first five rows
diag$representatives[[2]][[1]][1:5,]

#>      [,1] [,2]
#> [1,]   10    7
#> [2,]   36    7
#> [3,]   10    4
#> [4,]   17    7
#> [5,]   32    7

The representative of the one loop contains the edges found to be present in the loop. We could iterate over the representative for the loop to find all the data point in that representative:

unique(c(diag$representatives[[2]][[1]][,1],diag$representatives[[2]][[1]][,2]))

#>  [1] 10 36 17 32 44 30 29 40 14 27 39 28 11 31  9 25  7 41  4 16  8 49 21 34  3
#> [26]  2

Since the circ dataset is two-dimensional, we can actually plot the loop representative according to the following process:

Pick a threshold $\epsilon$ less than the death radius of the cocycle.
Plot all edges between pairs of points in circ of distance no more than $\epsilon$ .
Highlight all edges in the representative.

Since the death radius of the main cocycle is 1.738894, we can use the following code to plot the cocycle at thresholds value 1.7:

plot(x = circ$x,y = circ$y,xlab = "x",ylab = "y",main = "circ with representative")
for(i in 1:nrow(circ))
{
  for(j in 1:nrow(circ))
  {
    pt1 <- circ[i,]
    pt2 <- circ[j,]
    if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7)
    {
      graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]))
    }
  }
}

for(i in 1:nrow(diag$representatives[[2]][[1]]))
{
  pt1 <- circ[diag$representatives[[2]][[1]][i,1],]
  pt2 <- circ[diag$representatives[[2]][[1]][i,2],]
  if(sqrt((pt1[[1]] - pt2[[1]])^2 + (pt1[[2]] - pt2[[2]])^2) <= 1.7)
  {
    graphics::lines(x = c(pt1[[1]],pt2[[1]]),y = c(pt1[[2]],pt2[[2]]),col = "red")
  }
}

This plot shows the main loop that was found via persistent cohomology, and the representative is a set of edges (in this 2D case) whose removal would destroy the loop. A more intuitive notion of a “representative loop” can be found with persistent homology, for instance using the TDA ripsDiag function with the cycleLocation parameter set to TRUE. Another example of the representative cocycle can be found using another threshold value, for instance the (rounded up) birth value of 0.6009:

Since we know that the data points in the representatives help form a loop in the original data set, we could perform a further exploratory analysis in circ to explore the periodic nature of the feature. While in this example we know that a loop will be present, this type of analysis could help find hidden latent structure in data sets. In this way PyH is as flexible as the TDA persistent homology function ripsDiag, although representative cocycles are less intuitive than representative cycles.

Benchmarking TDApplied Against Other Packages

In order to properly situate TDApplied in the landscape of software for topological data analysis, we will need to compare the speed of its calculations to similar calculations from other packages. In the following sections we will benchmark

Persistent (co)homology calculations with TDApplied’s PyH and TDAstats’ calculate_homology.
Wasserstein distances between persistence diagrams with TDApplied’s diagram_distance and TDA’s wasserstein.
Wasserstein distances between persistence diagrams with TDApplied’s diagram_distance and the persim python module’s wasserstein.

Another pair of functions that we could benchmark is TDA’s bootstrapDiagram function and TDApplied’s bootstrap_persistence_thresholds, since they use the same bootstrap procedure. However, since both functions are built on parallelized distance calculations, comparing the runtime of TDA’s wasserstein and TDApplied’s diagram_distance functions should provide the same conclusion of which implementation is faster and more scalable. Nevertheless, one major advantage of the bootstrap_persistence_thresholds function over the bootstrapDiagram function is that the latter uses the function mclapply for parallelization, which does not work on Windows machines, whereas the parallelization of the former function is more flexible and does work on Windows machines.

An important note is that the three distance functions (TDApplied’s diagram_distance, TDA’s wasserstein and persim’s wasserstein) often did not agree on the distance value of certain calculations. Without knowing the exact details of the persim and TDA functions, one possible explanation is that the other two packages use alternative definitions for the wasserstein/bottleneck distances. For example, the definitions of wasserstein distance in (Robinson and Turner 2017) and (Zomorodian and Carlsson 2005) are different. Nevertheless, the diagram_distance function has been tested against examples with distances worked out by hand (these can be found in the test folder of the package), and a proof of algorithm correctness is given in the appendix.

The script that was used to perform benchmarking (and plotting the results) is available in the exec directory of this package, using PyH in certain cases and thus requiring python. A simple error check is included for the installation of the reticulate package, but the script will throw an error if reticulate is not properly connected with python. In all cases, benchmarking followed a similar procedure, involving sampling data from simple shapes (unit circles, unit spheres and tori with inner tube radius 0.25 and major radius 0.75) with various number of rows, and performing 10 benchmarking iterations at each number of rows. The mean and standard deviation of run time for the two functions were then calculated at each number of rows. All benchmarking was carried out on a Windows 10 64-bit machine, with an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz processor with 8 cores and 64GB of RAM.

The benchmarking results are displayed graphically in the following three subsections. On top of comparing raw run time of the various functions, we also compared the scalability of the functions by dividing the runtimes of the functions and regressing the quotients onto the number of points in the input shapes. Overall we found that TDApplied’s functions are faster and scale better than R counterparts, and scale similarly to python counterparts. These results indicate that TDApplied is a powerful and efficient tool for applied topological data analysis in R.

Benchmarking `PyH` Against TDAstats’ `calculate_homology` Function

The long calculation time of persistence diagrams is likely a large contributing factor to the slow adoption of topological data analysis for applied data science. Much research has been carried out in order to speed up these calculations, but the current state-of-the-art is the persistent cohomology algorithm (Silva and Vejdemo-Johansson 2011a). In R, the TDAstats’ calculate_homology function is the fastest option for persistence diagram calculations (Somasundaram et al. 2021), being a wrapper for the ripser persistent cohomology engine. TDApplied’s PyH function is a wrapper for the same engine, so we benchmarked their run time on circles, spheres and tori. The results were as follows:

As we can see the run time of PyH was significantly less than that of calculate_homology. We then used a linear model to analyze how the two functions scale compared to each other for all three shapes, regressing the ratio of TDAstats’ runtime divided by TDApplied’s runtime onto the number of points in the data set. For each of the three datasets, the model intercept was highly significant ( $p < 0.001$ ) and the coefficient estimate for number of points was not. This suggests that the runtime of the two functions scale similarly, but there was a constant speed increase of PyH which differed by shape (about 3 times faster for circles, 7 times faster for tori and 8 times faster for spheres). Overall, if python is available to a TDApplied user then the PyH function can provide a very fast means to calculate persistence diagrams in R making analyses containing many large persistent homology calculations much more feasible.

Benchmarking the TDApplied `diagram_distance` and TDA `wasserstein` Functions

Computing wasserstein (or bottleneck) distances between persistence diagrams is a key feature of some of the main topological data analysis software packages in R and python. However, these calculations can be very expensive, rendering practical applications of topological data analysis nearly unfeasible. Since TDAstats has implemented an unconventional distance calculation, we will benchmark TDApplied’s diagram_distance function against the TDA wasserstein function on spheres and tori, calculating their distance in dimensions 0, 1 and 2 and recording the total time. The results were as follows:

A linear model, regressing the ratio of TDA’s runtime divided by TDApplied’s runtime onto the number of points in the data set, found a significant positive coefficient of the number of points. This suggests that diagram_distance scales better than wasserstein, and the model estimated a 95x speed up for 1000 data points. These results suggest that distance calculations with TDApplied are faster and more scalable, making the applications of statistics and machine learning with persistence diagrams more feasible in R. This is why the TDA distance calculation was not used in the TDApplied package.

Benchmarking the TDApplied `diagram_distance` Function against persim’s `wasserstein` Function

While the functionality of python packages for topological data analysis packages are out of the scope for an R package, in order to fully situate TDApplied in the landscape of topological data analysis software we will benchmark the diagram_distance function against its counterpart from the sci-kit TDA collection of libraries, namely the wasserstein function from the persim python module. The R package reticulate (Ushey, Allaire, and Tang 2022) was used to carry out this benchmarking, via installing, importing and using the persim module. This benchmarking procedure also used spheres and tori, calculating distances in dimensions 0, 1 and 2, and the results were as follows:

The runtime of the persim wasserstein function was significantly faster than TDApplied’s diagram_distance function. However, a linear model of the runtime ratio of TDApplied vs. persim against the number of points in the shape finds evidence that the two functions scale similarly, since the estimated coefficient for number of points was not significant but the intercept (15) was highly significant. Nevertheless, the raw speed increase in python could be the basis for a very fast python counterpart to the TDApplied package in the future.

Case Study: Human Connectome Project

The human brain is a complex dynamical system which encodes and processes its outside environment to drive behavioral decisions. A very popular technology for studying the function of the brain, i.e. how it encodes and processes its environment, is called functional MRI (fMRI). fMRI has the ability to image the whole brain, where the intensity of each voxel (i.e. 3D volumetric pixel) is a proxy measure of the “activity” of that small volumetric brain region, over multiple time points giving both spatial and temporal information about brain activity. A common method for analyzing fMRI data is computing correlation matrices, which can either be spatial (correlating the vectors of activity, called spatial activity patterns, at different time points) or temporal (correlating the time courses of brain regions, called temporal functional connectivity (FC)).

In fMRI studies, data is collected from subjects who are either performing some task (for instance repeatedly viewing pictures from various object categories (Kriegeskorte, Mur, and Bandettini 2008)) or relaxing (fixating at the center of a screen, called “resting state”). This allows researchers to make inferences about which brain regions are responsible for which types of tasks or processes, making it an important tool for studying the brain. Moreover, fMRI (and other similar technologies) have become a popular venue for topological data analysis.

Topological data analysis has become a powerful tool for analyzing fMRI data due to persistent homology’s ability to detect topological signal in fMRI data (Ellis et al. 2019; Kang, Xu, and Morozov 2021) and model high-dimensional communication between multiple brain regions, all without needing to choose an arbitrary threshold for “significant” correlations in FC (Sizemore et al. 2019; Giusti, Ghrist, and Bassett 2016; Lee et al. 2011; Gracia-Tabuenca et al. 2020). In these kinds of studies, persistent homology (or closely related tools) has mainly been used to find the topology of temporal FC. However, such approaches can remove much temporal information in the data, and another topological data analysis tool called mapper (Singh, Mémoli, and Carlsson 2007) has analyzed spatial correlation matrices to find structure in fMRI spatial activity patterns (Saggar et al. 2018). However, unlike persistent homology, mapper is not designed to find periodic structures in data, which in fMRI data may encode essential periodic patterns in temporal brain activity. Therefore, we will use persistent homology and TDApplied to perform an exploratory topological analysis of the loops found in spatial correlation matrices in data from one of the most important sources of high-quality neuroimaging data - the Human Connectome Project (al. 2016).

The Human Connectome Project contains extensively preprocessed neuroimaging data from about 1200 subjects, allowing researchers to make powerful inferences about the function of the human brain. Of the data collected, fMRI was used to record brain activity over time while participants completed 9 tasks. The tasks were two resting-state scans, an emotion, gambling, language, working memory, relational, social and motor task, and for each task there were two scans (resulting in 18 scans per subject). The HCP fMRI data was also processed to convert voxels into surface nodes - points, on a mesh of the brain’s surface geometry, which are comparable across subjects. The resting state scans were longer than the other task scans, which normally necessitates the use of analysis methods which abstract out temporal information (like temporal FC). However, by using topology we can interrogate the geometry of the spatial activity patterns, which we call mental state topology, without losing potentially task-related temporal information. It was found that that mental state topologies in the HCP dataset contained robust 0-dimensional topological features which correlated with various measures of personality and behavior (Anderson et al. 2018). On the other hand, temporal FC can act as a “fingerprint” for identifying subjects in the HCP dataset (Finn et al. 2015). So what role can loops of mental-state topology play in distinguishing between subjects and tasks in the HCP dataset?

To answer this question we analyzed 100 HCP subject’s fMRI data with TDApplied by

calculating time-point by time-point correlation dissimilarity matrices of each fMRI dataset for each subject,
computing thresholded persistence diagrams of dimension 1 from those dissimilarity matrices (using the bootstrap method), and
using machine learning and inference functions from TDApplied to analyze the resulting collection of diagrams.

The script used to generate these results, as well as the subject ID’s used, can be found in the exec directory of the package. Since the persistence Fisher kernel takes two parameters, $t$ and $\sigma$ , we chose a grid over both parameters to perform the independence tests. In the paper (Le and Yamada 2018), the $\sigma$ values $10^{\{-3,-2,-1,0,1,2,3\}}$ were used for an SVM task. Since our input data had maximum value 1 (due to normalization) we will use the values $10^{\{-3,-2,-1\}} = \{0.001,0.01,0.1\}$ . Also in that paper, values of $t$ were selected by taking the 1, 2, 5, 10, 25, 50 percentile values of the Fisher information metric between all of the diagrams (for each value of $\sigma$ ), and we used the same procedure to select values of $t$ (after subsetting for only positive values). The parameters that resulted from this process are stored in the file kernel_parameters.csv in the vignette directory of the package.

Visualizing mental state topologies

What does a mental state topology, or a collection of them, represent? Loops in a mental state topology are sets of spatially periodic patterns of spatial activity, and these sets vary along one neurological dimension of activity. Collections of mental state topologies, like combining all of the persistence diagrams from the resting state 1 task across all 100 subjects, could encode both a characteristic number and representative neurological dimensions of loops across the whole collection. In this section we visualize combined diagrams across subjects and tasks, and construct representative loops from all the resting state 1 mental state topologies.

We visualized the persistence diagrams of subjects and tasks by combining all the diagrams for that subjects/task and plotting a probability distribution induced by adding a Gaussian point mass centered at each point (like in the Fisher information metric) with $\sigma$ values 0.001, 0.01 and 0.1. Plots with $\sigma = 0.1$ (and $\sigma = 0.01$ to a lesser degree) were least informative because $\sigma$ was comparable to the scale of the data. Here are the distribution plots for three subjects with $\sigma = 0.001$ :

As we can see the distributions were very different between these three subjects. Next we visualized the nine tasks (again with $\sigma = 0.001$ ):

Again we found that the tasks had visually different distributions, although the two resting state scans are fairly similar, as we would expect, with two nearby clusters of loops.

But what do these mental state loops represent in terms of brain activity? As an illustration we considered the resting state 1 persistence diagram and used kmeans clustering to find two “basis” resting state 1 loops which described all of the loops well:

These two loops were loops from two subject’s resting state 1 data, but note that topologically similar loops (in terms of birth and death values) may lie in different parts of the data (i.e. represent different loops of spatial activity patterns). We calculated the representative (co)cycles of these two loops, used cmdscale to project the representatives into 2D (needing two dimensions for the first loop and four dimensions for the second) and selected a top, bottom, left and right points along the loops (in red):

Since the highlighted points from the plots each represent a spatial pattern of neural activity at some time point, we then plotted these spatial patterns around the circumference of two circles (however, note that the loops did not form perfect circles in the data). Note that when the two brain hemispheres (i.e. sides) are plotted side-by-side, the left hemisphere is on the right and the right hemisphere is on the left. Prior to plotting, the fMRI data was first normalized to have temporal mean 0 and standard deviation 1 for each surface node. The spatial patterns were then smoothed by averaging spatial signal within a 5mm radius from each surface node, and the smoothed signal was thresholded by absolute value at least 0.75. We plotted the spatial patterns with boxes around areas which appeared to exhibit periodic behavior around the loops:

Loop 1 exhibited more drastic neural activity (both high and low) compared to loop 2, which showed more localized areas of low activity. We compared the locations of periodic activity (in the boxes) to a well-known parcellation of the brain into 7 resting-state networks (Yeo et al. 2011):

The periodic behavior in loop 1 appeared in the activation of the left hemisphere dorsal attention network (green box, high activity on top and bottom, lower activity on the sides) and in the deactivation of the right hemisphere default mode and ventral attention networks (red box, high activity on the right, medium on the top and bottom, low on the left). The periodic behavior in loop 2 appeared in the deactivation of the right hemisphere default mode and ventral attention networks (pink box, low activity on top and bottom, higher activity on the sides). Therefore only three of the seven resting-state networks may be sufficient to describe the group-level resting-state data.

Coming up with techniques for determining what the one circular dimension of one such loop represents in terms of neural activity, and determining if this dimension is correlated with time/study design (i.e. the temporal sequence of events in the task) for resting-state/task-based fMRI is left to future work.

Comparing subjects and tasks with inference

Some major functions of healthy human brains are conserved across people, but there is still much inter-subject variability (different people can respond differently to viewing the same image). But are task-induced mental state topologies conserved across subjects or tasks? To answer this question we first analyzed distributional properties of the persistence diagrams between subjects and between tasks, and compared the diagrams between subjects and between tasks using TDApplied’s permutation and independence tests.

We first analyzed basic properties of the persistence diagrams - the number of loops and the mean persistence of loops, both across subjects and tasks. The overall distribution of the number of loops across all diagrams was:

The bootstrap thresholding was evidently very conservative, with about 73% of the diagrams containing no loops. This is not to say that those diagrams were topologically trivial - on the contrary they may have contained non-trivial higher-dimensional topological structure, or it is possible that task-correlated loops may exist at smaller (i.e. less persistent) scales. To assess the validity of either explanation we would require significantly more computational resources than what was available at the time of writing, and we leave this to future work.

We may draw two important conclusions from this graphic. Firstly, the fact that many diagrams did not have any loops provides evidence that topological structure was not driven by physiological sources, such as heart-rate and breathing, which is known to drive signal in fMRI data (Caballero-Gaudes and Reynolds 2017). Secondly, resting-state analyses are often concerned with the activity of (about) seven major “resting-state networks” found in (Yeo et al. 2011), and task analyses often study the activity of a similar numbers of task-specific networks. However, our loop analysis found that fewer loop dimensions (perhaps up to 2 loops) were needed to describe the HCP data, suggesting high-dimensional topological features may be a useful low-dimensional representation of spatial fMRI data.

We then compared the distribution of the number of loops across subjects and tasks using paired t-tests (with two-sided alternative hypotheses and at the $\alpha = 0.05$ threshold for type 1 error). About 19% of the subject-subject pairs had statistically different numbers of loops, and about 22% of the task-task pairs had statistically significant numbers of loops. Interestingly, most of the differences between tasks were found between resting state scans and task scans, with less loops in the resting state scans, suggesting that tasks may induce large-scale and periodic cycles of mental states which are used to accomplish the task.

Next we performed the same analysis for the mean persistence of loops in the diagrams. About 12% of the subject-subject pairs had differences in mean persistence values. We found 14% statistically different persistence between tasks, mainly being lower persistence in resting state scans compared to task-based scans.

In order to better understand subject-subject and task-task differences we used TDApplied’s permutation test to compare subjects (matched by task) and tasks (matched by subjects) with both the 2-wasserstein and bottleneck distances, again judging significance at the $\alpha = 0.05$ level. For the subjects, both bottleneck and wasserstein distances found about 20% of the subject-subject pairs to have different topologies, with about 96% of the differences conserved between the two metrics. For the tasks, in the bottleneck tests there were 4 pairs of different tasks: both resting state scans with the emotion task, the first resting scan with the social task and the emotional task with the relational task. For the wasserstein tests there were 8 pairs of different tasks: the first resting state scan with the emotion, gambling, motor and relational tasks, the second resting state scan with the emotion, motor and working memory tasks, and the relational task with the working memory task. Therefore, the resting state scans were more different from task scans than the task scans were with each other.

Finally, we analyzed whether any of the subject-subject or task-task pairs were independent (again at the $\alpha = 0.05$ significance level). Only about 18% of the subject-subject pairs were significantly non-independent - an interesting future question would be to determine whether these non-independent pairs were blood-related. On the other hand, over 86% of the task-task pairs were found to be significantly non-independent in at least half of the kernel parameter values. Therefore, task-induced mental state topology may be strongly related between tasks - i.e. computational strategies in different neural domains may be related. In order to compare the results of the two inference procedures, we removed kernel parameters which resulted in no non-independent subject-subject pairs, and then we used a two-sided t-test to compare the percentage of those kernel values which found significant non-independence between subject-subject pairs who were significantly different in the permutation tests compared to pairs of subjects who were not significantly different in the tests (for both distance metrics). The results were very similar for both distance metrics, with significantly lower percentage of non-independence in non-different subject-subject pairs.

Machine learning

Our inference analyses revealed some of the underlying topological differences of rest and task-induced neural function in the HCP dataset. However, what is the overall structure of the space of all the diagrams? In this section we will first cluster the diagrams to find distinct mental-state topological “modes,” and then visualize the clusters using 2D MDS computed with both the bottleneck and wasserstein metrics.

Using all of the kernel parameters deduced earlier, we performed kernel kmeans clustering on all the persistence diagrams over a range of values of k - 2,3,4,…,18,25,50,100. These values were chosen because there were 18 scans per subject and 100 subjects. The within-cluster sum of squared (withinss) distances were calculated for each value of k and for each kernel parameter combination, and the value of k which minimized the withinss (without resulting in empty clusters) was selected as being optimal for each kernel parameter. The frequency of each number of optimal clusters was:

From the graph it seemed like the optimal number of clusters was likely either 3 or 4. We therefore performed clustering with both values of k (3 and 4) for all kernel parameters, and calculated a membership percentage matrix for each value of k - for what percentage of kernel parameters was each pair of subjects in the same cluster. When k was 3 we found that about 54% of the pairs were in the same cluster at least 90% of the time, and that 31% of the pairs were in the same cluster less than 5% of the time. Similarly for four clusters, 53% of the pairs were in the same cluster at least 90% of the time, and 34% percent of the pairs were in the same cluster less than 5% of the time. These results indicate that the clusterings for k being 3 and 4 were highly consistent across all kernel parameters. We then used the kernlab function kkmeans to cluster the two membership matrices, and only for k = 3 was the clustering able to be resolved (potentially due to the membership matrix not being a real kernel matrix). We therefore visualized the group of diagrams in each of the three found clusters:

The three clusters are quite distinct in terms of what topological information they likely represent. For cluster one this is mainly two loops but sometimes a small or very large loop (with mean persistence in the cluster of about 0.0018), for cluster two this is one loop of variable size (mean persistence in the cluster of about 0.0011), and cluster 3 is one main loop and sometimes one additional small loop (this group only had 16 loops in it, with mean persistence about 0.0005).

Next we analyzed the number of rows in the diagrams of each cluster. The results were as follows:

#> [1] "Cluster 1:"
#>   1   2   3   4   5   6   7 
#> 112  38  19   6   2   1   1
#> [1] "Cluster 2:"
#>   1   2 
#> 292   7
#> [1] "Cluster 3:"
#>    0    1 
#> 1306   16

This confirmed that cluster 1 had more loops than the other clusters on average, that cluster 2 mainly had one loop, and that cluster three was only a few loops of low persistence (and all of the no loop diagrams).

Next we analyzed the task composition of clusters, and found the following percentages (rounded to the hundredths position and ordered by most to least rounded variance):

#> [1] "Cluster 1:"
#>      rest1      rest2    emotion   gambling     social      motor relational 
#>       0.20       0.17       0.05       0.07       0.07       0.09       0.11 
#>   language 
#>       0.12
#> [1] "Cluster 2:"
#>      rest1      rest2    emotion   gambling     social      motor relational 
#>       0.09       0.08       0.12       0.15       0.14       0.09       0.12 
#>   language 
#>       0.11
#> [1] "Cluster 3:"
#>      rest1      rest2    emotion   gambling     social      motor relational 
#>       0.10       0.11       0.12       0.11       0.11       0.12       0.10 
#>   language 
#>       0.11

Cluster 1 had much higher percentages of resting state scans compared to the other clusters. However, it also had lower percentages of emotion, gambling and social tasks than the other clusters. Clusters 2 and 3 were moreso separated by percentages of resting state 2 and motor (more in cluster 3), and gambling and social (more in cluster 2).

We then calculated 2D MDS embeddings of all the diagrams using both the wasserstein and bottleneck distances. No external factor like task, subject ID, subject gender or subject age clearly separated the embedded points, however the cluster labels did a reasonably good job:

Behavioral measures may correlate with the mental state topologies, and therefore may be explaining some of the variance within the plots (like the 0-dimensional topological features in (Anderson et al. 2018)), and this could be an interesting future analysis. Also cluster 1 seemed to be more 2 dimensional in both embeddings whereas cluster 2 seemed to be more one dimensional, and determining what factors could explain these dimensions would also be an interesting future analysis.

Conclusions of HCP Analysis

We performed an exploratory analysis of the loops in HCP mental state topologies, and found interesting differences across subjects and tasks. Visual and statistical differences between subjects (across tasks) were found in just their loops, although higher-dimensional or more local-scale topology may need to be analyzed to be able to identify subjects based on their mental state topologies. Inter-related loops emerged across tasks, and fewer loops, compared to functional networks, were needed to describe the fMRI data. Three main signatures of loop structures existed across all subjects and tasks, suggesting largely shared functional strategies across subjects and domains, and these signatures were low dimensional in their variation. Overall our analyses point towards the utility of high (above 0) dimensional topology in finding task-related and subject-specific patterns in fMRI spatial patterns.

Nevertheless, important future directions of this exploratory analysis include

seeing if higher-dimensional topological features can improve subject or task discrimination,
checking if low-persistence features can distinguish between subjects or tasks by decreasing how conservative the bootstrap procedure is,
developing techniques for finding representative topological features when combining multiple diagrams (like in the combined resting state 1 diagram across all subjects),
creating methods for finding the neurological dimension which varies periodically across a loop, and determining if temporal patterns exist in this dimension,
determining if HCP subjects whose mental state topologies were non-independent are blood related, and
correlating HCP behavioral measures with the MDS embedding coordinates of the mental state topologies to see if behavior can explain variance in mental state topologies.

Limitations of TDApplied

There are three main limitations of TDApplied which should be discussed for its own future improvements. The first limitation is that precomputed distance/Gram matrices can not be used as input to machine learning and inference functions. Each of these functions involves the calculation of distance/Gram functions, which can be very computationally expensive, and if one is analyzing the same dataset with multiple TDApplied functions then potentially the same distance/Gram matrices will have to be recomputed a number of times. This source of redundancy and inefficiency should be addressed in future versions of TDApplied. The second limitation is in the function diagram_ksvm - the only acceptable input is a dataset where the single training feature is a persistence diagram (one diagram for each training example). This may be too inflexible for some applications, where the training features may include several persistence diagrams, or a mix of persistence diagrams, numeric and categorical features. A future update to TDApplied should provide the functionality for allowing any number of training features, of various types, to the diagram_ksvm function via a weighted sum of kernels. The third limitation of TDApplied is the long runtime of its diagram_distance function compared to python’s persim wasserstein function. Even with a potential runtime increase caused by the reticulate package, the persim wasserstein function was significantly faster than the TDApplied diagram_distance function. As such, a future version of TDApplied could attempt to speed up its distance calculations in order to be competitive with python implementations. While TDApplied has been created with flexibility and scalability at its core, there will always be room for adding more functionality and speedups.

Conclusions

The TDApplied package aims to bridge topological data analysis with researchers and data practitioners in the R community. Current topological data analysis packages in R and python do not provide the ability to carry out standard types of data analysis, being statistics and machine learning, with persistence diagrams - greatly limiting research and industry interest in topological data analysis. TDApplied was built with performance in mind, with fast native-R implementations of distance calculations between persistence diagrams, parallelization at every possible point, and its own fast persistent homology wrapper. Topological data analysis is an exciting and powerful new field of data analysis, and with TDApplied anyone can access its power for meaningful and creative analyses of data.

Appendix

Even though the Hungarian algorithm can be used to solve the linear sum assignment problem (LSAP), finding a minimal cost matching of two sets of points, some work needs to be done to properly apply the algorithm to calculate wasserstein or bottleneck distances. For an example we will consider the bottleneck distance, although the argument still holds with a simple change for the wasserstein distance (squaring matrix entries). Let Diag1 and Diag2 be two diagrams, with $n_1$ and $n_2$ points respectively, whose projections onto the diagonal are denoted by $\pi(\mbox{Diag1})$ and $\pi(\mbox{Diag2})$ respectively. Then take $M$ to be the following $(n_1 + n_2) \times (n_1 + n_2)$ matrix:

$M = \left[ \begin{array}{c|c} d_{\infty}(\mbox{Diag1},\mbox{Diag2}) & d_{\infty}(\mbox{Diag1},\pi(\mbox{Diag2})) \\ \hline d_{\infty}(\pi(\mbox{Diag1}),\mbox{Diag2}) & 0 \end{array} \right]$

Each row corresponds to the $n_1$ points in Diag1 followed by the $n_2$ projections $\pi(\mbox{Diag2})$ , and vice versa for the columns. Then we claim that the solution of the LSAP problem on $M$ has the same cost as the real bottleneck distance value between Diag1 and Diag2.

Firstly, we claim that a solution to the LSAP problem on $M$ has a cost which is no less than the distance value. Let the distance value be $s$ . Now suppose, to reach a contradiction, that there existed a lower-cost matching for the LSAP problem for $M$ , $m$ ,of cost $s' < s$ . Since projection points are matched together with cost 0 in $M$ , let $m'$ contain all the matches in $m$ which are not between two projection points. Then $m'$ would be matching for the distance calculation which has lower cost than $s$ , contradicting the minimality of $s$ . Therefore, a solution to the LSAP problem on $M$ has a cost which is no less than the distance value.

Next, we claim that a solution to the LSAP problem on $M$ has a cost which is no greater than the distance value. Now suppose, to reach a contradiction, that the solution to the LSAP on problem $M$ had cost $s$ , which was larger than the real distance value, $s'$ . Let $m'$ be a matching for the real distance calculation of $s'$ . Then since each point in either diagram is either paired with a point in the other diagram or its own diagonal projection, there must be an equal number of unpaired points in both diagrams in $m'$ . Therefore, we can augment $m'$ to a matching $m$ on $M$ in which the unpaired diagonal points are arbitrarily paired up with cost 0. Thus, $m$ has cost $s' < s$ , contradicting the minimality of $s$ . Therefore, a solution to the LSAP problem on $M$ has a cost which is no greater than the distance value.

Therefore, a solution to the LSAP problem for $M$ has a cost which is both greater than and less than the bottleneck distance value, and hence the two values must be equal.

References

Abdallah, Hassan et al. 2021. Github. https://github.com/hassan-abdallah/Statistical_Inference_PH_fMRI/blob/main/Abdallah_et_al_Statistical_Inference_PH_fMRI.pdf.

al, A. Salch et. 2021. “From Mathematics to Medicine: A Practical Primer on Topological Data Analysis (TDA) and the Development of Related Analytic Tools for the Functional Discovery of Latent Structure in fMRI Data.” PLoS ONE 16 (8).

al., M. Glasser et. 2016. “The Human Connectome Project’s Neuroimaging Approach.” Nature Neuroscience 19: 1175–87.

Anderson, Keri L., Jeffrey S. Anderson, Sourabh Palande, and Bei Wang. 2018. “Topological Data Analysis of Functional MRI Connectivity in Time and Space Domains.” In Connectomics in NeuroImaging, edited by Guorong Wu, Islem Rekik, Markus D. Schirmer, Ai Wern Chung, and Brent Munsell, 67–77. Cham: Springer International Publishing.

Bauer, Ulrich. 2015. Persistent Homology Algorithm Toolbox. https://github.com/Ripser/ripser.

Bauer, Ulrich, Michael Kerber, and Jan Reininghaus. 2013. Persistent Homology Algorithm Toolbox. https://www.sciencedirect.com/science/article/pii/S0747717116300098.

Bubenik, Peter. 2015. “Statistical Topological Data Analysis Using Persistence Landscapes.” J. Mach. Learn. Res. 16 (1): 77–102.

Bubenik, Peter, and Pawel Dlotko. 2017. “A Persistence Landscapes Toolbox for Topological Statistics.” Journal of Symbolic Computation 78: 91–114. https://doi.org/https://doi.org/10.1016/j.jsc.2016.03.009.

Caballero-Gaudes, César, and Richard C. Reynolds. 2017. “Methods for Cleaning the BOLD fMRI Signal.” NeuroImage 154: 128–49. https://doi.org/https://doi.org/10.1016/j.neuroimage.2016.12.018.

Carlsson, Gunnar E., Tigran Ishkhanov, Vin de Silva, and Afra Zomorodian. 2007. “On the Local Behavior of Spaces of Natural Images.” International Journal of Computer Vision 76: 1–12.

Carrière, Mathieu, Marco Cuturi, and Steve Oudot. 2017. “Sliced Wasserstein Kernel for Persistence Diagrams.” In Proceedings of the 34th International Conference on Machine Learning, edited by Doina Precup and Yee Whye Teh, 70:664–73. Proceedings of Machine Learning Research. PMLR.

Casella, G., R. L. Berger, and Brooks/Cole Publishing Company. 2002. Statistical Inference. Duxbury Advanced Series in Statistics and Decision Sciences. Thomson Learning.

Chazal, Frédéric, and Bertrand Michel. 2017. “An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists.” Frontiers in Artificial Intelligence 4 (October). https://doi.org/10.3389/frai.2021.667963.

Cox, Michael A. A., and Trevor F. Cox. 2008. “Multidimensional Scaling.” In Handbook of Data Visualization, 315–47. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-33037-0_14.

Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis. 2004. “A Unified View of Kernel k-Means , Spectral Clustering and Graph Cuts.” In UTCS Technical Report.

Edelsbrunner, Herbert, David Letscher, and Afra Zomorodian. 2000. “Topological Persistence and Simplification.” Discrete & Computational Geometry 28: 511–33.

Ellis, Cameron, Michael Lesnick, Greg Henselman, Bryn Keller, and Jonathan Cohen. 2019. “Feasibility of Topological Data Analysis for Event-Related fMRI.” Network Neuroscience 3 (May): 1–12. https://doi.org/10.1162/netn_a_00095.

Fasy, Brittany T., Jisu Kim, Fabrizio Lecci, Clement Maria, David L. Millman, and Vincent Rouvreau. 2021. TDA: Statistical Tools for Topological Data Analysis. https://CRAN.R-project.org/package=TDA.

Fasy, Brittany, Fabrizio Lecci, Alessandro Rinaldo, Larry Wasserman, Sivaraman Balakrishnan, and Aarti Singh. 2014. “Confidence Sets for Persistence Diagrams.” The Annals of Statistics 42 (March): 2301–39.

Finn, Emily, Xilin Shen, Dustin Scheinost, Monica Rosenberg, Huang Jessica, Marvin Chun, Xenophon Papademetris, and Robert Constable. 2015. “Functional Connectome Fingerprinting: Identifying Individuals Using Patterns of Brain Connectivity.” Nature Neuroscience 18 (October). https://doi.org/10.1038/nn.4135.

Giusti, Chad, Robert Ghrist, and Danielle Bassett. 2016. “Two’s Company, Three (or More) Is a Simplex: Algebraic-Topological Tools for Understanding Higher-Order Structure in Neural Data.” Journal of Computational Neuroscience 41 (January). https://doi.org/10.1007/s10827-016-0608-6.

Gracia-Tabuenca, Zeus, Juan Carlos Dı́az-Patiño, Isaac Arelio, and Sarael Alcauter. 2020. “Topological Data Analysis Reveals Robust Alterations in the Whole-Brain and Frontal Lobe Functional Connectomes in Attention-Deficit/Hyperactivity Disorder.” Eneuro.

Gretton, Arthur, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Schölkopf, and Alex Smola. 2007. “A Kernel Statistical Test of Independence.” In Advances in Neural Information Processing Systems, edited by J. Platt, D. Koller, Y. Singer, and S. Roweis. Vol. 20. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2007/file/d5cfead94f5350c12c322b5b664544c1-Paper.pdf.

Haim Meirom, Shaked, and Omer Bobrowski. 2022. “Unsupervised Geometric and Topological Approaches for Cross-Lingual Sentence Representation and Comparison.” In Proceedings of the 7th Workshop on Representation Learning for NLP, 173–83. Dublin, Ireland: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.repl4nlp-1.18.

Hornik, Kurt. 2005. “A CLUE for CLUster Ensembles.” Journal of Statistical Software 14 (12). https://doi.org/10.18637/jss.v014.i12.

Hyekyoung, Lee, Moo Chung, Hyejin Kang, and Dong Lee. 2014. “Hole Detection in Metabolic Connectivity of Alzheimer’s Disease Using Kappa-Laplacian.” In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, 17:297–304. https://doi.org/10.1007/978-3-319-10443-0_38.

Islambekov, Umar, and Alexey Luchinsky. 2022. TDAvec: Vector Summaries of Persistence Diagrams. https://CRAN.R-project.org/package=TDAvec.

Kang, Louis, Boyan Xu, and Dmitriy Morozov. 2021. “Evaluating State Space Discovery by Persistent Cohomology in the Spatial Representation System.” Frontiers in Computational Neuroscience 15. https://doi.org/10.3389/fncom.2021.616748.

Karatzoglou, Alexandros, Alex Smola, Kurt Hornik, and Achim Zeileis. 2004. “Kernlab – an S4 Package for Kernel Methods in R.” Journal of Statistical Software 11 (9): 1–20. https://www.jstatsoft.org/v11/i09/.

Kerber, Michael, Dmitriy Morozov, and Arnur Nigmetov. 2017. “Geometry Helps to Compare Persistence Diagrams.” ACM Journal of Experimental Algorithmics 22 (September). https://doi.org/10.1145/3064175.

Kriegeskorte, Nikolaus, Marieke Mur, and Peter Bandettini. 2008. “Representational Similarity Analysis – Connecting the Branches of Systems Neuroscience.” Frontiers in Systems Neuroscience 2 (February): 4. https://doi.org/10.3389/neuro.06.004.2008.

Krishnapriyan, Aditi S. et al. 2021. “Machine Learning with Persistent Homology and Chemical Word Embeddings Improves Prediction Accuracy and Interpretability in Metal-Organic Frameworks.” Nature Scientific Report 11 (April).

Kusano, Genki, Kenji Fukumizu, and Yasuaki Hiraoka. 2018. “Kernel Method for Persistence Diagrams via Kernel Embedding and Weight Factor.” Journal of Machine Learning Research 18 (189): 1–41. https://jmlr.org/papers/v18/17-317.html.

Lacombe, Theo, Hind Montassif, Manuel Soriano-Trigueros, Gard Spreeman, and Masatoshi Takenouchi. 2019. The GUDHI Library Is a Generic Open Source c++ Library, with a Python Interface, for Topological Data Analysis (TDA) and Higher Dimensional Geometry Understanding. https://gudhi.inria.fr/.

Le, Tam, and Makoto Yamada. 2018. “Persistence Fisher Kernel: A Riemannian Manifold Kernel for Persistence Diagrams.” In Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2018/file/959ab9a0695c467e7caf75431a872e5c-Paper.pdf.

Lee, Hyekyoung, Moo K. Chung, Hyejin Kang, Bung-Nyun Kim, and Dong Soo Lee. 2011. “Discriminative Persistent Homology of Brain Networks.” In 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 841–44. https://doi.org/10.1109/ISBI.2011.5872535.

Morozov, Dimitriy. 2017. Dionysus Is a c++ Library for Computing Persistent Homology. https://mrzv.org/software/dionysus2/.

Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. MIT press.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Reininghaus, Jan, Stefan Huber, Ulrich Bauer, and Roland Kwitt. 2014. “A Stable Multi-Scale Kernel for Topological Machine Learning,” December.

Robinson, Andrew, and Katharine Turner. 2017. “Hypothesis Testing for Topological Data Analysis.” Journal of Applied and Computational Topology 1.

Saggar, Manish, Olaf Sporns, Javier Gonzalez-Castillo, Peter A. Bandettini, Gunnar E. Carlsson, Gary H. Glover, and Allan L. Reiss. 2018. “Towards a New Approach to Reveal Dynamical Organization of the Brain Using Topological Data Analysis.” Nature Communications 9.

Saul, Nathaniel, and Chris Tralie. 2019. “Scikit-TDA: Topological Data Analysis for Python.” https://doi.org/10.5281/zenodo.2533369.

Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. 1998. “Nonlinear Component Analysis as a Kernel Eigenvalue Problem.” Neural Computation 10: 1299–319.

Silva, Morozov de, V., and M. Vejdemo-Johansson. 2011a. “Dualities in Persistent (Co)homology.” Inverse Problems 27.

———. 2011b. “Persistent Cohomology and Circular Coordinates.” Discrete & Computational Geometry 45: 737–59.

Singh, Gurjeet, Facundo Mémoli, and Gunnar E. Carlsson. 2007. “Topological Methods for the Analysis of High Dimensional Data Sets and 3d Object Recognition.” In SPBG.

Sizemore, Ann E., Jennifer E. Phillips-Cremins, Robert Ghrist, and Danielle S. Bassett. 2019. “The importance of the whole: Topological data analysis for the network neuroscientist.” Network Neuroscience 3 (3): 656–73. https://doi.org/10.1162/netn_a_00073.

Somasundaram, Eashwar V., Shael E. Brown, Adam Litzler, Jacob G. Scott, and Raoul R. Wadhwa. 2021. “The r Journal: Benchmarking r Packages for Calculation of Persistent Homology.” The R Journal 13: 184–93. https://doi.org/10.32614/RJ-2021-033.

Tauzin, Guillaume, Umberto Lupo, Lewis Tunstall, Julian Burella Pérez, Matteo Caorsi, Anibal Medina-Mardones, Alberto Dassatti, and Kathryn Hess. 2020. “Giotto-Tda: A Topological Data Analysis Toolkit for Machine Learning and Data Exploration.” https://arxiv.org/abs/2004.02551.

Turner, Katharine. 2013. “Means and Medians of Sets of Persistence Diagrams.” arXiv, July.

Turner, Katharine, Yuriy Mileyko, Sayan Mukherjee, and John Harer. 2014. “Frechet Means for Distributions of Persistence Diagrams.” Discrete & Computational Geometry 52 (1): 44–70.

Ushey, Kevin, JJ Allaire, and Yuan Tang. 2022. Reticulate: Interface to ’Python’. https://CRAN.R-project.org/package=reticulate.

Wadhwa, Raoul, Andrew Dhawan, Drew Williamson, and Jacob Scott. 2019. TDAstats: Pipeline for Topological Data Analysis. https://github.com/rrrlw/TDAstats.

Yen, Peter Tsung-Wen, and Siew Ann Cheong. 2021. “Using Topological Data Analysis (TDA) and Persistent Homology to Analyze the Stock Markets in Singapore and Taiwan.” Frontiers in Physics 9. https://doi.org/10.3389/fphy.2021.572216.

Yeo, B. T. Thomas, Fenna Krienen, Jorge Sepulcre, Mert Sabuncu, Danial Lashkari, Marisa Hollinshead, Joshua Roffman, et al. 2011. “The Organization of the Human Cerebral Cortex Estimated by Functional Correlation.” Journal of Neurophysiology 106 (June): 1125–65. https://doi.org/10.1152/jn.00338.2011.

You, Kisung, and Byeongsu Yu. 2021. TDAkit: Toolkit for Topological Data Analysis. https://CRAN.R-project.org/package=TDAkit.

Zomorodian, Afra, and Gunnar Carlsson. 2005. “Computing Persistent Homology.” Discrete and Computational Geometry 33 (February): 249–74. https://doi.org/10.1007/s00454-004-1146-y.

Machine learning and statistical inference of persistence diagrams with TDApplied

Shael Brown and Dr. Reza Farivar

Introduction

Software Review

Package TDApplied

Background: Persistent Homology

TDApplied Methods and Underlying Theory

Distance Between Persistence Diagrams

Multidimensional Scaling of Persistence Diagrams

Testing for Group Differences of Persistence Diagrams

Finding Real Topological Features via Bootstrapping

Kernels Between Persistence Diagrams

Kernel k-means of Persistence Diagrams

Kernel Principal Components Analysis of Persistence Diagrams

Kernel Support Vector Machines of Persistence Diagrams

Testing for Independence Between Two Groups of Paired Persistence Diagrams

Fast Persistence Diagram Calculations Using Reticulate

`PyH` Requirements and Usage

Representative Cocycles

Benchmarking TDApplied Against Other Packages

Benchmarking `PyH` Against TDAstats’ `calculate_homology` Function

Benchmarking the TDApplied `diagram_distance` and TDA `wasserstein` Functions

Benchmarking the TDApplied `diagram_distance` Function against persim’s `wasserstein` Function

Case Study: Human Connectome Project

Visualizing mental state topologies

Comparing subjects and tasks with inference

Machine learning

Conclusions of HCP Analysis

Limitations of TDApplied

Conclusions

Appendix

References

Machine learning and statistical inference of persistence diagrams with TDApplied

Shael Brown and Dr. Reza Farivar

Introduction

Software Review

Package TDApplied

Background: Persistent Homology

TDApplied Methods and Underlying Theory

Distance Between Persistence Diagrams

Multidimensional Scaling of Persistence Diagrams

Testing for Group Differences of Persistence Diagrams

Finding Real Topological Features via Bootstrapping

Kernels Between Persistence Diagrams

Kernel k-means of Persistence Diagrams

Kernel Principal Components Analysis of Persistence Diagrams

Kernel Support Vector Machines of Persistence Diagrams

Testing for Independence Between Two Groups of Paired Persistence Diagrams

Fast Persistence Diagram Calculations Using Reticulate

PyH Requirements and Usage

Representative Cocycles

Benchmarking TDApplied Against Other Packages

Benchmarking PyH Against TDAstats’ calculate_homology Function

Benchmarking the TDApplied diagram_distance and TDA wasserstein Functions

Benchmarking the TDApplied diagram_distance Function against persim’s wasserstein Function

Case Study: Human Connectome Project

Visualizing mental state topologies

Comparing subjects and tasks with inference

Machine learning

Conclusions of HCP Analysis

Limitations of TDApplied

Conclusions

Appendix

References

`PyH` Requirements and Usage

Benchmarking `PyH` Against TDAstats’ `calculate_homology` Function

Benchmarking the TDApplied `diagram_distance` and TDA `wasserstein` Functions

Benchmarking the TDApplied `diagram_distance` Function against persim’s `wasserstein` Function