library(parallelpam)
This is a copy of the vignette of package jmatrix
(Domingo (2022a)). It is included here since jmatrix
is underlying this package and you will need to know how to prepare your data to be processed by parallelpam
. But you must NOT load package jmatrix
; all the functions detailed below are already included into parallelpam
(Domingo (2022b)) and are available just by loading it with library(parallelpam)
.
The package jmatrix
was originally conceived as a tool for other packages, namely parallelpam
and scellpam
(Domingo (2022c)) which needed to deal with very big matrices which might not fit in the memory of the computer, particularly if their elements are of type double
(in most modern machines, 8 bytes per element) whereas they could fit if they were matrices of other data types, in particular of floats (4 bytes per element).
Unfortunately, R is not a strongly typed language. Double is the default type in R and it is not easy to work with other data types. Trials like the package float (Schmidt (2022)) have been done, but to use them you have to coerce a matrix already loaded in R memory to a float matrix, and then you can delete it. But, what happens if you computer has not memory enough to hold the matrix in the first place?. This is the problem this package tries to address.
Our idea is to use the disk as temporarily storage of the matrix in a file with a internal binary format (jmatrix
format). This format has a header of 128 bytes with information like type of matrix (full, sparse or symmetric), data type of each element (char, short, int, long, float, double or long double), number of rows and columns and endianness; then comes the content as binary data (in sparse matrices zeros are not stored; in symmetric matrices only the lower-diagonal is stored) and finally the metadata (currently, names for rows/columns if needed and an optional comment).
Such files are created and loaded by functions written in C++ which are accessible from R
with Rcpp (Eddelbuettel and François (2011)). The file, once loaded, uses strictly the needed memory for its data type and can be processed by other C++ functions (like the PAM algorithm or any other numeric library written in C++) also from inside R
.
The matrix contained in a binary data file in jmatrix
format cannot be loaded directly in R
memory as a R
matrix (that would be impossible, anyway, since precisely this package is done for the cases in which such matrix would NOT fit into the available RAM). Nevertheless, limited access through some functions is provided to read one or more rows or one or more columns as R
vectors or matrices (obviously, coerced to double).
The package jmatrix
must not be considered as a final, finished software. Currently is mostly an instrumental solution to address our needs and we make available as a separate package just in case it could be useful for anyone else.
First of all, the package can show quite informative (but sometimes verbose) messages in the console. To turn on/off such messages you can use.
ParallelpamSetDebug(deb=TRUE,debjmat=TRUE)
#> Debugging for PAM algorithm set to ON.
#> Debugging for jmatrix inside parallelpam package set to ON.
As stated before, the binary matrix files should normally be created from C++ getting the data from an external source like a data file in a format used in bioinformatics or a .csv file. These files should be read by chunks. As an example, look at function CsvToJMat
in package scellpam
(Domingo (2022c)).
As a convenience and only for testing purposes (to be used in this vignette), we provide the function JWriteBin
to write a R matrix as a jmatrix
file.
# Create a 6x8 matrix of random values
<- matrix(runif(48),nrow=6)
Rf # Set row and column names for it
rownames(Rf) <- c("A","B","C","D","E","F")
colnames(Rf) <- c("a","b","c","d","e","f","g","h")
# Let's see the matrix
Rf#> a b c d e f g
#> A 0.33924533 0.97618404 0.94662717 0.4186692 0.07196084 0.52290163 0.20934166
#> B 0.87766287 0.56642703 0.06926637 0.6681069 0.75988625 0.96225313 0.15208650
#> C 0.01490482 0.71701710 0.71612113 0.6208715 0.59357862 0.75143741 0.01372476
#> D 0.30824005 0.01012763 0.83325282 0.5588573 0.63413266 0.35389050 0.98980407
#> E 0.23731471 0.96237737 0.01428945 0.1286168 0.06033504 0.08693494 0.19441580
#> F 0.72093239 0.16500909 0.84209971 0.7722799 0.40297448 0.33691653 0.33110851
#> h
#> A 0.6991301
#> B 0.3287793
#> C 0.8802877
#> D 0.8229819
#> E 0.1623377
#> F 0.7174612
# and write it as the binary file Rfullfloat.bin
JWriteBin(Rf,"Rfullfloat.bin",dtype="float",dmtype="full",
comment="Full matrix of floats")
#> The passed matrix has row names for the 6 rows and they will be used.
#> The passed matrix has column names for the 8 columns and they will be used.
#> Writing binary matrix Rfullfloat.bin of (6x8)
#> End of block of binary data at offset 320
#> Writing row names (6 strings written, from A to F).
#> Writing column names (8 strings written, from a to h).
#> Writing comment: Full matrix of floats
# Also, you can write it with double data type:
JWriteBin(Rf,"Rfulldouble.bin",dtype="double",dmtype="full",
comment="Full matrix of doubles")
#> The passed matrix has row names for the 6 rows and they will be used.
#> The passed matrix has column names for the 8 columns and they will be used.
#> Writing binary matrix Rfulldouble.bin of (6x8)
#> End of block of binary data at offset 512
#> Writing row names (6 strings written, from A to F).
#> Writing column names (8 strings written, from a to h).
#> Writing comment: Full matrix of doubles
To get information about the stored file the function JMatInfo
is provided. Of course, this funcion does not read the complete file in memory but just the header.
# Information about the float binary file
JMatInfo("Rfullfloat.bin")
#> File: Rfullfloat.bin
#> Matrix type: FullMatrix
#> Number of elements: 48
#> Data type: float
#> Endianness: little endian (same as this machine)
#> Number of rows: 6
#> Number of columns: 8
#> Metadata: Stored names of rows and columns.
#> Metadata comment: "Full matrix of floats"
# Same information about the double binary file
JMatInfo("Rfulldouble.bin")
#> File: Rfulldouble.bin
#> Matrix type: FullMatrix
#> Number of elements: 48
#> Data type: double
#> Endianness: little endian (same as this machine)
#> Number of rows: 6
#> Number of columns: 8
#> Metadata: Stored names of rows and columns.
#> Metadata comment: "Full matrix of doubles"
As stated before, no function is provided to read the whole matrix in memory which would contradict the philosophy of this package, but you can get rows or columns from a file.
# Reads row 1 into vector vf. Float values inside the file are
# promoted to double.
<-GetJRow("Rfullfloat.bin",1))
(vf#> a b c d e f g
#> 0.33924535 0.97618407 0.94662714 0.41866922 0.07196084 0.52290165 0.20934166
#> h
#> 0.69913006
Obviously, storage in float provokes a loosing of precision. We have observed this not to be relevant for PAM
(partitioning around medoids) algorihm but it can be important in other cases. It is the price to pay for halving the needed space.
# Checks the precision lost
max(abs(Rf[1,]-vf))
#> [1] 2.677552e-08
Nevertheless, storing as double obviously keeps the data intact.
<-GetJRow("Rfulldouble.bin",1)
vdmax(abs(Rf[1,]-vd))
#> [1] 0
Now, let us see examples of some functions to read rows or columns by number or by name, or to read several rows/columns as a R matrix. In all examples numbers for rows and columns are in R-convention (i.e. starting at 1)
# Read column number 3
<-GetJCol("Rfullfloat.bin",3))
(vf#> A B C D E F
#> 0.94662714 0.06926637 0.71612114 0.83325285 0.01428945 0.84209973
# Test precision
max(abs(Rf[,3]-vf))
#> [1] 2.630986e-08
# Read row with name C
<-GetJRowByName("Rfullfloat.bin","C"))
(vf#> a b c d e f g
#> 0.01490482 0.71701711 0.71612114 0.62087148 0.59357864 0.75143743 0.01372476
#> h
#> 0.88028771
# Read column with name c
<-GetJColByName("Rfullfloat.bin","c"))
(vf#> A B C D E F
#> 0.94662714 0.06926637 0.71612114 0.83325285 0.01428945 0.84209973
# Get the names of all rows or columns as vectors of R strings
<-GetJRowNames("Rfullfloat.bin"))
(rn#> [1] "A" "B" "C" "D" "E" "F"
<-GetJColNames("Rfullfloat.bin"))
(cn#> [1] "a" "b" "c" "d" "e" "f" "g" "h"
# Get the names of rows and columns simultaneosuly as a list of two elements
<-GetJNames("Rfullfloat.bin"))
(l#> $rownames
#> [1] "A" "B" "C" "D" "E" "F"
#>
#> $colnames
#> [1] "a" "b" "c" "d" "e" "f" "g" "h"
# Get several rows at once. The returned matrix has the rows in the
# same order as the passed list,
# and this list can contain even repeated values
<-GetJManyRows("Rfullfloat.bin",c(1,4)))
(vm#> a b c d e f g
#> A 0.3392453 0.97618407 0.9466271 0.4186692 0.07196084 0.5229017 0.2093417
#> D 0.3082401 0.01012763 0.8332528 0.5588573 0.63413268 0.3538905 0.9898041
#> h
#> A 0.6991301
#> D 0.8229819
# Of course, columns can be extrated equally
<-GetJManyCols("Rfulldouble.bin",c(1,4)))
(vc#> a d
#> A 0.33924533 0.4186692
#> B 0.87766287 0.6681069
#> C 0.01490482 0.6208715
#> D 0.30824005 0.5588573
#> E 0.23731471 0.1286168
#> F 0.72093239 0.7722799
# and similar functions are provided for extracting by names:
<-GetJManyRowsByNames("Rfulldouble.bin",c("A","D")))
(vm#> a b c d e f g
#> A 0.3392453 0.97618404 0.9466272 0.4186692 0.07196084 0.5229016 0.2093417
#> D 0.3082401 0.01012763 0.8332528 0.5588573 0.63413266 0.3538905 0.9898041
#> h
#> A 0.6991301
#> D 0.8229819
<-GetJManyColsByNames("Rfulldouble.bin",c("a","d")))
(vc#> a d
#> A 0.33924533 0.4186692
#> B 0.87766287 0.6681069
#> C 0.01490482 0.6208715
#> D 0.30824005 0.5588573
#> E 0.23731471 0.1286168
#> F 0.72093239 0.7722799
The package can manage and store sparse and symmetric matrices, too.
# Generation of a 6x8 sparse matrix
<- matrix(rep(0,48),nrow=6)
Rsp <- 0.1
sparsity <- round(48*sparsity)
nnz <- floor(47*runif(nnz))
where <- runif(nnz)
val for (i in 1:nnz)
{floor(where[i]/8)+1,(where[i]%%8)+1] <- val[i]
Rsp[
}rownames(Rsp) <- c("A","B","C","D","E","F")
colnames(Rsp) <- c("a","b","c","d","e","f","g","h")
# Let's see the matrix
Rsp#> a b c d e f g h
#> A 0 0.2002556 0.0000000 0 0 0.0000000 0.00000000 0
#> B 0 0.0000000 0.7706144 0 0 0.0000000 0.81779897 0
#> C 0 0.0000000 0.0000000 0 0 0.0000000 0.00000000 0
#> D 0 0.0000000 0.0000000 0 0 0.0000000 0.06233453 0
#> E 0 0.0000000 0.0000000 0 0 0.7992811 0.00000000 0
#> F 0 0.0000000 0.0000000 0 0 0.0000000 0.00000000 0
# Write the matrix as sparse with type float
JWriteBin(Rsp,"Rspafloat.bin",dtype="float",dmtype="sparse",
comment="Sparse matrix of floats")
#> The passed matrix has row names for the 6 rows and they will be used.
#> The passed matrix has column names for the 8 columns and they will be used.
#> Writing binary matrix Rspafloat.bin of (6x8)
#> End of block of binary data at offset 192
#> Writing row names (6 strings written, from A to F).
#> Writing column names (8 strings written, from a to h).
#> Writing comment: Sparse matrix of floats
Notice that the condition of being a sparse matrix and the storage space used can be known with the matrix info.
JMatInfo("Rspafloat.bin")
#> File: Rspafloat.bin
#> Matrix type: SparseMatrix
#> Number of elements: 48
#> Data type: float
#> Endianness: little endian (same as this machine)
#> Number of rows: 6
#> Number of columns: 8
#> Metadata: Stored names of rows and columns.
#> Metadata comment: "Sparse matrix of floats"
#> Binary data size: 64 bytes, which is 33.3333% of the full matrix size.
Be careful: trying to store as sparse a matrix which is not (it has not a majority of 0-entries) works, but produces a matrix larger than the corresponding full matrix.
With respect to symmetric matrices, JWriteBin
works the same way. Let us generate a \(7 \times 7\) symmetric matrix.
<- matrix(runif(49),nrow=7)
Rns <- 0.5*(Rns+t(Rns))
Rsym rownames(Rsym) <- c("A","B","C","D","E","F","G")
colnames(Rsym) <- c("a","b","c","d","e","f","g")
# Let's see the matrix
Rsym#> a b c d e f g
#> A 0.8227176 0.29330321 0.3482491 0.6386264 0.2765956 0.1685091 0.3271973
#> B 0.2933032 0.06133059 0.5774178 0.7507464 0.6253002 0.4415560 0.3447988
#> C 0.3482491 0.57741785 0.2743452 0.2635373 0.6126107 0.4003239 0.5947477
#> D 0.6386264 0.75074645 0.2635373 0.9880357 0.5322137 0.2894587 0.6353060
#> E 0.2765956 0.62530016 0.6126107 0.5322137 0.3314372 0.4821001 0.4061523
#> F 0.1685091 0.44155604 0.4003239 0.2894587 0.4821001 0.7589535 0.9155305
#> G 0.3271973 0.34479879 0.5947477 0.6353060 0.4061523 0.9155305 0.9657987
# Write the matrix as symmetric with type float
JWriteBin(Rsym,"Rsymfloat.bin",dtype="float",dmtype="symmetric",
comment="Symmetric matrix of floats")
#> The passed matrix has row names for the 7 rows and they will be used.
#> Writing binary matrix Rsymfloat.bin
#> End of block of binary data at offset 240
#> Writing row names (7 strings written, from A to G).
#> Writing comment: Symmetric matrix of floats
# Get the information
JMatInfo("Rsymfloat.bin")
#> File: Rsymfloat.bin
#> Matrix type: SymmetricMatrix
#> Number of elements: 49 (28 really stored)
#> Data type: float
#> Endianness: little endian (same as this machine)
#> Number of rows: 7
#> Number of columns: 7
#> Metadata: Stored only names of rows.
#> Metadata comment: "Symmetric matrix of floats"
Notice that if you store a R matrix which is NOT symmetric as a symmetric jmatrix
, only the lower triangular part (including the main diagonal) will be saved. The upper-triangular part will be lost.
The functions to read rows/colums stated before works equally independently of the matrix character (full, sparse or symmetric) so you can play with them using the Rspafloat.bin
and Rsymfloat.bin
file to check they work.
Finally, there is a function to get the lower-diagonal part of a symmetric matrix stored in a jmatrix
file as a R vector, ordered by columns (i.e.: column 1 from M(2,1) to M(n,1), followed by column 2 from M(3,2) to M(n,2) and so on, up to M(n,n-1) so d is a vector of \(n(n-1)/2\) components). This is to get the data of a dissimilarity/distance matrix in the format used by the Partitioning around medoids (PAM) algorithm implemented in the cluster
(Maechler et al. (2022)) package. The purpose is to compare such implementation with our own implementation of PAM. If you are interested, see the documentation of the parallelpam
(Domingo (2022b)) package.
<-GetSubdiag("Rsymfloat.bin"))
(d#> [1] 0.2933032 0.3482491 0.6386265 0.2765956 0.1685091 0.3271973 0.5774179
#> [8] 0.7507464 0.6253002 0.4415560 0.3447988 0.2635373 0.6126108 0.4003239
#> [15] 0.5947477 0.5322137 0.2894587 0.6353061 0.4821001 0.4061523 0.9155305