github.com-BlasBenito-distantia_-_2021-03-03_23-13-27
Item Preview
Share or Embed This Item
Flag this item for
- Publication date
- 2021-03-03
R package to compute dissimilarity between multivariate time series
distantia: an R package to compute the dissimilarity between
multivariate time-series
Summary
The package distantia allows to measure the dissimilarity betweenmultivariate ecological time-series (METS hereafter). The packageassumes that the target sequences are ordered along a given dimension,being depth and time the most common ones, but others such as latitudeor elevation are also possible. Furthermore, the target METS can beregular or irregular, and have their samples aligned (sameage/time/depth) or unaligned (different age/time/depth). The onlyrequirement is that the sequences must have at least two (but ideallymore) columns with the same name and units representing differentvariables relevant to the dynamics of an ecological system.
In this document I explain the logics behind the method, show how to useit, and demonstrate how the distantia package introduces usefultools to compare multivariate time-series. The topics covered in thisdocument are:
- Installation of the package.
- Comparing two irregular METS.
- Comparing multiple irregular METS.
- Comparing regular and aligned METS.
- Restricted permutation test to assess the significance ofdissimilarity scores
- Assessing the contribution of every variable to the dissimilaritybetween two METS.
- Partial matching: finding the section in a long METS more similar toa given shorter one.
- Sequence slotting: combining samples of two METS into a singlecomposite sequence.
- Tranferring an attribute from one METS to another: direct andinterpolated modes.
Installation
You can install the released version of distantia (currently v1.0.0)from CRAN with:
rinstall.packages("distantia")
And the development version (currently v1.0.1) fromGitHub with:
rlibrary(devtools)devtools::install_github("BlasBenito/distantia")
Loading the library, plus other helper libraries:
rlibrary(distantia)library(ggplot2)library(viridis)library(kableExtra)library(qgraph)library(tidyr)
Comparing two irregular METS
In this section I will use two example datasets based on the Abernethypollen core (Birks and Mathewes, 1978) to fully explain the logicalbackbone of the dissimilarity analyses implemented in distantia.
``` r
loading sequences
data(sequenceA)data(sequenceB)
showing first rows
kable(sequenceA[1:15, ], caption = "Sequence A")```
betula | pinus | corylu | junipe | empetr | gramin | cypera | artemi | rumex |
---|---|---|---|---|---|---|---|---|
79 | 271 | 36 | 0 | 4 | 7 | 25 | 0 | 0 |
113 | 320 | 42 | 0 | 4 | 3 | 11 | 0 | 0 |
51 | 420 | 39 | 0 | 2 | 1 | 12 | 0 | 0 |
130 | 470 | 6 | 0 | 0 | 2 | 4 | 0 | 0 |
31 | 450 | 6 | 0 | 3 | 2 | 3 | 0 | 0 |
59 | 425 | 12 | 0 | 0 | 2 | 3 | 0 | 0 |
78 | 386 | 29 | 2 | 0 | 0 | 2 | 0 | 0 |
71 | 397 | 52 | 2 | 0 | 6 | 3 | 0 | 0 |
140 | 310 | 50 | 2 | 0 | 4 | 3 | 0 | 0 |
150 | 323 | 34 | 2 | 0 | 11 | 2 | 0 | 0 |
175 | 317 | 37 | 2 | 0 | 11 | 3 | 0 | 0 |
181 | 345 | 28 | 3 | 0 | 7 | 3 | 0 | 0 |
153 | 285 | 36 | 2 | 0 | 8 | 3 | 0 | 1 |
214 | 315 | 54 | 2 | 1 | 13 | 5 | 0 | 0 |
200 | 210 | 41 | 6 | 0 | 10 | 4 | 0 | 0 |
rkable(sequenceB[1:15, ], caption = "Sequence B")
betula | pinus | corylu | junipe | gramin | cypera | artemi | rumex |
---|---|---|---|---|---|---|---|
19 | 175 | NA | 2 | 34 | 39 | 1 | 0 |
18 | 119 | 28 | 1 | 36 | 44 | 0 | 4 |
30 | 99 | 37 | 0 | 2 | 20 | 0 | 1 |
26 | 101 | 29 | 0 | 0 | 18 | 0 | 0 |
31 | 99 | 30 | 0 | 1 | 10 | 0 | 0 |
24 | 97 | 28 | 0 | 2 | 9 | 0 | 0 |
23 | 105 | 34 | 0 | 1 | 6 | 0 | 0 |
48 | 112 | 46 | 0 | 0 | 12 | 0 | 0 |
29 | 108 | 16 | 0 | 6 | 3 | 0 | 0 |
23 | 110 | 21 | 0 | 2 | 11 | 0 | 1 |
5 | 119 | 19 | 0 | 1 | 1 | 0 | 0 |
30 | 105 | NA | 0 | 9 | 7 | 0 | 0 |
22 | 116 | 17 | 0 | 1 | 7 | 0 | 0 |
24 | 115 | 20 | 0 | 2 | 4 | 0 | 0 |
26 | 119 | 23 | 0 | 4 | 0 | 0 | 0 |
Data preparation
Notice that sequenceB has a few NA values (that were introduced toserve as an example). The function prepareSequences gets them readyfor analysis by matching colum names and handling empty data. It allowsto merge two or more METS into a single dataframe ready for furtheranalyses. Note that, since the data represents pollen abundances, aHellinger transformation (square root of the relative proportions ofeach taxa, it balances the relative abundances of rare and dominanttaxa) is applied. This transformation balances the relative importanceof very abundant versus rare taxa. The function prepareSequenceswill generally be the starting point of any analysis performed with thedistantia package.
``` r
checking the function help-file.
help(prepareSequences)
preparing sequences
AB.sequences <- prepareSequences( sequence.A = sequenceA, sequence.A.name = "A", sequence.B = sequenceB, sequence.B.name = "B", merge.mode = "complete", if.empty.cases = "zero", transformation = "hellinger")
showing first rows of the transformed data
kable(AB.sequences[1:15, ], digits = 4, caption = "Sequences A and B ready for analysis.")```
id | betula | pinus | corylu | junipe | empetr | gramin | cypera | artemi | rumex |
---|---|---|---|---|---|---|---|---|---|
A | 0.4327 | 0.8014 | 0.2921 | 0.0002 | 0.0974 | 0.1288 | 0.2434 | 2e-04 | 0.0002 |
A | 0.4788 | 0.8057 | 0.2919 | 0.0001 | 0.0901 | 0.0780 | 0.1494 | 1e-04 | 0.0001 |
A | 0.3117 | 0.8944 | 0.2726 | 0.0001 | 0.0617 | 0.0436 | 0.1512 | 1e-04 | 0.0001 |
A | 0.4609 | 0.8763 | 0.0990 | 0.0001 | 0.0001 | 0.0572 | 0.0808 | 1e-04 | 0.0001 |
A | 0.2503 | 0.9535 | 0.1101 | 0.0001 | 0.0778 | 0.0636 | 0.0778 | 1e-04 | 0.0001 |
A | 0.3432 | 0.9210 | 0.1548 | 0.0001 | 0.0001 | 0.0632 | 0.0774 | 1e-04 | 0.0001 |
A | 0.3962 | 0.8813 | 0.2416 | 0.0634 | 0.0001 | 0.0001 | 0.0634 | 1e-04 | 0.0001 |
A | 0.3657 | 0.8647 | 0.3129 | 0.0614 | 0.0001 | 0.1063 | 0.0752 | 1e-04 | 0.0001 |
A | 0.5245 | 0.7804 | 0.3134 | 0.0627 | 0.0001 | 0.0886 | 0.0768 | 1e-04 | 0.0001 |
A | 0.5361 | 0.7866 | 0.2552 | 0.0619 | 0.0001 | 0.1452 | 0.0619 | 1e-04 | 0.0001 |
A | 0.5667 | 0.7627 | 0.2606 | 0.0606 | 0.0001 | 0.1421 | 0.0742 | 1e-04 | 0.0001 |
A | 0.5650 | 0.7800 | 0.2222 | 0.0727 | 0.0001 | 0.1111 | 0.0727 | 1e-04 | 0.0001 |
A | 0.5599 | 0.7642 | 0.2716 | 0.0640 | 0.0001 | 0.1280 | 0.0784 | 1e-04 | 0.0453 |
A | 0.5952 | 0.7222 | 0.2990 | 0.0575 | 0.0407 | 0.1467 | 0.0910 | 1e-04 | 0.0001 |
A | 0.6516 | 0.6677 | 0.2950 | 0.1129 | 0.0001 | 0.1457 | 0.0922 | 1e-04 | 0.0001 |
The computation of dissimilarity between the datasets A and B requiresseveralsteps.
1. Computation of a distance matrix among the samples of both sequences.
It is computed by the distanceMatrix function, which allows the userto select a distance metric (so far the ones implemented aremanhattan, euclidean, chi, and hellinger). The functionplotMatrix allows an easy visualization of the resulting distancematrix.
``` r
computing distance matrix
AB.distance.matrix <- distanceMatrix( sequences = AB.sequences, method = "euclidean")
plotting distance matrix
plotMatrix( distance.matrix = AB.distance.matrix, color.palette = "viridis", margins = rep(4,4))```
2. Computation of the least-cost path within the distance matrix.
This step uses a dynamic programming algorithm to find the least-costpath that connnects the cell 1,1 of the matrix (lower left in the imageabove) and the last cell of the matrix (opposite corner). This can bedone via in two different ways.
- an orthogonal search by moving either one step on the x axisor one step on the y axis at a time (see Equation 1).
Equation 1
- an orthogonal and diagonal search (a.k.a diagonal) whichincludes the above, plus a diagonal search.
Equation 2
 + \sum{i=1}^{m}\sum{j=1}^{n} min\left(\begin{array}{c}D(A{i}, B{j+1}), \\ D(A{i+1}, B{j} \\ D(A{i+1}, B{j+1}) \end{array}\right))")
Where:
and
are the number ofsamples of the multivariate time-series
and
.
and
are the indices ofthe samples in
and
being considered oneach step of the recursive algorithm.
is a function thatreturns the multivariate distance (i.e. Manhattan) between any givenpair of samples of
and
.
is a functionreturning the minimum distance of a subset of distances defined inthe neigborhood of the given indices
and
The equation returns, which is the double of the sum of distances that liewithin the least-cost path, and represent the distance between thesamples of A and B. The value of
is computed by using the functions leastCostMatrix,which computes the partial solutions to the least-cost problem,leastCostPath, which returns the best global solution, andleastCost function, which sums the distances of the least-cost pathand multiplies them by 2.
The code below performs these steps according to both equations
``` r
ORTHOGONAL SEARCH
computing least-cost matrix
AB.least.cost.matrix <- leastCostMatrix( distance.matrix = AB.distance.matrix, diagonal = FALSE)
extracting least-cost path
AB.least.cost.path <- leastCostPath( distance.matrix = AB.distance.matrix, least.cost.matrix = AB.least.cost.matrix, diagonal = FALSE )
DIAGONAL SEARCH
computing least-cost matrix
AB.least.cost.matrix.diag <- leastCostMatrix( distance.matrix = AB.distance.matrix, diagonal = TRUE)
extracting least-cost path
AB.least.cost.path.diag <- leastCostPath( distance.matrix = AB.distance.matrix, least.cost.matrix = AB.least.cost.matrix.diag, diagonal = TRUE )
plotting solutions
plotMatrix( distance.matrix = list( 'A|B' = AB.least.cost.matrix[[1]], 'A|B' = AB.least.cost.matrix.diag[[1]] ), least.cost.path = list( 'A|B' = AB.least.cost.path[[1]], 'A|B' = AB.least.cost.path.diag[[1]] ), color.palette = "viridis", margin = rep(4,4), plot.rows = 1, plot.columns = 2 )```
Computing from these solutions is straightforward with thefunction leastCost
``` r
orthogonal solution
AB.between <- leastCost( least.cost.path = AB.least.cost.path )
diagonal solution
AB.between.diag <- leastCost( least.cost.path = AB.least.cost.path.diag )```
Which returns a value for of 33.7206 for the orthogonal solution, and 22.7596 forthe diagonal one. Diagonal solutions always yield lower values for
than orthogonal ones.
Notice the straight vertical and horizontal lines that show up in someregions of the least cost paths shown in the figure above. These areblocks, and happen in dissimilar sections of the compared sequences.Also, an unbalanced number of rows in the compared sequences cangenerate long blocks. Blocks inflate the value of because the distance to a given sample is countedseveral times per block. This problem often leads to false negatives,that is, to the conclusion that two sequences are statisticallydifferent when actually they are not.
This package includes an algorithm to remove blocks from the least costpath, which offers more realistic values for. The function leastCostPathNoBlocks reads a leastcost path, and removes all blocks as follows.
``` r
ORTHOGONAL SOLUTION
removing blocks from least cost path
AB.least.cost.path.nb <- leastCostPathNoBlocks( least.cost.path = AB.least.cost.path )
computing AB.between again
AB.between.nb <- leastCost( least.cost.path = AB.least.cost.path.nb )
DIAGONAL SOLUTION
removing blocks
AB.least.cost.path.diag.nb <- leastCostPathNoBlocks( least.cost.path = AB.least.cost.path.diag )
diagonal solution without blocks
AB.between.diag.nb <- leastCost( least.cost.path = AB.least.cost.path.diag.nb )```
Which now yields 11.2975 for the orthogonal solution, and 16.8667 forthe diagonal one. Notice how now the diagonal solution has a highervalue, because by default, the diagonal method generates less blocks.That is why each measure of dissimilarity (orthogonal, diagonal,orthogonal no-blocks, and diagonal no-blocks) lies within a differentcomparative framework, and therefore, outputs from different methodsshould not be compared.
Hereafter only the diagonal no-blocks option will be considered in theexample cases, since it is the most general and safe solution of thefour mentioned above.
``` r
changing names of the selected solutions
AB.least.cost.path <- AB.least.cost.path.diag.nbAB.between <- AB.between.diag.nb
removing unneeded objects
rm(AB.between.diag, AB.between.diag.nb, AB.between.nb, AB.distance.matrix, AB.least.cost.matrix, AB.least.cost.matrix.diag, AB.least.cost.path.diag, AB.least.cost.path.diag.nb, AB.least.cost.path.nb, sequenceA, sequenceB)```
3. Autosum, or sum of the distances among adjacent samples on each sequence.
This step requires to compute the distances between adjacent samples ineach sequence and sum them, as shown in Equation 3.
Equation 3
This operation is performed by the autoSum function shown below.
``` rAB.within <- autoSum( sequences = AB.sequences, least.cost.path = AB.least.cost.path, method = "euclidean" )AB.within
> $A|B
> [1] 19.69168
```
4. Compute dissimilarity score
.
The dissimilarity measure was firstdescribed in the book “Numerical methods in Quaternary pollenanalysis”(Birks and Gordon, 1985). Psi is computed as shown in Equation 4a:
Equation 4a
This equation has a particularity. Imagine two identical sequences A andB, with three samples each. In this case, is computed as
Since the samples of each sequence with the same index are identical,this can be reduced to
which in turn equals as shown in Equation 4, yielding a
value of0.
This equality does not work in the same way when the least-cost pathsearch-method includes diagonals. When the sequenes are identical,diagonal methods yield an of 0, leading to a
equal to-1. To fix this shift, this package uses Equation 4b instead when
is selected, which adds 1 to the final solution.
Equation 4b
In any case, the psi function only requires the least-cost, and theautosum of both sequences to compute. Since weare working with a diagonal search, 1 has to be added to the finalsolution.
rAB.psi <- psi( least.cost = AB.between, autosum = AB.within )AB.psi[[1]] <- AB.psi[[1]] + 1
Which yields a psi equal to 1.7131. The output of psi is a list,that can be transformed to a dataframe or a matrix by using theformatPsi function.
``` r
to dataframe
AB.psi.dataframe <- formatPsi( psi.values = AB.psi, to = "dataframe")kable(AB.psi.dataframe, digits = 4)```
A | B | psi |
---|---|---|
A | B | 1.7131 |
workflowPsi: doing it all at once
All the steps required to compute psi, including the format optionsprovided by formatPsi are wrapped together in the functionworkflowPsi. It includes options to switch to a diagonal method, andto ignore blocks, as shown below.
``` r
checking the help file
help(workflowPsi)
computing psi for A and B
AB.psi <- workflowPsi( sequences = AB.sequences, grouping.column = "id", method = "euclidean", format = "list", diagonal = TRUE, ignore.blocks = TRUE)AB.psi
> $A|B
> [1] 1.713075
```
The function allows to exclude particular columns from the analysis(argument exclude.columns), select different distance metrics(argument method), use diagonals to find the least-cost path (argumentdiagonal), or measure psi by ignoring blocks in the least-cost path(argument ignore.blocks). Since we have observed several blocks in theleast-cost path, below we compute psi by ignoring them.
``` r
cleaning workspace
rm(list = ls())```
Comparing multiple irregular METS
The package can work seamlessly with any given number of sequences, aslong as there is memory enough available (but check the new functionworkflowPsiHP, it can work with up to 40k sequences, if you have acluster at hand, and a few years to waste). To do so, almost everyfunction uses the packages“doParallel” and“foreach”, that togetherallow to parallelize the execution of the distantia functions by usingall the processors in your machine but one.
The example dataset sequencesMIS contains 12 sections of the samesequence belonging to different marine isotopic stages identified by acolumn named “MIS”. MIS stages with odd numbers are generallyinterpreted as warm periods (interglacials), while the odd ones areinterpreted as cold periods (glacials). In any case, this interpretationis not important to illustrate this capability of the library.
rdata(sequencesMIS)kable(head(sequencesMIS, n=15), digits = 4, caption = "Header of the sequencesMIS dataset.")
MIS | Quercus | Betula | Pinus | Alnus | Tilia | Carpinus |
---|---|---|---|---|---|---|
MIS-1 | 55 | 1 | 5 | 3 | 4 | 5 |
MIS-1 | 86 | 21 | 35 | 8 | 0 | 10 |
MIS-1 | 120 | 15 | 8 | 1 | 0 | 1 |
MIS-1 | 138 | 16 | 12 | 6 | 1 | 3 |
MIS-1 | 130 | 12 | 17 | 2 | 1 | 1 |
MIS-1 | 128 | 0 | 6 | 4 | 2 | 2 |
MIS-1 | 140 | 0 | 19 | 9 | 4 | 0 |
MIS-1 | 113 | 0 | 15 | 12 | 2 | 5 |
MIS-1 | 98 | 0 | 27 | 2 | 2 | 0 |
MIS-1 | 92 | 1 | 16 | 7 | 3 | 0 |
MIS-1 | 73 | 3 | 22 | 3 | 0 | 0 |
MIS-1 | 91 | 1 | 21 | 3 | 7 | 0 |
MIS-1 | 148 | 1 | 22 | 1 | 4 | 0 |
MIS-1 | 148 | 0 | 1 | 7 | 13 | 0 |
MIS-1 | 149 | 1 | 2 | 5 | 4 | 0 |
``` runique(sequencesMIS$MIS)
> [1] "MIS-1" "MIS-2" "MIS-3" "MIS-4" "MIS-5" "MIS-6" "MIS-7"
> [8] "MIS-8" "MIS-9" "MIS-10" "MIS-11" "MIS-12"
```
The dataset is checked and prepared with prepareSequences.
rMIS.sequences <- prepareSequences( sequences = sequencesMIS, grouping.column = "MIS", if.empty.cases = "zero", transformation = "hellinger")
The dissimilarity measure psi can be computed for every combinationof sequences through the function workflowPsi shown below.
``` rMIS.psi <- workflowPsi( sequences = MIS.sequences, grouping.column = "MIS", method = "euclidean", diagonal = TRUE, ignore.blocks = TRUE)
there is also a "high-performance" (HP) version of this function with a much lower memory footprint. It uses the options method = "euclidean", diagonal = TRUE, and ignore.blocks = TRUE by default
MIS.psi <- workflowPsiHP( sequences = MIS.sequences, grouping.column = "MIS")
ordered with lower psi on top
kable(MIS.psi[order(MIS.psi$psi), ], digits = 4, caption = "Psi values between pairs of MIS periods.")```
A | B | psi | |
---|---|---|---|
24 | MIS-3 | MIS-6 | 0.8631 |
59 | MIS-8 | MIS-11 | 0.8643 |
65 | MIS-10 | MIS-12 | 0.9099 |
30 | MIS-3 | MIS-12 | 0.9359 |
61 | MIS-9 | MIS-10 | 0.9422 |
66 | MIS-11 | MIS-12 | 0.9816 |
40 | MIS-5 | MIS-7 | 0.9866 |
64 | MIS-10 | MIS-11 | 0.9931 |
62 | MIS-9 | MIS-11 | 1.0009 |
60 | MIS-8 | MIS-12 | 1.0079 |
43 | MIS-5 | MIS-10 | 1.0174 |
45 | MIS-5 | MIS-12 | 1.0242 |
28 | MIS-3 | MIS-10 | 1.0340 |
42 | MIS-5 | MIS-9 | 1.0392 |
56 | MIS-7 | MIS-12 | 1.0423 |
63 | MIS-9 | MIS-12 | 1.0531 |
53 | MIS-7 | MIS-9 | 1.0598 |
51 | MIS-6 | MIS-12 | 1.0736 |
26 | MIS-3 | MIS-8 | 1.0846 |
52 | MIS-7 | MIS-8 | 1.0970 |
21 | MIS-2 | MIS-12 | 1.1002 |
32 | MIS-4 | MIS-6 | 1.1020 |
58 | MIS-8 | MIS-10 | 1.1186 |
54 | MIS-7 | MIS-10 | 1.1209 |
13 | MIS-2 | MIS-4 | 1.1214 |
15 | MIS-2 | MIS-6 | 1.1273 |
12 | MIS-2 | MIS-3 | 1.1365 |
49 | MIS-6 | MIS-10 | 1.1670 |
27 | MIS-3 | MIS-9 | 1.1725 |
47 | MIS-6 | MIS-8 | 1.1929 |
57 | MIS-8 | MIS-9 | 1.1982 |
23 | MIS-3 | MIS-5 | 1.2468 |
22 | MIS-3 | MIS-4 | 1.2541 |
44 | MIS-5 | MIS-11 | 1.2656 |
29 | MIS-3 | MIS-11 | 1.2770 |
41 | MIS-5 | MIS-8 | 1.2890 |
19 | MIS-2 | MIS-10 | 1.2976 |
55 | MIS-7 | MIS-11 | 1.3568 |
48 | MIS-6 | MIS-9 | 1.3699 |
50 | MIS-6 | MIS-11 | 1.4274 |
25 | MIS-3 | MIS-7 | 1.4692 |
39 | MIS-5 | MIS-6 | 1.4884 |
38 | MIS-4 | MIS-12 | 1.5184 |
17 | MIS-2 | MIS-8 | 1.5455 |
34 | MIS-4 | MIS-8 | 1.5710 |
14 | MIS-2 | MIS-5 | 1.6316 |
46 | MIS-6 | MIS-7 | 1.6411 |
36 | MIS-4 | MIS-10 | 1.6487 |
18 | MIS-2 | MIS-9 | 1.6933 |
4 | MIS-1 | MIS-5 | 1.7265 |
6 | MIS-1 | MIS-7 | 1.7712 |
3 | MIS-1 | MIS-4 | 1.8732 |
20 | MIS-2 | MIS-11 | 1.9079 |
33 | MIS-4 | MIS-7 | 1.9633 |
11 | MIS-1 | MIS-12 | 2.0587 |
16 | MIS-2 | MIS-7 | 2.1501 |
8 | MIS-1 | MIS-9 | 2.2155 |
7 | MIS-1 | MIS-8 | 2.2918 |
1 | MIS-1 | MIS-2 | 2.2944 |
2 | MIS-1 | MIS-3 | 2.3167 |
5 | MIS-1 | MIS-6 | 2.3369 |
10 | MIS-1 | MIS-11 | 2.3639 |
37 | MIS-4 | MIS-11 | 2.3919 |
35 | MIS-4 | MIS-9 | 2.4482 |
31 | MIS-4 | MIS-5 | 2.4810 |
9 | MIS-1 | MIS-10 | 2.6995 |
A dataframe like this can be transformed into a matrix to be plotted asan adjacency network with the qgraph package.
``` r
psi values to matrix
MIS.psi.matrix <- formatPsi( psi.values = MIS.psi, to = "matrix")
dissimilariy to distance
MIS.distance <- 1/MIS.psi.matrix**4
plotting network
qgraph::qgraph( MIS.distance, layout='spring', vsize=5, labels = colnames(MIS.distance), colors = viridis::viridis(2, begin = 0.3, end = 0.8, alpha = 0.5, direction = -1) )```
Or as a matrix with ggplot2.
``` r
ordering factors to get a triangular matrix
MIS.psi$A <- factor(MIS.psi$A, levels=unique(sequencesMIS$MIS))MIS.psi$B <- factor(MIS.psi$B, levels=unique(sequencesMIS$MIS))
plotting matrix
ggplot(data=na.omit(MIS.psi), aes(x=A, y=B, size=psi, color=psi)) + geompoint() + viridis::scalecolor_viridis(direction = -1) + guides(size = FALSE)```
The dataframe of dissimilarities between pairs of sequences can be alsoused to analyze the drivers of dissimilarity. To do so, attributes suchas differences in time (when sequences represent different times) ordistance (when sequences represent different sites) between sequences,or differences between physical/climatic attributes between sequencessuch as topography or climate can be added to the table, so models suchas (were A, B, and C are these attributes) can befitted.
``` r
cleaning workspace
rm(list = ls())```
Comparing regular aligned METS
The package distantia is also useful to compare synchronic sequencesthat have the same number of samples. In this particular case, distancesto obtain are computed only between samples with the sametime/depth/order, and no distance matrix (nor least-cost analysis) isrequired. When the argument paired.samples in prepareSequences isset to TRUE, the function checks if the sequences have the same numberof rows, and, if time.column is provided, it selects the samples thathave valid time/depth columns for every sequence in the dataset.
Here we test these ideas with the climate dataset included in thelibrary. It represents simulated palaeoclimate over 200 ky. at foursites identified by the column sequenceId. Note that this time thetransformation applied is “scaled”, which uses the scale function ofR base to center and scale the data.
``` r
loading sample data
data(climate)
preparing sequences
climate <- prepareSequences( sequences = climate, grouping.column = "sequenceId", time.column = "time", paired.samples = TRUE, transformation = "scale" )```
In this case, the argument paired.samples of workflowPsi must beset to TRUE. Additionally, if the argument same.time is set to TRUE,the time/age of the samples is checked, and samples without the sametime/age are removed from the analysis.
``` r
computing psi
climate.psi <- workflowPsi( sequences = climate, grouping.column = "sequenceId", time.column = "time", method = "euclidean", paired.samples = TRUE, #this bit is important same.time = TRUE, #removes samples with unequal time format = "dataframe")
ordered with lower psi on top
kable(climate.psi[order(climate.psi$psi), ], digits = 4, row.names = FALSE, caption = "Psi values between pairs of sequences in the 'climate' dataset.")```
A | B | psi |
---|---|---|
2 | 4 | 3.4092 |
4 | 2 | 3.4092 |
1 | 3 | 3.5702 |
3 | 1 | 3.5702 |
3 | 4 | 4.1139 |
4 | 3 | 4.1139 |
1 | 2 | 4.2467 |
2 | 1 | 4.2467 |
2 | 3 | 4.6040 |
3 | 2 | 4.6040 |
1 | 4 | 4.8791 |
4 | 1 | 4.8791 |
``` r
cleaning workspace
rm(list = ls())```
Restricted permutation test to assess the significance of dissimilarity values
One question that may arise when comparing time series is “to whatextent are dissimilarity values a result of chance?”. Answering thisquestion requires to compare a given dissimilarity value with adistribution of dissimilarity values resulting from chance. However… howdo we simulate chance in a multivariate time-series? The natural answeris “permutation”. Since samples in a multivariate time-series areordered, randomly re-shuffling samples is out of the question, becausethat would destroy the structure of the data. A more gentler alternativeis to randomly switch single data-points (a case of a variable)independently by variable. This kind of permutation is named “restrictedpermutation”, and preserves global trends within the data, but changeslocal structure.
A restricted permutation test on psi values requires the followingsteps:
- Compute the real psi on two given sequences A and B.
- Repeat the following steps several times (99 to 999):
- For each case of each column of A and B, randomly apply one ofthese actions:
- Leave it as is.
- Replace it with the previous case.
- Replace it with the next case.
- Compute randomized psi between A and B and store the value.
- For each case of each column of A and B, randomly apply one ofthese actions:
- Add real psi to the pool of randomized psi.
- Compute the proportion of randomized psi that is equal or lowerthan real psi.
Such a proportion represents the probability of obtaining a value lowerthan real psi by chance.
Since the restricted permutation only happens at a local scale withineach column of each sequence, the probability values returned are veryconservative and shouldn’t be interpreted in the same way p-values areinterpreted.
The process described above has been implemented in theworkflowNullPsi function. We will apply it to three groups of thesequencesMIS dataset.
``` r
getting example data
data(sequencesMIS)
working with 3 groups (to make this fast)
sequencesMIS <- sequencesMIS[sequencesMIS$MIS %in% c("MIS-4", "MIS-5", "MIS-6"),]
preparing sequences
sequencesMIS <- prepareSequences( sequences = sequencesMIS, grouping.column = "MIS", transformation = "hellinger")```
The computation of the null psi values goes as follows:
``` rrandom.psi <- workflowNullPsi( sequences = sequencesMIS, grouping.column = "MIS", method = "euclidean", diagonal = TRUE, ignore.blocks = TRUE, repetitions = 99, #recommended value: 999 parallel.execution = TRUE)
there is also a high-performance version of this function with fewer options (diagonal = TRUE, ignore.blocks = TRUE, and method = "euclidean" are used by default)
random.psi <- workflowNullPsiHP( sequences = sequencesMIS, grouping.column = "MIS", repetitions = 99, #recommended value: 999 parallel.execution = TRUE)```
Note that the number of repetitions has been set to 9 in order tospeed-up execution. The actual number would ideally be 999.
The output is a list with two dataframes, psi and p.
The dataframe psi contains the real and random psi values. Thecolumn psi contains the dissimilarity between the sequences in thecolumns A and B. The columns r1 to r9 contain the psi valuesobtained from permutations of thesequences.
rkable(random.psi$psi, digits = 4, caption = "True and null psi values generated by workflowNullPsi.")
A | B | psi | r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | r11 | r12 | r13 | r14 | r15 | r16 | r17 | r18 | r19 | r20 | r21 | r22 | r23 | r24 | r25 | r26 | r27 | r28 | r29 | r30 | r31 | r32 | r33 | r34 | r35 | r36 | r37 | r38 | r39 | r40 | r41 | r42 | r43 | r44 | r45 | r46 | r47 | r48 | r49 | r50 | r51 | r52 | r53 | r54 | r55 | r56 | r57 | r58 | r59 | r60 | r61 | r62 | r63 | r64 | r65 | r66 | r67 | r68 | r69 | r70 | r71 | r72 | r73 | r74 | r75 | r76 | r77 | r78 | r79 | r80 | r81 | r82 | r83 | r84 | r85 | r86 | r87 | r88 | r89 | r90 | r91 | r92 | r93 | r94 | r95 | r96 | r97 | r98 | r99 | r100 | r101 | r102 | r103 | r104 | r105 | r106 | r107 | r108 | r109 | r110 | r111 | r112 | r113 | r114 | r115 | r116 | r117 | r118 | r119 | r120 | r121 | r122 | r123 | r124 | r125 | r126 | r127 | r128 | r129 | r130 | r131 | r132 | r133 | r134 | r135 | r136 | r137 | r138 | r139 | r140 | r141 | r142 | r143 | r144 | r145 | r146 | r147 | r148 | r149 | r150 | r151 | r152 | r153 | r154 | r155 | r156 | r157 | r158 | r159 | r160 | r161 | r162 | r163 | r164 | r165 | r166 | r167 | r168 | r169 | r170 | r171 | r172 | r173 | r174 | r175 | r176 | r177 | r178 | r179 | r180 | r181 | r182 | r183 | r184 | r185 | r186 | r187 | r188 | r189 | r190 | r191 | r192 | r193 | r194 | r195 | r196 | r197 | r198 | r199 | r200 | r201 | r202 | r203 | r204 | r205 | r206 | r207 | r208 | r209 | r210 | r211 | r212 | r213 | r214 | r215 | r216 | r217 | r218 | r219 | r220 | r221 | r222 | r223 | r224 | r225 | r226 | r227 | r228 | r229 | r230 | r231 | r232 | r233 | r234 | r235 | r236 | r237 | r238 | r239 | r240 | r241 | r242 | r243 | r244 | r245 | r246 | r247 | r248 | r249 | r250 | r251 | r252 | r253 | r254 | r255 | r256 | r257 | r258 | r259 | r260 | r261 | r262 | r263 | r264 | r265 | r266 | r267 | r268 | r269 | r270 | r271 | r272 | r273 | r274 | r275 | r276 | r277 | r278 | r279 | r280 | r281 | r282 | r283 | r284 | r285 | r286 | r287 | r288 | r289 | r290 | r291 | r292 | r293 | r294 | r295 | r296 | r297 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MIS-4 | MIS-5 | 2.4810 | 0.0000 | 3.7536 | 1.3317 | 0.0000 | 3.8819 | 1.4707 | 0.0000 | 3.6452 | 1.2600 | 0.0000 | 3.5013 | 1.3834 | 0.0000 | 4.0277 | 1.3484 | 0.0000 | 2.9678 | 1.1679 | 0.0000 | 3.7815 | 1.3421 | 0.0000 | 3.8574 | 1.2646 | 0.0000 | 3.7788 | 1.2697 | 0.0000 | 3.7480 | 1.2656 | 0.0000 | 3.3589 | 1.3758 | 0.0000 | 2.6402 | 1.4237 | 0.0000 | 3.7716 | 1.1127 | 0.0000 | 3.7987 | 1.4647 | 0.0000 | 3.6555 | 1.4283 | 0.000 | 3.3450 | 1.1930 | 0.0000 | 2.8487 | 1.2650 | 0.0000 | 3.2751 | 0.9897 | 0.0000 | 3.7681 | 1.2368 | 0.0000 | 3.0029 | 1.1084 | 0.0000 | 2.9344 | 1.1265 | 0.0000 | 3.0970 | 1.3298 | 0.0000 | 3.2655 | 1.0583 | 0.0000 | 3.7342 | 1.2722 | 0.0000 | 3.9553 | 1.1021 | 0.0000 | 3.6368 | 1.1076 | 0.0000 | 3.5591 | 1.4114 | 0.0000 | 3.2966 | 1.3217 | 0.0000 | 3.5813 | 1.1319 | 0.0000 | 3.3199 | 0.983 | 0.0000 | 3.8344 | 1.1349 | 0.0000 | 3.9911 | 1.2949 | 0.0000 | 3.3264 | 1.2411 | 0.0000 | 3.8077 | 1.3534 | 0.0000 | 3.7525 | 1.2932 | 0.0000 | 3.3755 | 1.4088 | 0.0000 | 3.7383 | 1.3163 | 0.0000 | 3.5603 | 1.3440 | 0.0000 | 4.5097 | 1.6253 | 0.0000 | 3.0503 | 1.1434 | 0.0000 | 3.5231 | 1.3495 | 0.0000 | 3.5837 | 1.2140 | 0.0000 | 4.2384 | 1.2953 | 0.0000 | 3.4604 | 1.2267 | 0.0000 | 3.5569 | 1.3008 | 0.0000 | 3.8854 | 1.5219 | 0.0000 | 4.0438 | 1.1978 | 0.0000 | 3.6232 | 1.1538 | 0.0000 | 3.7029 | 1.7062 | 0.0000 | 4.0271 | 1.3001 | 0.0000 | 2.8608 | 1.3602 | 0.0000 | 3.8412 | 1.0693 | 0.0000 | 3.3452 | 1.3526 | 0.0000 | 3.7804 | 1.1964 | 0.0000 | 3.8922 | 1.2574 | 0.0000 | 3.4213 | 1.1074 | 0.0000 | 3.5955 | 1.2909 | 0.0000 | 3.1859 | 1.2943 | 0.0000 | 3.6091 | 1.2801 | 0.0000 | 2.9185 | 1.3562 | 0.0000 | 4.1324 | 1.1933 | 0.0000 | 3.8081 | 1.4057 | 0.0000 | 3.5450 | 1.4605 | 0.0000 | 3.7457 | 1.2695 | 0.0000 | 3.6136 | 1.3570 | 0.0000 | 3.8297 | 1.3051 | 0.0000 | 3.5123 | 1.0305 | 0.0000 | 3.4788 | 1.0452 | 0.0000 | 3.3500 | 1.5083 | 0.0000 | 3.4277 | 1.0918 | 0.0000 | 3.5416 | 1.1457 | 0.0000 | 3.8708 | 1.4005 | 0.0000 | 3.6071 | 1.2262 | 0.0000 | 3.5871 | 1.3329 | 0.0000 | 4.0342 | 1.4382 | 0.0000 | 3.6460 | 1.2434 | 0.0000 | 3.5562 | 1.2390 | 0.0000 | 3.5911 | 1.124 | 0.0000 | 3.7326 | 1.5393 | 0.0000 | 3.9109 | 1.4098 | 0.0000 | 3.4930 | 1.1832 | 0.0000 | 2.7677 | 1.2942 | 0.0000 | 3.6112 | 1.2785 | 0.0000 | 3.7562 | 1.1567 | 0.0000 | 3.4105 | 1.2620 | 0.0000 | 3.0288 | 1.0794 | 0.0000 | 4.3534 | 1.5268 | 0.0000 | 3.3925 | 1.1479 | 0.0000 | 3.4932 | 1.0770 | 0.0000 | 3.5992 | 1.4058 | 0.0000 | 3.7622 | 1.2329 | 0.0000 | 3.7002 | 1.2894 | 0.0000 | 3.5179 | 1.1918 | 0.0000 | 4.2240 | 1.3402 | 0.0000 | 3.5807 | 1.1993 | 0.0000 | 3.8250 | 1.2416 | 0.0000 | 3.7522 | 1.4727 | 0.0000 | 3.6046 | 1.2223 | 0.0000 | 3.8396 | 1.0294 |
MIS-4 | MIS-6 | 1.1020 | 3.7536 | 0.0000 | 1.7778 | 3.8819 | 0.0000 | 1.7145 | 3.6452 | 0.0000 | 1.4906 | 3.5013 | 0.0000 | 1.5997 | 4.0277 | 0.0000 | 1.9284 | 2.9678 | 0.0000 | 1.7680 | 3.7815 | 0.0000 | 1.7752 | 3.8574 | 0.0000 | 1.7149 | 3.7788 | 0.0000 | 1.6472 | 3.7480 | 0.0000 | 1.7108 | 3.3589 | 0.0000 | 1.7109 | 2.6402 | 0.0000 | 1.5961 | 3.7716 | 0.0000 | 1.6565 | 3.7987 | 0.0000 | 1.7153 | 3.6555 | 0.0000 | 1.5970 | 3.345 | 0.0000 | 1.6136 | 2.8487 | 0.0000 | 1.6701 | 3.2751 | 0.0000 | 1.4932 | 3.7681 | 0.0000 | 1.6126 | 3.0029 | 0.0000 | 1.9839 | 2.9344 | 0.0000 | 1.8604 | 3.0970 | 0.0000 | 1.5921 | 3.2655 | 0.0000 | 1.6640 | 3.7342 | 0.0000 | 1.5258 | 3.9553 | 0.0000 | 1.6979 | 3.6368 | 0.0000 | 1.5611 | 3.5591 | 0.0000 | 1.6552 | 3.2966 | 0.0000 | 1.6900 | 3.5813 | 0.0000 | 1.8737 | 3.3199 | 0.0000 | 1.717 | 3.8344 | 0.0000 | 1.5950 | 3.9911 | 0.0000 | 1.6099 | 3.3264 | 0.0000 | 1.5220 | 3.8077 | 0.0000 | 1.3983 | 3.7525 | 0.0000 | 1.3374 | 3.3755 | 0.0000 | 1.8604 | 3.7383 | 0.0000 | 1.8833 | 3.5603 | 0.0000 | 1.7699 | 4.5097 | 0.0000 | 1.6200 | 3.0503 | 0.0000 | 1.6997 | 3.5231 | 0.0000 | 1.7551 | 3.5837 | 0.0000 | 1.6986 | 4.2384 | 0.0000 | 1.5901 | 3.4604 | 0.0000 | 1.7288 | 3.5569 | 0.0000 | 1.5894 | 3.8854 | 0.0000 | 1.5422 | 4.0438 | 0.0000 | 1.7261 | 3.6232 | 0.0000 | 1.3016 | 3.7029 | 0.0000 | 1.6888 | 4.0271 | 0.0000 | 1.8111 | 2.8608 | 0.0000 | 1.9147 | 3.8412 | 0.0000 | 1.7930 | 3.3452 | 0.0000 | 1.8602 | 3.7804 | 0.0000 | 1.9937 | 3.8922 | 0.0000 | 1.7063 | 3.4213 | 0.0000 | 1.4280 | 3.5955 | 0.0000 | 1.7968 | 3.1859 | 0.0000 | 1.5233 | 3.6091 | 0.0000 | 1.3548 | 2.9185 | 0.0000 | 1.7555 | 4.1324 | 0.0000 | 1.6919 | 3.8081 | 0.0000 | 1.6254 | 3.5450 | 0.0000 | 1.6068 | 3.7457 | 0.0000 | 1.9194 | 3.6136 | 0.0000 | 1.2503 | 3.8297 | 0.0000 | 1.5950 | 3.5123 | 0.0000 | 1.3527 | 3.4788 | 0.0000 | 1.4569 | 3.3500 | 0.0000 | 1.9489 | 3.4277 | 0.0000 | 1.6999 | 3.5416 | 0.0000 | 1.5389 | 3.8708 | 0.0000 | 1.4900 | 3.6071 | 0.0000 | 1.8822 | 3.5871 | 0.0000 | 1.6469 | 4.0342 | 0.0000 | 1.8238 | 3.6460 | 0.0000 | 1.4199 | 3.5562 | 0.0000 | 1.9143 | 3.5911 | 0.0000 | 1.530 | 3.7326 | 0.0000 | 1.6699 | 3.9109 | 0.0000 | 1.8029 | 3.4930 | 0.0000 | 1.6022 | 2.7677 | 0.0000 | 1.6816 | 3.6112 | 0.0000 | 1.7398 | 3.7562 | 0.0000 | 1.3929 | 3.4105 | 0.0000 | 1.6506 | 3.0288 | 0.0000 | 1.5035 | 4.3534 | 0.0000 | 1.9079 | 3.3925 | 0.0000 | 1.7209 | 3.4932 | 0.0000 | 1.5058 | 3.5992 | 0.0000 | 1.7595 | 3.7622 | 0.0000 | 1.7663 | 3.7002 | 0.0000 | 1.7471 | 3.5179 | 0.0000 | 1.6484 | 4.2240 | 0.0000 | 1.6049 | 3.5807 | 0.0000 | 1.6514 | 3.8250 | 0.0000 | 1.8138 | 3.7522 | 0.0000 | 1.7080 | 3.6046 | 0.0000 | 1.2971 | 3.8396 | 0.0000 | 1.5657 |
MIS-5 | MIS-6 | 1.4884 | 1.3317 | 1.7778 | 0.0000 | 1.4707 | 1.7145 | 0.0000 | 1.2600 | 1.4906 | 0.0000 | 1.3834 | 1.5997 | 0.0000 | 1.3484 | 1.9284 | 0.0000 | 1.1679 | 1.7680 | 0.0000 | 1.3421 | 1.7752 | 0.0000 | 1.2646 | 1.7149 | 0.0000 | 1.2697 | 1.6472 | 0.0000 | 1.2656 | 1.7108 | 0.0000 | 1.3758 | 1.7109 | 0.0000 | 1.4237 | 1.5961 | 0.0000 | 1.1127 | 1.6565 | 0.0000 | 1.4647 | 1.7153 | 0.0000 | 1.4283 | 1.5970 | 0.0000 | 1.193 | 1.6136 | 0.0000 | 1.2650 | 1.6701 | 0.0000 | 0.9897 | 1.4932 | 0.0000 | 1.2368 | 1.6126 | 0.0000 | 1.1084 | 1.9839 | 0.0000 | 1.1265 | 1.8604 | 0.0000 | 1.3298 | 1.5921 | 0.0000 | 1.0583 | 1.6640 | 0.0000 | 1.2722 | 1.5258 | 0.0000 | 1.1021 | 1.6979 | 0.0000 | 1.1076 | 1.5611 | 0.0000 | 1.4114 | 1.6552 | 0.0000 | 1.3217 | 1.6900 | 0.0000 | 1.1319 | 1.8737 | 0.0000 | 0.9830 | 1.7170 | 0.000 | 1.1349 | 1.5950 | 0.0000 | 1.2949 | 1.6099 | 0.0000 | 1.2411 | 1.5220 | 0.0000 | 1.3534 | 1.3983 | 0.0000 | 1.2932 | 1.3374 | 0.0000 | 1.4088 | 1.8604 | 0.0000 | 1.3163 | 1.8833 | 0.0000 | 1.3440 | 1.7699 | 0.0000 | 1.6253 | 1.6200 | 0.0000 | 1.1434 | 1.6997 | 0.0000 | 1.3495 | 1.7551 | 0.0000 | 1.2140 | 1.6986 | 0.0000 | 1.2953 | 1.5901 | 0.0000 | 1.2267 | 1.7288 | 0.0000 | 1.3008 | 1.5894 | 0.0000 | 1.5219 | 1.5422 | 0.0000 | 1.1978 | 1.7261 | 0.0000 | 1.1538 | 1.3016 | 0.0000 | 1.7062 | 1.6888 | 0.0000 | 1.3001 | 1.8111 | 0.0000 | 1.3602 | 1.9147 | 0.0000 | 1.0693 | 1.7930 | 0.0000 | 1.3526 | 1.8602 | 0.0000 | 1.1964 | 1.9937 | 0.0000 | 1.2574 | 1.7063 | 0.0000 | 1.1074 | 1.4280 | 0.0000 | 1.2909 | 1.7968 | 0.0000 | 1.2943 | 1.5233 | 0.0000 | 1.2801 | 1.3548 | 0.0000 | 1.3562 | 1.7555 | 0.0000 | 1.1933 | 1.6919 | 0.0000 | 1.4057 | 1.6254 | 0.0000 | 1.4605 | 1.6068 | 0.0000 | 1.2695 | 1.9194 | 0.0000 | 1.3570 | 1.2503 | 0.0000 | 1.3051 | 1.5950 | 0.0000 | 1.0305 | 1.3527 | 0.0000 | 1.0452 | 1.4569 | 0.0000 | 1.5083 | 1.9489 | 0.0000 | 1.0918 | 1.6999 | 0.0000 | 1.1457 | 1.5389 | 0.0000 | 1.4005 | 1.4900 | 0.0000 | 1.2262 | 1.8822 | 0.0000 | 1.3329 | 1.6469 | 0.0000 | 1.4382 | 1.8238 | 0.0000 | 1.2434 | 1.4199 | 0.0000 | 1.2390 | 1.9143 | 0.0000 | 1.1240 | 1.5300 | 0.000 | 1.5393 | 1.6699 | 0.0000 | 1.4098 | 1.8029 | 0.0000 | 1.1832 | 1.6022 | 0.0000 | 1.2942 | 1.6816 | 0.0000 | 1.2785 | 1.7398 | 0.0000 | 1.1567 | 1.3929 | 0.0000 | 1.2620 | 1.6506 | 0.0000 | 1.0794 | 1.5035 | 0.0000 | 1.5268 | 1.9079 | 0.0000 | 1.1479 | 1.7209 | 0.0000 | 1.0770 | 1.5058 | 0.0000 | 1.4058 | 1.7595 | 0.0000 | 1.2329 | 1.7663 | 0.0000 | 1.2894 | 1.7471 | 0.0000 | 1.1918 | 1.6484 | 0.0000 | 1.3402 | 1.6049 | 0.0000 | 1.1993 | 1.6514 | 0.0000 | 1.2416 | 1.8138 | 0.0000 | 1.4727 | 1.7080 | 0.0000 | 1.2223 | 1.2971 | 0.0000 | 1.0294 | 1.5657 | 0.0000 |
The dataframe p contains the probability of obtaining the real psivalue by chance for each combination ofsequences.
rkable(random.psi$p, caption = "Probability of obtaining a given set of psi values by chance.")
A | B | p |
---|---|---|
MIS-4 | MIS-5 | 0.6677852 |
MIS-4 | MIS-6 | 0.3355705 |
MIS-5 | MIS-6 | 0.6845638 |
``` r
cleaning workspace
rm(list = ls())```
Assessing the contribution of a variable to the dissimilarity between two sequences
What variables are more important in explaining the dissimilaritybetween two sequences?, or in other words, what variables contributethe most to the dissimilarity between two sequences? One reasonableanswer is: the one that reduces dissimilarity the most when removed fromthe data. This section explains how to use the functionworkflowImportance follows such a principle to evaluate theimportance of given variables in explaining differences betweensequences.
First, we prepare the data. It is again sequencesMIS, but with onlythree groups selected (MIS 4 to 6) to simplify the analysis.
``` r
getting example data
data(sequencesMIS)
getting three groups only to simplify
sequencesMIS <- sequencesMIS[sequencesMIS$MIS %in% c("MIS-4", "MIS-5", "MIS-6"),]
preparing sequences
sequences <- prepareSequences( sequences = sequencesMIS, grouping.column = "MIS", merge.mode = "complete")```
The workflow function is pretty similar to the ones explained above.However, unlike the other functions in the package, that parallelizeacross the comparison of pairs of sequences, this one parallelizes thecomputation of psi on combinations of columns, removing one columneach time.
WARNING: the argument ‘exclude.columns’ of ‘workflowImportance’ doesnot work in version 1.0.0 (available in CRAN), but the bug is fixed inversion 1.0.1 (available in GitHub). If you are using 1.0.0, I recommendyou to subset ‘sequences’ so only the grouping column and the numericcolumns to be compared are available for the function.
``` rpsi.importance <- workflowImportance( sequences = sequencesMIS, grouping.column = "MIS", method = "euclidean", diagonal = TRUE, ignore.blocks = TRUE )
there is also a high performance version of this function, but with fewer options (it uses euclidean, diagonal, and ignores blocks by default)
psi.importance <- workflowImportanceHP( sequences = sequencesMIS, grouping.column = "MIS" )```
The output is a list with two slots named psi and psi.drop.
The dataframe psi contains psi values for each combination ofvariables (named in the coluns A and B) computed for all columns inthe column All variables, and one column per variable named Withoutvariable_name containing the psi value when that variable is removedfrom the comparedsequences.
rkable(psi.importance$psi, digits = 4, caption = "Psi values with all variables (column 'All variables'), and without one variable at a time.")
A | B | All variables | Without Carpinus | Without Tilia | Without Alnus | Without Pinus | Without Betula | Without Quercus |
---|---|---|---|---|---|---|---|---|
MIS-4 | MIS-5 | 4.3929 | 4.4018 | 4.4070 | 4.4087 | 5.9604 | 4.4240 | 0.9878 |
MIS-4 | MIS-6 | 1.0376 | 1.0374 | 1.0372 | 1.0303 | 1.6222 | 1.0311 | 0.9646 |
MIS-5 | MIS-6 | 1.7680 | 1.7684 | 1.7685 | 1.7695 | 1.1980 | 1.7873 | 1.0706 |
This table can be plotted as a bar plot as follows:
``` r
extracting object
psi.df <- psi.importance$psi
to long format
psi.df.long <- tidyr::gather(psi.df, variable, psi, 3:ncol(psi.df))
creating column with names of the sequences
psi.df.long$name <- paste(psi.df.long$A, psi.df.long$B, sep=" - ")
plot
ggplot(data=psi.df.long, aes(x=variable, y=psi, fill=psi)) + geombar(stat = "identity") + coordflip() + facetwrap("name") + scalefill_viridis(direction = -1) + ggtitle("Contribution of separated variables to dissimilarity.") + labs(fill = "Psi")```
The second table, named psi.drop describes the drop in psi values,in percentage, when the given variable is removed from the analysis.Large positive numbers indicate that dissimilarity drops (increase insimilarity) when the given variable is removed, confirming that thevariable is important to explain the dissimilarity between bothsequences. Negative values indicate an increase in dissimilarity betweenthe sequences when the variable is dropped.
In summary:
- High psi-drop value: variable contributes to dissimilarity.
- Low or negative psi-drop value: variable contributes tosimilarity.
rkable(psi.importance$psi.drop, caption = "Drop in psi, as percentage of the psi value obtained when using all variables. Positive values indicate that the sequences become more similar when the given variable is removed (contribution to dissimilarity), while negative values indicate that the sequences become more dissimilar when the variable is removed (contribution to similarity).")
A | B | Carpinus | Tilia | Alnus | Pinus | Betula | Quercus |
---|---|---|---|---|---|---|---|
MIS-4 | MIS-5 | \-0.20 | \-0.32 | \-0.36 | \-35.68 | \-0.71 | 77.51 |
MIS-4 | MIS-6 | 0.01 | 0.03 | 0.70 | \-56.35 | 0.62 | 7.03 |
MIS-5 | MIS-6 | \-0.02 | \-0.03 | \-0.09 | 32.24 | \-1.09 | 39.44 |
``` r
extracting object
psi.drop.df <- psi.importance$psi.drop
to long format
psi.drop.df.long <- tidyr::gather(psi.drop.df, variable, psi, 3:ncol(psi.drop.df))
creating column with names of the sequences
psi.drop.df.long$name <- paste(psi.drop.df.long$A, psi.drop.df.long$B, sep=" - ")
plot
ggplot(data=psi.drop.df.long, aes(x=variable, y=psi, fill=psi)) + geombar(stat = "identity") + coordflip() + geomhline(yintercept = 0, size = 0.3) + facetwrap("name") + scalefillviridis(direction = -1) + ggtitle("Drop in dissimilarity when variables are removed.") + ylab("Drop in dissimilarity (%)") + labs(fill = "Psi drop (%)")```
``` r
cleaning workspace
rm(list = ls())```
Partial matching: finding the section in a long sequence more similar to a given short sequence
In this scenario the user has one short and one long sequence, and thegoal is to find the section in the long sequence that better matches theshort one. To recreate this scenario we use the dataset sequencesMIS.The first 10 samples will serve as short sequence, and the first 40samples as long sequence. These small subsets are selected to speed-upthe execution time of this example.
``` r
loading the data
data(sequencesMIS)
removing grouping column
sequencesMIS$MIS <- NULL
subsetting to get the short sequence
MIS.short <- sequencesMIS[1:10, ]
subsetting to get the long sequence
MIS.long <- sequencesMIS[1:40, ]```
The sequences have to be prepared and transformed. For simplicity, thesequences are named short and long, and the grouping column is namedid, but the user can name them at will. Since the data representscommunity composition, a Hellinger transformation is applied.
rMIS.short.long <- prepareSequences( sequence.A = MIS.short, sequence.A.name = "short", sequence.B = MIS.long, sequence.B.name = "long", grouping.column = "id", transformation = "hellinger")
The function workflowPartialMatch shown below is going to subset thelong sequence in sizes between min.length and max.length. In theexample below this search space has the same size as MIS.short tospeed-up the execution of this example, but wider windows are possible.If left empty, the length of the segment in the long sequence to bematched will have the same number of samples as the short sequence. Inthe example below we look for segments of the same length, two samplesshorter, and two samples longer than the shorter sequence.
rMIS.psi <- workflowPartialMatch( sequences = MIS.short.long, grouping.column = "id", method = "euclidean", diagonal = TRUE, ignore.blocks = TRUE, min.length = nrow(MIS.short), max.length = nrow(MIS.short))
The function returns a dataframe with three columns: first.row (firstrow of the matched segment of the long sequence), last.row (last rowof the matched segment of the long sequence), and psi (ordered fromlower to higher). In this case, since the long sequence contains theshort sequence, the first row shows a perfectmatch.
rkable(MIS.psi[1:15, ], digits = 4, caption = "First and last row of a section of the long sequence along with the psi value obtained during the partial matching.")
first.row | last.row | psi |
---|---|---|
1 | 10 | 0.0000 |
3 | 12 | 0.2707 |
4 | 13 | 0.2815 |
2 | 11 | 0.3059 |
6 | 15 | 0.4343 |
5 | 14 | 0.5509 |
9 | 18 | 1.3055 |
10 | 19 | 1.3263 |
8 | 17 | 1.3844 |
7 | 16 | 1.3949 |
15 | 24 | 1.4428 |
12 | 21 | 1.4711 |
17 | 26 | 1.4959 |
11 | 20 | 1.5101 |
18 | 27 | 1.5902 |
Subsetting the long sequence to obtain the segment best matching withthe short sequence goes as follows.
``` r
indices of the best matching segment
best.match.indices <- MIS.psi[1, "first.row"]:MIS.psi[1, "last.row"]
subsetting by these indices
best.match <- MIS.long[best.match.indices, ]```
``` r
cleaning workspace
rm(list = ls())```
Sequence slotting: combining samples of two sequences into a single composite sequence
Under this scenario, the objective is to combine two sequences into asingle composite sequence. The basic assumption followed by thealgorithm building the composite sequence is most similar samplesshould go together, but respecting the original ordering of thesequences. Therefore, the output will contain the samples in bothsequences ordered in a way that minimizes the multivariate distancebetween consecutive samples. This scenario assumes that at least one ofthe sequences do not have a time/age/depth column, or that the values insuch a column are uncertain. In any case, time/age/depth is notconsidered as a factor in the generation of the composite sequence.
The example below uses the pollenGP dataset, which contains 200samples, with 40 pollen types each. To create a smalle case study, thecode below separates the first 20 samples of the sequence into twodifferent sequences with 10 randomly selected samples each. Even thoughthis scenario assumes that these sequences do not have depth or age,these columns will be kept so the result can be assessed. That is whythese columns are added to the exclude.columns argument. Also, notethat the argument transformation is set to “none”, so the output isnot transformed, and the outcome can be easily interpreted. This willgive more weight to the most abundant taxa, which will in fact guide theslotting.
``` r
loading the data
data(pollenGP)
getting first 20 samples
pollenGP <- pollenGP[1:20, ]
sampling indices
set.seed(10) #to get same result every timesampling.indices <- sort(sample(1:20, 10))
subsetting the sequence
A <- pollenGP[sampling.indices, ]B <- pollenGP[-sampling.indices, ]
preparing the sequences
AB <- prepareSequences( sequence.A = A, sequence.A.name = "A", sequence.B = B, sequence.B.name = "B", grouping.column = "id", exclude.columns = c("depth", "age"), transformation = "none")```
Once the sequences are prepared, the function workflowSlotting willallow to combine (slot) them. The function computes a distance matrixbetween the samples in both sequences according to the methodargument, computes the least-cost matrix, and generates the least-costpath. Note that it only uses an orthogonal method considering blocks,since this is the only option really suitable for this task.
rAB.combined <- workflowSlotting( sequences = AB, grouping.column = "id", time.column = "age", exclude.columns = "depth", method = "euclidean", plot = TRUE)
The function reads the least-cost path in order to find the combinationof samples of both sequences that minimizes dissimilarity, constrainedby the order of the samples on each sequence. The output dataframe has acolumn named original.index, which has the index of each sample in theoriginaldatasets.
rkable(AB.combined[1:15,1:10], digits = 4, caption = "Combination of sequences A and B.")
id | original.index | depth | age | Abies | Juniperus | Hedera | Plantago | Boraginaceae | Crassulaceae | |
---|---|---|---|---|---|---|---|---|---|---|
1 | A | 1 | 3 | 3.97 | 11243 | 0 | 5 | 12 | 21 | 95 |
11 | B | 1 | 1 | 3.92 | 11108 | 0 | 7 | 0 | 5 | 20 |
12 | B | 2 | 2 | 3.95 | 11189 | 0 | 3 | 3 | 15 | 47 |
13 | B | 3 | 4 | 4.00 | 11324 | 0 | 8 | 43 | 60 | 65 |
2 | A | 2 | 6 | 4.05 | 11459 | 0 | 20 | 73 | 94 | 1 |
14 | B | 4 | 5 | 4.02 | 11378 | 0 | 44 | 76 | 110 | 0 |
3 | A | 3 | 7 | 4.07 | 11514 | 0 | 20 | 80 | 100 | 0 |
4 | A | 4 | 8 | 4.10 | 11595 | 0 | 34 | 80 | 155 | 0 |
5 | A | 5 | 9 | 4.17 | 11784 | 0 | 22 | 44 | 131 | 0 |
15 | B | 5 | 13 | 4.27 | 12054 | 0 | 37 | 30 | 150 | 0 |
6 | A | 6 | 10 | 4.20 | 11865 | 0 | 35 | 30 | 112 | 0 |
7 | A | 7 | 11 | 4.22 | 11919 | 0 | 30 | 45 | 150 | 0 |
8 | A | 8 | 12 | 4.25 | 12000 | 0 | 44 | 35 | 150 | 0 |
9 | A | 9 | 15 | 4.32 | 12189 | 0 | 43 | 17 | 120 | 0 |
16 | B | 6 | 14 | 4.30 | 12135 | 0 | 50 | 10 | 120 | 0 |
Note that several samples show inverted ages with respect to theprevious samples. This is to be expected, since the slotting algorithmonly takes into account distance/dissimilarity between adjacent samplesto generate theordering.
``` r
cleaning workspace
rm(list = ls())```
Transferring an attribute from one sequence to another
This scenario assumes that the user has two METS, one of them with agiven attribute (age/time) that needs to be transferred to the othersequence by using similarity/dissimilarity (constrained by sample order)as a transfer criterion. This case is relatively common inpalaeoecology, when a given dataset is dated, and another taken at aclose location is not.
The code below prepares the data for the example. The sequencepollenGP is the reference sequence, and contains the column age.The sequence pollenX is the target sequence, without an agecolumn. We generate it by taking 40 random samples between the samples50 and 100 of pollenGP. The sequences are prepared withprepareSequences, as usual, with the identificators “GP” and “X”
``` r
loading sample dataset
data(pollenGP)
subset pollenGP to make a shorter dataset
pollenGP <- pollenGP[1:50, ]
generating a subset of pollenGP
set.seed(10)pollenX <- pollenGP[sort(sample(1:50, 40)), ]
we separate the age column
pollenX.age <- pollenX$age
and remove the age values from pollenX
pollenX$age <- NULLpollenX$depth <- NULL
removing some samples from pollenGP
so pollenX is not a perfect subset of pollenGP
pollenGP <- pollenGP[-sample(1:50, 10), ]
prepare sequences
GP.X <- prepareSequences( sequence.A = pollenGP, sequence.A.name = "GP", sequence.B = pollenX, sequence.B.name = "X", grouping.column = "id", time.column = "age", exclude.columns = "depth", transformation = "none")```
The transfer of “age” values from GP to X can be done in two ways,both constrained by sample order:
- Direct: each sample in X gets the age of its most similarsample in GP.
- Interpolated: each sample in X gets an age interpolated fromthe ages of the two most similar samples in GP. The interpolationis weighted by the similarity between the samples.
Direct transfer
A direct transfer of an attribute from the samples of one sequence tothe samples of another requires to compute a distance matrix betweensamples, the least-cost matrix and its least-cost path (both with theoption diagonal activated), and to parse the least-cost path file toassign attribute values. This is done by the functionworkflowTransfer with the option .
``` r
parameters
X.new <- workflowTransfer( sequences = GP.X, grouping.column = "id", time.column = "age", method = "euclidean", transfer.what = "age", transfer.from = "GP", transfer.to = "X", mode = "direct" )
kable(X.new[1:15, ], digits = 4)```
id | depth | age | Abies | Juniperus | Hedera | Plantago | Boraginaceae | Crassulaceae | Pinus | Ranunculaceae | Rhamnus | Caryophyllaceae | Dipsacaceae | Betula | Acer | Armeria | Tilia | Hippophae | Salix | Labiatae | Valeriana | Nymphaea | Umbelliferae | Sanguisorba\_minor | Plantago.lanceolata | Campanulaceae | Asteroideae | Gentiana | Fraxinus | Cichorioideae | Taxus | Rumex | Cedrus | Ranunculus.subgen..Batrachium | Cyperaceae | Corylus | Myriophyllum | Filipendula | Vitis | Rubiaceae | Polypodium | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | X | 0 | 3.92 | 11108 | 0 | 7 | 0 | 5 | 20 | 0 | 13 | 0 | 2 | 1 | 0 | 2 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | 2 | 0 | 0 |
42 | X | 0 | 4.00 | 11324 | 0 | 8 | 43 | 60 | 65 | 0 | 10 | 0 | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 11 | 0 | 2 | 0 |
43 | X | 0 | 4.02 | 11378 | 0 | 44 | 76 | 110 | 0 | 0 | 0 | 0 | 0 | 2 | 11 | 0 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 6 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
44 | X | 0 | 4.05 | 11459 | 0 | 20 | 73 | 94 | 1 | 0 | 0 | 0 | 0 | 1 | 10 | 0 | 2 | 0 | 0 | 3 | 1 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
45 | X | 0 | 4.07 | 11514 | 0 | 20 | 80 | 100 | 0 | 0 | 0 | 0 | 0 | 10 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 0 |
46 | X | 0 | 4.07 | 11595 | 0 | 34 | 80 | 155 | 0 | 0 | 0 | 2 | 0 | 2 | 13 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
47 | X | 0 | 4.17 | 11784 | 0 | 22 | 44 | 131 | 0 | 0 | 0 | 0 | 0 | 1 | 13 | 0 | 0 | 0 | 0 | 5 | 0 | 5 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
48 | X | 0 | 4.20 | 11865 | 0 | 35 | 30 | 112 | 0 | 0 | 0 | 0 | 0 | 4 | 8 | 0 | 0 | 0 | 2 | 2 | 1 | 2 | 0 | 0 | 7 | 0 | 1 | 0 | 0 | 10 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 2 | 1 | 0 |
49 | X | 0 | 4.22 | 11919 | 0 | 30 | 45 | 150 | 0 | 0 | 0 | 0 | 0 | 4 | 11 | 0 | 0 | 0 | 0 | 2 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 13 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 |
50 | X | 0 | 4.25 | 12000 | 0 | 44 | 35 | 150 | 0 | 0 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 |
51 | X | 0 | 4.25 | 12054 | 0 | 37 | 30 | 150 | 0 | 0 | 1 | 0 | 0 | 6 | 10 | 0 | 0 | 0 | 0 | 7 | 2 | 2 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 4 | 0 | 0 |
52 | X | 0 | 4.32 | 12135 | 0 | 50 | 10 | 120 | 0 | 0 | 0 | 0 | 0 | 1 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 8 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
53 | X | 0 | 4.32 | 12189 | 0 | 43 | 17 | 120 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 1 | 1 | 2 | 0 | 2 | 0 | 0 | 5 | 1 | 0 | 0 | 0 | 6 | 2 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 |
54 | X | 0 | 4.40 | 12324 | 0 | 50 | 11 | 86 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 2 | 1 | 4 | 5 | 3 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 5 | 1 | 0 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 1 | 0 |
55 | X | 0 | 4.40 | 12405 | 0 | 51 | 6 | 70 | 0 | 0 | 0 | 0 | 0 | 1 | 16 | 0 | 0 | 4 | 1 | 2 | 4 | 2 | 0 | 0 | 5 | 2 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
The algorithm finds the most similar samples, and transfers attributevalues directly between them. This can result in duplicated attributevalues, as highlighted in the table above. The Pearson correlationbetween the original ages (stored in pollenX.age) and the assignedones is 0.9996, so it can be concluded that in spite of its simplicity,this algorithm yields accurate results.
Interpolated transfer
If we consider:
- Two samples
and
ofthe sequence
.
- Each one with a value the attribute
,
and
.
- One sample
of the sequence
.
- With an unknown attribute
.
- The multivariate distance
between the samples
and
.
- The multivariate distance
between the samples
and
.
- The weight
, computed as
- The weight
, computed as
The unknwon value iscomputed as:
The code below exemplifies the operation, using the samples 1 and 4 ofthe dataset pollenGP as and
, and thesample 3 as
.
``` r
loading data
data(pollenGP)
samples in A
Ai <- pollenGP[1, 3:ncol(pollenGP)]Aj <- pollenGP[4, 3:ncol(pollenGP)]
ages of the samples in A
Ati <- pollenGP[1, "age"]Atj <- pollenGP[4, "age"]
sample in B
Bk <- pollenGP[2, 3:ncol(pollenGP)]
computing distances between Bk, Ai, and Aj
DBkAi <- distance(Bk, Ai)DBkAj <- distance(Bk, Aj)
normalizing the distances to 1
wi <- DBkAi / (DBkAi + DBkAj)wj <- DBkAj / (DBkAi + DBkAj)
computing Btk
Btk <- wi * Ati + wj * Atj```
The table below shows the observed versus the predicted values for.
rtemp.df <- data.frame(Observed = pollenGP[3, "age"], Predicted = Btk)kable(t(temp.df), digits = 4, caption = "Observed versus predicted value in the interpolation of an age based on similarity between samples.")
Observed | 3.9700 |
Predicted | 3.9735 |
Below we create some example data, where a subset of pollenGP will bethe donor of age values, and another subset of it, named pollenX willbe the receiver of the age values.
``` r
loading sample dataset
data(pollenGP)
subset pollenGP to make a shorter dataset
pollenGP <- pollenGP[1:50, ]
generating a subset of pollenGP
set.seed(10)pollenX <- pollenGP[sort(sample(1:50, 40)), ]
we separate the age column
pollenX.age <- pollenX$age
and remove the age values from pollenX
pollenX$age <- NULLpollenX$depth <- NULL
removing some samples from pollenGP
so pollenX is not a perfect subset of pollenGP
pollenGP <- pollenGP[-sample(1:50, 10), ]
prepare sequences
GP.X <- prepareSequences( sequence.A = pollenGP, sequence.A.name = "GP", sequence.B = pollenX, sequence.B.name = "X", grouping.column = "id", time.column = "age", exclude.columns = "depth", transformation = "none")```
To transfer attributes from GP to X we use the workflowTransferfunction with the option mode = “interpolate”.
``` r
parameters
X.new <- workflowTransfer( sequences = GP.X, grouping.column = "id", time.column = "age", method = "euclidean", transfer.what = "age", transfer.from = "GP", transfer.to = "X", mode = "interpolated" )
kable(X.new[1:15, ], digits = 4, caption = "Result of the transference of an age attribute from one sequence to another. NA values are expected when predicted ages for a given sample yield a higher number than the age of the previous sample.") %>% row_spec(c(8, 13), bold = T)```
id | depth | age | Abies | Juniperus | Hedera | Plantago | Boraginaceae | Crassulaceae | Pinus | Ranunculaceae | Rhamnus | Caryophyllaceae | Dipsacaceae | Betula | Acer | Armeria | Tilia | Hippophae | Salix | Labiatae | Valeriana | Nymphaea | Umbelliferae | Sanguisorba\_minor | Plantago.lanceolata | Campanulaceae | Asteroideae | Gentiana | Fraxinus | Cichorioideae | Taxus | Rumex | Cedrus | Ranunculus.subgen..Batrachium | Cyperaceae | Corylus | Myriophyllum | Filipendula | Vitis | Rubiaceae | Polypodium | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
41 | X | 0 | 3.9497 | 11108 | 0 | 7 | 0 | 5 | 20 | 0 | 13 | 0 | 2 | 1 | 0 | 2 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | 2 | 0 | 0 |
42 | X | 0 | 3.9711 | 11324 | 0 | 8 | 43 | 60 | 65 | 0 | 10 | 0 | 0 | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 11 | 0 | 2 | 0 |
43 | X | 0 | 4.0009 | 11378 | 0 | 44 | 76 | 110 | 0 | 0 | 0 | 0 | 0 | 2 | 11 | 0 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 0 | 5 | 0 | 0 | 1 | 0 | 6 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
44 | X | 0 | 4.0219 | 11459 | 0 | 20 | 73 | 94 | 1 | 0 | 0 | 0 | 0 | 1 | 10 | 0 | 2 | 0 | 0 | 3 | 1 | 3 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
45 | X | 0 | 4.0522 | 11514 | 0 | 20 | 80 | 100 | 0 | 0 | 0 | 0 | 0 | 10 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 0 |
46 | X | 0 | 4.1361 | 11595 | 0 | 34 | 80 | 155 | 0 | 0 | 0 | 2 | 0 | 2 | 13 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |
47 | X | 0 | 4.1972 | 11784 | 0 | 22 | 44 | 131 | 0 | 0 | 0 | 0 | 0 | 1 | 13 | 0 | 0 | 0 | 0 | 5 | 0 | 5 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
48 | X | 0 | NA | 11865 | 0 | 35 | 30 | 112 | 0 | 0 | 0 | 0 | 0 | 4 | 8 | 0 | 0 | 0 | 2 | 2 | 1 | 2 | 0 | 0 | 7 | 0 | 1 | 0 | 0 | 10 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 2 | 1 | 0 |
49 | X | 0 | 4.2027 | 11919 | 0 | 30 | 45 | 150 | 0 | 0 | 0 | 0 | 0 | 4 | 11 | 0 | 0 | 0 | 0 | 2 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 13 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 |
50 | X | 0 | 4.2237 | 12000 | 0 | 44 | 35 | 150 | 0 | 0 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 | 0 |
51 | X | 0 | 4.2999 | 12054 | 0 | 37 | 30 | 150 | 0 | 0 | 1 | 0 | 0 | 6 | 10 | 0 | 0 | 0 | 0 | 7 | 2 | 2 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 4 | 0 | 0 |
52 | X | 0 | NA | 12135 | 0 | 50 | 10 | 120 | 0 | 0 | 0 | 0 | 0 | 1 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 8 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
53 | X | 0 | NA | 12189 | 0 | 43 | 17 | 120 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 1 | 1 | 2 | 0 | 2 | 0 | 0 | 5 | 1 | 0 | 0 | 0 | 6 | 2 | 0 | 2 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 0 |
54 | X | 0 | 4.3501 | 12324 | 0 | 50 | 11 | 86 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 0 | 0 | 2 | 1 | 4 | 5 | 3 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 5 | 1 | 0 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 1 | 0 |
55 | X | 0 | 4.4155 | 12405 | 0 | 51 | 6 | 70 | 0 | 0 | 0 | 0 | 0 | 1 | 16 | 0 | 0 | 4 | 1 | 2 | 4 | 2 | 0 | 0 | 5 | 2 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 |
When interpolated values of the age column (transferred attribute viainterpolation) show the value NA, it means that the interpolationyielded an age lower than the previous one. This happens when the same and
are used to evaluatetwo or more different samples
, and the second
is more similar to
than the first one.These NA values can be removed with na.omit(), or interpolated withthe functionsimputeTS::na.interpolationorzoo::na.approx.
Without taking into account these NA values, the Pearson correlationof the interpolated ages with the real ones is 0.9985.
IMPORTANT: the interpretation of the interpolated ages requires acareful consideration. Please, don’t do it blindly, because thisalgorithm has its limitations. For example, significant hiatuses in thedata can introduce wild variations in interpolated ages.
``` r
cleaning workspace
rm(list = ls())```
To restore the repository download the bundle
wget https://archive.org/download/github.com-BlasBenito-distantia_-_2021-03-03_23-13-27/BlasBenito-distantia_-_2021-03-03_23-13-27.bundle
and run: git clone BlasBenito-distantia_-_2021-03-03_23-13-27.bundle
Source: https://github.com/BlasBenito/distantia
Uploader: BlasBenito
Upload date: 2021-03-03
- Addeddate
- 2021-07-06 16:44:36
- Identifier
- github.com-BlasBenito-distantia_-_2021-03-03_23-13-27
- Originalurl
-
https://github.com/BlasBenito/distantia
- Pushed_date
- 2021-03-03 23:13:27
- Scanner
- Internet Archive Python library 1.9.9
- Uploaded_with
- iagitup - v1.6.2
- Year
- 2021