RQ1 : Performances on real data

Are there any methods that stand out from the rest in terms of performance on real data datasets?

To answer this question, we measure the performances (both in terms of f1-score and execution time) of the selected Motif Discovery methods on our collection of real labeled time series.

Our evaluation is summarized in the table below (the empty cells correspond to methods that crashed or reached our time-out defined in the previous section ). We also present critical difference diagrams, with and without REFIT and SIGN, showing the average rank of each method over the entire dataset. The dark lines represent cliques of methods with broadly similar performance, found using pairwise Wilcoxon tests.

Summary of the results

Dataset

Metric

STOMP

PanMP

LoCoMotif

LatentMotif

MDL-Clust

k‑Motiflets

PEPA

VALMOD

SetFinder

A‑PEPA

GrammarViz

arm-coda

fscore

0.25 (0.15)

0.22 (0.10)

0.17 (0.17)

0.27 (0.14)

0.66 (0.25)

0.03 (0.07)

0.29 (0.14)

0.29 (0.15)

0.20 (0.05)

0.29 (0.17)

0.01 (0.02)

Exec. time

0.5 (0.06)

170 (63)

18 (8)

30 (9)

555 (159)

2 (0.3)

2 (0.3)

303 (80)

1.5 (0.5)

2 (0.3)

0.3 (0.00)

mitdb

fscore

0.50 (0.20)

0.14 (0.22)

0.12 (0.18)

0.29 (0.24)

0.33 (0.15)

0.40 (0.37)

0.41 (0.30)

0.17 (0.23)

0.55 (0.17)

0.51 (0.19)

0.00 (0.00)

Exec. time

2.9 (0.01)

934 (600)

1252 (3837)

14 (8)

4178 (1483)

235 (98)

11 (0.4)

1762 (1273)

14 (2.3)

11 (0.4)

0.41 (0.02)

mitdb1

fscore

0.63 (0.19)

0.69 (0.26)

0.29 (0.14)

0.14 (0.14)

0.18 (0.07)

0.44 (0.37)

0.46 (0.34)

0.66 (0.25)

0.77 (0.10)

0.36 (0.20)

0.00 (0.00)

Exec. time

3 (0.05)

187 (105)

76 (8)

7 (1.5)

1133 (254)

60(25)

11 (0.5)

156 (48)

12 (1.2)

10 (0.5)

0.42 (0.02)

ptt-ppg

fscore

0.49 (0.18)

0.53 (0.23)

0.38 (0.16)

0.27 (0.17)

0.18 (0.07)

0.61 (0.26)

0.68 (0.12)

0.54 (0.23)

0.69 (0.05)

0.43 (0.16)

0.00 (0.01)

Exec. time

3 (0.6)

270 (200)

102 (17)

8 (2.8)

1261 (279)

86 (41)

11 (0.2)

204 (86)

23 (3)

12 (1.4)

0.4 (0.02)

JIGSAWMaster

fscore

0.26 (0.10)

0.10 (0.12)

0.33 (0.10)

0.26 (0.12)

0.23 (0.08)

0.13 (0.08)

0.18 (0.09)

0.17 (0.09)

0.23 (0.04)

0.20 (0.09)

0.10 (0.05)

Exec. time

0.9 (0.8)

420 (520)

318 (665)

7 (6)

2214 (2147)

108 (106)

4 (3)

1208 (1038)

5 (5)

4 (3)

0.31 (0.04)

JIGSAWSlave

fscore

0.25 (0.12)

0.05 (0.07)

0.33 (0.12)

0.24 (0.10)

0.23 (0.06)

0.15 (0.10)

0.17 (0.08)

0.20 (0.10)

0.22 (0.05)

0.18 (0.08)

0.10 (0.06)

Exec. time

0.87 (0.68)

343 (300)

189 (267)

6 (4)

2005 (1812)

96 (83)

4 (3)

1453 (1459)

4.7 (4)

4 (2)

0.31 (0.03)

REFIT

fscore

0.00 (0.03)

0.03 (0.08)

0.00 (0.00)

0.14 (0.12)

0.16 (0.15)

0.00 (0.00)

Exec. time

500 (96)

230 (122)

15700 (9800)

1280 (100)

1310 (120)

63 (12)

SIGN

fscore

0.06 (0.04)

0.14 (0.09)

0.18 (0.14)

0.17 (0.03)

0.20 (0.06)

0.10 (0.07)

Exec. time

300 (25)

50 (10)

15500 (3600)

900 (85)

900 (88)

5 (18)

Critical diagram with REFIT and SIGN

crit_diag_w

Critical diagram without REFIT and SIGN

crit_diag_wo

RQ1 Conclusion

PEPA, A-PEPA, STOMP and SetFinder seem to have slightly better results on real data, according to critical diference diagrams. However, the variations in methods performances between the dataset show the importance of asking precise questions about which time series characteristics influence the performance of the algorithms. Thus, in the following sections, we benefit from our synthetic generator in identifying specific challenges.