-
Notifications
You must be signed in to change notification settings - Fork 1
/
jmva.tex
2625 lines (2438 loc) · 178 KB
/
jmva.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Sparse Representation of Multivariate Extremes}
\label{jmva}
%XXX TODO: detailed remark on classification in extreme regions (add classif in background?)
\begin{chapabstract}
This chapter presents the details relative to the introducing Section~\ref{resume:sec:JMVA}.
Capturing the dependence structure of multivariate extreme events is
a major concern in many fields involving the management of risks
stemming from multiple sources, \emph{e.g.}~portfolio monitoring, insurance, environmental risk management and anomaly detection.
One convenient (non-parametric) characterization of extreme dependence in the
framework of multivariate Extreme Value Theory (EVT) is the \textit{angular
measure}, which provides direct information about the probable
'directions' of extremes, that is, the relative contribution of each
feature/coordinate of the `largest' observations. Modeling the
angular measure in high dimensional problems is a major challenge for
the multivariate analysis of rare events.
The present chapter proposes a novel methodology aiming at
exhibiting a sparsity pattern within the dependence structure of extremes.
This is achieved by estimating the amount of mass spread by the angular measure on
representative sets of directions, corresponding to specific sub-cones of $\mathbb{R}_+^d$.
This dimension reduction technique paves the way towards scaling up existing multivariate EVT methods.
Beyond a non-asymptotic study providing a theoretical validity
framework for our method, we propose as a direct application a --first--
Anomaly Detection algorithm based on \textit{multivariate} EVT. This algorithm builds a sparse `normal profile' of extreme behaviors, to be confronted with new (possibly abnormal) extreme observations. Illustrative experimental results provide strong empirical evidence of the relevance of our approach.
\end{chapabstract}
Note: The material of this chapter is based on previous work under review available in \cite{ARXIV16}. Part of this work have been published in \cite{AISTAT16} and \cite{NIPSWORKSHOP15}.
\section{Introduction}
\label{jmva:sec:intro}
% An anomaly (or outlier) is `an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism' (\cite{Hawkins1980}).
% According to a further definition from Barnett and Lewis \cite{Barnett94}, this is an observation (or subset of observations) `which appears to be inconsistent with the remainder of that set of data'.
% These definitions give a hunch of what an outlying observation is, and motivate many anomaly detection methods whose approach is statistical (\cite{Eskin2000}, \cite{Desforges1998}, \cite{Barnett94}, \cite{Hawkins1980}), distance based (\cite{Knorr98}, \cite{Knorr2000}, \cite{Eskin2002geometric}), local-density based (\cite{Breunig2000LOF}, \cite{Breunig99LOF}, \cite{Tang2002enhancing}, \cite{Papadimitriou2003loci} ), spectral based (\cite{Shyu2003}, \cite{Wang2006}, \cite{Lee2013}) and others (\cite{Aggarwal2001}, \cite{Kriegel2008}, \cite{Liu2008}). The approach discussed in this paper lies at the intersection of the non-parametric statistical and the spectral one.
\subsection{Context: multivariate extreme values in large dimension}
Extreme Value Theory (EVT in abbreviated form) provides a theoretical basis for
modeling the tails of probability distributions. In many applied fields where
rare events may have a disastrous impact, such as
finance, insurance, climate, environmental risk management, network
monitoring (\cite{finkenstadt2003extreme,smith2003statistics}) or
anomaly detection (\cite{Clifton2011,Lee2008}), the information carried by extremes is crucial. In a multivariate context, the dependence structure
of the joint tail is of particular interest, as it gives access
\emph{e.g.} to probabilities of a joint excess above high thresholds or
to multivariate quantile regions. Also, the distributional structure of
extremes indicates which components of a multivariate quantity may be
simultaneously large while the others stay small, which is a valuable
piece of information for multi-factor risk assessment or
detection of anomalies among other --not abnormal-- extreme data.
In a multivariate `Peak-Over-Threshold' setting, realizations of a $d$ -dimensional random vector $\mb Y = (Y_1 ,..., Y_d)$ are observed and the goal pursued is to learn the conditional distribution of excesses, $\left[~ \mb Y ~|~ \|\mb Y\| \ge r ~ \right]$, above some large threshold $ r>0$.
The dependence structure of such excesses is described via the
distribution of the ‘directions’ formed by the most extreme
observations, the so-called \emph{angular measure}, hereafter denoted by $\Phi$. The latter
is defined on the positive orthant of the $d-1$ dimensional
hyper-sphere. To wit, for any region $A$ on the unit sphere (a set
of `directions'), after suitable standardization of the data (see
Section~\ref{jmva:sec:framework}),
$C \Phi(A) \simeq \PP(\|\mb Y\|^{-1} \mb Y \in A ~|~ \|\mb Y\| >r)$, where $C$ is a normalizing constant.
Some probability mass may be spread on any sub-sphere of dimension $k
< d$, the $k$-faces of an hyper-cube if we use the infinity norm, which
complexifies inference when $d$ is large. To fix ideas, the presence of $\Phi$-mass on a
sub-sphere of the type $\{\max_{1\leq i\leq k} x_i = 1 ~;~ x_i >0 \;(i\le k) ~;~ x_{k+1} = \ldots = x_d = 0\}$ indicates that the components $Y_1,\ldots,Y_k$ may
simultaneously be large, while the others are small.
An extensive exposition of this multivariate extreme setting may be found \eg~in \cite{Resnick1987},~\cite{BGTS04}.
%%%%%%%%%%%%%%%%%%%%%%%
Parametric or semi-parametric modeling and estimation of the structure of
multivariate extremes is relatively well documented in the statistical literature, see \emph{e.g.} \cite{coles1991modeling,fougeres2009models,cooley2010pairwise,sabourinNaveau2012} and the references therein. In a non-parametric setting, there is also an abundant literature concerning consistency and asymptotic normality of estimators of functionals characterizing the extreme dependence structure, \eg~extreme value copulas or the \emph{stable tail dependence function} (\stdf), see \cite{Segers12Bernoulli, Drees98, Embrechts2000, Einmahl2012, dHF06}.
In many applications, it is nevertheless more convenient to work with the angular measure itself, as the latter gives more direct information on the dependence structure and is able to reflect structural simplifying properties (\eg~sparsity as detailed below) which would not appear in copulas or in the \stdf.
However, non-parametric modeling of the angular measure faces major difficulties, stemming from the potentially complex structure of the latter, especially in a high dimensional setting.
Further, from a theoretical point of view, non-parametric estimation of the angular measure has only been studied in the two dimensional case, in \cite{Einmahl2001} and \cite{Einmahl2009}, in an asymptotic framework.
%%%%%%%%%%%%%%%%%%%%%%%
{Scaling up multivariate EVT} is a major challenge that one
faces when confronted to high-dimensional learning
tasks, since most multivariate extreme value models have been
designed to handle moderate dimensional problems (say, of dimensionality $d\le 10$). % that one faces when it comes to applying it in a
% machine learning context.
% In addition, their practical use is
% restricted to moderate dimensional problems (say, $d\le 10$),
For larger dimensions,
simplifying modeling choices are needed,
stipulating \emph{e.g} that only some pre-definite subgroups of components
may be concomitantly extremes, or, on the contrary, that all of them
must be (see \emph{e.g.}
\cite{stephenson2009high} or \cite{sabourinNaveau2012}).
This curse of dimensionality can be explained, in the context of
extreme values analysis, by the relative scarcity of extreme data,
the computational
complexity of the estimation procedure and, in the parametric case, by
the fact that the dimension of the parameter space usually grows with
that of the sample space. This calls for dimensionality reduction devices
adapted to multivariate extreme values.
In a wide range of situations, one may expect the occurrence of two phenomena:
\noindent
\textbf{1-} Only a `small' number of groups of components may be concomitantly extreme, so that only a `small' number of hyper-cubes (those corresponding to these subsets of indexes precisely) have non zero mass (`small' is relative to the total number of groups $2^d$).
\noindent
\textbf{2-} Each of these groups contains a limited number of coordinates (compared to the original dimensionality), so that the corresponding hyper-cubes with non zero mass have small dimension compared to $d$.
\noindent
The main purpose of this chapter is to introduce a data-driven
methodology for identifying such faces, so as to reduce the
dimensionality of the problem and thus to learn a sparse
representation of extreme behaviors.
In case hypothesis \textbf{2-} is not fulfilled, such a sparse `profile' can still be learned, but looses the low dimensional property of its supporting hyper-cubes.
One major issue is that real data generally do not concentrate on sub-spaces of zero Lebesgue measure. This is circumvented by setting to zero any coordinate less than a threshold $\epsilon>0$, so that the corresponding `angle' is assigned to a lower-dimensional face.
% restrictive structural assumptions have to be
% made
% Several multivariate models have been proposed, in a parametric
% context or in a semi-parametric one. However in practice, the applicability of
% these models is restricted to moderate dimensional problems (say $d\le
% 10$),
% From a theoretical point of view, non-parametric
% estimators of the \emph{stable tail dependence function} (another summary functional of the extremal
% dependence structure) have been proved to be asymptotically
% normal.
% consistency and asymptotic normality of non-parametric estimators of the so-called
% \emph{angular measure} (which rules the distribution of the
% `directions' of the most extreme observations and uniquely determines
% the dependence structure of extremes) has been established for the
% bivariate case only. In dimension greater than three,
% However, the latter does not give direct access to the angular
% measure, nor does it allow to exhibit sparsity in the latter,
The theoretical results stated in this chapter build on the work
of % are clearly stated in this
% work, in the continuity of
\cite{COLT15} exposed in Chapter~\ref{colt}, where non-asymptotic bounds related to the statistical performance of a non-parametric estimator of the \stdf, another functional measure of the dependence structure of extremes, are established.
However, even in the case of a sparse angular measure, the support of
the \stdf~would not be so, since the latter functional is an
integrated version of the former (see~\eqref{jmva:eq:integratePhiLambda},
Section~\ref{jmva:sec:framework}). Also,
%tools from statistical learning theory,
in many applications, it is more convenient to work with % an alternative (distributional)
% representation,
the {angular
measure}. Indeed, it provides
direct information about the
probable `directions' of extremes, that is, the relative contribution
of each components of the `largest' observations (where `large' may be
understood \emph{e.g.} in the sense of the infinity norm on the input
space). We emphasize again that estimating these `probable relative contributions' is a major concern in many fields
involving the management of risks from multiple sources.
To the best of our knowledge, non-parametric estimation of the angular
measure has only been treated in the two dimensional case, in
\cite{Einmahl2001} and \cite{Einmahl2009}, in an asymptotic
framework.
\noindent
\textbf{Main contributions.} The present contribution extends the non-asymptotic study derived in Chapter~\ref{colt} to the angular measure of extremes, restricted to a well-chosen representative class of sets, corresponding to lower-dimensional regions of the space. The objective is to learn a representation of the angular measure, rough enough to
control the variance in high dimension and accurate enough to gain information about the 'probable directions' of extremes. This yields a --first-- non-parametric estimate of the angular measure in any dimension, restricted to a
class of sub-cones, with a non asymptotic bound on the error.
The representation thus obtained is exploited to detect anomalies among extremes.
The proposed algorithm is based on \textit{dimensionality
reduction}. %, and can also be used as a preprocessing step to reduce the number of r.v.'s under consideration at extreme levels. % , as multivariate EVT is often inapplicable in a large dimensional context. XXX.
We believe that our method can also be used as a preprocessing stage, for dimensionality reduction purpose, before proceeding with a parametric or semi-parametric estimation which could benefit from the structural information issued in the first step. Such applications are beyond the scope of this work and will be the subject of further research.
\subsection{Application to Anomaly Detection}
% Anomaly Detection (AD in short, and depending of the application domain, outlier detection, novelty detection, deviation detection, exception mining) generally consists in assuming that the dataset under study contains a \textit{small} number of anomalies, generated by distribution models that \textit{differ} from that generating the vast majority of the data.
% % anomalies are a \textit{small} number of observations generated by \textit{different} models from the one generating the rest of the data
% %\sout{ -- the only difference in novelty detection is that the novel patterns are incorporated into the normal model after being detected. }.
% This formulation motivates many statistical AD methods, based on the underlying assumption that anomalies occur in low probability regions of the data generating process. Here and hereafter, the term `normal data' does not refer to Gaussian distributed data, but to \emph{not abnormal} ones, \ie~data belonging to the above mentioned majority.
% Classical parametric techniques, like those developed in \cite{Barnett94} or in \cite{Eskin2000}, assume that the normal data are generated by a distribution belonging to some specific, known in advance parametric model.
% The most popular non-parametric approaches include algorithms based on density (level set) estimation (see \textit{e.g.} \cite{Scholkopf2001}, \cite{Scott2006} or \cite{Breunig99LOF}), on dimensionality reduction (\textit{cf} \cite{Shyu2003}, \cite{Aggarwal2001}) or on decision trees (\cite{Liu2008}).
% % %approach is statistical, \
% % % {\red donner le theme de chaque papier cité}
% % \cite{Eskin2000},
% % \cite{Desforges1998}, \cite{Barnett94}, \cite{Hawkins1980}
% % , distance
% % based, \cite{Knorr98}, \cite{Knorr2000}, \cite{Eskin2002geometric},
% % local-density based \cite{Breunig2000LOF}, \cite{Breunig99LOF},
% % \cite{Tang2002enhancing}, \cite{Papadimitriou2003loci}, spectral based
% % \cite{Shyu2003}, \cite{Wang2006}, \cite{Lee2013} and others
% % \cite{Aggarwal2001}, \cite{Kriegel2008}, \cite{Liu2008}.
% One may refer to \cite{Hodge2004survey}, \cite{Chandola2009survey}, \cite{Patcha2007survey} and \cite{Markou2003survey} for excellent overviews of current research on Anomaly Detection, ad-hoc techniques being far too numerous to be listed here in an exhaustive manner.
The framework we develop in this chapter is non-parametric and lies at
the intersection of support estimation, density estimation and
dimensionality reduction: it consists in learning from training data
the support of a distribution, that can be decomposed into sub-cones,
hopefully of low dimension each and to which some mass is assigned,
according to %defining
empirical versions of probability measures on %of specific
extreme regions.
%In addition, each subcone is (potentially) of low-dimension.
%\cite{hodge2004survey}:
%Anomalies arise because of human error, instrument error, natural deviations in populations, fraudulent behaviour, changes in behaviour of systems or faults in systems
% \textbf{Anomaly Ranking}. Most usual AD algorithms actually
% provide more than a predicted label for any new observation, abnormal vs. normal. Instead,
% they return a real valued function, termed a \textit{scoring function} sometimes, defining a preorder/ranking on the input space. Such a function permits to rank any observations according to their supposed `degree of abnormality' and thresholding it yields a decision rule that splits the input space into `normal' and `abnormal' regions.
% In various fields (\textit{e.g.} fleet management, monitoring of energy/transportation networks), when confronted with massive data, being able to rank observations according to their degree of abnormality may significantly improve operational processes and allow for a prioritization of actions to be taken, especially in situations where human expertise required to check each observation is time-consuming.
% From a machine learning perspective,
% % AD can be considered as a specific
% % classification task, where the usual assumption in supervised learning stipulating that the dataset contains structural
% % information regarding all classes breaks down, see
% % \cite{Roberts99}. This typically happens in the case of two highly
% % unbalanced classes: the normal class is expected to regroup a large
% % majority of the dataset, so that the very small number of points
% % representing the abnormal class does not allow to learn information
% % about this class. In a clustering based approach, it can be
% % interpreted as the presence of a single cluster, corresponding to the
% % normal data. The abnormal ones are too limited to share a commun
% % structure, \ie~to form a second cluster. Their only characteristic is
% % precisely to lie outside the normal cluster, namely to lack any
% % structure. Thus, common classification approaches may not be applied
% % as such, even in a supervised
% % context. % That is the reason why even in a supervised framework, common classification approaches cannot be applied.
% \textit{supervised} AD consists in training the algorithm on a
% labeled (normal/abnormal) dataset including both normal and abnormal
% observations. In the \textit{semi-supervised} context, only normal
% data are available for training. This is the case in applications where
% normal operations are known but intrusion/attacks/viruses are unknown
% and should be detected. In the \textit{unsupervised} setup, no
% assumption is made on the data which consist in unlabeled normal and
% abnormal instances. \sout{Sometimes, no training is performed.} In general,
% a method from the semi-supervised framework may apply to the
% unsupervised one, as soon as the number of anomalies is sufficiently
% weak to prevent the algorithm from fitting them when learning the normal
% behavior. Such a method should be robust to outlying
% observations.
% \textbf{Connection to Extreme Value Theory} (EVT), which \textit{forms representations for the tails of distributions}, can be seen in the following sense.
% In the unsupervised framework % (where the data set consists of a large number of normal data with a smaller unknown number of anomalies)
% As a matter of fact, `extreme' observations are often more susceptible to be anomalies than the others.
% % In the supervised or semi-supervised framework (that is, when the
% % algorithm is trained on normal data), %considering only normal observations% where a dataset of only normal observations is available
% These extremal points represent the outlying regions of the normal instances. % we want to separate from the abnormal ones.
% In other words, extremal observations are often at the \textit{border} between normal and abnormal regions and play a very special role in this context.
% %
% As extreme observations constitute few percents of the data, a classical AD algorithm would tend to classify them as abnormal: it is not
% worth the risk (in terms of ROC or precision-recall curve for instance) trying to be more accurate in such low probability regions, without adapted tools. New
% observations outside the `observed support' are then most often predicted as abnormal. However, false positives (\ie~false alarms) are very expensive in many applications (\eg~aircraft predictive maintenance). It is then of primal interest to develop tools increasing the precision on such extremal regions.
EVT has been intensively used in anomaly detection in the one-dimensional
situation, see for instance \cite{Roberts99}, \cite{Roberts2000},
\cite{Clifton2011}, \cite{Clifton2008}, \cite{Lee2008}.
% Anomaly detection then relies on tail analysis of the variable of interest and naturally involves EVT.
In the multivariate setup, however, there is --to the best of our
knowledge-- no anomaly detection method
relying on \textit{multivariate} EVT. Until now, the multidimensional case has only been tackled by means of extreme value statistics
based on univariate EVT. The major reason is
the difficulty to scale up existing multivariate EVT models
% that most of existing multivariate EVT models have difficulties to scale-up well
with the dimensionality.
In the present contribution we bridge the gap between the practice of anomaly detection and multivariate EVT by proposing a method which is
able to learn a sparse `normal profile' of multivariate extremes and,
as such, may be implemented to improve the accuracy of any usual anomaly detection
algorithm.
%
%\subsection{Why using EVT in AD ?}
%
% Incidentally, one
% of the main feature of EVT is to allow extrapolation beyond the more
% extremal observations (`out-of-sample'), and thus inference on these regions.
% In this context, learning the structure of extremes allows to build a
% `normal profil' to be confronted with new extremal data.
% Nevertheless, if such a representation of extremes is learned in a
% unsupervised manner -namely from normal as from abnormal data-, it
% runs the risk of \textit{fitting abnormal observations}. It is
% therefore primordial to control the model complexity, especially in a
% non-parametric context such as multivariate EVT.
%
%
% %\subsection{Advantage of this method}
% % As \textbf{Anomaly Ranking} is a useful and necessary step to AD, t
% The algorithm proposed in this paper provides a scoring function which ranks
% extreme observations according to their supposed degree of abnormality. This method is complementary to other AD
% algorithms, insofar as two algorithms (that presented here, together with any
% other appropriate AD algorithm) may be trained. Then, the input space may be divided into two regions,
% extreme and non-extreme. Afterwards, a new
% observation in the central region (\emph{resp.} in the extremal
% region) would be classified as abnormal or
% not according to the scoring function issued by the generic algorithm (\emph{resp.} that presented
% here).
% % after training both algorithms on the whole
% % input space, the latter may be divided into two regions, extreme and
% % non-extreme, where % data can be divided into two datasets, extreme and non-extreme data, respectively used to learn a standard scoring function and an extreme one.
% %
% % Avoiding fitting anomalies as discussed above is a major concern in unsupervised AD.
% The scope of our algorithm concerns both semi-supervised and
% unsupervised problems. Undoubtedly, as it consists in learning a
% normal behavior in extremal regions, it is optimally efficient when
% trained on normal observations only. %a dataset of only normal
% %observations.
% However it also applies to unsupervised situations. % on a single unlabelized mixed dataset where training and detection are both peformed,
% Indeed, it involves a non-parametric but relatively coarse estimation
% scheme which prevents from overfitting normal data or fitting anomalies.
% % More precisely, the normal representation of extreme data identifies
% % % lies on finding
% % the groups of asymptotically dependent features, and new data are
% % scored, up to %%a weight and
% % a fonctionnal of the norm, according the
% % the degree of dependence between the observed large components.
% As a consequence, this method is robust to outliers and also applies when the training dataset contains a (small) proportion of anomalies.
% %
Experimental results show that this method significantly
improves the performance in extreme regions, as % on extreme regions in terms of precision-recall curve, while
% preserving undiluted ROC curves. As expected, the \textit{precision}
% of the standard AD algorithm is improved in extremal regions.
the risk
is taken not to uniformly predict as abnormal the most extremal observations, but to learn their dependence structure.
These improvements may typically be useful in applications where the cost of false positive errors (\ie~false alarms) is very high (\eg~predictive maintenance in aeronautics).
\bigskip
The structure of this chapter is as follows. The whys
and wherefores of multivariate EVT are explained in the following
Section~\ref{jmva:sec:framework}. A non-parametric estimator of the
subfaces' mass is introduced in Section~\ref{jmva:sec:estimation}, the
accuracy of which is
investigated by establishing % where consistency and
finite sample error bounds relying on {\sc VC} inequalities
tailored to low probability regions.
An application to anomaly detection is proposed in Section~\ref{jmva:sec:appliAD}, followed by a novel anomaly detection algorithm which relies on the above mentioned non-parametric estimator.
% is described
% in Section~\ref{jmva:sec:algo} as a natural pathway to learn the extreme
% dependence structure. The rationale behind can be understood without knowledge of EVT,
% This section reveals that the algorithm
% %incidentally
% relies
% on estimating classical quantities in multivariate EVT.
Experiments on both simulated and real data are performed in Section~\ref{jmva:sec:experiments}. Technical details are deferred at the end of this chapter, Section~\ref{jmva:appendix_proof}.
\section{Multivariate EVT Framework and Problem Statement}
\label{jmva:sec:framework}
% Extreme Value Theory (\textsc{EVT}) develops models for learning the
% unusual rather than the usual. These models are widely used in fields
% involving risk management like finance, insurance, telecommunication
% or environmental sciences.
% One major application of \textsc{EVT} is to provide a reasonable
% assessment of the probability of occurrence of rare events.
%
%
Extreme Value Theory (\textsc{EVT}) develops models to provide a reasonable
assessment of the probability of occurrence of rare events. Such models are widely used in fields
involving risk management such as Finance, Insurance, Operation Research, Telecommunication
or Environmental Sciences for instance. For clarity, we start off with recalling some key notions developed in Chapter~\ref{back:EVT} pertaining to (multivariate) \textsc{EVT}, that shall be involved in the formulation of the problem next stated and in its subsequent analysis.
%\subsection{Background on (multivariate) Extreme Value Theory}
First recall the primal assumption of multivariate extreme value theory.
For a $d$-dimensional \rv~$\mb X = (X^1,\; \ldots, \; X^d)$ with distribution $\mb F(\mb x):=\mathbb{P}(X_1 \le x_1, \ldots, X_d \le x_d)$, namely $\mb F \in \textbf{DA} (\mb G)$ it stipulates the existence of two sequences $\{\mb a_n, n \ge 1\}$ and $\{\mb b_n, n \ge 1\}$ in $\mathbb{R}^d$, the $\mb a_n$'s being positive, and a non-degenerate distribution function $\mb G$ such that
\begin{align}
\label{jmva:intro:assumption2}
\lim_{n \to \infty} n ~\mathbb{P}\left( \frac{X^1 - b_n^1}{a_n^1} ~\ge~ x_1 \text{~or~} \ldots \text{~or~} \frac{X^d - b_n^d}{a_n^d} ~\ge~ x_d \right) = -\log \mb G(\mathbf{x})
\end{align}
for all continuity points $\mathbf{x} \in \mathbb{R}^d$ of $\mb G$.
Recall also that considering the standardized variables
$V^j =1/(1-F_j(X^j))$ and $\mathbf{V}=(V^1,\; \ldots,\; V^d)$, Assumption~\eqref{jmva:intro:assumption2} implies the existence of a limit measure $\mu$ on $ [0,\infty]^d\setminus\{\mb 0\}$ such that % convergence of
\begin{align}
\label{jmva:intro:regvar}
n~ \mathbb{P}\left( \frac{V^1 }{n} ~\ge~ v_1 \text{~or~} \cdots
\text{~or~} \frac{V^d }{n} ~\ge~ v_d \right) \xrightarrow[n\to\infty]{}\mu \left([\mb 0,\mb v]^c\right),
\end{align}
where $[\mb 0,\mathbf{v}]:=[0,\; v_1]\times \cdots \times
[0,\; v_d]$. The dependence structure of the limit $\mb G$ in (\ref{jmva:intro:assumption2}) can then be expressed by means of the so-termed \textit{exponent measure} $\mu$:
\begin{align}
- \log \mb G(\mathbf{x})= \mu\left( \left[ \mb 0, \left(\frac{-1}{\log G_1(x_1)}, \dots ,\frac{-1}{\log G_d(x_d)}\right)\right]^c\right). \nonumber
\end{align}
% The latter is finite on
% sets bounded away from $\mb 0$ and has the
% homogeneity property : $\mu(t\point) =
% t^{-1}\mu(\point)$. Observe in addition that, due to the standardization chosen (with
% `nearly' Pareto margins), the support of $\mu$ is included in $[\mb 0,\; \mathbf{1}]^c$.
The measure $\mu$ should be viewed, up to a a normalizing factor, as
the asymptotic distribution of $\mb V$ in extreme regions. Also, for any borelian subset $A$ bounded away from $\mb 0$ on which $\mu$ is continuous, we have
\begin{align}
\label{jmva:eq:regularVariation}
t~ \mathbb{P}\left( \mb V \in t A\right) \xrightarrow[t\to\infty]{}\mu(A).
\end{align}
Using the homogeneity property $\mu(t\point) =
t^{-1}\mu(\point)$, $\mu$ can be decomposed into a radial component and an angular component $\Phi$, which are independent from each other.
For all $\mb v = (v_1,...,v_d) \in \mathbb{R}^d$, set
\begin{align}\label{jmva:eq:pseudoPolar_change}
\left\{ \begin{aligned}
R(\mb v)&:= \|\mb v\|_\infty ~=~ \max_{i=1}^d v_i, \\
\Theta (\mb v) &:= \left( \frac{v_1}{R(\mb v)},..., \frac{v_d}{R(\mb v)} \right)
\in S_\infty^{d-1},
\end{aligned}\right.
\end{align}
where $S_\infty^{d-1}$ is the positive orthant of the unit sphere in $\mathbb{R}^d$ for the infinity norm.
Define the \emph{ spectral measure} (also called \emph{angular measure}) by $\Phi(B)= \mu (\{\mb v~:~R(\mb v)>1 ,
\Theta(\mb v) \in B \})$. Then, for every $B
\subset S_\infty^{d-1}$,
\begin{align}
\label{jmva:mu-phi}
\mu\{\mb v~:~R(\mb v)>z, \Theta(\mb v) \in B \} = z^{-1} \Phi (B)~.
\end{align}
In a nutshell, there
is a one-to-one correspondence between the exponent measure $\mu$ and the angular measure
$\Phi$, both of them can be used to characterize the asymptotic tail
dependence of the distribution $\mb F$ (as
soon as the margins $F_j$ are known), since % after each marginal had been standardized in $\mathbf{U}$ or $\mathbf{V}$.
\begin{align}\label{jmva:eq:integratePhiLambda}
\mu \big( [\mb 0,\mathbf{x}^{-1}]^c \big) = \int_{\boldsymbol{\theta} \in S_{\infty}^{d-1}} \max_j{\theta_j x_j} \;\ud \Phi(\boldsymbol{\theta}).
\end{align}
Recall that here and beyond, operators on vectors are understood component-wise, so that $\mb x^{-1}=(x_1^{-1},\ldots,x_d^{_1})$.
The angular measure can be seen as the asymptotic conditional distribution of the
`angle' $\Theta$ given that the radius $R$ is large, up to the
normalizing constant $\Phi(S_\infty^{d-1})$. Indeed, dropping
the dependence on $\mb V$ for convenience, we have for any
\emph{continuity set} $A$ of $\Phi$,
\begin{align}
\label{jmva:eq:limitConditAngle}
\begin{aligned}
\PP(\Theta \in A ~|~R>r ) &=
\frac{r \PP(\Theta \in A , R>r ) }{r\PP(R>r)}
& \xrightarrow[r\to \infty]{} \frac{\Phi(A)}{\Phi(S_\infty^{d-1})} .
\end{aligned}
\end{align}
\subsection{Statement of the Statistical Problem}\label{jmva:sec:decomposMu}
The focus of this work is on
the dependence structure in extreme regions of a %heavy-tailed
random vector $\mb X$ in a multivariate domain of attraction (see
(\ref{jmva:intro:assumption2})). This asymptotic dependence %\eqref{jmva:intro:regvar} in
is fully described by the exponent measure $\mu$, or
equivalently by the spectral measure $\Phi$. The goal
of this contribution is to infer a meaningful (possibly sparse) summary of the latter. %%investigate the problem of infering
As shall be seen below,
since the support of $\mu$ can be naturally partitioned in a specific
and interpretable manner, this boils down to accurately recovering the
mass spread on each element of the partition. In order to formulate
this approach rigorously, additional %assumptions and
definitions are required.
\medskip
\noindent{\bf Truncated cones}. For any non empty subset of features $\alpha\subset\{1,\; \ldots,\; d \}$, consider the truncated cone (see Fig.~\ref{jmva:fig:3Dcones})
\begin{align}
\label{jmva:cone}
\mathcal{C}_\alpha = \{\mb v \ge 0,~\|\mb v\|_\infty \ge 1,~ v_j > 0 ~\text{ for } j \in \alpha,~ v_j = 0 ~\text{ for } j \notin \alpha \}.
\end{align}
%As mentioned in the introduction, the theory provides very little
%information concerning the form of the limit $\mu$ (\emph{resp.}
%$\Phi$). In particular, for any $\alpha\subset\{1,\ldots d\}$, some
%$\mu$-mass may be present --or not-- on the truncated subcone of
%$\rset^d_+$
%\begin{align}
%\label{jmva:cone}
%\mathcal{C}_\alpha = \{v \ge 0,~\|v\|_\infty \ge 1,~ v_j > 0 ~\text{ for } j \in \alpha,~ v_j = 0 ~\text{ for } j \notin \alpha \}.
%\end{align}
The corresponding subset of the sphere is
\begin{align}
\Omega_{\alpha} = \{\mb x \in S_{\infty}^{d-1} : x_i > 0 \text{ for } i\in\alpha~,~ x_i = 0 \text{ for } i\notin \alpha \}
= S_{\infty}^{d-1}\cap {\mathcal{C}}_\alpha, \nonumber
\end{align}
and we clearly have $\mu(\mathcal{C}_\alpha) = \Phi(\Omega_\alpha)$ for any $\emptyset\neq \alpha \subset\{1,\; \ldots,\; d \}$.
% Up to appropriate standardization/thresholding, data lying in $ \mathcal{C}_\alpha$ correspond to observations where the components $X^i$ with $i\in \alpha$ are simultaneously large, while the others staying small.
The collection $\{\mathcal{C}_\alpha:\; \emptyset \neq
\alpha\subset \{1,\; \ldots,\; d \}\}$ forming a partition of the
truncated positive orthant $\mathbb{R}_+^{d}\setminus[\mb 0,\mb 1]$, one may naturally decompose the exponent measure as
\begin{align}\label{jmva:eq:decomp1}
\mu = \sum_{\emptyset \neq \alpha\subset\{1,\ldots ,d\}}
\mu_\alpha,
\end{align}
where each component $\mu_\alpha$ is concentrated on the
untruncated cone corresponding to ${\cal C_\alpha}$.
Similarly,
the $\Omega_\alpha $'s forming a partition of
$S_\infty^{d-1}$, we have
\begin{align}
\Phi ~=~ \sum_{\emptyset \neq \alpha\subset\{1,\ldots ,d\}} \Phi_\alpha ~, \nonumber
\end{align}
where $\Phi_\alpha$ denotes the restriction of $\Phi$ to % $S_{\infty}^{d-1} \cap
${\Omega}_\alpha$ for all $\emptyset\neq \alpha \subset\{1,\; \ldots,\; d \}$.
The fact that mass is spread on $\cone_\alpha$ indicates that conditioned upon
the event `$R(\mb V)$ is large' (\ie~an excess of a large radial threshold),
the components $V^j (j\in\alpha)$ may be simultaneously large while
the other $V^j$'s $(j\notin\alpha)$ are small, with positive
probability.
Each index subset $\alpha$ thus defines a specific direction in the tail region.
However this interpretation should be handled with care,
since for $\alpha\neq\{ 1,\ldots,d\}$, if $\mu(\cone_\alpha)>0$,
then $\cone_\alpha$ is not a continuity set of $\mu$
(it has empty interior), nor $\Omega_\alpha$ is a continuity set
of $\Phi$. Thus, the quantity $t \PP(\mb V \in t \cone_\alpha)$ does not necessarily converge to
$\mu(\cone_\alpha)$ as $t\rightarrow +\infty$.
Actually, if $\mb F$ is continuous, we have $\PP(\mb V \in t \cone_\alpha) =0$
for any $t>0$. However, consider for $\epsilon \ge 0$ the {\it $\epsilon$-thickened rectangles }
\begin{align}
\label{jmva:eq:epsilon_Rectangle} %TODO : eq:epsilonCone ----> eq:epsilonSphere in the following (except the first one)
R_\alpha^\epsilon~=\{\mb v \ge 0,~\|\mb v\|_\infty \ge 1,~ v_j > \epsilon ~\text{ for } j \in \alpha,
~v_j \le \epsilon ~\text{ for } j \notin \alpha
\} ,
\end{align}
Since the boundaries of the sets $R_\alpha^\epsilon$ are disjoint, only a countable number of them may be discontinuity sets of $\mu$. Hence, the threshold $\epsilon$ may be chosen arbitrarily small in such a way that
$R_\alpha^\epsilon$ is a continuity set of $\mu$.
The result stated below
shows that nonzero mass on $\cone_\alpha$ is the same as
nonzero mass on $R_\alpha^\epsilon$ for $\epsilon$ arbitrarily small.
\noindent
\begin{minipage}{0.5\linewidth}
\centering
\includegraphics[scale=0.2]{fig_source/cone-resized}
\captionof{figure}{Truncated cones in 3D}
\label{jmva:fig:3Dcones}
\end{minipage}\hfill
\begin{minipage}{0.5\linewidth}
\centering
\includegraphics[width=0.64\linewidth]{fig_source/representation2D}
\captionof{figure}{Truncated $\epsilon$-rectangles in 2D}
\label{jmva:2Dcones}
\end{minipage}
\begin{lemma}\label{jmva:lem:limit_muCalphaEps}
For any non empty index subset $\emptyset \neq \alpha\subset\{1,\ldots,d\}$, the exponent measure of
$\cone_\alpha$ is
\[
\mu(\cone_\alpha) = \lim_{\epsilon\to 0} \mu(R_\alpha^\epsilon).
\]
\end{lemma}
\begin{proof}
First consider the case $\alpha=\dd$. Then $R_\alpha^\epsilon$'s forms an increasing sequence of sets as $\epsilon$ decreases and $\mathcal{C}_\alpha = R_\alpha^0 = \cup_{\epsilon>0, \epsilon \in \mathbb{Q}}~R_\alpha^\epsilon$. The result follows from the `continuity from below' property of the measure $\mu$.
Now, for $\epsilon\ge 0$ and $\alpha\subsetneq\{1,\; \ldots,\; d\}$, consider the sets
\begin{align}
O_\alpha^\epsilon & =\{ \mb x \in\rset_+^d~: \forall j \in \alpha:x_j > \epsilon \}, \nonumber \\
N_\alpha^\epsilon & =\{\mb x \in\rset_+^d~: \forall j \in \alpha: x_j > \epsilon, \exists j \notin\alpha: x_j > \epsilon \}, \nonumber
\end{align}
so that $N_\alpha^\epsilon \subset O_\alpha^\epsilon$ and $R_\alpha^\epsilon = O_\alpha^\epsilon \setminus N_\alpha^\epsilon$. Observe also that $\cone_\alpha = O_\alpha^0\setminus N_\alpha^0$. Thus, $\mu(R_\alpha^\epsilon) = \mu(O_\alpha^\epsilon) - \mu(N_\alpha^\epsilon)$, and $\mu(\cone_\alpha) = \mu(O_\alpha^0) - \mu(N_\alpha^0)$, so that it is sufficient to show that
\begin{align}
\mu(N_\alpha^0) = \lim_{\epsilon\to 0}\mu(N_\alpha^\epsilon) ,
\quad \text{and} \quad
\mu(O_\alpha^0) = \lim_{\epsilon\to 0}\mu(O_\alpha^\epsilon). \nonumber
\end{align}
Notice that the $N_\alpha^\epsilon$'s and the $O_\alpha^\epsilon$'s form two increasing sequences of sets (when $\epsilon$ decreases), and that $N_\alpha^0 = \bigcup_{\epsilon>0,\epsilon\in\mathbb{Q}} N_\alpha^\epsilon$, $O_\alpha^0 = \bigcup_{\epsilon>0,\epsilon\in\mathbb{Q}} O_\alpha^\epsilon$. This proves the desired result.
\end{proof}
%Equipped with Lemma~\ref{jmva:lem:limit_muCalphaEps},
We may now make precise the above heuristic
interpretation of the quantities $\mu(\cone_\alpha)$: the vector
$\mathcal{M}=\{ \mu(\mathcal{C}_{\alpha}):\; \emptyset \neq
\alpha\subset\{1,\; \ldots,\; d \}\}$ asymptotically describes the
dependence structure of the extremal observations.
% Equipped with Lemma~\ref{jmva:lem:limit_muCalphaEps}, the meaning of the
% statement
% `$\mu(\cone_\alpha) > 0$' becomes clear.
%
% Indeed, since the
% boundaries of the sets $\Omega_\alpha^\epsilon$ (viewed as subsets of
% $\mathbb{S}_\infty^{d-1}$) are disjoint, only a countable number of them may be
% discontinuity sets of $\Phi$. Thus, by homogeneity, the
% number of the sets $\cone_\alpha^\epsilon$ which are discontinuity
% sets of $\mu$ is at most countable. Hence, the threshold $\epsilon$ may be chosen arbitrarily small (so that, by
% Lemma~\ref{jmva:lem:limit_muCalphaEps}, $\mu(\cone_\alpha^\epsilon)$ is
% arbitrarily close to $\mu(\cone_\alpha)$), in such a way that
% $\cone_\alpha^\epsilon$ is a continuity set of $\mu$,
Indeed, by
Lemma~\ref{jmva:lem:limit_muCalphaEps}, and the discussion above, $
\epsilon$ may be chosen such that $R_\alpha^\epsilon$ is a
continuity set of $\mu$, while $\mu(R_\alpha^\epsilon)$ is
arbitrarily close to $\mu(\cone_\alpha)$. Then, using the
characterization (\ref{jmva:eq:regularVariation}) of $\mu$,
the following asymptotic identity holds true:
\begin{align}
\lim_{t\to\infty} t \PP\left( \|\mb V\|_\infty\ge t, V^j> \epsilon t~~ (j \in\alpha), V^j \le \epsilon t~~ (j\notin\alpha)\right) &=\mu(R_\alpha^\epsilon) \\ \nonumber
&\simeq \mu(\cone_\alpha). \nonumber
\end{align}
\begin{remark}
\label{jmva:rk_approx_mu_n}
In terms of conditional probabilities, denoting $R = \|T(\mb X)\|$, where
$T$ is the standardization map $\mb X\mapsto \mb V$, we have
\begin{align}
\nonumber \PP(T(\mb X)\in r R_\alpha^\epsilon~|~ R>r) =
\frac{r \PP(\mb V\in r R_\alpha^\epsilon)}{ r\PP(\mb V \in r([\mb 0,\mathbf{1}]^c)} \xrightarrow[r\to\infty]{}\frac{\mu(R_\alpha^\epsilon)}{\mu([\mb 0,\mathbf{1}]^c)},
\end{align}
as in~\eqref{jmva:eq:limitConditAngle}. In other terms,
\begin{align}
\PP\left( V^j> \epsilon r~~ (j \in\alpha), V^j \le \epsilon r~~ (j\notin\alpha) ~\big\vert~ \|\mb V\|_\infty\ge r \right) &\xrightarrow[r\to\infty]{} C \mu(R_\alpha^\epsilon) \\ \nonumber
&~~~~~\simeq C\mu(\cone_\alpha), \nonumber
\end{align}
where $C = 1/ \Phi(S_\infty^{d-1}) =1/\mu([\mb 0,\mathbf{1}]^c) $.
This clarifies the meaning of `large' and `small' in the heuristic
explanation given above.
\end{remark}
\noindent {\bf Problem statement.} % The goal of this paper is to estimate the dependence structure of the $d$-dimensional heavy-tailed r.v. $X$ in extreme
% regions from i.i.d. observations.
As explained above, our goal is to describe the dependence on extreme
regions by investigating the structure of $\mu$ (or, equivalently,
that of $\Phi$). % , which is determined
% by that of the spectral measure $\Phi$.
More precisely, the aim is twofold. First, recover a rough
approximation of the support of $\Phi$ based on the partition
$\{\Omega_\alpha, \alpha\subset\{1,\ldots,d\}, \alpha\neq
\emptyset\}$, that is, determine which $\Omega_\alpha$'s have
nonzero mass, or equivalently, which $\mu_\alpha's$ (\emph{resp.}
$\Phi_\alpha$'s) are nonzero. This support estimation is potentially
sparse (if a small number of $\Omega_\alpha$ have non-zero mass) and
possibly low-dimensional (if the dimension of the sub-cones
$\Omega_\alpha$ with non-zero mass is low).
%a well chosen partition of the input space,
The second objective is to
investigate how the exponent measure $\mu$ spreads its mass on the
$\mathcal{C}_{\alpha}$'s, the theoretical quantity
$\mu(\mathcal{C}_{\alpha})$ indicating to which extent extreme
observations may occur in the `direction' $\alpha$ for $\emptyset
\neq \alpha \subset \{1,\; \ldots,\; d \}$.
% estimate the amount of angular mass
% $\mu(\cone_\alpha) = \Phi(\Omega_\alpha)$ on each non-void element of the
% partition.
These two goals are achieved using empirical versions of
the angular measure defined in
Section~\ref{jmva:sec:classicEstimators}, evaluated on the
$\epsilon$-thickened rectangles $R_\alpha^\epsilon$.
% This problem can be tackled by investigating how the exponent measure
% $\mu$ spreads its mass on the $\mathcal{C}_{\alpha}$'s, the
% theoretical quantity $\mu(\mathcal{C}_{\alpha})$ indicating to which
% extent extreme observations may occur in the `direction' $\alpha$ for
% $\emptyset \subsetneq \alpha \subset \{1,\; \ldots,\; d \}$.
Formally, we wish to recover the $(2^{d}-1)$-dimensional unknown
vector
\begin{align}
\label{jmva:eq:representation_M}
\mathcal{M}=\{ \mu(\mathcal{C}_{\alpha}):\; \emptyset \neq \alpha\subset\{1,\; \ldots,\; d \}\}
\end{align}
from $\mb X_1,\;
\ldots,\; \mb X_n\overset{i.i.d.}{\sim} \mb F$ and build an estimator
$\widehat{\mathcal{M}}$ such that
\begin{align}
\nonumber
\vert\vert \widehat{\mathcal{M}} -\mathcal{M}
\vert\vert_{\infty} \;%\overset{def}
{=} \; \sup_{\emptyset \neq \alpha \subset \{1,\; \ldots,\; d \}}\; \vert
\widehat{\mathcal{M}}(\alpha)- \mu(\mathcal{C}_{\alpha})\vert
\end{align}
is small with large probability. % As $\mu(\mathcal{C}_{\alpha})=\Phi(\Omega_{\alpha})$ for any $\alpha$ and
In view of Lemma~\ref{jmva:lem:limit_muCalphaEps}, (biased) estimates of
$\mathcal{M}$'s components are built from an empirical version of
the exponent measure, evaluated on the
$\epsilon$-thickened rectangles $R_\alpha^\epsilon$ (see Section~\ref{jmva:sec:classicEstimators} below). As a by-product, one obtains an estimate of the support of the limit measure $\mu$,
\begin{align}
\bigcup_{\alpha:\; \widehat{\mathcal{M}}(\alpha)>0 }\mathcal{C}_{\alpha}. \nonumber
\end{align}
The results stated in the next section are non-asymptotic and sharp bounds are given by means of {\sc VC} inequalities tailored to low probability regions.
%, which is determined
%by that of the spectral measure $\Phi$.
%More precisely, the aim is
%twofold: first, recover a rough approximation of the support of
%$\Phi$ based on the partition $\{\Omega_\alpha,
%\alpha\subset\{1,\ldots,d\}, \alpha\neq \emptyset\}$, that is,
%determine which $\Omega_\alpha$'s have nonzero mass, or equivalently,
%which $\mu_\alpha's$ (\emph{resp.} $\Phi_\alpha$'s) are nonzero;
%a well chosen partition of the input space,
%second, estimate the amount of angular mass
%$\mu(\cone_\alpha) = \Phi(\Omega_\alpha)$ on each non-void element of the
%partition. These two goals are achieved using empirical versions of
%the angular measure defined in
%section~\ref{jmva:sec:nonParamEstimators}, evaluated on the
%$\epsilon$-thickened cones $\cone_\alpha^\epsilon$. Non-asymptoitc upper bounds on
%the error are derived in section~\ref{jmva:sec:estimation} using VC
%inequalities adapted to low probability regions.
\subsection{Regularity Assumptions}\label{jmva:sec:RegularAssumptions} % and Further Notations}
Beyond the existence of the limit measure $\mu$ (\ie~multivariate regular variation of $\mathbf{V}$'s distribution, see~\eqref{jmva:intro:regvar}), and thus, existence of an angular measure $\Phi$ (see (\ref{jmva:mu-phi})),
three additional assumptions are made, which are natural when estimation of the support of a distribution is considered.
\begin{assumption}\label{jmva:hypo:continuous_margins}
The margins of $\mb X$ have continuous c.d.f., namely $F_j,~1 \le j \le d$ is continuous.
\end{assumption}
\noindent
Assumption~\ref{jmva:hypo:continuous_margins}
is widely used in the context of non-parametric estimation of the dependence
structure (see \emph{e.g.} \cite{Einmahl2009}): it ensures that the transformed variables $V^j = (1 -
F_j(X^j))^{-1}$ (\emph{resp.} $U^j = 1 -F_j(X^j)$) have indeed a
standard Pareto distribution, $\P(V^j>x) = 1/x,~ x\ge 1$ (\emph{resp.}
the $U^j$'s are uniform variables).
\bigskip
For any non empty subset $\alpha$ of $\{1,\; \ldots,\;d\}$, one denotes by $\ud x_\alpha$ the Lebesgue measure on ${\cal C}_\alpha$ and write $\ud x_\alpha = \ud x_{i_1}\ldots\ud x_{i_k}$, when $\alpha=\{i_1, \ldots , i_k\}$. For convenience, we also write $\ud x_{\alpha\setminus{i}}$ instead of $\ud x_{\alpha\setminus{\{i\}}}$.
% \begin{align*}
% \mathcal{C}_\alpha^\epsilon = \{ \|x\| \ge 1: &\frac{x_i}{\|x\|_\infty} > \epsilon \text{ for }
% i\in\alpha, \\
% &\frac{x_i}{\|x\|_\infty}\le\epsilon \text{ for } i\notin \alpha \quad \}~.
% \end{align*}
\begin{assumption}\label{jmva:hypo:continuousMu}
Each component $\mu_\alpha$ of~\eqref{jmva:eq:decomp1} is absolutely continuous w.r.t.
Lebesgue measure $\ud x_\alpha$ on ${\cal C}_\alpha$. %This implies that $\Phi_\alpha$ is absolutely continuous with respect to the Lebesgue measure on $\Omega_\alpha$. It is also assumed that, each of the induced densities is bounded.
\end{assumption}
\noindent
% For $\alpha\subset \{1,\ldots ,d\}$, $\alpha\neq
% \emptyset$, the subset of the sphere corresponding to the cones is
% \begin{align*}
% \Omega_{\alpha} & = \{x \in S_{\infty}^{d-1} : x_i > 0 \text{ for } i\in\alpha~,~ x_i = 0 \text{ for } i\notin \alpha \} \\
% & = S_{\infty}^{d-1}\cap {\mathcal{C}}_\alpha ~ .
% \end{align*}
% \noindent
% Thus, the $\Omega_\alpha $'s form a partition of $S_\infty^{d-1}$, and we have
% \begin{align*}
% \mu(\mathcal{C}_i) ~=~ \Phi(\Omega_i) \text{ ~~~and~~~ } \Phi ~=~ \sum_{\emptyset \subsetneq \alpha\subset\{1,\ldots ,d\}} \Phi_\alpha ~,
% \end{align*}
% \noindent
% where $\Phi_\alpha$ denotes the restriction of $\Phi$ to $S_{\infty}^{d-1} \cap {\Omega}_\alpha$.
Assumption~\ref{jmva:hypo:continuousMu} has a very convenient consequence
regarding $ \Phi$: the fact that the exponent measure $\mu$ spreads no mass on subsets of the form
$\{\mb x: \;\ninf{\mb x} \ge 1, x_{i_1} = \dotsb = x_{i_r} \neq 0 \}$ with $r \ge 2$,
implies that the spectral measure $\Phi$ spreads no mass on edges $\{\mb x: \;\ninf{\mb x} = 1, \; x_{i_1} = \dotsb = x_{i_r} =1 \}$ with $r \ge 2~.$
This is summarized by the following result.
\begin{lemma}\label{jmva:lem:continuousPhi}
Under Assumption~\ref{jmva:hypo:continuousMu}, the following assertions holds true.
\begin{itemize}
\item $ \Phi$ is concentrated on the (disjoint) edges
\begin{align}
\Omega_{\alpha,i_0} = \{\mb x: \; \ninf{\mb x} = 1,\; x_{i_0} = 1,~~& 0< x_i < 1 ~~\text{~for~} i \in \alpha \setminus \{i_0\}\\ \nonumber
&x_i=0 ~~~~\text{~~~ for } i\notin \alpha ~~~~~~~\} \nonumber
\end{align}
for $i_0\in\alpha$, $\emptyset \neq \alpha\subset\{1,\; \ldots,\; d \}$.
\item The restriction $\Phi_{\alpha,i_0}$ of $\Phi$ to
$\Omega_{\alpha,i_0}$ is absolutely continuous \wrt~the Lebesgue measure $\ud x_{\alpha\setminus{i_0}}$ on the cube's edges, whenever $|\alpha|\ge 2 $.
\end{itemize}
\end{lemma}
\begin{proof}
The first assertion straightforwardly results from the discussion above. Turning to the
second point, consider any measurable
set $D \subset \Omega_{\alpha,i_0}$ such that $\int_{D}\ud x_{\alpha \setminus i_0} = 0$. Then the
induced truncated cone $\tilde D = \{ \mb v:~ \|\mb v\|_\infty \ge
1, \mb v / \|\mb v\|_\infty \in D \}$ satisfies $\int_{\tilde D}\ud
x_{\alpha} = 0$ and belongs to $\mathcal{C}_\alpha$. Thus, by virtue of
Assumption~\ref{jmva:hypo:continuousMu}, $\Phi_{\alpha,
i_0}(D)=\Phi_{\alpha}(D) = \mu_\alpha(\tilde D) = 0$.
\end{proof}
\noindent
It follows from Lemma~\ref{jmva:lem:continuousPhi} that the angular
measure $\Phi$ decomposes as $\Phi = \sum_{\alpha} \sum_{i_0\in\alpha}
\Phi_{\alpha,i_0}$ and that there exist densities $ \frac{\ud
\Phi_{\alpha,i_0}}{\ud x_{ \alpha \smallsetminus i_0}},~~
|\alpha|\ge 2,~i_0\in\alpha,$ such that for all $B \subset
\Omega_\alpha,~~ |\alpha| \ge 2$,
\begin{align}
\label{jmva:eq:decomposePhi}
\Phi(B)~=~ \Phi_\alpha(B)~=~ \sum_{i_0\in\alpha} \int_{B\cap \Omega_{
\alpha,i_0} } \frac{\ud \Phi_{\alpha,i_0}}{\ud x_{ \alpha
\smallsetminus i_0}}(x) \ud x_{\alpha\setminus i_0}.
\end{align}
% In view of equation~\eqref{jmva:eq:decomposePhi}, in particular,
% \begin{lemma}
% Each $\Phi_{\alpha,i_0}$ is continuous with respect to $\ud x_i$, for
% $i\in\alpha \setminus \{i_0\}$ .
% \end{lemma}
In order to formulate the next assumption, for $|\beta| \ge 2$, we set
\begin{align}
\label{jmva:eq:supDensity}
M_\beta = % \sup_{i,j \in\beta,~ j\neq i} ~~~~\sup_{x\in\Omega_{\beta,i}} ~~~~ \frac{\ud \Phi_{\beta, i}}{\ud x_j}(x) ~~??=??~
~ \sup_{i \in\beta} ~~\sup_{x\in\Omega_{\beta,i}} ~~~~ \frac{\ud \Phi_{\beta, i}}{\ud x_{\beta \setminus i}}(x).
\end{align}
\begin{assumption}\label{jmva:hypo:abs_continuousPhi}({\sc Sparse Support})
The angular density is uniformly bounded on $S^{d-1}_\infty$ ($\forall |\beta| \ge 2,~M_\beta < \infty$), and there exists a constant $M>0$, such that we have $\sum_{|\beta| \ge 2} M_\beta < M$, where the sum is over subsets $\beta$ of $\{1,\ldots,d\}$ which contain at least two elements.
\end{assumption}
\begin{remark}
The constant $M$ is problem dependent. However, in the case where our representation $\mathcal{M}$ defined in \eqref{jmva:eq:representation_M} is the most informative about the angular measure, that is, when the density of $\Phi_\alpha$ is constant on $\Omega_\alpha$, we have $M \le d$: Indeed, in such a case,
$M \le \sum_{|\beta| \ge 2} M_\beta |\beta| = \sum_{|\beta| \ge 2} \Phi(\Omega_\beta) \le \sum_\beta \Phi(\Omega_\beta) \le \mu([\mb 0,\mb 1]^c)$.
% An order of magnitude for the value of $M$ is given as follows. Assuming that $\Phi$ has constant density on each sub-sphere $\Omega_\alpha$, then
% $\mu([\mb 0,\mb 1]^c) = \sum_\beta \Phi(\Omega_\beta) \ge \sum_{|\beta| \ge 2} \Phi(\Omega_\beta) = \sum_{|\beta| \ge 2} M_\beta |\beta| $.
The equality inside the last expression comes from the fact that the Lebesgue measure of a sub-sphere $\Omega_\alpha$ is $|\alpha|$, for $|\alpha| \ge 2$. Indeed, using the notations defined in Lemma~\ref{jmva:lem:continuousPhi}, $\Omega_\alpha = \bigsqcup_{i_0 \in \alpha}\Omega_{\alpha,i_0}$, each of the edges $\Omega_{\alpha,i_0}$ being unit hypercube. % (Intuitively, in dimension 3, $\Omega_{\{1,2,3\}}$ is a cube whose 3 faces corresponding to the positive quadrant are unit squares).
Now, $\mu([\mb 0,\mb 1]^c) \le \mu(\{v,~ \exists j,~ v_j > 1\} \le d \mu(\{v,~v_1 >1\})) \le d$.
% Noting that the Lebesgue measure of a sub-sphere $\Omega_\alpha$ is $|\alpha|$, if $\Phi$ is constant on each sub-sphere we have
% $\mu([\mb 0,\mb 1]^c) = \sum_\beta \Phi(\Omega_\beta) \ge \sum_{|\beta| \ge 2} \Phi(\Omega_\beta) = \sum_{|\beta| \ge 2} M_\beta |\beta| $
\noindent
Note that the summation $\sum_{|\beta| \ge 2} M_\beta |\beta|$ is smaller than $d$ despite the (potentially large) factors $|\beta|$. Considering $\sum_{|\beta| \ge 2} M_\beta$ is thus reasonable: in particular, $M$ will be small when only few $\Omega_\alpha$'s have non-zero $\Phi$-mass, namely when the representation vector $\mathcal{M}$ defined in \eqref{jmva:eq:representation_M} is sparse.
\end{remark}
\noindent Assumption~\ref{jmva:hypo:abs_continuousPhi} is naturally involved in the derivation of upper bounds on the error made when approximating $\mu(\cone_\alpha)$ by the empirical counterpart of $\mu(R_\alpha^\epsilon)$.
The estimation error bound derived in Section~\ref{jmva:sec:estimation} depends on the sparsity constant $M$.
%XXX commments on these assumptions
% \begin{assumption}\label{jmva:hypo:M}
% For every $\beta$ such that $|\beta| > 2$,~ $M_\beta$ is finite and $|\beta_1| \le |\beta_2| \Rightarrow M_{\beta_1} \ge M_{\beta_2}$.
% \end{assumption}
% One 'parcimony assumption' would be something like $M_\beta\le e^{-|\beta|}$.
% In other words we select, for each feature $j\le d$, the `$k$ largest values' $X_i^j$
% over the $n$ observations. According to the nature of the extremal dependence,
% a number between $k$ and $dk$ of observations are selected: $k$ in
% case of perfect dependence, $dk$ in case of `asymptotic independence', which
% means, in EVT, that the components may only be large one at a time. In
% any case, the number of observations considered as extreme is proportionnal to $k$, whence the normalizing factor $\frac{n}{k}$.
%
% Yet the goal is to study the dependence between the $V_i^j$, $i$ fixed anf $j$ varying.
% One way to proceed is to characterize, for each subset of features
% $\alpha \subset \{1,...,d\}$, the `correlation' of these features
% given that one of them at least is large and the others are small. %extreme (\ie~given that the observation is extreme).
% Formally, we associate to each such $\alpha$ a coefficient
% $\mu_n^\alpha$ reflecting the degree of dependence between the
% features $\alpha$. Theoretically, by definition of asymptotic
% dependence, this coeficient is to be proportional to the expected number of
% points $V$ verifying $V^j > 0$, $j \in \alpha$ and $V^j = 0,~ j\notin
% \alpha$, and $\|V\|_\infty\ge r$, namely $ r^{-1}V \in \mathcal{C}_\alpha$ with
% \begin{align}
% %\label{jmva:cone}
% \mathcal{C}_\alpha = \{v \ge 0,~\|v\|_\infty \ge 1,~ v_j > 0 ~\text{ for } j \in \alpha,~ v_j = 0 ~\text{ for } j \notin \alpha \}.
% \end{align}
% (see Fig.~\ref{jmva:fig:3Dcones}) But in practice, the data are non-asymptotic so that if $\alpha \neq \{1,\ldots,d\}$ the cones $\mathcal{C}_\alpha$ have zero Lebesgue measure and are not likely to receive empirical mass. Consider then a tolerance parameter $\epsilon>0$ and approximate the asymptotic mass of $~\mathcal{C}_\alpha~$ by the non-asymptotic mass of
% \begin{align}
% \label{jmva:eq:epsilonCone}
% ~\mathcal{C}_\alpha^\epsilon~=\{v \ge
% 0,~\|v\|_\infty \ge 1,~ v_j > \epsilon \|v\|_\infty ~\text{ for } j
% \in \alpha,~v_j \le \epsilon\|v\|_\infty ~\text{ for } j \notin \alpha
% \} ,
% \end{align}
% which leads to coefficients
% This motivates the study of the \stdf $l$, since it is now theoretically clear how the asymptotic tail dependence structure of $F$ is contained in $l$.
% XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
% % The measures $\mu$ and $\Lambda$ are called exponent measures and have the following properties:
% % \begin{itemize}
% % \item standardized marginals: for all $a>0$,
% % \begin{align*}
% % \Lambda([0,a]\times[0,\infty]^{d-1})~=~\Lambda([0,\infty]&\times[0,a]\times[0,\infty]^{d-2})
% % \\&~=~...~=~\Lambda([0,\infty]^{d-1}\times[0,a])~=~a
% % \end{align*}
% % \noindent and
% % \begin{align*}
% % \mu([a,\infty]\times[0,\infty]^{d-1})~=~\mu([0,\infty]&\times[a,\infty]\times[0,\infty]^{d-2})
% % \\&~=~...~=~\mu([0,\infty]^{d-1}\times[a,\infty])~=~a^{-1}
% % \end{align*}
% % \item homogeneity: $\Lambda(c.)=c~ \Lambda(.)$ and $\mu(c.)=c^{-1} \mu(.)$
% % \item concentration: $\Lambda$, $\mu$ are respectively concentrated on $(0,\infty]^d \setminus \{\infty\}$, $[0,\infty)^d \setminus \{0\}$
% % \end{itemize}
% The homogeneity property yields a decomposition of $\mu$ into a
% radial and angular part (see de Haan and Resnick ...).
\section{A non-parametric estimator of the subcones' mass : definition and preliminary results}
\label{jmva:sec:estimation}
In this section, an estimator $\widehat{\mathcal{M}}(\alpha)$ of each of the sub-cones' mass
$\mu(\cone_\alpha)$, $\emptyset\neq\alpha\subset\dd$, is
proposed, based on observations $\mb X_1,.\ldots, \mb X_n$, \iid~copies of $\mb X\sim \mb F$.
Bounds on the error $\vert\vert
\widehat{\mathcal{M}}-\mathcal{M}\vert\vert_{\infty}$ are
established. In the remaining of this chapter, we work under
Assumption~\ref{jmva:hypo:continuous_margins} (continuous margins, see
Section~\ref{jmva:sec:RegularAssumptions}).
Assumptions~\ref{jmva:hypo:continuousMu}~and~\ref{jmva:hypo:abs_continuousPhi}
are not necessary to prove a preliminary result on a class of
rectangles (Proposition~\ref{jmva:prop:g} and Corollary~\ref{jmva:cor:mu_n-mu}). However, they are required % to approximate cones with rectangles
% (Proposition~\ref{jmva:prop:mu_n-mu}) and
to bound the bias induced by the tolerance parameter
$\epsilon$ (in Lemma~\ref{jmva:lemma_simplex}, Proposition~\ref{jmva:prop_simplex} and in the main result, Theorem~\ref{jmva:thm-princ}).
% under the hypotheses listed in Section~\ref{jmva:sec:RegularAssumptions}. % , we next investigate the problem of estimating $\mathcal{M}$ based on observations $\mb X_1,.\ldots, \mb X_n$, \iid copies of $\mb
% % X\sim F$.
\subsection{A natural empirical version of the exponent measure mu}
\label{jmva:sec:classicEstimators}
Since the marginal distributions $F_j$ are unknown, we classically consider
the empirical counterparts of the $\mb V_i$'s,
$\mb{\widehat V}_i = (\widehat V_i^1, \ldots,\widehat
V_i^d)$, $1\le i\le n$, as standardized variables obtained from a
rank transformation (instead of a probability integral
transformation),
\[\mb{\widehat V}_i = \left( ( 1- \widehat F_j
(X_i^j))^{-1}\right)_{1 \le j \le d}~, \]
where
$\widehat F_j (x) = (1/n) \sum_{i=1}^n \mathbf{1}_{\{X_i^j < x\}}$.
%
We denote by $T$ (\emph{resp.} $\widehat T$) the standardization
(\textit{resp.} the empirical standardization),
\begin{align}
\label{jmva:def:transform}
T(\mb x) = \left( \frac{1}{1- F_j (x^j)}\right)_{1\leq j\leq d}
\text{~~and~~}
\widehat T(\mb x) = \left( \frac{1}{1- \widehat F_j(x^j)}\right)_{1\leq j\leq d}.
\end{align}
The empirical probability distribution of the rank-transformed data is then given by
\begin{align*}
\mathbb{\widehat P}_n=(1/n)\sum_{i=1}^n\delta_{\mb{\widehat{V}}_i}.
\end{align*}
Since for a $\mu$-continuity set $A$ bounded away from $0$, $t~ \mathbb{P}\left( \mb V \in t A\right) \to \mu(A)$ as $t \to \infty$, see~\eqref{jmva:eq:regularVariation}, a natural empirical version of $\mu$ is defined as
\begin{align}\label{jmva:mu_n}
\mu_n(A) ~=~ \frac{n}{k} \widehat{\mathbb{P}}_n (\frac{n}{k}A) ~=~ \frac{1}{k}\sum_{i=1}^n \mathbf{1}_{\{\mb{\widehat{V}}_i \in \frac{n}{k} A\}}~.
\end{align}
Here and throughout, we place ourselves in the asymptotic setting stipulating that $k = k(n) >0$ is such that $k \to \infty$ and $k = o(n)$ as $n \to \infty$.
The ratio $n/k$ plays the role of a large radial threshold.
Note that this estimator is commonly used in the field of
non-parametric estimation of the dependence structure, see \textit{e.g.}
\cite{Einmahl2009}.
\subsection{Accounting for the non asymptotic nature of data:
epsilon-thickening.}
Since the cones $\mathcal{C}_\alpha$ have zero Lebesgue measure,
and since, under Assumption~\ref{jmva:hypo:continuous_margins}, the margins are
continuous, the cones are not likely to receive any empirical mass,
so that simply counting points in $\frac{n}{k}\mathcal{C}_\alpha$ is not an
option: with probability one, only
the largest dimensional cone (the central one, corresponding to
$\alpha= \{1,\ldots,d\})$ will be hit.
%
In view of Subsection~\ref{jmva:sec:decomposMu} and
Lemma~\ref{jmva:lem:limit_muCalphaEps},
it is natural to introduce a
tolerance parameter $\epsilon>0$ and to approximate the asymptotic mass
of $\mathcal{C}_\alpha$ with the non-asymptotic mass of
$R_\alpha^\epsilon$. We thus define the non-parametric estimator $\widehat{M}(\alpha)$ of
$\mu(\cone_\alpha)$ as
\begin{align}
\label{jmva:heuristic_mu_n}
\hatmass(\alpha) = \mu_n(R_\alpha^\epsilon), \qquad
\emptyset\neq\alpha\subset\dd. %= \frac{n}{k} \mathbb{\hat P}_n \left ( \frac{n}{k} \mathcal{C}_\alpha^\epsilon \right),
\end{align}
%where $\mathbb{\hat P}_n=(1/n)\sum_{i=1}^n\delta_{\hat{V}_i}$ is the empirical probability distribution of the rank-transformed data.
Evaluating $\hatmass(\alpha)$ boils down (see~\eqref{jmva:mu_n})
to counting points in $(n/k)\,R_{\alpha}^{\epsilon}$, as illustrated in Figure~\ref{jmva:estimation_rect}. The estimate $\hatmass(\alpha)$ is thus a (voluntarily
$\epsilon$-biased) natural estimator of $\Phi(\Omega_\alpha) = \mu(\mathcal{C}_\alpha)$.
\begin{figure}[!ht]
\centering
\includegraphics[width = 0.5\textwidth]{fig_source/representation2D_nk_rect.png}
\caption{Estimation procedure}
\label{jmva:estimation_rect}
\end{figure}
The coefficients $(\hatmass(\alpha))_{\alpha\subset\{1,\ldots,d\}}$ related to the cones $\mathcal{C}_\alpha$ constitute a summary representation of the dependence structure. % constitute a A representation of the dependence structure is then
% \begin{align}
% R_n(x) = \sum_{\alpha }\hatmass(\alpha) \mathds{1}_{\hat T(x) \in \mathcal{C}_\alpha^\epsilon},
% \end{align}
This representation is sparse as soon as the $\mu_n(R_\alpha^\epsilon)$ are positive only for a few groups of features $\alpha$ (compared to the total number of groups or sub-cones, $2^d$ namely). It is is low-dimensional as soon as each of these groups $\alpha$ is of small cardinality, or equivalently the corresponding sub-cones are low-dimensional compared with $d$.
In fact, $\hatmass(\alpha)$ is (up to a normalizing constant) an empirical version of the conditional probability that $T(\mb X)$ belongs to the rectangle $ r R_\alpha^\epsilon$, given that $\|T(\mb X)\|$ exceeds a large threshold $r$. Indeed, as explained in Remark~\ref{jmva:rk_approx_mu_n},
\begin{align}\label{jmva:eq:interprete_mun_Pcondit}
\mathcal{M}(\alpha) = \lim_{r \to \infty} \mu([\mb 0,\mathbf{1}]^c)~~\mathbb{P}(T(\mb X)\in r R_\alpha^\epsilon ~|~ \|T(\mb X)\|\ge r) .
\end{align}
The remaining of this section is devoted to obtaining non-asymptotic upper bounds on the error $\vert\vert \widehat{\mathcal{M}}-\mathcal{M}\vert\vert_{\infty}$.
The main result is stated in Theorem~\ref{jmva:thm-princ}.
Before all, notice that the error may be obviously decomposed as the sum of a stochastic term and a bias term inherent to the $\epsilon$-thickening approach:
\begin{align}
\vert\vert \widehat{\mathcal{M}}-\mathcal{M}\vert\vert_{\infty} &~=~\max_{\alpha} |
\mu_n(R_\alpha^\epsilon)-\mu(\mathcal{C}_\alpha)|\nonumber
\\&~\le~ ~\max_\alpha |\mu-\mu_n|(R_\alpha^\epsilon) ~+~ \max_\alpha|\mu(R_\alpha^\epsilon)-\mu(\mathcal{C}_\alpha)|~.\label{jmva:error_decomp}
\end{align}
Here and beyond, to keep the notation uncluttered, we simply denotes `$\alpha$' for `$\alpha$ non empty subset of $\{1,\; \ldots,\;d\}$'. The main steps of the argument leading to Theorem~\ref{jmva:thm-princ} are as follows. First, obtain a uniform upper bound on the error $|\mu_n - \mu|$ restricted to a well chosen VC class of rectangles (Subsection~\ref{jmva:sec:rectangles}), and deduce an uniform bound on $|\mu_n - \mu|(R_\alpha^\epsilon)$ (Subsection~\ref{jmva:sec:boundErrorEpsilonCones}). Finally, using the regularity assumptions (Assumption~\ref{jmva:hypo:continuousMu} and Assumption~\ref{jmva:hypo:abs_continuousPhi}), bound the difference $|\mu(R_\alpha^\epsilon) - \mu(\cone_\alpha)|$ (Subsection~\ref{jmva:sec:boundMuEpsilonCones}).
\subsection{Preliminaries: uniform approximation over a VC-class of rectangles}
\label{jmva:sec:rectangles}
This subsection builds on the theory developed in Chapter~\ref{colt}, where a non-asymptotic bound is stated on the estimation of the stable tail dependence function (defined in~\eqref{back:stdf1}). % We prove here (Proposition~\ref{jmva:prop:g}) a generalized version of the result obtained by these authors.
The \stdf~$l$ is related to the class of sets of the form $[\mb 0, \mb v]^c$ (or $[\mb u, \boldsymbol{\infty}]^c$ depending on which standardization is used), and an equivalent definition is
\begin{align}
\label{jmva:stdf}
l(\mathbf{x}):= \lim_{t \to \infty} t \tilde F (t^{-1}\mathbf{x}) = \mu([\mb 0, \mb x ^{-1}]^c)
\end{align}
\noindent
with $\tilde F (\mathbf{x}) = (1-F) \big( (1-F_1)^\leftarrow(x_1),\ldots, (1-F_d)^\leftarrow(x_d) \big)$.
Here the notation
$(1-F_j)^\leftarrow(x_j)$ denotes the quantity $\sup\{y\,:\; 1-F_j(y) \ge x_j\}$. Recall that the marginally uniform variable $\mb U$ is defined by $U^j = 1-F_j(X^j)$ ($1\le j\le d$). Then in terms of standardized variables $U^j$,
\begin{align}
\label{jmva:def:tildeF}
\tilde F(\mb x) = \P\Big(\bigcup_{j=1}^d\{U^j< x_j\}\Big) = \P(\mb
U\in [\mb x, \boldsymbol{\infty}[^c) = \P(\mb V \in [\mb 0, \mb x^{-1}]^c).
\end{align}
A natural estimator of $l$ is its empirical version defined as
follows, see \cite{Huangphd}, \cite{Qi97}, \cite{Drees98}, \cite{Einmahl2006}, \cite{COLT15}:
\begin{align}\label{jmva:empir-Stdf}
l_n(\mathbf{x}) &= \frac{1}{k}~\sum_{i=1}^{n} \mathds{1}_{\{ X_i^1 \ge
X^1_{(n-\lfloor kx_1 \rfloor+1)} \text{~~or~~} \ldots \text{~~or~~}