A Modiﬁed Energy Statistic for Unsupervised Anomaly Detection

For prognostics in industrial applications, the degree of anomaly of a test point from a baseline cluster is estimated using a statistical distance metric. Among different statistical distance metrics, energy distance is an interesting concept based on Newton’s Law of Gravitation, promising simpler computation than classical distance metrics. In this paper, we review the state of the art formulations of energy distance and point out several reasons why they are not directly applicable to the anomaly-detection problem. Thereby, we propose a new energy-based metric called the P -statistic which addresses these issues, is applicable to anomaly detection and retains the computational simplicity of the energy distance .


INTRODUCTION
Prognostics is a critical requirement in many industrial fields today owing to the potential of cost savings and operational efficiency entitlement through elimination of unscheduled failures and shut-downs. One of the many important modules used in a typical Prognostics and Health Management (PHM) application is a health indicator module for the system under consideration which not only estimates a health metric but also tracks the evolution of this metric with time. The health indicator is estimated from a collection of sensor readings at different timestamps. Generally, one of many statistical distances of a test-point from a baseline cluster may be used as this health indicator. The data-points can be multidimensional resulting from multiple sensor outputs that can be tapped from the system under monitoring. This statistical distance or some function of it may be used as the health metric or the health indicator.
Time trending of the health metric as a function of historical and future forecast usage patterns give us a visibility into the Remaining Useful Life (RUL) which is the ultimate target of PHM. However, even before hitting the ultimate target of RUL, doing anomaly detection as an intermediate step has value. Anomaly is flagged when the distance metric exceeds a threshold and an alert is generated, thereby preventing sudden unplanned failures. Although as is captured in (Goldstein & Uchida, 2016), anomaly may be both supervised and unsupervised in nature depending on the availability of labelled ground-truth data, in this paper anomaly detection is considered to be the unsupervised version which is more convenient and more practical in many industrial systems.
Choice of the statistical distance formulation has an important impact on the effectiveness of RUL estimation. In literature, statistical distances are described to be of two types as follows.
1. Divergence measures: These estimate the distance (or, similarity) between probability distributions. Some common divergence measures are Kullback-Leibler divergence, Jensen-Shannon divergence and Hellinger distance.
2. Distance measures: These measures estimate the distance between a single point and a distribution by comparing it with a sample drawn from the distribution. The most well-known measure in this category is Mahalanobis distance (Mahalanobis, 1936). A few other distance measures are Bhattacharya distance (Bhattacharyya, 1943) and the energy distance.
For the industrial anomaly detection problem, it is mostly the second category which is more significant because most of the times, we end up comparing a single point with a baseline cluster.
In this paper, we are interested in the class of distances called energy distance. These are interesting because they are based on the notion of Newton's law of gravitational energy and considers statistical observations as celestial objects having gravitational pull between each other. A distance metric called E-statistic may be written to represent the energy distance between distributions.The E-statistic can be used to test the statistical hypothesis of equality of two distributions. This concept was proposed and developed in (Székely, Rizzo, et al., 2004;Székely & Rizzo, 2013;Szekely & Rizzo, 2017) where it was shown the E-statistic is more general and powerful than many classical statistics.
In this paper, we are interested in exploring it further because of its ability to work with Euclidean distances between data-points rather than with the data-points themselves. This ability leads to reduced computational complexity compared to the Mahalanobis distance and makes the E-statistic potentially advantageous for memory and computation challenged systems.
However, despite the aforementioned advantages, we found some problems with the E-statistic when applied to anomaly detection. These problems arise from the fact that the Estatistic has been developed primarily for comparing equality between two distributions whereas for anomaly detection we need to compare a single test point with a distribution (baseline). In this paper, we analyse these problems and propose a new metric called the modified energy distance (P-statistic) based on the notion of Newton's law which addresses these problems and is more suitable to be used in an anomaly detection problem.
Here, we would like to mention that detailed comparison of the proposed metric against all other classical distance metrics is a task we would attempt in future work. In this work, we focus on establishing the improvements over the E-statistic.
This paper is laid out as follows. In Section 2, we briefly review the E-statistic and mention the reasons why it is computationally simpler than other techniques. In Section 3, we explain the issues which make the E-statistic less suitable for anomaly detection and in Section 4, we propose the Pstatistic and through the use of synthetic data, show that it addresses these issues. In Sections 5.1 and 5.2, we derive a method for estimating probability from the P-statistic using a chosen parametric distribution. In Section 5.4, we show how the performance of this new metric compares in terms of discrimination performance against that of the Mahalanobis Distance, a classical multi-dimensional distance metric. In Section 6, we demonstrate a simple application of the P-statistic on real-life data. In Section 7, we show how the training time of the proposed metric compares against that of Mahalanobis Distance for an incremental baseline update approach. Finally, in Section 8, we summarize the observations.
2. REVIEW OF ENERGY DISTANCE AND ITS COMPUTA-TIONAL SIMPLICITY Let X and Y be two independent real-valued random variables with probability density functions f X and f Y respec-tively. Let X S = {X 1 , X 2 , ..., X n1 } be a random sample of size n 1 drawn from the density function f X . Similarly, Y S = {Y 1 , Y 2 , ..., Y n2 } is a random sample of size n 2 drawn from the density function f Y . The energy distance E(X, Y ) between the distributions f X and f Y has been defined in (Székely et al., 2004;Székely, 1989Székely, , 2002. It is a population statistic for the pair of random variables X and Y . Now, a sample statistic E n1n2 may be used as an estimate for the population statistic E(X, Y ). It may be calculated from the pair of samples (X S , Y S ) as defined in (Székely et al., 2004;Székely, 1989Székely, , 2002 and has the following form. (1) For a d-dimensional space, each observation X i and Y j is a d ⇥ 1 array. Mathematically, the entire sample may be represented as a d ⇥ n 1 matrix for X S and d ⇥ n 2 matrix for Y S . In industrial systems, usually after a baseline data-set is accumulated, future test-points are compared against it and the baseline itself is not frequently replaced except when there is a need to re-establish the baseline. Thus, although the baseline cluster varies depending on which time instant it was acquired from sensor readings and hence has a random nature, it is considered a constant in the anomaly analysis once it is captured and saved. Hence, in subsequent analysis, it is considered invariant. Going forward, in this paper, we will be referring to the sample X S as the baseline cluster (or baseline, for simplicity).
For the anomaly detection problem, we want to compare a single point against a baseline cluster having many members. Hence, in (1), we assume that Y S has sample-size 1 and contains only one member Y 1 . We discard the notations n 1 and n 2 as n 2 = 1. We represent n 1 by n going forward as the subscript in n is no longer needed. For sufficiently large sample size for X S , n 1 ⇡ n 1 + 1. Also, we use y in place of Y 1 in future analysis since the subscript is not crucial for clarity in representing a single member set Y S .
From (1), we write a simplified form E n for E n1n2 as Reference (Székely et al., 2004) mentions that for E n (X S , y) to be a valid distance metric, there must exist a c ↵ for every Here, Z is a random variable. A random sample z drawn from the distribution of Z is a function of the baseline X S and a random sample x drawn from the distribution f X defined earlier. It takes a form As mentioned in Section 1 and in (Székely et al., 2004;Székely & Rizzo, 2013;Szekely & Rizzo, 2017), the E-statistic is simpler and more general than classical distance metrics. While writing this paper, on comparing (1) with the classical distance metrics described in (Statistical distance, n.d.), the advantages which specifically interested us are 1. The Euclidean distances in (1) can be computed in parallel and hence the computation time of E-statistic does not scale with data size if implemented in a parallel manner.
2. Covariance matrices are not required in (1). For highdimensional data, this can take up a significant amount of computation and memory 3. Matrix inversion is not needed. In many classical methods, inversion of covariance matrices is required. Again, for high dimensional data, this can involve significantly heavy computation.

SOME GAPS IDENTIFIED IN E -STATISTIC
In this section, we analyse the general requirements from any anomaly detection metric and point out some weaknesses in the E-statistic which prevent it from satisfying some of these conditions. Going forward, these weaknesses form the basis of the proposed innovation in this paper.
3.1. Requirements from an Acceptable Metric for Anomaly Detection In anomaly detection, we usually obtain a single test sample y and compare it against the baseline (say X S ) which is a cluster of samples drawn from the distribution f X . Although there is an element of randomness in the baseline cluster based on when that data set was captured in time, but once they are collected for any machinery, they are usually not changed during the comparison or anomaly detection phase. Thus, for all practical purposes, they are same as a set of multi-dimensional real-valued points.
From the way anomaly detection is usually implemented in industrial systems, we intuitively desire the following conditions from any acceptable anomaly measure (say E ⇤ n ) which may be used in a practical anomaly detection system.
The desired characteristics with respect to cluster size and distance from centroid along with their mathematical expressions are as follows.
1. Relationship with cluster size (a) For the test-point y situated at a given distance from the centroid of the baseline cluster and outside the baseline, it should appear less anomalous if it is closer to the outer boundary of the baseline cluster.
Given the saved baseline X S , we define a random variable R such that R has a probability density function f R whose model parameters may be estimated from the sample X S .
Here, X S is the centroid of the baseline cluster which is nothing but the sample mean.
For the test-point y, let r = ||y X S || 2 . Since, f R is a function of the baseline cluster X S and from (5) the domain of R is the set of Euclidean distances of all possible samples of X from X S , the probability density of any point with distance r from X S may be expressed as f R (r, X S ).
This problem may now be cast in the form of a hypothesis test as follows.
Null hypothesis(H 0 ): Random samples drawn from f R will be more extreme than r. y is flagged as an anomaly with respect to X S if the null hypothesis is rejected.

Rejection criterion:
The null hypothesis is rejected if which is a pre-determined threshold.
Let there be a second cluster X 0 S ( ) which is formed by scaling the baseline with respect to its centroid such that its kth element X 0 S ( ) k may be written as (7) > 0. If > 1, the number of points, relative orientation of points and cluster shape are maintained unchanged between X S and X 0 S ( ) with the second cluster occupying a larger spatial volume. Hence, the test-point will appear closer to the outer boundary of X 0 S ( ) than to that of X S . In this scenario, if E ⇤ n is a faithful indicator of anomaly, it should appear to be less in the former case. This behaviour may be expressed mathematically as Now, we saw that a larger radius of baseline is supposed to make y appear less anomalous. Also, larger the anomaly, less should be the value of the integral in (6) since larger anomaly would mean less net probability of having points more extreme. Hence, (b) For an infinitesimally small baseline cluster, any testpoint would appear to have a very large anomaly, irrespective of the value of the Euclidean distance from the centroid. This is because the distance always appears large relative to the cluster size.
With the anomaly metric increasing asymptotically, we will have an asymptotic reduction of the integral value in (6).

Relationship with distance from cluster centroid
(a) If the test-point is further away from the centroid of the baseline cluster, it should appear more anomalous and hence, the anomaly metric should increase. Thus, if there are two test-points y 1 and y 2 , If the test-point is at the center, the anomaly metric should be close to zero. Hence, It should be noted that (7) and (11) indicate that the ideal anomaly metric should follow a monotonic behaviour with respect to the baseline size and also distance of the test-point from the baseline cluster. These ideal conditions hold good for data of any dimension.

Synthetic Dataset
In order to examine how E n performs with respect to the desired characteristics stated in Section 3.1, we consider a ddimensional random variable X which is distributed as a uniform ball. We sample from this distribution to create a synthetic data-set following the method described in (Harman & Lacko, 2010).
X can be written as Z 1 is sampled from a multi-variate uncorrelated standard normal distribution, Z 2 ⇠ U (0, 1) and r b is the desired radius of the ball.
For the purpose of this study, we restrict the dimension of X in (14) to 2 to keep the analysis and visualization simple. If the two dimensions of X are written as X (1) and X (2) , they may be expressed in a simplified form as a function of two random variables R and ✓ in the following way.
r b is the chosen radius of the example cluster. It may be shown from (15) that 8 (✓, R), the probability density function f ✓,R = 1/(⇡r 2 b ). Thus, the distribution is uniform in nature. We also consider a fixed test-observation with location ✓ = 0 and R = 1.

Shortcomings of E-statistic
We sweep r b in (14) from 0 to 5 thereby progressively getting a different-sized baseline and thereby changing in (7). This implies different probabilities of the fixed test-point belonging to this cluster. We then evaluate the simplified E n in (2). Figures 1 and 2 show the baselines and test-points for two different values of r b along with computed values of E n for both the cases. Figure 3 shows a continuous plot of how E n varies with r b .
From these plots, the E-statistic can be shown to violate the desired characteristics mentioned in Section 3.1 in the following manner.
1. Limiting condition behaviour: The E-statistic violates the limiting conditions in the following manners.
(a) It has been shown in Appendix A (59) that the metric E n proposed in (2) has the following property. This violates the condition required for the ideal metric E ⇤ n stated in (10). Thus, E n in its present form cannot function as the ideal metric.
From Figure 3, it is clear that when the baseline cluster becomes progressively smaller and close to 0, the variation of E n is not asymptotic to infinity but hits a finite value equal to the distance of the test-point from the baseline centroid.
(b) Appendix A (62) also proves the following.
From Figure 3, it is clear that when baseline cluster increases in radius, the value of E n does not always reduce tending towards 0. It goes through an in- Figure 3. Variation of E n with baseline cluster radius (r b ), given a fixed test-point.
flection point beyond which it increases instead of reducing and reaches significantly high values instead of very small ones. For the particular case in Figure 2, the E n value of 2.363 is not appropriate. It should be very close to zero. Also, the E n should be higher in Figure 3 compared to Figure 2. However, the opposite is actually observed.
(c) As stated above, in Section 3.1, E ⇤ n = 0 if and only if y = X c for a viable distance-metric for anomaly detection. But this condition is violated by the E-statistic in its present form because in (1), E n1n2 = 0 if and only if the distributions X and Y are identical i.e. X = Y . However the single test-point y considered here forms a single member distribution which cannot be identical to X under any circumstance. Figure 3, it is seen that the value of E n goes through an inflection point as r varies instead of varying monotonically as required in Section 3.1

Monotonic behaviour: From
The issues with E-statistic raised in this section have been further addressed and a modified algorithm suggested in Section 4. Since the target in this paper is anomaly detection, we have also extended the analysis to reach a closed form analytical expression for the probability density function (PDF) based on which the p-value of the test-point may be estimated.

MODIFIED ENERGY DISTANCE
The reason why energy distance is computationally simple is that it works with distributions of Euclidean distances whereas other distance metrics first fit a multivariate distribution and then work our probabilities from that. The E-statistic is one of the first energy-based metric proposed for anomaly detection. However, we saw in Section 3.3 that it has a few weaknesses.
In this section, we borrow the concept of working with Euclidean distances from a study of the E-statistic and propose an alternative and slightly modified metric based on them.
The potential energy of a configuration consisting of two particles having masses m 1 and m 2 respectively and having positions X 1 and X 2 respectively is the work that needs to be done to move one particle from infinity to its current position against the gravitational force exerted by the other particle already in place. It can be written as We define the incremental potential energy of y to be the work that needs to be done to move the test-point y from infinity to its present position against the collective force exerted by the baseline sample cluster X S . The incremental potential energy may be written as where C is a positive constant if we assume that each datapoint in the baseline sample cluster X S as well as the testpoint y represent particles of equal mass.
The net potential energy for assembling the baseline cluster X S may be written as Since from (18), potential energy is inversely proportional to distance, we may assume a modified energy distance E n between X S and y such that (19). (21) Also, the effective modified energy distance of the entire cluster X S with respect to itself may be written as Considering the fact that the degree of anomaly should factor in the total energy content of the baseline cluster itself, we define a potential energy statistic, the P-statistic based on the modified energy distance formulated in (21) as One concern with (23) might be numerical stability as y overlaps with any particular X k and the denominator in (23) goes to 1. In order to overcome this, we propose introducing a saturation term ⌧ as If the kth dimension of ⌧ be ⌧ k , ⌧ 1 = ✏ and ⌧ j = 0 8 j > 1.
Here, ✏ is an infinitesimal positive number on the real line. The closer ✏ is to 0, the better is the match between (23) and (24). A guideline may be to use The reader may note that this is a guideline and one may use smaller values of ✏ if they want. However, larger values of ✏ may not be advisable. We would like the reader to know that a rigorous evaluation of the impact of the choice of this value on the results needs to be conducted as future work but at present, we do not see this as a big risk to this method.
We now examine the limiting conditions or ! 0 and ! 1 which have been discussed with respect to E n in (16) and (17) respectively. Using the definition of as given in (7), Using the same data-set and sweep parameters as in Figure  3, Figure 4 plots P n vs. radius of baseline cluster for a fixed test-point. In it, we notice the asymptotic behaviour of P n near infinitesimal baseline radius i.e., = 0 as implied by (26). Figure 4. Semi-log plot for variation of P n with baseline cluster size, given a fixed test-point.
Since y is an n-dimensional point, P n is a function of the distance r and several other variables related to the other degrees of freedom. Thus, functionally, P n may be written as P n (X S , y) = P n (X S , r, D) where D represents the set of remaining degrees of freedom other than r. While D represents the radial directional vector in the high dimensional space with respect to X S , r indicates position along that vector.
We now want to study the impact of r on P n for any given D.
Assuming X S to be a fixed baseline against which test-points are compared, we can consider D and X S to be constants. Hence, we have a simplified functional relationship as follows. P n (r) = P n (X S , r, D).
Thus, if we consider the space as a high dimensional ball, we are studying the variation of P n for points at different radial distances along a given radial vector direction.
In Appendix B, we prove the following regarding the geometric properties for the function P n () .
Now, we examine the impact of increasing the spatial volume of the baseline cluster. Following the proposition in (7), we examine the impact of increasing .
From (23), using the definition of in (7), if all the dimensions of y are finite.
From (33), for large spatial volume of the baseline i.e., for large values of , P n (r) will be finite unlike in (17) and in Figure 3 where the E-statistic is seen to increase with increasing after a certain value and tends to infinity. Also, in Figure  4, we see that as the baseline cluster becomes progressively larger thereby indicating a reduction in anomaly level, we see a reduction of P n to a value of 0 instead of the inflection seen in Figure 3.
The variation of P n with distance of test-point from baseline centroid, for a few values of the baseline radius is shown in Figure 5. It serves as a demonstration of the fact that as Figure 5. Semi-log plot for variation of P n with distance from centroid, for different baseline sizes.
per (29), after a small initial value of r, P n increases monotonically with increase of distance of the test-point from the baseline centroid. This is consistent with the fact that farther the test point, the more anomalous it is.
We also observe from Figure 5 that the rate of increase of P n is less when the baseline cluster size is larger. This is because for any fixed test-point, it will appear less anomalous when the baseline cluster has a larger radius.
All of these observations are consistent with P n being a valid anomaly statistic. All the discrepancies observed in Figure  3 are addressed and P n satisfies the requirements stated in Section 3.1. The observations are further validated in Figures  6 and 7 where the P-statistic is used to evaluate the cases earlier analysed in Figures 3 and 2. They show a much lower P-statistic value for the case in Figure 7 where the test-point is very close to the centroid X.  The observations in this section establish that the variation of P-statistic with baseline size and test-point distance is in accordance with what is expected intuitively from an anomaly metric. However, to fully establish its usefulness as a valid anomaly metric, we should be able to compute a probability density function (PDF) from this metric.

COMPUTATION OF PROBABILITY
5.1. Existence of a Valid Probability Measure using P n Let r represent the distance of the test-point from the centroid of the baseline cluster. Also, in the space of real numbers, let R represent a single-dimensional random variable having a density function f R from which the r-values are assumed to be sampled. R has a range defined by Since R is a random variable, Z = P n (R) is also a random variable as the function of a random variable is itself a random variable. Let f Z be the density function for Z and z 2 Range(Z). In Appendix B, it has been shown that P n () is a continuous function. Also, from (29), (30) and (32), Thus, the domain and the co-domain of P n () are both [0, 1).
As mentioned in (Székely et al., 2004) and (3), for P n (r) to be a valid metric for anomaly detection, (Z) and (36) ↵ 2 (0, 1), ↵ being unique for every choice of c ↵ . (36) and (37) are characteristics necessary for the existence of the cumulative density function (CDF) corresponding to the probability density function f Z .
From (29) and (31), 9 r ⇤⇤ r ⇤ such that These relations are true because r and r ⇤⇤ are all positive numbers. Also, since probability is a non-negative quantity, it may be shown that We now, analyse the behaviour of P n (r) in two zones, i.e., c ↵  c max and c ↵ > c max . These are analysed as follows.
For any c ↵ 2 [0, c max ], From (29), P n (r) need not be strictly increasing or strictly decreasing in nature if r  r ⇤⇤ . Hence the set X ↵ may have more than one member as defined in (41). Since c ↵ < c max , (39) tells us that all members of the set X ↵ in (41) lie within the interval [0, r ⇤⇤ ].
Let (X, ⌃) and (Y, T ) be measurable spaces such that X is a set comprising of all samples of R between 0 and r ⇤⇤ .
X and Y are equipped with -algebras ⌃ and T respectively. Also, let P n be a function from X to Y . Since P n (r) is strictly increasing for r > r ⇤⇤ , max(X ↵ ) defined in (41) cannot be greater than r ⇤⇤ with c ↵  c max . Also, by definition r 0.
Hence, if we consider any open set of C consisting of values from [0, c max ], from (41), its inverse values P 1 n (C) will map to the region [0, r ⇤⇤ ]. None of it would lie outside of this. Mathematically, this may be written as Thus, P n (R) is a valid choice for a random variable. We can show that 8 c ↵ , 9 a unique values of ↵ 1 Also, from (29), P 1 n () is single-valued in for r > r ⇤⇤ and hence, From (43) and (44), for r 2 [0, r ⇤⇤ ], In (45), ↵ is uniquely defined for any chosen value of c ↵ , since ↵ 1 is unique as shown in (43). 2. Case 2: c ↵ > c max .
From (45) and (47), it is seen that (36) is validated over the entire range of the random variable Z = P n (R). Thus, P n () is a valid metric for anomaly detection.

Calculating p-value
In order to calculate the p-value, we need to first choose a suitable form for the PDF f (x) that best represents the data being analysed. Different parametric distributions may be chosen for the PDF f (x). Since P n (x) > 0 8 x from (32), we assume that it is sampled from a standard half-normal distribution with mean µ = 0 and standard deviation . We can write a standard half-normal variant of P n as p n = P n / and p n = |Z| where Z ⇠ N (0, 1).
Here, can be estimated as Here, it must be noted that since X S is defined as a set of random samples, X S {X i } represents the set subtraction operation resulting in the complementary set obtained by eliminating X i from X S .
The p-value for the test-point y is An efficient computation of p(y) would need a compact approximation for the error function erf(x). For this purpose, we use the Bürmann series expansion which converges quickly for real values of x (Schöpf & Supancic, 2014). As explained in (Dixon, 1901;Schöpf & Supancic, 2014), the error function may be approximated as From (50) and (51), p(y) may be approximated as where We henceforth use the compact expression of p(y) in (52) in all calculations going forward. Figure 8 shows how the p(y) varies with distance from centroid and also the baseline sizes using the same test-cases as in Figure 5. We see that for any given test-point, the baseline cluster having a bigger radius will imply higher p-value and thus less chance of rejecting the null hypothesis. This is because the larger cluster would make the test-point appear less anomalous. Figure 8. Semi-log plot for variation of p(y) calculated in (50) with distance from the baseline centroid, for different baseline sizes.

Algorithm Steps and Computation Requirement
We can now summarize the steps involved in the computation of p d (y) in Algorithm 1.

Comparison of P-statistic with Mahalanobis Distance
Since in this paper we are trying to formulate and establish a new statistical metric for anomaly detection, it would be interesting to see how it performs with respect to the Mahalanobis Distance (MD) which is one of the established classical techniques in this domain. The questions we are trying to answer at this stage are the following.
1. Is there a one-to-one correspondence between MD and p n ? Or, does every value of p n corresponds to one and only one value of MD ?
2. Is it possible to estimate MD from p n for different dimensions of the data-points and different physical sizes of the baseline clusters ? By answering this question, we are trying to figure out if tribal knowledge about what constitutes anomaly in terms of MD for a particular application can be translated to the domain of the P-statistic.
Let us represent the Mahalanobis Distance (MD) (Mahalanobis, 1936) of test-point y with respect to baseline cluster sampled from the random variable X as M (y, X). Thus, the d dimensions of y and X are represented along the rows and S is the covariance matrix of the cluster sampled from X. We now examine the relationship between p n and M (y, X) in an empirical fashion by varying different parameters of the synthetic data-set designed in (14).
Similar to Figure 8, we sweep the distance of test-point from baseline centroid and calculate M (y, X) and p n for each position. Finally, the MD values corresponding to each value of p n are plotted for comparison in Figure 9 for two different values of baseline radius. Also, we assume different values of dimension d in (14) and for each value of d, we define a baseline cluster X and test-point y as in (14). For each d and a given baseline radius, Figure 10 shows the relationship between p n and M (y, X). old in M (y, X) can be translated to a corresponding threshold in p n by utilizing the linearity of the curves in Figure  10. This enhances the confidence that an existing thresholdbased alerting strategy that depends on values of Mahalanobis Distance may be translated to an equivalent P-statistic-based one. Hence, the p n -value in (48) and the corresponding pvalue calculated in (50) can be used effectively for anomaly detection.

PERFORMANCE ON REAL-LIFE DATA
For verifying the performance of the P-statistic, we take the Breast Cancer Data-set available publicly in the UCI Machine Learning Repository (Dua & Graff, 2017). We use the version embedded inside Python's scikit-learn package and load it using the native python command. For details on how to load and use the data-set, please check scikit-learn documentation (Scikit-learn breast cancer data-set, n.d.).

Data Description
This data-set consists of features computed from a set of fine needle aspirate (FNA) images of breast masses from a set of patients along with the ground-truth of the diagnosis i.e. malignant or benign. The characteristics of cell nuclei in the image are described in this data-set. There are a total of 30 features per image with more detailed descriptions available in (Dua & Graff, 2017). The feature matrix is shaped 357⇥30 and in the label vector, 0 represents malignant and 1 represents benign.
In order to visualize this high dimensional data, we use tdistributed Stochastic Neighbourhood (tSNE) (Maaten & Hinton, 2008) to calculate a 2D embedding of the data so that it can be plotted as a 2D scatter plot and examined manually for proximity between data-points and clusters. Before applying tSNE, each of the feature columns are normalized to extend from 0 to 1. Figure 11 represents the two clusters of data available in this data-set. The subsequent analysis has now been performed on this 2-D embedded version of this data-set so that results can be visually related to the cluster neighbourhood behaviour. Figure 11. tSNE embedding of the malignant and benign clusters in the breast cancer data-set from UCI ML repository

Analysis with P-statistic
We now consider 40 % of the benign cluster as the training baseline and the remaining points in the benign and malignant groups as test set. We now use the steps summarized in Algorithm 1 to estimate the p-values for all the data-points in both benign and malignant clusters in the test set with respect to the training baseline. Figures 12 and 13 show the P-statistic and Mahalanobis Distance respectively, for all these points. From these, we see that the capability for discrimination between the baseline and anomalous data are similar for both the approaches.  In order to compare the discrimination capability between these two metrics analytically, we plot the true positive rate vs. false positive rate with varying detection thresholds (ROC curve) in Figure 14. When calculating this ROC curve the benign and malignant labels are considered negative and positive respectively. We can see that the curves are nearly over-lapping indicating that the P-statistic is as effective as the Mahalanobis Distance in classification tasks as visible from this data. Figure 14. ROC curve for P-statistic and Mahalanobis Distance 7. SOME OBSERVATIONS ON COMPUTATION TIME

Test Setup and Limitations in Field
Since one of the premises on which the choice of energybased distance was made was that of computation simplicity, we analyse the computation requirement for training and inference separately in this section for both P-statistic and MD. We notice from (24) and (54) that the bulk of the computation happens during training time when model parameters are being computed from the baseline cluster X S . However, the inference phase has less computation to perform and is likely to have similar computation time for both P-statistic and MD. We henceforth focus our attention on the training time.
Since many field deployed Data Acquisition (DAQ) systems like edge devices and protection relays are likely to be deployed without a live internet connection, having the training done in a cloud-based infrastructure and the trained model transferred to these devices may not be feasible. Hence, any algorithm that would reduce training computation requirement would be advantageous as these devices are usually challenged in terms of computation power.
It must be noted that the analysis in this section is conducted on a CPU-based windows machine without parallel processing. It has 16 GB RAM and an i5 processor from Intel. No GPU is available. This is a valid test scenario because fielddeployed industrial DAQ systems are not likely to have GPUs making massive parallel processing difficult to implement.

Computation Time Analysis
A practical industrial system cannot wait till the entire baseline of a desired size is accumulated before producing results as it might take a long time for an industrial machine to capture baseline samples from all possible operating conditions. After each new data-point is received, the baseline cluster parameters are expected to be computed incrementally based on the calculations of the previous stage and comparison run on the new baseline thus obtained.
The block of computation that needs to happen during baseline computation phase is all that which can be done on the baseline X S alone and without the test-point y being available.
From (24), the training computation for P-statistic calculation is limited to calculating the numerator N (say) as For the jth update step, N j may be written incrementally as For any such step, the computation time is written as T P .
Similarly, for computing the MD value in (54), the training computation block may be defined by the following components.
2. S 1 : Inverse of the covariance matrix for the input dataset.
From the above steps, only the centroid can be computed incrementally. Since it is not very straightforward to write the covariance matrix in an incremental fashion, we have to recompute the covariance matrix after each step. The inversion needs to be repeated any way. Let the computation time be T MD for each of these update steps.
We consider a baseline distribution as one defined in Figure  6. Only now, the dimension d is considered more than 2 and a variable. We assume that the data-points arrive serially and compare computation times T P and T MD for a single update step with a new data-point given an already existing baseline size. Figure 15 shows how these two compare against each other for varying baseline sizes and values of the dimension d. Figure 15. Time taken (ms) for a single incremental update step when using P-statistic and MD, plotted for different already existing baseline sizes For any baseline size along the x axis in Figure 15, the variability in CPU loading is accounted for by computing the times as an average of those calculated for 20 random choices of the baseline cluster from the data-set defined above. The CPU time has been calculated by using time ns() function from the time package in Python before and after each code block during execution and taking the difference to be the computation time for that segment.
For a given existing baseline size, the ratio of the two timings is now plotted against the dimension in Figure 16. Figure 16. Comparison of T P and T MD for different data dimensions for a given baseline size.
From Figures 15 and 16, we have the following observations 1. Time taken for incremental baseline computation for MD is more than that of the P-statistic after each update step.
2. MD calculation time scales up much more with data dimension than the P-statistic computation. Thus, the advantage offered by P-statistic is more pronounced for data of high dimensions.
This demonstrates the computation advantage which is claimed by energy-based distances over Mahalanobis Distance.

CONCLUSION
In this paper, we have reviewed the energy distance, a distance metric proposed in literature and shown to have simpler computation than most classical distance metric. We have shown that this metric possesses several shortcomings which prevent it from being applied in a practical anomaly detection application. We have proposed a modified metric called Pstatistic which works on distributions of Euclidean distances just like the energy distance. Using a synthetic data-set, we show that the above shortcomings are overcome by this modified metric. We have taken an example real-life data, the UCI breast-cancer data and shown that the P-statistic and the Mahalanobis Distance have similar performance as a discrimination metric, as shown by an almost overlapping ROC curve.
We have also demonstrated that computation times for Mahalanobis Distance calculation are higher than those for Pstatistic when computed in an incremental fashion with each new data arrival. Specially, we have seen that P-statistic of-fers pronounced advantage with data having high dimension. Thus, we conclude that it is feasible to use this metric for anomaly detection without losing discrimination performance, at the same time utilizing the simpler computation that energy distance based methods offer.
From (68), (69) and (70), since ↵ is a constant. Here N is assumed to be the number of elements in the cluster X S .
A few other points about the nature of P n (r) that need to be mentioned here are 8 r < 1, P n (r) < 1 since ↵ < 1 and s k < 1.