**Baycomp** is a library for Bayesian comparison of classifiers.

Functions compare two classifiers on one or on multiple data sets. They compute three probabilities: the probability that the first classifier has higher scores than the second, the probability that differences are within the region of practical equivalence (rope), or that the second classifier has higher scores. We will refer to this probabilities as `p_left`

, `p_rope`

and `p_right`

. If the argument `rope`

is omitted (or set to zero), functions return only `p_left`

and `p_right`

.

The region of practical equivalence (rope) is specified by the caller and should correspond to what is “equivalent” in practice; for instance, classification accuracies that differ by less than 0.5 may be called equivalent.

Similarly, whether higher scores are better or worse depends upon the type of the score.

The library can also plot the posterior distributions.

The library can be used in three ways.

- Two shortcut functions can be used for comparison on single and on multiple data sets. If
`nbc`

and`j48`

contain a list of average

classification accuracies of naive Bayesian classifier and J48 on a collection of data sets, we can call`>>> two_on_multiple(nbc, j48, rope=1) (0.23124, 0.00666, 0.7621)`

(Actual results may differ due to Monte Carlo sampling.)

With some additional arguments, the function can also plot the posterior distribution from which these probabilities came.

- Tests are packed into test classes. The above call is equivalent to
`>>> SignedRankTest.probs(nbc, j48, rope=1) (0.23124, 0.00666, 0.7621)`

and to get a plot, we call

`>>> SignedRankTest.plot(nbc, j48, rope=1, names=("nbc", "j48"))`

To switch to another test, use another class::

`>>> SignTest.probs(nbc, j48, rope=1) (0.26508, 0.13274, 0.60218)`

- Finally, we can construct and query sampled posterior distributions.
`>>> posterior = SignedRankTest(nbc, j48, rope=0.5) >>> posterior.probs() (0.23124, 0.00666, 0.7621) >>> posterior.plot(names=("nbc", "j48"))`

Install from PyPI:

```
pip install baycomp
```

User documentation is available on https://baycomp.readthedocs.io/.

A detailed description of the implemented methods is available in Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis, Alessio Benavoli, Giorgio Corani, Janez Demšar, Marco Zaffalon. Journal of Machine Learning Research, 18 (2017) 1-36.

]]>There is another and more elegant way to derive this inequality:

$$Cov(X,Y)^2\leq Var(X)Var(Y)$$

To do that, we introduce again our favorite subject, Alice. Let us summarize the problem again. Assume that there two real variables $X,Y$ and that Alice only knows their first $\mu_x=E(X),\mu_y=E(Y)$ and second $E(X^2),E(Y^2)$ moments (in other words she only knows their means and variances, since $Var(Z)=E(Z^2)-E(Z)^2=\sigma_z^2$).

Alice wants to compute $Cov(X,Y)$.

Since Alice does not know the joint probability distribution $P(X,Y)$ of $X,Y$ (she only knows the first two moments), she cannot compute $Cov(X,Y)$. However, she can compute bounds for $Cov(X,Y)$, or in other words, she can aim to solve the following problem

$$

\begin{array}{l}

~\max_{P} E[(X-\mu_x)(Y-\mu_y) ]=\int (X-\mu_x)(Y-\mu_y) dP(X,Y)\\

E[X]=\int X dP(X,Y)=\mu_x\\

E[Y]=\int Y dP(X,Y)=\mu_y\\

E[X^2]=\int X^2 dP(X,Y)=\mu_x+\sigma_x^2\\

E[Y^2]=\int Y^2 dP(X,Y)=\mu_y+\sigma_y^2\\

\end{array}

$$

This means she aims to find the maximum value of the expectation of $(X-\mu_x)(Y-\mu_y) $ among all the probability distributions that are compatible with her beliefs on $X,Y$ (the knowledge of the means and variances). She can similarly compute the minimum. This is the essence of Imprecise Probability.

To compute that, we first observe that

$$

\begin{bmatrix}

X-\mu_x\\

Y-\mu_y

\end{bmatrix}\begin{bmatrix}

X-\mu_x & Y-\mu_y\\

\end{bmatrix}=

\begin{bmatrix}

(X-\mu_x)^2 & (X-\mu_x)(Y-\mu_y)\\

(Y-\mu_y)(X-\mu_x) & (Y-\mu_y)^2\\

\end{bmatrix}\geq 0

$$

where $\geq$ means that the matrix is positive semi-definite. We know that the expectation operator preserves the sign and, therefore,

$$

E\left(

\begin{bmatrix}

(X-\mu_x)^2 & (X-\mu_x)(Y-\mu_y)\\

(Y-\mu_y)(X-\mu_x) & (Y-\mu_y)^2\\

\end{bmatrix}\right)\geq 0

$$

but, because of linearity of expectation, we have that

$$

\begin{aligned}

&E\left(\begin{bmatrix}

(X-\mu_x)^2 & (X-\mu_x)(Y-\mu_y)\\

(Y-\mu_y)(X-\mu_x) & (Y-\mu_y)^2\\

\end{bmatrix}\right)\\ &~\\

&=

\begin{bmatrix}

E[(X-\mu_x)^2] & E[(X-\mu_x)(Y-\mu_y)]\\

E[(Y-\mu_y)(X-\mu_x)] & E[(Y-\mu_y)^2]\\

\end{bmatrix}\geq 0.

\end{aligned}

$$

This matrix is positive semi-definite provided that

$$

E[(X-\mu_x)(Y-\mu_y)]^2\leq E[(X-\mu_x)^2] E[(Y-\mu_y)^2]

$$

which is exactly

$$Cov(X,Y)^2\leq Var(X)Var(Y)$$.

]]>Quantum mechanics: The Bayesian theory generalized to the space of Hermitian matrices. In: Physics Review A, 94 , pp. 042106, 2016.

I had a question from the audience about whether/how we can derive “Heisenberg inequality” as a consequence of our subjective (gambling) formulation of QM. This is not complicated since Heisenberg inequality is just the QM version of Covariance Inequality which states that for any two random variables $X$ and $Y$

$$Cov(X,Y)^2\leq Var(X)Var(Y)$$

Therefore, before deriving Heisenberg uncertainty, I will show how to derive the above inequality from a Bayesian (Imprecise probability) perspective.

To explain the inequality from a subjective point of view, we introduce our favorite subject, Alice. Let us assume that there two real variables $X,Y$ and that Alice only knows their first $\mu_x=E(X),\mu_y=E(Y)$ and second $E(X^2),E(Y^2)$ moments (in other words she only knows their means and variances, since $Var(Z)=E(Z^2)-E(Z)^2=\sigma_z^2$).

Assume Alice wants to compute $Cov(X,Y)$.

Since Alice does not know the joint probability distribution $P(X,Y)$ of $X,Y$ (she only knows the first two moments), she cannot compute $Cov(X,Y)$. However, she can compute bounds for $Cov(X,Y)$, or in other words, she can aim to solve the following problem

$$

\begin{array}{l}

~\max_{P} \int (X-\mu_x)(Y-\mu_y) dP(X,Y)\\

\int X dP(X,Y)=\mu_x\\

\int Y dP(X,Y)=\mu_y\\

\int X^2 dP(X,Y)=\mu_x+\sigma_x^2\\

\int Y^2 dP(X,Y)=\mu_y+\sigma_y^2\\

\end{array}

$$

This means she aims to find the maximum value of the expectation of $(X-\mu_x)(Y-\mu_y) $ among all the probability distributions that are compatible with her beliefs on $X,Y$ (the knowledge of the means and variances). She can similarly compute the minimum. This is the essence of Imprecise Probability.

To compute these bounds she first rewrites the above problem as

\begin{equation}

\label{eq:1}

\begin{array}{l}

\text{opt}_{P} C=\int (X-\mu_x)^2 +a^2(Y-\mu_y)^2 -2a (X-\mu_x)(Y-\mu_y) dP(X,Y)\\

\int X dP(X,Y)=\mu_x\\

\int Y dP(X,Y)=\mu_y\\

\int X^2 dP(X,Y)=\mu_x+\sigma_x^2\\

\int Y^2 dP(X,Y)=\mu_y+\sigma_y^2\\

\end{array}

\end{equation}

where $a$ is some scalar. Note that since $\int (X-\mu_x)^2dP(X,Y)$ and $\int (Y-\mu_y)^2dP(X,Y)$ are known (they are respectively equal to $\sigma_x^2$ and $\sigma_y^2$), adding these terms “does not change” the optimization problem (the $P$ that achieves the maximum is the same — they are just additive constants). If we assume that $a$ is positive, then the same is true for the $a$ that multiplies $(X-\mu_x)(Y-\mu_y)$ provided that $opt=\min$ (otherwise, this is true provided that $opt=\max$).

Now observe that

$$

C=(X-\mu_x)^2 +a^2(Y-\mu_y)^2 -2a (X-\mu_x)(Y-\mu_y) =[X-\mu_x-a(Y-\mu_y)]^2

$$

and, therefore, we can conclude that $\int (X-\mu_x)^2 +a^2(Y-\mu_y)^2 -2a (X-\mu_x)(Y-\mu_y) dP(X,Y)$ is always non-negative for every $a,~P$.

She has to solve the above constrained optimization problem. For the moment let us forget the constraints.

Let us assume that we can take $a$ as a function of $P$, then the unconstrained maximum can be obtained by computing the derivative of the objective function w.r.t. $a$ and solving

$$

\frac{d}{da}C=\int 2a(Y-\mu_y)^2 -2 (X-\mu_x)(Y-\mu_y) dP(X,Y)=0

$$

whose solution is $a=\frac{E[(X-\mu_x)(Y-\mu_y)]}{E[(Y-\mu_y)^2]}=\frac{Cov(X,Y)}{\sigma_y^2}$. Since the second derivative of $C$ is non-negative, this is a minimum.

If we choose $a$ in this way then we have that

\begin{equation}

\begin{array}{rcl}

0&\leq& \int (X-\mu_x)^2 + \Big(\frac{Cov(X,Y)}{\sigma_y^2}\Big)^2(Y-\mu_y)^2 +2\Big(\frac{Cov(X,Y)}{\sigma_y^2}\Big) (X-\mu_x)(Y-\mu_y) dP(X,Y)\\

&=&\sigma_x^2+\frac{Cov(X,Y)^2}{\sigma_y^2}-2\frac{Cov(X,Y)^2}{\sigma_y^2}

\end{array}

\end{equation}

Since we allowed $a$ to depend on $P$, we cannot find a better minimum.

Hence, we can derive that

$$

Cov(X,Y)^2\leq \sigma_x^2\sigma_y^2

$$

Note that to obtain the above inequality we have chosen a $P(X,Y)$ that satisfies $\int (X-\mu_x)^2 dP(X,Y)=\sigma_x^2$

and $\int (Y-\mu_y)^2 dP(X,Y)=\sigma_y^2$ and, therefore, it satisfies the constraints. This ends the proof.

Here you can find the link to my Keynote talk:

]]>error. The motivations for the axioms are not always clear and even to experts the basic axioms of QM often

appear counter-intuitive. In a recent paper [1], we have shown that:

- It is possible to derive quantum mechanics from a single principle of self-consistency or, in other

words, that QM laws of Nature are logically consistent; - QM is just the Bayesian theory generalised to the complex Hilbert space.

To obtain these results we have generalised the theory of desirable gambles (TDG) to complex numbers.

TDG was originally introduced by Williams, and later reconsidered by Walley, to justify in a subjective way a

very general form of probability theory.

[**Theory of desirable gambles**]

In classical subjective, or Bayesian, probability, there is a well-established way to check whether the

probability assignments of a certain subject, whom we call Alice, about the result of an uncertain experiment is

valid, in the sense that they are self-consistent. The idea is to use these probability assignments to define

odds—the inverses of probabilities—about the results of the experiment (e.g., Head or Tail in the case of a coin

toss); and then show that there is no way to make Alice a sure loser in the related betting system, that is, to make

her lose money no matter the outcome of the experiment. Historically this is also referred to as the

impossibility to make a Dutch book or that the assessments are coherent; and Alice in these conditions is

regarded as a rational subject. De Finetti [3] showed that Kolmogorov’s probability axioms can

be derived by imposing the principle of coherence alone on a subject’s odds about an uncertain

experiment.

Williams and Walley [8, 7] have later shown that it is possible to justify probability in a simpler and more

elegant way. Their approach is also more general than de Finetti’s, because coherence is defined

purely as logical consistency without any explicit reference to probability (which is also what allows

coherence to be generalised to other domains, such as quantum mechanics); the idea is to work in the

dual space of gambles. To understand this framework, we consider an experiment whose outcome

ω belongs to a certain space of possibilities Ω (e.g., Head or Tail). We can model Alice’s beliefs

about ω by asking her whether *she accepts engaging in certain risky transaction*s, called **gambles**,

whose outcome depends on the actual outcome of the experiment. Mathematically, a gamble is a

bounded real-valued function on Ω, g : Ω →ℝ, which is interpreted as an uncertain reward in a

linear utility scale. If Alice accepts a gamble g, this means that she commits herself to receive g(ω) utiles (euros)

if the outcome of the experiment eventually happens to be the event ω ∈Ω. Since g(ω) can be negative, Alice can

also lose utiles. Therefore Alice’s acceptability of a gamble depends on her knowledge about the

experiment.

The set of gambles that Alice accepts—let us denote it by K —is called her set of desirable gambles. We say

that a gamble g is positive if g≠0 and g(ω) ≥0 for each ω ∈Ω. We say that g is negative if g≠0

and g(ω) ≤0 for each ω ∈Ω. K is said to be coherent when it satisfies the following minimal

requirements:^{2}

- D1
- Any positive gamble g must be desirable for Alice (g∈K), given that it may increase Alice’s capital

without ever decreasing it (accepting partial gain). - D2
- Any negative gamble g must not be desirable for Alice (g∈K), given that it may only decrease

Alice’s capital without ever increasing it (avoiding partial loss). - D3
- If Alice finds g and h to be desirable (g,h ∈K ), then also λg+ νh must be desirable for her

(λg+νh ∈K), for any 0 < λ,ν ∈ℝ (linearity of the utility scale).

In spite of their simple character, these axioms alone define a very general theory of probability. *De Finetti’s*

* (Bayesian) theory is the particular case obtained by additionally imposing some regularity (continuity)*

* requirement and especially completeness, that is, the idea that a subject should always be capable of*

comparing options [7, 8].In this case, probability is derived from K via (mathematical) **duality**.

[**QM**]

In [1] we have extended desirability to QM. To introduce this extension, we first have to define what is a

gamble in a quantum experiment and how the payoff for the gamble is computed. To this end, we consider an

experiment relative to an n-dimensional quantum system and two subjects: the gambler (Alice) and the

bookmaker. The** n-dimensional quantum system** is prepared by the bookmaker in some quantum state. We

assume that Alice has her personal knowledge about the experiment (possibly no knowledge at

all).

- 1.
- The bookmaker announces that he will measure the quantum system along its n orthogonal

directions and so the outcome of the measurement is an element of Ω = {ω_{1},…,ω_{n}}, with ω_{i}

denoting the elementary event “detection along i”. Mathematically, it means that the quantum

system is measured along its eigenvectors, i.e., the projectors and ω_{i}is the event “indicated” by

the i-th projector. - 2.
- Before the experiment, Alice declares the set of gambles she is willing to accept. Mathematically,

a gamble G on this experiment is a n×n Hermitian matrix in ℂ; the space of all Hermitian n×n

matrices is denoted by ℂ_{h}^{n×n}. - 3.
- By accepting a gamble G, Alice commits herself to receive γ
_{i}∈ℝ utiles if the outcome of the

experiment eventually happens to be ω_{i}. The value γ_{i}is defined from G and Π^{*}as follows

Π_{i}^{*}GΠ_{i}^{*}= γ_{i}Π_{i}^{*}for i = 1,…,n. It is a real number since G is Hermitian.

The subset of all positive semi-definite and non-zero (PSDNZ) matrices in ℂ_{h}^{n×n} constitutes the set of positive

gambles, whereas the set of negative gambles is similarly given by all gambles G ∈ℂ_{h}^{n×n}

h such that G ≩ 0. Alice examines the gambles in **ℂ _{h}^{n×n}** and comes up with the subset K of the gambles that she finds desirable. Alice’s

rationality is then characterised by simply applying the

space of hermitian matrices:

- S1
- Any PSDNZ (positive gamble) G must be desirable for Alice (G ∈K), given that it may increase

Alice’s utiles without ever decreasing them (accepting partial gain). - S2
- Any G ≩ 0 (negative gamble) must not be desirable for Alice (G in K ), given that it may only

decrease Alice’s utiles without ever increasing them (avoiding partial loss). - S3
- If Alice finds G and H desirable (G,H ∈K), then also λG + νH must be desirable for her

(λG+νH ∈K), for any 0 < λ,ν ∈ℝ (linearity of the utility scale).

From a geometric point of view, a coherent set of desirable gambles K is a convex cone without its apex and that

contains all PSDNZ matrices (and thus it is disjoint from the set of all matrices such that G ≩ 0). We may also

assume that K satisfies the following additional property:

- S4
- if G ∈K then either G ≰ 0 or G–εI ∈K for some strictly positive real number ε (openness).

This property is not necessary for rationality, but it is technically convenient as it precisely isolates the kind of

models we use in QM (as well as in classical probability) [1]. The openness condition (S4) has a gambling

interpretation too: it means that we will only consider gambles that are strictly desirable for Alice; these are the

gambles for which Alice expects gaining something—even an epsilon of utiles. For this reason, K is called set

of strictly desirable gambles (SDG) in this case.

An SDG is said to be maximal if there is no larger SDG containing it. In [1, Theorem IV.4], we have shown

that maximal SDGs and density matrices are one-to-one. The mapping between them is obtained through the

standard inner product in** ℂ _{h}^{n×n}** , i.e., G⋅R = Tr(G

h via a representation theorem [1, TheoremIV.4].

This result has several consequences. First, it provides a gambling interpretation of the first axiom of QM on

density operators. Second, it shows that density operators are coherent, since the dual of ρ is a valid

SDG. This also implies that QM is self-consistent—a gambler that uses QM to place bets on a

quantum experiment cannot be made a partial (and, thus, sure) loser. Third, the first axiom of QM

on ℂ_{h}^{n×n} is structurally and formally equivalent to Kolmogorov’s first and second axioms about

probabilities on ℝ^{n} [1, Sec. 2]. In fact, they can be both derived via duality from a coherent set of desirable

gambles on ℂ_{h}^{n×n} and, respectively, ℝ^{n}. In [1] we have also derived Born’s rule and the other three

axioms of QM as **a consequence of rational gambling on a quantum experiment ****and show that that**

** measurement, partial tracing and tensor product are equivalent to the probabilistic notions of**

** Bayes’ rule, marginalisation and independence. **Finally, as an additional consequence of the

aforementioned representation result, in [2] we have shown that a subject who uses dispersion-free

probabilities to accept gambles on a quantum experiment can always be made a sure loser: she

loses utiles no matter the outcome of the experiment. We say that dispersion-free probabilities are

incoherent, which means that they are logically inconsistent with the axioms of QM. Moreover, we have

proved that it is possible to derive a stronger version of Gleason’s theorem that holds in any finite dimension (hence even for n = 2),

through much a simpler proof, which states that all coherent

probability assignments in QM must be obtained as the trace of the product of a projector and a density

operator.

A list of relevant bibliographic references, as well as a comparison between our approach and similar

approaches like QBism [4] and Pitowsky’s quantum gambles [6], can be found in [1].

[1] Alessio Benavoli, Alessandro Facchini & Marco Zaffalon (2016): Quantum mechanics: The Bayesian

theory generalized to the space of Hermitian matrices. Phys. Rev. A 94, p. 042106. Available at

https://arxiv.org/pdf/1605.08177.pdf.

[2] Alessio

Benavoli, Alessandro Facchini & Marco Zaffalon (2017): A Gleason-Type Theorem for Any Dimension Based

on a Gambling Formulation of Quantum Mechanics. Foundations of Physics, pp. 1–12, doi:10.H.B1007/H.B

s10701-_017-_0097-_0H.B. Available at https://arxiv.org/pdf/1606.03615.pdf.

[3] B. de Finetti (1937): La prévision: ses lois logiques, ses sources subjectives. Annales de l’Institut Henri

Poincaré 7, pp. 1–68. English translation in [5].

[4] Christopher A Fuchs & Ruediger Schack (2013): Quantum-Bayesian coherence. Reviews of Modern

Physics 85(4), p. 1693.

[5] H. E. Kyburg Jr. & H. E. Smokler, editors (1964): Studies in Subjective Probability. Wiley, New York.

Second edition (with new material) 1980.

[6] Itamar Pitowsky (2003): Betting on the outcomes of measurements: a Bayesian theory of quantum

probability. Studies in History and Philosophy of Science Part B: Studies in History and Philosophy of Modern

Physics 34(3), pp. 395–414.

[7] P. Walley (1991): Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, New York.

[8] P. M. Williams (1975): Notes on conditional previsions. Technical Report, School of Mathematical and

Physical Science, University of Sussex, UK.

In particular, the Bayesian correlated t-test makes inference about the mean difference of accuracy between two classifiers in the $i$-th dataset ($\mu_i$) by exploiting three pieces of information: the sample mean ($\bar{x}_i$), the variability of the data (sample standard deviation $\hat{\sigma}_i$) and the correlation due to the overlapping training set ($\rho$). This test can only be applied to a single dataset.

There is no direct NHST able to extend the above statistical comparison to multiple datasets, i.e., that takes as inputs the $m$ runs of the $k$-fold cross-validation results for each dataset and returns as output a statistical decision

about which classifier is better in all the datasets.

The usual NHST procedure that is employed for performing such analysis has two steps:

(1) compute the mean \textit{difference} of accuracy for each dataset $\bar{x}_i$;

(2) perform a NHST to establish if the two classifiers have different performance or not based on these mean differences of accuracy.

This discards two pieces of information: the correlation $\rho$ and sample standard deviation $\hat{\sigma}_i$ in each dataset.

The standard deviation is informative about the accuracy of $\bar{x}_i$ as an estimator of $\mu_i$.

The standard deviation can largely vary across data sets, as a result of each data set having its own size and complexity.

The aim of this section is to present an extension of the Bayesian correlated t-test that is able

to make inference on multiple datasets and at the same time to account for all the available information

(mean, standard deviation and correlation).

In Bayesian estimation, this can be obtained by defining a hierarchical model. Hierarchical models are among the most powerful and flexible tools in Bayesian analysis.

The code, this notebook and the dataset can be downloaded from our github repository.

Module `hierarchical`

in `bayesiantests`

compares the performance of two classifiers that have been assessed by *m*-runs of *k*-fold cross-validation on *q* datasets. It returns probabilities that, based on the measured performance, one model is better than another or vice versa or they are within the region of practical equivalence.

This notebook demonstrates the use of the module.

In [1]:

```
import numpy as np
scores = np.loadtxt('Data/diffNbcHnb.csv', delimiter=',')
names = ("HNB", "NBC")
print(scores)
```

To analyse this data, we will use the function hierarchical in the module bayesiantests that accepts the following arguments.

```
scores: a 2-d array of differences.
rope: the region of practical equivalence. We consider two classifiers equivalent if the difference in their
performance is smaller than rope.
rho: correlation due to cross-validation
names: the names of the two classifiers; if x is a vector of differences, positive values mean that the second
(right) model had a higher score.
```

The hierarchical function uses **STAN** through Python module **pystan**.

Function `hierarchical(scores,rope,rho, verbose, names=names)`

computes the Bayesian hierarchical test and returns the probabilities that the difference (the score of the first classifier minus the score of the first) is negative, within rope or positive.

In [8]:

```
import bayesiantests as bt
rope=0.01 #we consider two classifers equivalent when the difference of accuracy is less that 1%
rho=1/10 #we are performing 10 folds, 10 runs cross-validation
pleft, prope, pright=bt.hierarchical(scores,rope,rho)
```

The first value (`left`

) is the probability that the the differences of accuracies is negative (and, therefore, in favor of HNB). The third value (`right`

) is the probability that the the differences of accuracies are positive (and, therefore, in favor of NBC). The second is the probability of the two classifiers to be practically equivalent, i.e., the difference within the rope.

In the above case, the HNB performs better than naive Bayes with a probability of 0.9965, and they are practically equivalent with a probability of 0.002. Therefore, we can conclude with high probability that HNB is better than NBC.

If we add arguments `verbose`

and `names`

, the function also prints out the probabilities.

In [9]:

```
pl, pe, pr=bt.hierarchical(scores,rope,rho, verbose=True, names=names)
```

The posterior distribution can be plotted out:

- using the function
`hierarchical_MC(scores,rope,rho, names=names)`

we generate the samples of the posterior - using the function
`plot_posterior(samples,names=('C1', 'C2'))`

we then plot the posterior in the probability simplex

In [10]:

```
%matplotlib inline
import matplotlib.pyplot as plt
samples=bt.hierarchical_MC(scores,rope,rho, names=names)
#plt.rcParams['figure.facecolor'] = 'black'
fig = bt.plot_posterior(samples,names)
plt.savefig('triangle_hierarchical.png',facecolor="black")
plt.show()
```

It can be seen that the posterior mass is in the region in favor of HNB and so it confirms that the classifier is better than NBC. From the posterior we have also an idea of the magnitude of the uncertainty and the “stability” of our inference.

To functions `hierarchical`

allow also to test the efect of the prior hyperparameters. We point to the last reference for a discussion about prior sensitivity.

`@ARTICLE{bayesiantests2016,`

author = {{Benavoli}, A. and {Corani}, G. and {Demsar}, J. and {Zaffalon}, M.},

title = "{Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis}",

journal = {ArXiv e-prints},

archivePrefix = "arXiv",

eprint = {1606.04316},

url={https://arxiv.org/abs/1606.04316},

year = 2016,

month = jun

}

`@article{corani2016unpub,`

title = { Statistical comparison of classifiers through Bayesian hierarchical modelling},

author = {Corani, Giorgio and Benavoli, Alessio and Demsar, Janez and Mangili, Francesca and Zaffalon, Marco},

url = {http://ipg.idsia.ch/preprints/corani2016b.pdf},

year = {2016},

date = {2016-01-01},

institution = {technical report IDSIA},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

The aim of this Special Issue is twofold. On one hand, it wishes to give a broad overview of popular models used in BNP, and of the related computational methods for implementation, through a tutorial paper. On the other hand, it focuses on theoretical advances and modern challenging applications of the BNP approach with special emphasis, among others, on: Bayesian asymp-

totics, Bayesian topic models, Bayesian linear and regression models, Bayesian semiparametric state space models, dictionary learning with application to image processing, sensitivity and robustness. This is the list of papers.

- Theory and Computations for the Dirichlet Process and Related Models: An Overview, by Alejandro Jara
- Frequentistic approximations to Bayesian prevision of exchangeable random elements, by Emanuele Dolera, Donato M. Cifarelli and Eugenio Regazzini
- A Dirichlet Process Functional Approach to Heteroscedastic-Consistent Covariance Estimation, by George Karabatsos
- Nonparametric Bayesian Topic Modelling with the Hierarchical Pitman-Yor Processes, by Kar Wai Lim, Wray Buntine, Changyou Chen and Lan Du
- Bayes linear kinematics in a dynamic survival model, by Kevin J. Wilson and Malcolm Farrow
- Nonparametric adaptive Bayesian regression using priors with tractable normalizing constants and under qualitative assumptions, by Khader Khadraoui
- Robust Identification of Highly Persistent Interest Rate Regimes, by Stefano Peluso, Antonietta Mira and Pietro Muliere
- Indian Buffet Process Dictionary Learning : algorithms and applications to image processing, by Hong-Phuong Dang and Pierre Chainais
- Bayesian Nonparametric System Reliability using Sets of Priors, by Gero Walter, Louis Aslett and Frank Coolen
- Robustness in Bayesian Nonparametrics, by Sudip Bose

We wish to thank the authors of this special issue for submitting interesting papers, and the reviewers for their time spent on reviewing the above manuscripts.

]]>A very pleasant way to start the day. A joy to read. It is really very well written, maybe a bit too simple, but a catching book. I recommend it.

]]>The difference between two classifiers (algorithms) can be very small; however *there are no two classifiers whose
accuracies are perfectly equivalent*.

By using an null hypothesis significance test (NHST), the null hypothesis is that the classifiers are equal. However, the null hypothesis is practically always false!

By rejecting the null hypothesis NHST indicates that the null hypothesis is unlikely; **but this is known even before running the experiment**.

We will use the Python modules `signtest`

and `signrank`

in `bayesiantests`

(see our GitHub repository).

Can we say anything about the probability that two classifiers are practically equivalent (e.g., *j48* is practically equivalent to *j48gr*)?

NHST cannot answer this question, while Bayesian analysis can.

We need to define the meaning of **practically equivalent**.

The rope depends:

- on the
**metric**we use for comapring classifiers (accuracy, logloss etc.); - on our
**subjective**definition of practical equivalence (**domain specific**).

Accuracy is a number in $[0,1]$. For practical applications, it is sensible to define that two classifiers whose mean difference of accuracies is less that $1\%$ ($0.01$) are practically equivalent.

A difference of accuracy of $1\%$ is neglegible in practice.

The interval $[-0.01,0.01]$ can thus be used to define a **region of practical equivalence** for classifiers.

See it in action.

`Data/accuracy_j48_j48gr.csv`

. For simplicity, we will skip the header row and the column with data set names.

In [2]:

```
import numpy as np
scores = np.loadtxt('Data/accuracy_j48_j48gr.csv', delimiter=',', skiprows=1, usecols=(1, 2))
names = ("J48", "J48gr")
```

Function `signtest(x, rope, prior_strength=1, prior_place=ROPE, nsamples=50000, verbose=False, names=('C1', 'C2'))`

computes the Bayesian signed-rank test and returns the probabilities that the difference (the score of the first classifier minus the score of the first) is negative, within rope or positive.

In [3]:

```
import bayesiantests as bt
left, within, right = bt.signtest(scores, rope=0.01,verbose=True,names=names)
```

The first value (`P(J48 > J48gr)`

) is the probability that the first classifier (the left column of `x`

) has a higher score than the second (or that the differences are negative, if `x`

is given as a vector).

The second value (`P(rope)`

) is the probability that they are practically equivalent.

The third value (`P(J48gr > J48)`

) is equal to `1-P(J48 > J48gr)-P(rope)`

.

The probability of the rope is equal to $1$ and, therefore, we can say that they are equivalent (for the given rope).

Decision tree grafting (**J48gr**) was developed to demonstrate that a preference for less complex trees (**J48**) does not serve to improve accuracy. The point is that J48gr has a consistent (albeit small) improvements in accuracy than **J48**.

The advantage of having a rope is that we can test this hypothesis from a statistical point of view.

In [10]:

```
left, within, right = bt.signtest(scores, rope=0.001,verbose=True,names=names)
```

No the difference is less than 0.001 with probability 0.99

In [11]:

```
left, within, right = bt.signtest(scores, rope=0.0001,verbose=True,names=names)
```

In [12]:

```
left, within, right = bt.signtest(scores, rope=0.0001,prior_place=bt.RIGHT,verbose=True,names=names)
```

In [88]:

```
%matplotlib inline
import matplotlib.pyplot as plt
left=np.zeros((10,1))
within=np.zeros((10,1))
right=np.zeros((10,1))
for i in range(9,-1,-1):
left[i], within[i], right[i] = bt.signtest(scores, rope=0.001/2**i,names=names)
plt.plot(0.001/(2**np.arange(0,10,1)),within)
plt.plot(0.001/(2**np.arange(0,10,1)),left)
plt.plot(0.001/(2**np.arange(0,10,1)),right)
plt.legend(('rope','left','right'))
plt.xlabel('Rope width')
plt.ylabel('Probability')
```

Out[88]:

In [13]:

```
left, within, right = bt.signrank(scores, rope=0.001,verbose=True,names=names)
```

In [14]:

```
left, within, right = bt.signrank(scores, rope=0.0001,verbose=True,names=names)
```

In [ ]:

```
```

This post is about Bayesian nonparametric tests for comparing algorithms in ML. This time we will discuss about Python module `signrank`

in `bayesiantests`

(see our GitHub repository). It computes the Bayesian equivalent of the Wilcoxon signed-rank test. It return probabilities that, based on the measured performance, one model is better than another or vice versa or they are within the region of practical equivalence.

This notebook demonstrates the use of the module. You can download the notebook from GitHub as well.

`Data/accuracy_nbc_aode.csv`

. For simplicity, we will skip the header row and the column with data set names.

In [11]:

```
import numpy as np
scores = np.loadtxt('Data/accuracy_nbc_aode.csv', delimiter=',', skiprows=1, usecols=(1, 2))
names = ("NBC", "AODE")
```

Functions in the module accept the following arguments.

`x`

: a 2-d array with scores of two models (each row corresponding to a data set) or a vector of differences.`rope`

: the region of practical equivalence. We consider two classifiers equivalent if the difference in their performance is smaller than`rope`

.`prior_strength`

: the prior strength for the Dirichlet distribution. Default is 0.6.`prior_place`

: the region into which the prior is placed. Default is`bayesiantests.ROPE`

, the other options are`bayesiantests.LEFT`

and`bayesiantests.RIGHT`

.`nsamples`

: the number of Monte Carlo samples used to approximate the posterior.`names`

: the names of the two classifiers; if`x`

is a vector of differences, positive values mean that the second (right) model had a higher score.

Function `signrank(x, rope, prior_strength=0.6, prior_place=ROPE, nsamples=50000, verbose=False, names=('C1', 'C2'))`

computes the Bayesian signed-rank test and returns the probabilities that the difference (the score of the first classifier minus the score of the first) is negative, within rope or positive.

In [2]:

```
import bayesiantests as bt
left, within, right = bt.signrank(scores, rope=0.01)
print(left, within, right)
```

The first value (`left`

) is the probability that the first classifier (the left column of `x`

) has a higher score than the second (or that the differences are negative, if `x`

is given as a vector).

In the above case, the right (AODE) performs worse than naive Bayes with a probability of 0.88, and they are practically equivalent with a probability of 0.12.

If we add arguments `verbose`

and `names`

, the function also prints out the probabilities.

In [12]:

```
left, within, right = bt.signrank(scores, rope=0.01, verbose=True, names=names)
```

The posterior distribution can be plotted out:

- using the function
`signrank_MC(x, rope, prior_strength=1, prior_place=ROPE, nsamples=50000)`

we generate the samples of the posterior - using the function
`plot_posterior(samples,names=('C1', 'C2'))`

we then plot the posterior in the probability simplex

In [13]:

```
%matplotlib inline
import matplotlib.pyplot as plt
samples = bt.signrank_MC(scores, rope=0.01)
fig = bt.plot_posterior(samples,names)
plt.show()
```

In [19]:

```
samples = bt.signrank_MC(scores, rope=0.01, prior_strength=0.6, prior_place=bt.LEFT)
fig = bt.plot_posterior(samples,names)
plt.show()
```

… and on the right

In [20]:

```
samples = bt.signrank_MC(scores, rope=0.01, prior_strength=0.6, prior_place=bt.RIGHT)
fig = bt.plot_posterior(samples,names)
plt.show()
```

`1`

has negligible effect. Only a much stronger prior on the left would shift the probabilities toward NBC:

In [18]:

```
samples = bt.signrank_MC(scores, rope=0.01, prior_strength=6, prior_place=bt.LEFT)
fig = bt.plot_posterior(samples,names)
plt.show()
```

The function `signrank_MC(x, rope, prior_strength=0.6, prior_place=ROPE, nsamples=50000)`

computes the posterior for the given input parameters. The result is returned as a 2d-array with `nsamples`

rows and three columns representing the probabilities $p(-\infty, `rope`), p[-`rope`, `rope`], p(`rope`, \infty)$. Call `signrank_MC`

directly to obtain a sample of the posterior.

The posterior is plotted by `plot_simplex(points, names=('C1', 'C2'))`

, where `points`

is a sample returned by `signrank_MC`

.

`@ARTICLE{bayesiantests2016,`

author = {{Benavoli}, A. and {Corani}, G. and {Demsar}, J. and {Zaffalon}, M.},

title = "{Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis}",

journal = {ArXiv e-prints},

archivePrefix = "arXiv",

eprint = {1606.04316},

url={https://arxiv.org/abs/1606.04316},

year = 2016,

month = jun

}

`@inproceedings{benavoli2014a,`

title = {A {B}ayesian {W}ilcoxon signed-rank test based on the {D}irichlet process},

booktitle = {Proceedings of the 30th International Conference on Machine Learning ({ICML} 2014)},

author = {Benavoli, A. and Mangili, F. and Corani, G. and Zaffalon, M. and Ruggeri, F.},

pages = {1--9},

year = {2014},

url = {http://www.idsia.ch/~alessio/benavoli2014a.pdf}

}