The following post discusses how to use a Bayesian hierarchical test (and also the Python module that implements it) to compare classifiers assessed via m-runs k-folds cross-validation. In the Bayesian correlated t-test and also in the frequentist correlated t-test, we can only analyze cross-validation results on a single dataset.
In particular, the Bayesian correlated t-test makes inference about the mean difference of accuracy between two classifiers in the $i$-th dataset ($\mu_i$) by exploiting three pieces of information: the sample mean ($\bar{x}_i$), the variability of the data (sample standard deviation $\hat{\sigma}_i$) and the correlation due to the overlapping training set ($\rho$). This test can only be applied to a single dataset.
There is no direct NHST able to extend the above statistical comparison to multiple datasets, i.e., that takes as inputs the $m$ runs of the $k$-fold cross-validation results for each dataset and returns as output a statistical decision
about which classifier is better in all the datasets.

The usual NHST procedure that is employed for performing such analysis has two steps:
(1) compute the mean \textit{difference} of accuracy for each dataset $\bar{x}_i$;
(2) perform a NHST to establish if the two classifiers have different performance or not based on these mean differences of accuracy.

This discards two pieces of information: the correlation $\rho$ and sample standard deviation $\hat{\sigma}_i$ in each dataset.
The standard deviation is informative about the accuracy of $\bar{x}_i$ as an estimator of $\mu_i$.
The standard deviation can largely vary across data sets, as a result of each data set having its own size and complexity.
The aim of this section is to present an extension of the Bayesian correlated t-test that is able
to make inference on multiple datasets and at the same time to account for all the available information
(mean, standard deviation and correlation).
In Bayesian estimation, this can be obtained by defining a hierarchical model. Hierarchical models are among the most powerful and flexible tools in Bayesian analysis.

The code, this notebook and the dataset can be downloaded from our github repository.

Bayesian Hierarchical Test

Module hierarchical in bayesiantests compares the performance of two classifiers that have been assessed by m-runs of k-fold cross-validation on q datasets. It returns probabilities that, based on the measured performance, one model is better than another or vice versa or they are within the region of practical equivalence.

This notebook demonstrates the use of the module.

We will load the 10-folds, 10-runs cross-validation results of the naive Bayesian classifier and HNB on 54 UCI datasets from the file Data/diffNbcHnb.csv. The file includes the differences of accuracies NBC-HNB in the $10 \times 10=100$ (columns) ripetions and 54 datasets (rows). A total of 5400 values.

In [1]:
import numpy as np
scores = np.loadtxt('Data/diffNbcHnb.csv', delimiter=',')
names = ("HNB", "NBC")
[[-0.02222222  0.01111111 -0.02222222 ..., -0.02222222 -0.04494382  0.        ]
 [ 0.         -0.08695652  0.04347826 ..., -0.09090909  0.         -0.09090909]
 [ 0.06451613  0.03225806  0.         ...,  0.03333333  0.03333333  0.1       ]
 [ 0.01428571  0.          0.         ...,  0.02857143  0.01428571  0.        ]
 [-0.02684564 -0.02684564  0.00671141 ..., -0.02027027  0.01351351
 [-0.18181818  0.          0.         ..., -0.2        -0.1         0.        ]]

To analyse this data, we will use the function hierarchical in the module bayesiantests that accepts the following arguments.

scores: a 2-d array of differences.
rope: the region of practical equivalence. We consider two classifiers equivalent if the difference in their      
      performance is smaller than rope.
rho: correlation due to cross-validation
names: the names of the two classifiers; if x is a vector of differences, positive values mean that the second 
       (right) model had a higher score.

The hierarchical function uses STAN through Python module pystan.

Summarizing probabilities

Function hierarchical(scores,rope,rho, verbose, names=names) computes the Bayesian hierarchical test and returns the probabilities that the difference (the score of the first classifier minus the score of the first) is negative, within rope or positive.

In [8]:
import bayesiantests as bt    
rope=0.01 #we consider two classifers equivalent when the difference of accuracy is less that 1%
rho=1/10 #we are performing 10 folds, 10 runs cross-validation
pleft, prope, pright=bt.hierarchical(scores,rope,rho)

The first value (left) is the probability that the the differences of accuracies is negative (and, therefore, in favor of HNB). The third value (right) is the probability that the the differences of accuracies are positive (and, therefore, in favor of NBC). The second is the probability of the two classifiers to be practically equivalent, i.e., the difference within the rope.

In the above case, the HNB performs better than naive Bayes with a probability of 0.9965, and they are practically equivalent with a probability of 0.002. Therefore, we can conclude with high probability that HNB is better than NBC.

If we add arguments verbose and names, the function also prints out the probabilities.

In [9]:
pl, pe, pr=bt.hierarchical(scores,rope,rho, verbose=True, names=names)
P(HNB > NBC) = 0.998, P(rope) = 0.002, P(NBC > HNB) = 0.0

The posterior distribution can be plotted out:

  1. using the function hierarchical_MC(scores,rope,rho, names=names) we generate the samples of the posterior
  2. using the function plot_posterior(samples,names=('C1', 'C2')) we then plot the posterior in the probability simplex
In [10]:
%matplotlib inline
import matplotlib.pyplot as plt

samples=bt.hierarchical_MC(scores,rope,rho, names=names)

#plt.rcParams['figure.facecolor'] = 'black'

fig = bt.plot_posterior(samples,names)
It can be seen that the posterior mass is in the region in favor of HNB and so it confirms that the classifier is better than NBC. From the posterior we have also an idea of the magnitude of the uncertainty and the “stability” of our inference.

Checking sensitivity to the prior

To functions hierarchical allow also to test the efect of the prior hyperparameters. We point to the last reference for a discussion about prior sensitivity.


author = {{Benavoli}, A. and {Corani}, G. and {Demsar}, J. and {Zaffalon}, M.},
title = "{Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1606.04316},
year = 2016,
month = jun

title = { Statistical comparison of classifiers through Bayesian hierarchical modelling},
author = {Corani, Giorgio and Benavoli, Alessio and Demsar, Janez and Mangili, Francesca and Zaffalon, Marco},
url = {},
year = {2016},
date = {2016-01-01},
institution = {technical report IDSIA},
keywords = {},
pubstate = {published},
tppubtype = {article}

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.