Hypothesis testing in machine learning – for instance to establish whether the performance of two algorithms is significantly different – is usually performed using null hypothesis significance tests (nhst). Yet the nhst methodology has well-known drawbacks. For instance, the claimed statistical significances do not necessarily imply practical significance. Moreover, nhst cannot verify the null hypothesis and thus cannot recognize equivalent classifiers. Most important, nhst does not answer the question of the researcher: which is the probability of the null and of the alternative hypothesis, given the observed data? Bayesian hypothesis tests overcome such problems. They compute the posterior probability of the null and the alternative hypothesis. This allows to detect equivalent classifiers and to claim statistical significances which have a practical impact. We developed Bayesian counterparts of the most commonly test adopted in machine learning, such as the correlated t-test and the signed-rank test. We have also implemented such tests for the most common platforms (R, Python, etc.) available in Github.

See also our Tutorial @ECML Comparing competing algorithms: Bayesian versus frequentist hypothesis testing