Hypothesis testing in machine learning – for instance to establish whether the performance of two algorithms is significantly different – is usually performed using null hypothesis significance tests (nhst). Yet the nhst methodology has well-known drawbacks. For instance, the claimed statistical significances do not necessarily imply practical significance. Moreover, nhst cannot verify the null hypothesis and thus cannot recognize equivalent classifiers. Most important, nhst does not answer the question of the researcher: which is the probability of the null and of the alternative hypothesis, given the observed data? Bayesian hypothesis tests overcome such problems. They compute the posterior probability of the null and the alternative hypothesis. This allows to detect equivalent classifiers and to claim statistical significances which have a practical impact. We developed Bayesian counterparts of the most commonly test adopted in machine learning, such as the correlated t-test and the signed-rank test. We have also implemented such tests for the most common platforms (R, Python, etc.) available in Github.

See also our Tutorial @ECML Comparing competing algorithms: Bayesian versus frequentist hypothesis testing

- Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. In: Journal of Machine Learning Research, 18 (77), pp. 1-36, 2017.
- Statistical comparison of classifiers through Bayesian hierarchical modelling. In: Machine Learning, pp. 1–21, 2017, ISSN: 1573-0565.
- A Bayesian approach for comparing cross-validated algorithms on multiple data sets. In: Machine Learning, 100 (2), pp. 285–304, 2015.
- A Bayesian nonparametric procedure for comparing algorithms. In: Proceedings of the 31th International Conference on Machine Learning (ICML 2015), pp. 1–9, 2015.
- A Bayesian Wilcoxon signed-rank test based on the Đirichlet process. In: Proceedings of the 30th International Conference on Machine Learning (ICML 2014), pp. 1–9, 2014.