Hypothesis testing in machine learning – for instance to establish whether the performance of two algorithms is significantly different or to assess the (in)dependence between (mixed-type) variables – is usually performed using null hypothesis significance tests (nhst). Yet the nhst methodology has well-known drawbacks. For instance, the claimed statistical significances do not necessarily imply practical significance. Moreover, nhst cannot verify the null hypothesis and thus cannot recognize equivalent classifiers. Most important, nhst does not answer the question of the researcher: which is the probability of the null and of the alternative hypothesis, given the observed data? Bayesian hypothesis tests overcome such problems. They compute the posterior probability of the null and the alternative hypothesis. This allows to detect equivalent classifiers and to claim statistical significances which have a practical impact. We developed Bayesian counterparts of the most commonly test adopted in machine learning, such as the correlated t-test and the signed-rank test, test to detect (in)dependence and many others.

See also our Tutorial @ECML Comparing competing algorithms: Bayesian versus frequentist hypothesis testing

- Bayesian Kernelised Test of (In)dependence with Mixed-type Variables. The 8th IEEE International Conference on Data Science and Advanced Analytics (DSAA). Porto, Portugal, 2021.
- Bayesian Dependence Tests for Continuous, Binary and Mixed Continuous-Binary Variables. In: Entropy, 18 (9), pp. 1-24, 2016.
- Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. In: Journal of Machine Learning Research, 18 (77), pp. 1-36, 2017.
- Statistical comparison of classifiers through Bayesian hierarchical modelling. In: Machine Learning, pp. 1–21, 2017, ISSN: 1573-0565.
- Joint Analysis of Multiple Algorithms and Performance Measures. In: New Generation Computing, 35 (1), pp. 69–86, 2016, ISSN: 1882-7055.
- A Bayesian approach for comparing cross-validated algorithms on multiple data sets. In: Machine Learning, 100 (2), pp. 285–304, 2015.
- A Bayesian nonparametric procedure for comparing algorithms. In: Proceedings of the 31th International Conference on Machine Learning (ICML 2015), pp. 1–9, 2015.
- A Bayesian Wilcoxon signed-rank test based on the Đirichlet process. In: Proceedings of the 30th International Conference on Machine Learning (ICML 2014), pp. 1–9, 2014.