Bayesian optimization is an exciting idea in machine learning that has recently achieved state-of-the-art performance by modeling a performance metric (e.g. accuracy or mean squared error) as a function of model hyperparameters. Specifically, given a set of $n$ observations $\mathcal{D}=\left(\theta_i, y_i\right)_{i=1}^n$ (where $\theta_i$ denotes the $i$th configuration of hyperparameters and $y_i$ the corresponding metric value), Bayesian optimization constructs a posterior over functions using a Gaussian process, \begin{align} f \sim \text{GP}\left(\mu\left(\cdot\right), k\left(\cdot, \cdot\right)\right), \end{align} where $\mu\left(\cdot\right)$ and $k\left(\cdot, \cdot\right)$ are the prior mean and covariance functions, respectively. There exists a closed-form expression for the Gaussian process posterior (refer to Rasmussen and Williams). In this post I want to describe how I think Bayesian optimization can be leveraged for the design of hypothesis tests.
Consider a likelihood ratio test: Given a null hypothesis set $\Theta_0\subset \Theta$, we wish to test the hypothesis $\theta\in\Theta_0$ versus the alternative $\theta\not\in\Theta_0$. Using likelihood ratios we obtain, \begin{align} \Lambda\left(\mathbf{X}\right) = \frac{\sup_{\theta\in\Theta_0} L\left(\theta;\mathbf{X}\right)}{\sup_{\theta\in\Theta} L\left(\theta;\mathbf{X}\right)}, \end{align} where $L\left(\theta;\mathbf{X}\right)$ is the likelihood function for the parameter $\theta$ given observational data $\mathbf{X}$. When $\left|\Theta_0\right|$ is small (and for many tests this cardinality equals one), controlling the Type-I error of the test is fairly straight-forward: We need to identify a threshold $c$ for which the likelihood ratio test is less than $c$ with probability $\alpha$. For a level-$\alpha$ test, this probability must equal exactly $\alpha$; for size-$\alpha$ tests, this probability must only be upper bounded by $\alpha$.
In general, we will have to solve the following optimization problem in order to design a likelihood ratio test with a desired Type-I error rate:
\begin{align}
&\min_{\theta} c\left(\theta\right) \\
\text{Such that:} ~~~~~~~~ & \theta\in\Theta_0 \\
&\text{Pr}\left[\Lambda\left(\mathbf{X}\right) < c\right] = \alpha.
\end{align}
Here, $c$ is written as a function of $\theta$ to make explicit the fact that the $\alpha$-percentile of $\Lambda\left(\mathbf{X}\right)$ is a function of the true $\theta$.
In general, solving this optimization problem will be infeasible to achieve analytically. How, then, might we go about identifying the worst-case parameter? One idea is to leverage the work of Gardner et al. on Bayesian optimization under constraints. In essence, we want to treat the $\alpha$-percentile of the likelihood ratio statistic as a black-box function that we can iteratively optimize using Bayesian optimization. I think there could be some good underlying structure behind these likelihood ratio tests that a Bayesian optimization routine could exploit. The constraint learning idea comes into play when it might be difficult to automatically generate points in the null hypothesis set and so one would like to avoid redundancy by explicitly modeling the null hypothesis set (or the feasible region, as one might say in the optimization domain).
In a future post I’ll try to implement this strategy in some simple examples to see how it performs. I’ll use my Bayesian optimization framework Thor for this purpose.