How to use objective Bayesian inference to compare binomial proportions

Consider two independent binomial distributions with probabilities of successes p_1 and p_2. If we observe a_1 successes, b_1 failures from the first distribution and a_2 successes, b_2 failures from the second distribution, what can we say about the difference, p_1 - p_2?

As far as I know, binomial proportion differences like this were first studied by Laplace in 1778. Laplace observed that the ratio of boys-to-girls born in London was notably larger than the ratio of boys-to-girls born in Paris, and he sought to determine whether the difference was significant.

Table 1: Birth records for London and Paris. Ratio represents #boys/#girls [1, p. 59]

Using what would now be called Bayesian inference together with a uniform prior, Laplace computed the posterior probability that the birth ratio in London was less than the birth ratio in Paris as

where

Laplace concluded

❝

there is therefore odds of more than four hundred thousand against one that the births of boys are more facile in London than in Paris. Thus we can regard as a very probable thing that there exists, in the first of these two cities, a cause more than in the second, which facilitates the births of boys, and which depends either on the climate or on the nourishment of the mothers. [1, p. 61]

Pierre-Simon Laplace, Mémoire sur les probabilités

In this blog post, I want to revisit this problem from Laplace with the latest techniques from objective Bayesian inference ([2]).

Note: For an introduction to objective Bayesian inference, see An Introduction To Objective Bayesian Inference and [3].

Like Laplace, I’ll try to provide an answer for the case “when we have nothing given a prior” ([1, p. 26]). As we now know, though, the uniform prior leads to results that depend arbitrarily on the scale used to measure a parameter, and the uniform prior often doesn’t do a good job of representing lack of prior knowledge.

Suppose, for instance, that L(θ; y) is the likelihood function for a model and we use the prior π(θ) ∝1. Then the posterior probability that θ is between a and b is given by

where Z is some normalizing constant. Now suppose that φ is a monotonically increasing transformation onto the domain of θ and we reparameterize the model using the transformation θ = φ(u). Then a simple application of integration by substitution shows us that the corresponding posterior probability in u becomes

Thus the uniform prior in θ is equivalent to the prior

in u.

See [10] and [11] for more detailed criticisms of the uniform prior

Fortunately, reference analysis ([2, part 3]) provides us with a much better and less arbitrary way to build priors “when we have nothing given a prior”.

What does it mean for a prior to be noninformative?

Suppose we’re given a model with a single parameter θ together with a prior, π(θ). How might we test if the prior represents “knowing nothing”.

Let’s run a simple experiment. We’ll fix θ to a particular value, θ_true, and then randomly sample observations, y = (y_1, . . ., y_n)^T, from the distribution P( · |θ_true). Now, let’s compute the two-tailed credible set [θ_a, θ_b] that contains 95% of the probability mass of the posterior

Does [θ_a, θ_b] contain θ_true? Suppose we repeat the experiment many times and vary θ_true and the number of observations, n. If π is a good noninformative prior, then [θ_a, θ_b] will consistently contain θ_true approximately 95% of the time regardless of the value of θ_true or n.

This metric is called frequentist matching coverage and it is extensively used in the Bayesian literature to benchmark noninformative priors (See [2, ch. 5] and [3]). We can intuitively think of frequentist matching coverage as providing an answer to the question “How accurate are the Bayesian credible sets produced using a given prior?”.

How can we build noninformative priors?

In the case of a model with a single parameter, asymptotically optimal frequentist matching coverage performance is obtained with Jeffreys prior

where I(θ) denotes the Fisher information matrix ([4]).

When the model has more than one parameter, the situation becomes more complicated; and unfortunately the good properties of Jeffreys prior for the single parameter case do not carry over to the multiparameter case. Berger et al. explain

❝

We have seen three important examples — the normal model with both parameters unknown, the Neyman-Scott problem and the multinomial model — where the multiparameter form of the Jeffreys-rule prior fails to provide appropriate objective posteriors. We actually know of no multiparameter example in which the Jeffreys-rule prior has been verified to be satisfactory. In higher dimensions, the prior always seems to be either ‘too diffuse’ as in the normal examples, or ‘too concentrated’ as the multinomial example [2, p. 117]

James Berger, Jose Bernardo, and Dongchu Sun, Objective Bayesian Inference

And unlike in the single parameter case, finding a matching prior that optimizes matching coverage isn’t always doable, as James Berger et al note

❝

Matching priors and invariance priors in multiparameter problems are typically suitable, but they are often unavailable (e.g., the solutions to the system of differential equations may not be available, or the model may not have a suitable group invariance structure) [3].

James Berger, Jose Bernardo, and Dongchu Sun, Objective bayesian inference and its relationship to frequentism

However, reference analysis provides us with a general way to produce a prior that almost always performs well. Reference analysis can be thought of as a refinement of Jeffreys prior. For the single parameter models, it produces the same prior as Jeffreys rule (provided the Fisher information matrix exists); for multiparameter models it proceeds sequentially, building up a multiparameter prior by repeated application of the single parameter case.

Suppose, for example, that

is a likelihood function for a model of two parameters. We start by identifying the parameter of interest, say θ_1, and the nuisance parameter, θ_2. We then develop a prior

for the single parameter conditional likelihood function

Then we marginalize out θ_2 to get an integrated likelihood function of a single parameter

and we can now apply the single parameter version of reference analysis to L^I to produce a prior π_1(θ_1). The final priors is given by

If possible, this one-at-a-time algorithm is the best way to produce a reference prior; Berger et al note

❝

There is a more general reference prior method called the one-at-a-time reference prior algorithm. … This is, indeed, our recommended reference prior algorithm, since it is based on one-dimensional information matrices, which all perspectives suggest are optimal [3].

James Berger, Jose Bernardo, and Dongchu Sun, Objective bayesian inference and its relationship to frequentism

Suppose I(θ_1, θ_2) is the 2x2 Fisher information matrix. Put

Then the reference prior is given by

(Provided that π_2 is proper). See Proposition 0.2 of [3] and chapter 10 of [1].

Let’s return to the delta binomial model to build a noninformative prior.

Noninformative Priors for the Delta Binomial Model

Both Jeffreys rule prior and reference priors use the Fisher information matrix, so let’s derive that.

Let f denote the log likelihood for the model. We have'

Differentiating f gives us

where n_l = b_l + g_l and n_p = b_p + g_p.

Filling in the Fisher information matrix, we have

Note: Here we’ve applied these formulas for the mean and variance of a binomial distribution with parameter p: Mean= n p, Variance = n p (1-p)

Thus, we derive Jeffreys prior as

and, using the asymptotically normal approximation with θ as the parameter of interest, the reference prior as

where r = n_l / n_p.

Frequentist Matching Coverage Testing

Let’s run a simulation to test the frequentist matching coverage performance for Jeffreys prior, the reference prior, and Laplace’s original uniform prior, π_L.

We can use this simple algorithm to compute exact values for the frequentist coverage of a given prior:

With modern numerical algorithms, we can compute the probabilities deterministically and to a high level of accuracy.

I won’t go into the details of the numerical methods here, but see my blog post How to Efficiently Approximate a Function of One or More Variables and [9].

Using the Python package bbai to do the computations, here is what an implementation of the algorithm looks like

from bbai.model import DeltaBinomialModel
import math

alpha = 0.95
low = (1 - alpha) / 2.0
high = 1.0 - low

def coverage(prior, theta, x, n1, n2):
    model = DeltaBinomialModel(prior=prior)
    p1 = theta + x
    p2 = x
    theta = p1 - p2
    res = 0.0
    for k1 in range(n1+1):
        prob1 = math.comb(n1, k1) * p1 ** k1 * (1 - p1)**(n1 - k1)
        for k2 in range(n2+1):
            prob = prob1 * math.comb(n2, k2) * p2 ** k2 * (1 - p2)**(n2 - k2)
            model.fit(k1, n1-k1, k2, n2-k2)
            cdf = model.cdf(theta)
            if low <= cdf and cdf <= high:
                res += prob
    return res

And below are the coverage results for various values of θ, x, n_1, and n_2.

Table 2: Frequentist matching coverage results for the Laplace, Jeffreys, and reference prior and various parameters. Values closer to 0.95 indicate a better “noninformative” prior.

While for many parameters the priors give similar and good results, we can see that coverage for the uniform prior is poor when θ is close to the extremes ±1. Coverage for these values of θ improve for Jeffreys prior, and they are the best for the reference priors.

Source code for the full simulation: https://github.com/rnburn/bbai/blob/master/example/21-delta-binomial-coverage.ipynb

Visualizing the Priors and Posteriors

Let’s plot out the priors and their posteriors.

Here’s what Jeffreys prior and the reference prior look like:

Figure 1: Log PDF for Jeffreys prior and the reference prior with n1/n2=2.9.

We see that both priors distribute more probability mass towards the extremes of θ and x.

Now, let’s look at the priors once we marginalize out x

Figure 2: Marginal PDFs for Laplace’s uniform prior, Jeffreys prior, and the reference prior with n1/n2=2.9.

The differences between the three priors becomes clearer. Laplace’s prior approaches zero towards the extremes, θ=±1, explaining its poor coverage performance for those values. In contrast, Jeffreys prior approaches a constant towards the extremes and distributes the most probability mass towards θ=0; and the reference prior approaches ∞ towards the extremes, while being relatively flat elsewhere.

To visualize the posteriors, we’ll randomly sample values from a distribution with the parameters p_1=0.75, p_2=0.25 and n_1=n_2=1, 2, 3.

Figure 3: Posterior PDFs for various observations using Laplace’s prior, Jeffreys prior, and the reference prior.

London-Paris Birth Rates

Let’s turn back to the problem Laplace studied and see what answers we get with the other priors.

We can use the below snippet of code to compute the posterior probability that the birth rate in Paris is greater than the birth rate in London.

from bbai.model import DeltaBinomialModel

a1 = 737629
b1 = 698958
a2 = 251527
b2 = 241945

model = DeltaBinomialModel(prior='uniform')
model.fit(a1, b1, a2, b2)
print('uniform:', model.cdf(0.0))

model = DeltaBinomialModel(prior='jeffreys')
model.fit(a1, b1, a2, b2)
print('jeffreys:', model.cdf(0.0))

model = DeltaBinomialModel(prior='reference')
model.fit(a1, b1, a2, b2)
print('reference:', model.cdf(0.0))

Outputs

uniform: 2.7153683841791673e-06
jeffreys: 2.7155789605440635e-06
reference: 2.8874214109908736e-06

Note: For comparison the answer Laplace derived for the uniform prior was 1/410458 ≈ 2.435e-06 [1, p. 61].

As we might expect, for a large amount of data and a difference in probability close to zero, the choice between the three priors doesn’t have much impact on the final result.

Remember, the difference between the priors is largest when p1-p2 is close to ±1

To see how the answer might evolve as more and more data is observed, we can run a small simulation where we randomly sample from two binomial distributions with probabilities a1/(a1+b1) ≈ 0.5135 and a2/(a2+b2) ≈ 0.5097.

from bbai.model import DeltaBinomialModel
import numpy as np

a1 = 737629
b1 = 698958
a2 = 251527
b2 = 241945
p1 = a1 / (a1 + b1)
p2 = a2 / (a2 + b2)
priors = ['uniform', 'jeffreys', 'reference']

def run_sim(n):
    y1 = np.random.binomial(n, p1)
    y2 = np.random.binomial(n, p2)
    res = []
    for prior in priors:
        model = DeltaBinomialModel(prior=prior)
        model.fit(y1, n-y1, y2, n-y2)
        val = model.cdf(0.0)
        res.append(val)
    return res

np.random.seed(0)
ns = [1, 10, 100, 1e3, 1e4, 1e5, 1e6]
res = [
        ['Laplace'], ['Jeffreys'], ['Reference'],
]
for n in ns:
    col = run_sim(n)
    for i, val in enumerate(col):
        res[i].append(val)
print(res)

This produces the results

Table 3: Posterior probability that p1 < p2 for simulated data sets with different priors, p1_true = 0.5135, and p2_true = 0.5097.

Discussion

Q1: Why did Laplace use a uniform prior?

Laplace doesn’t provide much of an explanation. He writes

❝

When we have nothing given a priori on the possibility of an event, it is necessary to assume all the possibilities, from zero to unity, equally probable; thus, observation can alone instruct us on the ratio of the births of boys and of girls, we must, considering the thing only in itself and setting aside the events, to assume the law of possibility of the births of a boy or of a girl constant from zero to unity, and to start from this hypothesis into the different problems that we can propose on this object. [1, p. 26]

Pierre-Simon Laplace, Mémoire sur les probabilités

Q2: Is the uniform prior a reasonable choice?

I don’t think it’s a good choice if you truly want to represent “knowing nothing”. The coverage experiment shows that the uniform prior performs very poorly when |p1-p2| is close to 1.

Of course, it would be rather ridiculous for |p1-p2| to be close to 1 if the probabilities represent birth ratios. So, I think the uniform prior could be a reasonable subjective choice for Laplace’s problem.

Q3: Does reference analysis produce a unique prior?

Reference analysis can best be thought of as a framework to produce good noninformative priors. It doesn’t necessarily produce a single unique prior.

For multivariable problems, for example, the resulting prior might depend on the order of integration used or whether you use the one-at-a-time method or the asymptotically normal approximation, Proposition 0.2 of [3].

Q4: Will reference priors give the best frequentist matching performance?

Not always. For example, for the bivariate normal model, reference analysis will produce a prior with good frequentist matching performance; but other methods can produce priors with exact frequentist matching performance (although the exact priors have marginalization paradoxes).

See [5] for details.

Q5: Does reference analysis produce “objective” result?

It depends on what your notion of “objective” is.

In [6], James Berger identifies four levels of objectivity. Reference analysis clearly doesn’t satisfy the strongest level

❝

1. A major goal of statistics (indeed science) is to find a completely coherent objective Bayesian methodology for learning from data [6].

James Berger, The case for objective Bayesian analysis

That goal is probably out of reach for any statistical methodology; but it can satisfy a weaker level of objectivity,

❝

3. Objective Bayesian analysis is a convention we should adopt in scenarios in which a subjective analysis is not tenable.

James Berger, The case for objective Bayesian analysis

And I would argue that it has a much stronger case for objectivity that other methods such as P-values where the result can vary dramatically based on the hidden experimental intensions of the investigator (see [7] and [8]).

Conclusion

In this blog post, we saw how we can combine reference analysis with modern numerical methods to objectively answer a statistical inference problem and we saw how to experimentally verify that the result reasonably express the absence of prior knowledge. As we saw, reference analysis is quite versatile and provides a general-purpose framework for statistical inference that can be used as a replacement for P-values and confidence intervals.

References

[1]: Laplace, P. (1778). Mémoire sur les probabilités. Translated by Richard J. Pulskamp.

[2]: Berger, J., J. Bernardo, and D. Sun (2024). Objective Bayesian Inference. World Scientific.

[3]: Berger, J., J. Bernardo, and D. Sun (2022). Objective bayesian inference and its relationship to frequentism.

[4]: Welch, B. L. and H. W. Peers (1963). On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society Series B-methodological 25, 318–329.

[5]: James O. Berger. Dongchu Sun. Objective priors for the bivariate normal model. Ann. Statist. 36 (2) 963–982, April 2008.

[6]: Berger, J. (2006). The case for objective Bayesian analysis. Bayesian Analysis 1(3), 385–402.

[7] Lindley, D. V., and L. D. Phillips. Inference for a Bernoulli Process (A Bayesian View). The American Statistician 30, no. 3 (1976): 112–19.

[8]: Berger, J. O. and D. A. Berry (1988). Statistical analysis and the illusion of objectivity. American Scientist 76(2), 159–165.

[9]: Trefethen Lloyd N. Approximation Theory and Approximation Practice, Extended Edition. USA: Society for Industrial and Applied Mathematics, 2019.

[10]: Fisher, R. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222, 309–368.

[11]: Zabell, S. (1989). R. A. Fisher on the History of Inverse Probability. Statistical Science 4(3), 247–256.

[12]: Paintings Paris and London were both made in the 18th century and can be considered in the public domain, as they were published more than 95 years ago.