Three important properties of Bayesian inference

Ewan Dunbar and Tal Linzen
Laboratoire de Sciences Cognitives et Psychlinguistique, Ecole Normale Supérieure

The interactive app is available here.

The slides from MFM are available here.

A fake experiment and a simple model

We imagine a very simple phonotactic learning experiment, in which subjects are presented with example nonce words (‘training’) and we want to see how they generalized.

The simple version we start with is the setting in Linzen and O’Donnell 2015, where we assume that people form classes predicated over the initial segment. That is, they form some generalization about what the initial segment can be.

For our purposes (for the sake of not having to deal with a more complex model) we assume they form a single class. These classes are for our purposes predicated over a set of 21 binary features adapted from Bruce Hayes’ textbook as shown below under Features.

The Bayesian model has two parts. First, the likelihood. This assigns uniform probability to all the segments in a class. For example, a class picking out voiceless stops, like [-continuant, -voice], would assign uniform probability to {p, t, k, tʃ}.

A class picking out [-continuant] would, in contrast, assign uniform probability to {p, t, k, tʃ}, plus the nasals {m, n, ŋ} and the voiced stops {b, d, g, dʒ}.

The prior distribution of a class is decomposed as:

\[p(class) = p(values|features)p(features)\]

That is, we pick a feature matrix; that has some values (+/-) filled in for some particular features. Not all of the 21 features are in every matrix. In fact, most logically possible class specifications (all but \(2^{21}\) of them) do not have all 21 features.

The probability of choosing a set of features is \(p(features)\). We take the probability of each of the features to be included in the set to be independent and identically distributed at 0.5. This implies that the probability of any set of features is the same: \(0.5^{21}\).

The probability of a valuation for a given set of features is \(p(values|features)\). We take the probability of each of the necessary feature values to be independent and uniform at 0.5. Features that are not included do not need values (we can say they have probability 1 of getting some dummy value).

Thus, for example, the probability of [-continuant, -voice] is \(0.5^{21}\times 0.5^{2}\).


A preference for more restrictive classes arises in the learner because of the likelihood function. We see this reflected in the posterior probabilities, given the observation [p], of several different classes which all equal prior probability, but different sized extensions.