Building a computational model of the behaviour of human beings is the ambition behind "strong" or “cognitive” artificial intelligence. The typical approach is to start from the ground up, as Alan Turing proposed in 1950, attempting to model an adult by modelling learning:
Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate [experience] one would obtain the adult brain.
By applying machine learning techniques of the kind used in speech and language processing to increasingly natural types of large language corpora—standing in for the early experience of the child—my modelling work seeks to build systems that learn something about language, mostly focused on speech perception.
Human speech perception, during the first year of life, adapts to the sounds of the inventory of the native language (Werker and Tees 1984). How can we build computer models that, when exposed to recordings of speech in a language, learn to perceive speech sounds like humans do?
Early modelling work in this domain used simplified, artificial data sets (Guenther and Gjaja 1996; Vallabha et al. 2007). For example, in Dillon, Dunbar, and Idsardi 2013, we built a Bayesian model to learn vowel sounds. Rather than taking complete, continuous speech recordings as input, the model, like other models of that era, took individual acoustic measurements, specifically tuned for vowel sounds, made by hand after identifying the vowels in the speech stream—simplifying the problem greatly.
While this simplified data is not realistic for modelling real human speech processing, it is useful for explaining the nature of the problem. Infant learners, who do not have (much) consistent or accessible information about the meanings of words, bathe in a sea of sound, much like an adult listener exposed to an unfamiliar language. To make the first steps in cracking the speech code, infants likely rely on finding statistically coherent patterns (“distributional learning”: Maye, Werker, and Gerken 2002). The figure at right shows simple acoustic measures extracted from vowel sounds in Inuktitut. There are three clear clusters (upper left, upper right, bottom centre), which, happily, correspond to the three phonemic vowel sounds of Inuktitut ([i], [u], [a]).
It becomes possible to apply machine learning techniques to raw speech corpora, rather than collections of hand-extracted measurements, once we take advantage of standard automatic speech recognition (ASR) techniques, which are, in some ways, similar to human speech processing:
Importantly, however, the comparison quickly breaks down:
The spirit of Turing's famous benchmark for a “thinking” machine is that, if a machine can think like a human, then it should act indistinguishably from a human—notably, it should be indistinguishable from a human if we engage it in conversation. Whether our interest in copying humans is in order to make “thinking” interlocutors out of spoken dialogue systems like Siri or, rather, in truly reverse engineering the human (for example, modelling language acquisition), there need to be benchmarks to assess whether the machine succeeds.
But a test consisting of undirected conversation is not a particularly efficient or reliable way of collecting data about the humanlikeness of a system. Better would be to test specific aspects of human behaviour in a controlled fashion—in other words, to do targeted psychological experiments eliciting behaviour we know to be particularly informative. By sticking to very simple experimental paradigms, and by testing models that work directly from the signal up, we can use the exact same stimuli and the exact same experimental task on models as we would on humans—in fact, we can directly compare the two.
To test whether a speech perception model succeeds in reaching the adult listener state, we use standard phonetic discrimination tasks. In these tasks, we measure how good listeners are at telling different pairs of speech sounds apart. Our approach is to do this not on just any sounds, but on stimuli that we expect to be discriminated differently depending on what has been learned by the model.
Since we know that human speech perception is shaped by the native language of the listener, in Millet, Jurov, and Dunbar (2019), we ran a cross-linguistic ABX phone discrimination task on humans and machine-learned models. In an ABX discrimination task, participants hear two stimuli, A and B (for example, shig–shoog), followed by a probe, X, which is always of a kind with either A or B (for example, another instance of shig). Participants are not told the intended spelling of the stimuli, and, based only on what they hear, must match X to one of A or B, and their accuracy is scored. Higher or lower scores indicate better or worse discriminability between categories A and B (between the ih sound, phonetically transcribed [ɪ], and the oo sound, transcribed [ʊ], in this example). In our cross-linguistic task, A and B were uttered by French–English bilinguals, so that one of the two was always a “French non-word,” with a French vowel sound (such as chappe, transcribed [ʃap]) and the other was always an “English non-word,” with an English vowel sound—including English vowel sounds known to be particularly difficult for French listeners (such as shup, transcribed [ʃʌp]). Both French and English listeners were tested.
The figure at left shows a correlation across the two listener groups, with a large number of pairs of sounds that were easy for both groups (for example, [æ]–[ɛ]; pairs always stated in the order English–French), some sounds that were difficult for both groups (for example, [æ]–[a]), and, notably, two pairs of sounds with much better performance for English speakers than for French speakers: [ɪ]–[ɛ] and [ʌ]–[a].
Difficulties for these pairs make sense according to the Perceptual Assimilation Model is correct (PAM: Best 1994): sounds in a foreign language will be completely assimilated to (perceived as) instances of the most similar native-language category: [ɪ] (the vowel sound in English kit) and [ʌ] (the vowel sound in English strut) do not exist as distinct vowel sounds in French, and, although they sound similar to the French [ɛ] and [a] sounds, English speakers will hear these sounds as different, because they correspond to distinct vowel sounds in English. In English, [ɪ] contrasts with a sound very much like French [ɛ] (Rick–wreck), and [ʌ] with a sound similar to French [a] (dud–dad). But in French, in each of these two cases, there is no distinction in the words of the language, and, so goes the reasoning of the Perceptual Assimilation Model, French speakers will perceive these pairs as being the same sound. The PAM model is not an implemented computational model, however. Whether we can build a quantitative model that actually predicts this similarity-based perceptual assimilation remains to be seen.
To learn about the sounds of a language by training on a database of raw speech recordings in that language, Chen et al. (2015) proposed to simply apply an off-the-shelf statistical clustering algorithm (Dirichlet process Gaussian mixture modelling: DPGMM) to the entire set of acoustic frames in a database of speech. The only processing that is done to the database is to convert the wav files into MFCC features (see above), in addition to some standard speech processing tricks for normalizing the representation for acoustic differences between speakers. These techniques do not require any transcripts. The model then learns to cluster together spectrally similar acoustic frames (the individual, minimal, 10-millisecond slices of spectral information into which the corpus is analyzed when it is converted into audio features). In this way, they obtained a good encoding of relevant linguistic information in the Zero Resource Speech Challenge 2015. Because of these good scores, recently, Schatz et al. (2019) proposed that infants may be using a technique like DPGMM to begin learning the sounds of the language they are hearing around them.
To see if we could learn to perceive like adult English and French speakers using this very simple method, we trained a DPGMM model on an English corpus, simulating English listeners, and on a French corpus, simulating French listeners. The trained models were used to map the experimental stimuli into a new perceptual space. Using distances between the A, B, and X stimuli in this space, predicted accuracies can be generated. In the figure at right, a “native language effect” is calculated for models (on the x-axis) and for humans (on the y-axis), by subtracting the English (models’ or speakers’) accuracy from the French accuracy. Each point shows a distinct experimental item (one A–B–X sequence as heard by participants); a weak but statistically robust relation is seen. However, in the figure at left, native language effects are averaged by vowel pair, and we see that the correlation between model and human is, in fact, merely item specific. The model is not at all capturing the systematic differences for the two vowel pairs, [ɪ]–[ɛ] and [ʌ]–[a], representing the most salient differences between listener groups here.
We have since tested this experiment with a number of other trained models, including standard ASR models—which, although trained using transcriptions inaccessible to the child, do perceive by mapping to the discrete phonemic categories of the language via a gradient “similarity”—just like the basic intuition behind the PAM model. We have found no model yet which captures the L2 perceptual assimilation effect observed in this data while still maintaining a good correlation with humans in predicting the gradient, inter-item differences. We do not know why, and are currently testing alternate explanations. If we succeed in capturing data of this kind, we will have constructed the first quantitatively predictive model of (second-language) speech perception applicable directly to raw speech audio files. We will therefore have passed one mini Turing test.
We have also implemented other models, more sophisticated in their approach than the simple DPGMM approach. The model of Thiollière, Dunbar, Synnaeve, Versteegh, and Dupoux (2015) instantiated the lexical distributional learning hypothesis (Feldman et al. 2013; Yeung and Werker 2009): it tried to group together acoustic patterns in a way that would help it to learn words.
Beginning with unannotated recordings of spontaneous English speech, the model first discovered recurring acoustic patterns that might be instances of the same “word.” Then, a feed-forward neural network learned a novel “abstract” representation of speech audio optimized to predict which stretches of speech are the same word, and which different, according to this rough “proto-lexicon.”
Examining any pair of same-“word” stretches of signal, the system sought to find a non-linear transformation of the signal such that, if applied to both of the members of the pair, the result would be two minimally different (ideally identical) representations. The transformation was thus updated so as to make the discovered “same” pairs’ representations more similar, and “different” pairs’ representations less similar (a “siamese network”: Becker and Hinton 1992; Synnaeve et al. 2014: see figure at right).
This model got the second-highest score in the Zero Resource Speech Challenge 2015. The big idea is to benefit from the fact that words repeat, and, because words are longer than individual speech sound, pairs of different words are likely to be easier to identify in the signal than pairs of sounds. (See Riad et al., 2018, for further research on this model.)
A different approach follows ideas going back to the Motor Theory of speech perception (Liberman 1963), claiming that human perceptual representations are in fact grounded in articulation. In Chaabouni, Dunbar, Zeghidour, and Dupoux (2017), we trained a siamese neural network to learn a combined auditory–visual based representation, on the basis of a speech corpus of videos with the speakers’ lips highlighted. Including the lip videos in training led the model to learn more distinct representations for contrasts between [labial] and non-[labial] consonants than training based on audio alone. It also led to a more consistent representation of the feature [labial] (see Decoding neural network representations).
Building on this, in current work, we beginning to use articulatory reconstruction models to yield articulation-based perceptual representations. These are models trained on articulatory corpora—in our case, speech paired with the concurrent movements of electromagnetic articulography (EMA) coils—to infer the articulatory trajectories from the audio. In principle, once trained, such a model could do articulatory reconstruction for an arbitrary new speech signal. We have recently proposed a new method for assessing the quality of reconstruction in unknown data, which will permit us, in ongoing follow-up research, to develop phonetic acquisition models based on mapping to articulatory representations.
These are only two potential types of phonetic acquisition models. There are far too many hypothetical models and variants for any one research group to test. This is part of the motivation for the Zero Resource Speech Challenge, or ZeroSpeech (Versteegh et al., 2015; Dunbar et al., 2017; Dunbar et al., 2019). ZeroSpeech is a machine learning challenge—participants work on common data sets with the goal of training the best system on a task. Our participants must develop a system capable of learning about the phonetics of a language by exposure only to audio recordings. Submitted systems can learn any kind of representation, from numerical vector representations to “pseudo-textual” transcriptions in discovered phone units.
By setting up this challenge, we cast a wide net for systems that succeed at phonetic learning in a “realistic” setting. (On “realistic”: at a minimum, the interactive and emotional components of early experience are missing from our data sets. This is nevertheless comparatively “realistic,” as opposed to the annotated and textual data sets used to construct traditional ASR systems. Passive exposure to language is, however, the current gold standard input in language acquisition modelling, despite its obvious limitations.)
We have just begun testing a new set of experimental speech perception stimuli on English and French speakers, which are drawn from the speech corpora used in the ZeroSpeech 2017 Challenge. Up to now, seventeen systems have been submitted to the ZeroSpeech 2017 Challenge, and were required to publicly provide their models' decodings of the complete test corpora, allowing us to immediately extract predictions from seventeen competing quantitative theories of phonetic learning, and compare them directly with adult speech perception. In principle, any experimental speech discrimination data could be used as such a benchmark, including the data set currently being collected as part of the GEOMPHON project testing synchronic explanations of sound inventory typology.
Becker, S., and Hinton, G. (1992). Self-organizing neural network that discovers surfaces inrandom-dot stereograms. Nature 355(6356): 161.
Benkí, J. R. (2001). Place of articulation and first formant transition pattern both affect perception of voicing in English. Journal of Phonetics 29:1–22.
Best, C. T. (1994). The emergence of native-language phonological influences in infants: Aperceptual assimilation model. In The development of speech perception: The transitionfrom speech sounds to spoken words, 167(224), 233–277.
Chaabouni, R., Dunbar, E., Zeghidour, N., and Dupoux, E. (2017). Learning weaklysupervised multimodal phoneme embeddings. In Proc. INTERSPEECH 2017.
Chen, H., Leung, C. C., Xie, L., Ma, B., & Li, H. (2015). Parallel inference of Dirichlet processGaussian mixture models for unsupervised acoustic modeling: A feasibility study. InProc. INTERSPEECH 2015.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.
Clements, G. N. (2003). Feature economy in sound systems. Phonology, 20(3), 287–333.
Dale, R., & Duran, N. D. (2011). The cognitive dynamics of negated sentence verification.Cognitive Science 35(5): 983-996.
Dillon, B., Dunbar, E., and Idsardi, W. (2013). A single-stage approach to learningphonological categories. Cognitive Science 37(2):344–377.
Dillon, B., and Wagers, M. (2019). Approaching gradience in acceptability with the tools ofsignal detection theory. Ms.
Dunbar, E. (2019). Generative grammar, neural networks, and the implementationalmapping problem: Response to Pater. Language 95(1): e87–e98.
Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, M., Cao, X-N., Miskic, L.,Dugrain, C., Ondel, L., Black, A. W., Besacier, L., Sakti, S., and Dupoux, E. (2019).
The Zero Resource Speech Challenge 2019: TTS without T. INTERSPEECH 2019: 20th Annual Congress of the International Speech Communication Association.
Dunbar, E., Cao, X-N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X.,and Dupoux, E. (2017). The Zero-Resource Speech Challenge 2017. In 2017 IEEE
Automatic Speech Recognition and Understanding Workshop (ASRU).
Dunbar, E., and Dupoux, E. (2016). Geometric constraints on human speech soundinventories. Frontiers in Psychology: Language Sciences 7, article 1061.
Dunbar, E., Synnaeve, G., and Dupoux, E. (2015). Quantitative methods for comparingfeatural representations. In Proceedings of ICPhS.
Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. (2016). Recurrent neural network grammars.Preprint arXiv:1602.07776.
Feldman, N., Griffiths, T., Goldwater, S., & Morgan, J. (2013). A role for the developing lexicon in phonetic category acquisition. Psychological Review 120(4): 751–778.
Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis.Cognition 28(1-2): 3–71.
Gelman, A. (2015). The connection between varying treatment effects and the crisis ofunreplicable research: A Bayesian perspective. Journal of Management, 41, 632–643.
Guenther, F., and Gjaja, M. (1996). The perceptual magnet effect as an emergent property ofneural map formation. The Journal of the Acoustical Society of America 100(2): 1111-1121.
Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and Baroni, M. (2018). Colorless greenrecurrent networks dream hierarchically. Preprint: arXiv:1803.11138.
Ioannidis, J. P. (2005). Why most published research findings are false. PLos Med, 2(8), e124.
Le Godais, G., Linzen, T., and Dupoux, E. (2017). Comparing character-level neural languagemodels using a lexical decision task. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers: 125–130.
Li, B., and Zen, H. (2016). Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNNbased Statistical Parametric Speech Synthesis. Proc. INTERSPEECH 2016.
Liberman, A. M. (1963). A motor theory of speech perception. In Proceedings of the Speech Communication Seminar, Stockholm. Speech Transmission Lab.
Linzen, T., Dupoux, E., and Goldberg, Y. (2016). Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of the Association for ComputationalLinguistics 4: 521–535.
Mackie, S., & Mielke, J. (2011). Feature economy in natural, random, and synthetic inventories.In Clements, G. N., and Ridouane, R. Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories, John Benjamins Publishing. 43-63.
Maldonado, M., Dunbar, E., and Chemla, E. (2019). Mouse tracking as a window intodecision making. Behavior Research Methods 51(3):1085–1101.
Marcus, G. (2001). The algebraic mind. Cambridge: MIT Press.
Marr, D. (1982). Vision: A computational investigation into the human representation andprocessing of visual information. Cambridge: MIT Press.
Maye, J., Werker, J., and Gerken, L. (2002). Infant sensitivity to distributional information canaffect phonetic discrimination. Cognition 82: B101–B111.
McCoy, R. T., Linzen, T., Dunbar, E., and Smolensky, P. (2019). RNNs implicitly representtensor product representations. ICLR (International Conference on LearningRepresentations).
Mikolov, T., Yih, W. T., and Zweig, G. (2013). Linguistic regularities in continuous space wordrepresentations. In Proceedings of the 2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, 746–751.
Millet, J., Jurov, N., and Dunbar, E. (2019). Comparing unsupervised speech learning directlyto human performance in speech perception. In Proceedings of the 41st Annual
Meeting of the Cognitive Science Society (Cog Sci 2019).
Palangi, H., Smolensky, P., He, X., & Deng, L. (2018). Question-answering withgrammatically-interpretable representations. In Thirty-Second AAAI Conference on Artificial Intelligence.
Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section onreplicability in psychological science: A crisis of confidence?. Perspectives on Psychological Science, 7(6), 528-530.
Riad, R., Dancette, C., Karadayi, J., Zeghidour, N., Schatz, T., & Dupoux, E. (2018).Sampling strategies in Siamese Networks for unsupervised speech representation learning. Interspeech 2018.
Riochet, R., Castro, M. Y., Bernard, M., Lerer, A., Fergus, R., Izard, V., & Dupoux, E. (2018). IntPhys: A framework and benchmark for visual intuitive physics reasoning. Preprint:arXiv:1803.07616.
Schatz, T., Feldman, N., Goldwater, S., Cao, X-N., and Dupoux, E. (2019). Early phonetic learning without phonetic categories: Insights from machine learning. Preprint:https://psyarxiv.com/fc4wh/
Smolensky, P. 1990. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1-2): 159–216.
Synnaeve, G., Schatz, T., and Dupoux, E. (2014). Phonetics embedding learning with side information. In 2014 IEEE Spoken Language Technology Workshop (SLT), 106–111.
Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M., and Dupoux, E. (2015). A hybriddynamic time warping-deep neural network architecture for unsupervised acoustic modeling. Proc. INTERSPEECH 2015.
Turing, A. M. 1950. Computing machinery and intelligence. Mind 59(236): 433–460.
Vallabha, G., McClelland, J.L., Pons, F., Werker, J., and Amano, S. (2007). Unsupervisedlearning of vowel categories from infant-directed speech. Proceedings of the NationalAcademy of Sciences 104(33): 13273–13278.
Versteegh, M., Thiolliere, R., Schatz, T., Cao, X. N., Anguera, X., Jansen, A., & Dupoux, E.(2015). The Zero Resource Speech Challenge 2015. Proc. INTERSPEECH 2015.
Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural Network Acceptability Judgments.Transactions of the Association for Computational Linguistics 7: 625–641.
Yeung, H., and Werker, J. (2009). Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information. Cognition