Ewan Dunbar
ˈju ən ˈdʌn ˌbɑɹ
Speech · Phonology · Learning
Modelling speech perception and learning · Sound inventories · Low resource speech technology · Decoding neural network representations · Controllable speech synthesis · Open and replicable science
use the link bar above to navigate between my active research projects.
papers referenced below:
team members on this work:

Current speech technology (automatic speech recognition, speech synthesis) depends completely on the availability of the textually labeled data—speech recordings labelled with textual or phonetic transcriptions—as well as large amounts of text for training language models. This is a bubble waiting to burst. Today's commercial spoken dialogue systems are concentrated on around fifty languages. These languages have a few things in common:

In the world’s thousands of low-resource languages and varieties, ASR and text-to-speech (TTS) is poor or non-existent. Text is the bottleneck.

Training speech technology without any text is the goal of the Zero Resource Speech Challenge (ZeroSpeech: www.zerospeech.com). The ultimate goal of the ZeroSpeech challenge is to work incrementally towards the goal of an autonomously-trained spoken dialogue system:

There have, so far, been three ZeroSpeech challenges. Track 1 of ZeroSpeech asks participants to build a system that can discover some kind of text-like representation of its own on the basis of audio only, instead of the phonemic transcriptions typically provided as training input.

For the purposes of cognitive modelling (see Modelling speech perception and learning) we would then like to probe this representation, to see if it “behaves” like humans do when we inject it into simulated experimental tasks. For applied purposes, we would like to use the discovered representation as a pseudo-transcription, and build on it in order to do standard natural language processing tasks:

In other words, we want participants to develop a system capable of learning a representation that codes the phonetics of a language by exposure only to audio recordings. In the challenge, we evaluate “coding the phonetics” by assessing whether, after training on data from a language, when the system recognizes new speech from that language, it is good at discriminating the lexically meaningful contrasts of the language. For example, if the system were a perfect phonemic transcriber, then it would pass the test, because for a recording of the syllable [pa], it would output [pa], and for a recording of [ba], then it would output [ba] (a distinct object). Submitted systems need not be devices that learn to map to transcriptions—they can generate any kind of representation at all, such as numerical representations—but they must make the fewest phoneme confusions possible. The rate of confusions based on the relative acoustic distinctness of phonemes, without any training, serves as a baseline to improve upon.

Both the 2015 challenge and the 2017 challenge proposed a second sub-challenge (Track 2): unsupervised spoken term discovery—word learning. The aim is to discover, again without any textual labels and based only on raw speech, “words”, defined as recurring speech fragments. Systems take raw speech as input and output a list of speech fragments (timestamps referring to the original audio file) together with a discrete label for category membership.

The 2019 edition adds a new task: TTS without T (text-to-speech without text). In addition to learning a better representation for speech (Track 1), participants must use this representation to do speech synthesis. Unlike in classical text-to-speech, the “text” is the transcription format invented by the learner (visit the leaderboard to hear audio samples).

An interesting result from the 2019 challenge is that representations which were more highly compressed with respect to the original audio (that is, contained fewer bits of information) tended overall to yield poorer synthesis and poorer phonemic discriminability (see figure at right). There is one notable exception to this rule: the original, correct phonemic transcriptions are representations that yield good synthesis, good phonemic discriminablity, and are highly compressed representations of the original audio. Learning a phoneme-like, or text-like, representation which learns something useful still eludes us.

References

Becker, S., and Hinton, G. (1992). Self-organizing neural network that discovers surfaces inrandom-dot stereograms. Nature 355(6356): 161.

Benkí, J. R. (2001). Place of articulation and first formant transition pattern both affect perception of voicing in English. Journal of Phonetics 29:1–22.

Best, C. T. (1994). The emergence of native-language phonological influences in infants: Aperceptual assimilation model. In The development of speech perception: The transitionfrom speech sounds to spoken words, 167(224), 233–277.

Chaabouni, R., Dunbar, E., Zeghidour, N., and Dupoux, E. (2017). Learning weaklysupervised multimodal phoneme embeddings. In Proc. INTERSPEECH 2017.

Chen, H., Leung, C. C., Xie, L., Ma, B., & Li, H. (2015). Parallel inference of Dirichlet processGaussian mixture models for unsupervised acoustic modeling: A feasibility study. InProc. INTERSPEECH 2015.

Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.

Clements, G. N. (2003). Feature economy in sound systems. Phonology, 20(3), 287–333.

Dale, R., & Duran, N. D. (2011). The cognitive dynamics of negated sentence verification.Cognitive Science 35(5): 983-996.

Dillon, B., Dunbar, E., and Idsardi, W. (2013). A single-stage approach to learningphonological categories. Cognitive Science 37(2):344–377.

Dillon, B., and Wagers, M. (2019). Approaching gradience in acceptability with the tools ofsignal detection theory. Ms.

https://people.umass.edu/bwdillon/publication/dillonwagers_inprep/dillonwagers_inprep.pdf

Dunbar, E. (2019). Generative grammar, neural networks, and the implementationalmapping problem: Response to Pater. Language 95(1): e87–e98.

Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, M., Cao, X-N., Miskic, L.,Dugrain, C., Ondel, L., Black, A. W., Besacier, L., Sakti, S., and Dupoux, E. (2019).

The Zero Resource Speech Challenge 2019: TTS without T. INTERSPEECH 2019: 20th Annual Congress of the International Speech Communication Association.

Dunbar, E., Cao, X-N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X.,and Dupoux, E. (2017). The Zero-Resource Speech Challenge 2017. In 2017 IEEE

Automatic Speech Recognition and Understanding Workshop (ASRU).

Dunbar, E., and Dupoux, E. (2016). Geometric constraints on human speech soundinventories. Frontiers in Psychology: Language Sciences 7, article 1061.

Dunbar, E., Synnaeve, G., and Dupoux, E. (2015). Quantitative methods for comparingfeatural representations. In Proceedings of ICPhS.

Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. (2016). Recurrent neural network grammars.Preprint arXiv:1602.07776.

Feldman, N., Griffiths, T., Goldwater, S., & Morgan, J. (2013). A role for the developing lexicon in phonetic category acquisition. Psychological Review 120(4): 751–778.

Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis.Cognition 28(1-2): 3–71.

Gelman, A. (2015). The connection between varying treatment effects and the crisis ofunreplicable research: A Bayesian perspective. Journal of Management, 41, 632–643.

Guenther, F., and Gjaja, M. (1996). The perceptual magnet effect as an emergent property ofneural map formation. The Journal of the Acoustical Society of America 100(2): 1111-1121.

Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and Baroni, M. (2018). Colorless greenrecurrent networks dream hierarchically. Preprint: arXiv:1803.11138.

Ioannidis, J. P. (2005). Why most published research findings are false. PLos Med, 2(8), e124.

Le Godais, G., Linzen, T., and Dupoux, E. (2017). Comparing character-level neural languagemodels using a lexical decision task. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers: 125–130.

Li, B., and Zen, H. (2016). Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNNbased Statistical Parametric Speech Synthesis. Proc. INTERSPEECH 2016.

Liberman, A. M. (1963). A motor theory of speech perception. In Proceedings of the Speech Communication Seminar, Stockholm. Speech Transmission Lab.

Linzen, T., Dupoux, E., and Goldberg, Y. (2016). Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of the Association for ComputationalLinguistics 4: 521–535.

Mackie, S., & Mielke, J. (2011). Feature economy in natural, random, and synthetic inventories.In Clements, G. N., and Ridouane, R. Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories,  John        Benjamins Publishing. 43-63.

Maldonado, M., Dunbar, E., and Chemla, E. (2019). Mouse tracking as a window intodecision making. Behavior Research Methods 51(3):1085–1101.

Marcus, G. (2001). The algebraic mind. Cambridge: MIT Press.

Marr, D. (1982). Vision: A computational investigation into the human representation andprocessing of visual information. Cambridge: MIT Press.

Maye, J., Werker, J., and Gerken, L. (2002). Infant sensitivity to distributional information canaffect phonetic discrimination. Cognition 82: B101–B111.

McCoy, R. T., Linzen, T., Dunbar, E., and Smolensky, P. (2019). RNNs implicitly representtensor product representations. ICLR (International Conference on LearningRepresentations).

Mikolov, T., Yih, W. T., and Zweig, G. (2013). Linguistic regularities in continuous space wordrepresentations. In Proceedings of the 2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, 746–751.

Millet, J., Jurov, N., and Dunbar, E. (2019). Comparing unsupervised speech learning directlyto human performance in speech perception. In Proceedings of the 41st Annual

Meeting of the Cognitive Science Society (Cog Sci 2019).

Palangi, H., Smolensky, P., He, X., & Deng, L. (2018). Question-answering withgrammatically-interpretable representations. In Thirty-Second AAAI Conference on Artificial Intelligence.

Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section onreplicability in psychological science: A crisis of confidence?. Perspectives on Psychological Science, 7(6), 528-530.

Riad, R., Dancette, C., Karadayi, J., Zeghidour, N., Schatz, T., & Dupoux, E. (2018).Sampling strategies in Siamese Networks for unsupervised speech representation learning. Interspeech 2018.

Riochet, R., Castro, M. Y., Bernard, M., Lerer, A., Fergus, R., Izard, V., & Dupoux, E. (2018). IntPhys: A framework and benchmark for visual intuitive physics reasoning. Preprint:arXiv:1803.07616.

Schatz, T., Feldman, N., Goldwater, S., Cao, X-N., and Dupoux, E. (2019). Early phonetic learning without phonetic categories: Insights from machine learning. Preprint:https://psyarxiv.com/fc4wh/

Smolensky, P. 1990. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1-2): 159–216.

Synnaeve, G., Schatz, T., and Dupoux, E. (2014). Phonetics embedding learning with side information. In 2014 IEEE Spoken Language Technology Workshop (SLT), 106–111.

Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M., and Dupoux, E. (2015). A hybriddynamic time warping-deep neural network architecture for unsupervised acoustic modeling. Proc. INTERSPEECH 2015.

Turing, A. M. 1950. Computing machinery and intelligence. Mind 59(236): 433–460.

Vallabha, G., McClelland, J.L., Pons, F., Werker, J., and Amano, S. (2007). Unsupervisedlearning of vowel categories from infant-directed speech. Proceedings of the NationalAcademy of Sciences 104(33): 13273–13278.

Versteegh, M., Thiolliere, R., Schatz, T., Cao, X. N., Anguera, X., Jansen, A., & Dupoux, E.(2015). The Zero Resource Speech Challenge 2015. Proc. INTERSPEECH 2015.

Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural Network Acceptability Judgments.Transactions of the Association for Computational Linguistics 7: 625–641.

Yeung, H., and Werker, J. (2009). Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information. Cognition

113(2): 234–243.