Ewan Dunbar
ˈju ən ˈdʌn ˌbɑɹ
Speech · Phonology · Learning
Modelling speech perception and learning · Sound inventories · Low resource speech technology · Decoding neural network representations · Controllable speech synthesis · Open and replicable science
use the link bar above to navigate between my active research projects.
papers referenced below:
team members on this work:
If one hopes to achieve a full understanding of a system as complicated as a nervous system, a developing embryo, a set of metabolic pathways, a bottle of gas, or even a large computer program, then one must be prepared to contemplate different kinds of explanation at different levels of description. (Marr 1982, p20)

In spite of the fact that internal representations in neural networks consist of inscrutable continuous activation states, this does not mean they are incompatible with structured symbolic representations of the kind typically used in linguistics or other cognitive sciences (despite claims to the contrary: e.g., Fodor and Pylyshyn 1988; Marcus 2001). The very fact that they are extremely powerful models means that they can implement symbolic representations within their continuous activation states, a fact made use of in practice by neural network syntactic parsing models (e.g., Dyer et al. 2016). Were one to examine the internal states of any neural network, one would never see a tree, but this does not mean that there are none—the numerical activations are the wrong level of description to answer the question.

As I discuss in Dunbar 2019, I believe there is much work to be done using neural networks to implement processing and learning in linguistics, and to assess whether linguistic and psycholinguistic theories are adequate. But this cannot be done without tools for decoding what has been learned by neural networks, much like we attempt to decipher what neurons are doing in the brain.

Tensor product decomposition networks

Tensor product representations (TPRs: Smolensky 1990) provide constructive proof of the claim that one can encode arbitrary symbolic structures in numerical vectors. The idea behind TPRs is very simple: symbolic structures can be broken down into the basic elements, or roles, in the structure, and the possible values those elements can take. Take a structure like a binary tree: it can be seen as having three elements—a root label, a left child tree, and a right child tree. Each of these roles can be filled by different values, or fillers. If we associate each possible filler with a vector representation, we can set up a fixed (linear) mapping that modifies “raw” filler vectors so as to associate them with specific roles (mapping them again to different vector representations). The set of all paired filler–role vectors are then added together to obtain a representation of the complete structure.

Previous research sought to force networks' representations to encode specific symbolic structures using TPRs (Palangi et al. 2018). In McCoy, Linzen, Dunbar, and Smolensky (2019), we sought to work backwards, to assess whether trained neural models learn (on their own) to implement specific symbolic representations. We introduced tensor product decomposition networks (TPDNs), a method for finding TPRs that approximate existing vector representations. A TPDN assesses evidence that a set of vectors implements a specific symbolic structure. If the TPDN is able to closely approximate the representations generated by a network, we conclude that the network implements the given symbolic structure.

We showed that recurrent networks trained on artificial tasks using fixed-length digit sequences learn to use (close approximations to) structured symbolic representational schemes. For example, a recurrent neural network trained to copy sequences (mapping 1–3–4–2 to 1–3–4–2, for example) will develop roles corresponding to positions with respect to both the beginning and end of the sequence (first element, last element, second element, second-last element, ...). See figure at left and the paper for more detail.

Not only is it in principle possible to encode structured symbolic representations in neural network states, but they in fact learn them spontaneously. We also show that several networks trained to encode real sentences do not seem to show any evidence of using any of the types of symbolic structures we hypothesized to find. None of these pre-trained networks were actually particularly good at the rather difficult tasks they were trained on (one was the Skip-Thought architecture, which, given a sentence, must predict the preceding and following sentences), which may suggest that they simply encode a “good enough” representation consisting merely of the (unstructured) set of words in the sentence.

Related methods

Phonological features have been central to phonological theory for a long time. Almost every theory of how the phonological processes of languages are dealt with by human cognition has relied heavily on some theory of the mental representation of sounds. Typically, these theories propose a finite, universal set of features which encode the differences and similarities between speech sounds. One view of phonological features that had an enormous influence on the field proposed that the features can take on only two values, + or − (Chomsky and Halle 1968). A representation of some consonants of English in a binary feature system is shown here.

Many different phonological feature theories have been proposed. They differ substantively, but what they have in common is that they are all conceived at least in part to explain tendencies across languages in what kinds of phonological processes exist. For example, English, like many languages, has a process of has voicing assimilation that applies across morpheme boundaries. A straightforward demonstration of English voicing assimilation in the past tense, as formed with -ed. The words eased and tossed are both spelled with -ed, but, in eased, where the suffix is combined with ease (ending in the [+ voice] sound [z]), it is pronounced as the [+ voice] sound [d], while, with toss (ending in the [- voice] sound [s]), it is pronounced as the [- voice] sound [t]. Similarly with the third person plural -s (and the -s plural, and the ’s possessive): with read, which ends in [d], it is pronounced [z], while with eat, which ends in [t], it is pronounced [s].

The proposition that there is a feature [voice] creates the possibility of a transformation in which [- voice] is changed to [+ voice] (or vice versa), and the same transformation therefore relates both [t] with [d] and [s] with [z]. A phonological feature theory is in part a theory of possible phonological transformations. This particular transformation exists under most feature theories.

Another thing phonological feature theories tend to have in common with each other is that each feature is thought to have some kind of innate phonetic meaning, either auditory or articulatory or both (although these innate phonetic specifications are usually only partial, with phonetic acquisition fixing them to match the sounds of the language). The simple phonological transformations that they make possible are ones that were conceived to have some kind of phonetic sense—like switching on or off whether or not there is activity at the vocal folds—not arbitrary ones, like changing [d] to [p] and [t] to [l].

However, as intuitive as many phonological features may appear to be, research in phonetics has made it clear that what may appear to be simple is not obviously so. Voicing, for example, manifests itself as an array of different acoustic cues which are modulated by other features like place of articulation and manner, as well as by neighboring sounds (Benkí 2001). And, in articulation, many studies have found that the articulatory commands for a given phoneme or feature are themselves variable and context-dependent. Perturbation studies show, for example, that, if one articulator is prevented mechanically from reaching its target, others can and do compensate (Fowler and Turvey 1980; Kelso et al. 1984; McFarland and Baum 1995).

In Dunbar, Synnaeve, and Dupoux (2015), we developed an analysis (closely related to tensor product representations and to the analogy method of Mikolov et al. 2013) to look at numerical vector representations of the phonetics of English consonants, to see whether common phonological transformations were also simple transformations in acoustic and articulatory space.

The idea can be illustrated graphically: on the left, a cube representing a binary phonological feature system is shown. It has three features as axes, and the edges highlighted with arrows represent a transformation of nasalization, under which [- nasal] sounds change to their corresponding [+ nasal] counterpart. On the right, a (hypothetical) two-dimensional numerical representation of sounds is shown.

The numerical representation can be said to approximate a nasalization transformation—and thus capture in a very clearly defined sense the feature [nasal]—because the same transformation (a simple translation in the vector space) relates [b] with [m], [d] with [n], and [g] with [ŋ], just the way that nasalization would work under a phonological feature theory. We examined six phonological transformations that would be defined as simple transformations under most phonological feature theories (nasalization, changes in voicing, changing plosives to fricatives, and changing the place of articulation in three different ways), and developed a method for scor- ing numerical representations on how well these featural transformations were captured (in the sense illustrated graphically above). In the paper, we found that the articulatory and the acoustic vectors captured certain of these featural transformations extremely well; follow-up research indicates that combining the articulatory and acoustic vectors together results in a nearly-perfect representation of all six of these featural transformations. Despite the complexity of phonetics, typical phonological features do a good job of capturing transformations that are phonetically natural.

We also did a preliminary assessment of the plausibility of certain proposals that, in fact, there are no innately phonetically-meaningful phonological features (Mielke 2008; Hale and Reiss 2008). These proposals suggest that phonological processes (and other systematic, language-specific restrictions on how sounds combine) are learned simply as relations between individual sounds, and nothing more. The phoneme inventory of a language is, according to these theories, not represented mentally in any phonological feature-like representation, which would highlight the phonological processes in the language by making them simple transformations. But, these theories claim, once phonological processes are learned, we do have access to a phonological feature representation that encodes the processes in the language. For example, the feature [voice] would be learned, as a consequence of learning voicing assimilation.

We constructed a system that used a feedforward neural network model to represent the sequences of English phonemes from a large corpus of conversational speech (Buckeye corpus: Pitt et al. 2005). This system started from an input in which phonemes were encoded with no internal structure. It learned a numerical vector representation that was optimized to encode sequences: it was optimized to predict what phonemes appeared immediately before and after a given phoneme. In this way, we expected that some common phonological processes in English would be captured. However, this representation did not capture any of the six phonological transformations.

In Chaabouni, Dunbar, Zeghidour, and Dupoux (2017), we used this same approach to test the representations of a neural network trained to learn a combined auditory–visual based representation, on the basis of a speech corpus of videos with the speakers’ lips highlighted. Using this method, we established that using the lip videos in training led the model to learn more consistent representations for the feature [labial] than training based on audio alone.

References

Becker, S., and Hinton, G. (1992). Self-organizing neural network that discovers surfaces inrandom-dot stereograms. Nature 355(6356): 161.

Benkí, J. R. (2001). Place of articulation and first formant transition pattern both affect perception of voicing in English. Journal of Phonetics 29:1–22.

Best, C. T. (1994). The emergence of native-language phonological influences in infants: Aperceptual assimilation model. In The development of speech perception: The transitionfrom speech sounds to spoken words, 167(224), 233–277.

Chaabouni, R., Dunbar, E., Zeghidour, N., and Dupoux, E. (2017). Learning weaklysupervised multimodal phoneme embeddings. In Proc. INTERSPEECH 2017.

Chen, H., Leung, C. C., Xie, L., Ma, B., & Li, H. (2015). Parallel inference of Dirichlet processGaussian mixture models for unsupervised acoustic modeling: A feasibility study. InProc. INTERSPEECH 2015.

Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.

Chomsky, N. and Halle, M. (1968). The sound pattern of English. New York: Harper and Row.

Clements, G. N. (2003). Feature economy in sound systems. Phonology, 20(3), 287–333.

Dale, R., & Duran, N. D. (2011). The cognitive dynamics of negated sentence verification.Cognitive Science 35(5): 983-996.

Dillon, B., Dunbar, E., and Idsardi, W. (2013). A single-stage approach to learningphonological categories. Cognitive Science 37(2):344–377.

Dillon, B., and Wagers, M. (2019). Approaching gradience in acceptability with the tools ofsignal detection theory. Ms.

https://people.umass.edu/bwdillon/publication/dillonwagers_inprep/dillonwagers_inprep.pdf

Dunbar, E. (2019). Generative grammar, neural networks, and the implementationalmapping problem: Response to Pater. Language 95(1): e87–e98.

Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, M., Cao, X-N., Miskic, L.,Dugrain, C., Ondel, L., Black, A. W., Besacier, L., Sakti, S., and Dupoux, E. (2019).

The Zero Resource Speech Challenge 2019: TTS without T. INTERSPEECH 2019: 20th Annual Congress of the International Speech Communication Association.

Dunbar, E., Cao, X-N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X.,and Dupoux, E. (2017). The Zero-Resource Speech Challenge 2017. In 2017 IEEE

Automatic Speech Recognition and Understanding Workshop (ASRU).

Dunbar, E., and Dupoux, E. (2016). Geometric constraints on human speech soundinventories. Frontiers in Psychology: Language Sciences 7, article 1061.

Dunbar, E., Synnaeve, G., and Dupoux, E. (2015). Quantitative methods for comparingfeatural representations. In Proceedings of ICPhS.

Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. (2016). Recurrent neural network grammars.Preprint arXiv:1602.07776.

Feldman, N., Griffiths, T., Goldwater, S., & Morgan, J. (2013). A role for the developing lexicon in phonetic category acquisition. Psychological Review 120(4): 751–778.

Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis.Cognition 28(1-2): 3–71.

Gelman, A. (2015). The connection between varying treatment effects and the crisis ofunreplicable research: A Bayesian perspective. Journal of Management, 41, 632–643.

Guenther, F., and Gjaja, M. (1996). The perceptual magnet effect as an emergent property ofneural map formation. The Journal of the Acoustical Society of America 100(2): 1111-1121.

Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and Baroni, M. (2018). Colorless greenrecurrent networks dream hierarchically. Preprint: arXiv:1803.11138.

Ioannidis, J. P. (2005). Why most published research findings are false. PLos Med, 2(8), e124.

Le Godais, G., Linzen, T., and Dupoux, E. (2017). Comparing character-level neural languagemodels using a lexical decision task. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers: 125–130.

Li, B., and Zen, H. (2016). Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNNbased Statistical Parametric Speech Synthesis. Proc. INTERSPEECH 2016.

Liberman, A. M. (1963). A motor theory of speech perception. In Proceedings of the Speech Communication Seminar, Stockholm. Speech Transmission Lab.

Linzen, T., Dupoux, E., and Goldberg, Y. (2016). Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of the Association for ComputationalLinguistics 4: 521–535.

Mackie, S., & Mielke, J. (2011). Feature economy in natural, random, and synthetic inventories.In Clements, G. N., and Ridouane, R. Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories,  John        Benjamins Publishing. 43-63.

Maldonado, M., Dunbar, E., and Chemla, E. (2019). Mouse tracking as a window intodecision making. Behavior Research Methods 51(3):1085–1101.

Marcus, G. (2001). The algebraic mind. Cambridge: MIT Press.

Marr, D. (1982). Vision: A computational investigation into the human representation andprocessing of visual information. Cambridge: MIT Press.

Maye, J., Werker, J., and Gerken, L. (2002). Infant sensitivity to distributional information canaffect phonetic discrimination. Cognition 82: B101–B111.

McCoy, R. T., Linzen, T., Dunbar, E., and Smolensky, P. (2019). RNNs implicitly representtensor product representations. ICLR (International Conference on LearningRepresentations).

Mikolov, T., Yih, W. T., and Zweig, G. (2013). Linguistic regularities in continuous space wordrepresentations. In Proceedings of the 2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, 746–751.

Millet, J., Jurov, N., and Dunbar, E. (2019). Comparing unsupervised speech learning directlyto human performance in speech perception. In Proceedings of the 41st Annual

Meeting of the Cognitive Science Society (Cog Sci 2019).

Palangi, H., Smolensky, P., He, X., & Deng, L. (2018). Question-answering withgrammatically-interpretable representations. In Thirty-Second AAAI Conference on Artificial Intelligence.

Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section onreplicability in psychological science: A crisis of confidence?. Perspectives on Psychological Science, 7(6), 528-530.

Riad, R., Dancette, C., Karadayi, J., Zeghidour, N., Schatz, T., & Dupoux, E. (2018).Sampling strategies in Siamese Networks for unsupervised speech representation learning. Interspeech 2018.

Riochet, R., Castro, M. Y., Bernard, M., Lerer, A., Fergus, R., Izard, V., & Dupoux, E. (2018). IntPhys: A framework and benchmark for visual intuitive physics reasoning. Preprint:arXiv:1803.07616.

Schatz, T., Feldman, N., Goldwater, S., Cao, X-N., and Dupoux, E. (2019). Early phonetic learning without phonetic categories: Insights from machine learning. Preprint:https://psyarxiv.com/fc4wh/

Smolensky, P. 1990. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1-2): 159–216.

Synnaeve, G., Schatz, T., and Dupoux, E. (2014). Phonetics embedding learning with side information. In 2014 IEEE Spoken Language Technology Workshop (SLT), 106–111.

Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M., and Dupoux, E. (2015). A hybriddynamic time warping-deep neural network architecture for unsupervised acoustic modeling. Proc. INTERSPEECH 2015.

Turing, A. M. 1950. Computing machinery and intelligence. Mind 59(236): 433–460.

Vallabha, G., McClelland, J.L., Pons, F., Werker, J., and Amano, S. (2007). Unsupervisedlearning of vowel categories from infant-directed speech. Proceedings of the NationalAcademy of Sciences 104(33): 13273–13278.

Versteegh, M., Thiolliere, R., Schatz, T., Cao, X. N., Anguera, X., Jansen, A., & Dupoux, E.(2015). The Zero Resource Speech Challenge 2015. Proc. INTERSPEECH 2015.

Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural Network Acceptability Judgments.Transactions of the Association for Computational Linguistics 7: 625–641.

Yeung, H., and Werker, J. (2009). Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information. Cognition

113(2): 234–243.