The replicability crisis refers to the large accumulation of results reported in the literature (in both behavioral science and artificial intelligence, among other fields) which have turned out to be false positives or to be otherwise unreproduceable (Gelman 2015; Ioannidis 2005; Pashler & Wagenmakers, 2012).
The preponderance of such results has been attributed to two kinds of widespread situations. The first is where inappropriate conclusions by the researchers are drawn on the basis of the data, either knowingly or unknowingly, often by the use of inappropriate data analysis such as wrong tests, tests which are guaranteed to fail due to lack of statistical power, unreported negative results in the same or similar studies, or inappropriate practices such as developing the statistical analysis on-the-fly after data is collected, which introduces hidden degrees of freedom via the analyst’s choices. The second type of situation giving rise to unreproduceable results is that, after publication, data, materials, experiment scripts, or analysis scripts, are non-existent, lost, inaccessible, or contain known errors.
Many measures have been proposed to counteract the replicability crisis, for example:
Some or all of these practices are in place on almost all of the research projects described here. The experiments currently being prepared on the GEOMPHON project were in part intended as a methodological exercise in pre-validation of statistical analyses and full reproducibility.
Reliable and interpretable methodology, including a careful choice or construction of dependent measures, is a critical part of ensuring the reliability of the experimental results we obtain.
In mouse tracking, participants perform a task (typically a two-way forced choice) by clicking on buttons on the screen. Their mouse movements toward the buttons are tracked and analyzed, to draw inferences about the cognitive processes underlying their decisions. For example, Dale and Duran (2011) tracked mouse trajectories as participants clicked on True or False in response to generic statements such as Cars have wings or Cars have no wings. Mouse trajectories in negated sentences tended to first move towards the incorrect response (see figure inline from our replication: the upper left is the position of the incorrect answer button, and the upper right the position of the correct answer button).
Dale and Duran interpreted this as evidence for two-step processing of negation: truth conditions are calculated first for the positive version of the sentence and negated in a second step. However, the analysis of mouse tracking data is complicated by the fact that the trajectory extends over time and in two spatial dimensions, and must somehow be reduced to an informative measure (for example, the degree to which it is bowed towards the incorrect answer). The measures used in the literature are ad hoc and come with no guarantee they are actually measuring anything relevant to the unfolding decision process.
In Maldonado, Dunbar, and Chemla (2019), we proposed a novel measure of “degree of trajectory deviation,” constructed in a grounded way. We first collected data for which we knew the participants’ mouse trajectories to be deviated towards the wrong answer. Participants moved the mouse toward one of two buttons indicating the colour of a frame around the experiment window, but, on some trials, the frame switched colours in the middle of the trial, changing the correct answer from red to blue or conversely (right). We then trained a simple classifier to combine the features of these trajectories relevant to detecting deviations into a single measure of “degree of deviatedness.”
This function can then be applied to any trajectory. We showed that our new measure worked at least as well as existing mouse tracking measures, and used it to analyse a replication of Dale and Duran’s (2011) study.
Becker, S., and Hinton, G. (1992). Self-organizing neural network that discovers surfaces inrandom-dot stereograms. Nature 355(6356): 161.
Benkí, J. R. (2001). Place of articulation and first formant transition pattern both affect perception of voicing in English. Journal of Phonetics 29:1–22.
Best, C. T. (1994). The emergence of native-language phonological influences in infants: Aperceptual assimilation model. In The development of speech perception: The transitionfrom speech sounds to spoken words, 167(224), 233–277.
Chaabouni, R., Dunbar, E., Zeghidour, N., and Dupoux, E. (2017). Learning weaklysupervised multimodal phoneme embeddings. In Proc. INTERSPEECH 2017.
Chen, H., Leung, C. C., Xie, L., Ma, B., & Li, H. (2015). Parallel inference of Dirichlet processGaussian mixture models for unsupervised acoustic modeling: A feasibility study. InProc. INTERSPEECH 2015.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press.
Chomsky, N. and Halle, M. (1968). The sound pattern of English. New York: Harper and Row.
Clements, G. N. (2003). Feature economy in sound systems. Phonology, 20(3), 287–333.
Dale, R., & Duran, N. D. (2011). The cognitive dynamics of negated sentence verification.Cognitive Science 35(5): 983-996.
Dillon, B., Dunbar, E., and Idsardi, W. (2013). A single-stage approach to learningphonological categories. Cognitive Science 37(2):344–377.
Dillon, B., and Wagers, M. (2019). Approaching gradience in acceptability with the tools ofsignal detection theory. Ms.
Dunbar, E. (2019). Generative grammar, neural networks, and the implementationalmapping problem: Response to Pater. Language 95(1): e87–e98.
Dunbar, E., Algayres, R., Karadayi, J., Bernard, M., Benjumea, M., Cao, X-N., Miskic, L.,Dugrain, C., Ondel, L., Black, A. W., Besacier, L., Sakti, S., and Dupoux, E. (2019).
The Zero Resource Speech Challenge 2019: TTS without T. INTERSPEECH 2019: 20th Annual Congress of the International Speech Communication Association.
Dunbar, E., Cao, X-N., Benjumea, J., Karadayi, J., Bernard, M., Besacier, L., Anguera, X.,and Dupoux, E. (2017). The Zero-Resource Speech Challenge 2017. In 2017 IEEE
Automatic Speech Recognition and Understanding Workshop (ASRU).
Dunbar, E., and Dupoux, E. (2016). Geometric constraints on human speech soundinventories. Frontiers in Psychology: Language Sciences 7, article 1061.
Dunbar, E., Synnaeve, G., and Dupoux, E. (2015). Quantitative methods for comparingfeatural representations. In Proceedings of ICPhS.
Dyer, C., Kuncoro, A., Ballesteros, M., & Smith, N. (2016). Recurrent neural network grammars.Preprint arXiv:1602.07776.
Feldman, N., Griffiths, T., Goldwater, S., & Morgan, J. (2013). A role for the developing lexicon in phonetic category acquisition. Psychological Review 120(4): 751–778.
Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis.Cognition 28(1-2): 3–71.
Gelman, A. (2015). The connection between varying treatment effects and the crisis ofunreplicable research: A Bayesian perspective. Journal of Management, 41, 632–643.
Guenther, F., and Gjaja, M. (1996). The perceptual magnet effect as an emergent property ofneural map formation. The Journal of the Acoustical Society of America 100(2): 1111-1121.
Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and Baroni, M. (2018). Colorless greenrecurrent networks dream hierarchically. Preprint: arXiv:1803.11138.
Ioannidis, J. P. (2005). Why most published research findings are false. PLos Med, 2(8), e124.
Le Godais, G., Linzen, T., and Dupoux, E. (2017). Comparing character-level neural languagemodels using a lexical decision task. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers: 125–130.
Li, B., and Zen, H. (2016). Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNNbased Statistical Parametric Speech Synthesis. Proc. INTERSPEECH 2016.
Liberman, A. M. (1963). A motor theory of speech perception. In Proceedings of the Speech Communication Seminar, Stockholm. Speech Transmission Lab.
Linzen, T., Dupoux, E., and Goldberg, Y. (2016). Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of the Association for ComputationalLinguistics 4: 521–535.
Mackie, S., & Mielke, J. (2011). Feature economy in natural, random, and synthetic inventories.In Clements, G. N., and Ridouane, R. Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories, John Benjamins Publishing. 43-63.
Maldonado, M., Dunbar, E., and Chemla, E. (2019). Mouse tracking as a window intodecision making. Behavior Research Methods 51(3):1085–1101.
Marcus, G. (2001). The algebraic mind. Cambridge: MIT Press.
Marr, D. (1982). Vision: A computational investigation into the human representation andprocessing of visual information. Cambridge: MIT Press.
Maye, J., Werker, J., and Gerken, L. (2002). Infant sensitivity to distributional information canaffect phonetic discrimination. Cognition 82: B101–B111.
McCoy, R. T., Linzen, T., Dunbar, E., and Smolensky, P. (2019). RNNs implicitly representtensor product representations. ICLR (International Conference on LearningRepresentations).
Mikolov, T., Yih, W. T., and Zweig, G. (2013). Linguistic regularities in continuous space wordrepresentations. In Proceedings of the 2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, 746–751.
Millet, J., Jurov, N., and Dunbar, E. (2019). Comparing unsupervised speech learning directlyto human performance in speech perception. In Proceedings of the 41st Annual
Meeting of the Cognitive Science Society (Cog Sci 2019).
Palangi, H., Smolensky, P., He, X., & Deng, L. (2018). Question-answering withgrammatically-interpretable representations. In Thirty-Second AAAI Conference on Artificial Intelligence.
Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section onreplicability in psychological science: A crisis of confidence?. Perspectives on Psychological Science, 7(6), 528-530.
Riad, R., Dancette, C., Karadayi, J., Zeghidour, N., Schatz, T., & Dupoux, E. (2018).Sampling strategies in Siamese Networks for unsupervised speech representation learning. Interspeech 2018.
Riochet, R., Castro, M. Y., Bernard, M., Lerer, A., Fergus, R., Izard, V., & Dupoux, E. (2018). IntPhys: A framework and benchmark for visual intuitive physics reasoning. Preprint:arXiv:1803.07616.
Schatz, T., Feldman, N., Goldwater, S., Cao, X-N., and Dupoux, E. (2019). Early phonetic learning without phonetic categories: Insights from machine learning. Preprint:https://psyarxiv.com/fc4wh/
Smolensky, P. 1990. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1-2): 159–216.
Synnaeve, G., Schatz, T., and Dupoux, E. (2014). Phonetics embedding learning with side information. In 2014 IEEE Spoken Language Technology Workshop (SLT), 106–111.
Thiolliere, R., Dunbar, E., Synnaeve, G., Versteegh, M., and Dupoux, E. (2015). A hybriddynamic time warping-deep neural network architecture for unsupervised acoustic modeling. Proc. INTERSPEECH 2015.
Turing, A. M. 1950. Computing machinery and intelligence. Mind 59(236): 433–460.
Vallabha, G., McClelland, J.L., Pons, F., Werker, J., and Amano, S. (2007). Unsupervisedlearning of vowel categories from infant-directed speech. Proceedings of the NationalAcademy of Sciences 104(33): 13273–13278.
Versteegh, M., Thiolliere, R., Schatz, T., Cao, X. N., Anguera, X., Jansen, A., & Dupoux, E.(2015). The Zero Resource Speech Challenge 2015. Proc. INTERSPEECH 2015.
Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural Network Acceptability Judgments.Transactions of the Association for Computational Linguistics 7: 625–641.
Yeung, H., and Werker, J. (2009). Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information. Cognition