syn_bx_neuro:

A Deep Neural Network Trained on Congruent Audiovisual Speech Reports the McGurk Effect
In the McGurk effect, incongruent auditory and visual syllables are perceived as a third, illusory syllable. The prevailing explanation for the effect is that the illusory syllable is a consensus percept intermediate between otherwise incompatible auditory and visual representations. To test this idea, we turned to a deep neural network known as AVHuBERT that transcribes audiovisual speech with high accuracy. Critically, AVHuBERT was trained only with congruent audiovisual speech, without exposure to McGurk stimuli or other incongruent speech. In the current study, when tested with congruent audiovisual "ba", "ga" and "da" syllables recorded from 8 different talkers, AVHuBERT transcribed them with near-perfect accuracy, and showed a human-like pattern of highest accuracy for audiovisual speech, slightly lower accuracy for auditory-only speech, and low accuracy for visual-only speech. When presented with incongruent McGurk syllables (auditory "ba" paired with visual "ga"), AVHuBERT reported the McGurk fusion percept of "da" at a rate of 25%, many-fold greater than the rate for either auditory or visual components of the McGurk stimulus presented on their own. To examine the individual variability that is hallmark of human perception of the McGurk effect, 100 variants of AVHuBERT were constructed. Like human observers, AVHuBERT variants was consistently accurate for congruent syllables but highly variable for McGurk syllables. Similarities between the responses of AVHuBERT and humans to congruent and incongruent audiovisual speech, including the McGurk effect, suggests that DNNs may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception.