The intricate process of speech perception involves the human auditory system deciphering linguistic abstractions from speech signals. While traditional models using linear feature-encoding have had limited success in understanding this complex process, artificial neural networks, particularly deep neural network (DNN) models, have shown promise in speech recognition tasks.
A recent groundbreaking study utilized state-of-the-art DNN models to explore neural coding from the auditory nerve to the speech cortex, unravelling the correlation between DNN representations and neural activity in the ascending auditory system.Â
The research unveiled several key findings:Â
- The hierarchy in DNNs learning speech representations correlates well with the ascending auditory pathway, demonstrating the alignment of computational structures.Â
- Unsupervised speech models performed on par with or even better than purely supervised or fine-tuned models, showcasing the DNNs’ ability to learn meaningful representations without explicit linguistic knowledge.Â
- Deeper layers of DNNs exhibited better correlation with neural activity in higher-order auditory cortex regions, aligning with phonemic and syllabic structures in speech.Â
- DNN-based models revealed language-specific properties in cross-language speech perception, offering insights into language-specific coding in the superior temporal gyrus (STG) during cross-language perception.Â
The study employed a neural encoding framework to systematically evaluate the similarity between the auditory pathway and DNN models with different architectures and training strategies. Importantly, it used a cross-linguistic paradigm, going beyond the constraints of a single language, to uncover both language-invariant and language-specific aspects during speech perception.Â
The results challenged traditional cognitive models and neural encoding models, showcasing the limitations of linear encoding models in capturing higher-order speech information. DNNs, with their nonlinearity and dynamic temporal integration of phonological contextual information, outperformed traditional models, especially in predicting responses in the nonprimary auditory cortex.Â
The study also shed light on the computational attributes of DNN models, indicating that different architectures better correlated with different parts of the auditory pathway. While convolution layers were suitable for the auditory periphery, deeper transformer-encoder and LSTM layers better fit the speech–auditory cortex, demonstrating the DNNs’ ability to adapt to different levels of processing.Â
The findings have implications for interpreting the functions of the primary and nonprimary auditory cortical areas. The study challenged the notion of the primary auditory cortex’s sole contribution to advanced computational models of speech processing, emphasizing the role of the entire auditory pathway in speech perception.Â
In conclusion, this research offers new insights into neural coding in the auditory cortex, demonstrating the potential of DNN models to correlate with and enhance our understanding of the human auditory system. The study’s approach opens avenues for data-driven computational models of sensory perception and emphasizes the importance of considering dynamic temporal integration and nonlinearity in modelling speech processing across the auditory pathway.Â
Journal Reference Â
Li, Y., Anumanchipalli, G.K., Mohamed, A. et al. Dissecting neural computations in the human auditory pathway using deep neural networks for speech. Nat Neurosci (2023). https://doi.org/10.1038/s41593-023-01468-4Â


