A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

Ariel Goldstein, Haocheng Wang, Leonard Niekerken, Mariano Schain, Zaid Zada, Bobbi Aubrey, Tom Sheffer, Samuel A. Nastase, Harshvardhan Gazula, Aditi Singh, Aditi Rao, Gina Choe, Catherine Kim, Werner Doyle, Daniel Friedman, Sasha Devore, Patricia Dugan, Avinatan Hassidim, Michael Brenner, Yossi Matias, Orrin Devinsky, Adeen Flinker, Uri Hasson
nature human behaviour
Department of Psychology and the Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA

Table of Contents

Overall Summary

Study Background and Main Findings

This study investigated how the human brain processes natural language during real-world conversations. Researchers recorded brain activity using electrocorticography (ECoG), a technique that involves placing electrodes directly on the brain's surface, while participants engaged in unscripted conversations with family, friends, and hospital staff. This approach provided a rich dataset of approximately 100 hours of continuous recordings, encompassing nearly half a million words. The key innovation was the use of a state-of-the-art, multimodal speech-to-text model called Whisper (developed by OpenAI) to analyze both the audio recordings and the corresponding brain activity. Whisper is a deep learning model, meaning it's a complex algorithm that learns patterns from data, similar to how a brain learns. It's trained to process speech and convert it into text, and it does so by extracting different levels of linguistic information, from the raw sounds (acoustic features) to the recognized speech sounds (speech features) and finally to the meaning of the words (language features). These different levels of information are represented within the model as "embeddings," which are essentially numerical codes that capture different aspects of the language.

The researchers then used a technique called "encoding models" to see how well these embeddings from Whisper could predict the brain activity they recorded. They found that the embeddings could predict brain activity with remarkable accuracy. Moreover, different types of embeddings were better at predicting activity in different brain regions. Speech embeddings, representing the sounds of speech, were more strongly related to activity in areas involved in hearing and producing speech, such as the superior temporal cortex and the precentral gyrus. Language embeddings, representing the meaning of words, were more strongly related to activity in higher-level language areas, such as the inferior frontal gyrus and the angular gyrus. This pattern aligns with the well-established understanding of how language is processed in the brain, with a hierarchical organization from lower-level sensory and motor areas to higher-level cognitive areas.

Furthermore, the study found that the Whisper model outperformed traditional linguistic models, which rely on symbolic representations of language (like parts of speech and grammatical rules). This suggests that deep learning models, which learn statistical patterns from vast amounts of data, may capture aspects of language processing that are not captured by traditional, rule-based approaches. The study also examined the timing of brain activity and found that the model could capture fine-grained temporal patterns during both speech production and comprehension. For example, during speech production, there was evidence of brain activity related to the upcoming word even before the word was spoken, suggesting that the brain plans the entire word in advance. During speech comprehension, the brain activity showed a sequential pattern, with earlier parts of the speech signal being processed earlier in the brain.

The study concludes that unified computational models, like Whisper, offer a promising new framework for studying the neural basis of natural language processing. These models can capture the entire processing hierarchy, from acoustics to meaning, and provide a more comprehensive and naturalistic view of how the brain processes language in real-world situations.

Research Impact and Future Directions

This study provides compelling evidence for a strong correspondence between a unified computational model of language (OpenAI's Whisper) and the neural activity observed in the human brain during natural conversations. The researchers demonstrate that different levels of linguistic representation within the model – acoustic, speech, and language – map onto distinct brain regions, mirroring the known hierarchical organization of language processing in the cortex. The model's ability to predict neural activity with high accuracy, even outperforming traditional symbolic models, suggests that deep learning approaches offer a promising avenue for understanding the complex neural mechanisms underlying language.

The work makes a significant contribution by moving beyond highly controlled experimental settings and investigating language processing in real-world, unconstrained conversations. This ecological approach, combined with the advanced computational modeling, provides a more naturalistic and comprehensive view of how the brain processes language. However, it's crucial to acknowledge that the study's findings are based on correlations between model representations and brain activity. While these correlations are strong and statistically significant, they do not definitively prove that the brain uses the same representations or computational principles as the model. Further research is needed to explore the causal relationships and to determine the extent to which these findings generalize to the broader population, given the study's small sample size of patients with epilepsy.

Despite these limitations, the study represents a significant step forward in bridging the gap between computational linguistics and neuroscience. The findings open up exciting avenues for future research, including investigating the temporal dynamics of language processing in more detail, exploring the role of individual differences, and developing more refined computational models that can capture even finer-grained aspects of neural language processing. The potential applications of this research extend to clinical settings, where a better understanding of the neural basis of language could lead to improved diagnostic and therapeutic tools for language disorders.

Critical Analysis and Recommendations

Clear Summary of Key Findings (written-content)
The abstract effectively summarizes the key findings, highlighting the alignment between the model's processing hierarchy and the brain's cortical hierarchy for speech and language. This concise overview provides readers with a clear understanding of the study's main result, increasing accessibility and impact.
Section: Abstract
Missing Explicit Research Question (written-content)
The abstract does not explicitly state the research question at the beginning. Adding a clear statement of the research question (e.g., "This study investigates how the human brain processes natural language during everyday conversations...") would provide immediate context and improve the abstract's overall clarity and impact.
Section: Abstract
Critique of Traditional Approaches (written-content)
The introduction effectively establishes the limitations of traditional psycholinguistic approaches in capturing the complexities of real-world conversations. This critique sets the stage for the need for a new, unified computational framework, justifying the study's approach.
Section: Introduction
Missing Explicit Research Question (written-content)
The introduction lacks a concise statement of the specific research question being addressed. Adding a sentence like, "This study aims to investigate the neural mechanisms underlying natural language processing during real-world conversations..." would immediately orient the reader to the study's purpose.
Section: Introduction
Accurate Prediction of Neural Activity (written-content)
Whisper's embeddings accurately predicted neural activity during natural conversations (correlations ranging from 0.04 to 0.40, P < 0.01, FWER corrected, Fig. 2). This was demonstrated through encoding models that mapped the model's internal representations onto brain activity recorded via ECoG. This finding provides strong evidence for the alignment between the computational model and brain activity, suggesting that the model captures relevant aspects of neural language processing.
Section: Results
Hierarchical Organization of Encoding (written-content)
Speech embeddings better predicted activity in lower-level speech perception/production areas (superior temporal cortex, precentral gyrus), while language embeddings better predicted activity in higher-order language areas (inferior frontal gyrus, angular gyrus) (Fig. 3). This hierarchical organization was revealed through variance partitioning, quantifying the unique contribution of each embedding type. This finding supports the established understanding of a hierarchical organization of language processing in the brain, extending it to naturalistic conversational settings.
Section: Results
Missing Effect Sizes and Confidence Intervals (written-content)
The Results section lacks consistent reporting of effect sizes and confidence intervals alongside p-values. Including these measures (e.g., Cohen's d, Pearson's r, and their confidence intervals) would provide a more complete picture of the magnitude and reliability of the findings, allowing for a better assessment of practical significance.
Section: Results
Consideration of Different Interpretations (written-content)
The discussion thoughtfully considers different interpretations of the relationship between the model's internal representations and brain activity. It presents both a conservative view (the model learns the transformation between distinct codes) and a more speculative one (the model and brain share computational principles), offering a balanced perspective.
Section: Discussion
Missing Acknowledgment of Limitations (written-content)
The discussion does not explicitly acknowledge the study's limitations, such as the small sample size (N=4) and the specific patient population (individuals with epilepsy). Addressing these limitations would enhance the paper's credibility and provide a more nuanced interpretation of the results.
Section: Discussion
Comprehensive Preprocessing Pipeline (written-content)
The Methods section meticulously describes the preprocessing pipeline for both speech and ECoG recordings. This includes steps for de-identification, transcription, alignment, artifact mitigation, and signal processing, enhancing the reproducibility of the study.
Section: Methods
Incomplete Description of Manual Verification (written-content)
The Methods section does not fully detail the criteria used for manual verification and adjustment of word onset and offset times. Providing more detail on this manual correction process would enhance transparency and allow other researchers to replicate this crucial step.
Section: Methods

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 1 | An ecological, dense-sampling paradigm for modelling neural activity...
Full Caption

Fig. 1 | An ecological, dense-sampling paradigm for modelling neural activity during real-world conversations.

Figure/Table Image (Page 2)
Fig. 1 | An ecological, dense-sampling paradigm for modelling neural activity during real-world conversations.
First Reference in Text
The Whisper architecture incorporates a multilayer encoder network and a multilayer decoder network (Fig. 1): the encoder maps continuous acoustic inputs into a high-dimensional embedding space, captur- ing speech features which are transferred into a word-level decoder, effectively mapping them into contextual word embeddings21–23.
Description
  • Overview of the experimental and computational approach: This figure outlines the overall experimental and computational approach used in the study. It shows how the researchers recorded brain activity (using electrocorticography, or ECoG, which is like a more detailed EEG that involves placing electrodes directly on the brain's surface) while people were having natural conversations. They simultaneously recorded the audio of these conversations. The audio and transcriptions of the conversations were then fed into a powerful computer model called "Whisper," which is a type of deep learning model. Deep learning models are complex algorithms that learn patterns from data, much like how a brain learns. Whisper is specifically designed to process speech. The figure shows that the researchers extracted different types of information, called "embeddings," from Whisper. These embeddings represent different aspects of the speech, from low-level acoustic features (the raw sounds) to higher-level linguistic information (the meaning of the words). They then used a mathematical technique called linear regression to see how well these embeddings could predict the brain activity they recorded. Linear regression, in simple terms, is like finding the best-fitting line through a set of data points, allowing you to predict one variable based on another.
  • Dense-sampling paradigm: The figure highlights a "dense-sampling paradigm." This refers to the continuous and extensive recording of neural activity (24/7) during real-life conversations. It contrasts with traditional experiments that often use short, controlled stimuli. This approach aims to capture the natural complexity of language processing in a more realistic setting. The diagram shows the timeline of conversations ('How are you today?' and 'I feel better...'), indicating periods of speech production (purple) and comprehension (green).
  • Types of embeddings extracted from the Whisper model: The figure shows three key types of "embeddings" extracted from the Whisper model: acoustic embeddings, speech embeddings, and language embeddings. Acoustic embeddings represent the raw auditory input to the model. Speech embeddings are taken from the final layer of the Whisper's "encoder," which transforms the acoustic input into a representation of speech sounds. Language embeddings are taken from the "decoder," which converts the speech representation into a representation of the meaning of the words. The figure shows these as different layers, reflecting the hierarchical processing within the Whisper model. The dimensionality reduction using Principal Component Analysis (PCA) to 50 dimensions is mentioned. PCA is a technique to reduce the number of variables while retaining most of the original information. It's like summarizing a large dataset with a smaller set of key features.
  • Linear regression analysis: The figure visually represents the linear regression analysis. This analysis attempts to find a mathematical relationship between the embeddings (acoustic, speech, or language) and the recorded brain activity. It's depicted with equations showing how the embeddings (X) are multiplied by weights (β) to predict neural activity. The 'Beta weights' represent the strength of the relationship between each embedding dimension and the brain activity. The goal is to see how well the model's internal representations (the embeddings) can predict real brain activity during natural conversations. The figure includes a schematic of brain coverage, showing the locations of the electrodes in the four participants (S1-S4).
Scientific Validity
  • Overall methodological approach: The figure presents a valid and innovative approach to studying neural activity during natural conversations. The use of a dense-sampling paradigm and a powerful speech-to-text model (Whisper) is a significant strength. The application of linear regression to relate model embeddings to brain activity is a standard and appropriate method for this type of analysis.
  • Dimensionality reduction using PCA: The use of PCA for dimensionality reduction is justified, given the high dimensionality of the embeddings. However, it would be beneficial to provide more detail about the PCA procedure, such as the amount of variance explained by the 50 components.
  • Extraction of embeddings: The figure clearly outlines the process of extracting embeddings from different layers of the Whisper model. This is crucial for understanding the hierarchical nature of the analysis and the comparison of acoustic, speech, and language representations.
  • Visualization of Brain Coverage: The depiction of brain coverage is helpful, but a more detailed visualization, perhaps showing individual electrode locations, would be beneficial. It's also important to note that the coverage is limited to the left hemisphere, which should be explicitly stated in the figure legend.
Communication
  • Clarity and organization of the visual representation: The figure effectively introduces the core components of the study's methodology, offering a clear visual representation of the data collection and analysis pipeline. The use of distinct colors and labels for different stages (Comprehension, Production, and different embedding types) enhances readability. However, the figure is quite complex and could benefit from a more streamlined layout to improve immediate comprehension, perhaps by separating the production and comprehension pipelines more distinctly.
  • Completeness of the figure legend: The figure legend is concise but could be expanded to provide a more detailed explanation of each component, especially the 'Encoder stack' and 'Decoder stack'. While the main text elaborates on these, a self-contained explanation within the figure caption would improve stand-alone understanding.

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Fig. 2 | Acoustic, speech and language encoding model performance during speech...
Full Caption

Fig. 2 | Acoustic, speech and language encoding model performance during speech production and comprehension.

Figure/Table Image (Page 4)
Fig. 2 | Acoustic, speech and language encoding model performance during speech production and comprehension.
First Reference in Text
Whisper's acoustic, speech and language embeddings predicted neural activity with remarkable accuracy across conversations compris- ing hundreds of thousands of words during both speech production and comprehension for numerous electrodes in various regions of the cortical language network (Fig. 2).
Description
  • Overview of encoding model performance: This figure presents the results of how well the different types of information extracted from the "Whisper" model (acoustic, speech, and language) can predict brain activity during both speaking (production) and listening (comprehension). The researchers used a technique called "encoding models" to do this. Think of an encoding model as a way to translate between the language of the computer model (the embeddings) and the language of the brain (the neural activity). The better the translation, the better the model is at capturing what's happening in the brain.
  • Color-coded brain maps representing correlation (r): The figure shows brain maps, color-coded to represent the strength of the prediction (correlation, represented by 'r'). A correlation is a number between -1 and 1 that indicates how well two things are related. A correlation of 0 means no relationship, while 1 (or -1) means a perfect positive (or negative) relationship. Here, the colors represent the correlation between the predicted brain activity (based on the Whisper model's embeddings) and the actual recorded brain activity. The colors range from 0.04 (light yellow) to 0.40 (dark red), indicating varying degrees of positive correlation. The N values (N=64, N=274, etc.) indicate the number of electrodes included in each map.
  • Separate panels for production, comprehension, and embedding types: There are separate brain maps for speech production (when people were talking) and speech comprehension (when people were listening). Within each of these, there are maps for the acoustic embeddings (representing the raw sound), speech embeddings (representing the recognized speech sounds), and language embeddings (representing the meaning of the words). This allows us to see which type of information from the Whisper model best predicts brain activity in different brain areas during different tasks.
  • Statistical Significance: The figure shows results that are statistically significant. The statement 'P < 0.01, FWER' means that the probability of observing these results by chance is less than 1%, and this has been corrected for multiple comparisons using the Family-Wise Error Rate (FWER) method. FWER correction is a way to reduce the chances of getting false positives when you're doing many statistical tests at once (in this case, testing many electrodes).
Scientific Validity
  • Overall methodological approach: The figure presents compelling evidence for the alignment between the Whisper model's internal representations and neural activity during natural language processing. The use of a large dataset (hundreds of thousands of words) and multiple electrodes strengthens the generalizability of the findings.
  • Statistical significance: The use of a rigorous statistical threshold (P < 0.01, FWER corrected) provides confidence that the observed correlations are not due to chance.
  • Comparison of different embedding types: The presentation of results for different embedding types (acoustic, speech, and language) allows for a nuanced understanding of how different levels of linguistic information are encoded in the brain.
  • Correlation vs. Causation: While the figure shows impressive results, it's important to acknowledge that correlation does not equal causation. The observed correlations suggest an alignment between the model and the brain, but they do not prove that the brain uses the same representations as the model. Further investigation is needed to explore the causal relationship.
Communication
  • Clarity and organization of the visual representation: The figure effectively visualizes the encoding performance for three different embedding types (acoustic, speech, and language) across multiple brain regions. The use of color-coded brain maps allows for a quick comparison of performance across conditions (production and comprehension). However, the figure could benefit from a clearer indication of the scale for the correlation values (r). While the range is stated (0.04 - 0.40), adding tick marks or labels on the color bar would improve readability.
  • Completeness of the figure legend: The figure legend is concise but could be more informative. For example, explicitly stating that 'N' refers to the number of electrodes would be helpful. Also, clarifying the meaning of the 'P < 0.01, FWER' threshold in the legend would enhance stand-alone understanding.
  • Panel organization and layout: The use of separate panels for production and comprehension, and for each embedding type, makes it easy to compare the results across these different conditions. The layout is logical and well-organized.
Fig. 3 | Mixed selectivity for speech and language embeddings during speech...
Full Caption

Fig. 3 | Mixed selectivity for speech and language embeddings during speech production and comprehension.

Figure/Table Image (Page 5)
Fig. 3 | Mixed selectivity for speech and language embeddings during speech production and comprehension.
First Reference in Text
We observed different selectivity patterns for speech and lan- guage embeddings, each accounting for different portions of the variance across different cortical areas (Fig. 3).
Description
  • Overall concept of mixed selectivity: This figure shows how well different types of information from the Whisper model – specifically, speech sounds (speech embeddings) and the meaning of words (language embeddings) – predict brain activity in different parts of the brain, and how this changes depending on whether someone is talking (production) or listening (comprehension). The main idea is to see which parts of the brain are more sensitive to the sounds of speech versus the meaning of the words.
  • Color-coded brain maps and unique variance explained: The figure uses color-coded brain maps. The color at each location on the brain represents which type of information (speech or language) is better at predicting brain activity in that area. Red means speech sounds are more important, blue means the meaning of the words is more important, and white means it's a mix of both. The colors show the percentage of "unique variance explained." Variance, in this context, is a measure of how much the brain activity changes over time. 'Unique variance explained' means how much of that change can be predicted by only one type of information (either speech or language), after taking into account any overlap between them.
  • Separate maps for production and comprehension: There are separate brain maps for when people are talking (speech production) and when they are listening (speech comprehension). This allows us to see if the patterns of brain activity are different for these two processes.
  • Individual electrode plots showing temporal dynamics: In addition to the brain maps, there are smaller graphs showing the correlation between predicted and actual brain activity over time (the x-axis is labeled 'Lag (s)', meaning time in seconds). These graphs are for specific electrodes in specific brain regions (like IFG, STG, etc.). The red line shows the correlation for speech embeddings (sounds), and the blue line shows the correlation for language embeddings (meaning). These graphs show how the relationship between the model's information and brain activity changes over a short period of time around when a word is spoken or heard.
  • Statistical threshold and FDR correction: The dotted horizontal line in each of the smaller graphs represents the statistical threshold. This means that any correlation above that line is considered statistically significant, meaning it's unlikely to have happened by chance. The text mentions that the threshold is q < 0.01, two-sided, FDR corrected. This means the probability of a false positive is less than 1%, and this has been adjusted for multiple comparisons using the False Discovery Rate (FDR) method.
Scientific Validity
  • Overall methodological approach: The figure presents a novel and insightful analysis of the differential roles of speech and language representations in the brain. The use of variance partitioning to quantify the unique contribution of each embedding type is a strong methodological approach.
  • Comparison of production and comprehension: The inclusion of both production and comprehension data allows for a comparison of the neural substrates involved in these two fundamental aspects of language processing.
  • Presentation of group and individual data: The presentation of results at both the group level (brain maps) and the individual electrode level (plots) provides a comprehensive view of the data.
  • Statistical analysis: The statistical analysis appears to be rigorous, with appropriate correction for multiple comparisons (FDR correction).
  • Interpretation of selectivity: The figure focuses on selectivity, which is the relative importance of speech vs. language. It's important to note that even in areas showing strong selectivity for one type of information, the other type might still contribute to neural activity. The figure does not imply that these areas are exclusively involved in processing only one type of information.
Communication
  • Overall organization and clarity: The figure presents a complex set of results, comparing encoding performance for speech and language embeddings during both production and comprehension across various brain regions. The use of separate brain maps for production and comprehension, colored according to the proportion of unique variance explained, is effective for visualizing the spatial distribution of selectivity. The inclusion of individual electrode plots with correlation over time adds another layer of detail. However, the sheer amount of information presented makes the figure somewhat overwhelming. The small size of the individual plots and the lack of clear visual separation between the production and comprehension sections make it challenging to quickly grasp the key findings.
  • Color scheme and representation of mixed selectivity: The color scheme used to represent the percentage of unique variance (ranging from red for speech to blue for language) is intuitive, but the addition of a 'mixed' category (white) adds complexity. It might be helpful to provide a more explicit explanation of what constitutes 'mixed' selectivity in the figure legend.
  • Readability of individual electrode plots: The individual electrode plots are useful for showing the temporal dynamics of encoding performance, but the x-axis labels ('Lag (s)') are small and could be more prominent. Adding tick marks or grid lines to the plots might also improve readability.
  • Use of abbreviations: The use of abbreviations for brain regions (e.g., STG, IFG, preCG) is standard practice, but including a key or expanding these abbreviations in the figure legend would make the figure more accessible to a broader audience.
Fig. 4 | Enhanced encoding for language embeddings fused with auditory speech...
Full Caption

Fig. 4 | Enhanced encoding for language embeddings fused with auditory speech features.

Figure/Table Image (Page 6)
Fig. 4 | Enhanced encoding for language embeddings fused with auditory speech features.
First Reference in Text
In testing both sets of embeddings, we observed that encoding performance for language embeddings was significantly higher when the language decoder received speech information from the encoder, during both production (Fig. 4a) and comprehension (Fig. 4b).
Description
  • Comparison of language embeddings with and without auditory input: This figure compares how well two different types of language information from the Whisper model predict brain activity. The first type ('Only text') is based solely on the written words of the conversation. The second type ('Text + audio') combines the written words with the actual sounds of the speech. The researchers are testing whether adding the sound information improves the prediction of brain activity.
  • Separate panels for production and comprehension: The figure shows results for both when people are talking (production, panel a) and when they are listening (comprehension, panel b). This allows us to see if the effect of adding sound information is different for these two processes.
  • Brain maps showing the difference in correlation: The brain maps show the difference in prediction accuracy between the two types of language information. The colors represent the 'A correlation,' which is the difference in correlation values between the 'Text + audio' model and the 'Only text' model. Warmer colors (closer to 0.050) mean the 'Text + audio' model is better, while cooler colors (closer to -0.050) mean the 'Only text' model is better. The 'N' values indicate the number of electrodes.
  • Line graphs showing correlation over time: The line graphs show the correlation values over time (the x-axis is 'Lag (s)', meaning time in seconds) for all electrodes ('All') and for electrodes in the inferior frontal gyrus ('IFG'). The blue line represents the 'Only text' model, and the pink line represents the 'Text + audio' model. These graphs show how the relationship between the model's information and brain activity changes over a short period of time around when a word is spoken or heard.
Scientific Validity
  • Overall methodological approach: The figure provides strong evidence that incorporating auditory speech features improves the encoding performance of language embeddings. This supports the idea that the brain integrates acoustic and linguistic information during both speech production and comprehension.
  • Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a comparison of the effects of auditory information in these two processes.
  • Presentation of results at different levels: The presentation of results at both the group level (brain maps) and for specific regions (IFG) provides a more detailed view of the data. However, providing similar plots for other regions (like STG) in supplementary materials could further strengthen the findings.
Communication
  • Overall organization and clarity: The figure is divided into two main sections (a and b), clearly separating the results for production and comprehension. The use of brain maps and line graphs effectively visualizes the comparison between language embeddings with and without auditory input. However, the brain maps are relatively small, and the color difference between 'Only text' and 'Text + audio' is subtle, making it somewhat difficult to distinguish between them. The line graphs, while informative, could benefit from more prominent axis labels and tick marks.
  • Caption descriptiveness: The caption is concise but could be more descriptive. It would be helpful to explicitly state what 'enhanced encoding' means in this context (i.e., higher correlation values).
  • Consistency and clarity of notation: The use of 'N' to represent the number of electrodes is consistent with previous figures, but a reminder in the legend would still be beneficial for stand-alone understanding.
Fig. 5| Comparing speech and language embeddings to symbolic features.
Figure/Table Image (Page 7)
Fig. 5| Comparing speech and language embeddings to symbolic features.
First Reference in Text
Our findings indicate that speech and language embeddings extracted from the multimodal, deep acoustic-to-speech-to-language model outperform symbolic speech and language features (Fig. 5) in predicting neural activity during natural conversations.
Description
  • Comparison of embeddings and symbolic features: This figure compares two different ways of representing speech and language in the Whisper model: embeddings and symbolic features. Embeddings are the internal representations learned by the deep learning model, while symbolic features are traditional linguistic features like phonemes (speech sounds) and parts of speech (nouns, verbs, etc.). The figure shows how well each of these representations predicts brain activity during speech production and comprehension.
  • Panel (a): Speech embeddings vs. symbolic speech features: Panel (a) focuses on speech. It compares how well the speech embeddings (from the Whisper encoder) and symbolic speech features (like phonemes and articulation features) predict brain activity. The line graphs show the correlation between predicted and actual brain activity over time. The red line represents deep speech embeddings, and the orange line represents symbolic speech features.
  • Panel (b): Language embeddings vs. symbolic language features: Panel (b) focuses on language. It compares how well the language embeddings (from the Whisper decoder) and symbolic language features (like parts of speech and syntactic dependencies) predict brain activity. The blue line represents deep language embeddings, and the light blue line represents symbolic language features.
  • Brain maps showing unique variance explained: The brain maps show the percentage of unique variance explained by each type of representation (deep vs. symbolic). This means how much of the change in brain activity can be predicted by only that type of representation, after taking into account any overlap between them. The color coding for % unique variance explained in the brain maps indicates the relative importance of the representations.
  • Statistical significance: The dotted horizontal line in the line graphs represents a statistical threshold. Correlations above this line are considered statistically significant. The text indicates that red dots (in panel a) and blue dots (panel b) indicate a statistically significant difference in performance between the deep embeddings and the symbolic features.
Scientific Validity
  • Overall methodological approach: The figure provides strong evidence that deep embeddings outperform symbolic features in predicting neural activity during natural conversations. This supports the idea that deep learning models capture aspects of language processing that are not captured by traditional linguistic features.
  • Comparison of different conditions and representations: The use of separate analyses for production and comprehension, and for speech and language, allows for a detailed comparison of the different representations.
  • Analysis of different brain regions: The inclusion of results for different brain regions (All, STG, IFG) provides insights into the spatial distribution of encoding performance.
  • Statistical Analysis: The statistical analysis, including the use of FDR correction, appears to be appropriate.
  • Variance partitioning: The variance partitioning analysis is a strong method for quantifying the unique contribution of each representation.
Communication
  • Overall organization and clarity: The figure is divided into two main sections (a and b), clearly separating the comparison for speech and language embeddings. Within each section, results are presented for both production and comprehension, and for different brain regions (All, STG, IFG). The use of line graphs to show correlation over time is effective, and the color-coding (deep vs. symbolic) is consistent. However, the brain images showing unique variance explained are quite small and lack detailed anatomical labels, making it difficult to precisely identify the regions where differences are observed.
  • Caption descriptiveness: The caption is concise but could be more informative. It would be helpful to explicitly state what 'symbolic features' are being compared to the embeddings.
  • Readability of line graphs: The x-axis labels ('Lag (s)') on the line graphs are small and could be more prominent. Adding tick marks or grid lines might also improve readability.
  • Color scheme in brain maps: The color scheme used in the brain maps to represent % unique correlation could be confusing. It uses different colors for deep vs symbolic features, where one might expect a gradient of a single color to show % unique correlation. The description in the figure is important to understand the figure.
Fig. 6 | Representations of phonetic and lexical information in Whisper.
Figure/Table Image (Page 8)