A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

Section Analysis

Abstract

Key Aspects

Unified Computational Framework: This study introduces a unified computational framework to investigate the neural underpinnings of natural language processing during real-world conversations. The framework bridges acoustic, speech, and word-level linguistic representations, providing a comprehensive approach to analyzing brain activity during both speech production and comprehension. This holistic approach contrasts with traditional, fragmented studies of language processing.
Naturalistic ECoG Recordings: The researchers employed electrocorticography (ECoG) to record neural activity from participants engaged in unconstrained, open-ended conversations. This approach, combined with a large dataset of 100 hours of speech, allows for the study of language processing in a naturalistic setting. The use of ECoG provides high temporal and spatial resolution, capturing the dynamics of brain activity during real-time communication.
Multimodal Embedding Extraction: The study utilized OpenAI's Whisper, a multimodal speech-to-text model, to extract embeddings representing different levels of linguistic information: low-level acoustics, mid-level speech, and contextual word representations. These embeddings were then mapped onto brain activity using encoding models. This allowed the researchers to examine how different levels of linguistic processing are represented in the brain.
Cortical Hierarchy Alignment: A key finding is the alignment between the model's internal processing hierarchy and the cortical hierarchy of speech and language processing. Sensory and motor regions showed greater correspondence with the model's speech embeddings, while higher-level language areas aligned more closely with the model's language embeddings. This finding supports the idea of a hierarchical organization of language processing in the brain.
Superior Performance of Unified Model: The study demonstrates that the embeddings from the Whisper model outperform traditional symbolic models in capturing neural activity related to natural speech and language. This result suggests that unified computational models, which capture the entire processing hierarchy, offer a promising approach for understanding the neural basis of language. The findings have implications for future research and potentially for clinical applications.

Strengths

Clear Summary of Key Findings
The abstract clearly and concisely summarizes the study's key findings, highlighting the alignment between the model's internal processing hierarchy and the cortical hierarchy for speech and language processing.

"The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings." (Page 1)
Introduction of a Novel Framework
The abstract effectively introduces a novel computational framework for studying the neural basis of natural language processing, which is a significant contribution to the field.

"This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain." (Page 1)
Methodological Rigor
The abstract concisely states the use of a large-scale dataset and advanced techniques, indicating the study's methodological rigor.

"We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations." (Page 1)
Broader Implications
The abstract concludes with a strong statement about the broader implications of the findings, suggesting a paradigm shift in the field.

"These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations." (Page 1)

Suggestions for Improvement

State the Research Question Explicitly
This high-impact improvement would enhance the abstract's clarity and impact by explicitly stating the core research question or objective at the very beginning. The abstract currently jumps directly into describing the study's approach, but would be more effective if it first framed the specific problem being addressed. This change is crucial for the Abstract section, as it sets the stage for the entire paper and immediately engages the reader with the central scientific focus. Explicitly stating the research question would strengthen the abstract by providing immediate context for the reader, making it easier to understand the study's purpose and significance. This would enhance the overall clarity and impact of the abstract, making it more effective in communicating the study's key contributions.

"This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain." (Page 1)

Implementation: Begin the abstract with a sentence like: "This study investigates how the human brain processes natural language during everyday conversations by connecting acoustic, speech, and word-level linguistic structures." Then, proceed with the existing description of the computational framework.
Specify the Type of Neural Activity
This medium-impact improvement would increase the abstract's informativeness by briefly mentioning the specific type of neural activity measured. While the abstract mentions "neural signals," specifying the type of activity (e.g., high-frequency activity) would provide valuable context for readers familiar with neurophysiological methods. This detail is appropriate for the Abstract section as it provides a concise overview of the methodology without delving into excessive detail. Adding this detail would strengthen the abstract by providing a more precise description of the data collected, enhancing the reader's understanding of the study's methodological approach. This would improve the abstract's informativeness and make it more useful for readers seeking specific methodological details.

"We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations." (Page 1)

Implementation: Modify the sentence about electrocorticography to read: "We used electrocorticography to record high-frequency neural activity across 100 h of speech production and comprehension..."
Clarify the 'Whisper' Model
This low-impact change would improve the abstract's precision. The term 'Whisper' could benefit from a more descriptive label. This helps readers quickly grasp the model's nature without needing prior knowledge. This detail belongs in the abstract because it provides essential context for the model used, without being overly technical. Adding this clarification enhances the abstract's accessibility and broadens its appeal. It aids readers unfamiliar with the specific model, ensuring they understand the core methodology. This improves the abstract's overall clarity and informativeness.

"We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper)." (Page 1)

Implementation: Change 'multimodal speech-to-text model (Whisper)' to 'multimodal speech-to-text model (OpenAI's Whisper)' or 'multimodal speech-to-text model (Whisper, developed by OpenAI)'.

Introduction

Key Aspects

Overarching Goal and Context: The introduction establishes the overarching goal of understanding how the brain supports dynamic, context-dependent behaviors, with a specific focus on language communication in real-world conversations. This sets the context for the study and highlights the importance of investigating language processing in naturalistic settings. This framing positions the research within a broader neuroscience context.
Critique of Traditional Approaches: The introduction critiques traditional neurolinguistic approaches, which often divide language into distinct subfields and rely on symbolic features. It points out the limitations of these approaches in capturing the complexities of real-world conversations, where language is highly complex, multidimensional, and context-dependent. This critique justifies the need for a new approach.
Deep Learning as a Unifying Framework: Deep learning, particularly multimodal models, is presented as a unified computational framework that overcomes the limitations of traditional approaches. These models can process continuous auditory input and transform it into speech and word-level linguistic dimensions, offering a more holistic approach to studying language processing. The introduction emphasizes the shift from discrete symbols to multidimensional vectorial representations (embeddings).
Leveraging the Whisper Model: The study leverages Whisper, a multimodal acoustic-to-speech-to-language model, to link different levels of linguistic representation. The introduction describes the model's architecture, which includes an encoder network that maps acoustic inputs to speech features and a decoder network that maps these features to contextual word embeddings. Notably, the model does not rely on traditional linguistic elements.
Study Approach and Methodology: The introduction outlines the study's approach of using Whisper to analyze neural activity during real-world conversations. It mentions the extraction of acoustic, speech, and language embeddings from the model and the use of encoding models to map these embeddings onto brain activity. This provides a high-level overview of the methodology.

Strengths

Critique of Traditional Approaches
The introduction effectively establishes the limitations of traditional psycholinguistic approaches in capturing the complexities of real-world conversations, setting the stage for the need for a new approach.

"Psycholinguistic models and theories often fail to account for the subtle, non-linear, context-dependent interactions within and across levels of linguistic analysis in real-world conversations13–15." (Page 2)
Introduction of Deep Learning as a Solution
The introduction clearly presents deep learning, particularly multimodal models like Whisper, as a unifying computational framework that overcomes the limitations of traditional approaches.

"Deep learning provides a unified computational framework that can serve as an alternative approach to natural language processing in the human brain16,17." (Page 2)
Highlighting the Key Innovation
The introduction concisely highlights the key innovation of the study: leveraging a multimodal acoustic-to-speech-to-language model (Whisper) to link different levels of linguistic representation.

"Combining speech and language embeddings into a unified multimodal model provides a numerical ‘code’ for linking across levels of linguistic representation, which are traditionally studied in isolation." (Page 2)
Connection to Broader Research Goals
The introduction effectively connects the study to the broader goal of understanding how the brain supports dynamic, context-dependent behaviors, specifically language communication.

"One of the ultimate goals of our collective research endeavour in human neuroscience is to model and understand how the brain supports dynamic, context-dependent behaviours in the real world." (Page 1)

Suggestions for Improvement

State the Research Question Explicitly
This medium-impact improvement would enhance the Introduction's clarity and flow by explicitly stating the research question or objective early on. While the Introduction effectively sets the stage and introduces the approach, it lacks a concise statement of the specific question being addressed. This is crucial for the Introduction as it immediately orients the reader to the study's purpose. Explicitly stating the research question would strengthen the Introduction by providing immediate context and making the subsequent discussion of methodology and approach more impactful. This would enhance the overall clarity and coherence of the Introduction.

"In this work, we leverage a multimodal acoustic-to-speech-to-language model called Whisper20 that learns to transcribe acoustic recordings of natural conversations recorded in real-life contexts20." (Page 2)

Implementation: Add a sentence like: "This study aims to investigate the neural mechanisms underlying natural language processing during real-world conversations by leveraging a unified computational framework." This sentence should be placed before the description of the Whisper model.
Briefly Mention Key Findings
This low-impact improvement would enhance the Introduction's completeness by briefly mentioning the study's key findings. While the Introduction focuses on the approach and rationale, hinting at the main results would further engage the reader and provide a more complete overview. This is appropriate for the Introduction as it provides a preview of the study's contributions without delving into details. Adding a brief mention of the key findings would strengthen the Introduction by providing a more complete picture of the study's scope and impact. This would make the Introduction more informative and engaging for the reader.

"In this work, we report on the alignment between the internal representations of an acoustic-to-speech-to-language model and the human brain when processing real-life conversations." (Page 2)

Implementation: Add a sentence like: "Our findings reveal a remarkable alignment between the model's internal representations and neural activity patterns, providing new insights into the hierarchical processing of language in the brain." This sentence should be placed towards the end of the Introduction.
Clarify the 'Whisper' Model
This low-impact improvement would improve the Introduction's clarity. The term 'Whisper' could benefit from a more descriptive label, helping readers quickly grasp the model's nature. This is important in the Introduction to provide context for the model used, without being overly technical. Adding this clarification enhances the Introduction's accessibility. It aids readers unfamiliar with the specific model, ensuring they understand the core methodology. This improves the overall clarity.

"In this work, we leverage a multimodal acoustic-to-speech-to-language model called Whisper20 that learns to transcribe acoustic recordings of natural conversations recorded in real-life contexts20." (Page 2)

Implementation: Change 'multimodal acoustic-to-speech-to-language model called Whisper' to 'multimodal acoustic-to-speech-to-language model called OpenAI's Whisper' or '...Whisper, developed by OpenAI'.

Non-Text Elements

Fig. 1 | An ecological, dense-sampling paradigm for modelling neural activity...

Full Caption

Fig. 1 | An ecological, dense-sampling paradigm for modelling neural activity during real-world conversations.

Figure/Table Image (Page 2)

First Reference in Text

The Whisper architecture incorporates a multilayer encoder network and a multilayer decoder network (Fig. 1): the encoder maps continuous acoustic inputs into a high-dimensional embedding space, captur- ing speech features which are transferred into a word-level decoder, effectively mapping them into contextual word embeddings21–23.

Description

Overview of the experimental and computational approach: This figure outlines the overall experimental and computational approach used in the study. It shows how the researchers recorded brain activity (using electrocorticography, or ECoG, which is like a more detailed EEG that involves placing electrodes directly on the brain's surface) while people were having natural conversations. They simultaneously recorded the audio of these conversations. The audio and transcriptions of the conversations were then fed into a powerful computer model called "Whisper," which is a type of deep learning model. Deep learning models are complex algorithms that learn patterns from data, much like how a brain learns. Whisper is specifically designed to process speech. The figure shows that the researchers extracted different types of information, called "embeddings," from Whisper. These embeddings represent different aspects of the speech, from low-level acoustic features (the raw sounds) to higher-level linguistic information (the meaning of the words). They then used a mathematical technique called linear regression to see how well these embeddings could predict the brain activity they recorded. Linear regression, in simple terms, is like finding the best-fitting line through a set of data points, allowing you to predict one variable based on another.
Dense-sampling paradigm: The figure highlights a "dense-sampling paradigm." This refers to the continuous and extensive recording of neural activity (24/7) during real-life conversations. It contrasts with traditional experiments that often use short, controlled stimuli. This approach aims to capture the natural complexity of language processing in a more realistic setting. The diagram shows the timeline of conversations ('How are you today?' and 'I feel better...'), indicating periods of speech production (purple) and comprehension (green).
Types of embeddings extracted from the Whisper model: The figure shows three key types of "embeddings" extracted from the Whisper model: acoustic embeddings, speech embeddings, and language embeddings. Acoustic embeddings represent the raw auditory input to the model. Speech embeddings are taken from the final layer of the Whisper's "encoder," which transforms the acoustic input into a representation of speech sounds. Language embeddings are taken from the "decoder," which converts the speech representation into a representation of the meaning of the words. The figure shows these as different layers, reflecting the hierarchical processing within the Whisper model. The dimensionality reduction using Principal Component Analysis (PCA) to 50 dimensions is mentioned. PCA is a technique to reduce the number of variables while retaining most of the original information. It's like summarizing a large dataset with a smaller set of key features.
Linear regression analysis: The figure visually represents the linear regression analysis. This analysis attempts to find a mathematical relationship between the embeddings (acoustic, speech, or language) and the recorded brain activity. It's depicted with equations showing how the embeddings (X) are multiplied by weights (β) to predict neural activity. The 'Beta weights' represent the strength of the relationship between each embedding dimension and the brain activity. The goal is to see how well the model's internal representations (the embeddings) can predict real brain activity during natural conversations. The figure includes a schematic of brain coverage, showing the locations of the electrodes in the four participants (S1-S4).

Scientific Validity

Overall methodological approach: The figure presents a valid and innovative approach to studying neural activity during natural conversations. The use of a dense-sampling paradigm and a powerful speech-to-text model (Whisper) is a significant strength. The application of linear regression to relate model embeddings to brain activity is a standard and appropriate method for this type of analysis.
Dimensionality reduction using PCA: The use of PCA for dimensionality reduction is justified, given the high dimensionality of the embeddings. However, it would be beneficial to provide more detail about the PCA procedure, such as the amount of variance explained by the 50 components.
Extraction of embeddings: The figure clearly outlines the process of extracting embeddings from different layers of the Whisper model. This is crucial for understanding the hierarchical nature of the analysis and the comparison of acoustic, speech, and language representations.
Visualization of Brain Coverage: The depiction of brain coverage is helpful, but a more detailed visualization, perhaps showing individual electrode locations, would be beneficial. It's also important to note that the coverage is limited to the left hemisphere, which should be explicitly stated in the figure legend.

Communication

Clarity and organization of the visual representation: The figure effectively introduces the core components of the study's methodology, offering a clear visual representation of the data collection and analysis pipeline. The use of distinct colors and labels for different stages (Comprehension, Production, and different embedding types) enhances readability. However, the figure is quite complex and could benefit from a more streamlined layout to improve immediate comprehension, perhaps by separating the production and comprehension pipelines more distinctly.
Completeness of the figure legend: The figure legend is concise but could be expanded to provide a more detailed explanation of each component, especially the 'Encoder stack' and 'Decoder stack'. While the main text elaborates on these, a self-contained explanation within the figure caption would improve stand-alone understanding.

Results

Key Aspects

Naturalistic Data Collection: The study utilized continuous 24/7 ECoG and speech recordings from four patients engaged in spontaneous conversations. This naturalistic data collection approach, spanning dozens of hours and tens of thousands of words, provides a uniquely rich dataset for studying language processing in real-world settings. The data was categorized into comprehension (listening to speech) and production (producing speech).
Embedding Extraction from Whisper: The researchers employed Whisper, a multimodal acoustic-to-speech-to-language model, to extract acoustic, speech, and language embeddings for each word in the conversations. These embeddings represent different levels of linguistic information, from low-level acoustic features to higher-level semantic representations. The extraction process involved separating the model's encoder and decoder stacks to examine the influence of speech input on language embeddings.
Encoding Model Construction and Evaluation: Encoding models were constructed to estimate a linear mapping between the Whisper embeddings and neural activity for each word during speech production and comprehension. These models were trained and tested using a 10-fold cross-validation procedure. Model performance was evaluated by calculating the correlation between predicted and actual neural signals for held-out conversations, providing a measure of how well the embeddings captured brain activity.
Accurate Prediction and Hierarchical Encoding: A key finding was that Whisper's acoustic, speech, and language embeddings predicted neural activity with remarkable accuracy across conversations. Different brain regions showed varying degrees of alignment with different embedding types. Speech embeddings better captured activity in lower-level speech perception and production areas (superior temporal cortex, precentral gyrus), while language embeddings were better aligned with higher-order language areas (inferior frontal gyrus, angular gyrus).
Variance Partitioning Reveals Selectivity Patterns: A variance partitioning approach was used to identify the proportion of the predicted signal in each electrode uniquely explained by each embedding type. This analysis revealed different selectivity patterns for speech and language embeddings across cortical areas, indicating a hierarchical organization of processing. Articulatory and perceptual areas showed a preference for speech embeddings, while higher-level language areas preferred language embeddings.
Influence of Speech on Language Representations: The study investigated how speech information influences language processing by comparing language embeddings extracted with and without speech input. Encoding performance for language embeddings was significantly higher when the language decoder received speech information, demonstrating that auditory speech signals inform language representations in the model and improve its ability to model neural responses.

Strengths

Accurate Prediction of Neural Activity
The Results section clearly presents the core finding: Whisper's embeddings accurately predict neural activity during natural conversations, demonstrating a strong alignment between the model and brain activity.

"Whisper ’s acoustic, speech and language embeddings predicted neural activity with remarkable accuracy across conversations compris- ing hundreds of thousands of words during both speech production and comprehension for numerous electrodes in various regions of the cortical language network (Fig. 2)." (Page 3)
Hierarchical Organization of Encoding
The section effectively describes the hierarchical organization of encoding, with speech embeddings better predicting activity in lower-level areas and language embeddings better predicting activity in higher-order areas.

"Speech embeddings better captured cortical activity in lower-level speech perception and production areas, includ- ing the superior temporal cortex and precentral gyrus. On the other hand, linguistic embeddings were better aligned with higher-order language areas such as the inferior frontal gyrus and angular gyrus." (Page 3)
Variance Partitioning Analysis
The Results section introduces a novel variance partitioning approach to quantify the unique contributions of acoustic, speech, and language embeddings, providing a deeper understanding of their roles in different brain regions.

"We utilized a variance partitioning approach to identify the proportion of the predicted signal in each electrode uniquely explained by the acoustic, speech and language embeddings." (Page 4)
Influence of Auditory Signals on Language Representations
The section effectively demonstrates that auditory speech signals inform language representations in the model, enhancing its ability to predict neural responses, highlighting the multimodal nature of language processing.

"In testing both sets of embeddings, we observed that encoding performance for language embeddings was significantly higher when the language decoder received speech information from the encoder, during both production (Fig. 4a) and comprehension (Fig. 4b)." (Page 4)
Clear Description of Data Collection
The section concisely and effectively summarizes the data collection methods, emphasizing the large-scale, naturalistic nature of the ECoG recordings.

"We collected continuous 24/7 recordings of ECoG and speech signals from 4 patients as they spontaneously conversed with their family, friends, doctors and hospital staff during their entire days-long stay at the epilepsy unit (for patient demographics and clinical character- istics, see Supplementary Table 1)." (Page 3)

Suggestions for Improvement

Improve Organization and Structure
This medium-impact improvement would enhance the clarity and flow of the Results section. While the section presents many findings, it would benefit from a more structured presentation, grouping related results and providing clear transitions between them. This is crucial for a Results section, as it guides the reader through the key findings in a logical and coherent manner. Organizing the results into subsections with clear headings would make it easier for the reader to follow the different lines of evidence and understand the overall narrative. This would improve the readability and impact of the Results section.

"Results We collected continuous 24/7 recordings of ECoG and speech signals from 4 patients as they spontaneously conversed with their family, friends, doctors and hospital staff during their entire days-long stay at the epilepsy unit (for patient demographics and clinical character- istics, see Supplementary Table 1)." (Page 3)

Implementation: Organize the Results section into subsections with clear, descriptive headings. For example: "Whisper Embeddings Predict Neural Activity During Natural Conversations", "Hierarchical Encoding of Speech and Language Information", "Influence of Auditory Input on Language Representations", "Temporal Dynamics of Speech and Language Encoding".
Report Effect Sizes and Confidence Intervals
This low-impact improvement would enhance the clarity of the Results section. While the section mentions statistical significance, it would benefit from consistently reporting effect sizes and confidence intervals alongside p-values. This is important for a Results section as it provides a more complete picture of the magnitude and reliability of the findings. Adding effect sizes and confidence intervals would allow readers to better assess the practical significance of the results, beyond just statistical significance. This would strengthen the interpretation of the findings.

"Overall, acoustic embeddings yielded fewer significant electrodes than speech embeddings for pro- duction (64 vs 274, chi-square (1, N = 644) = 175.21, P < 0.001, Bonfer- roni corrected, φ = 0.27) and comprehension (46 vs 186, chi-square (1, N = 644) = 101.58, P < 0.001, Bonferroni corrected, φ = 0.16), and fewer significant electrodes than language embeddings (pro- duction: 64 vs 154, chi-square (1, N = 644) = 43.73, P < 0.001, Bon- ferroni corrected, φ = 0.06; comprehension: 46 vs 135, chi-square (1, N = 644) = 49.55, P < 0.001, Bonferroni corrected φ = 0.08). Speech embeddings yielded more significant electrodes than language embed- dings for both production (274 vs 154, chi-square (1, N = 644) = 49.55, P < 0.001, Bonferroni corrected, φ = 0.08) and comprehension (186 vs 135, chi-square (1, N = 644) = 10.37, P < 0.005, Bonferroni cor- rected, φ = 0.02)." (Page 4)

Implementation: Report effect sizes (e.g., Cohen's d, Pearson's r) and confidence intervals alongside p-values for all statistical comparisons. For example, instead of just stating "P < 0.001", report "(P < 0.001, d = 0.8, 95% CI [0.6, 1.0])".
Clarify the 'Whisper' Model
This low-impact improvement would aid readers unfamiliar with the specific model. The term 'Whisper' could benefit from a more descriptive label. This helps readers quickly grasp the model's nature. This detail belongs in the results as it provides essential context for the model used, without being overly technical. Adding this clarification enhances the results accessibility and broadens its appeal. It aids readers unfamiliar with the specific model, ensuring they understand the core methodology. This improves the overall clarity and informativeness.

"In our dataset, each conversation is unique: patients freely express themselves without any intervention from experimenters. We input the audio recordings and the transcribed text into a multimodal, acoustic-to-speech-to-language model (Whisper)20." (Page 3)

Implementation: Change 'multimodal, acoustic-to-speech-to-language model (Whisper)' to 'multimodal, acoustic-to-speech-to-language model (OpenAI's Whisper)' or '...Whisper, developed by OpenAI'.

Non-Text Elements

Fig. 2 | Acoustic, speech and language encoding model performance during speech...

Full Caption

Fig. 2 | Acoustic, speech and language encoding model performance during speech production and comprehension.

Figure/Table Image (Page 4)

First Reference in Text

Whisper's acoustic, speech and language embeddings predicted neural activity with remarkable accuracy across conversations compris- ing hundreds of thousands of words during both speech production and comprehension for numerous electrodes in various regions of the cortical language network (Fig. 2).

Description

Overview of encoding model performance: This figure presents the results of how well the different types of information extracted from the "Whisper" model (acoustic, speech, and language) can predict brain activity during both speaking (production) and listening (comprehension). The researchers used a technique called "encoding models" to do this. Think of an encoding model as a way to translate between the language of the computer model (the embeddings) and the language of the brain (the neural activity). The better the translation, the better the model is at capturing what's happening in the brain.
Color-coded brain maps representing correlation (r): The figure shows brain maps, color-coded to represent the strength of the prediction (correlation, represented by 'r'). A correlation is a number between -1 and 1 that indicates how well two things are related. A correlation of 0 means no relationship, while 1 (or -1) means a perfect positive (or negative) relationship. Here, the colors represent the correlation between the predicted brain activity (based on the Whisper model's embeddings) and the actual recorded brain activity. The colors range from 0.04 (light yellow) to 0.40 (dark red), indicating varying degrees of positive correlation. The N values (N=64, N=274, etc.) indicate the number of electrodes included in each map.
Separate panels for production, comprehension, and embedding types: There are separate brain maps for speech production (when people were talking) and speech comprehension (when people were listening). Within each of these, there are maps for the acoustic embeddings (representing the raw sound), speech embeddings (representing the recognized speech sounds), and language embeddings (representing the meaning of the words). This allows us to see which type of information from the Whisper model best predicts brain activity in different brain areas during different tasks.
Statistical Significance: The figure shows results that are statistically significant. The statement 'P < 0.01, FWER' means that the probability of observing these results by chance is less than 1%, and this has been corrected for multiple comparisons using the Family-Wise Error Rate (FWER) method. FWER correction is a way to reduce the chances of getting false positives when you're doing many statistical tests at once (in this case, testing many electrodes).

Scientific Validity

Overall methodological approach: The figure presents compelling evidence for the alignment between the Whisper model's internal representations and neural activity during natural language processing. The use of a large dataset (hundreds of thousands of words) and multiple electrodes strengthens the generalizability of the findings.
Statistical significance: The use of a rigorous statistical threshold (P < 0.01, FWER corrected) provides confidence that the observed correlations are not due to chance.
Comparison of different embedding types: The presentation of results for different embedding types (acoustic, speech, and language) allows for a nuanced understanding of how different levels of linguistic information are encoded in the brain.
Correlation vs. Causation: While the figure shows impressive results, it's important to acknowledge that correlation does not equal causation. The observed correlations suggest an alignment between the model and the brain, but they do not prove that the brain uses the same representations as the model. Further investigation is needed to explore the causal relationship.

Communication

Clarity and organization of the visual representation: The figure effectively visualizes the encoding performance for three different embedding types (acoustic, speech, and language) across multiple brain regions. The use of color-coded brain maps allows for a quick comparison of performance across conditions (production and comprehension). However, the figure could benefit from a clearer indication of the scale for the correlation values (r). While the range is stated (0.04 - 0.40), adding tick marks or labels on the color bar would improve readability.
Completeness of the figure legend: The figure legend is concise but could be more informative. For example, explicitly stating that 'N' refers to the number of electrodes would be helpful. Also, clarifying the meaning of the 'P < 0.01, FWER' threshold in the legend would enhance stand-alone understanding.
Panel organization and layout: The use of separate panels for production and comprehension, and for each embedding type, makes it easy to compare the results across these different conditions. The layout is logical and well-organized.

Fig. 3 | Mixed selectivity for speech and language embeddings during speech...

Full Caption

Fig. 3 | Mixed selectivity for speech and language embeddings during speech production and comprehension.

Figure/Table Image (Page 5)

First Reference in Text

We observed different selectivity patterns for speech and lan- guage embeddings, each accounting for different portions of the variance across different cortical areas (Fig. 3).

Description

Overall concept of mixed selectivity: This figure shows how well different types of information from the Whisper model – specifically, speech sounds (speech embeddings) and the meaning of words (language embeddings) – predict brain activity in different parts of the brain, and how this changes depending on whether someone is talking (production) or listening (comprehension). The main idea is to see which parts of the brain are more sensitive to the sounds of speech versus the meaning of the words.
Color-coded brain maps and unique variance explained: The figure uses color-coded brain maps. The color at each location on the brain represents which type of information (speech or language) is better at predicting brain activity in that area. Red means speech sounds are more important, blue means the meaning of the words is more important, and white means it's a mix of both. The colors show the percentage of "unique variance explained." Variance, in this context, is a measure of how much the brain activity changes over time. 'Unique variance explained' means how much of that change can be predicted by only one type of information (either speech or language), after taking into account any overlap between them.
Separate maps for production and comprehension: There are separate brain maps for when people are talking (speech production) and when they are listening (speech comprehension). This allows us to see if the patterns of brain activity are different for these two processes.
Individual electrode plots showing temporal dynamics: In addition to the brain maps, there are smaller graphs showing the correlation between predicted and actual brain activity over time (the x-axis is labeled 'Lag (s)', meaning time in seconds). These graphs are for specific electrodes in specific brain regions (like IFG, STG, etc.). The red line shows the correlation for speech embeddings (sounds), and the blue line shows the correlation for language embeddings (meaning). These graphs show how the relationship between the model's information and brain activity changes over a short period of time around when a word is spoken or heard.
Statistical threshold and FDR correction: The dotted horizontal line in each of the smaller graphs represents the statistical threshold. This means that any correlation above that line is considered statistically significant, meaning it's unlikely to have happened by chance. The text mentions that the threshold is q < 0.01, two-sided, FDR corrected. This means the probability of a false positive is less than 1%, and this has been adjusted for multiple comparisons using the False Discovery Rate (FDR) method.

Scientific Validity

Overall methodological approach: The figure presents a novel and insightful analysis of the differential roles of speech and language representations in the brain. The use of variance partitioning to quantify the unique contribution of each embedding type is a strong methodological approach.
Comparison of production and comprehension: The inclusion of both production and comprehension data allows for a comparison of the neural substrates involved in these two fundamental aspects of language processing.
Presentation of group and individual data: The presentation of results at both the group level (brain maps) and the individual electrode level (plots) provides a comprehensive view of the data.
Statistical analysis: The statistical analysis appears to be rigorous, with appropriate correction for multiple comparisons (FDR correction).
Interpretation of selectivity: The figure focuses on selectivity, which is the relative importance of speech vs. language. It's important to note that even in areas showing strong selectivity for one type of information, the other type might still contribute to neural activity. The figure does not imply that these areas are exclusively involved in processing only one type of information.

Communication

Overall organization and clarity: The figure presents a complex set of results, comparing encoding performance for speech and language embeddings during both production and comprehension across various brain regions. The use of separate brain maps for production and comprehension, colored according to the proportion of unique variance explained, is effective for visualizing the spatial distribution of selectivity. The inclusion of individual electrode plots with correlation over time adds another layer of detail. However, the sheer amount of information presented makes the figure somewhat overwhelming. The small size of the individual plots and the lack of clear visual separation between the production and comprehension sections make it challenging to quickly grasp the key findings.
Color scheme and representation of mixed selectivity: The color scheme used to represent the percentage of unique variance (ranging from red for speech to blue for language) is intuitive, but the addition of a 'mixed' category (white) adds complexity. It might be helpful to provide a more explicit explanation of what constitutes 'mixed' selectivity in the figure legend.
Readability of individual electrode plots: The individual electrode plots are useful for showing the temporal dynamics of encoding performance, but the x-axis labels ('Lag (s)') are small and could be more prominent. Adding tick marks or grid lines to the plots might also improve readability.
Use of abbreviations: The use of abbreviations for brain regions (e.g., STG, IFG, preCG) is standard practice, but including a key or expanding these abbreviations in the figure legend would make the figure more accessible to a broader audience.

Fig. 4 | Enhanced encoding for language embeddings fused with auditory speech...

Full Caption

Fig. 4 | Enhanced encoding for language embeddings fused with auditory speech features.

Figure/Table Image (Page 6)

First Reference in Text

In testing both sets of embeddings, we observed that encoding performance for language embeddings was significantly higher when the language decoder received speech information from the encoder, during both production (Fig. 4a) and comprehension (Fig. 4b).

Description

Comparison of language embeddings with and without auditory input: This figure compares how well two different types of language information from the Whisper model predict brain activity. The first type ('Only text') is based solely on the written words of the conversation. The second type ('Text + audio') combines the written words with the actual sounds of the speech. The researchers are testing whether adding the sound information improves the prediction of brain activity.
Separate panels for production and comprehension: The figure shows results for both when people are talking (production, panel a) and when they are listening (comprehension, panel b). This allows us to see if the effect of adding sound information is different for these two processes.
Brain maps showing the difference in correlation: The brain maps show the difference in prediction accuracy between the two types of language information. The colors represent the 'A correlation,' which is the difference in correlation values between the 'Text + audio' model and the 'Only text' model. Warmer colors (closer to 0.050) mean the 'Text + audio' model is better, while cooler colors (closer to -0.050) mean the 'Only text' model is better. The 'N' values indicate the number of electrodes.
Line graphs showing correlation over time: The line graphs show the correlation values over time (the x-axis is 'Lag (s)', meaning time in seconds) for all electrodes ('All') and for electrodes in the inferior frontal gyrus ('IFG'). The blue line represents the 'Only text' model, and the pink line represents the 'Text + audio' model. These graphs show how the relationship between the model's information and brain activity changes over a short period of time around when a word is spoken or heard.

Scientific Validity

Overall methodological approach: The figure provides strong evidence that incorporating auditory speech features improves the encoding performance of language embeddings. This supports the idea that the brain integrates acoustic and linguistic information during both speech production and comprehension.
Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a comparison of the effects of auditory information in these two processes.
Presentation of results at different levels: The presentation of results at both the group level (brain maps) and for specific regions (IFG) provides a more detailed view of the data. However, providing similar plots for other regions (like STG) in supplementary materials could further strengthen the findings.

Communication

Overall organization and clarity: The figure is divided into two main sections (a and b), clearly separating the results for production and comprehension. The use of brain maps and line graphs effectively visualizes the comparison between language embeddings with and without auditory input. However, the brain maps are relatively small, and the color difference between 'Only text' and 'Text + audio' is subtle, making it somewhat difficult to distinguish between them. The line graphs, while informative, could benefit from more prominent axis labels and tick marks.
Caption descriptiveness: The caption is concise but could be more descriptive. It would be helpful to explicitly state what 'enhanced encoding' means in this context (i.e., higher correlation values).
Consistency and clarity of notation: The use of 'N' to represent the number of electrodes is consistent with previous figures, but a reminder in the legend would still be beneficial for stand-alone understanding.

Fig. 5| Comparing speech and language embeddings to symbolic features.

Figure/Table Image (Page 7)

First Reference in Text

Our findings indicate that speech and language embeddings extracted from the multimodal, deep acoustic-to-speech-to-language model outperform symbolic speech and language features (Fig. 5) in predicting neural activity during natural conversations.

Description

Comparison of embeddings and symbolic features: This figure compares two different ways of representing speech and language in the Whisper model: embeddings and symbolic features. Embeddings are the internal representations learned by the deep learning model, while symbolic features are traditional linguistic features like phonemes (speech sounds) and parts of speech (nouns, verbs, etc.). The figure shows how well each of these representations predicts brain activity during speech production and comprehension.
Panel (a): Speech embeddings vs. symbolic speech features: Panel (a) focuses on speech. It compares how well the speech embeddings (from the Whisper encoder) and symbolic speech features (like phonemes and articulation features) predict brain activity. The line graphs show the correlation between predicted and actual brain activity over time. The red line represents deep speech embeddings, and the orange line represents symbolic speech features.
Panel (b): Language embeddings vs. symbolic language features: Panel (b) focuses on language. It compares how well the language embeddings (from the Whisper decoder) and symbolic language features (like parts of speech and syntactic dependencies) predict brain activity. The blue line represents deep language embeddings, and the light blue line represents symbolic language features.
Brain maps showing unique variance explained: The brain maps show the percentage of unique variance explained by each type of representation (deep vs. symbolic). This means how much of the change in brain activity can be predicted by only that type of representation, after taking into account any overlap between them. The color coding for % unique variance explained in the brain maps indicates the relative importance of the representations.
Statistical significance: The dotted horizontal line in the line graphs represents a statistical threshold. Correlations above this line are considered statistically significant. The text indicates that red dots (in panel a) and blue dots (panel b) indicate a statistically significant difference in performance between the deep embeddings and the symbolic features.

Scientific Validity

Overall methodological approach: The figure provides strong evidence that deep embeddings outperform symbolic features in predicting neural activity during natural conversations. This supports the idea that deep learning models capture aspects of language processing that are not captured by traditional linguistic features.
Comparison of different conditions and representations: The use of separate analyses for production and comprehension, and for speech and language, allows for a detailed comparison of the different representations.
Analysis of different brain regions: The inclusion of results for different brain regions (All, STG, IFG) provides insights into the spatial distribution of encoding performance.
Statistical Analysis: The statistical analysis, including the use of FDR correction, appears to be appropriate.
Variance partitioning: The variance partitioning analysis is a strong method for quantifying the unique contribution of each representation.

Communication

Overall organization and clarity: The figure is divided into two main sections (a and b), clearly separating the comparison for speech and language embeddings. Within each section, results are presented for both production and comprehension, and for different brain regions (All, STG, IFG). The use of line graphs to show correlation over time is effective, and the color-coding (deep vs. symbolic) is consistent. However, the brain images showing unique variance explained are quite small and lack detailed anatomical labels, making it difficult to precisely identify the regions where differences are observed.
Caption descriptiveness: The caption is concise but could be more informative. It would be helpful to explicitly state what 'symbolic features' are being compared to the embeddings.
Readability of line graphs: The x-axis labels ('Lag (s)') on the line graphs are small and could be more prominent. Adding tick marks or grid lines might also improve readability.
Color scheme in brain maps: The color scheme used in the brain maps to represent % unique correlation could be confusing. It uses different colors for deep vs symbolic features, where one might expect a gradient of a single color to show % unique correlation. The description in the figure is important to understand the figure.

Fig. 6 | Representations of phonetic and lexical information in Whisper.

Figure/Table Image (Page 8)

First Reference in Text

a–d, Speech embeddings and language embeddings were visualized in a two- dimensional space using t-SNE (Fig. 6a–d).

Description

Use of t-SNE for visualization: This figure uses a technique called t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize the high-dimensional information from the Whisper model. t-SNE is a way to take data that has many dimensions (like the embeddings, which have hundreds or thousands of dimensions) and reduce it down to just two dimensions so we can plot it on a graph. The goal of t-SNE is to keep similar data points close together and dissimilar data points far apart in the 2D representation.
Separate panels for speech and language embeddings: There are four subpanels (a-d). Panels (a) and (c) show the speech embeddings (information about the sounds of speech), while panels (b) and (d) show the language embeddings (information about the meaning of words).
Color-coding by phonetic and lexical categories: Within each pair of panels, one is colored by phonetic categories (like the specific sounds in a word, panel a) and the other is colored by lexical categories (like the part of speech, such as noun, verb, adjective, panel d). This allows us to see if the embeddings naturally cluster together based on these features.
Interpretation of point clustering: Each point in the plots represents a single word (or a short segment of audio for the speech embeddings). The closer two points are, the more similar their embeddings are in the high-dimensional space. If points of the same color (meaning the same phonetic or lexical category) tend to cluster together, it suggests that the embeddings are capturing that type of information.
Classification accuracy: Panel (e) shows how well a computer algorithm can classify (or guess) the correct phonetic or lexical category of a word based on its embedding. It does this for both speech and language embeddings, and for different layers of the Whisper model. Higher accuracy means the embedding contains more information about that category.

Scientific Validity

Use of t-SNE: The use of t-SNE is a valid approach for visualizing high-dimensional data like embeddings. However, it's important to remember that t-SNE is a non-linear dimensionality reduction technique, which means it can distort distances and relationships between data points. It's primarily useful for exploring clustering patterns, not for making precise measurements of distances.
Comparison of different embeddings and categories: The comparison of speech and language embeddings, and of phonetic and lexical categories, provides a comprehensive view of the information captured by the different representations.
Quantitative analysis using classification: The presentation of classification accuracy (panel e) provides a quantitative measure of the information content of the embeddings, complementing the qualitative visualization provided by the t-SNE plots.
Linking representations to neural processes: The figure investigates whether phonetic and lexical categories are represented in the embeddings, but it doesn't directly address how the brain uses this information. It's an important step in understanding the model's internal representations, but further research is needed to link these representations to neural processes.

Communication

Overall organization and clarity: The figure is divided into four subpanels (a-d), each showing a t-SNE visualization of either speech or language embeddings, colored by either phonetic or lexical categories. This organization allows for a clear comparison of the clustering patterns. However, the subpanels are relatively small, and the points within each plot are densely packed, making it difficult to discern individual data points and their relationships. The use of different colors for different categories is helpful, but a legend explaining the color-coding is essential and should be more prominently displayed.
Caption descriptiveness: The caption is concise, but it could be more specific. It would be helpful to mention that t-SNE is used for visualization and to briefly explain the purpose of the figure (i.e., to examine the clustering of embeddings based on phonetic and lexical features).
Missing axis labels: The axis labels are missing in the t-SNE plots, which is a significant omission. While t-SNE axes are not directly interpretable in the same way as traditional coordinate axes, it's still important to indicate that these are t-SNE dimensions (e.g., 't-SNE 1', 't-SNE 2').
Missing chance level in classification accuracy: Panel (e) presents classification accuracy of phonetic and lexical categories. The different colors representing different layers is helpful, but it would aid the understanding of readers if horizontal lines representing the chance level were added to the graphs.

Fig. 7 | Temporal dynamics of speech production and speech comprehension across...

Full Caption

Fig. 7 | Temporal dynamics of speech production and speech comprehension across different brain areas.

Figure/Table Image (Page 9)

First Reference in Text

Evaluating encoding models at each lag relative to word onset allows us to trace the temporal flow of information from STG (speech comprehen- sion ROI) to IFG (language-related ROI) to SM (speech production ROI) during the production and comprehension of natural conversations (Fig. 7).

Description

Overall concept of temporal dynamics: This figure shows how the relationship between the Whisper model's predictions and actual brain activity changes over time, during both speaking (production) and listening (comprehension). The researchers are looking at the timing of brain activity in different areas to see which areas are active first and how the activity flows between them.
Line graphs showing correlation over time: The line graphs show the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken or heard). A correlation measures how well two things are related. Here, it shows how well the model's predictions match the brain activity at different points in time.
Separate panels for production and comprehension, and different brain regions: There are separate graphs for speech production (panel a) and speech comprehension (panel b). Within each panel, there are graphs for different brain regions: IFG (inferior frontal gyrus, involved in language), SM (sensorimotor cortex, involved in movement and sensation), and STG (superior temporal gyrus, involved in hearing).
Color-coding of lines: The different colored lines in each graph represent different conditions or brain regions. In panels (a) and (b), the blue line represents IFG, the red line represents SM, and the orange line represents STG. In panel (c), the colors represent different parts of the SM: dSM (dorsal), mSM (middle), and vSM (ventral).
Statistical significance: The dotted horizontal line, if present, in each graph represents a statistical threshold. Correlations above this line are considered statistically significant. The text indicates that different significance thresholds are used (*P < 0.05, **P < 05, ***P < 0.001). The significance levels are corrected for multiple comparisons

Scientific Validity

Overall methodological approach: The figure presents a novel and insightful analysis of the temporal dynamics of speech processing. By examining encoding performance at different time lags, the researchers can infer the order in which different brain regions are involved in production and comprehension.
Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a direct comparison of the temporal dynamics in these two processes.
Analysis of different brain regions: The inclusion of data from multiple brain regions (IFG, SM, STG) provides a more comprehensive view of the network involved in speech processing.
Statistical analysis: The statistical analysis, as described in the text, appears to be appropriate, with the use of t-tests and correction for multiple comparisons.

Communication

Overall organization and clarity: The figure is divided into two main sections (a and b), clearly separating the results for production and comprehension. Within each section, line graphs show the correlation between predicted and actual neural activity over time for different brain regions (IFG, SM, STG). The use of different colors for different regions and conditions is effective. However, the figure is quite complex, and the small size of the individual plots makes it challenging to see the details. The brain map (panel d) is helpful for visualizing the electrode locations, but it lacks detailed anatomical labels.
Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to mention the specific types of embeddings (speech and language) being analyzed and the key finding (the temporal order of activation).
Readability of line graphs: The x-axis labels ('Lag (s)') on the line graphs are small and could be more prominent. Adding tick marks or grid lines might also improve readability.
Use of abbreviations: The use of abbreviations (IFG, SM, STG) is standard practice, but expanding these abbreviations in the figure legend or providing a key would make the figure more accessible.
Missing y-axis label in Panel c: Panel (c) uses different colors for dSM, mSM, and vSM which helps to see the differences in the temporal dynamics in these regions. However, the y-axis is not labeled, making it difficult to interpret the plot.

Fig. 8 | Fine-grained temporal sequence of speech encoding during production...

Full Caption

Fig. 8 | Fine-grained temporal sequence of speech encoding during production and comprehension.

Figure/Table Image (Page 10)

First Reference in Text

We observed that during speech comprehen- sion, neural encoding begins to peak around word onset and gradually shifts over time (Fig. 8b,d).

Description

Overall concept of fine-grained temporal sequence: This figure shows how the relationship between the Whisper model's predictions and brain activity changes over very short periods of time (milliseconds) during both speaking (production) and listening (comprehension). It focuses on the timing of when the model's predictions are most strongly related to brain activity.
Encoder units (20-ms chunks): The figure uses the Whisper encoder, which processes speech in 20-millisecond chunks, called 'encoder units'. The researchers are looking at how well each of these 20-ms chunks predicts brain activity at different points in time.
Line graphs of encoding performance: Panels (a) and (b) show line graphs of encoding performance over time. The x-axis represents the 'Encoder unit' (1 to 20), and the y-axis represents the correlation between the model's predictions and brain activity. Different colored lines represent different brain areas or conditions.
Scatter plots with regression lines: Panels (c) and (d) show scatter plots with regression lines. The x-axis represents the 'Encoder unit' (1 to 20), and the y-axis represents the 'Lag (s)', which is the time delay between the word onset and the peak encoding performance. The regression lines show the overall trend, and the statistics (β, P) indicate the slope and significance of the relationship.
Shifting encoding peak during comprehension: The key finding for comprehension (panels b and d) is that the encoding peak shifts over time. This means that earlier encoder units (representing earlier parts of the speech signal) predict brain activity earlier, and later encoder units predict brain activity later. This suggests a sequential processing of speech information.
Fixed delay of encoding peak during production: The key finding for production (panels a and c) is different. Before word onset, the encoding peak remains at a fixed delay (around -300 ms). This suggests that the brain has information about the entire upcoming word before it's actually spoken.

Scientific Validity

Overall methodological approach: The figure presents a novel and insightful analysis of the fine-grained temporal dynamics of speech encoding. By examining encoding performance at the level of individual encoder units (20-ms chunks), the researchers can make inferences about the timing of neural processes.
Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a direct comparison of the temporal dynamics in these two processes.
Regression analysis: The use of regression analysis to quantify the relationship between encoder unit and peak encoding lag is a strong methodological approach.
Statistical analysis: The statistical analysis, including the use of linear mixed models and reporting of p-values and confidence intervals, appears to be appropriate.

Communication

Overall organization and clarity: The figure is complex, presenting multiple subpanels (a-d) that compare encoding performance across different encoder units and conditions (production and comprehension). The use of line graphs and scatter plots with regression lines is appropriate for visualizing the temporal dynamics. However, the figure is quite dense, and the small size of the individual plots makes it challenging to discern details. The color-coding and symbols are generally clear, but a more explicit explanation of the 'Encoder unit' in the legend would be beneficial.
Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to mention the key finding (the shift in encoding peak during comprehension and the fixed delay during production).
Readability of graphs: The x-axis and y-axis labels on the line graphs and scatter plots are small and could be more prominent. Adding tick marks or grid lines might also improve readability.
Panel organization: The use of separate panels for production (a, c) and comprehension (b, d) is effective for comparing the results across these conditions.
Incomplete reference to subpanels: The reference text mentions Fig. 8b,d, but Fig. 8a,c are also relevant and present important information. It would be more appropriate to reference the entire figure (Fig. 8) or all subpanels (Fig. 8a-d) in this sentence.

Supp. Figure 2. Encoding performance for reduced sample size.

Figure/Table Image (Page 21)

First Reference in Text

Moreover, prediction per- formance in the left-out testing segments was robust and did not meaningfully change even when we used only 25% of the data for train- ing (Supplementary Fig. 2).

Description

Overall concept of reduced sample size analysis: This figure shows what happens to the accuracy of the brain activity predictions when the researchers use less data to train their models. They compare the results when using all the available data (x-axis) to the results when using only 25% of the data (y-axis). Each point in the scatter plots represents a single electrode.
Separate plots for speech and language embeddings: There are separate plots for speech embeddings (panels A and B) and language embeddings (panels C and D). This allows us to see if the effect of reducing the sample size is different for different types of information.
Separate plots for production and comprehension: There are also separate plots for speech production (when people are talking) and speech comprehension (when people are listening).
Interpretation of the diagonal line: The diagonal line in each plot represents the scenario where the prediction accuracy is the same with the full dataset and the reduced dataset. If a point falls on this line, it means reducing the sample size didn't change the accuracy for that electrode. Points above the line mean the accuracy was better with less data (which is unlikely), and points below the line mean the accuracy was worse with less data.
Key finding: robustness to reduced sample size: The main finding is that most of the points cluster closely around the diagonal line. This means that reducing the sample size to 25% didn't significantly change the prediction accuracy for most electrodes. This suggests that the models are robust and don't require a huge amount of data to achieve good performance.

Scientific Validity

Overall methodological approach: The figure provides strong evidence for the robustness of the encoding models to reduced sample size. This is an important finding because it suggests that the results are not overly sensitive to the amount of data used for training.
Comparison of different conditions: The use of separate analyses for speech and language embeddings, and for production and comprehension, allows for a comprehensive assessment of robustness across different conditions.
Quantitative measure of difference: The figure presents a clear and direct comparison of encoding performance with full and reduced datasets. However, it would be helpful to include a quantitative measure of the difference in performance, such as the mean difference in correlation values or the percentage of electrodes with significantly reduced performance.

Communication

Overall organization and clarity: The figure presents scatter plots comparing encoding performance (correlation values) using the full dataset versus a reduced dataset (25% of the data). Separate plots are shown for speech and language embeddings, and for production and comprehension. The use of scatter plots is appropriate for visualizing the relationship between two continuous variables. The diagonal line clearly indicates the expected performance if there were no change with reduced sample size. However, the points are densely packed, making it difficult to assess the distribution and density in different regions of the plots. Using a different plot type, such as a 2D histogram or a density plot, might improve clarity.
Caption descriptiveness: The caption is concise but could be more informative. It would be helpful to explicitly state the key finding (i.e., that encoding performance is robust even with reduced sample size).
Readability of scatter plots: The axis labels are clear and informative, but adding tick marks or grid lines might improve readability.

Supp. Figure 3. Comparing language embeddings across layers and models.

Figure/Table Image (Page 22)

First Reference in Text

We also extracted language embeddings from the decoder stack of layer 4 (instead of layer 3 which was used for Figs. 2-7) and a unimodal language model (GPT-2), and obtained similar encoding results (Supplementary Fig. 3).

Description

Comparison of language embeddings across layers and models: This figure explores whether the main findings of the paper (that language embeddings predict brain activity) depend on the specific way the language information is extracted. It compares the results using language embeddings from two different layers of the Whisper model (layer 3 and layer 4 of the decoder) and also from a completely different language model called GPT-2. GPT-2, like Whisper, is a powerful, deep-learning model trained on a massive amount of text, but it was trained only on text, unlike Whisper, which was trained on both text and audio.
Brain maps showing correlation values: The brain maps show the correlation between predicted and actual brain activity, similar to previous figures. The colors represent the strength of the correlation, with warmer colors indicating better predictions. Separate maps are shown for speech production (when people are talking) and speech comprehension (when people are listening).
Statistical significance: The 'N' values indicate the number of electrodes included in each map. The results are presented for statistically significant electrodes (p < 0.01, FWER corrected), meaning that the observed correlations are unlikely to have happened by chance.
Key finding: similar encoding results across layers and models: The main finding is that the encoding results are similar regardless of whether the language embeddings are taken from layer 3 or layer 4 of the Whisper decoder, and also when using embeddings from GPT-2. This suggests that the ability to predict brain activity from language embeddings is not specific to a particular layer or model, but rather reflects a more general property of how language is represented in these models.

Scientific Validity

Overall methodological approach: The figure provides important evidence for the robustness of the main findings. By showing that similar results are obtained with different language embeddings (from different layers and a different model), the researchers demonstrate that their findings are not an artifact of a specific methodological choice.
Comparison with a unimodal language model: The use of a unimodal language model (GPT-2) is a particularly strong control, as it shows that the results are not dependent on the multimodal nature of the Whisper model.
Comparison of production and comprehension: The presentation of results for both production and comprehension allows for a comparison of the robustness across these two processes.

Communication

Overall organization and clarity: The figure presents brain maps showing encoding performance (correlation values) for language embeddings extracted from different layers of the Whisper decoder (layer 4 vs. layer 3) and from a different language model (GPT-2). Separate maps are shown for production and comprehension. The use of brain maps allows for a quick visual comparison of performance across conditions. However, the color scale is not explicitly defined on the figure itself, although the range is described in the figure (0.04 - 0.4).
Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to state the key finding (i.e., that similar encoding results are obtained across different layers and models).
Lack of detailed anatomical labels: The brain maps lack detailed anatomical labels, making it difficult to precisely identify the regions showing significant encoding performance.

Supp. Figure 4. Continuous acoustic and speech encoding model performance...

Full Caption

Supp. Figure 4. Continuous acoustic and speech encoding model performance during speech production and comprehension.

Figure/Table Image (Page 22)

First Reference in Text

Because the speech encoder receives continuous speech recordings, we could also run encoding models for continuous acoustic and speech embeddings, encompassing all time points in each recording, including non-speech segments, irrespective of the spoken word boundaries (Supplemen- tary Fig. 4a,b and Methods).

Description

Continuous encoding analysis: This figure extends the analysis beyond individual words and looks at how well the Whisper model's representations predict brain activity continuously over time, even during periods without speech. This contrasts with the previous figures, which focused on brain activity aligned to specific words.
Comparison of continuous acoustic and speech embeddings: The figure compares two types of information from the Whisper model: continuous acoustic embeddings (representing the raw sound) and continuous speech embeddings (representing the recognized speech sounds, but still over the entire time course, not just at word boundaries).
Brain maps of encoding performance: Panels (A) and (B) show brain maps of encoding performance for the continuous acoustic and speech embeddings, respectively. The colors represent the correlation between predicted and actual brain activity. Separate maps are shown for speech production (when people are talking) and speech comprehension (when people are listening).
Difference in encoding performance: Panel (C) shows the difference in encoding performance between the continuous speech and acoustic embeddings. Red indicates areas where speech embeddings perform better, and blue indicates areas where acoustic embeddings perform better.
Line graphs of correlation over time: Panel (D) shows line graphs of the correlation over time for all electrodes. The red line represents the continuous speech embeddings, and the blue line represents the continuous acoustic embeddings.
Key finding: speech embeddings outperform acoustic embeddings: The key finding is that even when considering continuous signals (including non-speech segments), speech embeddings outperform acoustic embeddings in predicting brain activity in most brain areas. This strengthens the evidence that the speech representations in the Whisper model capture important aspects of neural processing.

Scientific Validity

Overall methodological approach: The figure presents a valuable extension of the main analysis, demonstrating that the superiority of speech embeddings is not limited to word-aligned activity but also holds for continuous signals. This addresses a potential concern that the previous results might be an artifact of focusing only on word onsets and offsets.
Comparison of production and comprehension: The use of separate analyses for production and comprehension allows for a comparison of the continuous encoding performance across these two processes.
Comparison of acoustic and speech embeddings: The inclusion of results for both acoustic and speech embeddings provides a clear comparison of the different representations.
Statistical Analysis: The statistical analysis, as described in the Methods section and referenced in the text, appears to be appropriate. The contrast in panel C uses FDR correction.

Communication

Overall organization and clarity: The figure is divided into multiple panels (A-D), presenting brain maps and line graphs comparing encoding performance for continuous acoustic and speech embeddings. Separate maps and graphs are shown for production and comprehension. The use of brain maps allows for a quick visual comparison of performance across brain regions, and the line graphs provide a visualization of the temporal dynamics. However, the brain maps are relatively small and lack detailed anatomical labels. The color scales are not explicitly defined on the figure, although the range is mentioned for panel C.
Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to explicitly state the key finding (that speech embeddings outperform acoustic embeddings even when considering continuous signals).
Readability of line graphs: The x-axis and y-axis labels on the line graphs (Panel D) are small and could be more prominent. Adding tick marks or grid lines might also improve readability.

Supp. Figure 5. Unique variance explained by acoustic, speech, and language...

Full Caption

Supp. Figure 5. Unique variance explained by acoustic, speech, and language embeddings.

Figure/Table Image (Page 23)

First Reference in Text

A similar analysis was also done for acoustic and speech embeddings (Supplementary Fig. 5).

Description

Comparison of acoustic and speech embeddings: This figure is similar to Figure 3, but instead of comparing speech and language embeddings, it compares acoustic and speech embeddings. It shows how much of the change in brain activity can be predicted by only the raw sound information (acoustic embeddings) versus only the recognized speech sounds (speech embeddings), after accounting for any overlap between them.
Color-coded brain maps and unique variance explained: The brain maps are color-coded to represent the percentage of unique variance explained. This means how much of the change in brain activity can be predicted by one type of information (acoustic or speech) but not the other.
Separate maps for production and comprehension: There are separate maps for speech production (when people are talking) and speech comprehension (when people are listening). This allows us to see if the patterns are different for these two processes.
Key finding: speech embeddings explain more unique variance: The key finding is that in most brain areas, speech embeddings (representing the recognized speech sounds) explain more unique variance than acoustic embeddings (representing the raw sound). This suggests that the brain is more sensitive to the higher-level speech features than to the raw acoustic details.

Scientific Validity

Overall methodological approach: The figure provides a valuable comparison of the unique contributions of acoustic and speech representations to neural activity. This analysis complements the main findings and helps to disentangle the roles of different levels of auditory processing.
Variance partitioning: The use of variance partitioning is a strong method for quantifying the unique contribution of each embedding type.
Comparison of production and comprehension: The presentation of results for both production and comprehension allows for a comparison of the effects across these two processes.

Communication

Overall organization and clarity: The figure presents brain maps showing the percentage of unique variance explained by acoustic and speech embeddings during production and comprehension. The use of color-coded brain maps is effective for visualizing the spatial distribution of variance explained. However, the color scale is not explicitly defined on the figure, and the brain maps lack detailed anatomical labels, making it difficult to precisely identify the regions showing significant differences. The range of colors is also very small and difficult to interpret.
Caption descriptiveness: The caption is informative, but it could be more specific. It would be helpful to state the key finding (that speech embeddings explain more unique variance than acoustic embeddings in most regions).

Supp. Figure 8. Comparing speech and languaging based encoding for...

Full Caption

Supp. Figure 8. Comparing speech and languaging based encoding for comprehension and production.

Figure/Table Image (Page 27)

First Reference in Text

Supplementary Figs. 6 and 8 display the mean encoding results during production and comprehension in three ROIs (SM, IFG and STG) per patient.

Description

Comparison of speech and language embeddings for production and comprehension: This supplementary figure compares how well speech sounds (speech embeddings) and the meaning of words (language embeddings) predict brain activity in different brain areas, separately for when people are talking (production) and when they are listening (comprehension). It's similar to Figure 3, but this figure focuses on comparing production and comprehension directly.
Multiple brain regions analyzed: The figure shows results for several different brain regions, including preCG (precentral gyrus), postCG (postcentral gyrus), TP (temporal pole), STG (superior temporal gyrus), IFG (inferior frontal gyrus), pMTG (posterior middle temporal gyrus), and AG (angular gyrus). These regions are involved in different aspects of speech and language processing.
Line graphs showing correlation over time: Each small graph shows the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken or heard). The different colored lines represent different conditions (production or comprehension).
Region-specific results: The figure presents results for individual brain regions, allowing for a comparison of the temporal dynamics of encoding performance across different areas.

Scientific Validity

Overall methodological approach: The figure provides a valuable comparison of encoding performance for speech and language embeddings during both production and comprehension, across multiple brain regions. This allows for a more nuanced understanding of the neural substrates involved in these processes.
Comparison of production and comprehension: The use of separate analyses for production and comprehension is crucial for understanding the differences in neural dynamics between these two processes.
Analysis of multiple brain regions: The inclusion of data from multiple brain regions provides a more comprehensive view of the network involved in speech and language processing.
Averaged results: The reference text indicates that these are mean encoding results, suggesting that the data has been averaged across participants. While this provides a general overview, it's important to also consider individual variability, as shown in Supplementary Figure 6.

Communication

Overall organization and clarity: The figure presents a complex set of results, comparing encoding performance for speech and language embeddings during both production and comprehension. The use of separate panels for different brain regions (preCG, postCG, TP, STG, IFG, pMTG, AG) and for production/comprehension is effective for organizing the information. However, the figure is extremely dense, and the small size of the individual plots, combined with the lack of clear y-axis scales and tick marks, makes it very difficult to interpret the results. The color-coding (purple for production, green for comprehension) is consistent, but a more explicit legend would be beneficial.
Caption descriptiveness and terminology: The caption is informative but could be more specific. The term "languaging" is not standard terminology and should be replaced with "language". It would also be helpful to restate the definitions of the brain regions (preCG, postCG, etc.) in the legend.
Readability of plots: The x-axis ('Lag (s)') is consistent across plots, but the y-axis ('Correlation (r)') lacks a clear scale and tick marks, making it difficult to compare correlation values across plots and conditions.

Supp. Figure 10. Evidence for speech processing of the speaker's own voice...

Full Caption

Supp. Figure 10. Evidence for speech processing of the speaker's own voice during speech production.

Figure/Table Image (Page 29)

First Reference in Text

In contrast, the second peak occurs -200 ms after word onset. Additional analyses indicate that the first peak is associated with motor planning, while the second peak is associated with the speaker processing their own voice (Supplementary Fig. 10).

Description

Focus on post-word-onset activity during production: This supplementary figure investigates what happens in the brain when someone is talking, specifically focusing on the period shortly after they say a word. It builds on the observation that some brain areas show two peaks of activity related to speech: one before the word is spoken (related to planning the speech) and one after (potentially related to hearing your own voice).
Speech embeddings and brain regions: The figure shows how well the Whisper model's speech embeddings (representing the recognized speech sounds) can predict brain activity in different brain areas (STG, superior temporal gyrus, involved in hearing; and SM, sensorimotor cortex, involved in movement and sensation).
Comparison of production-trained and comprehension-trained models: The key idea is to compare two scenarios: (1) predicting brain activity using a model trained on data from when people are talking (production, red line), and (2) predicting brain activity using a model trained on data from when people are listening (comprehension, green line).
Line graphs showing correlation over time: The line graphs show the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken).
Key finding: second peak related to self-monitoring: The main finding is that the second peak of activity (after the word is spoken) is better predicted by the model trained on production data. This suggests that this second peak is related to the speaker processing their own voice, rather than just general speech processing.

Scientific Validity

Overall methodological approach: The figure provides supporting evidence for the interpretation of the double peak in activity observed during speech production. By comparing the encoding performance of models trained on production and comprehension data, the researchers can make inferences about the functional roles of the two peaks.
Comparison of different brain regions: The use of separate analyses for STG and SM allows for a comparison of the effects across different brain regions.
Focus on electrodes with double peak: The focus on electrodes showing a double peak is justified, as these electrodes are most likely to be involved in both motor planning and auditory feedback processing.
Correlational evidence: The figure provides correlational evidence, and further experiments would be needed to definitively establish a causal link between the second peak and self-monitoring.

Communication

Overall organization and clarity: The figure presents encoding performance for speech embeddings during speech production, focusing on electrodes showing a double peak in activity (one before and one after word onset). Separate plots are shown for different brain regions (STG, SM) and for two conditions: training the encoding model on production data and training it on comprehension data. The use of line graphs is appropriate for visualizing the temporal dynamics. However, the figure is dense, and the y-axis scale (Correlation (r)) is small and lacks tick marks, making it difficult to compare correlation values across plots. The color-coding (red for production training, green for comprehension training) is consistent, but a more explicit legend would be beneficial.
Caption descriptiveness: The caption is informative but could be more precise. It would be helpful to explicitly state the key finding (that the second peak is more strongly predicted by the production-trained model).
Readability of line graphs: The x-axis label ('Lag (s)') is consistent, but adding tick marks or grid lines to the plots would improve readability.
Brain map clarity: The brain map showing electrode locations is helpful, but it lacks detailed anatomical labels and could be improved by indicating which electrodes show a double peak.

Supp. Table 1. Patient demographics and clinical characteristics.

Figure/Table Image (Page 31)

First Reference in Text

We collected continuous 24/7 recordings of ECOG and speech signals from 4 patients as they spontaneously conversed with their family, friends, doctors and hospital staff during their entire days-long stay at the epilepsy unit (for patient demographics and clinical character- istics, see Supplementary Table 1).

Description

Overall purpose of the table: This supplementary table provides background information about the four people who participated in the study. It includes details about their age, sex, the type of electrodes implanted in their brains, how long their brain activity and speech were recorded, and other relevant medical information.
Demographic information (Age and Sex): The 'Age' column shows the age of each participant, ranging from 24 to 53 years. The 'Sex' column indicates whether each participant is male (M) or female (F).
Number of electrodes implanted: The 'Number of electrodes implanted' column shows how many electrodes were used to record brain activity for each participant. This number varies considerably across participants (104 to 255).
Recording duration and number of words: The 'Hours of speech recorded' column shows how many hours of speech were recorded for each participant, ranging from 17 to 37 hours. The 'Number of words' columns show the total number of words recorded, as well as the number of words spoken by the participant (production) and the number of words spoken by others (comprehension).
Neuropsychological testing scores: The 'Neuropsychological testing scores' section presents scores on various cognitive tests, including VCI (Verbal Comprehension Index), POI (Perceptual Organization Index), PSI (Processing Speed Index), and WMI (Working Memory Index). These scores provide information about the participants' cognitive abilities.
Clinical characteristics (Pathology/epilepsy type/seizure focus): The 'Pathology/epilepsy type/seizure focus' section describes the medical condition of each participant, including the type of epilepsy and the location of seizure onset. All participants had epilepsy that was resistant to medication.
Implant type: The 'Implant' section specifies the type of electrodes used for each participant (grid, strips, and/or depth electrodes).

Scientific Validity

Relevance to scientific interpretation: The table provides essential information for understanding the characteristics of the study participants. This information is crucial for assessing the generalizability of the findings and for interpreting the neural data in the context of individual differences.
Comprehensive information: The inclusion of both demographic and clinical information is important, as both factors can influence neural activity and language processing.
Sample size: The sample size is small (N=4), which is a limitation of the study. However, the extensive amount of data collected from each participant (hours of continuous recordings) partially compensates for the small sample size.
Selection bias: It is important to consider potential biases introduced by the selection of participants. All participants had drug-resistant epilepsy and were undergoing intracranial monitoring, which may limit the generalizability of the findings to the broader population.

Communication

Overall organization and clarity: The table presents key demographic and clinical information about the four participants in the study. The use of a table is appropriate for summarizing this type of data. The table is well-organized, with clear column headings and row labels. However, it could be improved by adding units to the numerical values (e.g., 'years' for age, 'hours' for recording duration). The Neuropsychological testing scores are presented without stating the maximum possible scores, making it difficult to contextualize.
Caption descriptiveness: The caption is concise and informative.
Inclusion of relevant information: The inclusion of information about implant type and pathology/epilepsy type/seizure focus provides important context for interpreting the neural data.

Discussion

Key Aspects

Naturalistic Data Analysis: The study analyzed ECoG recordings from patients engaged in natural, open-ended conversations, providing a comprehensive dataset for investigating the neural basis of language processing in real-world contexts. The large scale of the data (~100 hours, ~500,000 words) allows for a detailed examination of language production and comprehension as they unfold naturally.
Unified Model and Encoding Approach: The researchers utilized a unified acoustic-to-speech-to-language model (Whisper) to extract internal representations (embeddings) of acoustic, speech, and language information. These embeddings were then mapped onto brain activity using encoding models. This approach allows for linking different levels of linguistic representation and examining their neural correlates.
Distributed Processing Hierarchy: A key finding is the distributed processing hierarchy revealed by the encoding models. Sensory and somatomotor areas showed greater alignment with speech embeddings, while higher-order language areas aligned more closely with language embeddings. This supports the idea of a hierarchical organization of language processing in the brain, extending previous findings to naturalistic conversations.
Interaction Between Speech and Language Processing: The study highlights the interaction between lower-level speech processing and higher-level semantic processing. Linguistic prediction can facilitate speech processing in auditory areas, and acoustic information can enhance word processing in language areas. This demonstrates the interconnectedness of different levels of linguistic processing.
High Temporal Resolution Analysis: The acoustic-to-speech-to-language model processes speech with high temporal resolution (20 ms), allowing for a precise examination of the temporal dynamics of speech and language processing. The study reveals distinct patterns of activity during speech production and comprehension, including pre-word-onset activity during production and post-word-onset activity during comprehension.
Interpretations of Model-Brain Relationship: The Discussion considers two interpretations of the relationship between the model's internal representations and brain activity: (1) the model learns the transformation between distinct codes for language processing, and (2) the model and the brain share computational principles. Evidence supporting the stronger, second claim includes the high prediction performance of a linear mapping between model and brain activity and the outperformance of symbolic models.
Paradigm Shift Towards Unified Models: The study suggests a paradigm shift towards unified computational models based on statistical learning and high-dimensional embedding spaces. This contrasts with traditional rule-based symbolic linguistic models. The findings highlight the potential of non-symbolic models to capture the neural basis of natural language processing.
Emergence of Psycholinguistic Descriptors: Although not explicitly symbolic units, phonemes and parts of speech emerge as high-level statistical descriptors of natural language within the deep learning model. This demonstrates the model's ability to capture high-level linguistic phenomena documented by psycholinguistics, despite not being explicitly trained on these features.

Strengths

Summary of Key Findings
The Discussion effectively summarizes the study's key findings, highlighting the alignment between the acoustic-to-speech-to-language model and neural activity during natural conversations.

"We analysed neural processes involved in natural speech production and comprehension using ECoG recordings collected over ~100 h of spontaneous open-ended conversations, comprising approximately half a million words." (Page 9)
Connection to Previous Research
The section clearly connects the findings to previous research, acknowledging the alignment with prior work using unimodal models and highlighting the novel contributions of the current study.

"This result aligns with previous findings that used a unimodal speech model (Hubert) to encode speech information during passive listening to a closed set of sentences27." (Page 9)
Implications for Hierarchical Processing
The Discussion appropriately discusses the implications of the findings for understanding the hierarchical processing of speech and language in the brain.

"These results recapitulate the known hierarchy of natural language processing during free-flowing conversations28,29." (Page 9)
Exploration of Temporal Dynamics
The section explores the temporal dynamics of speech processing, highlighting the model's ability to capture fine-grained temporal patterns during both production and comprehension.

"The acoustic-to-speech-to-language model processes natural speech with a temporal resolution of 20 ms. This gives us unprecedented precision in modelling how speech and language information are processed during real-life conversations." (Page 9)
Consideration of Different Interpretations
The Discussion thoughtfully considers different interpretations of the relationship between the model's internal representations and brain activity, offering both conservative and more speculative perspectives.

"How should we interpret the relationship between the internal representations of the acoustic-to-speech-to-language model and the human brain when processing human speech? There are two potential options to consider." (Page 9)
Broader Context and Future Directions
The Discussion effectively positions the findings within a broader context, suggesting a paradigm shift towards unified computational models and highlighting the potential of future research directions.

"In summary, the acoustic-to-speech-to-language model provides a new unified computational framework for studying the neural basis of natural language processing." (Page 10)

Suggestions for Improvement

Acknowledge Study Limitations
This medium-impact improvement would strengthen the Discussion by providing a more balanced perspective. While the Discussion highlights the strengths and implications of the findings, it could benefit from a more explicit acknowledgment of the study's limitations. This is crucial for the Discussion section as it provides a balanced and critical evaluation of the research. Acknowledging limitations, such as the small sample size or the specific characteristics of the patient population, would enhance the paper's credibility and provide a more nuanced interpretation of the results. This would also help guide future research by identifying areas that require further investigation.

"We analysed neural processes involved in natural speech production and comprehension using ECoG recordings collected over ~100 h of spontaneous open-ended conversations, comprising approximately half a million words." (Page 9)

Implementation: Add a paragraph specifically addressing the study's limitations. This could include sentences like: "While our findings provide compelling evidence for the alignment between the model and brain activity, it is important to acknowledge some limitations. The study involved a small sample of patients with epilepsy, which may limit the generalizability of the results to the broader population." Also consider mentioning limitations of using ECoG, or the potential for selection bias.
Elaborate on Future Research Directions
This medium-impact improvement would enhance the Discussion's clarity and impact. While the Discussion mentions future research directions, it could benefit from a more concrete and specific discussion of potential follow-up studies. This is important for the Discussion section as it helps to guide future research and highlight the broader implications of the findings. Providing specific examples of future research questions or experimental designs would make the Discussion more impactful and help to stimulate further investigation in the field. This would also demonstrate the study's contribution to advancing knowledge in the field.

"As these models improve at processing natural speech, their alignment with cognitive processes may also improve." (Page 10)

Implementation: Expand the discussion of future research directions by providing specific examples. For instance: "Future studies could investigate whether similar alignment between the model and brain activity is observed in healthy individuals using non-invasive neuroimaging techniques, such as fMRI or MEG." Or "Further research could explore how the model's representations change with different types of linguistic input, such as different languages or different conversational contexts." Or "Future work should investigate how individual differences, such as language proficiency or cognitive abilities, modulate the relationship between model representations and neural activity."
Address Potential Role of Symbolic Models
This low-impact improvement would enhance the Discussion's completeness. While the section discusses the outperformance of symbolic models, it could briefly address the potential role or value of symbolic representations in future research. This is important for a balanced perspective in the Discussion section. Adding this nuance would strengthen the Discussion by acknowledging the ongoing debate and potential complementarity of different modeling approaches. This improves the overall thoroughness and intellectual honesty of the section.

"Combined, our finding of a linear relationship between the internal activity in the acoustic-to-speech-to-language model and the internal activity in the human brain during natural speech and language processing offers an alternative, unified computational framework for how the brain learns to process many aspects of natural speech." (Page 9)

Implementation: Add a sentence or two discussing the potential role of symbolic models. For example: "While our findings highlight the advantages of deep learning models, future research might explore hybrid approaches that integrate symbolic representations to capture specific linguistic phenomena or constraints." or "It remains an open question whether certain aspects of language processing are best captured by symbolic rules, and future work could investigate potential complementarities between symbolic and deep learning approaches."

Methods

Key Aspects

Ethical Oversight and Participant Consent: The study was approved by the Institutional Review Boards of NYU Grossman School of Medicine and Princeton University, operating under strict ethical guidelines. Participants, all treatment-resistant epilepsy patients undergoing intracranial monitoring, provided informed consent, with confirmation of their cognitive capacity by clinical staff. All procedures adhered to the Declaration of Helsinki.
Participant Demographics and Clinical Characteristics: Four patients with treatment-resistant epilepsy, undergoing intracranial monitoring with subdural grid and strip electrodes, participated in the study. The participants (2 females, 24-53 years old) were undergoing monitoring for clinical purposes. Three participants had an FDA-approved hybrid clinical-research grid implanted, providing higher spatial coverage.
Speech Recording Preprocessing Pipeline: A semi-automated pipeline was developed to preprocess the extensive 24/7 speech recordings. This involved four key steps: de-identifying the speech to comply with HIPAA regulations, transcribing the speech using a human-in-the-loop annotation pipeline with Amazon's Mechanical Turk, aligning the text transcripts to the audio recordings at the individual word level using the Penn Forced Aligner and manual verification, and aligning the speech to the neural activity by recording the microphones' output directly on an ECoG channel.
ECoG Recording Preprocessing Pipeline: The ECoG recordings underwent a preprocessing pipeline to mitigate artifacts. This included identifying and removing corrupted data segments, mitigating noise sources using FFT, ICA, and de-spiking methods, bandpassing the neural signals (75-200 Hz), computing the power envelope, z-scoring, and smoothing the signal. These steps ensured data quality and prepared the neural signals for subsequent analysis.
Acoustic Embedding Extraction: Acoustic embeddings were extracted from the zeroth encoder layer of the Whisper model. Audio recordings were downsampled, and a 30-second sliding window was used. Embeddings were temporally aligned to word onsets by defining the endpoint of each sliding window to the word's onset plus 200 ms. The last 10 hidden states were concatenated to create word-level embeddings.
Speech Embedding Extraction: Speech embeddings were extracted from the fourth encoder layer of the Whisper model, chosen for its structured representation of phonetic categories. The extraction process was similar to that of acoustic embeddings, with alignment to word onsets. A variation of this process involved calculating the number of hidden states needed to capture the full word duration, resulting in variable-dimension embeddings, which were then reduced to a fixed dimensionality using PCA.
Continuous Acoustic and Speech Embedding Extraction: Continuous acoustic and speech embeddings were extracted by feeding audio recordings to the model in non-overlapping 30-second segments. Each hidden state represented a temporal segment of approximately 20 ms. Embeddings were extracted from the zeroth (continuous acoustic) and fourth (continuous speech) encoder layers. Encoding was performed on these continuous embeddings, limited to segments that were entirely production or comprehension.
Language Embedding Extraction: Language embeddings were extracted from the third decoder layer of the Whisper model. For each word, text transcripts corresponding to a 30-second context window were tokenized and given as input to the decoder. The embedding corresponding to the last word in the sequence was extracted.
Electrode-Wise Encoding: Linear regression was used to estimate encoding models for each electrode and lag relative to word onset. The neural signal was averaged across a 200-ms window at each lag. A 10-fold cross-validation procedure was used to train and evaluate the models. Model performance was assessed by calculating the Pearson correlation between predicted and actual neural signals.
Variance Partitioning Analysis: A variance partitioning scheme was employed to estimate the variance uniquely explained by different models (e.g., speech and language embeddings). This involved building encoding models based on individual embeddings and a combined model. Set arithmetic was used to derive unique and shared variance explained by each embedding type.
Electrode Selection and Statistical Comparisons: Significant electrodes were identified using a randomization procedure. A random shift was applied to the assigned embeddings, and the encoding procedure was repeated 1,000 times. P-values were computed for each electrode, and electrodes with P-values less than 0.01 were considered significant. Similar procedures were used to identify differences in encoding performance magnitude and lag-by-lag encoding performance.
Analysis of Temporal Dynamics: Independent-sample t-tests were performed to test for significant differences in the temporal dynamics of encoding performance between ROIs. One-sample t-tests were used to test whether peak encoding performance occurred significantly before or after word onset.
Cross-Modal Encoding Model Implementation: An encoding model trained on speech comprehension data was implemented during production to investigate shared mechanisms. Beta weights from the best-performing lag were used to predict neural activity during production. Electrodes showing a double peak during speech production were identified.
Encoding Models per Speech Unit: Separate encoding models were constructed for 20 encoder hidden states, each receiving 20 ms of the original audio input. This fine-grained analysis allowed for mapping the sequence of neural activity with a temporal resolution of 20 ms.
Linear Mixed Model Analysis: Linear mixed models (LMMs) were used to analyze the time points of neural encoding peaks, accounting for inter-subject variability. A random intercept per patient was included.
Visualization of Embedding Space: t-distributed Stochastic Neighbor Embedding (t-SNE) was used to project the high-dimensional embedding spaces down to two-dimensional manifolds for visualization. Data points were colored according to speech and language features (phonemes, place of articulation, manner of articulation, and part of speech).
Classification of Speech and Linguistic Features: Multinomial logistic regression classifiers were trained to predict phonetic and lexical categories from both speech and language embeddings. A 10-fold cross-validation procedure was used, and balanced accuracy was used to evaluate classification performance. A non-parametric bootstrapping procedure was used to compute classification significance.

Strengths

Clear Ethical Oversight
The Methods section clearly outlines the ethical oversight and approval process, ensuring the study's adherence to ethical guidelines and regulations. This includes details about the Institutional Review Board approvals and the informed consent process.

"The study was approved by the NYU Grossman School of Medicine Institutional Review Board (approved protocol s14-02101) which operates under NYU Langone Health Human Research Protections and Princeton University’s Review Board (approval protocol 4962)." (Page 10)
Detailed Participant Description
The section provides a detailed description of the participants, including their demographics, clinical condition (treatment-resistant epilepsy), and the type of intracranial monitoring used. This information is crucial for understanding the context of the study and the characteristics of the sample.

"Four patients (2 females, gender assigned on the basis of medical record; 24–53 years old) with treatment-resistant epilepsy undergoing intracranial monitoring with subdural grid and strip electrodes for clinical purposes participated in the study." (Page 10)
Comprehensive Preprocessing Pipeline
The Methods section meticulously describes the preprocessing pipeline for both the speech recordings and the ECoG recordings. This includes steps for de-identification, transcription, alignment of text to speech, and alignment of speech to neural activity, as well as artifact mitigation and signal processing techniques. This level of detail enhances the reproducibility of the study.

"We developed a semi-automated pipeline for preprocessing the dataset. The pipeline can be broken down into four steps: 1. De-identifying speech.... 2. Transcribing speech.... 3. Aligning text to speech.... 4. Aligning speech to neural activity...." (Page 11)
Detailed Embedding Extraction Procedure
The section clearly explains the process of extracting acoustic, speech, and language embeddings from the Whisper model. This includes details about the downsampling of audio recordings, the use of a sliding window, the alignment of embeddings to word onsets, and the concatenation of hidden states. The rationale for choosing specific layers for embedding extraction is also provided.

"To prepare audio recordings for subsequent processing by the speech model, we downsampled the audio recordings from 16 kHz. Since Whisper is trained on 30-s audio segments, audio recordings were fed to the model using a sliding window of 30 s." (Page 11)
Thorough Electrode-Wise Encoding Description
The Methods section provides a thorough description of the electrode-wise encoding procedure, including the use of linear regression, the construction of outcome variables, the cross-validation procedure, and the calculation of model performance. The use of a randomization procedure to identify significant electrodes is also well-explained.

"We used linear regression to estimate encoding models for each electrode and lag relative to word onset to map the Whisper embeddings onto the neural activity." (Page 12)
Clear Explanation of Variance Partitioning
The section clearly explains the variance partitioning analysis, providing the formulas and rationale for calculating unique and shared variance explained by different embeddings. This allows for a quantitative assessment of the contributions of different levels of linguistic information.

"We employed a variance partitioning scheme to estimate the variance that different models uniquely explain. We built encoding models on the basis of two different embeddings A and B (for example, speech and language embeddings) and an additional combined encoding model where we concatenate the embeddings of A and B." (Page 12)
Rigorous Statistical Procedures
The Methods section describes the statistical procedures used to identify significant differences in encoding performance, including randomization procedures, permutation tests, and FDR correction. This ensures the statistical rigor of the study.

"To identify significant electrodes, we used a randomization procedure." (Page 12)
Description of Embedding Visualization and Classification
The section describes the use of t-SNE for visualizing the embedding space and logistic regression classifiers for quantifying the information encoded in the embeddings. This provides details about the methods used to analyze the structure and content of the embeddings.

"We used t-SNE to project the high-dimensional embedding spaces down to two-dimensional manifolds to visualize the information structure represented in speech and language embeddings." (Page 13)

Suggestions for Improvement

Elaborate on Manual Verification of Word Alignment
This medium-impact improvement would increase the clarity and reproducibility of the Methods section. While the section mentions manual verification and adjustment of word onset and offset times, it could benefit from a more detailed description of the criteria or guidelines used for this manual correction. This is crucial for the Methods section as it directly affects the temporal alignment of the data, a key aspect of the study. Providing more detail on the manual correction process would strengthen the paper by ensuring that other researchers can understand and replicate this crucial step. This would enhance the transparency and reproducibility of the study.

"We further improved this automated forced alignment by manually verifying and adjusting each word’s onset and offset times." (Page 11)

Implementation: Add a sentence or two describing the criteria used for manual verification and adjustment of word onset and offset times. For example: "Manual adjustments were made based on visual inspection of the spectrogram and waveform, ensuring that word onsets and offsets corresponded to clear acoustic boundaries. Adjustments were typically within a range of ±20 ms from the automatically generated timestamps." Also, consider adding inter-rater reliability.
Specify Variance Explained by PCA
This low-impact improvement would enhance the completeness of the Methods section. While the section mentions the use of PCA for dimensionality reduction, it does not specify the amount of variance explained by the 50 principal components. This is important for the Methods section as it provides information about the potential loss of information during dimensionality reduction. Adding this detail would strengthen the paper by providing a more complete picture of the PCA procedure and its impact on the data. This would enhance the transparency of the methodology.

"Within each training fold, we standardized the embeddings and used PCA to reduce the embeddings to 50 dimensions." (Page 12)

Implementation: Add a sentence stating the amount of variance explained by the 50 principal components. For example: "The 50 principal components retained approximately X% of the variance in the original embeddings."
Justify P-Value Threshold for Electrode Selection
This low-impact improvement would enhance the clarity of the Methods section. While the section describes the electrode selection procedure, it could benefit from explicitly stating the rationale for choosing a p-value threshold of 0.01. This is important for the Methods section as it provides context for the statistical significance threshold used. Adding this rationale would strengthen the paper by providing a more complete justification for the chosen statistical threshold. This would enhance the transparency of the methodology.

"Electrodes with P values less than 0.01 were considered significant." (Page 12)

Implementation: Add a sentence explaining the rationale for the p-value threshold. For example: "A p-value threshold of 0.01 was chosen to provide a balance between sensitivity and specificity, while controlling for the multiple comparisons inherent in the electrode-wise analysis."

Non-Text Elements

Supp. Figure 1. Summary statistics of conversations (A) Distribution of...

Full Caption

Supp. Figure 1. Summary statistics of conversations (A) Distribution of temporal word duration.

Figure/Table Image (Page 21)

First Reference in Text

Not explicitly referenced in main text

Description

Overall description: This figure provides information about the words used in the conversations that were recorded and analyzed. It shows two graphs, both are histograms. Histograms are like bar graphs that show how frequently different values occur in a dataset.
Panel (A): Distribution of temporal word duration: Panel (A) shows the distribution of word durations, measured in milliseconds (ms). A millisecond is one-thousandth of a second. The x-axis (horizontal axis) shows the duration of the words, ranging from 0 to 2000 ms (2 seconds). The y-axis (vertical axis) shows the frequency, meaning how many words had that particular duration. The graph shows that most words are relatively short, with a peak around a few hundred milliseconds. There are fewer very long words.
Panel (B): Distribution of number of characters in words: Panel (B) shows the distribution of word lengths, measured in the number of characters. The x-axis shows the number of characters, ranging from 0 to 50. The y-axis shows the frequency, meaning how many words had that particular number of characters. The graph shows that most words have a relatively small number of characters, with a peak around 5-10 characters. There are fewer very long words.
Shape of distribution: The shape of both distributions is skewed to the right. It is expected as it is very uncommon to have long words in comparison with short words.

Scientific Validity

Overall methodological approach: The figure provides basic descriptive statistics of the conversational data, which is important for characterizing the dataset. The use of histograms is appropriate for visualizing the distributions of word duration and length.
Lack of summary statistics: The figure lacks any statistical analysis beyond the basic distributions. It would be helpful to include summary statistics such as the mean, median, and standard deviation of word duration and length.
Accuracy of word segmentation and alignment: It's important to ensure that the word segmentation and alignment were accurate, as errors in these processes could affect the distributions shown in the figure.

Communication

Overall organization and clarity: The figure presents two histograms: (A) showing the distribution of word durations in milliseconds, and (B) showing the distribution of word lengths in characters. The use of histograms is appropriate for visualizing the distribution of continuous and discrete variables, respectively. However, the y-axis label 'frequency' is not very informative. It would be better to specify 'Number of words' or 'Frequency (number of words)'. The x-axis labels are clear, but adding tick marks or grid lines might improve readability.
Caption descriptiveness: The caption is informative, but could be more specific. It would be helpful to state the total number of words analyzed.
Lack of explicit reference in main text: Since the reference text states that the figure is not explicitly referenced, it may be less critical to the main findings of the paper and may benefit from additional context provided in the supplementary text.

Supp. Figure 6. Mixed selectivity for speech and language embeddings during...

Full Caption

Supp. Figure 6. Mixed selectivity for speech and language embeddings during speech production and comprehension.

Figure/Table Image (Page 24)

First Reference in Text

Not explicitly referenced in main text

Description

Individual subject results: This supplementary figure expands on Figure 3 by showing the results for individual participants, rather than averaging across all participants. It shows how well speech sounds (speech embeddings) and the meaning of words (language embeddings) predict brain activity in different brain areas for each person, separately for when they are talking (production) and when they are listening (comprehension).
Line graphs showing correlation over time: Each small graph shows the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken or heard). The red line represents speech embeddings (sounds), and the blue line represents language embeddings (meaning).
Separate graphs for different brain regions and participants: There are separate graphs for different brain regions (IFG, SM, STG) and for each of the four participants (S1, S2, S3, S4). This allows us to see if the patterns of brain activity are consistent across different people and different brain areas.
Statistical threshold: The dashed horizontal line in each graph represents a statistical threshold. Correlations above this line are considered statistically significant, meaning they're unlikely to have happened by chance.

Scientific Validity

Individual subject analysis: The figure provides valuable information about the inter-individual variability in encoding performance. By presenting results for each participant separately, the researchers can assess the consistency of the findings across individuals.
Consistency with main analyses: The use of separate analyses for production and comprehension, and for speech and language embeddings, is consistent with the main analyses in the paper.
Complementary to main figure: The figure complements Figure 3 by providing a more detailed view of the data at the individual subject level. However, it doesn't introduce any new methodological approaches.

Communication

Overall organization and clarity: The figure presents line graphs showing encoding performance (correlation values) for speech and language embeddings during both production and comprehension. Separate graphs are shown for different brain regions and for each of the four participants (S1-S4). The use of individual subject plots allows for an assessment of inter-individual variability. However, the figure is very dense, and the small size of the individual plots makes it difficult to discern details. The x-axis (Lag (s)) is consistent, but the y-axis (Correlation (r)) lacks a clear scale and tick marks, making it hard to compare correlation values across plots. The dashed horizontal line representing the statistical threshold is not explicitly defined in the legend.
Caption descriptiveness: The caption is informative but could be more specific. It would be beneficial to restate the definition of mixed selectivity and to highlight the key findings observed in the individual subject plots.
Legend completeness: The legend could be improved by explicitly defining the dashed horizontal line and by providing a more detailed explanation of the color-coding (red for speech embeddings, blue for language embeddings).

Supp. Figure 7. Average speech and language encoding across ROIs.

Figure/Table Image (Page 26)

First Reference in Text

Not explicitly referenced in main text

Description

Average encoding performance across ROIs: This supplementary figure shows the average results across all participants for how well speech sounds (speech embeddings) and the meaning of words (language embeddings) predict brain activity in three key brain areas: IFG (inferior frontal gyrus, involved in language), SM (sensorimotor cortex, involved in movement and sensation), and STG (superior temporal gyrus, involved in hearing). It averages the results shown in Supplementary Figure 6.
Line graphs showing correlation over time: The line graphs show the correlation between predicted and actual brain activity over time (the x-axis is 'Lag (s)', meaning time in seconds relative to when a word is spoken or heard). The red line represents speech embeddings (sounds), and the blue line represents language embeddings (meaning).
Separate graphs for production and comprehension: There are separate graphs for speech production (when people are talking) and speech comprehension (when people are listening). This allows us to see if the patterns of brain activity are different for these two processes.
Averaged across participants: The 'N' values indicate the number of electrodes included in each average. These are averaged results across all participants, unlike Supp Fig 6, which showed individual participant results.

Scientific Validity

Averaged results across participants: The figure provides a useful summary of the encoding performance across different ROIs, complementing the individual subject results presented in Supplementary Figure 6. Averaging across participants can reveal general trends, but it can also mask inter-individual variability.
Consistency with main analyses: The use of separate analyses for production and comprehension, and for speech and language embeddings, is consistent with the main analyses in the paper.
Focus on average performance: The figure focuses on the average encoding performance within each ROI. It's important to remember that there may be significant variation in encoding performance within each ROI, as shown in previous figures.

Communication

Overall organization and clarity: The figure presents line graphs showing the average encoding performance (correlation values) for speech and language embeddings, averaged across electrodes within specific regions of interest (ROIs): IFG, SM, and STG. Separate graphs are shown for production and comprehension. The use of line graphs is appropriate for visualizing the temporal dynamics. The color-coding (red for speech, blue for language) is consistent with previous figures. However, the figure is quite dense, and the y-axis scale (Correlation (r)) is small and lacks tick marks, making it difficult to compare correlation values across plots. The N values, indicating the number of electrodes, are clearly presented.
Caption descriptiveness: The caption is informative, but it could be more precise. It would be helpful to explicitly mention that the results are averaged across participants and to restate the definitions of the ROIs (IFG, SM, STG).
Readability of line graphs: The x-axis label ('Lag (s)') is consistent, but adding tick marks or grid lines to the plots would improve readability.

Supp. Figure 9. Representations of phonetic and lexical information in Whisper.

Figure/Table Image (Page 28)

First Reference in Text

Not explicitly referenced in main text

Description

Overall purpose of the figure: This supplementary figure explores how well different layers of the Whisper model capture information about the sounds of speech (phonetics) and the meaning of words (lexical information). It does this in two main ways: using t-SNE plots to visualize the data and using classification accuracy to quantify how well the model can predict different categories.
t-SNE visualizations (a-d): Panels (a-d) show t-SNE plots. t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique to visualize high-dimensional data in a two-dimensional space. It tries to keep similar data points close together and dissimilar data points far apart. Each point in these plots represents either a short segment of audio (for speech embeddings) or a single word (for language embeddings).
Comparison of speech and language embeddings, phonetic and lexical categories: Panels (a) and (b) show speech embeddings, while panels (c) and (d) show language embeddings. Within each pair, one plot is colored by phonetic categories (like manner of articulation or place of articulation), and the other is colored by lexical categories (like part of speech).
Classification accuracy (e): Panel (e) shows classification accuracy. This measures how well a computer algorithm can predict the correct phonetic or lexical category of a word or sound segment based on its embedding from different layers of the Whisper model. Higher accuracy means the embedding contains more information about that category.
Different layers of the Whisper model: The different colors in panel (e) represent different layers of the Whisper model (speech 0-4, language 0-4). This allows us to see how the representation of phonetic and lexical information changes across different layers of the model.

Scientific Validity

Overall methodological approach: The figure provides a valuable extension of the main analyses, exploring the internal representations of the Whisper model in more detail. The use of both t-SNE visualization and classification accuracy provides a comprehensive assessment of phonetic and lexical information.
Comparison of different layers: The comparison of different layers of the Whisper model allows for an investigation of how the representation of phonetic and lexical information changes across the processing hierarchy.
Multiple phonetic and lexical categories: The use of multiple phonetic categories (phoneme, PoA, MoA) and a lexical category (PoS) provides a more fine-grained analysis of the information captured by the embeddings.
Quantitative and qualitative analysis: The classification analysis provides a quantitative measure of the information content of the embeddings, complementing the qualitative visualization provided by the t-SNE plots.

Communication

Overall organization and clarity: The figure presents a comprehensive analysis of phonetic and lexical information representation across different layers of the Whisper model, using both t-SNE visualizations (a-d) and classification accuracy plots (e). The organization into subpanels is logical, with (a-d) focusing on t-SNE and (e) on classification. However, the t-SNE plots are quite small and densely packed, making it difficult to discern individual data points and their relationships. The color-coding in the t-SNE plots is not clearly explained in the legend. The classification accuracy plots (e) lack a clear indication of chance level performance, making it difficult to assess the significance of the obtained accuracies.
Caption descriptiveness: The caption is informative but could be more specific. It would be helpful to mention the use of t-SNE and classification accuracy as the main analysis methods.
Missing axis labels in t-SNE plots: The t-SNE plots (a-d) lack axis labels, which is a significant omission. While t-SNE axes are not directly interpretable in the same way as traditional coordinate axes, it's still important to indicate that these are t-SNE dimensions (e.g., 't-SNE 1', 't-SNE 2').
Missing chance level in classification accuracy plots: The classification accuracy plots (e) should include horizontal lines indicating chance-level performance for each category (phoneme, PoA, MoA, PoS). This would provide a crucial baseline for evaluating the significance of the reported accuracies.

Supp. Table 2. Distribution of part of speech for all words in our dataset

Figure/Table Image (Page 32)

First Reference in Text

Not explicitly referenced in main text

Description

Overall purpose of the table: This supplementary table shows how often different types of words, called "parts of speech," appear in the conversations that were recorded. Parts of speech are categories like nouns (names of things), verbs (actions), adjectives (describing words), and so on.
Production vs. Comprehension: The table is divided into two main columns: 'Prod Frequency' (production frequency) and 'Comp Frequency' (comprehension frequency). 'Prod Frequency' shows how many times each part of speech was used by the participants when they were talking. 'Comp Frequency' shows how many times each part of speech was used by other people when talking to the participants.
Parts of speech categories: Each row in the table represents a different part of speech, such as NOUN (noun), VERB (verb), PRON (pronoun), ADP (adposition, like prepositions and postpositions), ADV (adverb), DET (determiner), ADJ (adjective), CONJ (conjunction), PRT (particle), NUM (numeral), X (other), and . (punctuation).
Frequency counts: The numbers in the table show the raw counts of how many times each part of speech appeared in each category (production or comprehension). For example, the number 64835 in the 'Prod Frequency' column and the 'NOUN' row means that the participants used nouns 64835 times when they were talking.

Scientific Validity

Relevance to scientific interpretation: The table provides useful information about the linguistic characteristics of the dataset. Knowing the distribution of parts of speech can be important for understanding the nature of the conversations and for interpreting the neural data in the context of different types of linguistic content.
Raw frequencies vs. percentages: The table presents raw frequency counts. It would be helpful to also include percentages or relative frequencies to facilitate comparison across different parts of speech and between production and comprehension.
Accuracy of part-of-speech tagging: The accuracy of the part-of-speech tagging is crucial for the validity of the table. The researchers should describe the method used for part-of-speech tagging and report its accuracy.

Communication

Overall organization and clarity: The table presents the frequency of different parts of speech (e.g., NOUN, VERB, PRON, etc.) in the dataset, separated into production (words spoken by the participants) and comprehension (words spoken to the participants). The use of a table is appropriate for this type of data. The table is well-organized, with clear column headings and row labels. However, it would be helpful to include the total number of words in each category (production and comprehension) and to present percentages in addition to raw frequencies.
Caption descriptiveness: The caption is concise and informative.
Use of abbreviations: The abbreviations used for parts of speech (e.g., ADP, ADV, DET) are relatively standard, but it would be beneficial to provide a key or expand these abbreviations in the table legend for clarity.

Supp. Table 3. Symbolic speech and linguistic features.

Figure/Table Image (Page 33)

First Reference in Text

Not explicitly referenced in main text

Description

Overall purpose of the table: This supplementary table lists the traditional linguistic features that the researchers used as a comparison to the Whisper model's embeddings. These features represent speech and language in a symbolic way, using categories and labels, rather than the continuous, high-dimensional representations of the deep learning model.
Division into speech and linguistic features: The table is divided into two main sections: 'Symbolic Speech Features' and 'Symbolic Linguistic Features'. Symbolic speech features are related to the sounds of speech, while symbolic linguistic features are related to the meaning and structure of words and sentences.
Symbolic Speech Features: The 'Symbolic Speech Features' section includes: Phonemes (the individual sounds in a word), Place of Articulation (where in the mouth the sound is made), Manner of Articulation (how the sound is made), and Voice or Voiceless (whether the vocal cords vibrate or not). The 'Feature Categories / Dimensions' column shows the number of different categories or values for each feature. For example, there are 39 different phoneme categories.
Symbolic Linguistic Features: The 'Symbolic Linguistic Features' section includes: Part of Speech (noun, verb, adjective, etc.), Dependency (how words relate to each other in a sentence), Prefix (a group of letters at the beginning of a word), Suffix (a group of letters at the end of a word), and Stop Word (common words like 'the', 'a', 'is', etc. that are often removed in natural language processing).
Sum of dimensions: The 'Sum' row indicates the total number of dimensions for each section (60 for speech features and 137 for linguistic features). These numbers reflect the total number of binary features used to represent each word, when using a one-hot encoding scheme

Scientific Validity

Transparency and reproducibility: The table provides a clear and comprehensive list of the symbolic features used in the study. This is important for transparency and reproducibility, allowing other researchers to understand and potentially replicate the analysis.
Rationale for feature selection: The choice of symbolic features appears to be reasonable and covers a range of relevant aspects of speech and language. However, the rationale for selecting these specific features (and not others) is not explicitly stated.
Missing methodological details: The table lists the features used, but it doesn't describe the method used to extract or annotate these features (e.g., how phonemes were determined, how part-of-speech tagging was performed). This information should be provided in the Methods section.

Communication

Overall organization and clarity: The table lists the symbolic speech and linguistic features used in the study, along with the number of categories or dimensions for each feature. The table is well-organized, with clear row and column labels. The separation into 'Symbolic Speech Features' and 'Symbolic Linguistic Features' is logical. However, it would be beneficial to provide a brief explanation or definition of each feature within the table, or in a separate supplementary note, to make it more accessible to readers who may not be familiar with all the terms.
Caption descriptiveness: The caption is concise and informative.
Use of abbreviations: The use of abbreviations (e.g., PoA, MoA) could be expanded upon in the table legend or in a footnote for better clarity.

A unified acoustic-to-speech-to-language embedding space captures the neural basis of natural language processing in everyday conversations

Table of Contents

Overall Summary

Study Background and Main Findings

Research Impact and Future Directions

Critical Analysis and Recommendations

Section Analysis

Abstract

Key Aspects

Strengths

Suggestions for Improvement

Introduction

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Results

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements

Discussion

Key Aspects

Strengths

Suggestions for Improvement

Methods

Key Aspects

Strengths

Suggestions for Improvement

Non-Text Elements