loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2501.04359v1

Decoding EEG Speech Perception with Transformers and VAE-based Data Augmentation

Authors: Terrance Yu-Hao Chen, Yulin Chen, Pontus Soederhaell, Sadrishya Agrawal, Kateryna Shapovalenko

Published: 2025-01-08

Abstract:

Decoding speech from non-invasive brain signals, such as electroencephalography (EEG), has the potential to advance brain-computer interfaces (BCIs), with applications in silent communication and assistive technologies for individuals with speech impairments. However, EEG-based speech decoding faces major challenges, such as noisy data, limited datasets, and poor performance on complex tasks like speech perception. This study attempts to address these challenges by employing variational autoencoders (VAEs) for EEG data augmentation to improve data quality and applying a state-of-the-art (SOTA) sequence-to-sequence deep learning architecture, originally successful in electromyography (EMG) tasks, to EEG-based speech decoding. Additionally, we adapt this architecture for word classification tasks. Using the Brennan dataset, which contains EEG recordings of subjects listening to narrated speech, we preprocess the data and evaluate both classification and sequence-to-sequence models for EEG-to-words/sentences tasks. Our experiments show that VAEs have the potential to reconstruct artificial EEG data for augmentation. Meanwhile, our sequence-to-sequence model achieves more promising performance in generating sentences compared to our classification model, though both remain challenging tasks. These findings lay the groundwork for future research on EEG speech perception decoding, with possible extensions to speech production tasks such as silent or imagined speech.

Paper Content: on Alphaxiv
Page 1: DECODING EEG S PEECH PERCEPTION WITH TRANSFORMERS AND VAE- BASED DATA AUGMENTATION Terrance Yu-Hao Chen Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 terrancc@andrew.cmu.eduYulin Chen Information Networking Institute Carnegie Mellon University Pittsburgh, PA 15213 jolinc@andrew.cmu.eduPontus Soederhaell Computational Finance Carnegie Mellon University Pittsburgh, PA 15213 psoderha@andrew.cmu.edu Sadrishya Agrawal Software and Societal Systems Department Carnegie Mellon University Pittsburgh, PA 15213 sadrisha@andrew.cmu.eduKateryna Shapovalenko Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 kshapova@andrew.cmu.edu January 9, 2025 ABSTRACT Decoding speech from non-invasive brain signals, such as electroencephalography (EEG), has the potential to advance brain-computer interfaces (BCIs), with applications in silent communication and assistive technologies for individuals with speech impairments. However, EEG-based speech decoding faces major challenges, such as noisy data, limited datasets, and poor performance on complex tasks like speech perception. This study attempts to address these challenges by employing variational autoencoders (V AEs) for EEG data augmentation to improve data quality and applying a state-of-the-art (SOTA) sequence-to-sequence deep learning architecture, originally successful in electromyography (EMG) tasks, to EEG-based speech decoding. Additionally, we adapt this architecture for word classification tasks. Using the Brennan dataset, which contains EEG recordings of subjects listening to narrated speech, we preprocess the data and evaluate both classification and sequence-to-sequence models for EEG-to-words/sentences tasks. Our experiments show that V AEs have the potential to reconstruct artificial EEG data for augmentation. Meanwhile, our sequence- to-sequence model achieves more promising performance in generating sentences compared to our classification model, though both remain challenging tasks. These findings lay the groundwork for future research on EEG speech perception decoding, with possible extensions to speech production tasks such as silent or imagined speech. Keywords Brain Signal Processing ·EEG·EMG·Speech Decoding ·Brain-to-Text ·Speech Production ·Silent Speech ·Speech Perception ·Deep Learning ·Transformers ·V AEs·Data Augmentation 1 Introduction Surface electroencephalography (EEG) has long been a standard, non-invasive method for measuring electrical brain activity. In recent years, the field of Brain-Computer Interfaces (BCIs) has seen significant advancements, largely driven by breakthroughs in artificial intelligence and deep learning. We plan to further these advances, particularly in the domain of the decoding of speech perception. Decoding imagined speech from EEG signals presents a promising avenue for developing assistive technologies for individuals with speech impairments, as well as communication systems for environments requiring silence or with high background noise. However, EEG-based imagined speech decoding faces several challenges, including lowarXiv:2501.04359v1 [eess.AS] 8 Jan 2025 Page 2: APREPRINT - JANUARY 9, 2025 signal-to-noise ratio (SNR), inter-subject variability, and limited datasets. This project aims to address these issues by exploring pre-processing and deep learning techniques on more extensive datasets and utilizing those findings on EEG data. This project will investigate several innovative approaches to enhance EEG-based speech perception decoding: •Using variational autoencoders (V AEs) to learn robust latent representations of EMG or EEG signals, potentially improving noise resilience. Cai and Zeng [2024] Chien et al. [2022] •Adapting an EMG-based SOTA transformer model to be used on a different modality (EEG), with an aim to increase decoding performance. By combining these approaches, we aim to advance the state-of-the-art in EEG-based speech decoding, paving the way for more accurate and practical BCI systems in real-world applications. 2 Literature Review Brain-to-text represents a promising direction in neuroprosthetics and brain-computer interfaces (BCIs), especially for individuals with impaired speech production due to neurological conditions. This technology seeks to interpret neural signals corresponding to various speech-related activities, enabling communication through thought alone. The primary paradigms explored in speech-related BCIs include speech perception, normal speech, silent speech, imagined speech, and inner speech. Each paradigm offers unique insights and challenges for decoding speech from brain signals. 2.1 Speech Perception Speech perception, also known as auditory comprehension, refers to the brain’s processing of external speech stimuli. While this paradigm has contributed to understanding speech perception, its practical applications in BCIs are limited due to the passive nature of the task and the complexity of the neural signals involved. The primary value lies in advancing our understanding of the auditory cortex and its role in speech processing, providing a foundation for more applied speech-decoding efforts. Recent studies have focused on using AI to decode speech from brain activity while subjects listen to dialogues Défossez et al. [2023], revealing patterns in brain activity that correlate with specific auditory stimuli. This research has the potential to enhance our understanding of brain-based speech recognition but remains less well-defined compared to other paradigms. 2.2 Speech Production 2.2.1 Active/Overt Speech Active speech focuses on decoding neural activity during active speech production. By analyzing brain signals captured while subjects speak out loud, researchers aim to map cortical activity patterns directly to spoken words. This approach, which primarily uses modalities such as electroencephalography (EEG) and electrocorticography (ECoG), has yielded notable progress in neural decoding. However, challenges remain due to noise and variability introduced by the muscle movements during vocalization. A particularly promising development is the integration of multimodal speech recognition, combining EEG with audio signals. Das et al. [2024] explores this approach, where EEG data collected during overt speech is used alongside audio signals. By leveraging deep learning techniques, the fusion of neural and audio data significantly enhanced performance, particularly in noisy environments. Their multimodal model achieved a 95.39% classification accuracy and exhibited resilience to white noise, outperforming traditional automatic speech recognition (ASR) systems reliant solely on audio input. This synergy between EEG and audio data offers a promising new dimension for ASR, especially in challenging conditions like background noise or speech impairments. 2.2.2 Silent Speech Silent speech refers to the motor movements involved in speech production, such as mouth and tongue articulation, without producing an audible sound. Recent advancements in silent speech interfaces have focused on decoding speech using non-audible signals generated during articulation. Electrocorticography (ECoG)-based systems have shown high accuracy in translating cortical signals into intended speech, particularly for individuals with paralysis. Willett et al. [2022] leverages high-resolution intracranial recordings to decode speech, achieving 94% accuracy for phonemes and 2 Page 3: APREPRINT - JANUARY 9, 2025 whole words. These systems represent breakthroughs in assistive communication technologies, offering viable solutions for patients with severe speech impairments Denby et al. [2010]. Additionally, recent work has explored multimodal silent speech recognition systems, incorporating techniques like electromyography (EMG) Gaddy and Klein [2020] and lip-reading sensors to enhance accuracy Denby et al. [2010]. Cross-modal models such as Multimodal Orofacial Neural Audio (MONA) combine neural data with audio data, narrowing the gap between silent and vocalized speech recognition. These systems achieve a substantial reduction in word error rates, demonstrating the potential of silent speech interfaces in noisy and data-limited environments Benster et al. [2024]. 2.2.3 Imaginary/Inner Speech Two other sub-fields of speech production are imaginary speech and inner speech. Imaginary speech is the imagination of speech without physical movement. Inner speech can be described as an inner monologue or “thinking in words”. The two sub-fields are very closely related, and many studies treat them as equivalent. While imaginary/inner speech and silent speech involve different data collection methods and experimental setups, they share significant potential for decoding speech directly from brain activity. Previous studies have tried to relate brain signals captured with MEG/EEG with the imaginary and inner speech of letters, phonemes, and words. Abdulghani et al. used a headset with eight-channel EEG electrons to record the brain activities of four subjects while imagining speaking one of four specified commands: up, down, left, right. Subsequently, an LSTM network was used to classify the commands based on the EEG recordings, with an accuracy of 92.5%. A similar study was conducted by Coretto et al. [2017]. They used Random Forests and Support Vector Machines to achieve 19.60% and 18.26% 6-class classification accuracy for commands, respectively, and 22.72% and 21.94% 5-class classification accuracy for vowels. Nguyen et al. [2017] compares the classification accuracy of imagined speech between vowels, short words, and long words using Relevance Vector Machines. They achieve an accuracy of 49.0%, 51.1%, 66.2%, respectively, using class sizes of three for vowels and short words and a class size of two for long words. A major obstacle facing the development of EEG-based BCI technology is the lack of very large datasets. Moreover, the low SNR of EEG signals makes it difficult to distinguish relevant signals from background brain activity EEG [2022]. Charan Mahapatra and Bhuyan [2023] and Lee et al. [2020] investigate two different approaches to combat these challenges. The first-mentioned applies transfer learning by utilizing ResNet and DenseNet, pre-trained on large amounts of images. They achieve an 11-class classification accuracy of 82.35% on the KaraOne dataset and 16-class classification accuracy of 89.01% on the FEIS dataset. Lee et al. [2020], on the other hand, trained a Siamese network with contrastive loss to construct embeddings, followed by a KNN classifier, achieving a 6-class classification of 31.40%. 2.3 Variational Autoencoder The variational autoencoder was first introduced by Kingma and Welling [2022]. The V AE is composed of an encoder qϕ(z|x)which maps the input xto a latent space z, and a decoder pθ(x|z)which reconstructs the data from the latent variables. The objective function, known as the Evidence Lower Bound (ELBO), balances the reconstruction accuracy and the regularization of the latent space. By assuming that the prior distribution of the latent space is standard Gaussian and applying the reparametrization trick, we obtain the following form of the loss function: L(θ;ϕ;x(i)) =1 2JX j=1 1 + log(( σ(i) j)2)−(µ(i) j)2−(σ(i) j)2 +1 LLX l=1logpθ(x(i)|z(i,l)) In the above formula, jcorresponds to the dimension of the latent space. The first summation corresponds to the KL divergence between the approximate posterior distribution of the encoder qϕ(z|x)and the prior distribution p(z), while the second summation corresponds to the reconstruction loss. The reparametrization trick is done by expressing zas: z(i)=µ(i)+σ(i)·ϵ(i),ϵ(i)∼ N(0, I) Sincezis expressed as a deterministic function of the mean µ(i)and standard deviation σ(i)plus a noise term ϵ, gradient descent can be used for optimization. Variational Autoencoders have been successfully used in studies to improve classification models through data aug- mentation. Since the latent space follows a multivariate Gaussian distribution, new variations of the latent space 3 Page 4: APREPRINT - JANUARY 9, 2025 Figure 1: Diagram of EEG-to-Text Word Classifier Training Flow can be generated and passed through the decoder. For example, Saldanha et al. [2022] improves the accuracy of a respiratory disease classification model by generating new samples through MLP-V AE, CNN-V AE, and Conditional V AE. Moreover, Nishizaki [2017] uses V AE-based data augmentation to improve a speech recognition model. 3 Model Description 3.1 EEG-to-Text Models 3.1.1 Word Classifier Model The Word Classifier model, as illustrated in Figure 1, is designed for EEG-to-word classification tasks. The corpus consists of 601 unique vocabulary entries, making this a 601-class classification problem. This design aligns with Meta’s research on the same dataset Défossez et al. [2023], which also focuses on predicting a single word corresponding to a specific EEG input window. A detailed summary of its parameters is provided in Table 1. Model Architecture: •Input: The input to the Word Classifier consists of preprocessed EEG signals, similar to the Seq2Seq model. The data is further truncated such that each input window corresponds to a specific word representation in the EEG data. •Feature Extraction and Sequence Modeling: The model employs ResBlocks for feature extraction, and the extracted features are passed through a Transformer Encoder with six encoder layers. These designs are similar to the Seq2Seq model. •Output and Classification: The final output is passed through a linear layer , which maps the output of the transformer encoder to a fixed vocabulary size of 601 words. A softmax activation function is applied to generate a probability distribution over the vocabulary. The model is trained using the Cross-Entropy Loss function, which aims to minimize the differences between predicted and actual word distributions. 3.1.2 Sequence to Sequence Model The Sequence-to-Sequence model (Seq2Seq) is designed to map EEG signals to corresponding sentences. It leverages a combination of convolutional layers and transformer encoder layers to process the input EEG signals and generate meaningful sequences. The model architecture is adapted from Gaddy’s EMG recognition model Gaddy and Dan [2022]. The model diagram is illustrated in Figure 2, and a detailed summary of its parameters is provided in Table 2. Model Architecture: •Input: The input to the Seq2Seq model consists of preprocessed EEG signals, represented as temporal sequences across multiple EEG channels. The EEG features are derived from the preprocessing pipeline described in Section 4.2. •Feature Extraction: The initial feature extraction is performed using multiple Residual Blocks (ResBlocks) . Each ResBlock comprises stacked 1D convolutional layers, followed by Batch Normalization and residual 4 Page 5: APREPRINT - JANUARY 9, 2025 Table 1: Model Summary: EEGWordClsModel, Params size (MB): 145.68 Layer (type:depth-idx) Output Shape Param # EEGWordClsModel [8, 601] 768 Sequential: 1-1 [8, 768, 130] – ResBlock: 2-1 [8, 768, 260] – Conv1d: 3-1 [8, 768, 260] 139,008 BatchNorm1d: 3-2 [8, 768, 260] 1,536 Conv1d: 3-3 [8, 768, 260] 1,770,240 BatchNorm1d: 3-4 [8, 768, 260] 1,536 Conv1d: 3-5 [8, 768, 260] 46,848 BatchNorm1d: 3-6 [8, 768, 260] 1,536 ResBlock: 2-2 [8, 768, 130] – Conv1d: 3-7 [8, 768, 130] 1,770,240 BatchNorm1d: 3-8 [8, 768, 130] 1,536 Conv1d: 3-9 [8, 768, 130] 1,770,240 BatchNorm1d: 3-10 [8, 768, 130] 1,536 Conv1d: 3-11 [8, 768, 130] 590,592 BatchNorm1d: 3-12 [8, 768, 130] 1,536 Linear: 1-2 [8, 130, 768] 590,592 TransformerEncoder: 1-3 [131, 8, 768] – ModuleList: 2-3 – – TransformerEncoderLayer: 3-13 [131, 8, 768] 7,237,632 TransformerEncoderLayer: 3-14 [131, 8, 768] 7,237,632 TransformerEncoderLayer: 3-15 [131, 8, 768] 7,237,632 TransformerEncoderLayer: 3-16 [131, 8, 768] 7,237,632 TransformerEncoderLayer: 3-17 [131, 8, 768] 7,237,632 TransformerEncoderLayer: 3-18 [131, 8, 768] 7,237,632 Linear: 1-4 [8, 601] 462,169 connections. The goal of these blocks is to capture both spatial and temporal features of the EEG signals effectively. The extracted features are projected to a higher-dimensional representation using a Linear layer to prepare them for sequence modeling. •Sequence Modeling: The core of the model is the Transformer Encoder , consisting of six encoder layers. Each layer employs a multi-head self-attention mechanism to capture temporal dependencies and long-range relationships within the EEG signal. This design allows the model to attend to different parts of the signal. •Output and Decoding: The output of the transformer encoder is passed through a CTC Beam Decoder , which aligns the predicted sequences with the target sentence labels. The Connectionist Temporal Classification (CTC) loss is used for training, which enables the model to handle length mismatches between input EEG sequences and output text sequences. 3.2 AugV AE-EEG Model The V AE is trained on pre-processed EEG data using the ELBO loss function as explained in section 2.3. A latent space of dimension 64 was used. The V AE was trained on 10 subjects with high comprehension scores. For the final model, the encoder and decoder consist of two linear layers of size 512 and 256 with ReLU activation functions. V AEs were trained for both the eeg_raw andeeg_feats (EEG features derived from manual preprocessing 4.2). However, for the final model, only eeg_raw was used. The mean and standard deviation of the latent space for each word are found across subjects. The decoder is then integrated as a part of the training pipeline where with a certain probability, a new latent space is drawn based on the mean and standard deviations for the appropriate label and passed through the decoder to generate a new sample. The generated sample is subsequently used as part of the batch in place of (or in addition to) the real EEG. An example of real and generated EEG is shown in figure 4 5 Page 6: APREPRINT - JANUARY 9, 2025 Figure 2: Diagram of EEGToText Seq2Seq Training Flow Table 2: Model Summary: EEGSeqtoSeqModel, Params size (MB): 143.94 Layer (type:depth-idx) Output Shape Param # EEGSeqtoSeqModel [7, 1250, 38] – Sequential: 1-1 [7, 768, 1250] – ResBlock: 2-1 [7, 768, 2500] – Conv1d: 3-1 [7, 768, 2500] 139,008 BatchNorm1d: 3-2 [7, 768, 2500] 1,536 Conv1d: 3-3 [7, 768, 2500] 1,770,240 BatchNorm1d: 3-4 [7, 768, 2500] 1,536 Conv1d: 3-5 [7, 768, 2500] 46,848 BatchNorm1d: 3-6 [7, 768, 2500] 1,536 ResBlock: 2-2 [7, 768, 1250] – Conv1d: 3-7 [7, 768, 1250] 1,770,240 BatchNorm1d: 3-8 [7, 768, 1250] 1,536 Conv1d: 3-9 [7, 768, 1250] 1,770,240 BatchNorm1d: 3-10 [7, 768, 1250] 1,536 Conv1d: 3-11 [7, 768, 1250] 590,592 BatchNorm1d: 3-12 [7, 768, 1250] 1,536 Linear: 1-2 [7, 1250, 768] 590,592 TransformerEncoder: 1-3 [1250, 7, 768] – ModuleList: 2-3 – – TransformerEncoderLayer: 3-13 [1250, 7, 768] 7,237,632 TransformerEncoderLayer: 3-14 [1250, 7, 768] 7,237,632 TransformerEncoderLayer: 3-15 [1250, 7, 768] 7,237,632 TransformerEncoderLayer: 3-16 [1250, 7, 768] 7,237,632 TransformerEncoderLayer: 3-17 [1250, 7, 768] 7,237,632 TransformerEncoderLayer: 3-18 [1250, 7, 768] 7,237,632 Linear: 1-4 [7, 1250, 38] 29,222 4 Dataset 4.1 Data We are using Brennan and Hale [2019] dataset available via this link. The participants are 33 adult volunteers (after exclusions) who passively listened to a 12.4-minute audiobook story (first chapter of Alice’s Adventures in Wonderland) 6 Page 7: APREPRINT - JANUARY 9, 2025 Figure 3: Training Flow of AugV AE-EEG Model Figure 4: Example of real and generated EEG signals while EEG was recorded. The story included 2,129 words in 84 sentences, slowed by 20% for better comprehension. The EEG was recorded using 61 active electrodes, a 500 Hz sampling rate, and a 0.1-200 Hz bandpass filter. 4.2 Data Preprocessing EEG data is known to have low SNR due to its high susceptibility to various artifacts, such as eye movements, muscle activities (subject-generated artifacts), and environmental electrical interference (externally generated artifacts) Shamlo et al. [2015]. Therefore, it is essential that the data is properly cleaned up and preprocessed before we feed them into the model. Two key outputs were generated from the preprocessing pipeline - eeg_raw (minimally processed EEG data) and eeg_feats (further processed EEG features). The preprocessing steps are as follows. Preprocessing for eeg_raw : 1.Channel removal: Removed last two channels. 2.Baseline correction: Subtracted the mean of the first 0.5 seconds to remove DC offset and drifts. 3.Robust scaling: Reduced outlier impact using scikit-learn. 4.Handling outliers: Clipped extreme values (below 5th and above 95th percentiles) and clamped those exceeding 20 standard deviations. 5.Normalization: Standardized to zero mean and unit variance. Preprocessing for eeg_feats : 1.Temporal shifting: Shifted signals by 150 ms to align with stimuli. 2.Feature extraction: Applied convolutional layers to extract: • Double-averaged signal • RMS of wavelet coefficients and rectified signal • Zero-crossing rate • Mean of the rectified signal 7 Page 8: APREPRINT - JANUARY 9, 2025 3.Feature stacking: Combined extracted features for final representation. 5 Evaluation Metrics 5.1 Word Classifier Model For the word classification model, we used accuracy as the evaluation metric. Accuracy measures the proportion of correctly classified samples to the total number of samples. It is defined mathematically as: Accuracy =Number of Correct Predictions Total Number of Predictions. (1) LetNdenote the total number of predictions, and Crepresent the number of correct predictions. The accuracy metric can be expressed as: Accuracy =C N. (2) •C: The number of samples that were correctly classified by the model. •N: The total number of samples in the dataset. In the context of our classification task, accuracy is particularly relevant as it provides a straightforward measure of the model’s performance. 5.2 Sequence-to-Sequence Model The primary metric used for our Sequence-to-Sequence model is Word Error Rate (WER) . This metric assesses the intelligibility of model outputs generated from silent EMG signals by comparing transcriptions to reference text. The WER is calculated as follows: WER =substitutions +insertions +deletions reference length. (3) LetS,I, and Drepresent the number of substitutions, insertions, and deletions, respectively, and let Rdenote the length of the reference text. The formula for WER is: WER =S+I+D R. (4) •S: The number of substitutions needed to match the reference text. •I: The number of extra words inserted compared to the reference text. •D: The number of deletions needed to match the reference text. •R: The total number of words in the reference text. WER is a critical evaluation metric for our Sequence-to-Sequence model as it directly measures the accuracy of generated transcriptions in terms of word-level edits. Lower WER values indicate better alignment between the model’s output and the reference text, thus reflecting improved performance in capturing the intended meaning from EEG or EMG signals. 6 Loss Functions The choice of loss function is critical for the performance of any machine learning model, as it directly influences how the model learns to represent and generate data. 8 Page 9: APREPRINT - JANUARY 9, 2025 6.1 Cross-Entropy Loss For the Word Classifier model (classification task), we utilized the cross-entropy loss function to train the model to predict the correct word class. The cross-entropy loss is a widely used loss function in classification problems, particularly when the target variable is categorical. It measures the dissimilarity between the predicted probability distribution and the true distribution. LetNbe the number of samples, Cthe number of classes (here C= 601 corresponding to the unique words in the recording), and yithe one-hot encoded vector representing the true class for the i-th sample. Let ˆyirepresent the predicted probability distribution over the classes for the i-th sample, obtained from the softmax layer of the model. The cross-entropy loss is defined as: LCE=−1 NNX i=1CX j=1yijlog ˆyij, where: •yijis the true label for the j-th class of the i-th sample, which is either 0 or 1, •ˆyijis the predicted probability for the j-th class of the i-th sample. 6.2 Connectionist Temporal Classification (CTC) Loss For the Seq2Seq model trained on EEG and audio transcripts of sentences, we employed the Connectionist Temporal Classification (CTC) loss. This loss is particularly suited for problems where the input and output sequences are of different lengths and the alignment between them is unknown. CTC enables the model to learn alignments implicitly during training. Letxrepresent the input sequence (e.g., EEG data), ythe target sequence (e.g., the audio transcript of the sentence), andB(·)the CTC output transformation, which maps the model’s predictions to the set of valid sequences by collapsing repeated characters and removing blank symbols. The model outputs a probability distribution over all possible alignments of the input to the target sequence. The CTC loss for a single training example is defined as: LCTC=−logP(y|x), where P(y|x)is the probability of the target sequence given the input, obtained by summing over all valid alignments Aofxtoy: P(y|x) =X a∈B−1(y)P(a|x). 6.3 V AE Loss Function For the AugV AE-EEG model, we deployed the loss function consisting of two components: 1.Reconstruction Loss : Measures how well the decoder reconstructs the input data from the latent representation. 2.KL Divergence Loss : Regularizes the latent space by encouraging the approximate posterior distribution to match the prior distribution (typically a standard normal distribution). The total loss is expressed as: LV AE(x,ˆx, µ, logσ2) =Lreconstruction (x,ˆx) +β· LKL(µ,logσ2) (5) where: •xis the input data. •ˆxis the reconstructed data. •µandlogσ2are the mean and log variance of the latent variables, respectively. •βis a weighting factor to balance the two terms. This loss function is specifically designed for V AEs to address the dual objectives of accurate reconstruction and a well-structured latent space. By jointly optimizing these components, the V AE learns a meaningful low-dimensional representation of the input data, which is critical for generating realistic synthetic data. 9 Page 10: APREPRINT - JANUARY 9, 2025 6.3.1 Reconstruction Loss The reconstruction loss ensures that the decoder output ˆxis close to the input x. This can be defined as the negative log likelihood of the reconstruction: Lreconstruction (x,ˆx) =∥x−ˆx∥2 2 (6) for Mean Squared Error (MSE), or alternatively: Lreconstruction (x,ˆx) =−X ixilog ˆxi+ (1−xi) log(1 −ˆxi) (7) for Binary Cross-Entropy (BCE) when the input is binary. 6.3.2 KL Divergence Loss The KL divergence loss measures the difference between the approximate posterior q(z|x)and the prior p(z): LKL(µ,logσ2) =1 2dX j=1 1 + log σ2 j−µ2 j−σ2 j (8) where dis the dimensionality of the latent space. 7 Baseline Models 7.1 Baseline Model Description The baseline model we selected is from the work Voicing Silent Speech Gaddy and Dan [2022], which addresses the task of converting electromyography (EMG) data from facial muscle movements incurred by silently mouthed words into audible speech. This model also serves as the baseline for A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition Benster et al. [2024]. It was the first attempt to train a deep learning model specifically on EMG data from silent speech and provided a benchmark on WER. One of the key innovations of this model is its use of a cross-modal training approach, which aligns audio from vocalized speech with EMG data from silent speech—a critical challenge since no audio is produced during silent speech. The goal of Gaddy’s study is to capture articulatory information from muscle movements using EMG sensors and then train sequence-to-sequence deep learning models to create audio that corresponds to these silently mouthed words. This study consists of a series of architectures, with the most recent one published in his dissertation V oicing Silent Speech Gaddy and Dan [2022], which is the model we replicated. The model is divided into two main components: feature extraction and transduction, with an alignment mechanism for training on silent speech data with no corresponding time-aligned audio, and ultimately generating audible speech using a neural vocoder (HiFi-Gan). 7.1.1 Feature extraction In the learned feature extraction phase, the raw EMG inputs from silent speech ( ES) are first pre-processed to reduce noise and then normalized, then passed to a CNN to learn a set of latent representations of the EMG input (E′ S). This set of EMG features is then passed through a neural transduction model that predicts the corresponding audio features (ˆA′ S), such as mel-spectrograms (mfcc) and phonemes. The CNN models utilize three residual convolution blocks, each comprising 2 kernel size 3 convolutions, with a batch normalization layer, and a ReLU activation function, sequentially followed by a shortcut path with a width-1 convolution. This design allows the model to learn complex patterns in the EMG data, and extract useful feature representation for subsequent processing. 7.1.2 Transduction model The core transduction model employs a Transformer architecture to convert EMG features into audio features. This Transformer model uses a self-attention mechanism that comprises multiple attention heads to aggregate information across time. The attention weights in Gaddy’s model were computed using a learned vector pthat accounts for the relative distance between the query and key positions: aij=softmax(WKxj+pij)⊤(WQxi)√ d 10 Page 11: APREPRINT - JANUARY 9, 2025 Figure 5: Baseline Model Workflow Here, WKandWQare learned projection matrices, and dis the dimension of the vectors. This approach enables the model to effectively utilize temporal relationships within the sequence, which enhances its ability to translate EMG signal patterns into precise audio features. 7.1.3 Training with alignment The training involves both silent and vocalized EMG data as input. The loss between vocalized EMG features (ˆA′ V) and target audio features (AV)is simply the Euclidean distance. However, to train the model on silent EMG data, dynamic time warping (DTW) was used to align the model’s audio feature predictions from silent EMG (ˆA′ S)with the vocalized audio features (AV). This alignment is key to training the model to generate intelligible speech as silent speech produces no audio. The DTW alignment cost is calculated based on the Euclidean distance between the predicted and target mel-spectrogram features, denoted as δ[i, j] =∥ˆA′ S[i]−A′ V[j]∥ By iteratively aligning predicted features with target audio features during training, the model improves its predictions by leveraging the aligned vocalized examples to refine its understanding of silent speech patterns. 7.1.4 Auxiliary phoneme loss An auxiliary phoneme prediction loss was introduced in Gaddy’s study to further improve learning. This approach produces an auxiliary phoneme label along with the audio feature vectors at each time step. Phoneme distributions are predicted by adding a linear layer and softmax activation to the transduction model encoder. The target phoneme labels for each audio feature frame are obtained using the Montreal Forced Aligner, which aligns phoneme sequences with audio based on reference text and a phonemic dictionary. The aligner runs a Viterbi decode using a pre-trained acoustic model for scoring. The training loss is, therefore, adjusted to include phoneme negative log likelihood, weighted by λ: 11 Page 12: APREPRINT - JANUARY 9, 2025 L=X i A′[i]−˜A[i] +λP[i]⊤log˜P[i] where A′and˜Arepresent audio feature targets and aligned audio features predictions, and Pand˜Pdenote phoneme targets and aligned predicted probabilities. This phoneme prediction layer is disregarded after training. Due to the limited data size, this auxiliary loss provides significant guidance and regularization to prevent overfitting. 7.1.5 Vocoding The final part of Gaddy’s model is the vocoder, which takes the predicted audio features into audio waveforms that can be played and heard. This study uses HiFi-Gan Kong et al. [2020], a neural model that can generate audio from all samples in parallel. This model offers quicker inference compared to the autoregressive models like WaveNet used in his previous experiments. HiFi-Gan uses a generative adversarial loss with multiple discriminators for realistic audio synthesis. The model is fine-tuned with dataset-specific vocalized examples to handle prediction artifacts to ensure high-quality output. HiFi-GAN was found to produce more natural-sounding speech than alternatives like WORLD and WaveNet. 7.2 Baseline Implementation Completeness The baseline model, described in section 3, was replicated by implementing the repository associated with Gaddy and Dan [2022]. We achieved a WER of 34.9% on the open dataset, closely aligned with the WER of 36.1% reported by Gaddy and Dan [2022]. Since the discrepancy is small in magnitude and there is some variability between each trained model, we did not take any further efforts to resolve it. We also streamlined the training setup into a more easily readable notebook which is able to produce similar results for easier reference. This can be found on our github. 8 Experiments We extended the above architecture primarily in the following two ways: 8.1 Using V AEs for EEG Data Augmentation V AEs show promising results in learning robust EEG representations by reconstructing masked input data, potentially improving noise resilience. We experimented with the following two V AE architectures: 1.Linear V AEs : We started with a simple linear V AE architecture and trained it on Brennan and Hale [2019]. 2.Convolutional V AEs : We also tried a V AE with convolution layers for both eeg_raw andeeg_feats . For the Word Classifier model, we used eeg_raw . 8.2 Extending the EMG-based SOTA model to the EEG modality We want to test the EMG-based architecture presented in the Baseline section on the EEG dataset Brennan and Hale [2019] and compare the results archived in Défossez et al. [2023]. 9 Results 9.1 EEG-to-Text This section presents the ablation results of our experiments on the EEG-to-Text models. Two primary variants were explored: the Word Classifier model and the Seq2Seq model . The experiments for each model also included various techniques such as V AE-based EEG data augmentation, sentence-level stratified sampling, and masking techniques. The results are discussed in terms of training and validation performance, with a focus on top-1/top-10 word accuracy for the word classifier and training loss and WER for the sequence-to-sequence model. 12 Page 13: APREPRINT - JANUARY 9, 2025 9.1.1 Word Classifier Model The word classification model was developed to predict individual words from EEG signals and evaluated under various configurations, including V AE-based data augmentations. The corpus contained a total of 601 unique words. Performance was measured using Top-1 Accuracy andTop-10 Accuracy . For the initial experiment, the model was trained on EEG data from 10 subjects using stratified sampling, with time and frequency masking applied during preprocessing. The results showed a Top-1 validation accuracy of 4.1% and a Top-10 validation accuracy of 26.82% . These outcomes align with prior results from Meta’s report on the same dataset, where a Top-10 accuracy of 25.7% was achieved Défossez et al. [2023]. However, upon analyzing the model’s outputs, it was observed that the predictions were dominated by the most frequent words in the dataset (e.g., “she,” “the,” and “was”), as shown in the following list: [‘she’, ‘the’, ‘was’, ‘it’, ‘and’, ‘to’, ‘i’, ‘that’, ‘had’, ‘a’] This may indicate that the model primarily learned the word frequency distribution rather than meaningful EEG-to-text representations. Figure 10 illustrates the rapid convergence of the Top-10 validation accuracy, which plateaued at 26.82% after just two epochs. (a) Training top-10 accuracy. (b) Validation top-10 accuracy. Figure 6: Training and Validation Top-10 accuracy for the Word Classifier model. To address the observed imbalance in the word frequency distribution, we applied inverse word frequency weights to the cross-entropy loss function. Despite this adjustment, the model’s performance significantly deteriorated, with the Top-10 validation accuracy barely exceeding 0.1% (see Figure 7). This further demonstrates that the model failed to capture meaningful EEG-to-text mappings, and underscores the inherent difficulty of decoding EEG signals straight into discrete words. 9.1.2 Sequence to Sequence Model The Seq2Seq model was designed to map EEG signals to full sentences, with the goal of capturing the temporal dependencies inherent in EEG data. Compared to the word-level classifier, the Seq2Seq model generates sentence-level outputs, which enables a more comprehensive decoding of EEG signals. A series of experiments were conducted to evaluate the effects of masking, stratified sampling, and extended training epochs on the model’s performance. In the initial experiment, the model was trained on EEG data collected from 22 subjects. The data was randomly split into training, validation, and testing sets in an 8:1:1 ratio. A character-level tokenizer was employed for sentence generation, and the model was trained for 200 epochs. The results 3 revealed a training WER of 20.97% and a validation WER of 92.48%, which indicates significant overfitting. To mitigate this issue, time masking (with a parameter of 50) and frequency masking (with a parameter of 10) were applied. These techniques improved the validation WER slightly to 92.32%, while the training WER increased to 37.73% due to the introduction of noise from masking. Although this approach reduced overfitting, it did not yield substantial improvements in the model’s generalization capabilities. The impact of masking on the training and validation WER is illustrated in Figure 8. In 8a, the training WER decreases steadily as the number of epochs increases, with masking leading to slower convergence due to the added noise. Figure 8b shows that the validation WER still fluctuates for the masked model and has a limited improvement in overall performance. 13 Page 14: APREPRINT - JANUARY 9, 2025 (a) Training top-10 accuracy. (b) Validation top-10 accuracy. Figure 7: Training and Validation Top-10 accuracy after applying inverse word frequency weights to the loss function. (a) Training WER with and without masking. (b) Validation WER with and without masking. Figure 8: Impact of time and frequency masking on training and validation WER over 200 epochs. Upon further analysis of the results, it became evident that the sentence distribution across the training and validation sets was imbalanced. Certain sentences appeared exclusively in the training set, while others were present only in the validation set. We speculated that this disparity might have posed a challenge since the model was unable to learn EEG patterns corresponding to all sentences during the training phase. Consequently, the model might struggle to generalize and decode meaningful information for new sentences during validation. To address this issue, a sentence-level stratified sampling strategy was implemented to ensure that all sentences appeared in both the training and validation sets, though with data from different subjects. Due to constraints in computational resources and time, this experiment was conducted on a subset of 10 subjects. Time and frequency masking were applied as part of the training process. The model was trained in four consecutive stages, with each stage building upon the weights of the preceding one. A total of 800 epochs were completed across these stages. After approximately 400 epochs, the validation WER plateaued at 93.23%, while the training WER continued to improve, eventually reaching 2.91%. The training loss also decreased significantly and converged to 0.06 by the end of the training process. The training and validation WER metrics for these experiments are visualized in Figure 9. The graphs represent the performance of the model trained with stratified sampling over the four training stages, which are distinguished by different colors: brown, green, cyan, and pink, each corresponding to 200 epochs. 14 Page 15: APREPRINT - JANUARY 9, 2025 (a) Training WER with stratified sample. (b) Validation WER with stratified sample. Figure 9: Impact of stratified on training and validation WER over 800 epochs. Experiment Epochs Train Loss Train WER (%) Val WER (%) 22 subjects 200 0.3459 20.97 92.48 22 subjects + Time masking + Frequency masking 200 0.5031 37.73 92.32 10 subjects + Stratified sampling + Masking 800 0.0600 2.91 93.23 Table 3: Performance metrics for the Seq2Seq model under different experimental conditions. 9.1.3 AugV AE-EEG Augmenting the data by replacing 50% of the EEG signals with V AE-generated signals did not improve the performance of the model (table 10b). The top-10 accuracy converges to the same value as the word classifier trained without augmented EEG. This indicates that the data augmentation is not sufficient to steer the model away from predicting the most frequent words in the dataset. We observed the same pattern when training the model on 90% generated EEG data. The training accuracy as well as the top-1 accuracy closely follow that of the model without augmented data. (a) Training top-10 accuracy (b) Validation top-10 accuracy Figure 10: Training and Validation Top-10 accuracy for the word classification model with 50% augmented V AE signals 10 Discussion The results of our experiments highlight both the potential and challenges of EEG-based speech decoding. The Word Classifier achieved a Top-10 validation accuracy of 26.82%, which aligns with prior work (e.g., Meta’s report). However, examining the prediction results reveals that the classifier primarily learned word frequency distributions rather than meaningful EEG-to-text mappings. These findings suggest that while the models provide initial benchmarks for EEG-to-text decoding, further improvements in data preprocessing, model architectures, and training strategies are needed for practical applications. Similarly, The Seq2Seq model demonstrated the ability to capture temporal 15 Page 16: APREPRINT - JANUARY 9, 2025 dependencies inherent in EEG data by achieving a training WER of 2.91%. However, the validation WER remained high at 93.23%, which indicates significant overfitting and poor generalization. This finding underscores the difficulty of decoding EEG signals into coherent text outputs, particularly in sentence-level tasks. At the same time, augmenting the data with V AE-generated EEG signals did not yield major improvements. Our hypothesis was that V AE may learn common patterns across subjects. The lack of improvement could be due to the low signal-to-noise ratio, lack of large EEG datasets, or an indication that another model architecture is needed. Here is the analysis of key results: •Time and Frequency Masking: Marginal improvement in validation performance (WER reduced from 92.48% to 92.32%) indicates a limited model’s sensitivity to masking parameters. While masking effectively mitigates some overfitting, it fails to substantially enhance generalization. This suggests that more sophisti- cated regularization techniques or alternative augmentation methods may be required to achieve meaningful performance gains. •Evaluating V AE-Based Data Augmentation: The classifier trained with V AE-augmented data performed similarly to the one trained without it, indicating that the generated data did not contribute meaningful variance or diversity. This highlights the limitations of the current V AE’s capacity to produce informative data. It underscores the need to refine the V AE architecture or explore more advanced strategies to leverage augmented inputs, such as optimizing the classifier’s design or incorporating additional training enhancements. •The Role of Stratified Sampling in Training: Stratified sampling effectively balanced sentence distribution across training and validation datasets, reducing coverage imbalances and stabilizing training dynamics. Despite these improvements, validation performance remained constrained, with a WER of 93.23%, suggesting that the model’s generalization challenges are rooted in deeper issues such as subject-specific EEG variability and limited dataset size. •Word Classifier vs Seq2Seq Models: The Word Classifier achieved a reasonable Top-10 accuracy of 26.82%, comparable to prior work. However, it struggled with infrequent words and lacked the ability to model sentence- level dependencies. In contrast, the Seq2Seq model excelled in capturing temporal dependencies, enabling sentence-level decoding. Yet, its high validation WER highlighted significant challenges in generalization and overfitting. •Sensitivity to Loss Function Weighting: Applying inverse word frequency weighting to the loss function of the word classifier drastically impacted performance, reducing validation accuracy to a mere 0.1%. This underscores the classifier’s reliance on patterns associated with high-frequency words in the dataset and its inability to generalize effectively to less frequent, yet more informative, words. The failure to adapt when frequency bias was counteracted suggests a fundamental limitation in the model’s architecture or training strategy, which prevents it from learning meaningful representations for rare words critical to real-world applications. 11 Future Work The models explored in this project have significant potential for developing assistive technologies, particularly for individuals with speech impairments. However, the current performance levels suggest that these models are still far from practical deployment. Our experiments have revealed several areas for improvement and future exploration. One of the major shortcomings of our study lies in the limited generalization capability of our models. Specifically, the word classifier primarily captures word frequency distributions rather than learning meaningful EEG-to-text mappings, while the Seq2Seq model suffers from overfitting, failing to generalize to unseen data. These issues show the challenges of decoding EEG signals directly into textual outputs and emphasize the need for more robust modeling approaches. To address these issues, we propose the following future experiments: •Train an AugV AE-EEG model for the sequence-to-sequence task : Currently, our AugV AE-EEG architecture is only applied to word-level data augmentation. Extending this model to generate augmented data for the Seq2Seq task could potentially improve its performance by introducing more diversity and robustness to the training set. This would require fine-tuning the V AE to produce sentence-level EEG representations, which could better capture the temporal dependencies inherent in EEG signals. •Apply the approach to speech production tasks (e.g., silent or imagined speech) : The Brennan dataset, while useful, is relatively small and limited to speech perception tasks. Future work could explore the 16 Page 17: APREPRINT - JANUARY 9, 2025 applicability of our methods to speech production tasks, such as decoding silent or imagined speech. These tasks are more aligned with real-world applications, such as assistive technologies for individuals with speech impairments. However, the lack of publicly available datasets for these paradigms poses a significant challenge. Developing new datasets or leveraging transfer learning from related domains could be viable solutions. •Incorporate advanced data augmentation techniques : Beyond V AEs, more sophisticated data augmentation methods, such as generative adversarial networks (GANs), could be explored. GANs have shown promise in generating realistic and diverse synthetic data in other domains and could help address the limited size and variability of EEG datasets. Additionally, combining multiple augmentation techniques may further enhance model robustness. •Leverage additional modalities such as audio and phonemes : The availability of audio and phoneme data in the dataset presents an opportunity to incorporate multimodal learning approaches. By integrating these complementary modalities, the model could potentially learn richer representations of the underlying speech signals, improving its ability to decode EEG data. Exploring techniques such as cross-modal training and alignment could further enhance performance and robustness. 12 Conclusions While our findings did not fully achieve the original goal of robust and generalized EEG-to-text decoding, they provide valuable insights into the challenges and limitations of this task. The use of V AEs for EEG data augmentation showed promise but requires further refinement to produce more meaningful and diverse synthetic data. The Seq2Seq model, despite its overfitting, demonstrated the potential for capturing temporal relationships in EEG signals, which set up a foundation for future improvements. In conclusion, this study lays the groundwork for future research on EEG-based speech decoding. By addressing the identified shortcomings and exploring new datasets, advanced augmentation techniques, and improved model architectures, we aim to move closer to the ultimate goal of practical brain-to-text systems. These advancements could have profound implications for assistive communication technologies and have the potential to offer new possibilities for individuals with speech impairments. 13 Code All the code used to train the model is included in this link, which was forked from the silent speech repository of Gaddy and Klein [2020] and adapted to work on the perception of EEG speech. The ablation study logs are publicly available at this link for further reference and analysis. 14 Division of Work In alphabetical order by last name. •Sadrishya Agrawal - Ran the experiments from a newer piece of research with state-of-the-art performance from Benster et al. [2024]. Conducted research on and implemented Variational Autoencoders. Integrated the V AE with the exisiting pipeline. •Terrance Chen - Analyzed and ran the experiments from a newer piece of research with state-of-the-art performance from Benster et al. [2024]. Analyzed the methods and outlined the baseline model workflow from Gaddy and Dan [2022] using a diagram. Extended the work of Gaddy and Dan [2022] to the EEG sequence-to-sequence task. Implemented the modeling framework for the word classification task. Prepared the presentation slides and compiled model diagrams. Scheduled and managed meetings. •Yulin Chen - Replicated the results of the baseline model from Gaddy and Dan [2022]. Extended the baseline model to the EEG sequence-to-sequence task and ran ablations. Implemented the modeling framework for the word classification task. Compiled model description, results, discussion and conlusion sections for the final report. •Pontus Soederhaell - Compiled and ran the notebook version of the model architecture from Gaddy and Dan [2022]. Implemented the EMG-to-text version of the baseline model. Implemented Variational Autoencoders and integration with the classification task. • Kateryna Shapovalenko - Advised on deep learning model design, evaluation strategy, and project direction. 17 Page 18: APREPRINT - JANUARY 9, 2025 References Miao Cai and Yu Zeng. Mae-eeg-transformer: A transformer-based approach combining masked autoencoder and cross-individual data augmentation pre-training for eeg classification. Biomedical Signal Processing and Control , 94:106131, 2024. ISSN 1746-8094. doi:https://doi.org/10.1016/j.bspc.2024.106131. URL https: //www.sciencedirect.com/science/article/pii/S1746809424001897 . Hsiang-Yun Sherry Chien, Hanlin Goh, Christopher M. Sandino, and Joseph Y . Cheng. Maeeg: Masked auto-encoder for eeg representation learning, 2022. URL https://arxiv.org/abs/2211.02625 . Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence , 5(10):1097–1107, October 2023. ISSN 2522-5839. doi:10.1038/s42256-023-00714-5. URL http://dx.doi.org/10.1038/s42256-023-00714-5 . Anarghya Das, Puru Soni, Ming-Chun Huang, Feng Lin, and Wenyao Xu. Multimodal speech recognition using eeg and audio signals: A novel approach for enhancing asr systems. Smart Health , 32:100477, 2024. ISSN 2352-6483. doi:https://doi.org/10.1016/j.smhl.2024.100477. URL https://www.sciencedirect.com/science/article/ pii/S2352648324000333 . Francis R. Willett, Erin M. Kunz, Chaofei Fan, Donald T. Avansino, Guy H. Wilson, Eun Young Choi, Foram Kamdar, Matthew F. Glasser, Leigh R. Hochberg, Shaul Druckmann, Krishna V . Shenoy, and Jaimie M. Henderson. Thinking out loud, an open-access eeg-based bci dataset for inner speech recognition. Scientific Data , 9:52, 2022. doi:10.1038/s41597-022-01147-2. B. Denby, T. Schultz, K. Honda, T. Hueber, J.M. Gilbert, and J.S. Brumberg. Silent speech interfaces. Speech Communication , 52(4):270–287, 2010. ISSN 0167-6393. doi:https://doi.org/10.1016/j.specom.2009.08.002. URL https://www.sciencedirect.com/science/article/pii/S0167639309001307 . Silent Speech Interfaces. David Gaddy and Dan Klein. Digital voicing of silent speech. arXiv preprint arXiv:2010.02960 , 2020. doi:10.48550/arXiv.2010.02960. EMNLP 2020. Tyler Benster, Guy Wilson, Reshef Elisha, Francis R Willett, and Shaul Druckmann. A cross-modal approach to silent speech with llm-enhanced recognition. 2024. URL https://arxiv.org/abs/2403.05583 . M.M. Abdulghani, W.L. Walters, and K.H. Abed. Classification using eeg and deep learning. Bioengineering 2023 , 10: 649. doi:10.3390/. Germán A. Pressel Coretto, Iván E. Gareis, and H. Leonardo Rufiner. Open access database of EEG signals recorded during imagined speech. In Eduardo Romero, Natasha Lepore, Jorge Brieva, Jorge Brieva, and Ignacio Larrabide and, editors, 12th International Symposium on Medical Information Processing and Analysis , volume 10160, page 1016002. International Society for Optics and Photonics, SPIE, 2017. doi:10.1117/12.2255697. URL https: //doi.org/10.1117/12.2255697 . Chuong H Nguyen, George K Karavas, and Panagiotis Artemiadis. Inferring imagined speech using eeg signals: a new approach using riemannian manifold features. Journal of Neural Engineering , 15(1):016002, dec 2017. doi:10.1088/1741-2552/aa8235. URL https://dx.doi.org/10.1088/1741-2552/aa8235 . A state-of-the-art review of eeg-based imagined speech decoding. 16, 2022. doi:10.3389/fnhum.2022.867281. Nrushingh Charan Mahapatra and Prachet Bhuyan. Decoding of imagined speech electroencephalography neural signals using transfer learning method. J. Phys. Commun. , 7, 2023. doi:10.1088/2399-6528/ad0197. Dong-Yeon Lee, Minji Lee, and Seong-Whan Lee. Classification of imagined speech using siamese neural net- work. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) , pages 2979–2984, 2020. doi:10.1109/SMC42975.2020.9282982. Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312. 6114 . Jane Saldanha, Shaunak Chakraborty, Shruti Patil, Ketan Kotecha, Satish Kumar, and Anand Nayyar. Data augmentation using variational autoencoders for improvement of respiratory disease classification. PLoS ONE , 17(8):e0266467, 2022. doi:10.1371/journal.pone.0266467. Hiromitsu Nishizaki. Data augmentation and feature extraction using variational autoencoder for acoustic modeling. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) , pages 1222–1227, 2017. doi:10.1109/APSIPA.2017.8282225. David Gaddy and Klein Dan. Voicing silent speech . PhD thesis, eScholarship, University of California, Berkeley, 2022. URL https://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-68.pdf . Jonathan R. Brennan and John T. Hale. Hierarchical structure guides rapid linguistic predictions during naturalistic listening. PLoS ONE , 14(1):e0207741, 2019. doi:10.1371/journal.pone.0207741. 18 Page 19: APREPRINT - JANUARY 9, 2025 Nima Bigdely Shamlo, T. Mullen, Christian Kothe, Kyungmin Su, and K. Robbins. The prep pipeline: standardized preprocessing for large-scale eeg analysis. Frontiers in Neuroinformatics , 9, 2015. doi:10.3389/fninf.2015.00016. Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis, 2020. URL https://arxiv.org/abs/2010.05646 . 19

---