Authors: Terrance Yu-Hao Chen, Yulin Chen, Pontus Soederhaell, Sadrishya Agrawal, Kateryna Shapovalenko
Page 1:
DECODING EEG S PEECH PERCEPTION WITH TRANSFORMERS
AND VAE- BASED DATA AUGMENTATION
Terrance Yu-Hao Chen
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA 15213
terrancc@andrew.cmu.eduYulin Chen
Information Networking Institute
Carnegie Mellon University
Pittsburgh, PA 15213
jolinc@andrew.cmu.eduPontus Soederhaell
Computational Finance
Carnegie Mellon University
Pittsburgh, PA 15213
psoderha@andrew.cmu.edu
Sadrishya Agrawal
Software and Societal Systems Department
Carnegie Mellon University
Pittsburgh, PA 15213
sadrisha@andrew.cmu.eduKateryna Shapovalenko
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213
kshapova@andrew.cmu.edu
January 9, 2025
ABSTRACT
Decoding speech from non-invasive brain signals, such as electroencephalography (EEG), has the
potential to advance brain-computer interfaces (BCIs), with applications in silent communication
and assistive technologies for individuals with speech impairments. However, EEG-based speech
decoding faces major challenges, such as noisy data, limited datasets, and poor performance on
complex tasks like speech perception. This study attempts to address these challenges by employing
variational autoencoders (V AEs) for EEG data augmentation to improve data quality and applying
a state-of-the-art (SOTA) sequence-to-sequence deep learning architecture, originally successful
in electromyography (EMG) tasks, to EEG-based speech decoding. Additionally, we adapt this
architecture for word classification tasks. Using the Brennan dataset, which contains EEG recordings
of subjects listening to narrated speech, we preprocess the data and evaluate both classification and
sequence-to-sequence models for EEG-to-words/sentences tasks. Our experiments show that V AEs
have the potential to reconstruct artificial EEG data for augmentation. Meanwhile, our sequence-
to-sequence model achieves more promising performance in generating sentences compared to our
classification model, though both remain challenging tasks. These findings lay the groundwork for
future research on EEG speech perception decoding, with possible extensions to speech production
tasks such as silent or imagined speech.
Keywords Brain Signal Processing ·EEG·EMG·Speech Decoding ·Brain-to-Text ·Speech Production ·Silent
Speech ·Speech Perception ·Deep Learning ·Transformers ·V AEs·Data Augmentation
1 Introduction
Surface electroencephalography (EEG) has long been a standard, non-invasive method for measuring electrical brain
activity. In recent years, the field of Brain-Computer Interfaces (BCIs) has seen significant advancements, largely driven
by breakthroughs in artificial intelligence and deep learning. We plan to further these advances, particularly in the
domain of the decoding of speech perception.
Decoding imagined speech from EEG signals presents a promising avenue for developing assistive technologies for
individuals with speech impairments, as well as communication systems for environments requiring silence or with
high background noise. However, EEG-based imagined speech decoding faces several challenges, including lowarXiv:2501.04359v1 [eess.AS] 8 Jan 2025
Page 2:
APREPRINT - JANUARY 9, 2025
signal-to-noise ratio (SNR), inter-subject variability, and limited datasets. This project aims to address these issues by
exploring pre-processing and deep learning techniques on more extensive datasets and utilizing those findings on EEG
data.
This project will investigate several innovative approaches to enhance EEG-based speech perception decoding:
•Using variational autoencoders (V AEs) to learn robust latent representations of EMG or EEG signals, potentially
improving noise resilience. Cai and Zeng [2024] Chien et al. [2022]
•Adapting an EMG-based SOTA transformer model to be used on a different modality (EEG), with an aim to
increase decoding performance.
By combining these approaches, we aim to advance the state-of-the-art in EEG-based speech decoding, paving the way
for more accurate and practical BCI systems in real-world applications.
2 Literature Review
Brain-to-text represents a promising direction in neuroprosthetics and brain-computer interfaces (BCIs), especially for
individuals with impaired speech production due to neurological conditions. This technology seeks to interpret neural
signals corresponding to various speech-related activities, enabling communication through thought alone. The primary
paradigms explored in speech-related BCIs include speech perception, normal speech, silent speech, imagined speech,
and inner speech. Each paradigm offers unique insights and challenges for decoding speech from brain signals.
2.1 Speech Perception
Speech perception, also known as auditory comprehension, refers to the brain’s processing of external speech stimuli.
While this paradigm has contributed to understanding speech perception, its practical applications in BCIs are limited
due to the passive nature of the task and the complexity of the neural signals involved. The primary value lies in
advancing our understanding of the auditory cortex and its role in speech processing, providing a foundation for more
applied speech-decoding efforts.
Recent studies have focused on using AI to decode speech from brain activity while subjects listen to dialogues Défossez
et al. [2023], revealing patterns in brain activity that correlate with specific auditory stimuli. This research has the
potential to enhance our understanding of brain-based speech recognition but remains less well-defined compared to
other paradigms.
2.2 Speech Production
2.2.1 Active/Overt Speech
Active speech focuses on decoding neural activity during active speech production. By analyzing brain signals captured
while subjects speak out loud, researchers aim to map cortical activity patterns directly to spoken words. This approach,
which primarily uses modalities such as electroencephalography (EEG) and electrocorticography (ECoG), has yielded
notable progress in neural decoding. However, challenges remain due to noise and variability introduced by the muscle
movements during vocalization.
A particularly promising development is the integration of multimodal speech recognition, combining EEG with audio
signals. Das et al. [2024] explores this approach, where EEG data collected during overt speech is used alongside audio
signals. By leveraging deep learning techniques, the fusion of neural and audio data significantly enhanced performance,
particularly in noisy environments. Their multimodal model achieved a 95.39% classification accuracy and exhibited
resilience to white noise, outperforming traditional automatic speech recognition (ASR) systems reliant solely on audio
input. This synergy between EEG and audio data offers a promising new dimension for ASR, especially in challenging
conditions like background noise or speech impairments.
2.2.2 Silent Speech
Silent speech refers to the motor movements involved in speech production, such as mouth and tongue articulation,
without producing an audible sound. Recent advancements in silent speech interfaces have focused on decoding speech
using non-audible signals generated during articulation. Electrocorticography (ECoG)-based systems have shown high
accuracy in translating cortical signals into intended speech, particularly for individuals with paralysis. Willett et al.
[2022] leverages high-resolution intracranial recordings to decode speech, achieving 94% accuracy for phonemes and
2
Page 3:
APREPRINT - JANUARY 9, 2025
whole words. These systems represent breakthroughs in assistive communication technologies, offering viable solutions
for patients with severe speech impairments Denby et al. [2010]. Additionally, recent work has explored multimodal
silent speech recognition systems, incorporating techniques like electromyography (EMG) Gaddy and Klein [2020]
and lip-reading sensors to enhance accuracy Denby et al. [2010]. Cross-modal models such as Multimodal Orofacial
Neural Audio (MONA) combine neural data with audio data, narrowing the gap between silent and vocalized speech
recognition. These systems achieve a substantial reduction in word error rates, demonstrating the potential of silent
speech interfaces in noisy and data-limited environments Benster et al. [2024].
2.2.3 Imaginary/Inner Speech
Two other sub-fields of speech production are imaginary speech and inner speech. Imaginary speech is the imagination
of speech without physical movement. Inner speech can be described as an inner monologue or “thinking in words”.
The two sub-fields are very closely related, and many studies treat them as equivalent. While imaginary/inner speech
and silent speech involve different data collection methods and experimental setups, they share significant potential for
decoding speech directly from brain activity.
Previous studies have tried to relate brain signals captured with MEG/EEG with the imaginary and inner speech of
letters, phonemes, and words. Abdulghani et al. used a headset with eight-channel EEG electrons to record the brain
activities of four subjects while imagining speaking one of four specified commands: up, down, left, right. Subsequently,
an LSTM network was used to classify the commands based on the EEG recordings, with an accuracy of 92.5%. A
similar study was conducted by Coretto et al. [2017]. They used Random Forests and Support Vector Machines to
achieve 19.60% and 18.26% 6-class classification accuracy for commands, respectively, and 22.72% and 21.94% 5-class
classification accuracy for vowels.
Nguyen et al. [2017] compares the classification accuracy of imagined speech between vowels, short words, and long
words using Relevance Vector Machines. They achieve an accuracy of 49.0%, 51.1%, 66.2%, respectively, using class
sizes of three for vowels and short words and a class size of two for long words.
A major obstacle facing the development of EEG-based BCI technology is the lack of very large datasets. Moreover,
the low SNR of EEG signals makes it difficult to distinguish relevant signals from background brain activity EEG
[2022]. Charan Mahapatra and Bhuyan [2023] and Lee et al. [2020] investigate two different approaches to combat
these challenges. The first-mentioned applies transfer learning by utilizing ResNet and DenseNet, pre-trained on large
amounts of images. They achieve an 11-class classification accuracy of 82.35% on the KaraOne dataset and 16-class
classification accuracy of 89.01% on the FEIS dataset. Lee et al. [2020], on the other hand, trained a Siamese network
with contrastive loss to construct embeddings, followed by a KNN classifier, achieving a 6-class classification of
31.40%.
2.3 Variational Autoencoder
The variational autoencoder was first introduced by Kingma and Welling [2022]. The V AE is composed of an encoder
qϕ(z|x)which maps the input xto a latent space z, and a decoder pθ(x|z)which reconstructs the data from the latent
variables. The objective function, known as the Evidence Lower Bound (ELBO), balances the reconstruction accuracy
and the regularization of the latent space. By assuming that the prior distribution of the latent space is standard Gaussian
and applying the reparametrization trick, we obtain the following form of the loss function:
L(θ;ϕ;x(i)) =1
2JX
j=1
1 + log(( σ(i)
j)2)−(µ(i)
j)2−(σ(i)
j)2
+1
LLX
l=1logpθ(x(i)|z(i,l))
In the above formula, jcorresponds to the dimension of the latent space. The first summation corresponds to the KL
divergence between the approximate posterior distribution of the encoder qϕ(z|x)and the prior distribution p(z), while
the second summation corresponds to the reconstruction loss.
The reparametrization trick is done by expressing zas:
z(i)=µ(i)+σ(i)·ϵ(i),ϵ(i)∼ N(0, I)
Sincezis expressed as a deterministic function of the mean µ(i)and standard deviation σ(i)plus a noise term ϵ, gradient
descent can be used for optimization.
Variational Autoencoders have been successfully used in studies to improve classification models through data aug-
mentation. Since the latent space follows a multivariate Gaussian distribution, new variations of the latent space
3
Page 4:
APREPRINT - JANUARY 9, 2025
Figure 1: Diagram of EEG-to-Text Word Classifier Training Flow
can be generated and passed through the decoder. For example, Saldanha et al. [2022] improves the accuracy of a
respiratory disease classification model by generating new samples through MLP-V AE, CNN-V AE, and Conditional
V AE. Moreover, Nishizaki [2017] uses V AE-based data augmentation to improve a speech recognition model.
3 Model Description
3.1 EEG-to-Text Models
3.1.1 Word Classifier Model
The Word Classifier model, as illustrated in Figure 1, is designed for EEG-to-word classification tasks. The corpus
consists of 601 unique vocabulary entries, making this a 601-class classification problem. This design aligns with Meta’s
research on the same dataset Défossez et al. [2023], which also focuses on predicting a single word corresponding to a
specific EEG input window. A detailed summary of its parameters is provided in Table 1.
Model Architecture:
•Input: The input to the Word Classifier consists of preprocessed EEG signals, similar to the Seq2Seq model.
The data is further truncated such that each input window corresponds to a specific word representation in the
EEG data.
•Feature Extraction and Sequence Modeling: The model employs ResBlocks for feature extraction, and
the extracted features are passed through a Transformer Encoder with six encoder layers. These designs are
similar to the Seq2Seq model.
•Output and Classification: The final output is passed through a linear layer , which maps the output of the
transformer encoder to a fixed vocabulary size of 601 words. A softmax activation function is applied to
generate a probability distribution over the vocabulary. The model is trained using the Cross-Entropy Loss
function, which aims to minimize the differences between predicted and actual word distributions.
3.1.2 Sequence to Sequence Model
The Sequence-to-Sequence model (Seq2Seq) is designed to map EEG signals to corresponding sentences. It leverages
a combination of convolutional layers and transformer encoder layers to process the input EEG signals and generate
meaningful sequences. The model architecture is adapted from Gaddy’s EMG recognition model Gaddy and Dan
[2022]. The model diagram is illustrated in Figure 2, and a detailed summary of its parameters is provided in Table 2.
Model Architecture:
•Input: The input to the Seq2Seq model consists of preprocessed EEG signals, represented as temporal
sequences across multiple EEG channels. The EEG features are derived from the preprocessing pipeline
described in Section 4.2.
•Feature Extraction: The initial feature extraction is performed using multiple Residual Blocks (ResBlocks) .
Each ResBlock comprises stacked 1D convolutional layers, followed by Batch Normalization and residual
4
Page 5:
APREPRINT - JANUARY 9, 2025
Table 1: Model Summary: EEGWordClsModel, Params size (MB): 145.68
Layer (type:depth-idx) Output Shape Param #
EEGWordClsModel [8, 601] 768
Sequential: 1-1 [8, 768, 130] –
ResBlock: 2-1 [8, 768, 260] –
Conv1d: 3-1 [8, 768, 260] 139,008
BatchNorm1d: 3-2 [8, 768, 260] 1,536
Conv1d: 3-3 [8, 768, 260] 1,770,240
BatchNorm1d: 3-4 [8, 768, 260] 1,536
Conv1d: 3-5 [8, 768, 260] 46,848
BatchNorm1d: 3-6 [8, 768, 260] 1,536
ResBlock: 2-2 [8, 768, 130] –
Conv1d: 3-7 [8, 768, 130] 1,770,240
BatchNorm1d: 3-8 [8, 768, 130] 1,536
Conv1d: 3-9 [8, 768, 130] 1,770,240
BatchNorm1d: 3-10 [8, 768, 130] 1,536
Conv1d: 3-11 [8, 768, 130] 590,592
BatchNorm1d: 3-12 [8, 768, 130] 1,536
Linear: 1-2 [8, 130, 768] 590,592
TransformerEncoder: 1-3 [131, 8, 768] –
ModuleList: 2-3 – –
TransformerEncoderLayer: 3-13 [131, 8, 768] 7,237,632
TransformerEncoderLayer: 3-14 [131, 8, 768] 7,237,632
TransformerEncoderLayer: 3-15 [131, 8, 768] 7,237,632
TransformerEncoderLayer: 3-16 [131, 8, 768] 7,237,632
TransformerEncoderLayer: 3-17 [131, 8, 768] 7,237,632
TransformerEncoderLayer: 3-18 [131, 8, 768] 7,237,632
Linear: 1-4 [8, 601] 462,169
connections. The goal of these blocks is to capture both spatial and temporal features of the EEG signals
effectively. The extracted features are projected to a higher-dimensional representation using a Linear layer to
prepare them for sequence modeling.
•Sequence Modeling: The core of the model is the Transformer Encoder , consisting of six encoder layers.
Each layer employs a multi-head self-attention mechanism to capture temporal dependencies and long-range
relationships within the EEG signal. This design allows the model to attend to different parts of the signal.
•Output and Decoding: The output of the transformer encoder is passed through a CTC Beam Decoder , which
aligns the predicted sequences with the target sentence labels. The Connectionist Temporal Classification
(CTC) loss is used for training, which enables the model to handle length mismatches between input EEG
sequences and output text sequences.
3.2 AugV AE-EEG Model
The V AE is trained on pre-processed EEG data using the ELBO loss function as explained in section 2.3. A latent space
of dimension 64 was used. The V AE was trained on 10 subjects with high comprehension scores. For the final model,
the encoder and decoder consist of two linear layers of size 512 and 256 with ReLU activation functions. V AEs were
trained for both the eeg_raw andeeg_feats (EEG features derived from manual preprocessing 4.2). However, for the
final model, only eeg_raw was used.
The mean and standard deviation of the latent space for each word are found across subjects. The decoder is then
integrated as a part of the training pipeline where with a certain probability, a new latent space is drawn based on the
mean and standard deviations for the appropriate label and passed through the decoder to generate a new sample. The
generated sample is subsequently used as part of the batch in place of (or in addition to) the real EEG. An example of
real and generated EEG is shown in figure 4
5
Page 6:
APREPRINT - JANUARY 9, 2025
Figure 2: Diagram of EEGToText Seq2Seq Training Flow
Table 2: Model Summary: EEGSeqtoSeqModel, Params size (MB): 143.94
Layer (type:depth-idx) Output Shape Param #
EEGSeqtoSeqModel [7, 1250, 38] –
Sequential: 1-1 [7, 768, 1250] –
ResBlock: 2-1 [7, 768, 2500] –
Conv1d: 3-1 [7, 768, 2500] 139,008
BatchNorm1d: 3-2 [7, 768, 2500] 1,536
Conv1d: 3-3 [7, 768, 2500] 1,770,240
BatchNorm1d: 3-4 [7, 768, 2500] 1,536
Conv1d: 3-5 [7, 768, 2500] 46,848
BatchNorm1d: 3-6 [7, 768, 2500] 1,536
ResBlock: 2-2 [7, 768, 1250] –
Conv1d: 3-7 [7, 768, 1250] 1,770,240
BatchNorm1d: 3-8 [7, 768, 1250] 1,536
Conv1d: 3-9 [7, 768, 1250] 1,770,240
BatchNorm1d: 3-10 [7, 768, 1250] 1,536
Conv1d: 3-11 [7, 768, 1250] 590,592
BatchNorm1d: 3-12 [7, 768, 1250] 1,536
Linear: 1-2 [7, 1250, 768] 590,592
TransformerEncoder: 1-3 [1250, 7, 768] –
ModuleList: 2-3 – –
TransformerEncoderLayer: 3-13 [1250, 7, 768] 7,237,632
TransformerEncoderLayer: 3-14 [1250, 7, 768] 7,237,632
TransformerEncoderLayer: 3-15 [1250, 7, 768] 7,237,632
TransformerEncoderLayer: 3-16 [1250, 7, 768] 7,237,632
TransformerEncoderLayer: 3-17 [1250, 7, 768] 7,237,632
TransformerEncoderLayer: 3-18 [1250, 7, 768] 7,237,632
Linear: 1-4 [7, 1250, 38] 29,222
4 Dataset
4.1 Data
We are using Brennan and Hale [2019] dataset available via this link. The participants are 33 adult volunteers (after
exclusions) who passively listened to a 12.4-minute audiobook story (first chapter of Alice’s Adventures in Wonderland)
6
Page 7:
APREPRINT - JANUARY 9, 2025
Figure 3: Training Flow of AugV AE-EEG Model
Figure 4: Example of real and generated EEG signals
while EEG was recorded. The story included 2,129 words in 84 sentences, slowed by 20% for better comprehension.
The EEG was recorded using 61 active electrodes, a 500 Hz sampling rate, and a 0.1-200 Hz bandpass filter.
4.2 Data Preprocessing
EEG data is known to have low SNR due to its high susceptibility to various artifacts, such as eye movements, muscle
activities (subject-generated artifacts), and environmental electrical interference (externally generated artifacts) Shamlo
et al. [2015]. Therefore, it is essential that the data is properly cleaned up and preprocessed before we feed them into
the model. Two key outputs were generated from the preprocessing pipeline - eeg_raw (minimally processed EEG
data) and eeg_feats (further processed EEG features). The preprocessing steps are as follows.
Preprocessing for eeg_raw :
1.Channel removal: Removed last two channels.
2.Baseline correction: Subtracted the mean of the first 0.5 seconds to remove DC offset and drifts.
3.Robust scaling: Reduced outlier impact using scikit-learn.
4.Handling outliers: Clipped extreme values (below 5th and above 95th percentiles) and clamped those
exceeding 20 standard deviations.
5.Normalization: Standardized to zero mean and unit variance.
Preprocessing for eeg_feats :
1.Temporal shifting: Shifted signals by 150 ms to align with stimuli.
2.Feature extraction: Applied convolutional layers to extract:
• Double-averaged signal
• RMS of wavelet coefficients and rectified signal
• Zero-crossing rate
• Mean of the rectified signal
7
Page 8:
APREPRINT - JANUARY 9, 2025
3.Feature stacking: Combined extracted features for final representation.
5 Evaluation Metrics
5.1 Word Classifier Model
For the word classification model, we used accuracy as the evaluation metric. Accuracy measures the proportion of
correctly classified samples to the total number of samples. It is defined mathematically as:
Accuracy =Number of Correct Predictions
Total Number of Predictions. (1)
LetNdenote the total number of predictions, and Crepresent the number of correct predictions. The accuracy metric
can be expressed as:
Accuracy =C
N. (2)
•C: The number of samples that were correctly classified by the model.
•N: The total number of samples in the dataset.
In the context of our classification task, accuracy is particularly relevant as it provides a straightforward measure of the
model’s performance.
5.2 Sequence-to-Sequence Model
The primary metric used for our Sequence-to-Sequence model is Word Error Rate (WER) . This metric assesses the
intelligibility of model outputs generated from silent EMG signals by comparing transcriptions to reference text. The
WER is calculated as follows:
WER =substitutions +insertions +deletions
reference length. (3)
LetS,I, and Drepresent the number of substitutions, insertions, and deletions, respectively, and let Rdenote the
length of the reference text. The formula for WER is:
WER =S+I+D
R. (4)
•S: The number of substitutions needed to match the reference text.
•I: The number of extra words inserted compared to the reference text.
•D: The number of deletions needed to match the reference text.
•R: The total number of words in the reference text.
WER is a critical evaluation metric for our Sequence-to-Sequence model as it directly measures the accuracy of
generated transcriptions in terms of word-level edits. Lower WER values indicate better alignment between the model’s
output and the reference text, thus reflecting improved performance in capturing the intended meaning from EEG or
EMG signals.
6 Loss Functions
The choice of loss function is critical for the performance of any machine learning model, as it directly influences how
the model learns to represent and generate data.
8
Page 9:
APREPRINT - JANUARY 9, 2025
6.1 Cross-Entropy Loss
For the Word Classifier model (classification task), we utilized the cross-entropy loss function to train the model
to predict the correct word class. The cross-entropy loss is a widely used loss function in classification problems,
particularly when the target variable is categorical. It measures the dissimilarity between the predicted probability
distribution and the true distribution.
LetNbe the number of samples, Cthe number of classes (here C= 601 corresponding to the unique words in the
recording), and yithe one-hot encoded vector representing the true class for the i-th sample. Let ˆyirepresent the
predicted probability distribution over the classes for the i-th sample, obtained from the softmax layer of the model.
The cross-entropy loss is defined as:
LCE=−1
NNX
i=1CX
j=1yijlog ˆyij,
where:
•yijis the true label for the j-th class of the i-th sample, which is either 0 or 1,
•ˆyijis the predicted probability for the j-th class of the i-th sample.
6.2 Connectionist Temporal Classification (CTC) Loss
For the Seq2Seq model trained on EEG and audio transcripts of sentences, we employed the Connectionist Temporal
Classification (CTC) loss. This loss is particularly suited for problems where the input and output sequences are of
different lengths and the alignment between them is unknown. CTC enables the model to learn alignments implicitly
during training.
Letxrepresent the input sequence (e.g., EEG data), ythe target sequence (e.g., the audio transcript of the sentence),
andB(·)the CTC output transformation, which maps the model’s predictions to the set of valid sequences by collapsing
repeated characters and removing blank symbols. The model outputs a probability distribution over all possible
alignments of the input to the target sequence.
The CTC loss for a single training example is defined as:
LCTC=−logP(y|x),
where P(y|x)is the probability of the target sequence given the input, obtained by summing over all valid alignments
Aofxtoy:
P(y|x) =X
a∈B−1(y)P(a|x).
6.3 V AE Loss Function
For the AugV AE-EEG model, we deployed the loss function consisting of two components:
1.Reconstruction Loss : Measures how well the decoder reconstructs the input data from the latent representation.
2.KL Divergence Loss : Regularizes the latent space by encouraging the approximate posterior distribution to
match the prior distribution (typically a standard normal distribution).
The total loss is expressed as:
LV AE(x,ˆx, µ, logσ2) =Lreconstruction (x,ˆx) +β· LKL(µ,logσ2) (5)
where:
•xis the input data.
•ˆxis the reconstructed data.
•µandlogσ2are the mean and log variance of the latent variables, respectively.
•βis a weighting factor to balance the two terms.
This loss function is specifically designed for V AEs to address the dual objectives of accurate reconstruction and a
well-structured latent space. By jointly optimizing these components, the V AE learns a meaningful low-dimensional
representation of the input data, which is critical for generating realistic synthetic data.
9
Page 10:
APREPRINT - JANUARY 9, 2025
6.3.1 Reconstruction Loss
The reconstruction loss ensures that the decoder output ˆxis close to the input x. This can be defined as the negative log
likelihood of the reconstruction:
Lreconstruction (x,ˆx) =∥x−ˆx∥2
2 (6)
for Mean Squared Error (MSE), or alternatively:
Lreconstruction (x,ˆx) =−X
ixilog ˆxi+ (1−xi) log(1 −ˆxi) (7)
for Binary Cross-Entropy (BCE) when the input is binary.
6.3.2 KL Divergence Loss
The KL divergence loss measures the difference between the approximate posterior q(z|x)and the prior p(z):
LKL(µ,logσ2) =1
2dX
j=1
1 + log σ2
j−µ2
j−σ2
j
(8)
where dis the dimensionality of the latent space.
7 Baseline Models
7.1 Baseline Model Description
The baseline model we selected is from the work Voicing Silent Speech Gaddy and Dan [2022], which addresses
the task of converting electromyography (EMG) data from facial muscle movements incurred by silently mouthed
words into audible speech. This model also serves as the baseline for A Cross-Modal Approach to Silent Speech with
LLM-Enhanced Recognition Benster et al. [2024]. It was the first attempt to train a deep learning model specifically on
EMG data from silent speech and provided a benchmark on WER. One of the key innovations of this model is its use
of a cross-modal training approach, which aligns audio from vocalized speech with EMG data from silent speech—a
critical challenge since no audio is produced during silent speech.
The goal of Gaddy’s study is to capture articulatory information from muscle movements using EMG sensors and then
train sequence-to-sequence deep learning models to create audio that corresponds to these silently mouthed words. This
study consists of a series of architectures, with the most recent one published in his dissertation V oicing Silent Speech
Gaddy and Dan [2022], which is the model we replicated. The model is divided into two main components: feature
extraction and transduction, with an alignment mechanism for training on silent speech data with no corresponding
time-aligned audio, and ultimately generating audible speech using a neural vocoder (HiFi-Gan).
7.1.1 Feature extraction
In the learned feature extraction phase, the raw EMG inputs from silent speech ( ES) are first pre-processed to reduce
noise and then normalized, then passed to a CNN to learn a set of latent representations of the EMG input (E′
S). This
set of EMG features is then passed through a neural transduction model that predicts the corresponding audio features
(ˆA′
S), such as mel-spectrograms (mfcc) and phonemes. The CNN models utilize three residual convolution blocks, each
comprising 2 kernel size 3 convolutions, with a batch normalization layer, and a ReLU activation function, sequentially
followed by a shortcut path with a width-1 convolution. This design allows the model to learn complex patterns in the
EMG data, and extract useful feature representation for subsequent processing.
7.1.2 Transduction model
The core transduction model employs a Transformer architecture to convert EMG features into audio features. This
Transformer model uses a self-attention mechanism that comprises multiple attention heads to aggregate information
across time. The attention weights in Gaddy’s model were computed using a learned vector pthat accounts for the
relative distance between the query and key positions:
aij=softmax(WKxj+pij)⊤(WQxi)√
d
10
Page 11:
APREPRINT - JANUARY 9, 2025
Figure 5: Baseline Model Workflow
Here, WKandWQare learned projection matrices, and dis the dimension of the vectors. This approach enables the
model to effectively utilize temporal relationships within the sequence, which enhances its ability to translate EMG
signal patterns into precise audio features.
7.1.3 Training with alignment
The training involves both silent and vocalized EMG data as input. The loss between vocalized EMG features (ˆA′
V)
and target audio features (AV)is simply the Euclidean distance. However, to train the model on silent EMG data,
dynamic time warping (DTW) was used to align the model’s audio feature predictions from silent EMG (ˆA′
S)with
the vocalized audio features (AV). This alignment is key to training the model to generate intelligible speech as silent
speech produces no audio. The DTW alignment cost is calculated based on the Euclidean distance between the predicted
and target mel-spectrogram features, denoted as
δ[i, j] =∥ˆA′
S[i]−A′
V[j]∥
By iteratively aligning predicted features with target audio features during training, the model improves its predictions
by leveraging the aligned vocalized examples to refine its understanding of silent speech patterns.
7.1.4 Auxiliary phoneme loss
An auxiliary phoneme prediction loss was introduced in Gaddy’s study to further improve learning. This approach
produces an auxiliary phoneme label along with the audio feature vectors at each time step. Phoneme distributions are
predicted by adding a linear layer and softmax activation to the transduction model encoder. The target phoneme labels
for each audio feature frame are obtained using the Montreal Forced Aligner, which aligns phoneme sequences with
audio based on reference text and a phonemic dictionary. The aligner runs a Viterbi decode using a pre-trained acoustic
model for scoring. The training loss is, therefore, adjusted to include phoneme negative log likelihood, weighted by λ:
11
Page 12:
APREPRINT - JANUARY 9, 2025
L=X
i
A′[i]−˜A[i]
+λP[i]⊤log˜P[i]
where A′and˜Arepresent audio feature targets and aligned audio features predictions, and Pand˜Pdenote phoneme
targets and aligned predicted probabilities.
This phoneme prediction layer is disregarded after training. Due to the limited data size, this auxiliary loss provides
significant guidance and regularization to prevent overfitting.
7.1.5 Vocoding
The final part of Gaddy’s model is the vocoder, which takes the predicted audio features into audio waveforms that can
be played and heard. This study uses HiFi-Gan Kong et al. [2020], a neural model that can generate audio from all
samples in parallel. This model offers quicker inference compared to the autoregressive models like WaveNet used in
his previous experiments. HiFi-Gan uses a generative adversarial loss with multiple discriminators for realistic audio
synthesis. The model is fine-tuned with dataset-specific vocalized examples to handle prediction artifacts to ensure
high-quality output. HiFi-GAN was found to produce more natural-sounding speech than alternatives like WORLD and
WaveNet.
7.2 Baseline Implementation Completeness
The baseline model, described in section 3, was replicated by implementing the repository associated with Gaddy and
Dan [2022]. We achieved a WER of 34.9% on the open dataset, closely aligned with the WER of 36.1% reported
by Gaddy and Dan [2022]. Since the discrepancy is small in magnitude and there is some variability between each
trained model, we did not take any further efforts to resolve it. We also streamlined the training setup into a more easily
readable notebook which is able to produce similar results for easier reference. This can be found on our github.
8 Experiments
We extended the above architecture primarily in the following two ways:
8.1 Using V AEs for EEG Data Augmentation
V AEs show promising results in learning robust EEG representations by reconstructing masked input data, potentially
improving noise resilience. We experimented with the following two V AE architectures:
1.Linear V AEs : We started with a simple linear V AE architecture and trained it on Brennan and Hale [2019].
2.Convolutional V AEs : We also tried a V AE with convolution layers for both eeg_raw andeeg_feats . For
the Word Classifier model, we used eeg_raw .
8.2 Extending the EMG-based SOTA model to the EEG modality
We want to test the EMG-based architecture presented in the Baseline section on the EEG dataset Brennan and Hale
[2019] and compare the results archived in Défossez et al. [2023].
9 Results
9.1 EEG-to-Text
This section presents the ablation results of our experiments on the EEG-to-Text models. Two primary variants were
explored: the Word Classifier model and the Seq2Seq model . The experiments for each model also included various
techniques such as V AE-based EEG data augmentation, sentence-level stratified sampling, and masking techniques.
The results are discussed in terms of training and validation performance, with a focus on top-1/top-10 word accuracy
for the word classifier and training loss and WER for the sequence-to-sequence model.
12
Page 13:
APREPRINT - JANUARY 9, 2025
9.1.1 Word Classifier Model
The word classification model was developed to predict individual words from EEG signals and evaluated under
various configurations, including V AE-based data augmentations. The corpus contained a total of 601 unique words.
Performance was measured using Top-1 Accuracy andTop-10 Accuracy . For the initial experiment, the model was
trained on EEG data from 10 subjects using stratified sampling, with time and frequency masking applied during
preprocessing. The results showed a Top-1 validation accuracy of 4.1% and a Top-10 validation accuracy of 26.82% .
These outcomes align with prior results from Meta’s report on the same dataset, where a Top-10 accuracy of 25.7%
was achieved Défossez et al. [2023]. However, upon analyzing the model’s outputs, it was observed that the predictions
were dominated by the most frequent words in the dataset (e.g., “she,” “the,” and “was”), as shown in the following list:
[‘she’, ‘the’, ‘was’, ‘it’, ‘and’, ‘to’, ‘i’, ‘that’, ‘had’, ‘a’]
This may indicate that the model primarily learned the word frequency distribution rather than meaningful EEG-to-text
representations. Figure 10 illustrates the rapid convergence of the Top-10 validation accuracy, which plateaued at
26.82% after just two epochs.
(a) Training top-10 accuracy.
(b) Validation top-10 accuracy.
Figure 6: Training and Validation Top-10 accuracy for the Word Classifier model.
To address the observed imbalance in the word frequency distribution, we applied inverse word frequency weights to
the cross-entropy loss function. Despite this adjustment, the model’s performance significantly deteriorated, with the
Top-10 validation accuracy barely exceeding 0.1% (see Figure 7). This further demonstrates that the model failed to
capture meaningful EEG-to-text mappings, and underscores the inherent difficulty of decoding EEG signals straight
into discrete words.
9.1.2 Sequence to Sequence Model
The Seq2Seq model was designed to map EEG signals to full sentences, with the goal of capturing the temporal
dependencies inherent in EEG data. Compared to the word-level classifier, the Seq2Seq model generates sentence-level
outputs, which enables a more comprehensive decoding of EEG signals. A series of experiments were conducted to
evaluate the effects of masking, stratified sampling, and extended training epochs on the model’s performance.
In the initial experiment, the model was trained on EEG data collected from 22 subjects. The data was randomly split
into training, validation, and testing sets in an 8:1:1 ratio. A character-level tokenizer was employed for sentence
generation, and the model was trained for 200 epochs. The results 3 revealed a training WER of 20.97% and a validation
WER of 92.48%, which indicates significant overfitting. To mitigate this issue, time masking (with a parameter of 50)
and frequency masking (with a parameter of 10) were applied. These techniques improved the validation WER slightly
to 92.32%, while the training WER increased to 37.73% due to the introduction of noise from masking. Although this
approach reduced overfitting, it did not yield substantial improvements in the model’s generalization capabilities.
The impact of masking on the training and validation WER is illustrated in Figure 8. In 8a, the training WER decreases
steadily as the number of epochs increases, with masking leading to slower convergence due to the added noise. Figure
8b shows that the validation WER still fluctuates for the masked model and has a limited improvement in overall
performance.
13
Page 14:
APREPRINT - JANUARY 9, 2025
(a) Training top-10 accuracy.
(b) Validation top-10 accuracy.
Figure 7: Training and Validation Top-10 accuracy after applying inverse word frequency weights to the loss function.
(a) Training WER with and without masking.
(b) Validation WER with and without masking.
Figure 8: Impact of time and frequency masking on training and validation WER over 200 epochs.
Upon further analysis of the results, it became evident that the sentence distribution across the training and validation
sets was imbalanced. Certain sentences appeared exclusively in the training set, while others were present only in the
validation set. We speculated that this disparity might have posed a challenge since the model was unable to learn EEG
patterns corresponding to all sentences during the training phase. Consequently, the model might struggle to generalize
and decode meaningful information for new sentences during validation. To address this issue, a sentence-level stratified
sampling strategy was implemented to ensure that all sentences appeared in both the training and validation sets, though
with data from different subjects.
Due to constraints in computational resources and time, this experiment was conducted on a subset of 10 subjects. Time
and frequency masking were applied as part of the training process. The model was trained in four consecutive stages,
with each stage building upon the weights of the preceding one. A total of 800 epochs were completed across these
stages. After approximately 400 epochs, the validation WER plateaued at 93.23%, while the training WER continued to
improve, eventually reaching 2.91%. The training loss also decreased significantly and converged to 0.06 by the end of
the training process.
The training and validation WER metrics for these experiments are visualized in Figure 9. The graphs represent the
performance of the model trained with stratified sampling over the four training stages, which are distinguished by
different colors: brown, green, cyan, and pink, each corresponding to 200 epochs.
14
Page 15:
APREPRINT - JANUARY 9, 2025
(a) Training WER with stratified sample.
(b) Validation WER with stratified sample.
Figure 9: Impact of stratified on training and validation WER over 800 epochs.
Experiment Epochs Train Loss Train WER (%) Val WER (%)
22 subjects 200 0.3459 20.97 92.48
22 subjects + Time masking + Frequency masking 200 0.5031 37.73 92.32
10 subjects + Stratified sampling + Masking 800 0.0600 2.91 93.23
Table 3: Performance metrics for the Seq2Seq model under different experimental conditions.
9.1.3 AugV AE-EEG
Augmenting the data by replacing 50% of the EEG signals with V AE-generated signals did not improve the performance
of the model (table 10b). The top-10 accuracy converges to the same value as the word classifier trained without
augmented EEG. This indicates that the data augmentation is not sufficient to steer the model away from predicting the
most frequent words in the dataset. We observed the same pattern when training the model on 90% generated EEG data.
The training accuracy as well as the top-1 accuracy closely follow that of the model without augmented data.
(a) Training top-10 accuracy
(b) Validation top-10 accuracy
Figure 10: Training and Validation Top-10 accuracy for the word classification model with 50% augmented V AE signals
10 Discussion
The results of our experiments highlight both the potential and challenges of EEG-based speech decoding. The
Word Classifier achieved a Top-10 validation accuracy of 26.82%, which aligns with prior work (e.g., Meta’s report).
However, examining the prediction results reveals that the classifier primarily learned word frequency distributions
rather than meaningful EEG-to-text mappings. These findings suggest that while the models provide initial benchmarks
for EEG-to-text decoding, further improvements in data preprocessing, model architectures, and training strategies
are needed for practical applications. Similarly, The Seq2Seq model demonstrated the ability to capture temporal
15
Page 16:
APREPRINT - JANUARY 9, 2025
dependencies inherent in EEG data by achieving a training WER of 2.91%. However, the validation WER remained
high at 93.23%, which indicates significant overfitting and poor generalization. This finding underscores the difficulty
of decoding EEG signals into coherent text outputs, particularly in sentence-level tasks. At the same time, augmenting
the data with V AE-generated EEG signals did not yield major improvements. Our hypothesis was that V AE may learn
common patterns across subjects. The lack of improvement could be due to the low signal-to-noise ratio, lack of large
EEG datasets, or an indication that another model architecture is needed.
Here is the analysis of key results:
•Time and Frequency Masking: Marginal improvement in validation performance (WER reduced from
92.48% to 92.32%) indicates a limited model’s sensitivity to masking parameters. While masking effectively
mitigates some overfitting, it fails to substantially enhance generalization. This suggests that more sophisti-
cated regularization techniques or alternative augmentation methods may be required to achieve meaningful
performance gains.
•Evaluating V AE-Based Data Augmentation: The classifier trained with V AE-augmented data performed
similarly to the one trained without it, indicating that the generated data did not contribute meaningful variance
or diversity. This highlights the limitations of the current V AE’s capacity to produce informative data. It
underscores the need to refine the V AE architecture or explore more advanced strategies to leverage augmented
inputs, such as optimizing the classifier’s design or incorporating additional training enhancements.
•The Role of Stratified Sampling in Training: Stratified sampling effectively balanced sentence distribution
across training and validation datasets, reducing coverage imbalances and stabilizing training dynamics.
Despite these improvements, validation performance remained constrained, with a WER of 93.23%, suggesting
that the model’s generalization challenges are rooted in deeper issues such as subject-specific EEG variability
and limited dataset size.
•Word Classifier vs Seq2Seq Models: The Word Classifier achieved a reasonable Top-10 accuracy of 26.82%,
comparable to prior work. However, it struggled with infrequent words and lacked the ability to model sentence-
level dependencies. In contrast, the Seq2Seq model excelled in capturing temporal dependencies, enabling
sentence-level decoding. Yet, its high validation WER highlighted significant challenges in generalization and
overfitting.
•Sensitivity to Loss Function Weighting: Applying inverse word frequency weighting to the loss function
of the word classifier drastically impacted performance, reducing validation accuracy to a mere 0.1%. This
underscores the classifier’s reliance on patterns associated with high-frequency words in the dataset and its
inability to generalize effectively to less frequent, yet more informative, words. The failure to adapt when
frequency bias was counteracted suggests a fundamental limitation in the model’s architecture or training
strategy, which prevents it from learning meaningful representations for rare words critical to real-world
applications.
11 Future Work
The models explored in this project have significant potential for developing assistive technologies, particularly for
individuals with speech impairments. However, the current performance levels suggest that these models are still far
from practical deployment.
Our experiments have revealed several areas for improvement and future exploration. One of the major shortcomings of
our study lies in the limited generalization capability of our models. Specifically, the word classifier primarily captures
word frequency distributions rather than learning meaningful EEG-to-text mappings, while the Seq2Seq model suffers
from overfitting, failing to generalize to unseen data. These issues show the challenges of decoding EEG signals directly
into textual outputs and emphasize the need for more robust modeling approaches.
To address these issues, we propose the following future experiments:
•Train an AugV AE-EEG model for the sequence-to-sequence task : Currently, our AugV AE-EEG architecture
is only applied to word-level data augmentation. Extending this model to generate augmented data for the
Seq2Seq task could potentially improve its performance by introducing more diversity and robustness to the
training set. This would require fine-tuning the V AE to produce sentence-level EEG representations, which
could better capture the temporal dependencies inherent in EEG signals.
•Apply the approach to speech production tasks (e.g., silent or imagined speech) : The Brennan dataset,
while useful, is relatively small and limited to speech perception tasks. Future work could explore the
16
Page 17:
APREPRINT - JANUARY 9, 2025
applicability of our methods to speech production tasks, such as decoding silent or imagined speech. These
tasks are more aligned with real-world applications, such as assistive technologies for individuals with speech
impairments. However, the lack of publicly available datasets for these paradigms poses a significant challenge.
Developing new datasets or leveraging transfer learning from related domains could be viable solutions.
•Incorporate advanced data augmentation techniques : Beyond V AEs, more sophisticated data augmentation
methods, such as generative adversarial networks (GANs), could be explored. GANs have shown promise in
generating realistic and diverse synthetic data in other domains and could help address the limited size and
variability of EEG datasets. Additionally, combining multiple augmentation techniques may further enhance
model robustness.
•Leverage additional modalities such as audio and phonemes : The availability of audio and phoneme data
in the dataset presents an opportunity to incorporate multimodal learning approaches. By integrating these
complementary modalities, the model could potentially learn richer representations of the underlying speech
signals, improving its ability to decode EEG data. Exploring techniques such as cross-modal training and
alignment could further enhance performance and robustness.
12 Conclusions
While our findings did not fully achieve the original goal of robust and generalized EEG-to-text decoding, they provide
valuable insights into the challenges and limitations of this task. The use of V AEs for EEG data augmentation showed
promise but requires further refinement to produce more meaningful and diverse synthetic data. The Seq2Seq model,
despite its overfitting, demonstrated the potential for capturing temporal relationships in EEG signals, which set up a
foundation for future improvements.
In conclusion, this study lays the groundwork for future research on EEG-based speech decoding. By addressing
the identified shortcomings and exploring new datasets, advanced augmentation techniques, and improved model
architectures, we aim to move closer to the ultimate goal of practical brain-to-text systems. These advancements could
have profound implications for assistive communication technologies and have the potential to offer new possibilities
for individuals with speech impairments.
13 Code
All the code used to train the model is included in this link, which was forked from the silent speech repository of
Gaddy and Klein [2020] and adapted to work on the perception of EEG speech. The ablation study logs are publicly
available at this link for further reference and analysis.
14 Division of Work
In alphabetical order by last name.
•Sadrishya Agrawal - Ran the experiments from a newer piece of research with state-of-the-art performance
from Benster et al. [2024]. Conducted research on and implemented Variational Autoencoders. Integrated the
V AE with the exisiting pipeline.
•Terrance Chen - Analyzed and ran the experiments from a newer piece of research with state-of-the-art
performance from Benster et al. [2024]. Analyzed the methods and outlined the baseline model workflow
from Gaddy and Dan [2022] using a diagram. Extended the work of Gaddy and Dan [2022] to the EEG
sequence-to-sequence task. Implemented the modeling framework for the word classification task. Prepared
the presentation slides and compiled model diagrams. Scheduled and managed meetings.
•Yulin Chen - Replicated the results of the baseline model from Gaddy and Dan [2022]. Extended the baseline
model to the EEG sequence-to-sequence task and ran ablations. Implemented the modeling framework for the
word classification task. Compiled model description, results, discussion and conlusion sections for the final
report.
•Pontus Soederhaell - Compiled and ran the notebook version of the model architecture from Gaddy and Dan
[2022]. Implemented the EMG-to-text version of the baseline model. Implemented Variational Autoencoders
and integration with the classification task.
• Kateryna Shapovalenko - Advised on deep learning model design, evaluation strategy, and project direction.
17
Page 18:
APREPRINT - JANUARY 9, 2025
References
Miao Cai and Yu Zeng. Mae-eeg-transformer: A transformer-based approach combining masked autoencoder
and cross-individual data augmentation pre-training for eeg classification. Biomedical Signal Processing and
Control , 94:106131, 2024. ISSN 1746-8094. doi:https://doi.org/10.1016/j.bspc.2024.106131. URL https:
//www.sciencedirect.com/science/article/pii/S1746809424001897 .
Hsiang-Yun Sherry Chien, Hanlin Goh, Christopher M. Sandino, and Joseph Y . Cheng. Maeeg: Masked auto-encoder
for eeg representation learning, 2022. URL https://arxiv.org/abs/2211.02625 .
Alexandre Défossez, Charlotte Caucheteux, Jérémy Rapin, Ori Kabeli, and Jean-Rémi King. Decoding speech
perception from non-invasive brain recordings. Nature Machine Intelligence , 5(10):1097–1107, October 2023. ISSN
2522-5839. doi:10.1038/s42256-023-00714-5. URL http://dx.doi.org/10.1038/s42256-023-00714-5 .
Anarghya Das, Puru Soni, Ming-Chun Huang, Feng Lin, and Wenyao Xu. Multimodal speech recognition using eeg
and audio signals: A novel approach for enhancing asr systems. Smart Health , 32:100477, 2024. ISSN 2352-6483.
doi:https://doi.org/10.1016/j.smhl.2024.100477. URL https://www.sciencedirect.com/science/article/
pii/S2352648324000333 .
Francis R. Willett, Erin M. Kunz, Chaofei Fan, Donald T. Avansino, Guy H. Wilson, Eun Young Choi, Foram
Kamdar, Matthew F. Glasser, Leigh R. Hochberg, Shaul Druckmann, Krishna V . Shenoy, and Jaimie M. Henderson.
Thinking out loud, an open-access eeg-based bci dataset for inner speech recognition. Scientific Data , 9:52, 2022.
doi:10.1038/s41597-022-01147-2.
B. Denby, T. Schultz, K. Honda, T. Hueber, J.M. Gilbert, and J.S. Brumberg. Silent speech interfaces. Speech
Communication , 52(4):270–287, 2010. ISSN 0167-6393. doi:https://doi.org/10.1016/j.specom.2009.08.002. URL
https://www.sciencedirect.com/science/article/pii/S0167639309001307 . Silent Speech Interfaces.
David Gaddy and Dan Klein. Digital voicing of silent speech. arXiv preprint arXiv:2010.02960 , 2020.
doi:10.48550/arXiv.2010.02960. EMNLP 2020.
Tyler Benster, Guy Wilson, Reshef Elisha, Francis R Willett, and Shaul Druckmann. A cross-modal approach to silent
speech with llm-enhanced recognition. 2024. URL https://arxiv.org/abs/2403.05583 .
M.M. Abdulghani, W.L. Walters, and K.H. Abed. Classification using eeg and deep learning. Bioengineering 2023 , 10:
649. doi:10.3390/.
Germán A. Pressel Coretto, Iván E. Gareis, and H. Leonardo Rufiner. Open access database of EEG signals recorded
during imagined speech. In Eduardo Romero, Natasha Lepore, Jorge Brieva, Jorge Brieva, and Ignacio Larrabide
and, editors, 12th International Symposium on Medical Information Processing and Analysis , volume 10160, page
1016002. International Society for Optics and Photonics, SPIE, 2017. doi:10.1117/12.2255697. URL https:
//doi.org/10.1117/12.2255697 .
Chuong H Nguyen, George K Karavas, and Panagiotis Artemiadis. Inferring imagined speech using eeg signals:
a new approach using riemannian manifold features. Journal of Neural Engineering , 15(1):016002, dec 2017.
doi:10.1088/1741-2552/aa8235. URL https://dx.doi.org/10.1088/1741-2552/aa8235 .
A state-of-the-art review of eeg-based imagined speech decoding. 16, 2022. doi:10.3389/fnhum.2022.867281.
Nrushingh Charan Mahapatra and Prachet Bhuyan. Decoding of imagined speech electroencephalography neural
signals using transfer learning method. J. Phys. Commun. , 7, 2023. doi:10.1088/2399-6528/ad0197.
Dong-Yeon Lee, Minji Lee, and Seong-Whan Lee. Classification of imagined speech using siamese neural net-
work. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) , pages 2979–2984, 2020.
doi:10.1109/SMC42975.2020.9282982.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.
6114 .
Jane Saldanha, Shaunak Chakraborty, Shruti Patil, Ketan Kotecha, Satish Kumar, and Anand Nayyar. Data augmentation
using variational autoencoders for improvement of respiratory disease classification. PLoS ONE , 17(8):e0266467,
2022. doi:10.1371/journal.pone.0266467.
Hiromitsu Nishizaki. Data augmentation and feature extraction using variational autoencoder for acoustic modeling. In
2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) ,
pages 1222–1227, 2017. doi:10.1109/APSIPA.2017.8282225.
David Gaddy and Klein Dan. Voicing silent speech . PhD thesis, eScholarship, University of California, Berkeley, 2022.
URL https://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-68.pdf .
Jonathan R. Brennan and John T. Hale. Hierarchical structure guides rapid linguistic predictions during naturalistic
listening. PLoS ONE , 14(1):e0207741, 2019. doi:10.1371/journal.pone.0207741.
18
Page 19:
APREPRINT - JANUARY 9, 2025
Nima Bigdely Shamlo, T. Mullen, Christian Kothe, Kyungmin Su, and K. Robbins. The prep pipeline: standardized
preprocessing for large-scale eeg analysis. Frontiers in Neuroinformatics , 9, 2015. doi:10.3389/fninf.2015.00016.
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high
fidelity speech synthesis, 2020. URL https://arxiv.org/abs/2010.05646 .
19