Authors: Chenqi Kong, Anwei Luo, Peijun Bao, Yi Yu, Haoliang Li, Zengwei Zheng, Shiqi Wang, Alex C. Kot
Page 1:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 1
MoE-FFD: Mixture of Experts for Generalized
and Parameter-Efficient Face Forgery Detection
Chenqi Kong, Member, IEEE , Anwei Luo, Peijun Bao, Yi Yu, Haoliang Li, Member, IEEE , Zengwei Zheng,
Shiqi Wang, Senior Member, IEEE , and Alex C. Kot, Life Fellow, IEEE
Abstract —Deepfakes have recently raised significant trust issues and security concerns among the public. Compared to CNN face
forgery detectors, ViT -based methods take advantage of the expressivity of transformers, achieving superior detection performance.
However, these approaches still exhibit the following limitations: (1) Fully fine-tuning ViT -based models from ImageNet weights
demands substantial computational and storage resources; (2) ViT -based methods struggle to capture local forgery clues, leading to
model bias; (3) These methods limit their scope on only one or few face forgery features, resulting in limited generalizability. To tackle
these challenges, this work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet
parameter-efficient ViT -based approach. MoE-FFD only updates lightweight Low-Rank Adaptation (LoRA) and Adapter layers while
keeping the ViT backbone frozen, thereby achieving parameter-efficient training. Moreover, MoE-FFD leverages the expressivity of
transformers and local priors of CNNs to simultaneously extract global and local forgery clues. Additionally, novel MoE modules are
designed to scale the model’s capacity and smartly select optimal forgery experts, further enhancing forgery detection performance.
Our proposed learning scheme can be seamlessly adapted to various transformer backbones in a plug-and-play manner. Extensive
experimental results demonstrate that the proposed method achieves state-of-the-art face forgery detection performance with
significantly reduced parameter overhead. The code is released at: https://github.com/LoveSiameseCat/MoE-FFD.
Index Terms —Deepfakes, Face Forgery Detection, Mixture-of-Experts, Generalizability, Robustness, Parameter-Efficient Training.
✦
1 I NTRODUCTION
WITH rapid advancements in Artificial Intelligence-
Generated Content (AIGC), forged facial content has
become increasingly sophisticated, making it difficult for
the human eye to distinguish between fake and real faces.
Non-experts can easily use face manipulation algorithms
to create highly realistic falsified facial images and videos,
known as Deepfakes. Consequently, the rapid proliferation
of Deepfake content on social media platforms has led to
significant security issues, including disinformation, fraud,
and impersonation [1]. Even worse, the complexity of de-
ployment environments in real-world applications further
deteriorates the performance of detection models. Therefore,
developing generalized and robust face forgery detectors to
counter malicious attacks remains a substantial challenge.
Early traditional Deepfake detection methods focused
on extracting hand-crafted features, such as eye-blinking
frequency [2] and headpose inconsistency [3]. These tech-
•C. Kong, P . Bao, Y. Yu, and A. C. Kot are with the Rapid-Rich Object
Search (ROSE) Lab, School of Electrical and Electronic Engineering,
Nanyang Technology University, Singapore, 639798.
E-mail: (chenqi.kong@ntu.edu.sg, peijun001@e.ntu.edu.sg,
yuyi0010@e.ntu.edu.sg, eackot@ntu.edu.sg)
•A. Luo is with the School of Computer Science and Engineering, Sun Yat-
sen University, Guangzhou, China.
E-mail: luoanw@mail2.sysu.edu.cn
•H. Li is with the Department of Electrical and Engineering, City Univer-
sity of Hong Kong, Hong Kong, China.
E-mail: haoliang.li@cityu.edu.hk
•Z. Zheng is with the Department of Computer Science and Computing,
Zhejiang University City College, Zhejiang, China.
E-mail: zhengzw@zucc.edu.cn
•S. Wang is with the Department of Computer Science, City University of
Hong Kong, Hong Kong, China.
E-mail: shiqwang@cityu.edu.hk
Fig. 1. Comparison between MoE-FFD (Ours) and Open-source face
forgery detection models on the CelebDF-v2 dataset. We present the
number of activated parameters and the AUC detection performance in
𝑥and𝑦axis, respectively.
niques fail to capture representative features because the
hand-crafted features may limit their scopes to only one
or few kinds of statistical information. Heading into the
era of deep learning, numerous CNN-based methods have
been proposed to improve the detection accuracy [4], [5],
[6], [7], [8]. Many of them employ Xception network [9]
or EfficientNet [10] as backbones due to their outstanding
performance in face forgery detection. To enhance general-
izability and robustness, some methods proposed to extractarXiv:2404.08452v2 [cs.CV] 8 Jun 2024
Page 2:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 2
common forgery featrues such as noise patterns [11], [12],
[13], blending artifacts [4], [14], frequency fingerprints [15],
[16], [17], [18], and identity inconsistency [8], etc. How-
ever, these methods are inherently limited to the local-
interactions of CNN architectures.
With the advent of the vision transformer (ViT) [19], ViT
architectures have achieved significant success in a wide
variety of computer vision tasks due to their long-range
interactions and outstanding expressivity. In the realm of
Deepfake detection, numerous ViT-based approaches [20],
[21], [22], [23], [24], [25], [26] have been proposed, achieving
enhanced accuracy and generalizability. For instance, Dong
et al. [20] designed a ViT to capture the identity inconsis-
tency between inner and outer face region. Zhuang et al.
[22] proposed a ViT to mine unsupervised inconsistency-
aware features within each frame. Shao et al. [21] introduced
a concise yet effective Seq-DeepFake Transformer to predict
a sequential vector of facial manipulation operations.
Nevertheless, ViT-based forgery detection approaches
still face several limitations. First, fully training ViT-based
models from the ImageNet weights demands substantial
computational resources, which hinders their deployment
or fine-tuning in real-world applications, particularly on
mobile devices with limited processing power. Second, al-
though ViT-based methods exhibit outstanding expressiv-
ity, they may struggle to capture forgery features in lo-
cal abnormal regions, resulting in model bias and limited
generalizability. Finally, previous methods focus on certain
forgery artifacts, but it is challenging to empirically select
the optimal features in unpredictable application scenarios.
This paper presents a generalized yet parameter-efficient
approach MoE-FFD, which proposes using Mixture of
Experts for FaceForgery Detection. MoE-FFD draws inspi-
ration from Parameter Efficient Fine-Tuning (PEFT), which
integrates lightweight Low-Rank Adaptation (LoRA) layers
and Adapter layers with the ViT backbone. During the
training process, only the designed LoRA and Adapter
parameters are updated while the ViT parameters remain
frozen as the ImageNet weights. The designed MoE mod-
ules dynamically select optimal LoRA and Adapter experts
for face forgery detection.
Compared with previous arts that directly fine-tune the
entire ViT parameters, the proposed fine-tuning strategy
effectively preserves abundant knowledge from ImageNet
and enables the model to adaptively learn forgery-specific
features. Additionally, the designed LoRA layers model
the long-range interactions within input faces, while the
Convpass Adapter layers effectively highlight local forgery
anomalies. To this end, the integration of the designed
LoRA and Adapter layers leverages the expressivity of
transformers and the local forgery priors of CNNs, leading
to enhanced generalizability and robustness. Additionally,
we design novel Mixture of Experts (MoE) modules within
both LoRA and Adapter layers to scale the model’s capacity
using fixed activated parameters. The MoE dynamically
selects optimal forgery detection experts for input faces,
further enhancing the model’s performance.
As depicted in Fig. 1, our MoE-FFD with the fewest
activated parameters achieves the best AUC score on the
unseen CelebDF-v2 datset [27]. Overall, the contributions of
this work are summarized as follows:•We innovatively integrate LoRA and Adapter mod-
ules with the ViT backbone for face forgery detection.
This design enables the designed model to simul-
taneous mine global and local forgery clues in a
parameter-efficient manner.
•We design novel Mixture-of-Experts (MoE) mod-
ules to scale up the model capacity. These modules
dynamically select optimal forgery experts for in-
put faces, thereby boosting detection performance.
Furthermore, the MoE modules can be seamlessly
adapted to other transformer architectures in a plug-
and-play fashion.
•We conduct experiments on six Deepfake datasets
and various common perturbations. Our experimen-
tal results demonstrate that MoE-FFD achieves state-
of-the-art generalizability and robustness. Extensive
ablation experiments validate the effectiveness of our
proposed MoE learning scheme and the designed
components.
In this paper, Section 2 comprehensively reviews previ-
ous related literature. Section 3 details the proposed mixture
of expert framework for face forgery detection. Section 4
presents extensive experimental results under various set-
tings. Finally, Section 5 concludes the paper and discusses
possible future research directions.
2 R ELATED WORKS
In this section, we broadly review existing works on
face forgery detection, parameter-efficient fine-tuning, and
mixture of experts.
2.1 Face Forgery Detection
Early attempts at face forgery detection primarily re-
lied on extracting handcrafted features, such as the lack
of eye blinking [2], inconsistency of head pose [3], face
warping artifacts [28], and heart rate anomalies [29]. How-
ever, these methods suffer from limited accuracy due to
their narrow focus on specific statistical information. In re-
sponse, learning-based approaches have emerged, leverag-
ing generic network architectures like Xception Network [9],
EfficientNet [10], and Capsule Network [30] for forgery fea-
ture extraction. Nonetheless, CNN-based methods are prone
to overfitting to the training data, resulting in limited gen-
eralizability and robustness. Follow-up works, such as Face
X-ray [4], F3Net [18], SBI [31], DCL [32], and RECCE [33],
have introduced comprehensive forgery frameworks and
robust feature extraction techniques, enhancing the model’s
generalization capability. With the explosive development
of ViT [19], numerous ViT-based methods [20], [21], [22],
[23], [24], [25], [26] have been proposed to tackle the Deep-
fake problem. These approaches take advantage of non-
inductive bias and global context understanding, achiev-
ing superior face forgery detection performance. However,
most of these methods fine-tune the entire ViT model on
Deepfake datasets, which is computationally expensive and
may lead to loss of valuable knowledge from the ImageNet
dataset [34]. In contrast, our approach finetunes external
lightweight LoRA and Adapter parameters during training,
keeping the ViT backbone fixed with ImageNet weights,
Page 3:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 3
Fig. 2. Overview of the designed MoE-FFD framework. (a) Overall model structure; (b) Details of MoE-FFD transformer block; (c) Details of the
designed MoE Adapter layer; (d) Details of each Adapter expert; (e) Details of the designed MoE LoRA layer; (f) Details of each LoRA expert.
thus enabling the model to learn forgery-specific knowledge
and enhance the detection performance.
2.2 Parameter Efficient Fine-Tuning (PEFT)
Over the past few years, the size of deep learning models
has exponentially increased, especially after the advent of
transformers. Consequently, numerous Parameter Efficient
Fine-Tuning (PEFT) methods have been proposed to reduce
the computational and storage overhead. PEFT only updates
a small portion of the model parameters while freezing
the majority of pretrained weights. Adapter [35], [36] is a
typical PEFT method, comprising a down-sample layer and
an up-sample layer, generally integrated into transformer
layers and blocks. Low-Rank Adaptation (LoRA) [37] layer
aims at updating two low-rank matrices, significantly re-
ducing the trainable parameters. Zhong et al. [38] found
that convolution operation effectively introduces local prior
information for image segmentation. Additionally, Visual
Prompt tuning (VPT) [39] augments inputs with extra learn-
able tokens, which can be regarded as learnable pixels to
vision transformers. Neural Prompt Search (NOAH) [40]
advances further by incorporating Adapter, LoRA, and
VPT into vision transformers and optimizing their design
through a neural architecture search algorithm. Scale and
Shift Feature Modulation (SSF) [41] introduces to scale and
shift parameters to modulate visual features during training,
achieving comparable performance compared with full fine-
tuning. Convpass [42] integrates convolutional bypasses
into large ViT models, leveraging local priors to improve
image classification performance. However, the application
of PEFT methods to face forgery detection remains largely
unexplored. In this work, we incorporate dedicated forgeryAdapter and LoRA layers to mine global and local clues and
achieve a parameter-efficient face forgery detection.
2.3 Mixture of Experts (MoE)
Mixture-of-Experts (MoE) [43] models aim to augment
the model’s capacity without increasing computational ex-
penses. MoE comprises multiple sub-experts and incorpo-
rates a gating mechanism to dynamically select the most
relevant Top-k experts, thereby optimizing the results. The
concept of MoE has been widely used in both Computer
Vision [44], [45], [46], [47] and Natural Language Process-
ing [48], [49], [50]. Sparse MoE [48] introduces a router
to select a subset of experts, ensuring that the inference
time is on par with the standalone counterpart. Subsequent
methods such as [51], [52], [53] seek to design novel gating
mechanisms to enhance the performance on specific tasks.
Follow-up works [51], [54], [55] propose leveraging multi-
task learning to guide the model to select optimal experts
for a given input query. Moreover, some studies apply
MoE architectures to domain adaptation [56] and domain
generalization [57] tasks. Previous works usually adopt
Feed-Forward Networks (FFN) as the expert choices [44],
[48], [58], [59], [60], [61]. However, Feed-Forward Networks
(FFNs) still consume significant memory and computational
resources during training and inference. In our work, we
draw inspiration from the concept of Mixture-of-Experts
(MoE) to automate the selection of different LoRA and
Adapter modules for various test data. The use of MoE in-
troduces negligible computational overhead, facilitating ef-
ficient Deepfake detection. Additionally, the designed MoE
modules intelligently select optimal experts, significantly
outperforming previous detection methods.
Page 4:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 4
3 M ETHODOLOGY
Herein, we first briefly review the basic preliminaries
on Vision Transformer (ViT), Low Rank Adaptation (LoRA),
and Compass Adapter. Then we introduce the overview
of our designed face forgery detection framework. Finally,
we delve into the details of the designed architectures and
objective functions of our method.
3.1 Preliminaries
3.1.1 Vision Transformer
Vision Transformer consists of multiple blocks of Self-
Attention and Multi-Layer Perceptron (MLP). For a given
input sequence 𝑥∈R𝑁𝑡×𝐷,𝑥is firstly projected to queries
𝑄∈R𝑁𝑡×𝑑𝑖𝑚, keys𝐾∈R𝑁𝑡×𝑑𝑖𝑚, and values 𝑉∈R𝑁𝑡×𝑑𝑖𝑚
using three learnable matrices 𝑊𝑞∈R𝐷×𝑑𝑖𝑚,𝑊𝑘∈R𝐷×𝑑𝑖𝑚,
and𝑊𝑣∈R𝐷×𝑑𝑖𝑚, where𝑁𝑡,𝐷, and𝑑𝑖𝑚 denote the token
number, embedding dimension, and hidden dimension, re-
spectively. The 𝑄,𝐾, and𝑉are calculated by:
𝑄=𝑥𝑊𝑞,𝐾=𝑥𝑊𝑘,𝑉=𝑥𝑊𝑣. (1)
Then the Self-Attention is conducted by:
Attention(𝑄,𝐾,𝑉)=softmax(𝑄𝐾⊤/√
𝑑𝑖𝑚)𝑉. (2)
3.1.2 Low Rank Adaptation (LoRA)
LoRA is one popular parameter-efficient tuning method.
For a pretrained ImageNet weight matrix 𝑊∈R𝐷×𝑑𝑖𝑚,
LoRA freezes 𝑊during training while adding a prod-
uct of two trainable low-rank matrices 𝑊𝑑𝑜𝑤𝑛𝑊𝑢𝑝, where
𝑊𝑑𝑜𝑤𝑛∈R𝐷×𝑟,𝑊𝑢𝑝∈R𝑟×𝑑𝑖𝑚, and rank 𝑟 << 𝑚𝑖𝑛(𝐷,𝑑𝑖𝑚).
As such, the forward pass is modified as:
ℎ=𝑥(𝑊+Δ𝑊)=𝑥𝑊+𝑥𝑊𝑑𝑜𝑤𝑛𝑊𝑢𝑝, (3)
where𝑊could be either Query, Key, or Value matrix,
i.e.,𝑊∈ {𝑊𝑞,𝑊𝑘,𝑊𝑣}. And𝑊is frozen during training.
Compared with previous ViT-based finetuning methods, the
proposed LoRA learning scheme significantly reduce the
trainable parameters.
3.1.3 Convpass Adapter
Adapter integrates local priors for visual tasks, con-
structing Convolutional Bypasses (Convpass) within the
Vision Transformer (ViT) framework as adaptation modules.
The Convpass Adapter can be formulated as:
𝑥𝑜𝑢𝑡=𝑀𝐿𝑃(𝑥𝑖𝑛)+𝐶𝑜𝑛𝑣𝑝𝑎𝑠𝑠(𝑥𝑖𝑛), (4)
where the parameters in MLP are fixed with ImageNet
weights during training. Convpass generally consists of sev-
eral trainable convolutional layers. Compared with the MLP
layer, the Convpass layer saves computational resources.
3.2 Overview of the Proposed Framework
Fig. 2 illustrates the designed MoE-FFD framework.
Fig. 2 (a) depicts the ViT backbone, which is initialized
with the ImageNet weights. The input face firstly undergoes
the patch and position embedding. Then, the input tokens
are delivered to the designed transformer blocks. Fig. 2 (b)
presents more details regarding the designed transformer
blocks. We modify the standard ViT blocks by integrating
Fig. 3. Illustration of the designed Convpass Adapter experts. (a) Vanilla
Convolution, (b) Angular Difference Convolution (ADC), (c) Central Dif-
ference Convolution (CDC), (d) Radial Difference Convolution (RDC),
and (e) Second-Order Convolution (SOC).
the designed MoE Adapter layer and MoE LoRA layer with
the original blocks.
Fig. 2 (c) details the designed MoE Adapter layer. During
the training process, the MLP parameters are frozen with
the ImageNet weights. Within this layer, a MoE module is
designed, comprising one gate 𝐺𝐴(·)and𝑀forgery Adapter
experts. The gating mechanism aims to dynamically select
the appropriate experts for each input query, while the de-
signed Adapter experts aim to extract specific local forgery
features from the input faces. Fig. 2 (d) illustrates the struc-
ture of Adapter 𝑗. We first reshape the input feature and
reduce its channel dimension by using a 1 ×1 convolution.
Subsequently, we design convolution operations to extract
specific local forgery clues. The output feature is then passed
through to another 1 ×1 convolution layer to restore the
channel dimension. Finally, the feature map is flattened to
the original shape.
Similarly, the attention weights 𝑊𝑞,𝑘,𝑣 in the MoE LoRA
layer are fixed to the ImageNet weights. As shown in Fig. 2
(e), the MoE LoRA module consists of one gate 𝐺𝐿(·)and𝑁
experts, each with a unique rank. The gate 𝐺𝐿(·)selectively
activate sparse experts for detecting face forgeries. Fig. 2 (f)
depicts the LoRA structure, where ⊗indicates matrix multi-
plication. The output feature is derived by multiplying the
input with two learnable low-rank matrices. The resulting
𝑥𝑜𝑢𝑡is the element-wise summation of the fixed weights and
the learned MoE LoRA weights, which are further processed
by a self-attention mechanism.
The whole framework is trained in an end-to-end man-
ner with the supervision of the following objective function:
Page 5:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 5
𝐿=𝐿𝑐𝑒+𝜆·𝐿𝑚𝑜𝑒, (5)
where𝐿𝑐𝑒represents the cross-entropy loss. Following [48],
[62], we apply an additional loss 𝐿𝑚𝑜𝑒 to encourage all
experts to have equal importance. Further details on 𝐿𝑚𝑜𝑒
will be dedicated at the end of this section.
3.3 Details of the MoE
3.3.1 MoE LoRA Layer.
LoRA modules with different ranks tend to project the
input tokens into various feature spaces. However, in un-
controlled deployment environments, it is challenging to
manually predefine an ideal rank for different testing faces.
To tackle this challenge, we design a MoE LoRA Layer that
learns an optimal LoRA expert for each input query. As
shown in Fig. 2 (e), each LoRA expert 𝐸𝐿(·)𝑖specifies a
rank𝑟𝑖. In this work, we design a gating mechanism 𝐺𝐿(·)
to dynamically select the Top-k experts (default: k=1). The
details of the gating mechanism will be elaborated later. As
such, the output tokens can be calculated by:
𝑥𝑜𝑢𝑡=𝑊𝑞/𝑘/𝑣𝑥𝑖𝑛+𝑁∑︁
𝑖𝐺𝐿(𝑥𝑖𝑛)𝑖𝐸𝐿(𝑥𝑖𝑛)𝑖, (6)
where𝑊𝑞/𝑘/𝑣is fixed as the ImageNet weight during the
training process. Each LoRA expert is formulated by:
𝐸𝐿(𝑥𝑖𝑛)𝑖=𝑥𝑖𝑛𝑊𝑑𝑜𝑤𝑛
𝑖𝑊𝑢𝑝
𝑖, (7)
where𝑊𝑢𝑝
𝑖∈R𝑟𝑖×𝑑𝑖𝑚and𝑊𝑑𝑜𝑤𝑛
𝑖∈R𝑑𝑖𝑚×𝑟𝑖are two trainable
matrices.
3.3.2 MoE Adapter Layer.
Previous studies [21], [66], [67] have demonstrated the
effectiveness of utilizing local difference convolution to cap-
ture face forgery clues. However, these CNN-based methods
only limit their scopes in one specific forgery feature. In our
research, we introduce the MoE Adapter Layer integrated
into the ViT backbone. This design aims to scale the model
capacity and facilitate the dynamic selection of the suitable
forgery Adapter expert. Additionally, the designed MoE
Adapter layer dynamically injects local forgery priors into
the plain ViT backbone.
As shown in Fig. 2 (c), the designed MoE Adapter layer
consists of 𝑀Convpass Adapter experts. And the output
𝑥𝑜𝑢𝑡can be formularized as:
𝑥𝑜𝑢𝑡=MLP(𝑥𝑖𝑛)+𝑀∑︁
𝑗𝐺𝐴(𝑥𝑖𝑛)𝑗𝐸𝐴(𝑥𝑖𝑛)𝑗, (8)
where MLP is frozen as the ImageNet weights during train-
ing. Adapter expert 𝐸𝐴(·)𝑗can be calculated by (we omit the
activation layer):
𝐸𝐴(𝑥𝑖𝑛)𝑗=Convup
1×1(Convj
3×3(Convdown
1×1(𝑥𝑖𝑛))), (9)
Convdown
1×1and Convup
1×1are two 1×1 convolution layers,
which down-sample and up-sample the channels, respec-
tively. Convj
3×3indicates the specific convolution layer in
different experts. We develop 𝑀=5 types of convolution.
These convolutions aim to model different local interactions,facilitating the model to capture abundant local forgery
features from input faces.
Fig. 3 shows the designed five convolutions, including
(a) Vanilla Convolution, (b) Angular Difference Convolution
(ADC), (c) Central Difference Convolution (CDC), (d) Radial
Difference Convolution (RDC), and (e) Second-Order Con-
volution (SOC). The red arrow →indicates the subtraction
operation. We first formulate the vanilla convolution as (we
omit the bias for concision):
𝑦=∑︁
𝑝∈Ω1𝑤𝑝𝑥𝑝, (10)
then, for the other four convolution types, Eq.(10) can be
rewritten as:
𝑦=∑︁
𝑝∈Ω𝑗𝑤𝑝ˆ𝑥𝑝=𝑤𝑐𝑥𝑐+∑︁
𝑝∈Ω𝑗,𝑝≠𝑐𝑤𝑝ˆ𝑥𝑝, (11)
where𝑥𝑐and𝑤𝑐represent the center elements of the input
𝑥and weight 𝑤.
In ADC and CDC, we set the Ωsize as 3×3. The ˆ𝑥𝑝in
CDC can be calculated by: ˆ 𝑥𝑝=𝑥𝑝−𝑥𝑐. And the ˆ𝑥𝑝in ADC
is denoted as: ˆ 𝑥𝑝=𝑥𝑝−𝑥𝑛𝑒𝑥𝑡
𝑝, where𝑥𝑛𝑒𝑥𝑡
𝑝is the next element
in the clockwise direction.
In the RDC and SOC operations, we set the size of
theΩregion as 5×5. For each element 𝑥𝑝, we define a
corresponding radial element 𝑥𝑅
𝑝in the peripheral region
ofΩ, which is highlighted in dark green in Fig. 3 (d) and (e).
As such, the ˆ 𝑥𝑝in RDC is calculated by: ˆ 𝑥𝑝=𝑥𝑅
𝑝−𝑥𝑝. SOC
aims at learning second-order local anomaly of the input. As
such, ˆ𝑥𝑝in SOC is formulated as: ˆ 𝑥𝑝=(𝑥𝑅
𝑝−𝑥𝑝)−(𝑥𝑝−𝑥𝑐)=
(𝑥𝑅
𝑝−𝑥𝑝)+(𝑥𝑐−𝑥𝑝). As such, the designed MoE Adapter
layer can effectively search intrinsic detailed local forgery
patterns in a larger feature space.
3.3.3 Gating Network.
We adopt Top-k noisy gating [48] as our gating mecha-
nism. The gating scores 𝐺(𝑥)∈R𝑁𝑒are determined by the
values𝐻(𝑥)∈R𝑁𝑒, where𝑁𝑒indicates the expert number.
For a given input 𝑥∈R𝑁𝑡×𝑑𝑖𝑚, we first apply average
pooling and reshape it to 𝑥𝑚∈R𝑑𝑖𝑚. Then, we calculate
𝐻(𝑥)by:
𝐻(𝑥)=𝑥𝑚⊗𝑊𝑔𝑎𝑡𝑒+StandardNormal()(Softplus(𝑥𝑚⊗𝑊𝑛𝑜𝑖𝑠𝑒)),
(12)
where𝑊𝑔𝑎𝑡𝑒∈R𝑑𝑖𝑚×𝑁𝑒and𝑊𝑛𝑜𝑖𝑠𝑒∈R𝑑𝑖𝑚×𝑁𝑒are slim
trainable parameters. Softplus is an activation function. For
a given𝐻(𝑥)∈R𝑁𝑒, we keep Top-k ( 𝑘≤𝑁𝑒) values while
setting others as −∞. Then, the gating scores 𝐺(𝑥)can be
calculated by:
𝐺(𝑥)=Softmax(Topk(𝐻(𝑥),𝑘)). (13)
To prevent the gating network from converging to a state
where it produces large weights for the same few experts.
We further apply a soft constraint on the batch-wise average
of each gate. As such, for a given batch of data 𝑋, the MoE
loss𝐿𝑚𝑜𝑒is defined as:
𝐿𝑚𝑜𝑒=𝐶𝑉(𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒(𝑋))2;𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒(𝑋)=∑︁
𝑥∈𝑋𝐺(𝑥),
(14)
Page 6:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 6
TABLE 1
Cross-dataset evaluation on five unseen datasets. ‘*’ indicates the trained model provided by the authors. ‘†’ indicates our re-implementation using
the public official code. Methods highlighted in blue denote video-level results. #Params indicates the activated parameter number during training.
Method Venue #ParamsCDF WDF DFDC-P DFD DFR Avgerage
AUC EER AUC EER AUC EER AUC EER AUC EER AUC EER
Face X-ray [4] CVPR20 41.97M 74.20 - - - 70.00 - 85.60 - - - - -
GFF [11] CVPR21 53.25M 75.31 32.48 66.51 41.52 71.58 34.77 85.51 25.64 - - - -
LTW [63] AAAI21 20.37M 77.14 29.34 67.12 39.22 74.58 33.81 88.56 20.57 - - - -
F2Trans-S [26] TIFS23 117.52M 80.72 - - - 71.71 - - - - - - -
ViT-B [19] ICLR21 85.80M 72.35 34.50 75.29 33.40 75.58 32.11 79.61 28.85 80.47 26.97 76.66 31.17
SBI* [31] CVPR22 19.34M 81.33 26.94 67.22 38.85 79.87 28.26 77.37 30.18 84.90 23.13 78.14 29.47
DCL* [32] AAAI22 19.35M 81.05 26.76 72.95 35.73 71.49 35.90 89.20 19.46 92.26 14.81 81.39 26.53
Xception† [9] ICCV19 20.81M 64.14 39.77 68.90 38.67 69.56 36.94 84.31 25.00 91.93 15.52 75.77 31.18
RECCE† [33] CVPR22 25.83M 61.42 41.71 74.38 32.64 64.08 40.04 83.35 24.57 92.93 14.74 75.23 30.73
EN-B4† [10] ICML19 19.34M 65.24 39.41 67.89 37.21 67.96 37.60 88.67 18.46 92.18 15.51 76.39 29.64
CFM* [6] TIFS24 25.37M 82.78 24.74 78.39 30.79 75.82 31.67 91.47 16.80 95.18 11.87 83.94 24.20
MoE-FFD Ours 15.51M 86.69 22.06 80.64 27.11 80.83 26.67 90.37 17.67 95.39 11.29 86.78 20.96
Lisiam [64] TIFS22 - 78.21 - - - - - - - - - - -
F3Net [18] ECCV20 42.53M 68.69 - - - 67.45 - - - - - - -
FTCN [65] ICCV21 26.60M 86.90 - - - 74.00 - - - - - - -
ViT-B [19] ICLR21 85.80M 78.12 30.59 78.95 29.59 79.43 28.62 84.52 23.48 86.42 19.90 81.49 26.44
SBI* [31] CVPR22 19.34M 88.61 19.41 70.27 37.63 84.80 25.00 82.68 26.72 90.04 17.53 83.28 25.26
DCL* [32] AAAI22 19.35M 88.24 19.12 76.87 31.44 77.57 29.55 93.91 14.40 97.41 9.96 86.80 20.89
RECCE† [33] CVPR22 25.83M 69.25 34.38 76.99 30.49 66.90 39.39 86.87 21.55 97.15 9.29 79.42 27.01
CFM* [6] TIFS24 25.37M 89.65 17.65 82.27 26.80 80.22 27.48 95.21 11.98 97.59 9.04 88.99 18.59
MoE-FFD Ours 15.51M 91.28 17.15 83.91 24.75 84.97 23.44 93.57 14.05 98.52 5.47 90.45 16.97
where𝐶𝑉(·)indicates the Coefficient of Variation:
𝐶𝑉(𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒(𝑋))=Mean[𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒(𝑋)]
Std[𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒(𝑋)]. (15)
As such, the additional loss 𝐿𝑚𝑜𝑒encourages all experts to
have equal importance.
4 E XPERIMENTS
In this section, we evaluate our model in terms of gen-
eralizability, robustness, and parameter-efficiency under a
wide variety of experimental settings. We conduct extensive
ablation studies to demonstrate the effectiveness of the
designed network architecture and the adopted training
strategy.
4.1 Implementation Details
We apply the popular MTCNN face detector [68] to crop
the face regions. The proposed framework is implemented
on the Pytorch [69] platform. The model is trained using
Adam optimizer [70] with 𝛽1= 0.9 and 𝛽2= 0.999. We
set the learning rate of gating network and other trainable
parameters as 1e-4 and 3e-5, respectively. The loss weigh 𝜆
in Eq. (5) is set as 1. We train the model for 20 epochs on one
single 3090 GPU with batch size 32. We follow the official
dataset split strategy in [6] for fair comparison.
4.2 Datasets and Evaluation Metrics
The experiments encompass the following six datasets
for training and testing: FaceForensics++ (FF++) [72],
CelebDF-v2 (CDF) [27], WildDeepfake (WDF) [73], Deep-
Fake Detection Challenge Preview (DFDC-P) [74], Deep-
FakeDetection (DFD) [75], and DeepForensics-1.0 (DFR)
[76]. FF++ is a widely used dataset in face forgery detection,
which includes four face manipulation types: Deepfakes
(DF) [77], Face2Face (FF) [78], FaceSwap (FS) [79], and
NeuralTextures (NT) [80]. The remaining datasets representTABLE 2
Cross-manipulation detection AUC on the unseen manipulation
technique (FF++ C23 dataset).
Method #Params DF FF FS NT AVG
Xception [9] 20.81M 0.907 0.753 0.460 0.744 0.716
EN-B4 [10] 19.34M 0.485 0.556 0.517 0.493 0.513
AT EN-B4 [10] 19.34M 0.911 0.801 0.543 0.774 0.757
FL EN-B4 [10] 19.34M 0.903 0.798 0.503 0.759 0.741
MLDG [71] 62.38M 0.918 0.771 0.609 0.780 0.770
LTW [63] 20.37M 0.927 0.802 0.640 0.773 0.786
ViT-B [19] 85.80M 0.771 0.656 0.510 0.554 0.623
CFM [6] 25.37M 0.880 0.814 0.630 0.643 0.742
MoE-FFD 15.51M 0.947 0.877 0.647 0.759 0.808
recent five Deepfake datasets with diverse environment
variables and various video qualities. We perform cross-
dataset and cross-manipulation evaluations to examine the
models’ generalization capability. Additionally, we measure
the model’s robustness by applying it to perturbed face
images with various perturbation types and severity levels.
Consistent with prior arts, we adopt Area under the ROC
Curve (AUC) and Equal Error Rate (EER) as evaluation
metrics. AUC measures the Area under the Receiver Op-
erating Characteristic (ROC) curve, while EER denotes the
False Positive Rate (FPR) that equals to the True Positive
Rate (TPR).
4.3 Comparison with State-of-the-Arts
4.3.1 Cross-Dataset Evaluations.
In this subsection, we perform cross-dataset evaluations
to evaluate the models’ generalization capability. In practical
scenarios, the trained models are vulnerable to unseen do-
mains, leading to significant performance degradations. We
train all face forgery detectors on the FF++ (C23) dataset and
directly evaluate them on five unseen Deepfake datasets:
CDF, WDF, DFDC-P , DFD, and DFR. Table 1 presents
Page 7:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 7
TABLE 3
Cross-manipulation detection AUC on the remaining three unseen
manipulation techniques (FF++ C23 dataset).
Methods Train DF FF FS NT AVG∗
ViT [19] 99.28 59.87 49.91 62.38 57.39
RECCE [33] 99.95 69.75 54.72 77.15 67.21
OursDF
99.80 73.46 52.15 77.45 67.69
ViT [19] 74.72 99.21 57.19 56.38 62.76
RECCE [33] 71.55 99.20 50.02 72.27 64.61
OursFF
86.67 99.43 66.92 68.75 74.11
ViT [19] 78.59 61.62 99.44 46.69 62.30
RECCE [33] 63.05 66.21 99.72 58.07 62.44
OursFS
79.89 71.45 99.56 48.33 66.56
ViT [19] 78.46 68.31 45.07 97.19 63.95
RECCE [33] 72.37 64.69 51.61 99.59 62.89
OursNT
80.02 73.02 51.94 98.70 68.33
the frame-level and video-level detection results, divided
into top and bottom sections. Video-level detection results
highlighted in blue color are obtained by averaging all
frame scores within each video. We bold the best results
and underline the second-best results. Notably, MoE-FFD
achieves the highest detection performance on four datasets
in both frame-level and video-level detection. Additionally,
MoE-FFD’s performance is on par with CFM on the DFD
dataset. Moreover, MoE-FFD outperforms CFM by a clear
margin in average performance. The average frame-level
AUC improves by 2.86%, rising from 83.94% to 86.78%.
Similarly, MoE-FFD also outperforms others in video-level
detection on average. Our MoE-FFD’s generalizability im-
provements over existing methods offers valuable insights
into the effectiveness of the proposed approach.
4.3.2 Cross-Manipulation Evaluations.
Existing face forgery detectors often struggle to handle
emerging manipulation techniques. With the rapid devel-
opment of AIGC, more sophisticated manipulation tech-
niques continue to emerge, posing challenges for existing
detection models. Ensuring model generalizability to un-
seen forgeries is crucial for real-world applications. In this
study, we conduct cross-manipulation experiments involv-
ing four forgery techniques: Deepfakes (DF), Face2Face (FF),
FaceSwap (FS), and NeuralTextures (NT). Table 2 presents
the results, where models trained on three manipulation
types are tested on the remaining one. MoE-FFD exhibits
the best average detection results. Compared to the ViT-B
baseline, the proposed method achieves an average AUC en-
hancement of 18.5%, going from 62.3% to 80.8%. In Table 3,
we further examine models trained on one manipulation
type and tested across the other three. AVG∗denotes the
average AUC score across three cross-manipulation trials.
MoE-FFD outperforms the baseline in both intra- and cross-
manipulation evaluations. Remarkably, MoE-FFD achieves
around 99% AUC scores in each intra-manipulation eval-
uation, highlighting its outstanding detection accuracy in
intra-domain evaluations. When compared to the state-of-
the-art RECCE method, our MoE-FFD consistently achieves
superior average cross-manipulation results across all four
evaluation settings. These findings suggest that MoE-FFD is
generalized to unseen manipulations.TABLE 4
Ablation experiment on the designed MoE modules.
ViT-BMoE MoE CDF DFDC-P WDF
LoRA Adapter AUC EER AUC EER AUC EER
✓ - - 72.35 34.50 75.58 32.11 75.29 33.40
✓ ✓ - 84.84 23.51 79.62 28.13 79.73 28.62
✓ - ✓ 83.21 25.34 76.35 30.90 77.15 30.93
✓ ✓ ✓ 86.69 22.06 80.83 26.67 80.64 27.11
TABLE 5
Cross-dataset detection performance on different ViT backbones.
Backbone MoE-FFD #ParamsCDF DFDC-P WDF
AUC EER AUC EER AUC EER
ViT-Tiny - 5.52M 66.41 38.53 71.92 34.23 69.71 37.23
ViT-Tiny ✓ 3.90M 76.56 31.13 73.61 32.61 75.40 31.50
ViT-Small - 21.67M 70.03 35.71 72.19 34.02 71.67 35.66
ViT-Small ✓ 7.73M 81.22 27.58 78.08 29.22 77.83 30.23
ViT-Large - 303.30M 73.13 33.28 74.46 32.25 72.96 34.45
ViT-Large ✓ 41.34M 86.21 22.11 77.51 29.45 80.00 28.33
4.3.3 Robustness to Real-World Perturbations.
Images and videos transmitted online always undergo
various perturbations that erase forgery cues within the im-
age/video contents [81]. As such, the detection performance
of existing models significantly drops on distorted data. In
this work, we introduce common perturbations to measure
the model’s robustness, including Gaussian blurring, pink
noise, white noise, and blockwise. Each perturbation type
further involves five severity levels to mimic diverse real-
world conditions. Note that we do not apply any data
augmentations during training, such that the tested pertur-
bations are totally unseen for our model. As shown in Fig. 4,
the detectors’ AUC detection performance consistently de-
teriorates with the increasing severity level. However, our
proposed MoE-FFD exhibits significantly greater resilience
to most perturbations compared to previous methods. This
indicates that the proposed MoE-FFD effectively captures
the inherent forgery clues. Notably, MoE-FFD achieves sub-
stantial improvements in robustness compared to the ViT-
B baseline, demonstrating that the designed MoE learning
scheme and the PEFT modules greatly enhance the model’s
robustness against image perturbations.
4.3.4 Discussion.
MoE-FFD achieves superior generalizability and robust-
ness compared to previous methods, which can be at-
tributed to the following designs: (1) MoE-FFD only up-
dates external modules while preserving the abundant Im-
ageNet knowledge, enabling the model to adaptively learn
forgery-specific features; (2) MoE-FFD integrates the LoRA
and Convpass Adapter with the ViT backbone, effectively
leveraging the expressivity of transformers and the local
forgery priors; (3) The incorporation of MoE modules fa-
cilitates optimal selection of LoRA and Adapter experts for
forgery feature mining. Furthermore, MoE-FFD presents a
parameter-efficient approach to face forgery detection due
to the utilized PEFT strategy. To validate the effectiveness of
the designed components, we perform ablation experiments
next.
Page 8:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 8
Fig. 4. Robustness to various common perturbations at five severity levels: Gaussian blur, pink noise, white noise, and block wise.
4.4 Ablation Experiments
4.4.1 Impacts of the PEFT Layers.
The proposed MoE-FFD integrates one LoRA layer and
one Adapter layer with each ViT block. The LoRA layer
is designed to learn forgery-specific parameters for sub-
sequent attention mechanisms, capturing long-range inter-
actions within the input faces. Meanwhile, the Convpass
Adapter introduces forgery local priors into the plain ViT
model. To study the effectiveness of the designed PEFT
modules, we report the cross-dataset detection results in
Table 4. Utilizing MoE LoRA and MoE Adapter significantly
improves the model generalizability across all datasets com-
pared to the vanilla ViT-B backbone. This enhancement can
be attributed to the LoRA and Adapter layers effectively
retaining the ImageNet knowledge while adaptively learn-
ing forgery features. Furthermore, the combination of LoRA
and Adapter layers allows the model to capture both long-
range interactions and local forgery cues, further enhancing
its face forgery detection performance.
4.4.2 Effectiveness on Other Backbones.
Flexibility is crucial in real-world applications to ad-
dress the complexities of practical scenarios. Deployment
devices with varying computational resources may necessi-
tate different detection models. Therefore, we integrate the
proposed MoE-FFD into ViT-Tiny, ViT-Small, and ViT-Large
models to assess the flexibility of our approach. MoE-FFD
can be readily inserted to other vision transformer back-
bones in a plug-and-play manner. The cross-dataset results
are presented in Table 5. Compared with directly finetuning
the vanilla ViT backbones, MoE-FFD significantly reduces
the training parameters. As the model size increases, the
model generalizability consistently enhances. Additionally,
significant performance improvements are observed across
different ViT backbones by using MoE-FFD. Specifically,
when we assemble MoE-FFD with the ViT-Large backbone,
the model achieves 13.08%, 3.05%, and 7.04% AUC boosts on
CDF, DFDC-P , and WDF datasets, with only ∼13.6% train-
able parameters. This, in turn, demonstrates the flexibility
and effectiveness of our method.
4.4.3 Impacts of the MoE Learning Scheme.
In this work, the proposed MoE is designed to dy-
namically select the optimal Top-1 expert for face forgeryTABLE 6
Impacts of the matrix rank of LoRA layers and the effectiveness of MoE.
SettingCDF DFDC-P WDF
AUC EER AUC EER AUC EER
rank=8 82.46 25.94 78.30 29.24 79.35 28.77
rank=16 82.16 26.01 77.85 29.74 79.15 29.37
rank=32 81.58 26.63 78.11 29.48 79.48 29.00
rank=48 83.10 25.06 79.06 29.45 79.50 28.92
rank=64 83.85 24.97 79.38 28.19 79.46 27.99
rank=96 83.50 24.86 79.53 27.69 78.28 29.14
rank=128 83.47 24.52 77.36 30.48 78.93 29.89
MoE 84.84 23.51 79.62 28.13 79.73 28.62
detection. In Table 6, we compare the cross-dataset detection
performance of the proposed MoE-FFD with individual
LoRA experts of varying ranks. Notably, different datasets
exhibit varying optimal LoRA ranks. The model reaches the
best AUC scores on CDF, DFDC-P , and WDF datasets with
the LoRA rank of 64, 96, and 48, respectively. MoE dynami-
cally selects the LoRA rank, facilitating the model to search
the optimal feature space for each input query. In Table 6,
MoE consistently outperforms individual LoRA experts on
all three datasets. Similarly, we investigate the impact of
different Adapters in Table 7. As introduced in Sec. 3.3.2,
the designed adapters tend to expose different local artifacts
of input faces. MoE smartly searches the best local feature
extractor for each input real/fake face, thereby achieving
the superior results compared to using a single Adapter.
It should be noted that the MoE approach only introduces
negligible additional activated parameters, demonstrating
that the performance improvements stem from the proposed
MoE learning scheme, instead of the model scaling up.
In Fig. 5, we further investigate the expert selection
distributions on four datasets. Fig. 5 (a)-(d) show the LoRA
expert selection frequency on CDF, WDF, DFDC-P , and DFD
datasets, while Fig. 5 (e)-(h) illustrate the Adapter expert
selection frequency. Different Deepfake datasets generally
exhibit significant domain gaps. From Fig. 5, we observe
distinct LoRA and Adapter expert selection distributions
among the four datasets. Our model dynamically projects
the features into suitable space and extracts informative
local features tailored to each dataset. This observation
Page 9:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 9
Fig. 5. LoRA expert selection frequency on (a) CDF , (b) WDF , (c) DFDC-P , and (d) DFD datasets; Adapter expert selection frequency on (e) CDF ,
(f) WDF , (g) DFDC-P , and (h) DFD datasets.
TABLE 7
Impacts of different designed Adapters and the effectiveness of MoE.
SettingCDF DFDC-P WDF
AUC EER AUC EER AUC EER
Conv 81.25 27.36 75.01 32.39 76.28 31.15
ADC 77.52 29.86 76.41 30.99 76.48 31.45
CDC 82.50 26.12 75.66 31.66 76.06 31.21
RDC 83.05 25.39 75.12 31.90 76.16 31.86
SDC 78.29 29.54 76.80 30.93 76.87 31.35
MoE 83.21 25.34 76.35 30.90 77.15 30.93
TABLE 8
MoE v.s. Multi-Experts.
MethodTraining Inference CDF DFDC-P WDF
Speed Speed AUC EER AUC EER AUC EER
Multi-E 1.67Iter/s 7.18Iter/s 85.04 23.74 80.16 27.78 78.67 30.07
MoE 2.13Iter/s 8.82Iter/s 86.69 22.06 80.83 26.67 80.64 27.11
underscores MoE-FFD’s capability to select optimal experts
tailored to different data, offering valuable insights into its
adaptive nature.
4.4.4 MoE v.s. Multi-Experts.
While the proposed MoE learning scheme exhibits su-
perior generalizability compared to using a single expert,
it is still ambiguous whether the performance boosts stem
from the designed MoE or the joint usage of experts. To
further demonstrate the dynamic selection of optimal LoRA
and Adapter experts by MoE, we compare MoE with Multi-
Experts (Multi-E). Multi-E aggregates the features of all de-
signed experts. The efficiency and the face forgery detection
performance of MoE and Multi-E are reported in Table 8.
Thanks to the used gating mechanism that selectively acti-
vates the sparse experts, our MoE achieves a 1.28 ×speedup
in training and 1.23 ×speedup in inference. Despite these
efficiency gains, we interestingly find that MoE consistently
outperforms Multi-E in terms of the cross-dataset detectionperformance. This suggests that naively aggregating multi-
ple experts may not be the optimal strategy for face forgery
detection. One potential explanation for this phenomenon
is that using all feature extractor experts could suppress the
most informative features while introducing noisy ones. The
results in Table 8 further underscores the effectiveness of the
proposed method in selecting the optimal expert for face
forgery detection.
4.4.5 Effectiveness of MoE Loss.
The gating network often exhibits a tendency to con-
sistently assign large weights to only a few experts [48],
resulting in overfitting problems. To address this issue, we
introduce an MoE loss component aimed at encouraging
equal importance among all experts. This regularization also
prevents the model from getting trapped in local optima.
In this subsection, we examine the impacts of the pro-
posed𝐿𝑚𝑜𝑒. Table 9 presents the cross-dataset evaluation
performance with different values of the 𝐿𝑚𝑜𝑒loss weight
𝜆in Eq. (5), where 𝜆=0 represents no MoE loss applied.
We observe that the use of 𝐿𝑚𝑜𝑒 effectively mitigates the
gate overfitting problem and consistently boosts model
generalizability. Furthermore, the model achieves the best
generalizability with 𝜆=1.
These findings highlight the importance of incorporating
balanced expert contributions in the MoE framework. 𝐿𝑚𝑜𝑒
component ensures that the learned representations are
more diverse and generalizable across different datasets.
This is particularly crucial in practical applications where
data distribution can vary significantly. The optimal perfor-
mance at𝜆=1 suggests that the balance between expert
utilization and regularization is vital.
4.5 Visualization Results
To better demonstrate the effectiveness of our MoE-
FFD method, we visualize the feature distributions of the
baseline model (ViT-B) and MoE-FFD on the CDF and WDF
datasets. In Fig. 6, green and red marks represent real and
Page 10:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 10
Fig. 6. t-SNE feature distributions of the ViT baseline and MoE-FFD. Visualization results of (a) ViT baseline on CDF dataset, (b) MoE-FFD on CDF
dataset, (c) ViT baseline on WDF dataset, and (d) MoE-FFD on WDF dataset.
Fig. 7. Grad-CAM maps of the baseline model (ViT) and our proposed method MoE-FFD on six Deepfake datasets: FF++/DF , CDF , DFDC-P , WDF ,
DFD, and DFR.
fake data samples, respectively. As shown in Fig. 6 (a) and
(c), the baseline model struggles to discriminate the real
faces from fake ones, leading to limited detection perfor-
mance. In contrast, MoE-FFD feature distribution maps in
Fig. 6 (b) and (d) illustrate that the real and fake faces are
more discriminative.
We further provide the Grad-CAM maps of the baseline
model and MoE-FFD in Fig. 7. Both models are trained on
FF++ dataset and tested on six Deepfake datasets: FF++/DF,
CDF, DFDC-P , WDF, DFD, and DFR datasets. While the
baseline model often neglects informative fake facial regions
or attends to peripheral irrelevant areas, MoE-FFD consis-
tently directs attention to the manipulated regions within
each input face. Despite the diverse environments captured
in these datasets, including conditions like poor illumi-
nation, extreme head pose, and low resolution, MoE-FFD
accurately localizes the forgery regions containing abundant
forgery features. This observation further underscores the
generalizability of our method.
5 C ONCLUSIONS AND FUTURE WORKS
In this paper, we introduced MoE-FFD, a generalized
yet parameter-efficient method for detecting face forgeries.
By incorporating external lightweight LoRA and Adapter
layers with the frozen ViT backbone, our framework adaptly
acquired forgery-specific knowledge with minimal activatedTABLE 9
Effectiveness of loss components.
𝐿𝑐𝑒𝜆CDF DFDC-P WDF
AUC EER AUC EER AUC EER
! 0 82.65 25.57 78.14 29.06 76.29 30.43
! 0.1 85.09 23.29 80.05 26.96 79.56 27.92
! 1 86.69 22.06 80.83 26.67 80.64 27.11
! 5 84.01 24.69 81.47 26.68 80.14 27.32
! 10 83.35 25.53 80.46 28.13 78.93 29.05
parameters. This approach not only harnesses the expres-
siveness of transformers but also capitalizes on the local
forgery priors with customized adapters, contributing to
enhanced detection performance. Through dynamic expert
selection within both LoRA and Adapter layers, our MoE
design further enhances the model’s generalizability and ro-
bustness. Extensive experiments consistently demonstrated
MoE-FFD’s superiority in face forgery detection across di-
verse datasets, manipulation types, and perturbation sce-
narios. Moreover, MoE-FFD serves as a parameter-efficient
detector and can seamlessly adapt to various ViT backbones,
facilitating its deployment and fine-tuning in real-world ap-
plications. Last but not least, comprehensive ablation stud-
ies demonstrated the effectiveness of our designed LoRA
layers and Convpass Adapter layers, and the MoE learning
Page 11:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 11
scheme indeed helped the model to search the optimal
forgery experts. We anticipate that MoE-FFD will inspire
future advancements in face forgery detection, particularly
in bolstering generalization and efficiency.
While our MoE-FFD framework achieves generalized
and robust performance in image-level face forgery detec-
tion, adapting our network design and learning scheme
to video-level Deepfake detection is worth considering.
Moreover, detecting audio-visual content forgery is essential
in real world applications, multimodal Deepfake detection
opens an important research path forward. We envision that
our proposed MoE structure and the PEFT strategy should
be still effective in these tasks. This path can be explored in
our future work.
REFERENCES
[1] C. Kong, S. Wang, and H. Li, “Digital and physical face attacks:
Reviewing and one step further,” arXiv preprint arXiv:2209.14692 ,
2022.
[2] Y. Li, M.-C. Chang, and S. Lyu, “In ictu oculi: Exposing ai created
fake videos by detecting eye blinking,” in 2018 IEEE International
Workshop on Information Forensics and Security (WIFS) . IEEE, 2018,
pp. 1–7.
[3] X. Yang, Y. Li, and S. Lyu, “Exposing deep fakes using inconsistent
head poses,” in ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2019, pp.
8261–8265.
[4] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, and B. Guo, “Face
x-ray for more general face forgery detection,” in Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition , 2020,
pp. 5001–5010.
[5] S. Dong, J. Wang, R. Ji, J. Liang, H. Fan, and Z. Ge, “Implicit
identity leakage: The stumbling block to improving deepfake
detection generalization,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition , 2023, pp. 3994–4004.
[6] A. Luo, C. Kong, J. Huang, Y. Hu, X. Kang, and A. C. Kot, “Beyond
the prior forgery knowledge: Mining critical clues for general face
forgery detection,” IEEE Transactions on Information Forensics and
Security , vol. 19, pp. 1168–1182, 2024.
[7] C. Kong, B. Chen, W. Yang, H. Li, P . Chen, and S. Wang, “Ap-
pearance matters, so does audio: Revealing the hidden face via
cross-modality transfer,” IEEE Transactions on Circuits and Systems
for Video Technology , vol. 32, no. 1, pp. 423–436, 2021.
[8] B. Huang, Z. Wang, J. Yang, J. Ai, Q. Zou, Q. Wang, and D. Ye,
“Implicit identity driven deepfake face swapping detection,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition , 2023, pp. 4490–4499.
[9] F. Chollet, “Xception: Deep learning with depthwise separable
convolutions,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition , 2017, pp. 1251–1258.
[10] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con-
volutional neural networks,” in International Conference on Machine
Learning . PMLR, 2019, pp. 6105–6114.
[11] Y. Luo, Y. Zhang, J. Yan, and W. Liu, “Generalizing face forgery
detection with high-frequency features,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition ,
2021, pp. 16 317–16 326.
[12] I. Masi, A. Killekar, R. M. Mascarenhas, S. P . Gurudatt, and
W. AbdAlmageed, “Two-branch recurrent network for isolating
deepfakes in videos,” in European Conference on Computer Vision .
Springer, 2020, pp. 667–684.
[13] C. Kong, B. Chen, H. Li, S. Wang, A. Rocha, and S. Kwong,
“Detect and locate: Exposing face manipulation by semantic-and
noise-level telltales,” IEEE Transactions on Information Forensics and
Security , vol. 17, pp. 1741–1756, 2022.
[14] B. Shi, D. Zhang, Q. Dai, Z. Zhu, Y. Mu, and J. Wang, “Informative
dropout for robust representation learning: A shape-bias perspec-
tive,” in International Conference on Machine Learning . PMLR, 2020,
pp. 8828–8839.
[15] C. Miao, Z. Tan, Q. Chu, N. Yu, and G. Guo, “Hierarchical
frequency-assisted interactive networks for face manipulation de-
tection,” IEEE Transactions on Information Forensics and Security ,
vol. 17, pp. 3008–3021, 2022.[16] K. Xu, M. Qin, F. Sun, Y. Wang, Y.-K. Chen, and F. Ren, “Learning
in the frequency domain,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition , 2020, pp. 1740–1749.
[17] J. Li, H. Xie, J. Li, Z. Wang, and Y. Zhang, “Frequency-aware
discriminative feature learning supervised by single-center loss for
face forgery detection,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition , 2021, pp. 6458–6467.
[18] Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao, “Thinking in
frequency: Face forgery detection by mining frequency-aware
clues,” in European Conference on Computer Vision . Springer, 2020,
pp. 86–103.
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly
et al. , “An image is worth 16x16 words: Transformers for image
recognition at scale,” arXiv preprint arXiv:2010.11929 , 2020.
[20] X. Dong, J. Bao, D. Chen, T. Zhang, W. Zhang, N. Yu, D. Chen,
F. Wen, and B. Guo, “Protecting celebrities from deepfake with
identity consistency transformer,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition , 2022, pp.
9468–9478.
[21] R. Shao, T. Wu, and Z. Liu, “Detecting and recovering sequen-
tial deepfake manipulation,” in European Conference on Computer
Vision . Springer, 2022, pp. 712–728.
[22] W. Zhuang, Q. Chu, Z. Tan, Q. Liu, H. Yuan, C. Miao, Z. Luo, and
N. Yu, “Uia-vit: Unsupervised inconsistency-aware method based
on vision transformer for face forgery detection,” in European
Conference on Computer Vision . Springer, 2022, pp. 391–407.
[23] J. Wang, Z. Wu, W. Ouyang, X. Han, J. Chen, Y.-G. Jiang, and S.-
N. Li, “M2tr: Multi-modal multi-scale transformers for deepfake
detection,” in Proceedings of the 2022 international conference on
multimedia retrieval , 2022, pp. 615–623.
[24] C. Kong, H. Li, and S. Wang, “Enhancing general face forgery
detection via vision transformer with low-rank adaptation,” arXiv
preprint arXiv:2303.00917 , 2023.
[25] J. Guan, H. Zhou, Z. Hong, E. Ding, J. Wang, C. Quan, and Y. Zhao,
“Delving into sequential patches for deepfake detection,” Advances
in Neural Information Processing Systems , vol. 35, pp. 4517–4530,
2022.
[26] C. Miao, Z. Tan, Q. Chu, H. Liu, H. Hu, and N. Yu, “F2trans: High-
frequency fine-grained transformer for face forgery detection,”
IEEE Transactions on Information Forensics and Security , vol. 18, pp.
1039–1051, 2023.
[27] Y. Li, X. Yang, P . Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale
challenging dataset for deepfake forensics,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition ,
2020, pp. 3207–3216.
[28] Y. Li and S. Lyu, “Exposing deepfake videos by detecting face
warping artifacts,” arXiv preprint arXiv:1811.00656 , 2018.
[29] U. A. Ciftci, I. Demir, and L. Yin, “Fakecatcher: Detection of syn-
thetic portrait videos using biological signals,” IEEE transactions
on pattern analysis and machine intelligence , 2020.
[30] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics:
Using capsule networks to detect forged images and videos,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP) . IEEE, 2019, pp. 2307–2311.
[31] K. Shiohara and T. Yamasaki, “Detecting deepfakes with self-
blended images,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition , 2022, pp. 18 720–18 729.
[32] K. Sun, T. Yao, S. Chen, S. Ding, J. Li, and R. Ji, “Dual contrastive
learning for general face forgery detection,” in Proceedings of the
AAAI Conference on Artificial Intelligence , vol. 36, no. 2, 2022, pp.
2316–2324.
[33] J. Cao, C. Ma, T. Yao, S. Chen, S. Ding, and X. Yang, “End-to-end
reconstruction-classification learning for face forgery detection,”
inProceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition , 2022, pp. 4113–4122.
[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im-
agenet: A large-scale hierarchical image database,” in 2009 IEEE
conference on computer vision and pattern recognition . Ieee, 2009, pp.
248–255.
[35] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Larous-
silhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-
efficient transfer learning for nlp,” in International Conference on
Machine Learning . PMLR, 2019, pp. 2790–2799.
[36] Y.-L. Sung, J. Cho, and M. Bansal, “Lst: Ladder side-tuning for
parameter and memory efficient transfer learning,” Advances in
Page 12:
SUBMITTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 12
Neural Information Processing Systems , vol. 35, pp. 12 991–13 005,
2022.
[37] E. J. Hu, Y. Shen, P . Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
and W. Chen, “Lora: Low-rank adaptation of large language
models,” arXiv preprint arXiv:2106.09685 , 2021.
[38] Z. Zhong, Z. Tang, T. He, H. Fang, and C. Yuan, “Convolution
meets lora: Parameter efficient finetuning for segment anything
model,” arXiv preprint arXiv:2401.17868 , 2024.
[39] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan,
and S.-N. Lim, “Visual prompt tuning,” in European Conference on
Computer Vision . Springer, 2022, pp. 709–727.
[40] Y. Zhang, K. Zhou, and Z. Liu, “Neural prompt search,” arXiv
preprint arXiv:2206.04673 , 2022.
[41] D. Lian, D. Zhou, J. Feng, and X. Wang, “Scaling & shifting your
features: A new baseline for efficient model tuning,” Advances in
Neural Information Processing Systems , vol. 35, pp. 109–123, 2022.
[42] S. Jie and Z.-H. Deng, “Convolutional bypasses are better vision
transformer adapters,” arXiv preprint arXiv:2207.07039 , 2022.
[43] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton,
“Adaptive mixtures of local experts,” Neural computation , vol. 3,
no. 1, pp. 79–87, 1991.
[44] C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton,
A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision
with sparse mixture of experts,” Advances in Neural Information
Processing Systems , vol. 34, pp. 8583–8595, 2021.
[45] Y. Lou, F. Xue, Z. Zheng, and Y. You, “Cross-token modeling with
conditional computation,” arXiv preprint arXiv:2109.02008 , 2021.
[46] B. Mustafa, C. Riquelme, J. Puigcerver, R. Jenatton, and
N. Houlsby, “Multimodal contrastive learning with limoe: the
language-image mixture of experts,” Advances in Neural Informa-
tion Processing Systems , vol. 35, pp. 9564–9576, 2022.
[47] S. Shen, Z. Yao, C. Li, T. Darrell, K. Keutzer, and Y. He, “Scaling
vision-language models with sparse mixture of experts,” arXiv
preprint arXiv:2303.07226 , 2023.
[48] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton,
and J. Dean, “Outrageously large neural networks: The sparsely-
gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538 ,
2017.
[49] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun,
N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with
conditional computation and automatic sharding,” arXiv preprint
arXiv:2006.16668 , 2020.
[50] M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer,
X. V . Lin, J. Du, S. Iyer, R. Pasunuru et al. , “Efficient large
scale language modeling with mixtures of experts,” arXiv preprint
arXiv:2112.10684 , 2021.
[51] H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathiamoorthy, Y. Chen,
R. Mazumder, L. Hong, and E. Chi, “Dselect-k: Differentiable
selection in the mixture of experts with applications to multi-
task learning,” Advances in Neural Information Processing Systems ,
vol. 34, pp. 29 335–29 347, 2021.
[52] M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer,
“Base layers: Simplifying training of large, sparse models,” in
International Conference on Machine Learning . PMLR, 2021, pp.
6265–6274.
[53] S. Roller, S. Sukhbaatar, J. Weston et al. , “Hash layers for large
sparse models,” Advances in Neural Information Processing Systems ,
vol. 34, pp. 17 555–17 566, 2021.
[54] S. Kudugunta, Y. Huang, A. Bapna, M. Krikun, D. Lepikhin, M.-
T. Luong, and O. Firat, “Beyond distillation: Task-level mixture-
of-experts for efficient inference,” arXiv preprint arXiv:2110.03742 ,
2021.
[55] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling
task relationships in multi-task learning with multi-gate mixture-
of-experts,” in Proceedings of the 24th ACM SIGKDD international
conference on knowledge discovery & data mining , 2018, pp. 1930–
1939.
[56] J. Guo, D. J. Shah, and R. Barzilay, “Multi-source domain adap-
tation with mixture of experts,” arXiv preprint arXiv:1809.02256 ,
2018.
[57] B. Li, Y. Shen, J. Yang, Y. Wang, J. Ren, T. Che, J. Zhang, and Z. Liu,
“Sparse mixture-of-experts are domain generalizable learners,”
arXiv preprint arXiv:2206.04046 , 2022.
[58] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal,
S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-
training with mixture-of-modality-experts,” Advances in Neural
Information Processing Systems , vol. 35, pp. 32 897–32 912, 2022.[59] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,
Y. Zhou, A. W. Yu, O. Firat et al. , “Glam: Efficient scaling of lan-
guage models with mixture-of-experts,” in International Conference
on Machine Learning . PMLR, 2022, pp. 5547–5569.
[60] Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V . Zhao, A. M. Dai,
Q. V . Le, J. Laudon et al. , “Mixture-of-experts with expert choice
routing,” Advances in Neural Information Processing Systems , vol. 35,
pp. 7103–7114, 2022.
[61] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling
to trillion parameter models with simple and efficient sparsity,”
Journal of Machine Learning Research , vol. 23, no. 120, pp. 1–39, 2022.
[62] E. Bengio, P .-L. Bacon, J. Pineau, and D. Precup, “Conditional
computation in neural networks for faster models,” arXiv preprint
arXiv:1511.06297 , 2015.
[63] K. Sun, H. Liu, Q. Ye, Y. Gao, J. Liu, L. Shao, and R. Ji, “Domain
general face forgery detection by learning to weight,” in Proceed-
ings of the AAAI conference on Artificial Intelligence , vol. 35, no. 3,
2021, pp. 2638–2646.
[64] J. Wang, Y. Sun, and J. Tang, “Lisiam: Localization invariance
siamese network for deepfake detection,” IEEE Transactions on
Information Forensics and Security , vol. 17, pp. 2425–2436, 2022.
[65] Y. Zheng, J. Bao, D. Chen, M. Zeng, and F. Wen, “Exploring tem-
poral coherence for more general video face forgery detection,”
inProceedings of the IEEE/CVF international conference on computer
vision , 2021, pp. 15 044–15 054.
[66] J. Fei, Y. Dai, P . Yu, T. Shen, Z. Xia, and J. Weng, “Learning
second order local anomaly for general face forgery detection,”
inProceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition , 2022, pp. 20 270–20 280.
[67] J. Yang, A. Li, S. Xiao, W. Lu, and X. Gao, “Mtd-net: learning to
detect deepfakes images by multi-scale texture difference,” IEEE
Transactions on Information Forensics and Security , vol. 16, pp. 4234–
4245, 2021.
[68] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection
and alignment using multitask cascaded convolutional networks,”
IEEE signal processing letters , vol. 23, no. 10, pp. 1499–1503, 2016.
[69] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch: An im-
perative style, high-performance deep learning library,” Advances
in neural information processing systems , vol. 32, 2019.
[70] D. P . Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980 , 2014.
[71] D. Li, Y. Yang, Y.-Z. Song, and T. Hospedales, “Learning to gener-
alize: Meta-learning for domain generalization,” in Proceedings of
the AAAI conference on artificial intelligence , vol. 32, no. 1, 2018.
[72] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and
M. Nießner, “Faceforensics++: Learning to detect manipulated fa-
cial images,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision , 2019, pp. 1–11.
[73] B. Zi, M. Chang, J. Chen, X. Ma, and Y.-G. Jiang, “Wilddeepfake: A
challenging real-world dataset for deepfake detection,” in Proceed-
ings of the 28th ACM International Conference on Multimedia , 2020,
pp. 2382–2390.
[74] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer,
“The deepfake detection challenge (dfdc) preview dataset,” arXiv
preprint arXiv:1910.08854 , 2019.
[75] https://ai.googleblog.com/2019/09/
contributing-data-to-deepfake-detection.html.
[76] L. Jiang, R. Li, W. Wu, C. Qian, and C. C. Loy, “Deeperforensics-
1.0: A large-scale dataset for real-world face forgery detection,”
inProceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition , 2020, pp. 2889–2898.
[77] https://https://github.com/deepfakes/.
[78] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and
M. Nießner, “Face2face: Real-time face capture and reenactment
of rgb videos,” in Proceedings of the IEEE conference on computer
vision and pattern recognition , 2016, pp. 2387–2395.
[79] https://github.com/MarekKowalski/FaceSwap/.
[80] J. Thies, M. Zollh ¨ofer, and M. Nießner, “Deferred neural render-
ing: Image synthesis using neural textures,” Acm Transactions on
Graphics (TOG) , vol. 38, no. 4, pp. 1–12, 2019.
[81] H. Wu, J. Zhou, J. Tian, J. Liu, and Y. Qiao, “Robust image forgery
detection against transmission over online social networks,” IEEE
Transactions on Information Forensics and Security , vol. 17, pp. 443–
456, 2022.