Page 1:
Trainable Adaptive Activation Function Structure (TAAFS)
Enhances Neural Network Force Field Performance with Only
Dozens of Additional Parameters
Enji Li
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
University of Chinese Academy of Sciences
lienji23s ict.ac.cn
December 20, 2024
Abstract
At the heart of neural network force fields (NNFFs) is the architecture of neural networks, where
the capacity to model complex interactions is typically enhanced through widening or deepening
multilayer perceptrons (MLPs) or by increasing layers of graph neural networks (GNNs). These
enhancements, while improving the model’s performance, often come at the cost of a substantial
increase in the number of parameters. By applying the Trainable Adaptive Activation Function
Structure (TAAFS), we introduce a method that selects distinct mathematical formulations for
non-linear activations, thereby increasing the precision of NNFFs with an insignificant addition to
the parameter count. In this study, we integrate TAAFS into a variety of neural network models,
resulting in observed accuracy improvements, and further validate these enhancements through
molecular dynamics (MD) simulations using DeepMD.
Keywords: Neural Network Force Fields, Trainable Adaptive Activation Function Structure , Molec-
ular Dynamic
1 Introduction
Molecular dynamics (MD) simulations are a critical research tool widely applied across various dis-
ciplines, including chemical engineering, biomedicine, materials science, and physics. MD models
encompass several methodologies, such as ab initio molecular dynamics (AIMD), potential function
methods, and neural network-based force fields, which rely on mathematical models.
The advent of artificial intelligence and the computation of high-precision ab initio data have
led to the development of neural network force fields. These methods offer near-ab-initio accuracy
while maintaining computational costs similar to those of potential function methods, making them
highly attractive for applications in computational materials science and molecular dynamics simula-
tions. Prominent machine learning force field models include SNAP [1], SIMPLE-NN [2], HDNNP [3],
DeePMD [4], NEP [5], BPNN [6], ANI [7], chgnet [8], MACE [9], among others.
At the heart of neural network force fields lies the neural network itself. Enhancing the expres-
sive power of neural networks has been an ongoing area of exploration. In multi-layer perceptrons
(MLPs [10]), this often involves increasing the width and depth of the network. However, the increase
in parameters does not proportionally enhance expressive power, and performance can plateau once a
certain parameter threshold is reached. Graph neural networks (GNNs) possess unique properties that
make them suitable for machine learning force fields, but they also come with substantial parameter
counts. For instance, CHGNet has 413K parameters, MACE has 4.69M, whereas DeePMD typically
has only 26K. Increasing GNN layers leads to significant increases in parameter count.
Recently, the KAN [11] introduced trainable activation functions on edges, with nodes responsible
only for summation. This approach reduced parameter complexity from O(N ²) to O((G+K)N ²).
Despite its innovative concept, the practical application of KAN did not meet expectations; replacing
1arXiv:2412.14655v1 [cs.LG] 19 Dec 2024
Page 2:
MLPs with KAN in DP resulted in no performance improvement, with parameter count and training
time increasing by approximately 8 times and 10 times, respectively.
Although KAN did not achieve expected results in my tasks, its method of training activation
functions offered significant inspiration. Drawing upon the KAN framework, I propose the Trainable
Adaptive Activation Function Structure (TAAFS). Unlike other trainable activation functions based on
fixed transformations and combinations, TAAFS allows for the free selection of mathematical formulas
for nonlinear computations. The shape of the activation function curve adapts dynamically based on
data and selected mathematical formulas. The primary workflow includes normalization, mathematical
formula computation, and training coefficients, detailed in Chapter 3.
Ultimately, applying TAAFS to models like DP, ANI2, and chgnet resulted in over 10% accuracy
improvements with only a marginal increase in parameters—ranging from tens to hundreds, less than
0.2% of the total parameter count. Experimental results are presented in Chapter 5.
The main contributions of this work are as follows:
1. Transfer of KAN’s trainable activation function to MLPs: Within this structure, we can arbi-
trarily select mathematical formulas as activation bases.
2. First Implementation of Trainable Activation Functions in Neural Network Force Fields: To our
knowledge, this is the first time such an approach has been employed in neural network force
fields.
3. Enhancement of Model Accuracy Using Mathematical Formulas in DP Networks: We improved
model accuracy using multiple mathematical formulas for fitting and validated the results using
LAMMPS [12] for MD simulations.
4. Successful Application of Trainable Activation Functions in GNN Layers: Achieving performance
gains by incorporating trainable activation functions into GNN layers.
2 Related Work
In the absence of activation functions, no matter how many layers a neural network possesses, it
can only function as a linear model. This limitation arises because the composition of multiple linear
transformations can always be reduced to a single linear transformation. The introduction of activation
functions infuses non-linearity into the network, thereby empowering it to learn and approximate
highly complex and non-linear mappings that are essential for capturing intricate patterns within
data. Consequently, activation functions are indispensable components in the architecture of artificial
neural networks.
Activation functions can be broadly classified into two categories: fixed activation functions, which
have predefined forms and parameters that do not change during training; and trainable activation
functions, which possess parameters that can be optimized during the training process to better fit the
data. The adaptability of trainable activation functions allows for greater flexibility and potentially
improved performance in specific tasks.
Below, I will provide a brief introduction to the two categories of activation functions.
2.1 Fixed-shape activation function
Fixed activation functions include traditional ones such as Sigmoid [13] and Hyperbolic Tangent
(tanh) [14], as well as their variants like SiLU (Sigmoid Linear Unit) [15], ReLU (Rectified Linear
Unit) [16], and Leaky ReLU (LReLU) [17].
•Sigmoid : Once popular for mapping predictions to probabilities, sigmoid functions are now less
favored due to the vanishing gradient problem in deep networks.
σ(x) =1
1 +e−x(1)
2
Page 3:
•Hyperbolic Tangent (tanh) : Similar to the sigmoid function but maps inputs to the range
(-1, 1), which can sometimes lead to better convergence.
tanh( x) =ex−e−x
ex+e−x(2)
•ReLU (Rectified Linear Unit) : Introduces non-linearity while mitigating the vanishing gra-
dient problem, though it can suffer from ”dying ReLU” issues.
ReLU( x) = max(0 , x) (3)
•Leaky ReLU (LReLU) : Addresses the ”dying ReLU” problem by allowing a small, non-zero
gradient when the unit is not active.
LReLU( x) = max( αx, x ),where αis a small constant (4)
•SiLU (Sigmoid Linear Unit) : Combines the benefits of sigmoid and ReLU, providing smooth
gradients and avoiding dead neurons.
SiLU( x) =x·σ(x) =x
1 +e−x(5)
2.2 Trainable activation functions
Trainable activation functions most commonly include parameterized versions of standard activation
functions, as well as combinations of multiple basic activation functions to create more complex and
adaptive activations. Below are some examples.
Parametric Exponential Linear Unit (PELU) : Introduces non-linearity with two trainable
parameters γandβ, which respectively adjust the saturation level and slope of the activation function.
This circumvents the need for manual setting of parameters as seen in ELU, allowing PELU to adapt
dynamically during training via backpropagation [18].
PELU( x) =(
β·x
γ, ifx≥0
β·
exp
x
γ
−1
,otherwise(6)
where βandγare trainable parameters.
Parametric ReLU (PReLU) : Extends the standard ReLU [19] by introducing a learnable pa-
rameter αthat modifies the response to negative inputs. This allows the activation function to partially
learn its form from the training data, mitigating the ”dying ReLU” problem where neurons can become
inactive and only output zero.
PReLU( x) =(
x, ifx >0
α·x,otherwise(7)
where αis learned concurrently with the model parameters using gradient-based methods.
Exponential Linear Sigmoid Squashing (EliSH) : Combines properties of both exponential
linear units (ELUs) and sigmoid functions. It introduces non-linearity while maintaining a smooth
gradient flow, even for large input values. EliSH [20] is designed to provide a balance between compu-
tational efficiency and performance, making it suitable for various neural network architectures.
The EliSH function is defined as follows:
EliSH( x) =(
x
1+e−x,ifx≥0
ex−1
1+e−x,otherwise(8)
In this formulation, the negative part of EliSH results from the multiplication of the ELU and
sigmoid functions, whereas the positive part shares similarities with the Swish function. HardELiSH,
meanwhile, is constructed by multiplying the HardSigmoid and ELU functions in the negative domain
and combining the HardSigmoid and linear functions in the positive domain.
3
Page 4:
Figure 1: Activation Function Curves
Mexican Hat Linear Unit (MeLU) : The Mexican Hat Linear Unit [21] addresses the issues
of unstable learning associated with trainable parameters. Unstable learning can lead to decreased
accuracy and increased generalization error when model performance varies significantly in response
to slight changes in data or parameters. Mexican hat-type functions offer a smoother curve compared
to ReLU, which helps prevent saturation and allows for optimal performance.
The core function fis defined as:
fγ,λ(x) = max( λ− |x−γ|,0), (9)
where λandγare real numbers. This function returns zero when |x−γ|> λ. It increases with a
derivative of +1 between γ−λandγ, then decreases with a derivative of −1 between γandγ+λ.
The MeLU is defined for each layer as:
MeLU( x) = PReLU( x) +k−1X
j=1cjfγj,λj(x), (10)
where krepresents the number of learnable parameters per neuron. The cjparameters are learnable
real numbers, while γjandλjare fixed parameters.
In this study, the neural network force field models selected all employ fixed activation functions.
Therefore, the experimental focus is on demonstrating the feasibility of TAAFS (Trainable Activa-
tion and Adaptive Force Field Scheme) and its performance enhancement relative to fixed activation
functions, rather than comparing it with trainable activation functions. Fig1 presents graphical rep-
resentations of several activation functions, which can be compared against the activation function
curves derived from TAAFS, providing an intuitive visualization of their differences.
3 Method
We adopt the approach proposed by KAN, utilizing a Trainable Adaptive Activation Function Struc-
ture (TAAFS) within neural networks. The following sections outline the working principle of TAAFS,
the mathematical formulaused, and comparisons with traditional activation functions.
3.1 Working Principle
The core idea behind TAAFS is to replace fixed activation functions with adaptive ones that can be
trained alongside the network parameters. By fitting mathematical formulato the data, these activation
functions can dynamically adjust their shape during training to better match the underlying patterns
in the dataset. This adaptability allows the model to learn more complex and nuanced representations
compared to using static activation functions.
Traditional Fixed Activation Function Forward Propagation Steps :
1. Weighted Summation: z=Wx, where Wis the weight matrix, xis the input vector.
2. Nonlinear activation is performed using activation functions, such as the Hyperbolic Tangent
(Tanh) and the Sigmoid Linear Unit (SiLU): a(z).
4
Page 5:
3. Addition of Bias : y=a(z) +b′if necessary, where b′is another bias vector that may be added
after the activation.
During backpropagation, the weights Wand biases bare adjusted.
Trainable Adaptive Activation Function Structure Forward Propagation Steps :
1. Weighted Summation: z=Wx.
2. Normalization can be performed using activation functions such as the Hyperbolic Tangent
(Tanh), or alternatively, through methods like the difference quotient: a=f(z).
3. Setting up a grid for mathematical formulaor determining the number of iterations N: This step
depends on the mathematical formulaor type used.
4. Mathematical Formulaor Evaluation: p(a;θ), where p(·;θ) represents the mathematical formulaor
with trainable parameters θ.
5. Addition of Bias: y=p(a;θ) +b, where bis the bias vector added after polynomial evaluation.
During backpropagation, the weights W, biases b, and polynomial coefficients θare adjusted.
3.2 Mathematical Formula
We employ various mathematical formulato fit the data, including B-spline basis function ( Bline ),
Fourier, Gaussian Radial Basis Function (GRBF), Legendre, Hermite, Chebyshev of the second kind
(Chebyshev2), Chebyshev of the first kind (Chebyshev1), Bessel, and Jacobi. These mathematical
formulahave been extensively used in mathematics, physics, statistics, and other fields. One class of
these mathematical formularequires a grid setup for evaluation, such as Bline, Fourier, and GRBF.
The other class only considers the number of iterations for computation, like Chebyshev polynomials
of the first kind (Chebyshev1) and Hermite polynomials.The following is an introduction to a few
representative multiples.
B-spline functions are piecewise polynomial functions that are widely used for constructing
smooth curves and surfaces. Given a knot vector U={u0, u1, . . . , u m}and a degree p, the B-spline
basis function Ni,p(u) is defined recursively using the Cox-de Boor formula:
Forp= 0:
Ni,0(u) =(
1 if ui≤u < u i+1
0 otherwise(11)
Forp >0:
Ni,p(u) =u−ui
ui+p−uiNi,p−1(u) +ui+p+1−u
ui+p+1−ui+1Ni+1,p−1(u) (12)
The advantages of B-splines include local support, where each B-spline function affects only a
limited interval, and smoothness, ensuring continuous derivatives up to order p−1. However, the
recursive nature can make implementation more complex.
The Fourier series represents periodic functions as a sum of sine and cosine terms. For a
function f(x) with period 2 π, the Fourier series is given by:
f(x) =a0
2+∞X
n=1(ancos(nx) +bnsin(nx)) (13)
where the coefficients anandbnare calculated as follows:
an=1
πZπ
−πf(x) cos( nx)dx, b n=1
πZπ
−πf(x) sin(nx)dx (14)
Advantages of Fourier series include their excellent representation of periodic data and the orthog-
onality of sine and cosine functions. However, they suffer from the Gibbs phenomenon, which causes
oscillations near discontinuities in the function.
5
Page 6:
First-kind Chebyshev polynomials Tn(x) are defined on the interval [ −1,1] and have the property
of minimizing the maximum deviation from zero. They are given by the recurrence relation:
T0(x) = 1 , T 1(x) =x, T n+1(x) = 2 xTn(x)−Tn−1(x) (15)
Alternatively, they can be expressed using trigonometric identities:
Tn(x) = cos( narccos( x)) (16)
Advantages of first-kind Chebyshev polynomials include their minimax property, which minimizes
the maximum error in polynomial approximation, and efficient computation through recurrence rela-
tions. However, they are defined on the interval [ −1,1], requiring appropriate scaling for other ranges.
Admittedly, the repertoire of applicable mathematical formulas is not limited to those enumerated
herein. A vast array of alternative mathematical formulations, including but not limited to rational
functions and wavelet transforms, have been explored. Due to constraints imposed by table width
and less favorable convergence outcomes, the results pertaining to these methods are omitted from the
experimental presentation.
3.3 Contrast
Advantages of Using Trainable Adaptive Activation Function Structure:
1 : Wider Output Range: Although all activation functions typically involve normalization, polyno-
mial fitting allows for a broader and more flexible expression range. This flexibility can better capture
complex patterns in the data.
2 : Freedom from Traditional Shapes: Polynomial fitting enables the activation function to break
free from the conventional shapes imposed by standard activation functions, providing greater adapt-
ability to different datasets and tasks.
3 : Improved Smoothness: Polynomial fitting ensures that the resulting activation curve is smoother,
which can contribute to more stable training processes and potentially better generalization perfor-
mance.
4 : Minimal Increase in Parameters: The approach requires only a modest increase in parameters,
making it relatively simple to implement. Users need only specify the polynomial degree and the
number of grid points or iterations.
Disadvantages of Using TAAFS:
1 : Increased Training Time Due to Serial Operations: Training the parameters of the polynomial-
based activation function is a serial operation, which can add to the overall training time compared to
using fixed activation functions.
2 : Sensitivity to Fitting Method Selection: The choice of fitting method can significantly influ-
ence the training outcomes. Selecting an inappropriate fitting technique might lead to suboptimal
performance or convergence issues.
4 Experiments Model
This study employs four models for experimentation: the Deep Potential model (DP), ANI2, PAINN,
and ChgNet.
Deep Potential Model (DP): The DP model comprises two networks arranged sequentially: an
embedding network and a fitting network, as illustrated in fig2(a). The embedding network transforms
atomic configurations into descriptors that capture the local environment of each atom, while the
fitting network maps these descriptors to atomic energies, thereby enabling the prediction of the total
potential energy of the system. In the context of Deep Potential Model (DP), two optimizers are
employed: ADAM [22] and RLEKF [23]. ADAM is a well-established optimizer, yet it tends to train
relatively slowly, often necessitating several hundred epochs to converge. Conversely, RLEKF is a
second-order optimizer capable of converging to the same level of precision as ADAM within just
a few epochs. However, RLEKF demands significantly more memory and each epoch takes longer
than with the ADAM optimizer. Despite these drawbacks, RLEKF shows a rapid decrease in RMSE,
achieving lower error rates quickly. Given this comparative analysis, the utilization of the RLEKF
6
Page 7:
Figure 2: Deep Potential model and ANI2
Table 1: Comparison of DP Parameters Across Different Schemes
Original scheme 1 scheme 2 scheme 3 scheme 4
Parameter 26550 26558 26566 26598 26775
Ratio 100% 100.03% 100.06% 100.18% 100.85%
DP network structure :embedding network [25,25,25], fitting network
[50,50,50,1] .
optimizer predominates in DPI testing due to its capability to quickly converge to a lower RMSE.
Code from: https://github.com/LonxunQuantum/PWMLFF .
ANI2: This model specifically targets the elements carbon (C), hydrogen (H), oxygen (O), and
nitrogen (N) [24]. ANI2 constructs separate neural networks for each element to compute their contri-
butions individually. These individual contributions are then aggregated to calculate the total energy
of the molecular system, as illustrated in fig2(b).Code from: https://github.com/aiqm/torchani/
tree/master .
PAINN and ChgNet: Both PAINN and ChgNet are based on Graph Neural Network (GNN) archi-
tectures, which are designed to handle complex interactions within molecular structures. These GNNs
process atoms as nodes and bonds as edges in a graph representation of molecules. They iteratively
update node features by aggregating information from neighboring nodes, allowing for sophisticated
modeling of interatomic forces and electronic properties. Due to their complexity and effectiveness in
capturing intricate patterns in molecular data, both models represent advanced approaches in the field
of computational chemistry.
5 Experiments
All experiments in this chapter are conducted in a progressive manner, with the results from each
preceding experiment being applied to subsequent experiments.
5.1 Experiment 1:Best practice testing
In this experiment, we conducted tests using the RLEKF optimizer in the DP with a smaller Ag dataset
and used the built-in example dataset from the ANI2 code for testing. The DP network consists of two
serially connected networks: EmbeddingNet and FittingNet. In contrast, the ANI2 model comprises
parallel networks for four elements: carbon (C), hydrogen (H), oxygen (O), and nitrogen (N). The
experiments were carried out according to the following schemes:
1:Global Activation Function: Using a singletrainable adaptive activation function across the entire
network.
2:Per-Network Activation Functions: Using onetrainable adaptive activation function per network.
3:Per-Layer Activation Functions: Using onetrainable adaptive activation function per layer.
4:Per-Neuron Activation Functions: Using onetrainable adaptive activation function per neuron.
The fitting scheme for thetrainable adaptive activation functions was set to third-order Bline. The
parameter counts for each configuration of DP model are summarized in the table 1.
7
Page 8:
Table 2: Convergence Results of the DP Model on the Test Set
Scheme E/atom, eV E Ratio F, eV/ ˚AF Ratio s/epoch Time Ratio
Original 1.90E-04 100.00% 1.13E-02 100.00% 26.2 100.00%
scheme 1 1.82E-04 95.75% 1.12E-02 98.80% 51.6 196.95%
scheme 2 1.80E-04 94.93% 1.11E-02 98.46% 51.7 197.33%
scheme 3 1.77E-04 93.24% 1.12E-02 98.74% 52.1 198.85%
scheme 4 3.91E+03 90697673% 3.36E-01 2969% 594 2267%
Table 3: Convergence Results of ANI2 on example data
Example Data RMSE Ratio second/epoch
Original 1.894896437 100% 1s
scheme 1 2.119502814 111.85% 10s
scheme 2 1.530401059 80.76% 10s
scheme 3 1.350248646 71.26% 129s
scheme 4 / / /
Figure 3: Convergence Curve of ANI2
From the results, it can be observed that in the DP network, schemes 1, 2, and 3 exhibited similar
computational times. As the scope of application for eachtrainable adaptive activation function became
more localized, performance improved, with the best results obtained when using a separatetrainable
adaptive activation function per layer. However, using a separatetrainable adaptive activation function
per neuron (scheme 4) yielded poor performance. Examination of the training logs revealed that during
the first epoch, the error in energy and forces stopped decreasing after processing only a small portion
of the data. This issue arose due to rapid gradient vanishing caused by the excessive number of neurons.
In the ANI2 network, while the per-layer approach showed good performance, the computational
time was significantly higher. Therefore, scheme 4 was not tested in ANI2. For subsequent experiments,
except for ANI2 which used different networks withtrainable adaptive activation functions, all other
configurations utilized a per-layertrainable adaptive activation function.
5.2 Experiment 2:Big data set testing
Based on the results from Experiment 1, we conducted relevant tests usingtrainable adaptive activation
functions on a layer-by-layer basis with larger datasets.
For ANI2, we tested using the dataset provided in the original paper. Despite some discrepancies
between our implementation and the paper’s results, the use oftrainable adaptive activation func-
tions led to precision improvements of 32% and 18%. The training process consistently yielded good
accuracy, with a particularly significant reduction in initial error.
In the DP network, we used the same dataset as in paper (Training one DeePMD Model in Min-
utes: a Step towards Online Learning [25]) and conducted tests using both the ADAM and RLEKF
optimizers. For the ADAM optimizer, the learning rate was reduced until it reached 10−7, at which
8
Page 9:
Table 4: Convergence Results of ANI2
X40.h5 RMSE Ratio Halgren.h5 RMSE Ratio
Thesis results 2.44 / Thesis results 1.77 /
Experimental results 2.28 1 Experimental results 2.16 1
Bline 1.56 68.42% Bline 1.78 82.36%
Table 5: Convergence Results of the DP Model Using the RLEKF Optimizer on a Large Dataset
RLEKF Al Cu CuO Mg NaCl Si H2O
tanh 2.88E-04 2.45E-04 1.03E-04 1.44E-04 1.98E-05 1.51E-04 9.68E-05
E/atom, eV Bline 2.41E-04 2.14E-04 6.28E-05 1.49E-04 1.64E-05 1.37E-04 7.42E-05
Ratio 83.74% 87.54% 60.87% 103.71% 83.00% 91.16% 76.69%
tanh 3.27E-02 4.18E-02 2.36E-02 1.22E-02 3.75E-03 2.26E-02 1.77E-02
F, eV/ ˚A Bline 3.09E-02 4.12E-02 2.16E-02 1.26E-02 3.70E-03 2.22E-02 1.66E-02
Ratio 94.23% 98.51% 91.22% 102.83% 98.65% 98.19% 93.89%
tanh 384 151.7 244 196 1017 641.8 622
s/epoch Bline 712 368 707 399 2874 1708 1614
Ratio 185.42% 242.58% 289.75% 203.57% 282.60% 266.13% 259.49%
Bline denotes thetrainable adaptive activation function using the Bline function. The ratio represents
the quotient of the results obtained using thetrainable adaptive activation function compared to those
obtained using fixed activation functions.
point training stopped. In contrast, for the RLEKF optimizer, we uniformly trained for 30 epochs.
The differing training criteria were set according to the characteristics of each optimizer; notably, the
RLEKF optimizer achieved the convergence precision of ADAM within a few epochs. Therefore, to
better highlight the differences in training outcomes, we set the training duration to 30 epochs for
RLEKF.
From the experimental results, when using the ADAM optimizer, the RMSE of energy across
various systems decreased by 6.2% to 37.5%, while the RMSE of forces decreased by 0.5% to 15%.
Training time increased to 185.42% to 289.75% compared to fixed activation functions. Both methods
had identical learning rate schedules and converged within the same number of epochs, as detailed
in the table. When using the RLEKF optimizer, the RMSE of energy decreased by 8.9% to 39.9%,
and the RMSE of forces decreased by 1.3% to 5.8%. Training time increased to 237.10% to 362.58%
compared to fixed activation functions.
Additionally, we visualized the activation function images. It was observed that the activation
functions varied across different optimizers within the same system, different systems under the same
optimizer, and even between layers. Activation functions evolved continuously during training, some-
Table 6: Convergence Results of the DP Model Using the ADAM Optimizer on a Large Dataset
ADAM Al Cu CuO Mg NaCl Si H2O
tanh 1.35E-03 2.92E-04 2.18E-04 2.36E-04 2.74E-05 2.60E-04 2.32E-04
E/atom, eV Bline 1.26E-03 2.49E-04 1.36E-04 2.18E-04 2.38E-05 2.33E-04 1.57E-04
Ratio 93.82% 85.14% 62.55% 92.48% 86.63% 89.62% 67.67%
tanh 3.99E-02 4.35E-02 2.56E-02 1.46E-02 3.75E-03 2.96E-02 2.36E-02
F, eV/ ˚A Bline 3.61E-02 4.03E-02 2.18E-02 1.39E-02 3.73E-03 2.71E-02 2.00E-02
Ratio 90.65% 92.56% 85.12% 95.47% 99.42% 91.39% 84.84%
tanh 172 78 120 99 470 363 296
s/epoch Bline 408 223 420 245 1704 1000 804
Ratio 237.10% 284.65% 348.00% 246.08% 362.58% 275.00% 271.70%
9
Page 10:
Figure 4: The solid line represents the activation function fitted using the Bline function, while the
dashed line shows the approximation curve of a cubic polynomial.
Figure 5: This depicts the evolution of the activation function for the first layer during fitting. There
is a significant change in the second epoch, followed by smaller subsequent variations, which aligns
with the convergence trend of the RMSE.
times with significant changes. Overall, the magnitude of these changes correlated closely with the
convergence trends.
5.3 Experiment 3:Mathematical Methods Test
In Experiment 2, it can be observed that the use of Bline as atrainable adaptive activation function
in the DP network significantly increases computational time, approximately doubling or tripling the
original duration. This substantial additional overhead is primarily due to the absence of the Bline
function in PyTorch, leading to slower computations, further exacerbated by certain characteristics of
the function. However, the flexibility offered by Bline is unmatched by other mathematical methods.
In this section, we tested eight mathematical fitting methods, including the Fourier function, using
10
Page 11:
Table 7: Convergence Results for Various Fitting Methods
Cu Tanh Bline Fourier GRBF Legendre Hermite Chebyshev2 Chebyshev1 Bessel Jacobi
E/atom, eV 2.45E-04 2.14E-04 1.96E-04 2.62E-04 2.96E-04 2.05E-04 2.36E-04 2.08E-04 2.30E-04 2.17E-04
Ratio 100% 87.54% 80.17% 106.87% 120.84% 83.89% 96.44% 84.83% 93.85% 88.60%
F, eV/ ˚A 4.18E-02 4.12E-02 4.18E-02 4.42E-02 5.23E-02 4.12E-02 4.27E-02 4.12E-02 4.25E-02 4.17E-02
Ratio 100% 98.51% 99.98% 105.77% 125.12% 98.73% 102.24% 98.73% 101.75% 99.77%
s/epoch 151.7 368 153 362 139.133 183 144 142 179 184
Ratio 100% 242.58% 100.86% 238.63% 91.72% 120.63% 94.92% 93.61% 118.00% 121.29%
The convergence results for various fitting methods are presented. The DP network was trained using the
RLEKF optimizer on the Cu system for 30 epochs.
Figure 6: The activation function fitted using Fourier transform.
the LKF optimizer for 30 epochs, with copper (Cu) selected as the baseline material. The test results
are summarized in the table. Considering comprehensive factors such as energy, force, and computa-
tional time, the Fourier and first-kind Chebyshev polynomials performed notably well. Subsequently,
we conducted full-system tests as outlined in Experiment 2. The results indicate that both Fourier
and first-kind Chebyshev outperformed Bline, with Fourier demonstrating superior performance over
Chebyshev. For instance, in CuO, the energy decreased to 38.3 and the force to 78.26 of those obtained
using fixed activation functions, while the computational time increased by only about 20.
We also recorded the number of epochs required to achieve the same energy precision as fixed
activation functions and the equivalent epoch times. Except for NaCl, Bline achieved the same precision
in less time than fixed activation functions. In contrast, Fourier and first-kind Chebyshev showed even
more significant reductions in time compared to fixed methods.
The fitted activation function curves for Fourier and CBSV are illustrated in the figure, where
Fourier exhibits more dramatic variations, whereas other methods show smoother curves. Notably, the
shapes of the fitted activation function curves differ entirely from those of fixed activation functions.
Through Experiments 2 and 3, we have demonstrated that our method employs a data-driven adap-
tive fitting approach. Compared to othertrainable adaptive activation function methods, it exhibits
stronger data dependency and adaptivity.
5.4 Experiment 4:Molecular dynamics simulations
To mitigate issues such as overfitting, we utilized LAMMPS for molecular dynamics (MD) simulations.
The PWMLFF framework inherently supports a LAMMPS interface, which we employed to conduct
11
Page 12:
Table 8: Comparison of Convergence Results for Four Fitting Methods Across Multiple Systems
RLEKF Al Cu CuO Mg NaCl Si H2O
Tanh 2.88E-04 2.45E-04 1.03E-04 1.44E-04 1.98E-05 1.51E-04 9.680E-05
E/atom, eV Bline 2.41E-04 2.14E-04 6.28E-05 1.49E-04 1.64E-05 1.37E-04 7.424E-05
83.74% 87.54% 60.87% 103.71% 83.00% 91.16% 76.69%
Fourier 2.20E-04 1.96E-04 3.95E-05 1.12E-04 1.31E-05 1.29E-04 5.418E-05
76.50% 80.17% 38.30% 78.10% 66.05% 85.56% 55.97%
CBSV1 2.64E-04 2.08E-04 5.77E-05 1.33E-04 1.60E-05 1.34E-04 6.765E-05
91.53% 84.83% 55.89% 92.81% 81.14% 89.11% 69.89%
Tanh 3.27E-02 4.18E-02 2.36E-02 1.22E-02 3.75E-03 2.26E-02 1.773E-02
F, eV/ ˚A Bline 3.09E-02 4.12E-02 2.16E-02 1.26E-02 3.70E-03 2.22E-02 1.665E-02
94.23% 98.51% 91.22% 102.83% 98.65% 98.19% 93.89%
Fourier 2.98E-02 4.18E-02 1.85E-02 1.15E-02 3.67E-03 2.21E-02 1.506E-02
90.89% 99.98% 78.26% 94.07% 97.65% 97.81% 84.92%
CBSV1 3.18E-02 4.12E-02 2.10E-02 1.19E-02 3.69E-03 2.15E-02 1.603E-02
97.26% 98.73% 88.79% 97.45% 98.42% 95.18% 90.42%
Tanh 384 151.7 244 196 1017 641.8 622
s/epoch Bline 712 368 707 399 2874 1708 1614
185.42% 242.58% 289.75% 203.57% 282.60% 266.13% 259.49%
Fourier 395 153.6 298 230 1307 790 879
102.86% 101.25% 122.13% 117.35% 128.52% 123.09% 141.32%
CBSV1 399 142 340 261 1516 858 889
103.92% 93.61% 139.34% 133.16% 149.07% 133.69% 142.93%
This section compares the convergence results between threetrainable adaptive activation functions and
the Tanh function. The percentages indicate the ratios relative to the Tanh results.
Table 9: Comparison of Epochs Required to Reach the Same Convergence Value
Al Cu CuO Mg NaCl Si H2O
Tanh 30 30 30 30 30 30 30
bline train 11 12 10 / 6 17 11
Equivalent 20.40 29.11 28.98 / 16.96 45.24 28.54
Fourier train 6 9 7 10 2 7 5
Equivalent 6.17 9.11 8.55 11.73 2.57 8.62 7.07
Chebyshev1 train 21 11 11 20 4 12 7
Equivalent 21.82 10.30 15.33 26.70 5.97 16.05 10.01
Train: The number of epochs required for each method to reach the specified
convergence value.
Equivalent: The equivalent number of epochs required to achieve the same con-
vergence value compared to the method using fixed activation functions.
12
Page 13:
Figure 7: The activation function fitted using first-kind Chebyshev polynomials.
Figure 8: The total energy variation curves over 1000 steps (with each step being 1 fs) for several
models using LAMMPS for molecular dynamics (MD) simulations.
MD simulations on four models—Bline, Fourier, CBSV, and Hermite—as detailed in Experiment 3.
Each simulation step was set to 1 femtosecond (fs), with a total of 1000 steps executed.
The resulting curves of total energy changes revealed that the Fourier method exhibited significant
deviations compared to the other approaches. Additionally, we visualized the molecular trajectories
and selected a representative frame for comparison, as shown in the figure. The visualization clearly in-
dicates that the trajectory generated by the Fourier method diverges substantially from those produced
by the other methods.
By examining the fitting curves of the activation functions used in the Fourier method, it becomes
evident that this type of periodic function, characterized by abrupt changes, is not suitable for tasks
within the Deep Potential (DP) framework. This conclusion underscores the importance of selecting
appropriate activation functions that align with the specific requirements of the task at hand.
13
Page 14:
Figure 9: Visualization of the 957th Frame Trajectory.
Table 10: PAINN Test set results
l2h64r16 l2h64r32 l2h128r16 l2h128r32
ERMSE F RMSE Time ERMSE F RMSE Time ERMSE F RMSE Time ERMSE F RMSE Time
Silu 8.64E-02 1.56E-02 55712 4.78E-02 1.44E-02 59447 7.77E-02 1.56E-02 59644 1.21E-01 1.43E-02 59838
Bliene 4.72E-02 1.57E-02 74177 4.85E-02 1.49E-02 75826 5.95E-02 1.57E-02 79616 2.86E-02 1.39E-02 79422
CBSV1 4.85E-02 1.60E-02 62656 4.65E-02 1.46E-02 62968 5.84E-02 1.60E-02 62995 5.17E-02 1.43E-02 63306
l3h64r16 l3h64r32 l3h128r16 l3h128r32
ERMSE F RMSE Time ERMSE F RMSE Time ERMSE F RMSE Time ERMSE F RMSE Time
Silu 6.40E-02 1.59E-02 65013 6.28E-02 1.61E-02 64813 8.80E-02 1.58E-02 65029 4.48E-02 1.36E-02 65283
Bliene 4.73E-02 1.55E-02 94409 3.52E-02 1.38E-02 95403 6.25E-02 1.50E-02 104581 3.39E-02 1.42E-02 104513
CBSV1 4.11E-02 1.51E-02 73434 3.33E-02 1.37E-02 74649 8.70E-02 1.51E-02 71864 4.10E-02 1.46E-02 71252
5.5 Experiment 5:GNN model Test
The experiments utilized two GNN-based neural network force field models: PAINN and ChgNet.
Previous experiments have demonstrated the effectiveness oftrainable adaptive activation functions
in standalone MLPs. In the PAINN model, we replaced the SiLU activation function within the
updata and message layers withtrainable adaptive activation functions fitted using BLINE and CBSV.
Specifically, ”L” denotes ”num-interaction”: ”2”, which refers to the number of interaction operations;
”h” represents ”hidden-channels”: ”128”, indicating the number of feature channels in hidden layers
(i.e., layers that are neither input nor output); and ”r” stands for ”num-radial-basis”: ”16”, referring
to the number of radial basis functions (RBFs), a class of functions based on inter-point distances.
The results are summarized in the table. From these results, it is clear that the Bline activation
function performed best, while CBSV, though inferior to Bline, still outperformed fixed activation
functions.
In the ChgNet model, due to the separate implementation of MLP and GatedMLP functions, where
GatedMLP invokes the MLP function, we directly modified the activation functions within the MLP
to use Bline and CBSVtrainable adaptive activation functions. Given that this model was pre-trained,
using the full MPtr dataset for training would be excessively time-consuming; thus, only the first
1% of the MPtr data was used for training. The performance improvements were as follows: energy
predictions improved by 9% with Bline and 15% with CBSV, while force predictions improved by 12%
with Bline and 7% with CBSV. Regarding parameter counts, the fixed activation function model had
403,126 parameters, whereas the Bline model added only 160 parameters, and the CBSV model added
just 64 parameters. Thus, usingtrainable adaptive activation functions in ChgNet led to more than a
10% performance improvement with an increase of only 0.016% in parameters.
Table 11: chgnet Test set results
eMAE f MAE s MAE m MAE Parameters
Silu 0.033 0.105 0.492 0.099 403126
Bline 0.03 0.092 0.435 0.093 403286
CBSV1 0.028 0.087 0.479 0.087 403190
14
Page 15:
6 Conclusion
This paper introduces Trainable Adaptive Activation Function Structure (TAAFS) in neural networks
that exhibits adaptivity through polynomial fitting of data. This approach has demonstrated promising
results in networks containing multiple multi-layer perceptrons (MLPs), such as Deep Potential (DP)
and ANI-2, while also showing improvements in models like PAINN and ChgNet. Overall, this method
effectively enhances the representation capability of neural networks with only a marginal increase in
parameters. However, the effectiveness of the enhancement is data-dependent, and different fitting
strategies can influence task performance. Consequently, it is recommended to select fitting methods
based on the characteristics of the specific task at hand. We have also explored adjusting parameters of
the TAF, such as grid numbers or iteration counts, which impact the training outcomes. Nonetheless,
in our experiments, we primarily used the most common parameter values without extensive tuning.
Additionally, molecular dynamics (MD) simulations conducted within the DP framework confirmed
the generalization ability of the proposed method.
We welcome contributions to expand the repertoire of mathematical formulas. We also welcome the
use of this approach in various types of tasks, with a focus on summarizing the compatibility between
different mathematical formulas and specific tasks.
References
[1] A.P. Thompson, L.P. Swiler, C.R. Trott, S.M. Foiles, and G.J. Tucker. Spectral neighbor analysis
method for automated generation of quantum-accurate interatomic potentials. Journal of Com-
putational Physics , 285:316–330, 2015. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2014.
12.018. URL https://www.sciencedirect.com/science/article/pii/S0021999114008353 .
[2] Kyuhyun Lee, Dongsun Yoo, Wonseok Jeong, and Seungwu Han. Simple-nn: An efficient package
for training and executing neural-network interatomic potentials. Computer Physics Communi-
cations , 242:95–103, 2019.
[3] J Behler. Representing potential energy surfaces by high-dimensional neural network potentials.
Journal of Physics: Condensed Matter , 26(18):183001, apr 2014. doi: 10.1088/0953-8984/26/18/
183001. URL https://doi.org/10.1088/0953-8984/26/18/183001 .
[4] Han Wang, Linfeng Zhang, Jiequn Han, and E Weinan. Deepmd-kit: A deep learning pack-
age for many-body potential energy representation and molecular dynamics. Computer Physics
Communications , 228:178–184, 2018.
[5] Zheyong Fan, Yanzhou Wang, Penghua Ying, Keke Song, Junjie Wang, Yong Wang, Zezhu Zeng,
Ke Xu, Eric Lindgren, J Magnus Rahm, et al. Gpumd: A package for constructing accurate
machine-learned potentials and performing highly efficient atomistic simulations. The Journal of
Chemical Physics , 157(11), 2022.
[6] Hengchang Dai and Colin MacBeth. Effects of learning parameters on learning procedure and
performance of a bpnn. Neural networks , 10(8):1505–1521, 1997.
[7] Justin S Smith, Olexandr Isayev, and Adrian E Roitberg. Ani-1: an extensible neural network
potential with dft accuracy at force field computational cost. Chemical science , 8(4):3192–3203,
2017.
[8] Bowen Deng, Peichen Zhong, KyuJung Jun, Janosh Riebesell, Kevin Han, Christopher J. Bartel,
and Gerbrand Ceder. Chgnet as a pretrained universal neural network potential for charge-
informed atomistic modelling. Nature Machine Intelligence , page 1–11, 2023. doi: 10.1038/
s42256-023-00716-3.
[9] Ilyes Batatia, David Peter Kovacs, Gregor N. C. Simm, Christoph Ortner, and Gabor Csanyi.
MACE: Higher order equivariant message passing neural networks for fast and accurate force
fields. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances
in Neural Information Processing Systems , 2022. URL https://openreview.net/forum?id=
YPpSngE-ZU .
15
Page 16:
[10] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica , 8:
143–195, 1999.
[11] Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c,
Thomas Y Hou, and Max Tegmark. Kan: Kolmogorov-arnold networks. arXiv preprint
arXiv:2404.19756 , 2024.
[12] Aidan P Thompson, H Metin Aktulga, Richard Berger, Dan S Bolintineanu, W Michael Brown,
Paul S Crozier, Pieter J In’t Veld, Axel Kohlmeyer, Stan G Moore, Trung Dac Nguyen, et al.
Lammps-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and
continuum scales. Computer Physics Communications , 271:108171, 2022.
[13] University of Illinois at Urbana-Champaign. Center for Supercomputing Research, Development,
and G Cybenko. Continuous valued neural networks with two hidden layers are sufficient . 1988.
[14] Feng Xiao, Y Honma, and T Kono. A simple algebraic interface capturing scheme using hyperbolic
tangent function. International journal for numerical methods in fluids , 48(9):1023–1040, 2005.
[15] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural networks for lvcsr
using rectified linear units and dropout. In 2013 IEEE international conference on acoustics,
speech and signal processing , pages 8609–8613. IEEE, 2013.
[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In
Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages
315–323. JMLR Workshop and Conference Proceedings, 2011.
[17] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. Rectifier nonlinearities improve neural
network acoustic models. In Proc. icml , volume 30, page 3. Atlanta, GA, 2013.
[18] Ludovic Trottier, Philippe Giguere, and Brahim Chaib-Draa. Parametric exponential linear unit
for deep convolutional neural networks. In 2017 16th IEEE international conference on machine
learning and applications (ICMLA) , pages 207–214. IEEE, 2017.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international
conference on computer vision , pages 1026–1034, 2015.
[20] Mina Basirat and Peter M Roth. The quest for the golden activation function. arXiv preprint
arXiv:1808.00783 , 2018.
[21] Gianluca Maguolo, Loris Nanni, and Stefano Ghidoni. Ensemble of convolutional neural networks
trained with different activation functions. Expert Systems with Applications , 166:114048, 2021.
[22] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 , 2014.
[23] Siyu Hu, Wentao Zhang, Qiuchen Sha, Feng Pan, Lin-Wang Wang, Weile Jia, Guangming Tan,
and Tong Zhao. Rlekf: An optimizer for deep potential with ab initio accuracy. In Proceedings of
the AAAI Conference on Artificial Intelligence , volume 37, pages 7910–7918, 2023.
[24] Francois Berenger and Koji Tsuda. An ani-2 enabled open-source protocol to estimate ligand
strain after docking. Journal of Computational Chemistry , 2024.
[25] Siyu Hu, Tong Zhao, Qiuchen Sha, Enji Li, Xiangyu Meng, Liping Liu, Lin-Wang Wang, Guang-
ming Tan, and Weile Jia. Training one deepmd model in minutes: a step towards online learning.
InProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of
Parallel Programming , pages 257–269, 2024.
16