Page 1:
A parametric algorithm is optimal for
non-parametric regression of smooth functions
Davide Maran*
Marcello Restelli†
December 20, 2024
Abstract
We address the regression problem for a general function f: [−1,1]d→
Rwhen the learner selects the training points {xi}n
i=1to achieve a uniform
error bound across the entire domain. In this setting, known historically as
nonparametric regression Stone (1982), we aim to establish a sample com-
plexity bound that depends solely on the function’s degree of smoothness.
Assuming periodicity at the domain boundaries, we introduce PADUA, an
algorithm that, with high probability, provides performance guarantees op-
timal up to constant or logarithmic factors across all problem parameters.
Notably, PADUA is the first parametric algorithm with optimal sample com-
plexity for this setting. Due to this feature, we prove that, differently from the
non-parametric state of the art, PADUA enjoys optimal space complexity in
the prediction phase. To validate these results, we perform numerical exper-
iments over functions coming from real audio data, where PADUA shows
comparable performance to state-of-the-art methods, while requiring only a
fraction of the computational time.
1 Introduction
Regression is a fundamental task in machine learning and statistics, and it is per-
haps the most classical family of problems to be studied in these fields. While tra-
ditional regression techniques like linear models are well-understood and widely
used, many real-world applications require learning smooth underlying functions
from data affected by noise. This challenge is omnipresent in Science, particu-
larly in fields such as signal processing, bioinformatics, and financial modeling,
where extracting meaningful patterns from noisy observations is essential (Bishop
and Nasrabadi, 2006; LeCun et al., 2015). In this paper, we focus on the prob-
lem of regression for a function f: [−1,1]d→R, which is known to be smooth,
i.e., differentiable a given number of times, and aim to obtain an estimate that is
*Politecnico di Milano, Milan, Italy. Email: davide.maran@polimi.it
†Politecnico di Milano, Milan, Italy. Email: marcello.restelli@polimi.it
1arXiv:2412.14744v1 [cs.LG] 19 Dec 2024
Page 2:
uniformly accurate over the entire domain. This kind of guarantee, which is not
dataset-related, requires the algorithm also to specify the values {xi}iwhere eval-
uations of f(·)are needed, a procedure known as active sampling . Non-parametric
methods, such as kernel regression, Gaussian Processes, and local polynomial esti-
mators, are popular choices for this task due to their flexibility and ability to capture
complex data patterns without assuming a specific functional form (Williams and
Rasmussen, 2006; Hastie et al., 2009). However, this flexibility often comes at the
cost of computational efficiency and scalability. Non-parametric models typically
require storing and processing the entire dataset during inference, making them
less practical for large-scale applications or real-time systems (Wasserman, 2006).
Parametric methods, on the other hand, define models with a finite set of pa-
rameters, offering several advantages over non-parametric approaches. Paramet-
ric models excel in scalability and computational efficiency, as they do not need
to access the entire dataset for making predictions, making them well-suited for
large datasets and applications requiring fast inference (Murphy, 2012). Also,
these models are often more interpretable: With explicit functional forms, para-
metric models provide clearer insights into the relationship between input variables
and outputs, which is valuable for understanding underlying processes and critical
decision-making scenarios (Bishop and Nasrabadi, 2006). Despite these advan-
tages, state-of-the-art parametric methods underperform in noisy settings when it
comes to modeling smooth functions. This is due to a well-known problem that
relates linear models with the possibility of achieving uniform estimation bounds
for regression. We explore these limitations in detail in Section 2.1.1.
Simultaneous Approximation of the derivatives Usually, regression methods
approximate the function f(·)by minimizing some notion of distance, such as
theL2distance, defined asR
Ω(bfn(x)−f(x))2dx(mean squared error), or the
L∞one, given by supx∈Ω|bfn(x)−f(x)|(maximum distance). In this paper, we
aim for a more challenging objective: approximating the derivatives of fbased
on noisy observations. It is well-known that traditional methods for estimating
derivatives, often based on finite-difference approximations Strikwerda (2004), are
particularly vulnerable to noise in the data. To address this, we propose a differ-
ent approach: projecting the estimated function onto an appropriate vector space,
where the derivatives can be computed analytically .
A key feature of our algorithm is its ability to approximate derivatives symulta-
neously : the approximation for f(α)(·)is the α−derivative of bfn(·). This property,
while seeming natural, is not trivial: in fact, classic approximation theorems from
mathematical analysis (Schultz, 1969) do not enforce it (a result of this type was
proved relatively recently by Bagby et al. (2002)).
Paper structure The rest of the paper is organized as follows: after introducing
the necessary notation in Section 2, we establish the foundation of our approach
based on the properties of the Fourier Series (see Section 3). The algorithm is
2
Page 3:
derived in Section 4, as well as its sample complexity guarantee, which matches
the lower bound provided in Section 5. As anticipated, the only algorithm achiev-
ing optimal sample complexity are parametric: we introduce them in Section 6
and compare to PADUA for what concerns computational/space complexity and
empirical performance (Section 7).
2 Preliminaries
In this paper, we focus on a regression problem, i.e., approximating a black-box
function f:X ⊂Rd→Rfrom noisy samples yicorresponding to some xi∈ X.
Assumption 1. (Stochastic active samples with sub-Gaussian noise) The agent is
able to choose nquery points in advance, receiving a dataset of samples {xi, yi}n
i=1
given by realizations of Yisuch that
E[Yi] =E[f(xi)], Y i−f(xi)isσ−subgaussian independent from Yjforj̸=i
We study the case where function fis smooth. Given a multi-index αas a tuple
of non-negative integers (α1, . . . α d), we say that f∈ Cν(X), forν∈(0,+∞),
if it is ν∗−times continuously differentiable for ν∗:=⌈ν−1⌉, and there exists a
constant Lν(f)such that:
∀α:∥α∥1=ν∗,∀x, y∈ X:D(α)f(x)−D(α)f(y)≤Lν(f)∥x−y∥ν−ν∗∞.
(1)
The multi-index derivative is defined as D(α)f:=∂α1+...+αd
∂xα1
1...∂xαd
d.The previ-
ous set becomes a normed space when endowed with a norm ∥f∥Cνdefined as
max
max|α|≤ν∗∥D(α)f∥L∞, Lν(f)
.Note that, when ν∈N, this norm re-
duces to ∥f∥Cν= max |α|≤ν∥D(α)f∥L∞,since the Lipschitz constant Lν(f)of
the derivatives up to order ν∗=ν−1correspond exactly to the upper bound of the
derivatives of order ν(which exists as a Lipschitz function is differentiable almost
everywhere).
Assumption 2. (Smooth and periodic function) f(·)∈ Cν
p([−1,1]d)for some
known ν >0and an upper bound for ∥f∥Cνis known.
Assumption 2 contains, in addition to smoothness, the requirement of period-
icity at the boundaries of [−1,1]d. While this assumption may seem restrictive,
we argue that it does notmake the problem anyway easier. Indeed, for a general
non-periodic function, one could: 1) Estimate f(·)on the boundary of the domain.
This can be done with a negligible number of samples, as the boundary has one less
dimension. 2) Subtract to fa smooth function gwhose value matches the estima-
tion on the boundary. In this way, we get that f−gis identically zero on ∂[−1,1]d
and, a fortiori , periodic. To support this reasoning, we will prove a result showing
3
Page 4:
that the same lower bound for the general case holds for the case with periodicity
(theorem 6).
The method we will introduce in this paper is based on the theory of the Fourier
Series. This discipline focuses on spaces of functions that are endowed with peri-
odic conditions at the boundary of the interval. In the following section, to keep
things intuitive, we will only focus on the one-dimensional case, while the statisti-
cal complexity bound for arbitrary dimension dwill be given in section 4.1.
Periodic functions and Fourier series As anticipated, we focus for now on
d= 1 so that we can take, without loss of generality, X= [−1,1]. We define
trigonometric polynomial of degree Nas a function fthat can be written as
f(x) =a0+PN
t=1btsin(tπx) +atcos(tπx),for real coefficients {at, bt}. Obvi-
ously, any function of this king is periodical on [−1,1]. By convenience, we can
stack the coefficients of this representation into a unique vector θ, instead of having
a, b: defining the function soc :Z×[−1,1]→[−1,1]as:
soc(t, x) :=(
sin(tπx)t >0
cos(tπx)t≤0,
any trigonometric polynomial can be written as
f(x) =NX
t=−Nθtsoc(t, x) =:ϕN(x)⊤θ ϕ N(x) := [ soc(t, x)]N
t=−N.(2)
In the following, we are going to use the definition corresponding to Equation
(2) and call the vector space of functions defined in this way as TN. Periodic func-
tions, and trigonometric polynomials in particular, admit a convolution operator
that will be useful in the following analyses. In the rest of this paper, we will call,
for periodic functions f, gon[−1,1],f∗g(x) :=R1
−1f(y)g(x−y)dy,where
gis extended by periodicity: for z > 1,g(z) := g(z−2)and for z <−1,
g(z) :=g(z+ 2). This operator satisfies the usual properties of convolution and is
usually called circular convolution .
2.1 Regression and Fourier Series
The underlying idea of Fourier Series is to use trigonometric polynomials to ap-
proximate generic functions. In fact, any function f(·)can be written as a trigono-
metric polynomial plus an error term whose magnitude can be bounded, for exam-
ple, with the following classical result.
Theorem 1. (Theorem 4.1 part (ii) from Schultz (1969)) There exists an absolute
constant K > 0such that, for any f∈Cν
p([−1,1]d)we have:
inf
TN∈TN∥TN(·)−f(·)∥L∞≤KN−ν∥f∥Cν.
4
Page 5:
Note that the trigonometric polynomial realizing the infimum of the previous
definition exists but does notcorrespond to the Fourier projection on TN, which
instead is defined by minimizing the L2−norm. For the latter notion of distance,
results analogous to the one presented in theorem 1 are known, which always link
the smoothness of a function to the approximation rate.
In our context of Learning theory (Vapnik, 2013) and precisely, Regression
(Harrell et al., 2001), these kind of results have inspired a simple yet effective idea.
If the target function f(·)is smooth, and we can access it through noisy samples,
we can choose Nso that the error term is negligible and then run the simple Linear
Regression algorithm (Montgomery et al., 2021) to estimate coefficients bθnsuch
thatϕN(·)⊤bθn≈f(·). In order to get a satisfactory result, we need two things to
be high: 1) N, so that the approximation error becomes negligible (low bias); 2)
n, the number of samples, so that the variance is also low. In fact, the two needs
cannot be disentangled: the higher N, the higher the number of samples needed to
avoid over-fitting.
2.1.1 The problem of linear regression under misspecification
Formalizing our setting, we have to fit a function f(·) =ϕ⊤
N(·)θ+ξN[f](·)from
noisy observation, knowing that ∥ξN[f](·)∥L∞≤ξ∞
N→0withN→ ∞ , at a
certain known rate. In case we are interested in minimizing the L2loss, things go
smooth, as from it follows that, sampling uniformly on [−1,1]we can guarantee
Wainwright (2019) (Chapters 13 and 14)
∥f(·)−⟨ϕN(·),bθn⟩∥L2∝r
E
X∼Unif(−1,1)[(f(X)− ⟨ϕN(X),bθn⟩)2|bθn] =Op
N/n +ξ∞
N
.
In light of this result, optimal error bounds for Fourier regression in the case of
theL2were proved Tsybakov (2009) (1.7 Projection estimators). The pain starts
when focusing on the uniform error instead. In fact, Lattimore et al. (2020) proved
that for general feature maps of length Nthe best provable guarantee takes the
following form
∥f(·)− ⟨ϕN(·),bθn⟩∥L∞=Op
N/n +√
Nξ∞
N
.
This dependence on√
Ncoupled with the misspecification is very unlucky
in our case. Not only can it be shown to be suboptimal, but it may also lead to
vacuous statistical complexity bounds (Maran et al., 2024a). As we anticipated,
the whole idea of using the Fourier Series in Regression builds on the fact that the
approximation term ξ∞
Nvanishes for N→ ∞ , a property that may not hold for
the product√
Nξ∞
N. Two more things are particularly discouraging: 1) the result
was proved for the bandit setting, where the learner can adaptively choose the
queries xibased on past observations yi, which is easier than our active sampling
assumption; 2) we are interested in a uniform bound also for the derivatives of
the function f, and the misspecification is known to be amplified for this kind
5
Page 6:
−1−0.5 0 0.5 101020N= 10
−1−0.5 0 0.5 1−202468 N= 10
Figure 1: Dirichlet Kernel (left) and de la Vall ´ee-Poussin one (right) (Maran et al.,
2024b).
of estimation problems. To elude this result, we need some argument that does
not come from linear algebra - as anything that holds for generic feature maps is
doomed by the previous negative result - but something that exploits the specific
features of the Fourier basis.
3 Fourier Series, Convolution, Kernels
The most standard way to find the Fourier coefficients on a given function f∈
L2([−1,1])is by performing a scalar product with the base functions sin(tπ·)and
cos(tπ·). This procedure gives the value of each of the coefficients of the trigono-
metric polynomial in equation 2. Still, there is another way to perform the same
operation, which directly gives the Fourier series truncated at a given order N; this
passes through a function called Dirichlet Kernel , defined as
DN(x) := 1 + 2NX
t=1cos(tπx) =sin((N+ 1/2)πx)
sin(πx/2).
In fact, calling FN[f](·)the Fourier Series of ftruncated to the term N, it
was proved that FN[f](·) =DN∗f(·).As we shall see, expressing this operator
as a convolution is fundamental to what comes next. But first, we must note that
finding the Fourier series is not exactly what we are looking for. In fact, as we want
a method that achieves a performance in infinity norm, we are interested in finding
the trigonometric polynomial TN∈TNminimizing ∥TN(·)−f(·)∥L∞. Instead, by
using Fourier Series, we minimize ∥TN(·)−f(·)∥L2. Unfortunately, the difference
between these two objectives is substantial: we cannot use one in place of the other
without significantly weakening the guarantees. Luckily, a solution can be found
by replacing DNwith another kernel, defined as
VN(x) :=1
N/2 + 1NX
t=N/2Dt(x).
6
Page 7:
This function, known as De la Val ´ee Poussin kernel (de la Vall ´ee Poussin, 1918;
De La Vall ´ee Poussin et al., 1919), is endowed with important properties. Even if,
contrarily to its L2counterpart, it is not able to provide with VN∗f(·)the exact
trigonometric polynomial in TNwhich minimizes the L∞error, it comes close. In
fact, the result of the convolution operator is just a constant time worse than the
optimal projection in ∥ · ∥ L∞, as the next theorem states.
Theorem 2. Letf∈ Cν
p([−1,1])andα∈ {0,1, . . . ν ∗}. Then, VN∗f(·)∈TN,
and we have for a constant K > 0,
∥f(α)(·)−(VN∗f)(α)(·)∥L∞≤K(2/N)ν−α∥f∥Cν.
Comparing this result to the one of theorem 1, we can see that the convolution
VN∗fdoes not only approximate the function fbut also all its derivatives, each
one with optimal order. The proof, which is left to appendix B, is based on a simple
property, which is also going to be useful later,
∥VN(·)∥L1≤Λ1, (3)
for an absolute constant Λ1, that, crucially, does not depend on N, and has
recently proved to be1
3+2√
3
π≈1.43(Mehta, 2015). Thanks to this equation,
we are able to split the VNinto a positive and a negative part, both having finite
integral. We will call these two parts V+
N(·), V−
N(·)≥0as
VN(·) =:β+V+
N(·)−β−V−
N(·),Z1
−1V+
N(x)dx=Z1
−1V−
N(x)dx= 1.(4)
3.1 Fooling our own learner
For now, we have seen that the abstract idea of projecting, w.r.t. the infinity norm,
a function onto the vector space TNcan be performed through a convolution with
VN(·). A key advantage of this approach is that we can approximate all derivatives
of a function at the same time: taking the derivative of the convolution of a function
f(·)with kernel VN(·)is the same as making the convolution of the derivative of
f(·)withVN(·). Therefore, by approximating the function itself, we implicitly
approximate all of its derivatives. However, the real advantage of this approach
is intertwined with the possibility of active sampling. As we have seen in section
2.1.1, problems arise when performing linear regression w.r.t. the basis ϕN(·)of
a function that is not written exactly as ϕN(·)⊤θ. While our target function f(·)
cannot be written in this form, this holds for VN∗f(·), as theorem 2 ensures. Here is
the crucial point: while we are interested in approximating f(·), gathering samples
fromVN∗f(·)is better, as it approximates f(·)at the optimal order and is perfectly
linear in ϕN(·). Unfortunately, the agent-environment interaction does not allow
7
Page 8:
Algorithm 1 PADUA: Projection with Active sampling for Derivative Uniform
Approximation
Require: Query space [−1,1], maximum number nof samples to collect, length
Nof the feature map, order of smoothness ν >0.
1:ε←1/(8πN)
2:Findε−coverCεfor[−1,1]
3:LetXϕ:={ϕN(x) :x∈ Cε}
4:Compute quasi-optimal design ρfor the set Xφ
5:FindV+
N, V−
N, β+, β−from Equation (4)
6:ntot← ⌊n/4⌋
7:i←0
8:forx∈supp(ρ)do
9: forj= 1, . . .⌈ntotρ(x)⌉do
10: xi=x
11: η+∼V+
N(·)
12: η−∼V−
N(·)
13: Query sample y+
iof function f(·)atx+η+
14: Query sample y−
iof function f(·)atx+η−
15: yi=β+y+
i−β−y−
i
16: i←i+ 1
17: end for
18:end for
19:Solve least squares
bθn= arg min
θ∈RNX
i=1
ϕN(xi)⊤θ−yi2
20:Return bfn(·) =ϕN(·)⊤bθnand its derivatives
sampling from the latter function. But this is when active sampling comes into
action.
Recall that, when perturbing the argument of a function g(·)with a noise ηof
density function fη(·), we have E[g(·+η)] =g∗fη(·). If we apply the previous
principle with a noise of density η∼VN(·), we get E[g(·+η)] = g∗VN(·).
Unfortunately, VN(·)isnota density: it is not positive. But luckily, using the
decomposition (4) does the job:
η+∼V+
N(·), η−∼V−
N(·) β+E[g(·+η+)]−β−E[g(·+η−)] =g∗VN(·).(5)
The last equation 5 is what we are going to use to fool our own learner. Here’s
the idea of our algorithm.
1. We start from a linear learner which queries data ¯x1. . .¯xn′from a distribu-
8
Page 9:
tionρ.
2. Instead of querying the model for the point chosen by the learner, we ask for
points x1. . . x 2n′by perturbing the original points as in equation 5.
3. Receiving the outputs y1. . . y n′, the linear learner will act as if the target
function were VN∗f(·), which is linear without any misspecification.
In the next section, we are going to formalize this idea into algorithm 1, and
to prove its theoretical guarantees. This trick, with the name of projection by con-
volution , was recently used by Maran et al. (2024b) in the field of Reinforcement
Learning.
4 Algorithm and main results
Algorithm 1, PADUA, takes as input all the problem parameters plus the order N
of the feature map we are going to use, even if, as we shall see, specific values
ofNare required to get theoretical guarantees. The first lines are standard; we
start finding an εcover of the interval [−1,1]and applying the feature map to each
point. In line 4, the linear learner chooses which points they desire to query to the
black-box model. In this step, we employ the notion of quasi-optimal design for the
least square problem, which we explained in detail in appendix C. Keeping things
simple, using a quasi-optimal design means finding the best distribution of data
to fit linear regression if we are interested in the supremum error on a finite set of
points, Cεin our case. Using a quasi-optimal design is able to reduce the number of
queries of a factor√
Nwhile only paying log(|Cε|), and is necessary to achieve the
optimal sample complexity. The noise densities follow exactly equation (4), and
the sampling process, at lines 13,14 follows exactly equation (5). After these steps,
the algorithm proceeds exactly as a simple linear regression. Thanks to this trick,
algorithm 1 achieves an optimal sample complexity guarantee for all the derivatives
of the target function f, as the next theorem states.
Theorem 3. RunPADUA algorithm 1 for a choice Nsuch that 16Nlog log( N)<
n. Under assumptions 1,2 for d= 1, with probability at least 1−δ, the output
satisfies, for all α={0, . . . ν ∗}
∥f(α)(·)−ϕ(α)
N(·)⊤bθn∥L∞=eO
Nα
∥f∥Cν
Nν+N1/2
√nσ!!
. (6)
Proof Sketch 1. The proof is divided into three parts: 1) formalizing the idea of
sampling η+, η−that we heuristically convey in section 3.1 2) ensuring that the
discretization over Cεpoints makes no harm 3) using the theory of optimal design
to bound the error due to the stochasticity of the samples 4) bounding the bias of
approximation with a trigonometric polynomial with a generalization of theorem 1
and putting everything together.
9
Page 10:
Corollary 4. RunPADUA algorithm 1 for a choice N=O(n1
2ν+1∥f∥2
2ν+1
Cνσ−2
2ν+1).
Under assumptions 1,2 for d= 1, we have with probability at least 1−δ, for all
α={0, . . . ν ∗}
∥f(α)(·)−ϕ(α)
N(·)⊤bθn∥L∞=eO
n−ν−α
2ν+1∥f∥2α+1
2ν+1
Cνσ2ν−2α
2ν+1
. (7)
4.1 Multi-dimensional generalization
After showing that our algorithm works for the one-dimensional setting, we gen-
eralize the result to the regression problem of a function f: [−1,1]d→R. The
Fourier theory for multivariate functions is similar to the one of the univariate case
Katznelson (2004), and approximation theorems hold in the same way. Just, the
feature map ϕN(·), whose length was 2N+ 1, gets replaced by ϕN(·), which
contains interaction terms among the dvariables, and for this reason has length
Nd≈Nd. This worsens the result, as it was expected due to the infamous curse
of dimensionality. Nonetheless, our algorithm 1 is substantially unchanged, just
we replace ϕN(·)withϕN(·)andVN(·)with the multidimensional Vall `ee de la
Poussin’s kernel N ´emeth (2016).
Theorem 5. RunPADUA algorithm 1 for a choice N=O
n1
2ν+d∥f∥2
2ν+d
Cνσ−2
2ν+d
.
Under assumptions 1,2 we have with probability at least 1−δ, for all |α| ≤ν∗,
∀|α| ≤ν∗∥D(α)f(·)−D(α)ϕN(·)⊤bθn∥L∞=eO
n−ν−|α|
2ν+d∥f∥2|α|+d
2ν+d
Cνσ2ν−2|α|
2ν+d
.
(8)
5 Lower bound
In this section, we prove a lower bound, showing that our last result, Corollary 4,
provides the best possible dependence in all variables. While similar results were
known in the nonparametric regression literature, this is the first one holding for
periodic problem instances with fixed probability.
Theorem 6. Any algorithm outputting an estimation D(α)bfn(·)forD(α)f(·)suf-
fers an error lower-bounded by
∥D(α)f(·)−D(α)bfn(·)∥ ≥Ω
n−ν−|α|
2ν+dψ2|α|+d
2ν+d
0 σ2ν−2|α|
2ν+d
,
w.p. at least 1/4, on a problem instance satisfying assumptions 1 and 2 with
∥f(·)∥Cν≤ψ0.
The idea of the theorem is to divide the set [−1,1]dinto disjoint regions and
work on each of them separately. The key to doing this is finding functions that are
10
Page 11:
Algorithm Time training Time prediction Space training Space prediction
LPE O(n) O(nm) O(n) O(n)
PADUA (algorithm 1) O
n2ν+3d
2ν+d
O
mnd
2ν+d
O
n2d
2ν+d
O
nd
2ν+d
Table 1: Table containing the computational complexities of the algorithms with
optimal statistical efficiency. Number of training samples: n, prediction samples:
m.
both smooth and zero at the boundaries of the regions, with all their derivatives. In
this way, we can build the hard instance by arbitrarily combining these functions,
one for each region, still maintaining smoothness and periodicity. Once this is
done, the proof follows a relatively standard KL divergence argument.
6 Related works
The problem of non-parametric regression of a smooth function is one of the most
classical in machine learning research. Among the numerous approaches intro-
duced in the decades, we summarize the one that better fits our problem.
Parametric approaches Due to the fact that smoothness is intrinsically local, es-
tablishing a relation between points that are nearby is the most intuitive approach
to our problem, which is to apply local techniques. In the literature, several ap-
proaches based on piecewise polynomial regression (Sauve, 2009) were studied,
with different estimation schemes (Chaudhuri et al., 1994) and computational com-
plexity (Lokshtanov et al., 2021). This family of methods would indeed work for
the estimation of f(·), as well-known results Chaudhuri et al. (1994) show that
smooth functions can be very well approximated by polynomials locally (think
about Taylor series). Unfortunately, these approaches fail to provide an estimated
function bfnwhose derivative can approximate the ones of f; in fact, restricting re-
gression over multi-intervals provides regression functions that are discontinuous
at the boundaries of these regions. A solution for this issue can be found in Splines
(Wahba, 1990; Quarteroni et al., 2010). Splines are, in brief, locally polynomial
functions with a fixed degree of smoothness at the boundaries of the intervals.
Several types of splines were introduced (Wahba, 1990; Marsh and Cormier, 2001;
Acharjee and Das, 2022; Wang, 2011), always with the aim of obtaining an esti-
mated function bfnthat is smooth on the whole domain. Still, to the best of our
knowledge, no sample complexity results for L∞error over a class of smooth f(·)
have been shown for now. Current results (Li and Ruppert, 2008; Claeskens et al.,
2009) deal only with the L2error in the function in case d= 1. To the best of
our knowledge, this is the first parametric method able to achieve optimal sample
complexity for this scenario.
11
Page 12:
Non-Parametric approaches Perhaps the most celebrated model for non-parametric
regression, Gaussian Process (Williams and Rasmussen, 2006) is widely used across
different fields. This regression method requires as input a kernel function: for the
choice of the Matern kernel; a result follows combining optimal bounds on the in-
formation gain (Vakili et al., 2021) with the fact that RKHS generated by this kernel
corresponds to Sobolev ones (Seeger, 2004). Unfortunately, even in this case, re-
sults are only valid in L2norm. The only approaches that are able to get optimal
approximation properties for a function and its derivatives are Local Polynomial
Estimators (LPE) (see pages 34-42 from Tsybakov (2009)). This kind of estima-
tor extends the Nadaraya–Watson (NW) family (see page 31 Tsybakov (2009)),
achieving, in asymptotic case, the same guarantee of our algorithm 5. Remarkably,
like our PADUA, LPE is able to estimate the derivatives of the target function with
optimal order, as proved in the seminal papers Stone (1982, 1983).
6.1 Comparison with nonparametric methods
As we have seen, the only algorithms matching the theoretical guarantees of our
ones are the LPE from the non-parametric statistic literature (Tsybakov, 2009).
While our bound in theorem 5 is valid for every nand for constants that can be ex-
actly computed, the performance of LPE depends on a constant λ−1
0, where λ0is
only bounded away from zero asymptotically (Tsybakov (2009) Lemma 1.4). On
the other hand, an advantage of LPE is that their guarantee holds for {xi}n
i=1that
are uniformly distributed, while our result requires a more peculiar distribution.
Arguably, the difference between the two approaches mostly concerns computa-
tional complexity.
Comparison on computational complexity. To compare our algorithm with
LPE, we fix a number nof training and mof prediction samples. Results are
summarized in table 1. (PADUA) There are only two parts of algorithm 1 that
are computationally heavy: finding the optimal design at line 4 and solving the
linear regression problem at line 19. The former step can be performed in kN2
d
steps (Lattimore et al. (2020)), while the latter is well-known to take nN2
d+N3
d
(computational complexity of linear regression). Replacing k= 1/ε=O(N),
we get a computational complexity of O(nN2
d+N3
d+mNd). For the opti-
mal choice Nd=O(nd
2ν+d). This number leads to O(n2ν+3d
2ν+d+mnd
2ν+d). The
space complexity in training corresponds to storing the design matrix, which means
O(Nd) =O(n2d
2ν+d), while the one in prediction to O(nd
2ν+d), as the vector of es-
timated bθnis sufficient. (NW/LPE) Both algorithms, like many non-parametric
methods, are learned through lazy learning. That is, nothing is done in the training
phase, and all samples are cycled over while making every prediction. This leads
to a computational complexity of O(mn)and a space complexity of O(n)both in
training and prediction.
Our algorithm is always faster in the prediction case, asd
2ν+d<1. Moreover,
12
Page 13:
0 50 100 150 200 250 300 350 400 450 500 550−0.15−0.1−5·10−205·10−20.1
xy
True Curve
NW
LPE
PADUAFigure 2: True unknown function f(·)used in the experiments and 95% confidence
regions for the predictions fn(·)generated by three algorithms.
the total computational time is superior to the one of LPE provided that n2d
2ν+d<
m,which holds if either we have many more predictions than training samples or if
n≈mandν > d/ 2. This happens in several realistic situations when the function
f(·)comes from a physical process. Thermal and electromagnetic phenomena
are governed by the heat equation and the Laplace-Poisson equation, respectively
(Sobolev, 1964; Tikhonov and Samarskii, 2013; Salsa and Verzini, 2022). Each
of these is characterized by inherent smoothness properties, making the solution
infinitely many times differentiable in the inner of the domain so that we can take
ν=∞. For this favorable case, our computational complexity approaches the
dream-like eO(n+m), while the space is polylogarithmic, as it is possible to choose
Nd=O(log(n)d).
To close this section, we prove a novel result showing that no algorithm can
achieve a space complexity less than the one of PADUA in the prediction phase.
Theorem 7. Any algorithm with optimal statistical complexity for a regression
problem satisfying assumptions 1 and 2 must have a space complexity in the pre-
diction of at least Ω
nd
2ν+d
.
7 Experiments
To empirically validate the results of our PADUA algorithm, we test it on a re-
gression task of a smooth function. Using an analytically defined function f(·)as
the target would not be particularly informative, as functions that can be expressed
as the composition of elemental functions are smooth of order ν= +∞. Making
them less smooth, adding terms like |x|is a viable option; still, we prefer to test
our algorithm on real data, which are smooth in a less artificial manner. As this
paper is concerned about the case of f(·)being periodic, we focus on a common
real case in which functions appear that are endowed of some kind of periodicity,
that is, the one of audio signals (Purwins et al., 2019). In particular, our target
13
Page 14:
0 200 400 600 800 1,00000.20.40.6
Number of samplesL∞errorNW
LPE
PADUA
0 200 400 600 800 1,00010−210−1100101
Number of samplesTime (s)NW
LPE
PADUAFigure 3: Left: L∞error of each of the algorithm, each averaged over 5random
seeds, and with shaded regions representing 95% coverage confidence intervals
for the estimation. Right: log-scale plot of the running time of each algorithm,
compared to the number nof samples.
function f(·)has been extracted from the signal of the song Houdini ©(Lipa et al.,
2023) in the following way. The audio signal was loaded from the .wav file using
thelibrosa library, which returns the waveform as a one-dimensional numpy
array. The audio signal, originally consisting of 8,1858,56samples, was too com-
plex for direct use in our regression task. To prepare the signal for our experiment,
we divided the waveform into intervals of length between 500and1000 samples.
These intervals were carefully selected to ensure periodicity at their boundaries,
which was achieved by taking only intervals starting and ending with a value close
to zero (this can be done for audio signals, where the waveform naturally oscillates
around zero). Then, one of the intervals was selected to perform hyper-parameter
tuning of the algorithms, and one was selected to test their performance. The plot
of the test function can be seen in Figure 2 as the blue solid line. For what con-
cerns the noise of the observation, we have always used a zero-mean Gaussian of
standard deviation 0.1.
Algorithms and results In this numerical experiment, we have compared our
algorithm PADUA with the previously introduced Nadaraya–Watson (NW) and
the Local Polynomial Estimators (LPE). We evaluated the performance of our
PADUA algorithm and the baseline methods across four different sample sizes:
n= 100 ,200,500,1000 . Figure 3 (left) shows that while for n= 1000 all algo-
rithms are able to estimate f(·)almost perfectly, the error of our algorithm PADUA
decreases much faster than the one of LPE, and similarly to the one of NW. As the
other panel 3 (right) shows, PADUA obtains the best running time across the algo-
rithms, outperforming LPE by orders of magnitude. In this experiment, we have
taken the same values for n, while m≈550, as the length of the axis of figure
2, which has been rescaled to [−1,1]in the simulation. Although both LPE and
14
Page 15:
NW share the same theoretical computational complexity of O(nm), in practice,
LPE is significantly slower because it requires solving a small linear regression for
each prediction sample. While the size of these regressions is negligible relative
ton, the additional computations add significant overhead, leading to a noticeable
difference in running time.
8 Conclusions
In this paper, we have introduced PADUA, the first parametric algorithm achiev-
ing optimal L∞performance guarantee over the task of non-parametric regression
of a smooth function ffrom noisy observations at chosen points. In contrast to
previous, non-parametric approaches, PADUA achieves superior computational
complexity and, in particular, matches a lower bound for memory storage proved
in theorem 7. These theoretical guarantees are empirically validated over real data
coming from audio signals, where PADUA shows much improved computational
complexity.
Acknowledgements
Funded by the European Union – Next Generation EU within the project NRPP
M4C2, Investment 1.,3 DD. 341 - 15 march 2022 – FAIR – Future Artificial Intel-
ligence Research – Spoke 4 - PE00000013 - D53C22002380006.
References
Acharjee, M. K. and Das, K. P. (2022). Polynomial spline regression: Theory and
application. arXiv preprint arXiv:2212.14777 .
Bagby, T., Bos, L., and Levenberg, N. (2002). Multivariate simultaneous approxi-
mation. Constructive approximation , 18(4):569–577.
Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine
learning , volume 4. Springer.
Chaudhuri, P., Huang, M.-C., Loh, W.-Y ., and Yao, R. (1994). Piecewise-
polynomial regression trees. Statistica Sinica , pages 143–167.
Claeskens, G., Krivobokova, T., and Opsomer, J. D. (2009). Asymptotic properties
of penalized spline estimators. Biometrika , 96(3):529–544.
de la Vall ´ee Poussin, C. (1918). Sur la meilleure approximation des fonctions
d’une variable r ´eelle par des expressions d’ordre donn ´e.CR Acad. Sci. Paris ,
166:799–802.
15
Page 16:
De La Vall ´ee Poussin, C. J. et al. (1919). Lec ¸ons sur l’approximation des fonctions
d’une variable r ´eelle . Paris.
Evans, L. C. (2022). Partial differential equations , volume 19. American Mathe-
matical Society.
Harrell, F. E. et al. (2001). Regression modeling strategies: with applications to
linear models, logistic regression, and survival analysis , volume 608. Springer.
Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The ele-
ments of statistical learning: data mining, inference, and prediction , volume 2.
Springer.
(https://math.stackexchange.com/users/793534/luis-yanka annalisc), L. Y . A.
Trigonometric polynomial derivative upper bound. Mathematics Stack Ex-
change. URL:https://math.stackexchange.com/q/4776607 (version: 2023-09-
27).
Katznelson, Y . (2004). An introduction to harmonic analysis . Cambridge Univer-
sity Press.
Kiefer, J. and Wolfowitz, J. (1960). The equivalence of two extremum problems.
Canadian Journal of Mathematics , 12:363–366.
Lattimore, T., Szepesvari, C., and Weisz, G. (2020). Learning with good feature
representations in bandits and in rl with a generative model. In International
Conference on Machine Learning , pages 5662–5670. PMLR.
LeCun, Y ., Bengio, Y ., and Hinton, G. (2015). Deep learning. nature ,
521(7553):436–444.
Li, Y . and Ruppert, D. (2008). On the asymptotics of penalized splines. Biometrika ,
95(2):415–436.
Lipa, D., Parker, K., and Harle, D. L. (2023). Houdini. Audio recording. From the
album Radical Optimism .
Lokshtanov, D., Suri, S., and Xue, J. (2021). Efficient algorithms for least square
piecewise polynomial regression. In ESA21: Proceedings of European Sympo-
sium on Algorithms .
Maran, D., Metelli, A. M., Papini, M., and Restelli, M. (2024a). No-regret rein-
forcement learning in smooth mdps. arXiv preprint arXiv:2402.03792 .
Maran, D., Metelli, A. M., Papini, M., and Restelli, M. (2024b). Projection by con-
volution: Optimal sample complexity for reinforcement learning in continuous-
space mdps. arXiv preprint arXiv:2405.06363 .
16
Page 17:
Marsh, L. C. and Cormier, D. R. (2001). Spline regression models . Number 137.
Sage.
Mehta, H. (2015). The l1 norms of de la vall ´ee poussin kernels. Journal of Math-
ematical Analysis and Applications , 422(2):825–837.
Montgomery, D. C., Peck, E. A., and Vining, G. G. (2021). Introduction to linear
regression analysis . John Wiley & Sons.
Murphy, K. P. (2012). Machine learning: a probabilistic perspective . MIT press.
N´emeth, Z. (2016). De la vall ´ee poussin type approximation methods.
Purwins, H., Li, B., Virtanen, T., Schl ¨uter, J., Chang, S.-Y ., and Sainath, T. (2019).
Deep learning for audio signal processing. IEEE Journal of Selected Topics in
Signal Processing , 13(2):206–219.
Quarteroni, A., Sacco, R., and Saleri, F. (2010). Numerical mathematics , vol-
ume 37. Springer Science & Business Media.
Salsa, S. and Verzini, G. (2022). Partial differential equations in action: from
modelling to theory , volume 147. Springer Nature.
Sauve, M. (2009). Piecewise polynomial estimation of a regression function. IEEE
transactions on information theory , 56(1):597–613.
Schultz, M. H. (1969). l∞-multivariate approximation theory. SIAM Journal on
Numerical Analysis , 6(2):161–183.
Seeger, M. (2004). Gaussian processes for machine learning. International journal
of neural systems , 14(02):69–106.
Sobolev, S. (1964). Partial differential equations of mathematical physics , vol-
ume 56. Courier Corporation.
Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regres-
sion. The annals of statistics , pages 1040–1053.
Stone, C. J. (1983). Optimal uniform rate of convergence for nonparametric esti-
mators of a density function or its derivatives. In Recent advances in statistics ,
pages 393–406. Elsevier.
Strikwerda, J. C. (2004). Finite difference schemes and partial differential equa-
tions . SIAM.
Tikhonov, A. N. and Samarskii, A. A. (2013). Equations of mathematical physics .
Courier Corporation.
Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation . Springer.
17
Page 18:
Vakili, S., Khezeli, K., and Picheny, V . (2021). On information gain and regret
bounds in gaussian process bandits. In International Conference on Artificial
Intelligence and Statistics , pages 82–90. PMLR.
Vapnik, V . (2013). The nature of statistical learning theory . Springer science &
business media.
Wahba, G. (1990). Spline models for observational data . SIAM.
Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic view-
point , volume 48. Cambridge university press.
Wang, Y . (2011). Smoothing splines: methods and applications . CRC press.
Wasserman, L. (2006). All of nonparametric statistics . Springer Science & Busi-
ness Media.
Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine
learning , volume 2. MIT press Cambridge, MA.
18
Page 19:
A Table of Notation
d Dimension of the space, e.g. [−1,1]d
n number of samples available for the algorithm
σ sub-gaussianity constant
R Set of real numbers
ν Index of space of differentiable functions Cν(X)
Cν(X) Space of differentiable functions Cν(X)for some
X ⊂Rd
ν∗ ⌈ν−1⌉
Lν(f) Lipschitz constant of fw.r.t. the index ν
α/α index/multi-index of the derivative
f(α)derivative of a univariate function
D(α)f multi-index derivative for a multivariate function
|α| norm of the multiindex, corresponding to ∥α∥1=
|α1|+···+|αd|
∥ · ∥ L∞ supremum norm of a function, ∥f∥L∞= sup |f|
∥ · ∥ L2 =qR
Ωf(x)2dxfor a function f
∥ · ∥Cν norm over Cν
TN Space of trigonometric polynomial of degree not ex-
ceeding N
TN(·) Element of TN
ϕN(·) Fourier (sin-cos) feature map
FN[f](·) Fourier series of order Nassociated to f
DN(·) Dirichlet Kernel
VN(·) De la Val `ee Poussin Kernel
β+V+
N(·)−
β−V−
N(·)Decomposition of the kernel VN(·)(see (4))
Nd N+d
N
Σ Design matrix
Notation (Part 2)
19
Page 20:
O(·) Order of something, ignoring constants
eO(·) Order of something, ignoring constants and loga-
rithms
k See algorithm 1
ε See algorithm 1
ntot See algorithm 1
supp(ρ) Support of a probability distribution ρ(·)(intersection
of all closed sets of probability one)
N(x;µ, σ) Gaussian density function of parameters µ, σ evalu-
ated in x
Ψ(·) Standard mollifier (see equation 47)
B Proofs of section 3
Theorem 2. Letf∈ Cν
p([−1,1])andα∈ {0,1, . . . ν ∗}. Then, VN∗f(·)∈TN,
and we have for a constant K > 0,
∥f(α)(·)−(VN∗f)(α)(·)∥L∞≤K(2/N)ν−α∥f∥Cν.
Proof. The proof largely builds on theorem 8. Indeed, for α= 0we have exactly
the former result, while for α > 0we note that, by the properties of convolu-
tion, which commutes with differentiation, we have, (VN∗f)(α)=VN∗(f(α));
therefore,
∥f(α)−(VN∗f)(α)∥L∞=∥f(α)−VN∗(f(α))∥L∞.
At this point, we note that the function f(α)∈ Cν−α([−1,1]), asf∈ Cν
p([−1,1]).
Therefore, we can apply theorem 8 for ν′=ν−αto ensure
∥f(α)−(VN∗f)(α)∥L∞≤K(2/N)ν+α∥f(α)∥Cν−α.
Bounding ∥f(α)∥Cν−αby its definition, we can complete the proof:
∥f(α)∥Cν−α= max
max
β≤ν∗−α∥Dβf(α)∥L∞, Lν−α(f(α))
= max
max
β≤ν∗−α∥Dβ+αf∥L∞, Lν(f)
= max
max
α≤β≤ν∗∥Dβf∥L∞, Lν(f)
≤ ∥f∥Cν.
20
Page 21:
Theorem 8. (Theorem 2 in Maran et al. (2024b)) Let f∈ Cν
p([−1,1]). Then,
VN∗f∈TN, and we have
∥f(·)−VN∗f(·)∥L∞≤K1inf
TN/2∈TN/2∥TN/2(·)−f(·)∥L∞≤K2(2/N)ν∥f∥Cν,
where TNdenotes the space of trigonometric polynomials of degree not higher
thanNandK1, K2are universal constants.
C (Quasi-)Optimal Design
Algorithm 1 requires a quasi-optimal design (Kiefer and Wolfowitz, 1960) to choose
which points to sample. To this aim, we recall a result by (Lattimore et al., 2020,
Theorem 4.3).
Theorem 9. (Lattimore et al. (2020), Theorem 4.4) Suppose X ⊂RNis a com-
pact set spanning RN. We can find a probability distribution ρonXsuch that
|supp(ρ)| ≤4Nlog log( N)and, once defined
Σ =E
x∼ρ[xx⊤],
we have, for all x∈Ω,∥x∥2
Σ−1≤2N(here the notation ∥x∥Σ−1stands for√
x⊤Σ−1x).
The distribution ρdefined in Theorem 9 is called quasi-optimal design for the
least square problem. ”Quasi” in the previous definition is because the actual op-
timal design satisfies x∈Ω,∥x∥2
Σ−1≤N; unfortunately, the latter result would
require a larger support for ρ, and it is not suitable for our problem henceforth.
D Proofs of section 4
Theorem 3. RunPADUA algorithm 1 for a choice Nsuch that 16Nlog log( N)<
n. Under assumptions 1,2 for d= 1, with probability at least 1−δ, the output
satisfies, for all α={0, . . . ν ∗}
∥f(α)(·)−ϕ(α)
N(·)⊤bθn∥L∞=eO
Nα
∥f∥Cν
Nν+N1/2
√nσ!!
. (6)
Proof. (Part 1: Expected target and subgaussianity) We start showing that the
average target of the regression problem is exactly linear in the feature map. In-
deed, we have, for every i= 1, . . . n ,
E[yi] =E[β+y+
i]−E[β−y−
i] (9)
=β+E[y+
i]−β−E[y−
i] (10)
=β+E[f(xi+η+)]−β−E[f(xi+η−)] (11)
=β+V+
N∗f(xi)−β−V+
N∗f(xi) (12)
=VN∗f(xi). (13)
21
Page 22:
Here, step 12 follows from 5. Now, let us focus on the sub-gaussianity. We
have
yi−VN∗f(xi) =β+y+
i−β−y−
i−VN∗f(xi) (14)
= (β+y+
i−β+E[y+
i])−(β−y−
i−β−E[y−
i]) (15)
=β+(y+
i−E[y+
i])|{z}
T1−β−(y−
i−E[y−
i])|{z }
T2. (16)
Consider (T1). This term is σ−subgaussian: fix λ >0
E[eλ(y+
i−E[y+
i])] =E[E[eλ(y+
i−E[y+
i])|η+]]
=E[E[eλ(y+
i−f(xi+η+))|η+]]
≤E[eλ2σ2/2|η+] =eλ2σ2
2,
where in the inequality we have used assumption 1. The same reasoning triv-
ially applies to (T2), which is also σ−subgaussian. Therefore, looking at the sum
of two independent subgaussians, we have that yi−VN∗f(xi)isσ′−subgaussian
with
σ′= (β2
++β2
−)1/2σ≤(|β+|+|β−|)σ= Λ 1σ.
Where Λ1is the constant of equation 3. Having proved that yiis unbiased
w.r.t. VN∗f(xi)allows us to ensure the existence of some θ⋆∈R2N+1(asTNis
a vector space of dimension 2N+ 1, as clear from equation (2)) such that
E[yi] =VN∗f(xi) =ϕN(xi)⊤θ⋆,
as it follows from the first part of theorem 2 (the writing ϕN(xi)⊤θ⋆corre-
sponds to saying that the function belongs to TN).
(Part 2: The discretization makes no harm) At this point, we can apply
proposition 2 from Maran et al. (2024b), with xt=ϕN(xt)andX′=Cε, defined
on line 3. In this way, k= 1/ε. The latter result proves, with probability 1−δ,
∀x∈ Cε,|ϕN(x)⊤bθn−ϕN(x)⊤θ⋆| ≤p
log(k/δ) sup
x∈Cε∥x∥Σ−1
nσ′,(17)
where
Σn:=nX
i=1ϕ(xi)ϕ(xi)⊤σ′= Λ 1σ.
At this point, we have to generalize the previous bound to all points of [−1,1],
even not belonging to Cε. To do this, we call ∆fn:=ϕN(·)⊤(bθn−θ⋆)∈TN. The
fact that the function is a trigonometric polynomial allows us to apply Lagrange’s
22
Page 23:
theorem, which ensures that, for any couple points x1, x2, there is x′∈[x1, x2]
such that
∆fn(x1)−∆fn(x2) = ∆ f(1)
n(x′)(x1−x2). (18)
Therefore, we have
∥∆fn(·)∥L∞= sup
x1∈[−1,1]|∆fn(x1)| (19)
(18)= sup
x1∈[−1,1]inf
x2∈Cε|∆fn(x2)−∆f(1)
n(x′)(x1−x2)| (20)
≤sup
x1∈[−1,1]inf
x2∈Cε|∆fn(x2)|+|∥∆f(1)
n∥L∞(x1−x2)| (21)
≤sup
x∈Cε|∆fn(x)|+ε∥∆f(1)
n∥L∞. (22)
Here, (21) follows from bounding the derivative with its infinity norm, and
the one (22) from choosing x2to be the nearest element of x1onCε. Finally, we
can bound ∥∆f(1)
n∥L∞. Indeed, due to the fact that the function is a degree- N
trigonometric polynomial, by theorem 11 we have
∥∆f(1)
n∥L∞≤4πN∥∆fn(·)∥L∞., (23)
which, once merged with the previous result, provides
∥∆fn(·)∥L∞≤1
1−4πεNsup
x∈Cε|∆fn(x)|. (24)
Merging equations (35) and (24), we get
∥∆fn(·)∥L∞≤1
1−4πεNp
log(k/δ) sup
x∈Cε∥x∥Σ−1
nσ′.
(Part 3: Optimal design) Here, what is missing is to estimate ∥x∥Σ−1
n. This
can be done by line 4: indeed,
•ρis a quasi-optimal design for Xϕ.
• The samples are collected according to ⌈ntotρ(x)⌉, which dominates ntotρ(x).
Therefore, from theorem 9, ∥x∥Σ−1
n≤p
2(2N+ 1)/ntot. This result proves that
∥∆fn(·)∥L∞≤1
1−4πεNp
5 log( k/δ)N/n totσ′. (25)
Moreover, theorem 11 ensures that an analogous result also holds for the deriva-
tives:
∀α∈N ∥∆f(α)
n(·)∥L∞≤(4π)αN1/2+α
1−4πεNp
5 log( k/δ)/ntotσ′. (26)
23
Page 24:
(Part 4) At this point, it is just a matter of substituting the known constants and
apply theorem 2. For any α∈ {0, . . . ν ∗}we have
∥f(α)−ϕ(α)
N(·)⊤bθn∥L∞≤ ∥f(α)−(VN∗f)(α)∥L∞+∥(VN∗f)(α)−ϕ(α)
N(·)⊤bθn∥L∞
(27)
=∥f(α)−(VN∗f)(α)∥L∞+∥∆f(α)
n∥L∞ (28)
thm.2
≤K(2/N)ν−α∥f∥Cν+∥∆f(α)
n∥L∞ (29)
(26)
≤K(2/N)ν−α∥f∥Cν+(4π)αN1/2+α
1−4πεNp
5 log( k/δ)/ntotσ′,
(30)
where the first inequality comes from the triangular inequality, the second in-
equality from theorem 2, and the third one from equation (26), which we have
obtained in this proof. At this point, we can replace ntot=⌊n/4⌋,k=ε−1,
σ′= Λ 1σand4πεN = 1/2to get
∥f(α)−ϕ(α)
N(·)⊤bθn∥L∞≤K(2/N)ν−α∥f∥Cν+2(4π)αN1/2+αp
5 log( ε−1/δ)/ntotΛ1σ′,
which means up to logarithmic and constant factors
∥f(α)−ϕ(α)
N(·)⊤bθn∥L∞≲Nα
∥f∥Cν
Nν+N1/2
√nσ!
. (31)
Let us now bound the number of samples collected by the algorithm. The
queries for the model happen at lines 13 and 14, so that the total number of samples
is
2X
x∈supp(ρ)⌈ntotρ(x)⌉ ≤2ntot+ 2|supp(ρ)| ≤2ntot+ 8Nlog log( N).
By assumption, both 2ntotand8Nlog log( N)are less than n/2, so that this
number is itself bounded by nas required by the algorithm.
E Proofs from section 4.1
In this section, we are going to generalize the main result of this paper, theorem
3, to the case of functions with domain [−1,1]d. Fourier series in [−1,1]dcan be
built in a similar way to the univariate setting Katznelson (2004). Still, this time,
the space TN,dof trigonometric polynomials of degree Nhas no longer dimension
24
Page 25:
2N+ 1, as we also need to take into account the mixed terms between the dvari-
ables. In fact, it was proved (see section 3 from Maran et al. (2024b)) that these
trigonometric polynomials take the form
TN(x) =ϕd(x)⊤θ,θ∈RNd, N d=2N+d
d
.
In fact, the latter number corresponds to the number of integer-valued vectors
vsuch that ∥v∥1≤N, and can is bounded by (2N+ 1)d.
About the approximation properties, we have that both theorem 1, the result to
approximate the smooth function with trigonometric polynomials, and theorem 2,
the projection with the la Val ´ee Poussin Kernel are still valid (the proof of theorem
2 is the same in dimension d > 1). Also, the results about this latter function
maintain their validity, just that this time:
∥VN,d(·)∥L1≤Λd, (32)
a dimension-dependent constant Λdwhose expression can be found in N ´emeth
(2016).
Theorem 10. Assume to run algorithm 1 for a choice Nsuch that 16Ndlog log( Nd)<
n. With probability at least 1−δ, the output satisfies, for all |α| ≤ν∗,
∥D(α)f(·)−D(α)ϕd(·)⊤bθn∥L∞=eO
N|α|
∥f∥Cν
Nν+Nd/2σ√n!!
.
Proof. (Part 1) This part follows exactly as made in the analog part of the proof
of theorem 3, proving that
E[yi] =VN,d∗f(xi) (33)
and that yi−E[yi]is sub-gaussian for
σ′= (β2
++β2
−)1/2σ≤(|β+|+|β−|)σ= Λ dσ. (34)
(Part 2) differently from the proof of theorem 3, the cardinality of the ε−cover
here is k:=|Cε|= (2/ε)d. Therefore, applying proposition 2 from Maran et al.
(2024b) results in
∀x∈ Cε,|ϕN(x)⊤bθn−ϕN(x)⊤θ⋆| ≤p
log(k/δ) sup
x∈Cε∥x∥Σ−1
nσ′,(35)
where θ⋆is such that, ϕN(·)⊤θ⋆=VN,d∗f(·)(this is possible since VN,d∗f∈
TN,d), and
Σn:=nX
i=1ϕd(xi)ϕd(xi)⊤σ′= Λ dσ.
25
Page 26:
At this point, we have to generalize the previous bound to all points of [−1,1]d,
even not belonging to Cε. To do this, we call ∆fn:=ϕd(·)⊤(bθn−θ⋆)∈TN,d. The
fact that the function is a trigonometric polynomial (that is always differentiable)
allows us to apply Lagrange’s theorem, which ensures that, for any couple points
x1, x2, there is x′in the segment between the two points such that
∆fn(x1)−∆fn(x2) =∇[∆fn](x′)(x1−x2). (36)
Here,∇denotes the gradient operator. Using this result, we have
∥∆fn(·)∥L∞= sup
x1∈[−1,1]d|∆fn(x1)| (37)
(36)= sup
x1∈[−1,1]dinf
x2∈Cε|∆fn(x2)− ∇[∆fn](x′)⊤(x1−x2)| (38)
= sup
x1∈[−1,1]dinf
x2∈Cε|∆fn(x2)|+|∇[∆fn](x′)⊤(x1−x2)| (39)
≤ sup
x1∈[−1,1]dinf
x2∈Cε|∆fn(x2)|+∥∇[∆fn](x′)⊤∥∞∥x1−x2∥1(40)
≤ sup
x1∈[−1,1]dinf
x2∈Cε|∆fn(x2)|+∥∥∇[∆fn](·)⊤∥∞∥L∞∥x1−x2∥1
(41)
≤sup
x∈Cε|∆fn(x)|+ε∥∥∇[∆fn](·)⊤∥∞∥L∞. (42)
Here, in step 40 we have used the Cauchy-Schwartz inequality between norm
one and norm infinity, (41) follows from bounding the ∥∥∇[∆fn](x′)⊤∥∞with its
infinity norm, and the one (42) from choosing x2to be the nearest element of x1on
Cεw.r.t. the 1−norm. At this point, theorem 12 ensures ∥∥∇[∆fn](·)⊤∥∞∥L∞≤
4πN∥∆fn(·)⊤∥L∞, which gives
∥∆fn(·)∥L∞≤sup
x∈Cε|∆fn(x)|+ε4πN∥∆fn(·)⊤∥L∞.
We rewrite this result as
∥∆fn(·)∥L∞≤1
1−4πεNsup
x∈Cε|∆fn(x)|., (43)
and merge it with equation 35, getting
∥∆fn(·)∥L∞≤1
1−4πεNp
log(k/δ) sup
x∈Cε∥x∥Σ−1
nσ′. (44)
(Part 3) This point corresponds to its analog for the univariate case, except
for the fact that, having a feature map ϕN(·)of dimension Nd, the quasi-optimal
design gives ∥x∥Σ−1
n≤p
2Nd/ntot. Replacing this quantity, we get
∥∆fn(·)∥L∞≤1
1−4πεNp
2 log( k/δ)Nd/ntotσ′, (45)
26
Page 27:
and, applying theorem 11,
∀α ∥D(α)∆fn(·)∥L∞≤(4π)|α|N1/2
dN|α|
1−4πεNp
2 log( k/δ)/ntotσ′. (46)
(Part 4) Repeating the same procedure of the univariate case, we ignore con-
stant and logarithmic terms. For any α∈ {0, . . . ν ∗}, it holds
∥D(α)f−D(α)ϕN(·)⊤bθn∥L∞≤ ∥D(α)f−D(α)(Vd,N∗f)∥L∞
+∥D(α)(Vd,N∗f)−D(α)ϕN(·)⊤bθn∥L∞
=∥D(α)f−D(α)(Vd,N∗f)∥L∞+∥D(α)∆fn∥L∞
≲N|α|−ν∥f∥Cν+∥D(α)∆fn∥L∞
(46)
≤N|α|−ν∥f∥Cν+(4π)|α|N1/2
dN|α|
1−4πεNp
2 log( k/δ)/ntotσ′
≲N|α|−ν∥f∥Cν+N1/2
dN|α|n−1/2σ.
Once recalled that Nd=O(Nd), this proves the desired bound:
∥D(α)f−D(α)ϕN(·)⊤bθn∥L∞=eO
N|α|
∥f∥Cν
Nν+Nd/2σ√n!!
.
To conclude the proof, it is sufficient to prove that the number of samples re-
quired is at most n. The queries for the model happen at lines 13 and 14. Moreover,
theorem 9 ensures that the optimal design satisfies |supp(ρ)| ≤8Ndlog log( Nd),
so that the total number of samples is
2X
x∈supp(ρ)⌈ntotρ(x)⌉ ≤2ntot+ 2|supp(ρ)| ≤2ntot+ 8Ndlog log( Nd).
By assumption, both 2ntotand8Ndlog log( Nd)are less than n/2, so that this
number is itself bounded by nas required by the algorithm.
Theorem 5. RunPADUA algorithm 1 for a choice N=O
n1
2ν+d∥f∥2
2ν+d
Cνσ−2
2ν+d
.
Under assumptions 1,2 we have with probability at least 1−δ, for all |α| ≤ν∗,
∀|α| ≤ν∗∥D(α)f(·)−D(α)ϕN(·)⊤bθn∥L∞=eO
n−ν−|α|
2ν+d∥f∥2|α|+d
2ν+d
Cνσ2ν−2|α|
2ν+d
.
(8)
Proof. It is sufficient to replace N=O
n1
2ν+d∥f∥2
2ν+d
Cνσ−2
2ν+d
in the previous
result. Note that the assumption is automatically satisfied as for this choice of N
Nd=eO(nd
2ν+d)<eO(n1).
27
Page 28:
F Proofs from section 5
Before arriving to the actual proof, we have to introduce some lemmas. We start
with a general result about Gaussian distributions.
Lemma 1. LetN(x;ν, σ) :=1√
2πσe−(x−µ)2/(2σ2), the normal density function.
We have, for all v∈R,
Z
RN(x;µ, σ) logN(x;µ, σ)
N(x+v;µ, σ)dx=v2
2σ2.
Proof.
Z
RN(x;µ, σ) logN(x;µ, σ)
N(x+µ;µ, σ)dx=Z
RN(x;µ, σ)(x+v−µ)2
2σ2−(x−µ)2
2σ2
dx
=1
2σ2Z
RN(x;µ, σ)
2v(x−µ) +v2
dx
=v2
2σ2+2v
2σ2Z
RN(x;µ, σ)(x−µ)dx.
The last integral gives exactly zero, as corresponds to the mean of N(·;µ, σ)minus
µ. This ends the proof.
The next lemmas are going to be based on the properties of the standard molli-
fier, a function which has many applications in mathematical analysis.
F.1 The standard mollifier
Letd∈N. We call standard mollifier Evans (2022) the function Rd→R
Ψ(x) :=
e1
∥x∥2
2−1∥x∥2
2<1
0 ∥x∥2
2≥1.(47)
This function is well-known Evans (2022) for being infinitely times differen-
tiable with compact support B1(0). Apart for this property, we need to prove this
very simple lemma.
Lemma 2. LetΨ(·)be defined as in equation (47). Then, for all ν, there is a
constant cν>0such that
∀|α| ≤ν∗∥D(α)Ψ(·)∥L∞≥cν.
Proof. We take straightforwardly,
cν= min
|α|≤ν∗∥D(α)Ψ(·)∥L∞.
Being the minimum over a finite set, to prove cν>0corresponds to proving that
all∥D(α)Ψ(·)∥L∞are not zero. To this aim, note that ∥D(α)Ψ(·)∥L∞= 0would
imply that D(α)Ψ(·) = 0 , which is false as Ψis not a polynomial function.
28
Page 29:
At this point, we define a squeezed version of Ψ, which we call:
ρ <1 : Ψ ρ(x) := Ψ( x/ρ). (48)
About this function, we prove a crucial lemma which we are going to use in
the proof of the lower bound.
Lemma 3. Fixν >0and let cνbe defined as in lemma 2. For any |α| ≤ν∗, we
have
∥D(α)Ψρ(·)∥L∞≥cνρ−|α|
Proof. By the properties of the derivative, we have
∀α D(α)Ψρ(·) =ρ−|α|D(α)Ψ(·).
This proves that
∥D(α)Ψρ(·)∥L∞=ρ−|α|∥D(α)Ψ(·)∥L∞≥cνρ−|α|.
This result can be completed with the following upper bound.
Lemma 4. ∥Ψρ∥Cν≤ ∥Ψ∥Cνρ−ν.
Proof. We first prove that, for every |α|< ν, we have the correct bound on the
infinity norm.
∥D(α)Ψρ∥L∞=ρ−|α|∥D(α)Ψ∥L∞≤ρ−ν∥Ψ∥Cν.
Next, we need to bound Lν(Ψρ). By definition,
Lν(Ψρ) = max
|α|=ν∗sup
x̸=y|D(α)Ψρ(x)−D(α)Ψρ(y)|
|x−y|ν−ν∗
= max
|α|=ν∗sup
x̸=y|ρ−ν∗D(α)Ψ(x/ρ)−ρ−ν∗D(α)Ψ(y/ρ)|
|x−y|ν−ν∗
=ρ−ν∗max
|α|=ν∗sup
x̸=y|D(α)Ψ(x/ρ)−D(α)Ψ(y/ρ)|
|x−y|ν−ν∗
=ρ−ν∗ρ−ν+ν∗max
|α|=ν∗sup
x̸=y|D(α)Ψ(x/ρ)−D(α)Ψ(y/ρ)|
|x/ρ−y/ρ|ν−ν∗
=ρ−νsup
x̸=y|D(α)Ψ(x/ρ)−D(α)Ψ(y/ρ)|
|x/ρ−y/ρ|ν−ν∗
=ρ−νLν(Ψ).
This ends the proof.
With all these lemmas we are able to pass to the actual proof of the lower
bound.
29
Page 30:
F.2 Proof of the lower bound
Theorem 6. Any algorithm outputting an estimation D(α)bfn(·)forD(α)f(·)suf-
fers an error lower-bounded by
∥D(α)f(·)−D(α)bfn(·)∥ ≥Ω
n−ν−|α|
2ν+dψ2|α|+d
2ν+d
0 σ2ν−2|α|
2ν+d
,
w.p. at least 1/4, on a problem instance satisfying assumptions 1 and 2 with
∥f(·)∥Cν≤ψ0.
Proof. LetK∈N. Let us divide [−1,1]dintoKdhypercubes {Qℓ}(K,K,..K )
ℓ=(1,1,...1),
each of side ρ:= 2/Kand center qℓrespectively. By definition of the sampling
process, there is ℓ⋆such that
nX
i=11(xi∈Qℓ⋆)≤n/Kd.
Let us define two problem instances, both affected by a noise N(0, σ2):
f1(·) = 0 f2(·) =(
0 Qc
ℓ⋆
ψ0∥Ψ∥−1
CνρνΨρ(x−qℓ⋆)Qℓ⋆, (49)
We have to verify that both functions are in Cν
p([−1,1]d). For the first function,
the result is trivial, while for the second one, we have that i) no derivative can be
discontinuous on ˙Qℓ⋆or˙Qc
ℓ⋆, as both Ψρ(·)and0are infinitely times differentiable
ii) On the boundaries between the hypercubes all function and all their derivatives
are identically zero, as it follows from the definition of Ψ.
We can now evaluate the norm of both functions. ∥f1(·)∥L∞=∥f1(·)∥Cν= 0,
while for the second function we have, due to lemma 4,
∥f2(·)∥L∞=ψ0∥Ψ∥−1
Cνρν,∥Ψρ∥Cν≤ψ0∥Ψ∥−1
Cνρν∥Ψ∥Cνρ−ν=ψ0.(50)
With this result fixed, we are able to measure the KL divergence between the
two distributions of the data yiunder the true function f1(·)andf2(·)respectively.
Let us call P1(y1, . . . y i, . . . y n)the distribution of the samples if the true function
isf1(·). We have
P1(y1, . . . y i, . . . y n) =nY
i=1N(yi, σ2).
In the same way, for what concerns P2, we have
P2(y1, . . . y i, . . . y n) =nY
i=1N(yi−f2(xi), σ2).
The KL divergence between the two measures corresponds, by definition, to
30
Page 31:
KL(P1, P2) =Z
Rnlog nY
i=1N(yi, σ2)
N(yi−f2(xi), σ2)!nY
i=1N(yi, σ2)dyi (51)
≤Z
Rnlog
n/KdY
i=1N(yi, σ2)
N(yi−f2(xi), σ2)
nY
i=1N(yi, σ2)dyi.(52)
Where the second passage follows from the fact that f2(xi) = 0 for all but at
most n/Kdsamples, which we assume without loss of generality to be the firsts.
The sum can now be written as
KL(P1, P2)≤Z
Rnn/KdX
i=1logN(yi, σ2)
N(yi−f2(xi), σ2)nY
i=1N(yi, σ2)dyi(53)
=n/KdX
i=1Z
RnlogN(yi, σ2)
N(yi−f2(xi), σ2)
N(yi, σ2)dyi, (54)
indeed, all other variables integrate to one. At this point, using lemma 1, we
have
KL(P1, P2)≤n/KdX
i=1Z
RnlogN(yi, σ2)
N(yi−f2(xi), σ2)
N(yi, σ2)dyi (55)
=n/KdX
i=1f2(xi)2
2σ2≤nψ2
0ρ2ν
2Kd∥Ψ∥2
Cνσ2. (56)
From this, it follows that any algorithm Athat classifies between f1andf2has
an error probability at least (Theorem 2.2 Tsybakov (2009))
p1,e≥1−q
(n/Kd)∥Ψ∥−2
Cνψ2
0ρ2ν/(4σ2)
2.
Remembering that in the former we have ρ= 2/K, the previous writes as
p1,e≥1−(2σ∥Ψ∥Cν)−1ψ0p
n/Kd+2ν
2.
Therefore, the previous inequality lets us find the minimum Ksuch that the
probability of making is bounded from below by 1/4:
K≥ψ2
0n
4σ2∥Ψ∥2
Cν 1
d+2ν
=⇒p1,e≥1/4. (57)
31
Page 32:
Now, following a standard argument for lower bounds, note that if no classifier
Acan distinguish the two functions f1, f2with a given probability, no regression
algorithm producing bfncan achieve error less than ∥f1(·)−f2(·)∥/2with the
same probability in anynorm∥ · ∥. Otherwise, a trivial classifier Athat outputs
the function minimizing, between f1, f2, the distance (w.r.t ∥ · ∥) from bfnwould
violate this condition. In our specific case, we are interested in the infinity norm
difference between order αderivatives, which is
∥D(α)f1(·)−D(α)f2(·)∥L∞=ψ0∥D(α)Ψρ(·)∥L∞
≥ψ0∥Ψ∥−1
Cνρνcνρ−|α|=ψ0cν
∥Ψ∥Cνρν−|α|,
where the last passage comes from lemma 3. Replacing ρ= 2/Kwe get
∥D(α)f1(·)−D(α)f2(·)∥L∞≥ψ0cν
∥Ψ∥Cν(K/2)|α|−ν.
Replacing the lower bound for Kthat we got in equation (57) proves that, w.p.
1/4, the error in estimating any derivative αis
∥D(α)f(·)−D(α)bfn(·)∥ ≥ψd+2|α|
d+2ν
0 cν
∥Ψ∥Cν2ν−|α|n
4σ2∥Ψ∥2
Cν|α|−ν
d+2ν
.
This ends the proof.
F.3 Space complexity lower bound
Theorem 7. Any algorithm with optimal statistical complexity for a regression
problem satisfying assumptions 1 and 2 must have a space complexity in the pre-
diction of at least Ω
nd
2ν+d
.
Proof. The proof strongly relies on lemma 5: due to this result, there are J=
Ω(2ε−d/ν)functions in Cν
p([−1,1]d)such that ∥fj∥Cν≤1and
∀j̸=j′∥fj−fj′∥L∞≥ε.
Any algorithm that attains error Jmust at least be able to distinguish these Jin-
stances. Therefore, it has to occupy a number of bits given by log2(J) = Ω( ε−d/ν).
As we have seen in the main paper (theorem 6), the order-optimal guarantee on the
statistical complexity links the error εto the number of samples in the following
way
ε= Ω
n−ν
2ν+d
,
32
Page 33:
meaning that the total number of bits is given by Ω
n−ν
2ν+d−d
ν
= Ω
nd
2ν+d
.
This ends the proof.
Lemma 5. Fixε >0. There is family of functions {fj}J
j=1⊂ Cν
p([−1,1]d)such
that∥fj∥Cν≤1and
∀j̸=j′∥fj(·)−fj′(·)∥L∞≥ε, J = Ω
2ε−d/ν
Proof. For this proof, we are going to use the notation of the previous part of this
section.
Let us divide [−1,1]dintoKdhypercubes {Qℓ}Kd
ℓ=1, each of side ρ:= 2/K
and center qℓrespectively. Define, for every j= [2Kd]the following function:
fj(·) =KdX
ℓ=1bin(j, ℓ)1Qℓ(·)∥Ψ∥−1
CνρνΨρ(· −qℓ), (58)
where bin (j, ℓ)indicates that the binary digit for j(which has at most Kd
digits) in position ℓ. In section F.1 and specifically lemma 4, we have already
proved that
1Qℓ(·)∥Ψ∥−1
CνρνΨρ(· −qℓ)∈ Cν
p([−1,1]d), (59)
with∥·∥Cνnorm bounded by one. Since, in equation (58), we are summing over
functions of this form with disjoint supports, the same result applies here, showing
that∥fj∥Cν≤1. Now, take two fj, fj′forj̸=j′and measure their infinity norm
difference. As j̸=j′, there must be ℓsuch that bin (j, ℓ0)̸=bin(j′, ℓ0).
∥fj(·)−fj′(·)∥L∞≥ ∥fj(·)−fj′(·)∥L∞(Qℓ0)
=∥Ψ∥−1
Cν∥1Qℓ(·)ρνΨρ(· −qℓ)∥L∞
=∥Ψ∥−1
Cνρν∥1Qℓ(·)Ψρ(· −qℓ)∥L∞=∥Ψ∥−1
Cνρν
(the last passage is due to the fact that ∥Ψρ(· −qℓ)∥L∞= 1, as we can simply see
from its definition). Therefore, for ∥Ψ∥−1
Cνρν≥ε, the condition in the statement is
satisfied. In turn, this implies,
ρν≥ ∥Ψ∥Cνε=⇒ρ≥(∥Ψ∥Cνε)1/ν.
Replacing ρ= 2/K, this condition corresponds to K≤2(∥Ψ∥Cνε)1/ν; which is
satisfied by K=⌊2(∥Ψ∥Cνε)−1/ν⌋, i.e.
J= 2Kd= 2⌊2(∥Ψ∥Cνε)−1/ν⌋d= Ω
2ε−d/ν
.
33
Page 34:
G Fundamental results
Theorem 11. LetTN∈TN. Then, ∥T(1)
N(·)∥L∞≤4πN∥TN(·)∥L∞.
Proof. See (https://math.stackexchange.com/users/793534/luis-yanka annalisc)
An analog theorem holds in dimension d >1.
Theorem 12. LetTN∈Td,N. Then, ∥∥∇TN(·)∥∞∥L∞≤4πN∥TN(·)∥L∞.
Proof. By definition of gradient,
∇TN(·) = [∂1TN(·), . . . ∂ dTN(·)],
where ∂j=D(0, . . .1|{z}
j,...0)
correspond to the partial derivative w.r.t. index
j. Note that, considering only the j−th variable, TN(·)is still a trigonometric
polynomial with a degree at most N. Therefore, theorem 11 ensures,
∥∂jTN(·)∥L∞≤4πN∥TN(·)∥L∞.
Using this property, we have
∥∥∇TN(·)∥∞∥L∞=∥max
j∈[d]|∂jTN(·)|∥L∞
= max
j∈[d]∥∂jTN(·)∥L∞≤4πN∥TN(·)∥L∞,
which ends the proof.
34