1 Introduction
Machine learning (ML) is increasingly used today to make predictions of system responses and to aid or guide decision making. Given a dimensional input vector to a system and the corresponding dimensional output vector , datadriven ML algorithms establish a map on the basis of an available sample set of input observations and of the corresponding output values , where and is a noise term. In classification tasks, the output is discrete, that is, it can take at most a countable number of different values (the class labels). In regression tasks, which this paper is concerned with, the output
takes continuous values. Regression techniques include linear regression, neural networks (NN), kernel methods (such as Gaussian processes, GP), sparse kernel machines (such as support vector machines, SVM), graphical models (such as Bayesian networks and Markov random fields), and others. A comprehensive overview of these methods can be found in
Bishop (2009) and in Witten et al. (2016).Current research on ML algorithms focuses on various open issues. First, there is an increasing interest towards problems where the inputs to the system, and as a consequence the system’s response, are uncertain (see, e.g., Chan and Elsheikh (2018); Kasiviswanathan et al. (2016); Mentch and Hooker (2016)
). Properly accounting for both aleatory and epistemic input uncertainties allows one to estimate the response statistics. Second, novel methods are sought to better automatize the tuning of hyperparameters, which several ML methods are highly sensitive to
Snoek et al. (2012). Third, the paucity of data in specific problems calls for models that can be effectively trained on few observations only Forman and Cohen (2004). Finally, complex models, while possibly yielding accurate predictions, are often difficult if not impossible to interpret.This manuscript revisits polynomial chaos expansions (PCE), a well established metamodelling technique in uncertainty quantification (UQ), as a statistical ML Vapnik (1995) regression algorithm that can deal with these challenges Sudret et al. (2015). UQ classically deals with problems where is uncertain, and is therefore modelled as a random vector. As a consequence, is also uncertain, but its statistics are unknown and have to be estimated. Differently from ML, in UQ the model is typically available (e.g., as a finite element code), and can be computed pointwise. However, is often a computationally expensive blackbox model, so that a Monte Carlo approach to estimate the statistics of (generate a large input sample set to obtain a large output sample set ) is not feasible. PCE is an UQ spectral technique that is used in these settings to express as a polynomial function of . PCE thereby allows one to replace the original computational model with an inexpensivetorun but accurate metamodel. The metamodel can be used to derive, for instance, estimates of various statistics of
, such as its moments or its sensitivity to different components of
Saltelli et al. (2000), in a computationally efficient way. PCE was originally developed for independent, identically distributed input data from a Gaussian Wiener (1938) and then a list of named distributions Xiu and Karniadakis (2002), then generalized to marginal and joint distributions with an arbitrary functional form
Soize and Ghanem (2004); Wan and Karniadakis (2006). A momentbased approach that does not require a functional form of the marginals to be available (or even to exist) was proposed in Oladyshkin and Nowak (2012), and further extended to arbitrary joint distributions in Paulson et al. (2018). In Torre et al. (2019), we established a general framework to perform UQ (including but not limited to PCE metamodelling) in the presence of complex input dependencies modelled through copulas.Here, we reestablish PCE in a purely datadriven ML setup, where the goal is to obtain a model that can predict the response of a system given its inputs . Compared to classical UQ settings where PCE is typically used, four major challenges arise in this datadriven setup:

no computational model of the system is available, determining a significant lack of information compared to classical UQ settings;

as a spectral method, PCE requires the knowledge of the joint of the input. We address instead scenarios where is entirely inferred from data, a step that can affect convergence;

to guarantee generality, inference on the marginal distributions is performed here nonparametrically (by kernel smoothing), the rationale being that too little evidence is available to assume any parametric families;

the obtained model needs to be robust to noise in the data, a feature that to our knowledge has not been investigated for PCE yet.
For simplicity, we consider the case where
is a scalar random variable
. The generalisation to multivariate outputs is straightforward. In the setup of interest here, no computational model is available, but only a set of input values and corresponding responses. and are considered to be (possibly noisy) realisations of and , the true relationship between which is deterministic but unknown.After recalling the underlying theory (Section 2), we first show by means of simulation that datadriven PCE delivers accurate pointwise predictions (Section 3). In addition, PCE also enables a reliable estimation of the statistics of the response (such as its moments and ), thus enabling uncertainty quantification of the predictions being made. Dependencies among the input variables are effectively modelled by copula functions, specifically by vine copulas. Vines copulas have been recently investigated for general UQ applications, including PCE metamodeling, in Torre et al. (2019). Here their applicability to datadriven settings is investigated and demonstrated. The full approach is shown to be robust to noise in the data, a feature that has not been investigated yet for PCE, and that enables denoising. Furthermore, the methodology works well in the presence of small training sets. In Section 4, we further apply PCE to real data previously analyzed by other authors with different NN and/or SVM algorithms, where it achieves a comparable performance. Importantly, the construction of the PCE metamodel does not rely on fine tuning of critical hyperparameters. This and other desirable features of the PCE modelling framework are discussed in Section 5.
2 Methods
2.1 Measures of accuracy
Before introducing PCE, which is a popular metamodelling technique in UQ, as an ML technique used to make pointwise predictions, it is important to clarify the distinction between the error types that UQ and ML typically aim at minimizing.
ML algorithms are generally used to predict, for any given input, the corresponding output of the system. Their performance is assessed in terms of the distance of the prediction from the actual system response, calculated for each input and averaged over a large number of (ideally, all) possible inputs. A common error measure in this sense, also used here, is the mean absolute error (MAE). For a regression model trained on a training data set , the MAE is typically evaluated over a validation data set of size by
(1) 
The associated relative error considers point by point the ratio between the absolute error and the actual response, i.e.,
(2) 
which is well defined if for all .
PCE is instead typically used to draw accurate estimates of various statistics of the output – such as its moments, its
, confidence intervals – given a probabilistic model
of an uncertain input. The relative error associated to the estimates of and of is often quantified by(3) 
provided that and . A popular measure of the error made on the full response
is the KullbackLeibler divergence
(4) 
which quantifies the average difference between the logarithms of the predicted and of the true s. In the simulated experiments performed in Section 3, we used the error measures in (3) and (4) to assess the quality of the PCE estimates , , and . The reference solutions , , and were obtained by MCS on samples. The real data analyzed in Section 4, instead, contained too few points to draw sufficiently accurate reference values. Statistical errors for these data could not be quantified.
Provided a suitable model of the joint of the input, PCE is known to converge (in the meansquare sense) very rapidly as the size of the training set increases, compared for instance to MonteCarlo sampling Puig et al. (2002); Todor and Schwab (2007); Ernst et al. (2012). This happens because PCE, which is a spectral method, efficiently propagates the probabilistic information delivered by the input model to the output (as explained in more details next). Nevertheless, this does not necessarily imply a similarly rapid pointwise convergence of the error, which remains to be demonstrated. Also, the speed of meansquare convergence will generally worsen if is unknown and needs to be inferred from data. For a discussion of various types of convergence, see Ernst et al. (2012) and references therein.
2.2 PCE in datadriven settings
PCE is designed to build an inexpensivetoevaluate analytical model mapping an input random vector onto an output random variable Ghanem and Spanos (1990); Xiu and Karniadakis (2002). PCE assumes that an unknown deterministic map exists, such that .
is additionally assumed to have finite variance:
. Under the further assumption that each has finite moments of all orders Ernst et al. (2012), the space of square integrable functions with respect to admits a basisof polynomials orthonormal to each other with respect to the probability measure
, i.e., such that(5) 
Here, is the Kronecker delta symbol. The element of the multiindex indicates the degree of in the th variable, . has a total degree given by .
Thus, admits the spectral representation
(6) 
The goal of PCE regression is to determine the coefficients of the expansion, truncated at some polynomial order, given an initial set of observations (the training set or experimental design). The advantage of expressing the functional relationship between the inputs and the output by an orthonormal multivariate basis lies in its spectral decay properties. Indeed, given that finite variance models/phenomenons are considered, the orthonormality of the basis implies that the squared sum of the PCE coefficients (equal to the variance, see Section 2.3.5) is finite. Therefore, the orthonormal basis guarantees a rapidly decaying spectrum of the PCE coefficients. This is turn reduces the number of parameters in the PCE representation, yielding a compressed representation. The orthonormal basis is thus paramount to reduce the number of regression coefficients and thus to avoid overfitting on finite datasets. The superiority of the orthonormal basis with respect to a nonorthogonal one in these settings is demonstrated on simulated data in Section 3.4.
In engineering applications the model is often directly available (e.g., as a finite element model) but computationally expensive to evaluate: in these cases, PCE is used as a surrogate to replace the true model with an inexpensivetoevaluate metamodel. In the standard ML settings considered here, conversely, is unknown and has to be modelled solely on the basis of available observations . The primary goal of this paper is to show that the accuracy of a PCE metamodel built purely on can compete with that of stateoftheart machine learning regression models, while requiring little parameter tuning and in some cases offering additional advantages.
2.3 PCE for independent inputs
In this section, we assume that the input random vector
has statistically independent components. The PCE basis is built from the tensor product of univariate orthogonal polynomials. The case of inputs with dependent components is discussed later in Section
2.4.2.3.1 Orthogonalisation for independent inputs
When has independent components, the can be obtained as the tensor product of univariate polynomials,
(7) 
where each is the element of degree of a basis of univariate polynomials orthonormal with respect to the marginals of , that is, such that
(8) 
The problem of building a basis of mutually orthonormal multivariate polynomials with respect to hence becomes the problem of building bases of mutually orthonormal univariate polynomials, each with respect to a marginal . The bases can be built independently.
2.3.2 Specification of the marginals and PCE construction
The construction of the polynomial basis requires a model of the marginal input distributions. Families of univariate orthonormal polynomials associated to classical continuous distributions are described in Xiu and Karniadakis (2002). An input variable with a different distribution (continuous, strictly increasing) may be transformed into a random variable with distribution belonging to one of the classical families by the transformation
(9) 
This relation offers a first possible (but usually not recommended) approach to build the PCE of for inputs with generic continuous marginals : transform into through (9), and then build a PCE of in terms of . The PCE has to be trained on , where is obtained from using (9).
A second approach, undertaken in this study, consists in directly constructing a basis of orthonormal polynomials with respect to , by Stiltjes or GramSchmidt orthogonalisation Soize and Ghanem (2004); Wan and Karniadakis (2006). In practice, performing PCE on the original input is often preferable to building the PCE on inputs transformed via (9
). Indeed, the latter is usually a highly nonlinear transformation, making the map from
to more difficult to approximate by polynomials Oladyshkin and Nowak (2012). In the applications presented in this paper, we opt for nonparametric inference of the input marginalsby kernel density estimation (KDE)
Rosenblatt (1956); Parzen (1962). Given a set of observations of , the kernel density estimate of reads(10) 
where is the kernel function and is the appropriate kernel bandwidth that is learned from the data. Different kernels are used in practice, such as the Epanechnikov or Gaussian kernels. Here we use the latter, that is, is selected as the standard normal .
A third approach for PCE construction, first introduced in Oladyshkin and Nowak (2012) under the name of arbitrary PCE (aPCE), consists in constructing a basis of polynomials orthonormal to the input moments (not to the input directly). The method was first introduced for independent input variables and later extended to correlated variables in Paulson et al. (2018). aPCE offers the advantage to be more general, as it requires only the input moments and does not need (or even assume the existence of) a functional form of the input . An accurate estimation of moments of higher order, however, requires a large number of input samples, thus effectively limiting its applicability in the settings considered here Oladyshkin and Nowak (2018). For this reason, we opt instead for the second approach outlined above.
After estimating the input marginals by KDE, we resort to PCE to build a basis of orthonormal polynomials. The PCE metamodel is then trained on .
2.3.3 Truncation schemes
The sum in (6) involves an infinite number of terms. For practical purposes, it is truncated to a finite sum. Truncation is a critical step: a wrong truncation will lead to underfitting (if too many terms are removed) or to overfitting (if too many terms are retained compared to the number of available training points). The model complexity can be controlled by various truncation schemes.
The standard basis truncation Xiu and Karniadakis (2002) retains the subset of terms defined by
where is the chosen maximum polynomial degree and is the total degree of . Thus, only polynomials of degree up to are considered. contains elements. To further reduce the basis size, several additional truncation schemes have been proposed. Hyperbolic truncation Blatman and Sudret (2011) retains the polynomials with indices in
where and is the norm defined by .
A complementary strategy is to set a limit to the number of nonzero elements in , that is, to the number of interactions among the components of in the expansion. This maximum interaction truncation scheme retains the polynomials with indices in
where .
In our case studies below we combined hyperbolic and maximum truncation, thus retaining the expansion’s polynomials with coefficients in
(11) 
This choice is motivated by the sparsity of effect principle, which assumes that only few meaningful interactions influence system responses and which holds in most engineering applications. The three hyperparameters can all be automatically tuned by crossvalidation within preassigned ranges, as illustrated in Blatman and Sudret (2011). To reduce the already high computational cost due to the large number of simulations performed in the current study, we fixed and tuned and only, within preassigned ranges and . and were selected depending on the amount of training data available. Starting from simpler models (lower and ), more complex models were favored if yielding a lower crossvalidation error. The calibration of each parameter stopped either if no increase in performance was attained in two consecutive steps, or if the maximum allowed value was reached.
2.3.4 Calculation of the coefficients
Selected a truncation scheme and the corresponding set of multiindices, the coefficients in
(12) 
need to be determined on the basis of a set of observed data. In these settings, the vector of expansion coefficients can be determined by regression.
When the number of regressors is smaller than the number of observations,
can be determined by solving the ordinary least squares (OLS) problem
(13) 
The solution reads
(14) 
where , , .
The OLS problem cannot be solved when , because in this case the matrix in (14) is not invertible. Also, OLS tends to overfit the data when is large. Simpler models with fewer coefficients can be constructed by sparse regression.
Least angle regression (LAR), proposed in Efron et al. (2004), is an algorithm that achieves sparse regression by solving the regularised regression problem
(15) 
The last addendum in the expression is the regularisation term, which forces the minimisation to favour sparse solutions. One advantage of using LAR is that it does not require explicit optimization with respect to the parameter, which in turn is never explicitly calculated. The use of LAR in the context of PCE was initially proposed in Blatman and Sudret (2011), which the reader is referred to for more details. In the applications illustrated in Sections 3 and 4, we adopt LAR to determine the PCE coefficients.
2.3.5 Estimation of the output statistics
Given the statistical model of the input and the PCE metamodel (12) of the inputoutput map, the model response is not only known pointwise, but can also be characterised statistically. For instance, the orthonormality of the polynomial basis ensures that the mean and the variance of read, respectively,
(16) 
This property provides a useful interpretation of the expansion coefficients in terms of the first two moments of the output. Other output statistics, such as the Sobol partial sensitivity indices Sobol’ (1993), can be obtained from the expansion coefficients analytically Sudret (2008).
Higherorder moments of the output, as well as other statistics (such as the full ), can be efficiently estimated through MonteCarlo simulation, by sampling sufficiently many realisations of and evaluating the corresponding PCE responses. Polynomial evaluation is computationally cheap and can be trivially vectorised, making estimation by resampling extremely efficient.
2.4 PCE for mutually dependent inputs
If the input vector has statistically dependent components, its joint is not the product of the marginals and the orthogonality property (5) does not hold. As a consequence, one cannot construct multivariate orthogonal polynomials by tensor product of univariate ones, as done in (7). Constructing a basis of orthogonal polynomials in this general case is still possible, for instance by GramSchmidt orthonormalisation NAVARRO et al. (2014). However, the procedure becomes more and more computationally demanding as the number of inputs or the expansion order grow, as it involves the numerical evaluation of higherdimensional integrals. For this reason, we investigate two alternative strategies.
The first strategy consists in ignoring the input dependencies and in building a basis of polynomials orthonormal with respect to by arbitrary PCE. This approach is labelled from here on as aPCEonX. While not forming a basis of the space of square integrable functions with respect to , the considered set of polynomials may still yield a similarly good pointwise approximation.
Accurate estimates of the output statistics may be obtained a posteriori by modelling the input dependencies through copula functions, as detailed in A. The joint of a random vector with copula distribution and marginals (here, obtained by KDE) is given by (1). Resampling from such a probabilistic input model yields more accurate estimates of the output statistics than resampling from the distribution , that is, than ignoring dependencies.
The second possible approach, presented in more details in B, consists in modelling directly the input dependencies by copulas, and then in transforming the input vector into a set of independent random variables with prescribed marginals (e.g., standard uniform) through the associated Rosenblatt transform Rosenblatt (1952), defined in (18). The PCE metamodel is then built between the transformed input vector and the output
. When the input marginals are standard uniform distributions, the corresponding class of orthonormal polynomials is the Legendre family, and the resulting PCE is here indicated by
lPCEonZ. The asymptotic properties of an orthonormal PCE, such as the relations (16), hold. The expression of in terms of is given by the relation (1).At first, the second approach seems advantageous over the first one. It involves, in a different order, the same steps: modelling the input marginals and copula, and building the PCE metamodel. By capturing dependencies earlier on, it enables the construction of a basis of mutually orthonormal polynomials with respect to the inferred joint of the (transformed) inputs. It thereby provides a model with spectral properties (except for unavoidable errors due to inference, truncation of the expansion, and estimation of the expansion parameters). The experiments run in Section 3, however, show that the first approach yields typically more accurate pointwise predictions (although not always better statistical estimates). This is due to the fact that it avoids mapping the input into independent variables via the (typically strongly nonlinear) Rosenblatt transform. The shortcomings of mapping dependent inputs into independent variables were noted, in classical UQ settings, by Eldred et al. (2008), and are thus extended here to datadriven settings. For a more detailed discussion, see Section 3.5.
3 Validation on synthetic data
3.1 Validation scheme
We first investigate datadriven regression by PCE on two different simulated data sets. The first data set is obtained through an analytical, highly nonlinear function of three variables, subject to three uncertain inputs. The second data set is a finite element model of a horizontal truss structure, subject to uncertain inputs. In both cases, the inputs are statistically dependent, and their dependence is modelled through a canonical vine (Cvine) copula (see A).
We study the performance of the two alternative approaches described in Section 2.4: aPCEonX and lPCEonZ. In both cases we model the input copula as a Cvine, fully inferred from the data. For the aPCEonX the choice of the copula only plays a role in the estimation of the output statistics, while for the lPCEonZ it affects also the pointwise predictions. We additionally tested the performance obtained by using the true input copula (known here because it was used to generate the synthetic data) and a Gaussian copula inferred from data. We also investigated the performance obtained by using the true marginals instead of the ones fitted by KDE. Using the true copula and marginals yielded better statistical estimates, but is of little practical interest, as it would not be known in real applications. The Gaussian copula yielded generally similar or slightly worse accuracy. For brevity, we do not show these results.
To assess the pointwise accuracy, we generate a training set of increasing size , build the PCE metamodels both by aPCEonX and by lPCEonZ, and evaluate their rMAE on a validation set of fixed size points. Both and
are sampled from the probabilistic input model with true marginals and vine copula. The statistical accuracy is quantified instead in terms of error on the mean, standard deviation, and full
(by the KullbackLeibler divergence) of the true models, as defined in (2)(4). Reference solutions are obtained by MC sampling from the true input model, while the PCE estimates are obtained by resampling from the inferred marginals and copula. The sample size is in both cases. The statistical estimates obtained by the two PCE approaches are furthermore compared to the corresponding sample estimates obtained from the same training data (sample mean, sample standard deviation, and KDE of the ).For each value of and each error type, error bands are obtained as the minimum to maximum error obtained across different realisations of and . The minimum error is often taken in machine learning studies as an indication of the best performance that can be delivered by a given algorithm in a given task. The maximum error represents analogously the worstcase scenario.
Finally, we assess the robustness to noise by adding a random perturbation to each output observations in
used to train the PCE model. The noise is drawn independently for each observation from a univariate Gaussian distribution with mean
and prescribed standard deviation .3.2 Ishigami function
The first model we consider is
(17) 
where
(18) 
is the Ishigami function Ishigami and Homma (1990), defined for inputs and taking values in . The Ishigami function is often used as a test case in global sensitivity analysis due to its strong nonlinearity and nonmonotonicity. Model (17) is rescaled here to take values in the interval . Rescaling does not affect the approximation accuracy of the PCE, but enables a meaningful evaluation of the rMAE (2) by avoiding values of close to .
We model the input as a random vector with marginals , , and Cvine copula with density
Here, and are the densities of the pairwise Gumbel and t copula families defined in rows and of Table A.4. Thus, correlates positively with both and . and are also positively correlated, but conditionally independent given . The joint of can be obtained from its marginals and copula through (1). The of in response to input , its mean and its standard deviation , obtained on sample points, are shown in the left panel of Figure 1.
Next, we build the aPCEonX and the lPCEonZ on training data , and assess their performance as described in Section 3.1. The errors are shown in Figure 2 (red: aPCEonX; green: lPCEonZ), as a function of the training set size . The dotted line indicates the average error over the repetitions, while the shaded area around it spans the range from the minimum to the maximum error across repetitions. The aPCEonX yields a considerably lower rMAE. This is due to the strong nonlinearity of the Rosenblatt transform used by the lPCEonZ to decouple the components of the input data. Importantly, the methodology works well already in the presence of relatively few data points: the pointwise error and the KullbackLeibler divergence both drop below when using as few as data points. The central panel of Figure 1 shows the histogram obtained from output observations of one training set, the true (black), and the s obtained by resampling from the aPCEonX and the lPCEonZ built on that training set. The statistics of the true response are better approximated by the aPCEonX than by the lPCEonZ or by sample estimation (blue solid lines in Figure 2).
Finally, we examine the robustness of aPCEonX to noise. We perturb each observation in by adding noise drawn from a Gaussian distribution with mean and standard deviation . is assigned as a fixed proportion of the model’s true mean : , , and of (corresponding to , , and of , respectively). The results, shown in Figures 34, indicate that the methodology is robust to noise. Indeed, the errors of all types are significantly smaller than the magnitude of the added noise, and decrease with increasing sample size (see Figure 3). For instance, the rMAE for drops to if or more training points are used. The error on is minorly affected, which is expected since the noise is unbiased. More importantly, and can be recovered with high precision even in the presence of strong noise (see also Figure 1, fourth panel). In this case, the PCE predictor for the standard deviation works significantly better than the sample estimates, as illustrated in Figure 4.
3.3 23bar horizontal truss
We further replicate the analysis carried out in the previous section on a more complex finite element model of a horizontal truss Blatman and Sudret (2011). The structure consists of bars connected at upper nodes, and is meters long and meters high (see Figure 5). The bars belong to two different groups (horizontal and diagonal bars), both having uncertain Young modulus and uncertain crosssectional area , :
where is the univariate lognormal distribution with mean and standard deviation .
The four variables can be considered statistically independent, and their values influence the structural response to loading. An additional source of uncertainty comes from six random loads the truss is subject to, one on each upper node. The loads have Gumbel marginal distribution with mean and standard deviation :
(19) 
where , , and is the Euler–Mascharoni constant. In addition, the loads are made mutually dependent through the Cvine copula with density
(20) 
where each is the density of the paircopula between and , , and belongs to the Gumbel–Hougaard family defined in Table A.4, row 11.
The presence of the loads causes a downward vertical displacement at the mid span of the structure. is taken to be the system’s uncertain response to the dimensional random input consisting of the structural variables and the loads. The true statistics (mean, standard deviation, ) of are obtained by MCS over sample points, and are shown in the left panel of Figure 6.
We analyse the system with the same procedure undertaken for the Ishigami model: we build aPCEonX and lPCEonZ on each of training sets of increasing size , and validate their performance. The pointwise error is evaluated on validation sets of fixed size , while the statistical errors are determined by large resampling.
The results are shown in Figure 7. Both PCEs exhibit high performance, yet the aPCEonX yields a significantly smaller pointwise error (first panel). The lPCEonZ yields a better estimate of the standard deviation, yet the empirical estimates obtained from the training data are the most accurate ones in this case.
Having selected the aPCEonX as the better of the two metamodels, we further assess its performance in the presence of noise. We perturb the response values used to train the model by adding Gaussian noise with increasing standard deviation , set to , , and of (equivalent to , , and of , respectively). The results are shown in Figures 89. The errors of all types are significantly smaller than the magnitude of the added noise, and decrease with increasing sample size for all noise levels (Figure 8). Also, the PCE estimates are significantly better than the sample estimates (Figure 9; see also Figure 6, fourth panel).
3.4 Regularization without orthonormalization
As discussed in the Section 2.3, PCE first determines a basis of polynomials which are mutually orthonormal with respect to the input PDF, then expresses the system response as a sum of such polynomials. The expansion coefficients are determined, in large dimension and in the examples above, by sparse regression. Here in particular, LAR was selected as a method to solve sparse regression.
In the presence of correlations among the inputs, we further explored two strategies. The first one, lPCEonZ, transforms the inputs into statistically independent random variables, and then builds an orthonormal basis with respect to the new variables. The second strategy, aPCEonX, ignores the input dependencies (copula) and builds a basis orthonormal with respect to the product of their marginals. aPCEonX ultimately outperforms lPCEonZ in terms of pointwise accuracy, because it avoids highly nonlinear transformations that make the expansion spectrum decay slower and thus the polynomial representation less efficient.
An even simpler alternative to aPCEonX would be to ignore not only the input copula, but its marginal distributions as well, and to use any nonorthonormal polynomial basis. Various parametric and nonparametric schemes exist to this end. We consider three different ones, which we apply to the Ishigami and horizontal truss models. This investigation allows us to quantify the benefit of basis orthonormality.
The first of such methods is LASSO Tibshirani (1996), which directly solves the problem (15). In standard LASSO the complete basis comprises all and only monomials of the type , . In practice though, the number of different monomials is reduced by neglecting interactions, that is, by considering only monomials of the type . Further simplification is typically achieved by considering only the linear terms. It is clear that, in such settings, LASSO would miss the complex interactions that characterize both the Ishigami and the horizontal truss systems. Indeed, using the standard Matlab implementation of linear LASSO (with fold crossvalidation), we could not achieve a better accuracy than for either case.
A recent improvement of LASSO is SALSA Kandasamy and Yu (2016)
. The method considers interactions among the input variables, and automatically selects the relevant terms in the expansion by an approach analogous to kernel ridge regression. We analyse the Ishigami and horizontal truss data with SALSA, using the Matlab implementation provided by the authors at
github.com/kirthevasank/salsa. The results are illustrated in Figure 10. While performing overall better than LASSO, SALSA (cyan curve) fails to capture the highly nonlinear behavior of the Ishigami function, and performs overall worse than aPCEonX (red) both on the Ishigami and the horizontal truss models.The third attempt to solve sparse regression using a nonorthonormal basis is by performing PCE on the data using Legendre polynomials. We refer to this approach as lPCEonX. Legendre polynomials are the family of orthonormal polynomials with respect to uniform distributions in an interval . We apply lPCEonX to the horizontal truss model only, as for the Ishigami model  whose inputs are truly uniform  it would correspond to the noninteresting case of assuming the correct marginals. We obtain the interval for each input variable from the training data, as the observation range enlarged by both to the left and to the right. The results are shown in Figure 10, right, orange curve. lPCEonX performs similarly to SALSA, and consistently worse than aPCEonX.
These comparisons demonstrate numerically the benefit of performing regression using an orthonormal basis, which ensures, due to finite variance of the response, better compression and thus more rapid convergence.
3.5 Preliminary conclusion
The results obtained in the previous section allow us to draw some important preliminary conclusions on datadriven PCE. The methodology:

delivers reliable predictions of the system response to multivariate inputs;

produces reliable estimates of the response statistics if the input dependencies are properly modelled, as done here through copulas (for aPCEonX: aposteriori);

works well already when trained on few observations;

deals effectively with noise, thus providing a tool for denoising;

involves only few hyperparameters, which have a clear interpretation and are easily tuned by crossvalidation;
In order to build the expansion when the inputs are mutually dependent, we investigated two alternative approaches, labelled as lPCEonZ and aPCEonX. Of the two strategies, aPCEonX appears to be the most effective one in purely datadriven problems. It is worth mentioning, though, that lPCEonZ may provide superior statistical estimates if the joint distribution of the input is known with greater accuracy than in the examples shown here. This was the case, for instance, when we replaced the inferred marginals and copula used to build the lPCEonZ with the true ones (not shown here): in both examples above, we obtained more accurate estimates of , , and (but not better pointwise predictions) than using aPCEonX.
The simulation results show a mismatch between the fast speed of MAE convergence and the slower convergence speed of the other metrics. Such mismatch is explained by the datadriven setup. To see this, let us start from an extreme example. Consider the ideal case where the true model is an exact polynomial. Inferring the input joint from data (first step of the analysis) will generally lead to an inferred pdf . The bases associated to and will thus also differ. Nevertheless, performing regression on the basis associated to may still lead to an exact expansion, that is, to a PCE model that is identical to the true polynomial. In this ideal scenario, the MAE would be . Yet, estimating the output statistics would involve resampling input points from the wrong pdf . The error would be propagated by the correct PCE, and yield positive errors in the statistical estimates despite the PCE model being exact. The numerical examples shown here are not so extreme: the original model is not a polynomial and the obtained PCE only approximates . Still, it is a good approximation, in the sense that the resulting MAE is very small even if the wrong input (and therefore a basis which is not exactly orthonormal) is used. The output statistics instead have a larger estimation error, because their accuracy is more strongly affected by errors in the inferred input distribution.
4 Results on real data sets
We now demonstrate the use of aPCEonX on three different real data sets. The selected data sets were previously analysed by other authors with different machine learning algorithms, which establish here the performance benchmark.
4.1 Analysis workflow
4.1.1 Statistical input model
The considered data sets comprise samples made of multiple input quantities and one scalar output. Adopting the methodology outlined in Section 2, we characterize the multivariate input statistically by modelling its marginals through KDE, and we then resort to arbitrary PCE to express the output as a polynomial of . The basis of the expansion thus consists of mutually orthonormal polynomials with respect to , where is the marginal inferred for .
4.1.2 Estimation of pointwise accuracy
Following the pointwise error assessment procedure carried out in the original publications, for the case studies considered here we assess the method’s performance by crossvalidation. Standard fold crossvalidation partitions the data into subsets, trains the model on of those (the training set ), and assesses the pointwise error between the model’s predictions and the observations on the th one (the validation set ). The procedure is then iterated over all possible combinations of training and validation sets. The final error is computed as the average error over all validation sets. The number of data subsets is chosen as in the reference studies. Differently from the synthetic models considered in the previous section, the true statistics of the system response are not known here, and the error on their estimates cannot be assessed.
A variation on standard crossvalidation consists in performing a fold cross validation on each of multiple random shuffles of the data. The error is then typically reported as the average error obtained across all randomisations, ensuring that the final results are robust to the specific partitioning of the data in its subsets. In the following, we refer to a fold cross validation performed on random permutations of the data (i.e. random fold partitions) as an fold randomised crossvalidation.
4.1.3 Statistical estimation
Finally, we estimate for all cases studies the response aposteriori (AP) by resampling. To this end, we first model their dependencies through a Cvine copula . The vine is inferred from the data as detailed in A.3. Afterwards, resampling involves the following steps:

sample points from . We opt for Sobol’ quasirandom lowdiscrepancy sequences Sobol’ (1967), and set ;

map by the inverse Rosenblatt transform of ;

map by the inverse probability integral transform of each marginal . is a sample set of input observations with copula and marginals ;

evaluate the set of responses to the inputs in .
The of is estimated on by kernel density estimation.
4.2 Combinedcycle power plant
The first real data set we consider consists of data points collected from a combinedcycle power plant (CCPP) over years (20062011). The CCPP generates electricity by gas and steam turbines, combined in one cycle. The data comprise ambient variables and the energy production , measured over time. The four ambient variables are the temperature , the pressure and the relative humidity measured in the gas turbine, and the exhaust vacuum measured in the steam turbine. All five quantities are hourly averages. The data are available online Lichman (2013).
The data were analysed in Tüfekci (2014) with different ML techniques, including regression trees, different types of neural networks (NNs), and support vector regression (SVR), to predict the energy output based on the measured ambient variables. The authors assessed the performance of each method by fold randomised crossvalidation, yielding a total of pairs of training and validation sets. Each set contained instances. The best learner among those tested by the authors was a bagging reduced error pruning (BREP) regression tree. Specifically, the model was obtained as the average response (“bagging”) of pruned regression trees, each one built on a subset of the full data obtained by bootstrapping. Each pruned regression tree was obtained as a regression tree that was then iteratively simplified by reduced error pruning. The final model yielded a mean MAE of (see their Table 10, row 4). The lowest MAE of this model over the validation sets, corresponding to the “best” validation set, was indicated to be . Besides providing an indicative lower bound to the errors, the minimum gives, when compared to the means, an indication of the variability of the performance over different partitions of the data. The actual error variance over the validation sets was not provided in the mentioned study.
MAE  min. MAE  meanmin  rMAE ()  

aPCEonX  3.11 0.03  3.05  0.06  0.68 0.007 
BREP Tüfekci (2014)  3.22 n.a.  2.82  0.40  n.a. 
We analyze the very same training sets by PCE. The results are reported in Table 1. The average MAE yielded by the aPCEonX is slightly smaller than that of the BREP model. More importantly, the difference between the average and the minimum error, calculated over the validation sets, is significantly lower with our approach, indicating a lower sensitivity of the results to the partition of the data, and therefore a higher reliability in the presence of random observations. The average error of the PCE predictions relative to the observed values is below .
4.3 Boston Housing
The second real data set used to validate the PCEbased ML method concerns housing values in the suburbs of Boston, collected in 1970. The data set, downloaded from Lichman (2013), was first published in Harrison and Rubinfeld (1978), and is a known reference in the machine learning and data mining communities.
The data comprise instances, each having attributes. One attribute (the proximity of the neighborhood to the Charles river) is binaryvalued and is therefore disregarded in our analysis. Of the remaining attributes, one is the median housing value of owneroccupied homes in the neighbourhood, in thousands of (MEDV). The remaining attributes are, in order: the per capita crime rate by town (CRIM), the proportion of residential land zones for lots over 25,000 sq.ft. (ZN), the proportion of nonretail business acres per town (INDUS), the nitric oxides concentration, in parts per 10 million (NOX), the average number of rooms per dwelling (RM), the proportion of owneroccupied units built prior to 1940 (AGE), the weighted distances to five Boston employment centres (DIS), the index of accessibility to radial highways (RAD), the fullvalue propertytax rate per 10,000 (TAX), the pupilteacher ratio by town (PTRATIO), the index , where Bk is the proportion of black residents by town, and the lower status of the population (LSTAT).
The data were analysed in previous studies with different regression methods to predict the median house values MEDV on the basis of the other attributes Can (1992); Gilley and Pace (1995); Quinlan (1993); R Kelley Pace (1997). The original publication itself Harrison and Rubinfeld (1978)
was concerned with determining whether the demand for clean air affected housing prices. The data were analysed with different supervised learning methods in
Quinlan (1993). Among them, the best predictor was shown to be an NN model combined with instancebased learning. The NN consisted of a single hidden layer and an output unit. The weights were regularised by a penalty coefficient. The number of hidden units and the penalty coefficients were first optimised by threefold cross validation on the first training set, and each weight was then finally selected as the best of values randomly sampled within . The final NN yielded (rMAE: ) on a fold crossvalidation.We model the data by PCE and quantify the performance by randomised crossvalidation. The results are summarised in Table 2. The errors are comparable to the NN model with instancebased learning in Quinlan (1993). While the latter yields the lowest absolute error, the aPCEonX achieves a smaller relative error. In addition, it does not require the fine parameter tuning that affects most NN models. Finally, we estimate the of the median house value as described in Section 4.1.3. The results are shown in Figure 12.
4.4 Wine quality
The third real data set we consider concerns the quality of wines from the vinho verde region in Portugal. The data set consists of red samples and white samples, collected between 2004 and 2007. The data are available online at http://www3.dsi.uminho.pt/pcortez/wine/. Each wine sample was analysed in laboratory for physicochemical parameters: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. In addition, each sample was given a quality score based on blinded sensory tests from three or more sensory assessors. The score is the median of the grades (integers between 0 and 10) assigned by each assessor.
The data were previously analysed in Cortez et al. (2009) to predict the wine quality score on the basis of the physicochemical parameters. The study compared various ML algorithms, namely multiple regression, a singlelayer NN, and linear support vector machine (SVM, Schölkopf and Smola (2002)) regression. The NN consisted of a single hidden layer with logistic transfer function and an output node with linear function. The number of hidden nodes and the weights were automatically calibrated by crossvalidation. For linear SVM regression, a Gaussian kernel was selected, with kernel width tuned by crossvalidation. The remaining parameters were fixed to arbitrary values to reduce the computational cost. SVM regression outperformed the other methods (at the expense of a considerably higher computational cost), yielding the lowest MAE, as assessed by means of fold randomised crossvalidation.
We model the data by aPCEonX, and round the predicted wine ratings, which take continuous predicted values, to their closest integer. The performance is then assessed through the same crossvalidation procedure used in Cortez et al. (2009). The results are reported in Table 3. The MAE of aPCEonX is comparable to that of the SVM regressor, and always lower than the best NN.
red wine:  white wine:  

MAE  rMAE ()  MAE  rMAE ()  
aPCEonX  0.44 0.03  8.0 0.6  0.50 0.02  8.8 0.3 
SVM in Cortez et al. (2009)  0.46 0.00  n.a.  0.45 0.00  n.a. 
Best NN in Cortez et al. (2009)  0.51 0.00  n.a.  0.58 0.00  n.a. 
Finally, our framework enables the estimation of the of the wine rating as predicted by the PCE metamodels. The resulting s are shown in Figure 13. One could analogously compute the conditional s given by fixing any subset of inputs to given values (e.g., the residual sugar or alcohol content, which can be easily controlled in the wine making process). This may help, for instance, predicting the wine quality for fixed physicochemical parameters, or choosing the latter so as to optimize the wine quality or to minimize its uncertainty. This analysis goes beyond the scope of the present work.
5 Discussion and conclusions
We proposed an approach to machine learning (ML) that capitalises on polynomial chaos expansion (PCE), an advanced regression technique from uncertainty quantification. PCE is a popular spectral method in engineering applications, where it is often used to replace expensivetorun computational models subject to uncertain inputs with an inexpensive metamodel that retains the statistics of the output (e.g., moments, ). Our paper shows that PCE can also be used as an effective regression model in purely datadriven problems, where only input observations and corresponding system responses  but no computational model of the system  are available. This setup poses various challenges that PCE does not have to deal with in classical uncertainty quantification problems. Namely, the training set is fixed and cannot be enriched, the input joint distribution (needed to build the spectral basis and to draw statistical estimates) is unknown and has to be entirely inferred from the available data, and the data may be noisy.
We tested the performance of PCE on simulated data first, and then on real data by crossvalidation. The reference performance measure was the mean absolute error of the PCE metamodel over all test data. The simulations also allowed us to assess the ability of PCE to estimate the statistics of the response (its mean, standard deviation, and ) in the considered datadriven scenario. Both the pointwise and the statistical errors of the methodology were low, even when relatively few observations were used to train the model. Importantly, high performance was still obtained in the presence of strong noise in the simulated data. PCE can thus be used for denoising, a feature that had not been previously investigated. Finally, the numerical experiments demonstrated the superiority of PCE, which uses as expansion basis the set of orthonormal polynomials with respect to the product of the input marginals, over methods that perform sparse regression over a nonorthonormal basis, such as LASSO Tibshirani (1996), SALSA Kandasamy and Yu (2016), or PCE built enforcing a different basis.
The applications to real data showed a comparable, and sometimes slightly superior, performance to that of other ML methods used in previous studies, such as different types of neural networks and support vector machines. In general, the quality of the approximations will depend on the magnitude of the truncation error, which is dictated by the speed of the decay of the spectrum. The amount of data available also limits in practice the attainable expansion order, as already observed in Oladyshkin and Nowak (2018).
PCE offers several advantages over other established ML regression techniques. First, the framework performs well on very different tasks, with only little parameter tuning needed to adapt the methodology to the specific data considered. The quadratic nature of the PCE regression problem enables direct optimisation, compared for instance to the cumbersome iterative training/calibration procedure in NNs. For PCE, only the total degree , the norm parameter and the interaction degree are to be specified. Automatic calibration of these parameters within preassigned ranges is straightforward. Specifically, as a single analysis takes only a few seconds to a minute to be completed on a standard laptop (depending on the size of the data), it is possible to repeat the analysis over an array of allowed values, and to retain the PCE with minimum error in the end. In our analyses we tuned the parameters and in this way, while setting (not optimized, in order to reduce the total computational cost). This feature distinguishes PCE from the above mentioned ML methods, which instead are known to be highly sensitive to their hyperparameters and require an appropriate and typically time consuming calibration Claesen and Moor (2015). Indeed, it is worth noting that, in the comparisons we made, all PCE metamodels were built by using the very same procedure, including the automatic hyperparameter calibration. When compared to the best NNs or SVM found in other studies, which differed significantly from each other in their construction and structure, the PCE metamodels exhibited a comparable performance.
Second, PCE delivers not only accurate pointwise predictions of the output for any given input, but also statistics thereof in the presence of input uncertainties. This is made possible by combining the PCE metamodel with a proper probabilistic characterisation of the input uncertainties through marginal distributions and copulas, and then by resampling from the obtained input model. The result is not trivial: theory motivates the spectral convergence of PCE in classical UQ settings (see Section 2.2). In the datadriven setup considered here, however, the joint of the input is unknown and has to be inferred (see also Section 3.5), a step which generally harms convergence. As a further difficulty, we addressed the worstcase scenario where parametric inference is not possible due to missing information, and proceeded instead by kernel density estimation. Despite these challenges, quite accurate statistics were obtained. The methodology works well also in the presence of several inputs (as tested on simulated problems of dimension up to ) and of sample sets of comparably small size. It is worth mentioning in this respect that the same resampling approach could be applied in combination with any technique for function approximation (including neural networks and support vector machines). In the particular case of independent inputs, the PCE representation encodes the moments of the output and its Sobol sensitivity indices directly in the polynomial coefficients Sudret (2008).
Third, the analytical expression of the output yielded by PCE in terms of a simple polynomial of the input makes the model easy to interpret, compared for instance to other popular methods for ML regression.
Fourth, the polynomial form makes the calibrated metamodel portable to embedded devices (e.g., drones). For this kind of applications, the demonstrated robustness to noise in the data is a particularly beneficial feature.
Fifth and last, PCE needs relatively few data points to attain acceptable performance levels, as shown here on various test cases. This feature demonstrates the validity of PCE metamodelling for problems affected by data scarcity, also when combined with complex vine copula representations of the input dependencies.
One limitation of PCEbased regression as presented here is its difficulty in dealing with data of large size or consisting of a very large number of inputs. Both features lead to a substantially increased computational cost needed to fit the PCE parameters and (if statistical estimation is wanted and the inputs are dependent) to infer the copula. Various solutions can be nevertheless envisioned. In the presence of very large training sets, the PCE may be initially trained on a subset of the available observations, and subsequently refined by enriching the training set with points in the region where the observed error is larger. Regarding copula inference, which is only needed for an accurate quantification of the prediction’s uncertainty, a possible solution is to employ a Gaussian copula. The latter involves a considerably faster fitting than the more complex vine copulas, and still yielded in our simulations acceptable performance. Alternatively, one may reduce the computational time needed for parameter estimation by parallel computing, as done in Wei et al. (2016). Recently, a technique to efficiently infer vine copulas in highdimensional problems has been proposed in Müller and Czado (2018).
Finally, the proposed methodology has been shown here on data characterised by continuous input variables only. PCE construction in the presence of discrete data is equally possible, and the Stiltjes orthogonalisation procedure is known to be quite stable in that case Gautschi (1982)
. The aposteriori quantification of the output uncertainty, however, generally poses a challenge. Indeed, it involves the inference of a copula among discrete random variables, which requires a different construction
Genest and Nešlehová (2007). Recently, however, methods have been proposed to this end, including inference for Rvines Panagiotelis et al. (2012); Panagiotelis et al. (2017). Further work is foreseen to integrate these advances with PCE metamodelling.Acknowledgments
Emiliano Torre gratefully acknowledges financial support from RiskLab, Department of Mathematics, ETH Zurich and from the Risk Center of the ETH Zurich. Declarations of interest: none.
References
 Aas (2016) Aas, K. (2016). Paircopula constructions for financial applications: A review. Econometrics 4(4), 43.
 Aas et al. (2009) Aas, K., C. Czado, A. Frigessi, and H. Bakkend (2009). Paircopula constructions of multiple dependence. Insurance: Mathematics and Economics 44(2), 182–198.
 Applegate et al. (2006) Applegate, D. L., R. E. Bixby, V. Chvátal, and W. J. Cook (2006). The Traveling Salesman Problem: A Computational Study. NewJersey: Princeton University Press.
 Bedford and Cooke (2002) Bedford, T. and R. M. Cooke (2002). Vines – a new graphical model for dependent random variables. The Annals of Statistics 30(4), 1031–1068.
 Bishop (2009) Bishop, C. (2009). Pattern recognition and machine learning. Springer.
 Blatman and Sudret (2011) Blatman, G. and B. Sudret (2011). Adaptive sparse polynomial chaos expansion based on Least Angle Regression. J. Comput. Phys 230, 2345–2367.
 Can (1992) Can, A. (1992). Specification and estimation of hedonic housing price models. Regional Science and Urban Economics 22(3), 453–474.
 Chan and Elsheikh (2018) Chan, S. and A. H. Elsheikh (2018). A machine learning approach for efficient uncertainty quantification using multiscale methods. Journal of Computational Physics 354, 493–511.
 Claesen and Moor (2015) Claesen, M. and B. D. Moor (2015). Hyperparameter search in machine learning. arXiv 1502.02127.
 Cortez et al. (2009) Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47, 547–553.
 Czado (2010) Czado, C. (2010). PairCopula Constructions of Multivariate Copulas, pp. 93–109. Berlin, Heidelberg: Springer Berlin Heidelberg.
 Dißmann et al. (2013) Dißmann, J., E. C. Brechmann, C. Czado, and D. Kurowicka (2013). Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics and Data Analysis 59, 52–69.
 Efron et al. (2004) Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annals of Statistics 32, 407–499.
 Eldred et al. (2008) Eldred, M., C. Webster, and P. Constantine (2008). Evaluation of nonintrusive approaches for WienerAskey generalized polynomial chaos. AIAA 20081892 46.
 Ernst et al. (2012) Ernst, O. G., A. Mugler, H.J. Starkloff, and E. Ullmann (2012). On the convergence of generalized polynomial chaos expansions. ESAIM: M2AN 46(2), 317–339.

Forman and
Cohen (2004)
Forman, G. and I. Cohen (2004).
Learning from little: Comparison of classifiers given little training.
In J.F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi (Eds.), Knowledge Discovery in Databases: PKDD 2004, Berlin, Heidelberg, pp. 161–172. Springer Berlin Heidelberg.  Gautschi (1982) Gautschi, W. (1982). On generating orthogonal polynomials. SIAM J. Sci. Stat. Comput. 3(3), 289–317.
 Genest and Nešlehová (2007) Genest, C. and J. Nešlehová (2007). A primer on copulas for count data. The Astin Bulletin 37, 475–515.
 Ghanem and Spanos (1990) Ghanem, R. and P.D. Spanos (1990). Polynomial chaos in stochastic finite elements. J. Applied Mech. 57, 197–202.
 Gilley and Pace (1995) Gilley, O. W. and R. K. Pace (1995). Improving hedonic estimation with an inequality restricted estimator. The Review of Economics and Statistics 77(4), 609–621.

Haff
et al. (2010)
Haff, I. H., K. Aas, and A. Frigessi (2010).
On the simplified paircopula construction – simply useful or too
simplistic?
Journal of Multivariate Analysis
101, 1296–1310.  Harrison and Rubinfeld (1978) Harrison, D. and D. L. Rubinfeld (1978). Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management 5, 81–102.
 Ishigami and Homma (1990) Ishigami, T. and T. Homma (1990). An importance quantification technique in uncertainty analysis for computer models. In Proc. ISUMA’90, First Int. Symp. Unc. Mod. An, pp. 398–403. University of Maryland.
 Joe (1996) Joe, H. (1996). Families of variate distributions with given margins and bivariate dependence parameters. In L. Rüschendorf, B. Schweizer, and M. D. Taylor (Eds.), Distributions with fixed marginals and related topics, Volume 28 of Lecture Notes–Monograph Series, pp. 120–141. Institute of Mathematical Statistics.
 Joe (2015) Joe, H. (Ed.) (2015). Dependence modeling with copulas. CRC Press.
 Kandasamy and Yu (2016) Kandasamy, K. and Y. Yu (2016). Additive approximations in high dimensional nonparametric regression via the SALSA. Volume 48, pp. 69–78.
 Kasiviswanathan et al. (2016) Kasiviswanathan, K. S., K. P. Sudheer, and J. He (2016). Quantification of Prediction Uncertainty in Artificial Neural Network Models, pp. 145–159. Cham: Springer International Publishing.

Kirk (2014)
Kirk, J. (2014).
Traveling salesman problem – genetic algorithm.
https://ch.mathworks.com/matlabcentral/fileexchange/13680travelingsalesmanproblemgeneticalgorithm.  Kurowicka and Cooke (2005) Kurowicka, D. and R. M. Cooke (2005). Distributionfree continuous Bayesian belief nets. In A. Wilson, N. Limnios, S. KellerMcNulty, and Y. Armijo (Eds.), Modern Statistical and Mathematical Methods in Reliability, Chapter 10, pp. 309–322. World Scientific Publishing.
 Kurz (2015) Kurz, M. (2015). Vine copulas with matlab. https://ch.mathworks.com/matlabcentral/fileexchange/46412vinecopulaswithmatlab.
 Lebrun and Dutfoy (2009) Lebrun, R. and A. Dutfoy (2009). Do Rosenblatt and Nataf isoprobabilistic transformations really differ? Prob. Eng. Mech. 24, 577–584.
 Lichman (2013) Lichman, M. (2013). UCI machine learning repository.

Mentch and
Hooker (2016)
Mentch, L. and G. Hooker (2016).
Quantifying uncertainty in random forests via confidence intervals and hypothesis tests.
Journal of Machine Learning Research 17, 1–41.  MoralesNápoles (2011) MoralesNápoles, O. (2011). Counting vines. In D. Kurowicka and H. Joe (Eds.), Dependence Modeling: Vine Copula Handbook, Chapter 9, pp. 189–218. World Scientific Publisher Co.
 Müller and Czado (2018) Müller, D. and C. Czado (2018). Selection of sparse vine copulas in high dimensions with the Lasso. Statistics and Computing, 1–19.
 NAVARRO et al. (2014) NAVARRO, M., J. WITTEVEEN, and J. BLOM (2014). Polynomial chaos expasion for general multivariate distributions with correlated variables. ArXiv 1406.5483v1.
 Nelsen (2006) Nelsen, R. (2006). An introduction to copulas (Second ed.). Springer Series in Statistics. SpringerVerlag New York.
 Oladyshkin and Nowak (2012) Oladyshkin, S. and W. Nowak (2012). Datadriven uncertainty quantification using the arbitrary polynomial chaos expansion. Reliability Engineering and System Safety 106, 179–190.
 Oladyshkin and Nowak (2018) Oladyshkin, S. and W. Nowak (2018). Incomplete statistical information limits the utility of highorder polynomial chaos expansions. Reliability Engineering & System Safety 169, 137–148.
 Panagiotelis et al. (2012) Panagiotelis, A., C. Czado, and H. Joe (2012). Pair copula constructions for multivariate discrete data. J. Amer. Statist. Assoc. 107, 1063–1072.
 Panagiotelis et al. (2017) Panagiotelis, A., C. Czado, H. Joe, and J. Stöber (2017). Model selection for discrete regular vine copulas. Computational Statistics & Data Analysis 106, 138–152.

Parzen (1962)
Parzen, E. (1962).
On estimation of a probability density function and mode.
The Annals of Mathematical Statistics 33(3), 1065–1076.  Paulson et al. (2018) Paulson, J. A., E. A. Buehler, and A. Mesbah (2018). Arbitrary polynomial chaos for uncertainty propagation of correlated random variables in dynamic systems. IFACPapersOnLine 50(1), 3548–3553.
 Puig et al. (2002) Puig, B., F. Poirion, and C. Soize (2002). NonGaussian simulation using Hermite polynomial expansion: convergences. Prob. Eng. Mech. 17, 253–264.
 Quinlan (1993) Quinlan, J. R. (1993). Combining instancebased and modelbased learning. In Proceedings on the Tenth International Conference of Machine Learning, pp. 236–243.
 R Kelley Pace (1997) R Kelley Pace, O. W. G. (1997). Using the spatial configuration of the data to improve estimation. Journal of Real Estate Finance and Economics 14(3), 333–340.
 Rosenblatt (1952) Rosenblatt, M. (1952). Remarks on a multivariate transformation. The Annals of Mathematical Statistics 23, 470–472.
 Rosenblatt (1956) Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics 27(3), 832–837.
 Saltelli et al. (2000) Saltelli, A., K. Chan, and E. Scott (Eds.) (2000). Sensitivity analysis. J. Wiley & Sons.
 Schepsmeier (2015) Schepsmeier, U. (2015). Efficient information based goodnessoffit tests for vine copula models with fixed margins. Journal of Multivariate Analysis 138(C), 34–52.
 Schölkopf and Smola (2002) Schölkopf, B. and A. Smola (2002). Learning with kernels. MIT Press.
 Sklar (1959) Sklar, A. (1959). Fonctions de répartition à dimensions et leurs marges. Publications de l’Institut de Statistique de L’Université de Paris 8, 229–231.
 Snoek et al. (2012) Snoek, J., H. Larochelle, and R. P. Adams (2012). Practical Bayesian optimization of machine learning algorithms. In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 2, NIPS’12, USA, pp. 2951–2959. Curran Associates Inc.
 Sobol’ (1993) Sobol’, I. (1993). Sensitivity estimates for nonlinear mathematical models. Mathematical Modeling & Computational Experiment 1, 407–414.
 Sobol’ (1967) Sobol’, I. M. (1967). On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics 7(4), 86–112.
 Soize and Ghanem (2004) Soize, C. and R. Ghanem (2004). Physical systems with random uncertainties: chaos representations with arbitrary probability measure. SIAM J. Sci. Comput. 26(2), 395–410.
 Stöber et al. (2013) Stöber, J., H. Joe, and C. Czado (2013). Simplified pair copula constructions — limitations and extensions. Journal of Multivariate Analysis 119, 101–118.
 Stuart and Ord (1994) Stuart, A. and K. Ord (1994). Kendall’s advanced theory of statistics Vol. 1 – Distribution theory. Arnold.
 Sudret (2008) Sudret, B. (2008). Global sensitivity analysis using polynomial chaos expansions. Reliab. Eng. Sys. Safety 93, 964–979.
 Sudret et al. (2015) Sudret, B., S. Marelli, and C. Lataniotis (2015). Sparse polynomial chaos expansions as a machine learning regression technique. International Symposium on Big Data and Predictive Computational Modeling, invited lecture.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society 58(1), 267–288.
 Todor and Schwab (2007) Todor, R.A. and C. Schwab (2007). Convergence rates for sparse chaos approximations of elliptic problems with stochastic coefficients. IMA J. Numer. Anal 27, 232–261.
 Torre et al. (2019) Torre, E., S. Marelli, P. Embrechts, and B. Sudret (2019). A general framework for datadriven uncertainty quantification under complex input dependencies using vine copulas. Probabilistic Engineering Mechanics 55, 1–16.
 Tüfekci (2014) Tüfekci, P. (2014). Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. International Journal of Electrical Power & Energy Systems 60, 126–140.

Vapnik (1995)
Vapnik, V. (1995).
The Nature of Statistical Learning Theory
. SpringerVerlag, New York.  Wan and Karniadakis (2006) Wan, X. and G. E. Karniadakis (2006). Beyond WienerAskey expansions: handling arbitrary PDFs. J. Sci. Comput. 27, 455–464.
 Wei et al. (2016) Wei, Z., D. Kim, and E. M. Conlon (2016). Parallel computing for copula parameter estimation with big data: A simulation study. arXiv 1609.05530.
 Wiener (1938) Wiener, N. (1938). The homogeneous chaos. Amer. J. Math. 60, 897–936.
 Witten et al. (2016) Witten, I. H., E. Frank, M. A. Hall, and C. J. Pal (2016). Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques (4th ed.). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
 Xiu and Karniadakis (2002) Xiu, D. and G. Karniadakis (2002). The WienerAskey polynomial chaos for stochastic differential equations. SIAM J. Sci. Comput. 24(2), 619–644.
Appendix A Mutually dependent inputs modelled through copulas
In Section 2.3 we illustrated PCE for an input with independent components being uniformly distributed in . was obtained from the true input by assuming either that the latter had independent components as well, and thus defining , or that a transformation to perform this mapping was available.
This section recalls known results in probability theory allowing one to specify
in terms of its marginal distributions and a dependence function called the copula of . We focus in particular on regular vine copulas (Rvines), for which the transformation can be computed numerically. Rvines provide a flexible class of dependence models for . This will prove beneficial to the ability of the PCE models to predict statistics of the output accurately, compared to simpler dependence models. However, and perhaps counterintuitively, the pointwise error may not decrease accordingly (see Section 2.1 for details), especially when is highly nonlinear. Examples on simulated data and a discussion are provided in Section 3.a.1 Copulas and Sklar’s theorem
A copula is defined as a variate joint with uniform marginals in the unit interval, that is,
Sklar’s theorem Sklar (1959) guarantees that any variate joint can be expressed in terms of marginals and a copula, specified separately.
Theorem (Sklar).
For any variate with marginals , a copula exists, such that
(1) 
Besides, is unique on , where Ran is the range operator. In particular, is unique on if all are continuous, and it is given by
(2) 
Conversely, for any copula and any set of univariate s , , the function defined by
(3) 
is a variate with marginals .
Throughout this study it is assumed that has continuous marginals . The relation (3) allows one to model any multivariate by modelling separately univariate s and a copula function . One first models the marginals , then transforms each into a uniform random variable by the socalled probability integral transform (PIT)
(4) 
Finally, the copula of is obtained as the joint of . The copula models the dependence properties of the random vector. For instance, mutual independence is achieved by using the independence copula
(5) 
Table A.4 provides a list of different parametric families of pair copulas implemented in the VineCopulaMatlab toolbox Kurz (2015), which was also used in this study. Details on copula theory and on various copula families can be found in Nelsen (2006) and in Joe (2015).
Sklar’s theorem can be restated in terms of probability densities. If admits joint and copula density , , then
(6) 
a.2 Vine copulas
In high dimension , specifying a copula which properly describes all pairwise and higherorder input dependencies may be challenging. Multivariate extensions of paircopula families (e.g. Gaussian or Archimedean copulas) are often inadequate when is large. In Joe (1996) and later in Bedford and Cooke (2002) and Aas et al. (2009), an alternative construction by multiplication of copulas was introduced. Copula models built in this way are called vine copulas. Here we briefly introduce the vine copula formalism, referring to the references for details.
Let be the vector obtained from the vector by removing its th component, i.e., . Similarly, let be the vector obtained by removing the th and th component, and so on. For a general subset , is defined analogously. Also, and indicate in the following the joint and of the random vector conditioned on . In the following, and form a partition of , i.e.,
Comments
There are no comments yet.