santaR
is a
Functional Data Analysis (FDA) approach where each individual’s
observations are condensed to a continuous smooth function of time,
which is then employed as a new analytical unit in subsequent data
analysis.
santaR
is designed to be automatically applicable to a
high number of variables, while simultaneously accounting
for:
- biological variability
- measurement error
- missing observations
- asynchronous sampling
- nonlinearity
- low number of time points (e.g. 4-10)
This vignette focusses on the theoretical background underlying
santaR
and will detail:
- Concept of Functional Data Analysis
- Main hypothesis
- Signal extraction using splines
- Group mean curve fitting
- Intra-class variability (confidence bands on the Group mean
curves)
- Detection of significantly altered time-trajectories
Concept of Functional Data Analysis
Traditional times-series methodologies consider each measurement (or
observation) - for example a metabolic response value at a defined
time-point - as the basic representative unit. In this context, a time
trajectory is composed of a set of discrete observations taken at
subsequent times. Functional data analysis is a framework
proposed by Ramsay and Silverman which models continuously each measured
values as variable evolves (e.g. time). Contrary to traditional
time-series approaches, FDA consider a subject’s entire time-trajectory
as a single data point. This continuous representation of the variable
through time is then employed for further data analysis.
More precisely, FDA assumes that a variable’s time trajectory is the
result of a smooth underlying function of time which must be estimated.
Therefore, the analysis’ first step is to convert a subject’s successive
observations to the underlying function F(t) that can be evaluated
at any desired time values
Two main challenges prescribe the exact characterisation of the true
underlying function of time F(t). First, the limited
number of discrete observations constrains to a finite time interval and
does not cover all possible values, while the true underlying smooth
process is infinitely defined. Secondly, intrinsic limitations and
variabilities of the analytical platforms mean that the discrete
observations are not capturing the true underlying function values, but
a succession of noisy realisations.
A metabolite’s concentration over the duration of the experiment is
the real quantity of interest. As the true smooth process F(t) cannot be evaluated,
due to the collection and analytical process only providing noisy
snapshots of its value, an approximation of the underlying metabolite
concentration f(t)
must be employed. This approximation can be computed by smoothing the
multiple single measurements available.
The approximation of the underlying function of time, of a single
variable observed on a single subject, is generated by smoothing yi observations
taken at the time ti, such as:
yi = f(ti) + ϵi
Where f(t) is the
smoothed function that can be evaluated at any desired time t within the defined time domain,
and ϵi is
an error term allowing for measurement errors. By extension, if f(t) is defined as
representing the true variable level across a population, ϵ can allow for both individual
specific deviation and measurement error.
In order to parametrise and computationally manipulate the smooth
approximation, the representation is projected onto a finite basis,
which expresses the infinitesimally dimensional f(t) as a linear
combination of a set of basis functions. A wide range of basis functions
(e.g. polynomials, splines, wavelets, Fourier basis) are available, even
if the most commonly used are Fourier basis for periodic data and
splines for aperiodic trajectories.
Following smoothing, a time-trajectory is expressed as a linear
combination of a set of basis functions, for which values can be
computer at any desired time without reconsidering the discrete
observations. It is assumed that, if a satisfactory fit of the original
data is obtained, both the new smooth approximation and the preceding
noisy realisations should provide a similar representation of the
underlying process
In practice, by assuming and parametrising (explicitely or
implicitely) the smoothness of the function and leveraging multiple
measurements in time as well as multiple individual trajectories, the
smoothed representation can implicitly handle noisy metabolic responses.
Additionally, as functional representations do not rely on the input
data after their generation, smoothing approaches can handle irregular
sampling and missing observations, translating to trajectories that can
be compared without requiring missing values imputation. As such, the
FDA framework and the smoothness assumption provide the necessary tools
for the time-dependent analysis of short and noisy metabolic
time-trajectories.
Main hypothesis
The working hypothesis is that analyte values are representative of
the underlying biological mechanism responsible for the observed
phenotype and evolve smoothly through time. The resulting measurements
are noisy realisations of these underlying smooth functions of time.
Based on previous work performed in Functional Data Analysis, short
time evolutions can be analysed by estimating each individual trajectory
as a smooth spline or curve. These smooth curves collapse the point by
point information into a new observational unit that enables the
calculation of a group mean curve or individual curves, each
representative of the true underlying function of time.
Further data analysis then results in the estimation of the
intra-class variability and the identification of time trajectories
significantly altered between classes.
Group mean curve fitting
Individual time-trajectories provide with subject-specific
approximations of the underlying function of time. If a homogenous
subset of the population (or group) can be devised, the replication of
the experiment over this group can be leveraged to generate a group mean
curve. This group mean curve provide with an approximation of the
underlying function of time potentially dissociated of inter-individual
variability.
A group mean curve g(t), representing the
underlying function of time for a subset of the population, is defined
as the mean response across all i group subjects (where i = 1, ., nk)
(Equation 3): $$g(t)=\frac{1}{n_k}
\sum_{i=1}^{n_k} f_i(t) = \overline{f_i(t)}$$ (Eq. 3)
Following the framework of FDA, each individual trajectory is
employed as the new basic unit of information. The group mean curve
therefore employs the back-projection of each individual curve at every
time of interest, instead of relying on each discrete measurement. This
way the group mean curve fit is resilient to missing samples in a
subject trajectory or asynchronous measurement across individuals.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Multiple group mean curve fits (Center and Right) based on
measurements generated from a true smooth underlying function of time
(Left) with added inter-individual variability and missing observations.
Multiple individual trajectories (Left, dashed green lines) are
generated from a true underlying function of time (Left, continuous
green line). Discrete measurements are generated (dots). Group mean
curves are computed on a subset of these measurements (Center and Right,
continuous blue line, 5 effective degrees of freedom). When the group
mean curve is generated solely based on measurements (Center), the true
underlying function of time is not satisfactorily approximated
(fundamental unit of information is the observation). When the group
mean curve (Right, continuous blue line) is generated from each
individual curves (Right, dashed blue lines) a satisfactory
approximation of the true underlying function of time can be obtained
(fundamental unit of information is the functional representation of
each individual).
Intra-class variability (confidence bands on the group mean
curves)
Confidence bands are employed to represent the uncertainty in an
function’s estimate based on limited or noisy data. Pointwise confidence
bands on the group mean curve can therefore provide a visualisation of
the variance of the overall mean function through time. The confidence
bands indicate for each possible time, values of the mean curve which
could have been produced by the provided individual trajectories, with
at least a given probability.
As the underlying distribution of F(t) is unknown and
inaccessible, confidence on g(t) must be estimated by
sampling the approximate distribution available (i.e. the measurements),
achieved by bootstrapping the observations . Bootstrap assumes
that sampling the distribution under a good estimate of the truth (the
measurements) should be close to sampling the distribution under the
true process. Once a bootstrapped distribution of mean curve is
estimated, using the percentile method a pointwise confidence interval
on g(t) can be
calculated by taking the α/2 and 1 − α/2 quantiles at each time-point
of interest (where α is the
error quantile in percent).
As limited assumptions on the fitted model can be established,
empirical confidence bands on the group mean curve are calculated based
on a non-parametric bootstrapped distribution. Individual curves are
sampled with replacement and group mean curves calculated. The
distribution of group mean curve is then employed to estimate the
pointwise confidence bands.
Detection of significantly altered time-trajectories
With a continuous time representation of a metabolite’s concentration
accessible (i.e. the group mean curve), a metric quantifying a global
difference between such curves can be devised. This measure of
difference, independent of the original sampling of the data
(e.g. asynchronous and missing measurements), can then be employed to
determine whether the group mean curves are identical across the two
groups and calculate the significance of the result.
In order to identify time-trajectories that are significantly altered
between two experimental groups G1 and G2, this global measure
must be established. The measure must be able to detect baseline
difference when both curves present the same shape, but also shape
differences when the mean values across time are identical for both
groups. We postulate that the area between two group mean curves fitted
independently g1(t) and g2(t) (Eq. 3) is
a good measure of the degree to which the two group time-trajectories
differ. This global distance measure is calculated as:
dist(g1, g2) = ∫tmintmax|g1(t) − g2(t)| dt
(Eq. 4)
Where:
- tmin
and tmax
are the lower and upper bound of the time course.
The global distance measure is a quantification of evidence for
differential time evolution, and the larger the value, the more the
trajectories differ.
Given the groups G1, G2, their respective
group mean curves g1(t), g2(t) and the
observed global distance measure dist(g1, g2),
the question of the significance of the difference between the two time
trajectories is formulated as the following hypothesis-testing
problem:
- H0: g1(t) and g2(t) are
identical.
- H1: The two
curves are not the same.
Under the null hypothesis H0, the global distance
between g1(t) and g2(t) can be
generated by a randomly permuting the individuals between two groups.
Under the alternate hypothesis H1, the two curves are
different, and very few random assignment of individuals to groups will
generate group mean curves that differ more than the observed ones
(i.e. a larger global distance).
In order to test the null hypothesis, the observed global distance
must be compared to the expected global distance under the null
hypothesis. The null distribution of the global distance measure is
inferred by repeated permutations of the individual class labels.
Individual trajectories are fit using the chosen df and
randomly assigned in one of two groups of size s1 and s2 (identical to the size
of G1 and G2 respectively). The
corresponding group mean curves g′1(t) and g′2(t) computed,
and a global distance dist(g′1, g′2)
calculated. This is repeated many times to form the null distribution of
the global distance. From this distribution a p-value can be computed as the
proportion of global distance as extreme or more extreme than the
observed global distance (dist(g1, g2)).
The p-value indicates the
probability (if the null hypothesis was true) of a global difference as,
or more, extreme than the observed one. If the p-value is inferior to an agreed
significance level, the null hypothesis is rejected.
Due to the stochastic nature of the permutations, the null
distribution sampled depends on the random draw, resulting in possible
variation of the p-value.
Variation on the permuted p-value can be described by
confidence intervals. The p-value is considered as the success
probability parameter p in
n independent Bernoulli
trials, where n = B
the total number of permutation rounds. Confidence
intervals on this binomial distribution of parameter p and n can then be computed. As extreme
p-values are possible (i.e
close to 0) and the number of permutation rounds could be small, the
Wilson score interval is employed to determine the confidence
intervals (Eq. 5):
$$p_\pm = \frac{1}{1+\frac{1}{B}z^2}\left
( p + \frac{1}{2B}z^2 \pm z\sqrt{\frac{1}{B} p (1 - p) +
\frac{1}{4B^2}z^2 } \right )$$ (Eq. 5)
Where:
- p± is the upper
or lower confidence interval.
- p is the calculated p-value.
- B the total number of
permutation rounds.
- z is the 1 − 1/2α quantile of the standard
normal distribution, with α
the error quantile.
Lastly, if a large number of variables are investigated for
significantly altered trajectories, significance results must be
corrected for multiple hypothesis testing. Here we employ the well-known
Benjamini & Hochberg false discovery rate approach as default, with
other methods available as an option.