Modeling longitudinal data using matrix

1. Motivation

The key question in medical practice and clinical research is how diseases progress in individual patients. Accurate continuous monitoring of a patient’s condition could considerably improve prevention and treatment. Many medical tests, such as X-ray, magnetic resonance imaging (MRI), motion capture gait analysis, biopsy, or blood tests, cannot be taken routinely, since they are costly, harmful, or inconvenient. Therefore, practitioners and researchers need to reach out for statistical tools to analyze the progression of a patient’s condition with only sparse and noisy observations at hand.

For illustration, consider longitudinal measurements of the gait deviation index (GDI), which is a holistic measure of motor impairment in children. GDI is measured using advanced motion capture hardware and software (Figure 1). Due to high costs, such measurements are taken only a few times in a patient’s life. By looking at individual processes and accounting for between-subject similarities, we can model individual progressions despite observing only few measurements. Such modeling can yield personalized predictions, clustering of patients based on progression patterns, or latent representation of progression patterns, which then could be used as covariates in other predictive models.

Figure 1: — Left plot: We observe the gait deviation index (GDI), a holistic metric of motor impairment, at every visit to a clinic, and we derive the population progression (the thick solid curve) using a locally weighted scatterplot smoothing method. Right plot: The analysis of individual patients (connected dots in the right plot in different colors) reveals patterns in individual trends. We highlight 3 randomly selected subjects in red, green, and blue.

These kinds of data have been commonly analyzed using linear mixed models, where we treat time as a random effect, and the nonlinearity of this effect is imposed by choice of the functional basis (Zeger et al., 1988; Verbeke, 1997; McCulloch and Neuhaus, 2001). When data is very sparse, additional constraints on the covariance structure of trajectories are inferred using cross-validation or information criteria (Rice and Wu, 2001; Bigelow and Dunson, 2009). To further constrain the dimensionality of the problem, practitioners model the covariance as a low-rank matrix (James et al., 2000; Berkey and Kent, 1983; Yan et al., 2017; Hall et al., 2006; Besse and Ramsay, 1986; Yao and Lee, 2006; Greven et al., 2011). Many models have been developed for incorporating additional covariates (Song et al., 2002; Liu and Huang, 2009; Rizopoulos et al., 2014). While these methods are used in practice, they tend to be slow and require careful implementation.

In this work, we pose the problem of trajectory prediction as a matrix completion problem, and we solve it using sparse matrix factorization techniques (Rennie and Srebro, 2005; Candès and Recht, 2009; Fithian et al., 2018). The objective of the classical matrix completion problem is to predict elements of a sparsely observed matrix using its known elements while minimizing a specific criterion, commonly Mean Squared Error (MSE). The motivating example is the “Netflix Prize” competition (Bennett and Lanning, 2007), where teams were tasked to predict unknown movie ratings using a sparse set of observed ratings. We can represent these data as a matrix of $N$ users and $M$ movies, with a subset of known elements measured on a fixed scale, e.g., 1–5.

One popular approach to approximate the observed matrix is to estimate its low-rank decomposition (Srebro et al., 2005). In the low-rank representation $W A^{'}$ , columns of $A$ spanning the space of movies can be loosely associated with some implicit (latent) characteristics such as taste, style, or genre, and each rater is represented as a weighted sum of their preferred characteristics, i.e., a row in the matrix of latent variables $W$ (see Figure 2).

Figure 2: — The key observation motivating this work is the fact that the problem of estimating trajectories can be mapped to a matrix completion problem where we estimate values of unobserved entries of a matrix. Matrix completion can be approached with matrix factorization, where we look for $W A^{'}$ of low rank, approximating observed values in $Y$ (circled green rectangles in the matrix $Y$ ). In the sparse longitudinal setting, we enforce smoothness by fixing the basis $B$ (e.g., splines, here with three elements), and again we find a low-rank matrix $W^{'} A$ , such that $W^{'} A B$ approximates observed values in $Y$ .

We can use the same idea to predict sparsely sampled curves as long as we introduce an additional smoothing step. The low-dimensional latent structure now corresponds to progression patterns, and the trajectory of each individual can be represented as a weighted sum of these “principal” patterns. In Figure 2, the patterns are given by $A^{'} B^{'}$ , while the individual weights are encoded in $W$ .

The key advantage of this approach is that the implementation only relies on the iterative application of matrix multiplication and singular value decomposition. We can leverage existing literature and packages for these operations, making the implementation straightforward and very efficient.

To the best of our knowledge, this is the first application of matrix completion for filling in entire curves to sparsely observed data. Nevertheless, matrix completion techniques have been previously used in this context for estimating the covariance operator Descary and Panaretos (2016), which can subsequently be used for fitting curves.

The manuscript is organized as follows. In Section 2, we introduce the notation of the functional representation of the problem and we discuss optimization techniques. Next, in Section 3, we introduce the matrix representation of the problem and show how it maps to the functional representation. We develop our main results and conclude with links to other results. In Section 4, we extend our framework to the multivariate case. We show example uses of our methods in a simulation study (Section 5) and in a data study (Section 6). We conclude with a discussion section (Section 7).

2. Background and related work

To focus our attention and aid further discussion, we start by introducing notation and methodology for the univariate case. Let $N$ denote the number of subjects. For each individual $i \in {1, 2, \dots, N}$ , we take measurements at $n_{i}$ irregularly sampled time-points $t_{i} = {(t_{i, 1}, t_{i, 2}, \dots, t_{i, n_{i}})}^{'}$ . We assume that $t_{m i n} < t_{i, 1} < t_{i, 2} < \dots < t_{i, n_{i}} < t_{m a x}$ for each $i$ and some $t_{m i n}, t_{m a x} \in R$ . In this work, we ignore the stochastic nature of sampling timepoints and we assume that points $t_{i}$ are fixed for each individual. Vectors $y_{i} = {(y_{i, 1}, \dots, y_{i, n_{i}})}^{'}$ denote observations corresponding to $t_{i}$ for each individual $i$ .

To model individual trajectories given pairs $(t_{i}, y_{i})$ practitioners map observations into a low-dimensional space that represents progression patterns. Conceptually, a small distance between individuals in the latent space reflects similar progression patterns.

In this section, we discuss state-of-the-art approaches to estimating this low-dimensional latent embedding. We classify them into three broad categories: the direct approach, mixed-effect models, and low-rank approximations.

2.1. Direct approach

If the set of observed values for each individual is dense, elementary interpolation using a continuous basis can be sufficient for approximating the entire trajectory. Let $(b_{i} : i \in N)$ be a basis of $L_{2} ((t_{m i n}, t_{m a x}))$ , i.e. the space of functions $f : (t_{m i n}, t_{m a x}) \to R$ such that the integral of $f^{2}$ on $(t_{m i n}, t_{m a x})$ is finite, such as splines or polynomials. In practice, we truncate the basis to a finite set of the first $K \in N$ basis elements. Let $b (t) = {(b_{1} (t), b_{2} (t), \dots, b_{K} (t))}^{'}$ be a vector of $K$ basis elements evaluated at a timepoint $t \in (t_{m i n}, t_{m a x})$ . Throughout this article we use the word basis to refer to some truncated basis of $K$ elements. The choice of the basis is relevant and should be done carefully. We refer the reader to the vast literature on functional data analysis for recommendations on this topic (Ramsay, 2006). Throughout this manuscript, we assume the basis is fixed upfront.

To find an individual trajectory, for each subject $i \in {1, \dots, N}$ , we might use least squares and estimate a set of coefficients $w_{i} \in R^{K}$ , minimizing squared Euclidean distance to observed points

\underset{w_{i}}{argmin} \sum_{j = 1}^{n_{i}} {| y_{i, j} - w_{i}^{'} b (t_{i, j}) |}^{2} .

(1)

This direct approach has two main drawbacks. First, it ignores correlations within the curves, which could potentially improve the fit. Second, if the number of observations $n_{i}$ for an individual $i$ is smaller or equal to the size of the basis $K$ , we can fit a curve with no error leading to overfitting and an unreliable estimator of the variance.

To remedy the first issue, it is common to estimate the covariance operator, compute principal functional components across all individuals and represent curves in the space spanned by the first few of them. Such representation, is referred to as a Karhunen-Loève expansion (Watanabe, 1965; Kosambi, 2016), has became a foundation of many functional data analysis workflows (Ramsay and Dalzell, 1991; Yao et al., 2005; Cnaan et al., 1997; Laird, 1988; Horváth and Kokoszka, 2012; Besse et al., 1997).

A basic idea to remedy the second issue is to estimate both the basis coefficients and the covariance structure simultaneously. Linear mixed-effect models provide a convenient solution to this problem.

2.2. Linear mixed-effect models

A common approach to modeling longitudinal observation $(t_{i}, y_{i})$ is to assume that data come from a linear mixed-effect model (LMM) (Verbeke, 1997; Zeger et al., 1988). We operate in a functionals pace with a basis $b (t)$ of $K$ elements and we assume there exists a fixed effect $μ (t) = m^{'} b (t)$ , where $m = {(m_{1}, \dots, m_{K})}^{'}$ for $m_{k} \in R$ for $1 \leq k \leq K$ . We model the individual random effect as a vector of basis coefficients. In the simplest form, we assume

where $Σ$ is a $K \times K$ covariance matrix. We model individual observations as

y_{i} ∣ w_{i} ~ 𝒩 (μ_{i} + B_{i} w_{i}, σ^{2} I_{n_{i}}),

(3)

where $μ_{i} = {(μ (t_{i, 1}), μ (t_{i, 2}), \dots, μ (t_{i, n_{i}}))}^{'}, σ$ is the standard deviation of observation error and $B_{i} = {(b (t_{i, 1}), \dots, b (t_{i, n_{i}}))}^{'}$ is the basis evaluated at timepoints defined in $t_{i}$ . Estimation of the model parameters is typically accomplished by the expectation-maximization (EM) algorithm (Laird and Ware, 1982). For predicting coefficients $w_{i}$ one can use the best unbiased linear predictor (BLUP) (Henderson, 1950; Robinson, 1991).

Since the LMM estimates the covariance structure and the individual fit simultaneously, it reduces the problem of overfitting of $w_{i}$ , present in the direct approach. However, this model is only applicable if we observe a relatively large number of observations per subject since we attempt to estimate $K$ coefficients for every subject.

To model trajectories from a small number of observations, practitioners further constrain the covariance structure. If we knew the functions which contribute the most to the random effect, we could fit an LMM in a smaller space spanned by these functions. We explore possibilities to learn the basics from the data using low-rank approximations of the covariance matrix.

2.3. Low-rank approximations

There are multiple ways to constrain the covariance structure. We can use cross-validation or information criteria to choose the best basis, the number of elements, or the positions of spline knots (Rice and Wu, 2001; Bigelow and Dunson, 2009). Alternatively, we can place a prior on the covariance matrix (MacLehose and Dunson, 2009).

Another solution is to restrict the latent space to $q < K$ dimensions and learn from the data a mapping $A \in R^{K \times q}$ between the latent space and the basis. In the simplest scenario with Gaussian noise, observations can be modeled as

y_{i} ∣ w_{i} ~ 𝒩 (μ_{i} + B_{i} A w_{i}, σ^{2} I_{n_{i}}),

(4)

following the notation from (3).

James et al. (2000) propose an EM algorithm for finding model parameters and predicting latent variables $w_{i} \in R^{q}$ in (4). In the expectation stage, they compute the conditional mean of $w_{i}$ given $y_{i}$ and the model parameters, while in the maximization stage, with $w_{i}$ assumed observed, they maximize the likelihood with respect to ${μ, A, σ}$ . The likelihood, given $w_{i}$ , takes the form

\begin{matrix} \prod_{i = 1}^{N} \frac{1}{{(2 π)}^{n_{i} / 2} σ^{n_{i}} | Σ |^{1 / 2}} exp {- {(y_{i} - μ_{i} - B_{i} A w_{i})}^{'} (y_{i} - μ_{i} - B_{i} A w_{i}) / 2 σ^{2} \\ - \frac{1}{2} w_{i}^{'} Σ^{- 1} w} . \end{matrix}

Another approach to estimating parameters of (4) is to optimize over $w_{i}$ and marginalize $A$ (Lawrence, 2004). This approach allows modifying the distance measure in the latent space, using the kernel trick (Schulam and Arora, 2016).

Estimation of the covariance structure of processes is central to the estimation of individual trajectories. Descary and Panaretos (2016) propose a method where the estimate of the covariance matrix is obtained through matrix completion.

Methods based on low-rank approximations are widely adopted and applied in practice (Berkey and Kent, 1983; Yan et al., 2017; Hall et al., 2006; Besse and Ramsay, 1986; Yao and Lee, 2006; Greven et al., 2011). However, due to their probabilistic formulation and reliance on the distribution assumptions, these models usually need to be carefully fine-tuned for specific situations and existing implementations tend to be slow. This shortcoming motivates us to develop an elementary optimization framework, using existing, extensively studied, and well–optimized tools for matrix algebra.

3. Modeling sparse longitudinal processes using matrix completion

We first introduce the methodology for univariate sparsely-sampled processes. The direct method, mixed-effect models, and low-rank approximations described in Section 2 have their counterparts in the matrix completion setting. We discuss these analogies in Sections 3.2 and 3.3. Next, in Section 4, we show that the simple representation of the problem allows for extension to multivariate sparsely-sampled processes and a regression setting.

3.1. Notation

For each individual $i \in {1, 2, \dots, N}$ we observe $n_{i}$ measurements $({\dot{y}}_{i, 1}, \dots, {\dot{y}}_{i, n_{i}})$ at time-points $(t_{i, 1}, t_{i, 2}, \dots, t_{i, n_{i}})$ . Unlike in the prior work introduced in Section 2 operating on entire curves, here we discretize time to $T$ time-points $G = (τ_{1}, τ_{2}, \dots, τ_{T})$ , where $t_{m i n} = τ_{1}, t_{m a x} = τ_{T}$ and $T$ is arbitrarily large. Each individual $i$ is expressed as a partially observed vector $r_{i} \in R^{T}$ . For each time-point $t_{i, j}$ for $1 \leq j \leq n_{i}$ we find a corresponding grid-point $g_{i} (j) = \underset{1 \leq k \leq T}{a r g m i n} (τ_{k} - t_{i, j})$ . We assign $y_{i, g_{i} (j)} = {\dot{y}}_{i, j}$ . Let $O_{i} = (g_{i} (j) : 1 \leq j \leq n_{i})$ be a set of grid indices corresponding to observed grid points for an individual $i$ . All elements outside $O_{i}$ , i.e. $(y_{i, j} : j \notin O_{i})$ are considered missing.

For $T$ sufficiently large, our observations can be uniquely represented as a $N \times T$ matrix $Y$ with missing values. We denote the set of all observed elements by pairs of indices as $Ω = ((i, j) : i \in {1, 2, \dots, N}, j \in O_{i})$ . Let $P_{Ω} (Y)$ be the projection onto observed indices, i.e. $P_{Ω} (Y) = \tilde{Y}$ , such that ${\tilde{Y}}_{i, j} = Y_{i, j}$ for $(i, j) \in Ω$ and ${\tilde{Y}}_{i, j} = 0$ otherwise. We define $P_{Ω}^{⊥} (Y)$ as the projection on the complement of $Ω$ , i.e. $P_{Ω}^{⊥} (Y) = Y - P_{Ω} (Y)$ . We use $∥ \cdot ∥_{F}$ to denote the Frobenius norm, i.e. the square root of the sum of squares of matrix elements, and $∥ \cdot ∥_{*}$ to denote the nuclear norm, i.e. the sum of singular values.

As in Section 2 we imposed smoothness by using a continuous basis $b (t) = {(b_{1} (t), b_{2} (t), \dots, b_{K} (t))}^{'}$ . When we evaluate the basis on a grid $G$ we get a $T \times K$ matrix $B = {(b (τ_{1}), b (τ_{2}), \dots, b (τ_{T}))}^{'}$ .

In our algorithms, we use diagonal thresholding operators defined as follows. Let $D = d i a g (d_{1}, \dots, d_{p})$ be a diagonal matrix. We define soft-thresholding as

D_{λ} = d i a g ({(d_{1} - λ)}_{+}, {(d_{2} - λ)}_{+}, \dots, {(d_{p} - λ)}_{+}),

where $(x)_{+} = m a x (x, 0)$ , and hard-thresholding as

D_{λ}^{H} = d i a g (d_{1}, d_{2}, \dots, d_{q}, 0, \dots, 0),

where $q = \underset{k}{a r g m i n} (d_{k} < λ)$ .

3.2. Direct approach

The optimization problem (1) of the direct approach described in Section 2.1 can be approximated in the matrix completion setting. Note that the bias introduced by the grid is negligible if the grid is sufficiently large, i.e. if each observed timepoint is relatively close to some grid point. We defined the basis as a set of continuous functions, so the finer the grid the closer we are to the actual value of the basis at a given timepoint. Eventually, the difference becomes negligible. Similarly, for the observed points, the coordinate corresponding to time is more accurate when the grid is finer, but again, due to continuity assumption this difference becomes negligible. The analysis of the properties of the grid is beyond the scope of the paper. As a rule of thumb, we recommend choosing the grid based on a practical indication. For example, if we expect patients to visit a clinic 1–2 times per year, then a grid corresponding to 12 points per year might be sufficient.

We approximate each observation $y_{i, j}$ with a point on the selected grid $y_{i, g (j)} ~ y_{i, j}$ . Next, we rewrite the optimization problem (1) as a matrix completion problem

\begin{matrix} \underset{{w_{i}}}{\arg m i n} \sum_{i = 1}^{N} \sum_{j = 1}^{n_{i}} {| y_{i, g_{i}, (j)} - w_{i}^{'} b (τ_{g_{i} (j)})) |}^{2} = \underset{{w_{i}}}{\arg \min} \sum_{(i, k) \in Ω} {| y_{i, k} - w_{i}^{'} b (τ_{k})) |}^{2} \\ = \underset{W}{\arg \min} {‖ P_{Ω} (Y - W B^{'}) ‖}_{F}^{2}, \end{matrix}

(5)

where by optimization over $(w_{i})$ we mean optimization over all $(w_{i} : i \in {1, 2, \dots, N})$ and $W$ is an $N \times K$ matrix composed of vectors $(w_{i}^{'})$ stacked vertically.

The matrix formulation in equation (5) and the classic approach in Section 2.1 share multiple characteristics. In both cases, if data is dense, we may find an accurate representation of the underlying process simply by fitting least-squares estimates of $W$ or $(w_{i})$ . Conversely, if the data is too sparse, the problem becomes ill-posed, and the least-squares estimates can overfit.

However, the represenational difference between functional representation (1) and matrix representation (5) leads to different computational approaches. The matrix representation enables us to use the matrix completion framework and, in particular, in Section 3.4 we introduce convex optimization algorithms for solving (5).

Some low-rank constraints on the random effect from the mixed-effect model introduced in Section 2.2 can be expressed in terms of constraints on $W$ . In particular, in Section 3.3 we show that the linear mixed-effect model can be expressed by constraining the $r a n k (W)$ .

3.3. Low-rank matrix approximation

In the low-rank approach described in Section 2.3 we assume that individual trajectories can be represented in a low-dimensional space, by constraining the rank of $W$ .

We use a similar approach in solving (5). One difficulty comes from the fact that optimization with a rank constraint on $W$ turns the original least-squares problem into a nonconvex problem. Motivated by the matrix completion literature, we relax the rank constraint in (5) to a nuclear norm constraint and we attempt to solve

\underset{W}{a r g m i n ∥} P_{Ω} (Y - W B^{'}) ∥_{F}^{2} + λ W ∥_{*,}

(6)

for some parameter $λ > 0$ .

From the practical standpoint, low-rank representation corresponds to describing curves on a basis comprised of only a few functions which are the best functions for obtaining low reconstruction errors. In Figure 3 we illustrate two functions for describing processes introduced in Section 1 and Figure 1. Such low-dimensional representation of processes can also be used for clustering based on the type of their progression as we discuss in the data study (Section 6).

Figure 3: — Left: The first two trends of variability for processes of the Gait Deviation Index, derived from the first two dimensions of the low-rank decomposition of the observed matrix. We can appreciate variability around the age of 10 in the first component, and correction before and after the age of 10 in the second component. Right: We illustrate how the individual score on the first component (positive and negative value) affects the population means.

Cai et al. (2010) propose a first-order singular value thresholding algorithm (SVT), for solving a general class of problems involving a nuclear norm penalty and a linear form of $Y$ , which includes (6). Their algorithm can handle large datasets and has strong guarantees of convergence, but it requires tuning the step size parameter, which can decrease computational efficiency as different steps have to be tried in practice. This limitation was addressed by Ma et al. (2011); Mazumder et al. (2010); Hastie et al. (2015a) who introduced iterative procedures which do not depend on such tuning.

In the last decade, machine learning and statistical learning communities have introduced multiple algorithms for matrix completion. Many of them are suitable for solving (6). However, in this article, we focus on analyzing the benefits of framing the trajectory prediction problem (1) in the matrix completion framework, rather than on benchmarking possible solutions.

3.4. Low-rank approximation with singular value thresholding

The low-rank probabilistic approach, introduced in Section 2.3, is based on the observation that the underlying processes for each subject can be approximated in a low-dimensional space. Here, we exploit the same characteristic using matrix-factorization techniques for solving the optimization problem (6).

For the sake of clarity and simplicity, we choose to solve the problem (6) with an extended version of the SOFT-IMPUTE algorithm designed by Hastie et al. (2015a); Mazumder et al. (2010). We find it easy to implement because it only uses matrix multiplication and Singular Value Decomposition (SVD). Using standard matrix operations also reduces the computational cost of our method, since implementations of these operations leverage well-optimized algorithms. However, as discussed in Section 3.3, many other convex optimization algorithms can be applied.

The key component to the solution is the following property linking problem (6) with the SVD. Consider the fully observed case of (6). Using Theorem 2.1 in Cai et al. (2010), one can show that the optimization problem

\underset{W}{a r g m i n} \frac{1}{2} {(Y - W B^{'})}_{F}^{2} + λ ∥ W ∥_{*},

(7)

with $B^{'} B = I$ , has a unique solution $W = 𝒮_{λ} (Y B)$ , where $𝒮_{λ} (X) = U D_{λ} V^{'}$ and $X = U D V^{'}$ is the SVD of $X$ , i.e., $U$ is an $n \times n$ unitary matrix, $D$ is an $n \times K$ rectangular diagonal matrix, with non-negative real numbers on the diagonal in decreasing order, while $V$ is a $K \times K$ matrix. We refer to $𝒮_{λ} (X)$ as the singular value thresholding (SVT) of $X$ . Note that, the decomposition of $Y B$ is identifiable since the solution of SVD with constrained $D$ is identifiable.

In order to solve (7) with a sparsely observed $Y$ , we modify the SOFT-IMPUTE algorithm to account for the basis $B$ . Our Algorithm 1 iteratively constructs better approximations of the global solution for each $λ$ in some predefined set $(λ_{1}, λ_{2}, \dots, λ_{k})$ for $k \in N$ . For a given approximation of the solution $W^{o l d}$ , we use $W^{o l d} B^{'}$ to impute unknown elements of $Y$ obtaining $\tilde{Y}$ . Then, we construct the next approximation $W^{n e w}$ by computing SVT of $\tilde{Y}$ .

As a stopping criterion, we compute the change between subsequent solutions, relative to the magnitude of the solution, following the methodology in Cai et al. (2010). We set a fixed threshold of $ε > 0$ for this criterion, depending on the desired accuracy.

Algorithm 1.

1. Initialize

W^{o l d}

with zeros

2. For each

λ_{1} > λ_{2} > \dots > λ_{k}

(a) Repeat:

i. Compute

W^{n e w} \leftarrow S_{λ_{i}} ((P_{Ω} (Y) + P_{Ω}^{⊥} (W^{o l d} B^{'})) B)

ii. If

\frac{{(W^{n e w} - W^{o l d})}_{F}^{2}}{{(W^{o l d})}_{F}^{2}} < ε

exit

iii. Assign

W^{o l d} \leftarrow W^{n e w}

(b) Assign

{\hat{W}}_{λ_{i}} \leftarrow W^{n e w}

3. Output

{\hat{W}}_{λ_{1}}, {\hat{W}}_{λ_{2}}, \dots, {\hat{W}}_{λ_{k}}

A common bottleneck of the algorithms introduced by Cai et al. (2010); Mazumder et al. (2010); Ma et al. (2011) as well as other SVT-based approaches is the computation of the SVD of large matrices. This problem is particularly severe in standard matrix completion settings, such as the Netflix problem, where the matrix size is over 400, 000 × 20, 000. However, in our problem,

r a n k (W B^{'}) \leq r a n k (B) = K ≪ N,

(8)

and $K ~ 10$ in our motivating data example. While algorithms for large matrices are applicable here, the property (8) makes the computation of SVD feasible in practice with generic packages such as PROPACK (Larsen, 2004).

For estimating curves on new data, we need to find $w_{i} = V a_{i}$ minimizing the prediction error.

To this end, we regress observed values $Y_{i}$ on $B_{i} V$ with a penalty $\frac{λ}{2} {(a_{i})}^{2}$ on the coeffcients, i.e. we optimize

where $B_{i}$ are basis functions evaluated at the corresponding timepoints.

While our algorithms for finding $W$ do not depend on the density of the grid, it is important to highlight that with the growing $T$ the norm of the solution $W_{T}$ grows to infinity and $W_{T} / \sqrt{T} \to W_{0}$ , for some fixed $W_{0}$ (See Appendix A). In order to obtain comparable results for different $T$ , parameters $λ$ need to be adjusted to this scale (for example, by modifying the penalty to $λ \sqrt{T} ∥ W ∥_{*}$ ).

The algorithm converges at a rate $\frac{1}{t}$ , where $t$ is the number of iterations (this can be established by an elementary extension of the proofs in Mazumder et al. (2010)) as we demonstrate in Appendix A. Each step of our algorithm requires matrix multiplications and an SVD. Thus the complexity of each step is $O (m i n (K n^{2}, n K^{2}))$ . For EM-based algorithms, the minimization step requires multiple matrix multiplications and therefore we expect a larger constant in each step. For practical use, we report computing times on synthetic data in Section 5.

3.5. $l_{0}$ regularization

While the nuclear norm relaxation (6) is motivated by making the problem convex, Mazumder et al. (2010) argue that in many cases it can also give better results than the rank constraint. They draw an analogy to the relation between best-subset regression $(ℓ_{0})$ regularization) and LASSO ( $ℓ_{1}$ , regularization as in Tibshirani (1996); Friedman et al. (2001)). In LASSO, by shrinking model parameters, we allow more parameters to be included, which can potentially improve predictions if the true subset is larger than the one derived through $ℓ_{0}$ regularization. In the case of (6), shrinking the nuclear norm allows us to include more dimensions of $W$ again potentially improving the prediction if the true dimension is high.

In theory, estimating too many components can lead to problems if the true dimension is small. In such cases, shrinking may cause the inclusion of unnecessary dimensions emerging from the noise. To remedy this issue, following the analogy with penalized linear regression, we may consider another class of penalties. In particular, we may consider coming back to the rank constraint by modifying the nuclear norm penalty (6) to $ℓ_{0}$ . We define $∥ W ∥_{0} = r a n k (W)$ . The problem

\underset{W}{m i n} \frac{1}{2} {(P_{Ω} (Y - W B^{'}))}_{F}^{2} + λ ∥ W ∥_{0},

(9)

has a solution $W = 𝒮_{λ}^{H} (Y B)$ , where $𝒮_{λ}^{H} (X) = U D_{λ}^{H} V^{'}$ and $X = U D V^{'}$ is the SVD of $X$ . We refer to $𝒮_{λ}^{H} (X)$ as the hard singular value thresholding (hard SVT) of $X$ . We use Algorithm 2 to find the hard SVT of $Y B$ .

Algorithm 2.

1. Initialize

W_{λ_{i}}^{o l d}

with solutions

{\tilde{W}}_{λ_{i}}

from Soft-Longitudinal-Impute

2. For each

λ_{1} > λ_{2} > \dots > λ_{k}

(a) Repeat:

i. Compute

W^{n e w} \leftarrow S_{λ_{i}}^{H} ((P_{Ω} (Y) + P_{Ω}^{⊥} (W^{o l d} B^{'})) B)

ii. If

\frac{{(W^{n e w} - W^{o l d})}_{F}^{2}}{{(W^{o l d})}_{F}^{2}} < ε

exit

iii. Assign

W^{o l d} \leftarrow W^{n e w}

(b) Assign

{\hat{W}}_{λ_{i}} \leftarrow W^{n e w}

3. Output

{\hat{W}}_{λ_{1}}, {\hat{W}}_{λ_{2}}, \dots, {\hat{W}}_{λ_{k}}

The problem (9) is non-convex, however, by starting from the solutions of Algorithm 1 we explore the space around the global minimum of the $ℓ_{1}$ , version of the problem. Ge et al. (2016) show that this strategy is successful empirically.

In practice, the continuous parameterization via $λ$ leads to similar results as rank-truncation methods for different ranks. There are two advantages of the former. First, it allows for a continuum of warm starts and, as such, it is a natural post-processor for Soft-Impute solutions. Second, it can be a basis for further generalizations with penalties between $ℓ_{0}$ and $ℓ_{1}$ (Mazumder et al., 2010). However, if the true rank can be estimated with high confidence and the signal is strong, then a rank-truncated method can lead to better predictive performance.

3.6. Dense grid and proximal gradient descent

Growing grid size parameter $T$ improves the accuracy of approximation of the data on the grid, but benefits in approximation are negligible when $T$ becomes sufficiently large. However, we show that $T \to \infty$ also leads to a continuous version of the algorithm, equivalent to the proximal gradient descent method applied to the initial formulation of the problem (1) with nuclear norm penalty.

We start by observing that in Algorithm 1 the Step 2(a)(i) depends on the grid and requires multiplication with all evaluations of the basis on that grid. We can rewrite the expression in $S_{λ_{i}}$ such that only a sparse number of evaluations of the basis is required. Indeed, we write

(P_{Ω} (Y) + P_{Ω}^{⊥} (W^{o l d} B^{'})) B = (P_{Ω} (Y) - P_{Ω} (W^{o l d} B^{'}) + W^{o l d} B^{'}) B = (P_{Ω} (Y - W^{o l d} B^{'})) B + W^{o l d}

(10)

and to compute $W^{o l d} B^{'}$ on $Ω$ in (10) we only need $W^{o l d}$ and an evaluation of the basis on gridpoints corresponding to $Ω$ . Therefore, for computing $(P_{Ω} (Y - W^{o l d} B^{'})) B$ we also only need $B$ evaluated on gridpoints with a nonzero number of observations. Now, if we increase the number of grid points to infinity we derive a solution where the basis needs to be evaluated only at observed timepoints.

Since $B$ is orthogonal, when $T \to \infty$ the product $(P_{Ω} (Y - W^{o l d} B^{'})) B$ converges to 0 and the update in Step 2(a)(i) in Algorithm 1 also converges to 0 at the rate $1 / T$ . Therefore, in order to obtain comparable results for different $T$ we need to scale the update direction $(P_{Ω} (Y - W^{o l d} B^{'})) B$ with $α / T$ , for some $α > 0$ . We refer to $α$ as the step size.

We show that at the limit $T \to \infty$ our approach is equivalent to proximal gradient descent. Let us define the loss function as in (1), i.e.

f (W) = \sum_{i = 1}^{N} \sum_{j = 1}^{n_{i}} {| y_{i, j} - w_{i}^{'} b (τ_{i, j})) |}^{2},

where $τ_{i, j}$ are timepoints, $y_{i, j}$ are observations. We optimize the loss function $f$ with a nuclear norm penalty for some $λ > 0$

where $W = {(w_{1}^{'}, w_{2}^{'}, \dots, w_{N}^{'})}^{'}$ . To solve problem (11) we define

G_{i} = \frac{\partial}{\partial w_{i}} f (W) = - 2 \sum_{j = 1}^{n_{i}} (y_{i, j} - w_{i}^{'} b (τ_{i, j})) b^{'} (τ_{i, j})

and the gradient of $f$ takes form

Note that $\nabla_{W} f (W)$ is corresponding to $(P_{Ω} (Y - W^{o l d} B^{'})) B$ in (10) with infinitely dense grid. For the update step in the proximal gradient descent method, we use step size $α > 0$ and singular value thresholding with a threshold $α λ$ as recommended in Hastie et al. (2015b). We describe the full procedure in Algorithm 3.

Algorithm 3.

Grid-free soft-Longitudinal-Impute

1. Initialize

W^{o l d}

with zeros

2. For each

λ_{1} > λ_{2} > \dots > λ_{k}

(a) Repeat:

i. Compute

W^{n e w} : = S_{α λ} (W^{o l d} - α \nabla_{W} f (W^{o l d}))

ii. If

\frac{{(W^{n e w} - W^{o l d})}_{F}^{2}}{{(W^{o l d})}_{F}^{2}} < ε

exit

iii. Assign

W^{o l d} \leftarrow W^{n e w}

(b) Assign

{\hat{W}}_{λ_{i}} \leftarrow W^{n e w}

3. Output

{\hat{W}}_{λ_{1}}, {\hat{W}}_{λ_{2}}, \dots, {\hat{W}}_{λ_{k}}

Nesterov (2013) showed sufficient conditions for convergence of Algorithm 3. If $α \in (0,1 / L]$ where $L$ is the Lipschitz constant of $\nabla_{w} f$ , then the algorithm converges at the rate $1 / t$ , where $t$ is the number of steps.

3.7. A link between the reduced-rank model and Soft-Longitudinal-Impute

Intuitively, we might expect similarity between the principal directions derived using the probabilistic approach (3) and their counterparts derived from the SVT-based approach. We investigate this relation by analyzing the behavior of SVT for matrices sampled from the probabilistic model given by (3).

For simplicity, let us assume that $μ = 0$ and the data is fully observed on a grid $G$ of $T$ timepoints. Assume that observations $i \in {1, 2, \dots, N}$ come from the mixed-effect model

\begin{matrix} y_{i} ∣ w_{i} ~ 𝒩 (B w_{i}, σ^{2} I_{T}), \\ w_{i} ~ 𝒩 (0, Σ), \end{matrix}

(12)

where $Σ$ is an unknown covariance matrix of rank $q < K$ and variables $(w_{i}, y_{i})$ are independent. By the spectral decomposition theorem, we decompose $Σ = V Λ V^{'}$ , where $V$ is a $K \times K$ orthogonal and $Λ$ is a diagonal $K \times K$ matrix with $q$ positive coefficients in decreasing order. Since $y_{i}$ and $w_{i}$ are independent, the distribution (12) can be rewritten as

The model (13) is a factor model with uncorrelated errors and $q$ factors — the first $q$ columns of $B V$ .

The following theorem constitutes a link between the mixed-effect model and SVT (Theorem 9.4.1 in Mardia et al. (1980)),

Theorem 1.

Let $Y = {(y_{1}, \dots, y_{N})}^{'}$ be the observed matrix and let $S_{σ^{2}} (Y B) = U D_{σ^{2}} Q^{'}$ . Then, $(D_{σ^{2}}, Q)$ is the maximum likelihood estimator of $(Λ, V)$ .

Theorem 1 implies that in the fully observed case the estimator of $(Λ, V)$ converges to the true parameter at the rate $1 / \sqrt{n}$ , as the MLE.

Factor analysis methods give not only estimates of $Λ$ and $V$ but also predictors of the individual latent variables $W = (w_{1}, \dots, w_{N})$ . In the multivariate analysis literature, there are multiple ways to estimate factor scores, i.e., a matrix $A$ such that $X ~ A D_{σ^{2}} V^{'}$ . Most notably researchers introduce Spearman’s scores, Bartlett’s scores, and Thompson’s scores (Kim and Mueller, 1978). Simply taking $W = U$ as the estimate of the scores corresponds to the solution of (7) as long as $λ = σ^{2}$ .

In Theorem 1 we assume that $σ^{2}$ is known, which is rarely the case in practice. However, the likelihood of $(V, σ)$ can be parametrized by $σ$ , and we can find the optimal solution analytically. This corresponds to minimizing (7) for different $λ$ .

A similar analogy is drawn between the classical principal component analysis and probabilistic principal component analysis by Tipping and Bishop (1999) and James et al. (2000).

4. Multivariate longitudinal data

In practice, we are often interested in the prediction of a univariate process in the context of other longitudinal measurements and covariates constant over time. Examples include the prediction of disease progression given patient’s demographics, data from clinical visits at which multiple blood tests or physical exams are taken, or measurements that are intrinsically multivariate such as gene expression or x-rays collected over time. The growing interest in this setting stimulated research in latent models (Sammel and Ryan, 1996) and multivariate longitudinal regression (Gray and Brookmeyer, 1998, 2000). Diggle et al. (2002) present an example case study in which longitudinal multivariate modeling enables estimation of joint confidence region of multiple parameters changing over time, shrinking the individual confidence intervals.

In this section, we present an extension of our univariate framework to multivariate measurements sparse in time. We explore two cases: (1) dimensionality reduction, where we project sparsely sampled multivariate processes to a small latent space, and (2) linear regression, where we use a multivariate process and covariates constant over time for the prediction of a univariate process of interest. To motivate the methodology, we start with a regression involving two variables observed over time.

4.1. Illustrative example: Univariate regression

Suppose that the true processes are in a low-dimensional space of some continuous functions (e.g., splines) and that we observe them with noise. More precisely, let

x_{i} = B w_{i} + e_{x, i} a n d y_{i} = B u_{i} + e_{y, i},

(14)

for $1 \leq i \leq N$ , where $x_{i}, y_{i}, e_{x, i}, e_{y, i}$ are $T \times 1$ vectors, $w_{i}, u_{i}$ are $K \times 1$ vectors and $B$ is a $T \times K$ matrix of $K$ splines evaluated on a grid of $T$ points. We assume zero-mean independent errors $e_{x, i}, e_{y, i}$ with fixed covariance matrices $Σ_{X}, Σ_{Y}$ respectively, and that the true processes underlying the observed $x_{i}$ and $y_{i}$ follow a linear relation in the spline space, i.e.

where $A$ is an unknown $K \times K$ matrix.

Let $X, Y, U, W$ be matrices formed from $x_{i}^{'}, y_{i}^{'}, w_{i}^{'}, u_{i}^{'}$ stacked vertically. From (15) we have $U = W A$ , while (14) implies

X = W B^{'} + E_{X} a n d Y = {W A B}^{'} + E_{Y},

(16)

where $E_{X}, E_{Y}$ are matrices of observation noise. Without loss of generality, we assume that $B$ is orthonormal. We have full freedom to choose the basis $B$ , since any basis can be orthogonalized using, for example, the singular value decomposition.

Due to the orthogonality of $B$ and after multiplying both expressions in (16) by $B$ we can map the problem to the classical multivariate errors-in-variables models. Let

\tilde{X} = X B = W + {\tilde{E}}_{X} and \tilde{Y} = Y B = W A + {\tilde{E}}_{Y},

(17)

where ${\tilde{E}}_{X} = E_{X} B$ and ${\tilde{E}}_{Y} = E_{Y} B$ . In errors-in-variables models it is assumed that the parameters $W$ and $A$ are unknown, and are to be estimated. Both regressors and responses are measured with noise (here ${\tilde{E}}_{X}$ and ${\tilde{E}}_{Y}$ ). The parameter $W$ can be interpreted as a latent representation of both $\tilde{X}$ and $\tilde{Y}$ .

The problem of estimating parameters in (17) has been extensively studied in literature dating back to Adcock (1878). Two main methods for estimating parameters in (17) are maximum likelihood estimates (MLE) and generalized least squares estimators (GLSE). The estimators in MLE are derived under the assumption of certain distributions of errors. In GLSE, the only assumption about errors is that they are independent, zero-mean, and they have a common covariance matrix. Then, $\tilde{X} - W$ and $\tilde{Y} - W A$ are zero-mean and estimates for $W$ and $B$ can be found by optimizing some norm of these expressions. Gleser and Watson (1973) show that in the no-intercept model for $\tilde{X}$ and $\tilde{Y}$ of the same size (as in our case) and under the assumption of Gaussian errors, MLE and GLSE give the same estimates of $A$ , if GLSE is derived for the Frobenius norm.

In this work, we focus on the GLSE approach as it can be directly solved in our matrix factorization framework and we find it easier to deploy and extend in practice. To account for different magnitudes of the noise in $X$ and $Y$ , we consider the optimization problem with weights

\underset{A, W}{m i n i m i z e} \frac{1}{σ_{X}^{2}} ∥ X B - W ∥_{F}^{2} + \frac{1}{σ_{Y}^{2}} ∥ Y B - W A ∥_{F}^{2},

(18)

where $σ_{X}, σ_{Y} > 0$ . Let $γ = σ_{X}^{2} / σ_{Y}^{2}$ . Then (18) can be transformed to

\underset{A, W}{m i n i m i z e} ∥ (X B : γ Y B) - W (I : γ A) ∥_{F}^{2},

(19)

where $(\cdot : \cdot)$ operator stacks horizontally matrices with the same number of rows. To solve (19), we show that the SVD of $(X B : γ Y B)$ truncated to the first $K$ dimensions, can be decomposed to $W (I : γ A)$ . Let $U S V^{'}$ be the SVD of $(X B : γ Y B)$ , with

U = (\begin{matrix} U_{1} & U_{2} \end{matrix}), S = (\begin{matrix} S_{11} & S_{12} \\ S_{21} & S_{22} \end{matrix}) and V = (\begin{matrix} V_{11} & V_{12} \\ V_{21} & V_{22} \end{matrix}),

where each $S_{i j}$ and $V_{i j}$ is a $K \times K$ matrix for $1 \leq i, j \leq 2$ and each $U_{i}$ is a $N \times K$ matrix for $1 \leq i \leq 2$ . By Lemma 2.1 and Lemma 2.2 in (Gleser, 1981) matrix $V_{11}$ is almost surely nonsingular. Therefore, $V_{11}^{- 1}$ almost surely exists and we can transform the decomposition such that $(I : γ A) = {(V_{11}^{'})}^{- 1} (\begin{matrix} V_{11}^{'} & V_{21}^{'} \end{matrix})$ and $W = U_{1} S_{11} V_{11}^{'}$ , solving (19).

For partially observed data, if they are very sparse, it might be essential to constrain the rank of the solution. The partially-observed and rank-constrained version of the problem (19) takes the form

\underset{A, W}{minimize ∥} P_{Ω}^{\sim} ((X : γ Y) - W (B^{'} : γ A B^{'})) ∥_{F}^{2},

$subject to rank (W (B^{'} : γ A B^{'})) = k$ , where $k$ is the desired rank of the solution and $P_{\tilde{Ω}}$ is a projection on

\tilde{Ω} = {(q, r) : (q, r) \in Ω o r (q, r - T) \in Ω} .

As previously, for an unknown $k$ we can introduce a rank penalty using the nuclear norm

\underset{A, W}{m i n i m i z e} {(P_{Ω}^{\sim} ((X : γ Y) - W (B^{'} : γ A B^{'})))}_{F}^{2} + λ {(W (B^{'} : γ A B^{'}))}_{*} .

(20)

The algorithm in the general case of multiple processes is derived in Section 4.2.

Although we motivate the problem as a regression of $Y$ on $X, X$ and $Y$ are symmetric in (19). The low-rank matrix $W$ is, therefore, a joint low-rank representation of matrices $X$ and $Y$ and thus our method can be seen as a dimensionality reduction technique or as a latent space model. In Section 4.2 we extended this idea to a larger number of variables. In Section 4.3 we discuss how this approach can be used for regression.

The linear structure of (16) allows us to draw analogy not only to the errors-in-variables models but also to the vast literature on canonical correlation analysis (CCA), partial least squares (PLS), factor analysis (FA), and linear functional equation (LFE) models. Borga et al. (1997) show that solutions of CCA and PLS can also be derived from the SVD of stacked matrices, as we did with $(X B : γ Y B)$ in (19). Gleser (1981) thoroughly discusses the relation between errors-in-variables, FA, and LFE.

Finally, the method of using SVD for stacked matrices has also been directly applied in the recommender systems context. Condli et al. (1999) showed that for improving the prediction of unknown entries in some partially observed matrix $Q$ one might consider a low-rank decomposition of ( $Q : R$ ), where $R$ are additional covariates for each row, e.g., demographic features of individuals.

4.2. Dimensionality reduction

Suppose that for every individual we observe multiple variables varying in time (e.g. results of multiple medical tests at different times in a clinic) and we want to find a projection on $R^{d}$ maximizing the variance explained for some $d \in N$ . This projection would correspond to characterizing patients by their progression trends of multiple metrics simultaneously.

We extend the equation (19) to account for a larger number of covariates, and we do not impose decomposition of the solution yet. We formulate the following optimization problem

\underset{W}{a r g m i n} {((X_{1} B : X_{2} B : \dots : X_{p} B) - W)}_{F}^{2} + λ ∥ W ∥_{*},

(21)

where $X_{i}$ are some $N \times T$ matrices corresponding to the processes measured on a grid of $T$ points, $B$ is a basis evaluated on the same grid and orthogonalized (a $T \times K$ matrix), and $W$ is a $N \times p K$ matrix.

Let $B = I_{p} \otimes B$ be the Kronecker product of $p \times p$ identity matrix and $B$ , i.e. a $p T \times p K$ matrix with $B$ stacked $p$ times on the diagonal, and let $X = (X_{1} : X_{2} : \dots : X_{p})$ . Matrix $B$ is orthogonal and therefore results developed in Section 2.3 apply here. In particular, singular value thresholding solves

and we can use Algorithm 1 for solving

\underset{W}{a r g m i n} {(P_{Ω} (X - W B^{'}))}_{F}^{2} + λ ∥ W ∥_{*},

(22)

where $P_{Ω}$ is the projection on observed indices $Ω$ .

The optimization problem (21) can be further extended. For example, we can choose a different basis for each process or scale individual contributions of processes using scaling factors. By modifying Equation (22) correspondingly, we can extend our solution to these cases.

4.3. Regression

In practice, we are often interested in the relationship between the progression of an individual parameter (e.g., cancer growth) and some individual features constant over time (e.g., demographics) or progressions of other covariates (e.g., blood tests, vitals, biopsy results). We show how our framework can be used to exploit relations between curves in order to improve the estimation of the progression of the main parameter of interest.

We start with a regression problem with fixed covariates and sparsely observed response trajectories. Assume that for each subject $1 \leq i \leq N$ we observe covariates $x_{i} \in R^{d}$ and samples from trajectories $y_{i} \in R^{n_{i}}$ following the distribution

where $B_{i}$ is the matrix of evaluations of the basis on the same evaluation points as $y_{i}$ (See Section 2.2). We map the problem to the matrix completion framework as in Section 3. Let $X$ be a $N \times d$ matrix of observed covariates, $Y$ be a sparsely observed $N \times T$ matrix of trajectories, and $B$ be a $T \times K$ matrix representing a basis of $K$ splines evaluated on a grid of $T$ points. We consider the optimization problem

where $A$ is a $d \times K$ matrix and $P_{Ω}$ is a projection on the observed indices $Ω$ . To solve (23) we propose an iterative Algorithm 4.

Algorithm 4.

1. Initialize

A

zeros

2. Repeat till convergence:

(a) Impute regressed values

\hat{Y} = P_{Ω} (Y) + P_{Ω}^{⊥} (X A B^{'})

(b) Compute

A^{n e w} \leftarrow {(X^{'} X)}^{- 1} X^{'} \hat{Y} B

\frac{{(A^{n e w} - A)}_{F}^{2}}{∥ A ∥_{F}^{2}} < ε

exit

(d) Assign

A \leftarrow A^{n e w}

3. Return

A

Suppose we want to incorporate other variables varying in time for the prediction of the response process. We can directly apply the method proposed in Section 4.2 and model the response and regressors together. However, it might be suboptimal for prediction, as it optimizes the least-squares distance in all variables rather than only the response. This difference is analogous to the difference between the regression line of some univariate $y$ on independent variables $x_{1}, \dots, x_{p}$ and the first principal component analysis of $(y, x_{1}, \dots, x_{p})$ , used for the prediction of $y$ . While the first one minimizes the distance to $y$ , the latter minimizes the distance to $(y, x_{1}, \dots, x_{p})$ , which is usually less efficient for predicting $y$ .

Alternatively, we can use methodology from Section 4.2 only for covariates. The solution of (22) can be decomposed into $W = U S V^{'}$ , and we regress $Y$ on $U$ as in (23). Let $U$ be an $N \times d_{2}$ orthogonal matrix derived from $(X_{1}, \dots, X_{p})$ , where $d_{2}$ is the numbers of latent components. We search for a $d_{2} \times K$ matrix $A$ solving

\underset{A}{m i n i m i z e ∥} P_{Ω} (Y - U A B^{'}) ∥_{F}^{2} + λ ∥ A ∥_{*},

(24)

where $P_{Ω}$ is a projection on a set of observed coefficients. We propose a two-step iterative procedure (Algorithm 5).

Algorithm 5.

Sparse-Longitudinal-Regression

1. For each

λ_{1}, λ_{2}, \dots, λ_{k}

● Get

A_{λ_{i}}

by solving the regression problem (24) with

Y, U, λ_{i}

(Algorithm 4)

2. Return

A_{λ_{1}}, A_{λ_{2}}, \dots, A_{λ_{k}}

5. Simulations

We illustrate the properties of the multivariate longitudinal fitting in a simulation study. First, we generate curves with quickly decaying eigenvalues of covariance matrices. Then, we compare the performance of the methods in terms of the variance explained by the predictions.

5.1. Data generation

We seek to simulate data similar to observations in Section 6. We want to parametrize the processes to draw insights depending on sample size and relation between variables. Let $G$ be a grid of $T$ equidistributed points and let $B$ be a basis of $K$ spline functions evaluated on the grid $G$ . We simulate three $N \times K$ matrices using the same procedure $𝒢 (r, K, N)$ , where $r, \in \underset{+}{R^{K}}$ :

Define the procedure $ℳ_{(r)}$ generating symmetric matrices $K \times K$ for a given vector $r \in R_{+}^{K} :$
1. Simulate $K \times K$ matrix $S$
2. Use SVD to decompose $S$ to $U D V'$
3. Return $Q = V d i a g [r] V^{'}$ , where $d i a g [r]$ is a diagonal matrix with $r$ on the diagonal
Let $Σ = ℳ_{(r)}$ and $μ = 𝒩_{(0, I_{K})}$ .
Draw
$N$
vectors
$v_{i}$
from the distribution
Return a matrix with rows ${(v_{i})}_{1 \leq i \leq N}$ .

We chose values of $r$ close to values on the diagonal of the spectral decomposition of covariance structure estimated in Section 6. We define

r = [1, 0.39, 0.19, e x p (- 5), \dots, e x p (- K - 1)] .

Let $X_{1}, X_{2}, Z$ be generated following the procedure $𝒢 (r, K, N)$ and let $Y = Z + X_{1} + X_{2}$ . We consider $X_{1}, X_{2}$ , and $Y$ as coefficients in a spline space $B$ . We derive corresponding functions by multiplying these matrices by $B^{'}$ , i.e. $X_{f, 1} = X_{1} B^{'}, X_{f, 2} = X_{2} B^{'}$ and $Y_{f} = Y B^{'}$ . We set $N = 100, K = 7$ and $T = 31$ .

In order to reflect a realistic sampling of curves, we use timepoints at which subjects in the data study were observed (see Section 6). As we are particularly interested in predictive performance, we select subjects with at least 4 observed timepoints to enable leave-one-out estimation. For each simulated curve we randomly select a subject and use grid-aligned timepoints at which this subject visited the clinic. We acknowledge that specifics of sampling might affect relative performance of methods, however a thorough analysis of sampling is beyond the scope of our work. Here we only focus on the case directly derived from our data study.

Each observed element of each matrix $X_{f, 1}, X_{f, 2}$ and $Y$ is drawn with Gaussian noise with mean 0 and standard deviation 0.25. Our goal is to recover $Y$ from such sparsely sampled noisy data.

5.2. Methods

We compare Soft-Longitudinal-Impute (SLI) defined in Algorithm 1, Proximal-Gradient (PG) method defined in Algorithm 3, the fPCA procedure (James et al., 2000), implemented by Peng and Paul (2009), fdapace (Zhou et al., 2022), and face (Xiao et al., 2018). The first three algorithms require a set of basis functions as one of the inputs. In all cases, we use a polynomial $B$ . In SLI and PG we also need to specify the tuning parameter $λ$ , while in fPCA we need to choose the rank $R$ . We use cross-validation to choose $λ$ and $R$ by optimizing for the prediction error on held-out (validation) observations.

We divide the observed coefficients into training (81%), validation (9%), and test (10%) sets. We choose the best parameters of the three models on the validation set and then retrain on the entire training and validation sets combined. We compute the error by taking the mean squared Frobenius distance between $Y$ and estimated $\hat{Y}$ , i.e.

MSE (\hat{Y}) = \frac{1}{T | S |} \sum_{i \in S} {‖ Y_{i} - {\hat{Y}}_{i} ‖}_{F}^{2}

(25)

on the test set $S$ .

We train the algorithms with all combinations of parameters: regularization parameter for SLI and PG procedures $λ \in {10, 15, 20, \dots, 50}$ and the rank for fPCA procedure $d \in {2, 3, 4}$ . We compare the five methods fPCA, fdapace, face, PG, and SLI, to the baseline null model which we define as the population means across all visits.

For testing our multivariate approach, we model $(Y, X_{1}, X_{2})$ together and we compare our Sparse-Longitudinal-Regression (SLR) with MFACE package (Li et al., 2020). We compare compute time and predictive performance in terms of recovering of $Y$ .

5.3. Results and discussion

The SLI and PG methods achieve superior performance at a good running time as presented in Figure 4. Fdapace (Zhou et al., 2022) outperforms all methods in terms of speed, but it performs worse in terms of the mean squared error.

Results indicate that both our methods, SLI and PG, should be considered on par with fdapace and face for practical applications since each method has its own advantages. A solution derived from fdapace might also be used as a warm start for other algorithms to speed them up. We also observed that the performance gap decreases with the growing number of curves.

One additional advantage of SLI and PG is that they return a smooth space of solutions, parametrized with $λ$ . In Figure 5, we present the estimated rank and cross-validation error of one of the simulation runs. Compared to fixed-rank alternatives, our methods allow for tuning the predictive accuracy and the rank simultaneously in a smooth manner.

In the multivariate case, both mface and our SLR successfully recovered addional information from variables $X_{1}$ and $X_{2}$ as their NMSE in modeling $Y$ was better compared to univariate counterpart methods (SLI and face respectively). SLR outperformed mface in terms of NMSE, but required more time to converge. We conclude that SLR should also be considered a viable candidate method for multivariate sparsely sampled functions on function regression.

6. Data study

We present an application of our methods for understanding how longitudinal changes in gait patterns relate to subtypes of neurological disorders. First, we discuss how practitioners collect the data and use them to guide the decision process. Next, we describe our dataset and present how our methodology can improve current workflows.

In clinical gait analysis, at each visit movement of a child is recorded using optical motion capture. Optical motion capture allows estimating 3D positions of body parts using a set of cameras tracking markers positions on the subject’s body. A set of at least three markers is placed at each analyzed body segment so that its 3D position and orientation can be identified uniquely. These data are then used to determine the relative positions of body segments by computing the angle between the corresponding planes. Typically it is done using a biomechanical model for enforcing biomechanical constraints and improving accuracy.

In gait analysis practitioners usually analyze the movement pattern of seven joints in the lower limbs: ankle, knee, hip in each leg, and pelvis (Figure 6). Each joint angle is measured in time. For making the curves comparable between the patients, usually, the time dimension is normalized to the percentage of the gait cycle, defined as the time between two foot strikes.

Figure 6: — Four joints were measured in clinical gait analysis: pelvis, hip, knee, and ankle. Each joint can be measured in three planes: sagittal plane (top row), frontal plate (middle row), and transverse plane (bottom row).

While trajectories of joint angles are a piece of data commonly used by practitioners for taking decisions regarding treatment, their high-dimensional nature hinders their use as a quantitative metric of gait pathology or treatment outcome. This motivates the development of univariate summary metrics of gait impairment, such as questionnaire-based metrics Gillette Functional Assessment Walking Scale (FAQ) (Gorton III et al., 2011), Gross Motor Function Classification System (GMFCS) (Palisano et al., 2008) and Functional Mobility Scale (FMS) (Graham et al., 2004), or observational video analysis scores such as Edinburgh Gait Score (Read et al., 2003).

One of the most widely adopted quantitative measurements of gait impairments in pediatrics is the gait deviation index (GDI) (Schwartz and Rozumalski, 2008). GDI is derived from joint angle trajectories and measures the deviation of the first ten singular values from the population average of the typically developing population. GDI is normalized in such a way that 100 corresponds to the mean value of typically developing children, with the standard deviation equal 10. It is observed to be highly correlated with questionnaire-based methods. Thanks to its deterministic derivation from the motion capture measurements this method is considered more objective than questionnaires.

In medical practice, GDI has been adapted as a metric for diagnosing the severity of impairment, and it constitutes an integral part of the clinical decision-making process and evaluation of treatment outcomes. However, in order to correctly identify the surgery outcome, it is crucial to understand the natural progression of GDI. In particular, a positive outcome of surgery might be negligible when compared to natural improvement during puberty. Similarly, a surgery maintaining the same level of GDI might be incorrectly classified as a failure, if the decline in the patient’s function over time is not accounted for.

Methods introduced in this article can be used to approximate individual progressions of GDI. First, we present how a prediction can be made solely based on the patient’s GDI history and the histories of other patients. Next, using our regression procedure, we predict GDI trajectories using other sparsely observed covariates, namely O₂ expenditure and walking speed.

6.1. Materials and methods

We analyze a dataset of Gillette Children’s Hospital patients visiting the clinic between 1994 and 2014, ages ranging from 4 to 19 years, mostly diagnosed with Cerebral Palsy. The dataset contains 84 visits of 36 patients without gait disorders and 6066 visits of 2898 patients with gait pathologies.

Motion capture data was collected at 120Hz and joint angles in time were extracted. These joint angles were then normalized in time to the gait cycle. Points from these curves were then subsampled (51 equidistributed points) for the downstream analysis established in the clinic.

In the dataset which we received from the hospital, for each patient we know their birthday and disease subtype. From each visit, we observe the following variables: patient ID, time of the visit, GDI of the left leg, GDI of the right leg, walking speed, and O₂ expenditure. Other clinical variables that we received were not included in this study. Walking speed is related to information we lose during the normalization of the gait cycle in time. O₂ expenditure is a measure of a subject’s energy expenditure during walking. Pathological gait is often energy inefficient and reduction of O₂ expenditure is one of the objectives of treatments. Finally, GDI is computed for two legs while in many cases the neurological disorder affects only one limb. To simplify the analysis, we focus on the more impaired limb by analyzing the minimum of the left and the right GDI.

Our objective is to model individual progression curves. We test two methods: functional principal components (fPCA) and Soft-Longitudinal-Impute (SLI). We compare the results to the null model – the population mean across all visits (mean). In SLR, we approximate GDI using latent variables of sparsely observed covariates O₂ expenditure and walking speed, following the methodology from Section 4.3.

Let us denote the test set as $Ω \subset {1, 2, \dots, N} \times {1, 2, \dots, T}$ . We validate each model $M$ on heldout indices by computing the mean squared error as defined in (25). We select the parameters of each of the three methods using cross-validation, using the same validation set.

In our evaluation procedure, for the test set, we randomly select 5% of observations of patients who visited the clinic at least 4 times. We chose the cutoff at 4 in order to have at least 3 observations to fit the curve, and keep 1 observation for testing. We verified with the collaborators at the hospital that three visits are reasonable for the practical use of our methods. We split the remaining 95% of observations into training and validation sets in 90: 10 proportion. We also ran the procedures using a smaller test set and did not observe any relevant difference in results, likely because the training set is already relatively large for the task. We train the algorithms with the following combinations of parameters: the regularization parameter for SLI and SLR procedures $λ \in {0, 0.1, 0.2, \dots, 2.0}$ and the rank for fPCA procedure $d \in {2, 3, 4, \dots, K}$ . We define the grid of $T = 51$ points. We repeat the entire evaluation procedure 20 times.

6.2. Results

Compared to the null model, fPCA and SLI explain around ~30% of the variance. We present detailed results in Table 1. SLR using additional predictors, O₂ expenditure and walking speed, yielded a mean MSE of 0.68 with a standard deviation of 0.8. We conclude that O₂ expenditure and walking speed provide additional information for the prediction of GDI progression.

Table 1:

Distribution of cross-validated MSE of the two methods: functional principal components (fPCA) and Soft-Longitudinal-Impute (SLI).

	mean	sd
fPCA	76.4	11.9
SLI	74.7	7.5

Both fPCA and Sparse-Longitudinal-Impute provide latent representations of patients’ progression curves. We analyze the singular value vectors from our SVD solution which we refer to as principal components. In the left plot in Figure 3 we show the first two estimated principal components. We found that the first component estimates the change between GDI before and after age of 20. The second component model changes around the age of 10 and around the age of 18. In the right plot in Figure 3, by adding a principle component to the population mean curve, we illustrate how differences in the first component are reflected in the patient’s trajectory. By visual investigation of curves returned by our Sparse-Longitudinal-Impute and by fPCA we found similar trends in the first two components.

Since our SVD decomposition defines a low-rank representation of the progression trends, we can also use it to gain insights into progression in different groups of patients. In Cerebral Palsy, we divide paralysis into subtypes depending on which limbs are affected: monoplegia (one leg), diplegia (two legs), hemiplegia (one side of the body), triplegia (three limbs), quadriplegia (four limbs). Hemiplegia is the most prevalent in our population and it might be divided depending on severity, from type I (weak muscles, drop foot) to type IV (severe spasticity). We find differences between trends of progression for different subtypes of paralysis of patients $(F_{6,541} = 17.17, p < 10^{- 15})$ . We illustrate these distributions in Figure 7.

Figure 7: — Progression trends in different subsets of diseases. Negative values of the score, such as most of the quadriplegic group, correspond to individual trends where the first component (the red curve Figure 3 left) is subtracted from the mean (the red curve in Figure 3 right). Positive values of the score, such as most of the hemiplegic group, correspond to individual trends where the first component is added (the green curve in Figure 3 right).

7. Discussion

Results presented in Section 6.2 imply that our Sparse-Longitudinal-Impute and Sparse-Longitudinal-Regression methods can be successfully applied to understand trends of the variability of disease progression. We show how to incorporate progressions of O₂ expenditure and walking speed in the prediction of the progression of GDI. We present how low-rank representation can be leveraged to gain insights into subtypes of impairment.

While a large portion of variance remains unexplained, it is important to note that in practice the individual progression is not accounted for explicitly in the current decision-making process. Instead, practitioners only use the population-level characteristics of the dependence between age and impairment severity. Our model can greatly improve this practice.

Despite successful application, we identify limitations that could be potentially addressed in the extensions of our model. First, the method is meant to capture the natural continuous progression of GDI, while in practice there are many discrete events, such as surgeries that break continuity assumptions and render the mean trajectories less interpretable. Second, our methodology does not address the “cold start problem”, i.e. we do not provide tools for predictions with only one or zero observations. Third, we do not provide explicit equations for the confidence bounds of predicted parameters. Fourth, in our methods, we focus on finding a solution of (7) regardless of the distribution of the data (e.g., the nature of missingness of the data). While we see it as an advantage if distributions are unknown, in certain situations we could leverage the distribution and improve results.

While these and other limitations can constrain the applicability of the method in the current form, they can be addressed using existing techniques of matrix completion. The focus of this paper is to introduce a computational framework rather than build a full solution for all cases. Elementary formulation of the optimization problem as well as the fully-functional $R$ implementation can foster the development of new tools using matrix completion for longitudinal analysis and for mixed-effect models.

In our R package fcomplete available at https://github.com/kidzik/fcomplete, we provide implementations of all algorithms described in this article as well as helper functions for transforming the data, sampling training and test datasets, and plotting functions. For convenience, we also provided an interface for using the fpca package implementing Sparse Functional Principal Components algorithms (James et al., 2000; Peng and Paul, 2009). The analysis was performed on a desktop PC with 64 GB RAM memory and an Intel^® Core^™ Intel(R) Core(TM) i9–10900K CPU @ 3.70GHz, operating on a Ubuntu 18.04 system with R version 4.1.0.

Modeling longitudinal data using matrix completion - PMC

1. Motivation

Figure 1:

Figure 2:

2. Background and related work

2.1. Direct approach

2.2. Linear mixed-effect models

2.3. Low-rank approximations

3. Modeling sparse longitudinal processes using matrix completion

3.1. Notation

3.2. Direct approach

3.3. Low-rank matrix approximation

Figure 3:

3.4. Low-rank approximation with singular value thresholding

Algorithm 1.

3.5. $l_{0}$ regularization

Algorithm 2.

3.6. Dense grid and proximal gradient descent

Algorithm 3.

3.7. A link between the reduced-rank model and Soft-Longitudinal-Impute

Theorem 1.

4. Multivariate longitudinal data

4.1. Illustrative example: Univariate regression

4.2. Dimensionality reduction

4.3. Regression

Algorithm 4.

Algorithm 5.

5. Simulations

5.1. Data generation

5.2. Methods

5.3. Results and discussion

Figure 4:

Figure 5:

6. Data study

Figure 6:

6.1. Materials and methods

6.2. Results

Table 1:

Figure 7:

7. Discussion

Modeling longitudinal data using matrix completion - PMC

1. Motivation

Figure 1:

Figure 2:

2. Background and related work

2.1. Direct approach

2.2. Linear mixed-effect models

2.3. Low-rank approximations

3. Modeling sparse longitudinal processes using matrix completion

3.1. Notation

3.2. Direct approach

3.3. Low-rank matrix approximation

Figure 3:

3.4. Low-rank approximation with singular value thresholding

Algorithm 1.

3.5. l0 regularization

Algorithm 2.

3.6. Dense grid and proximal gradient descent

Algorithm 3.

3.7. A link between the reduced-rank model and Soft-Longitudinal-Impute

Theorem 1.

4. Multivariate longitudinal data

4.1. Illustrative example: Univariate regression

4.2. Dimensionality reduction

4.3. Regression

Algorithm 4.

Algorithm 5.

5. Simulations

5.1. Data generation

5.2. Methods

5.3. Results and discussion

Figure 4:

Figure 5:

6. Data study

Figure 6:

6.1. Materials and methods

6.2. Results

Table 1:

Figure 7:

7. Discussion

3.5. $l_{0}$ regularization