1 Introduction
In many real world problems we are interested in learning multiple tasks while the training set for each task is quite small. For example, in pharmacological studies, we may be attempting to predict the concentration of some drug at different times across multiple patients. Finding a good regression function of an individual patient based only on his or her measurements can be difficult due to insufficient training points for the patient. Instead, by using measurements across all the patients, we may be able to leverage common patterns across patients to obtain better estimates for the population and for each patient individually. Multitask learning captures this intuition aiming to learn multiple correlated tasks simultaneously. This idea has attracted much interest in the literature and several approaches have been applied to a wide range of domains including medical diagnosis
(Bi et al., 2008), recommendation systems (Dinuzzo et al., 2008) and HIV Therapy Screening (Bickel et al., 2008). Building on the theoretical framework for singletask learning, multitask learning has recently been formulated by Evgeniou et al. (2006)as a multitask regularization problem in vectorvalued Reproducing Kernel Hilbert space.
Several approaches formalizing multitask learning exist within Bayesian statistics. Considering hierarchical Bayesian models
(Xue et al., 2007; Gelman, 2004), one can view the parameter sharing of the prior among tasks as a form of multitask learning where evidence from all tasks is used to infer the parameters. Over the past few years, Bayesian models for multitask learning were formalized using Gaussian processes (Yu et al., 2005; Schwaighofer et al., 2005; Pillonetto et al., 2010). In this mixedeffect model, information is shared among tasks by having each task combine a common (fixed effect) portion and a task specific portion, each of which is generated by an independent Gaussian process.Our work builds on this formulation extending it and the associated algorithms in several ways. In particular, we extend the model to include three new aspects. First, we allow the fixed effect to be multimodal so that each task may draw its fixed effect from a different cluster. Second, we extend the model so that each task may be an arbitrarily phaseshifted image of the original time series. This yields our GMT model: the shiftinvariant grouped mixedeffect model. Alternatively, our model can be viewed as a probabilistic extension of the Phased Kmeans algorithm of
Rebbapragada et al. (2009) that performs clustering for phaseshifted time series data and as a nonparametric Bayesian extension of mixtures of random effects regressions for curve clustering (Gaffney and Smyth, 2003). Finally, unlike the existing models that require the model order to be set a priori, our extension in the DPGMT model uses a Dirichlet process prior on the mixture proportions so that the number of mixture components is adaptively determined by the data rather than being fixed explicitly.Our main technical contribution is the inference algorithm for the proposed model. We develop details for the em algorithm for the GMT model and a Variational em for DPGMT optimizing the maximum a posteriori (MAP) estimates for the parameters of the models. Technically, the main insights are in estimating the expectation for the coupled hidden variables (the cluster identities and the task specific portion of the time series) and in solving the regularized least squares problem for a set of phaseshifted observations. In addition, for the DPGMT, we show that the variational em algorithm can be implemented with the same complexity as the fixed order GMT without using sampling. Thus the DPGMT provides an efficient model selection algorithm compared to alternatives such as BIC. As a special case our algorithm yields the (Infinite) Gaussian mixture model for phase shifted time series, which may be of independent interest, and which is a generalization of the algorithms of Rebbapragada et al. (2009) and Gaffney and Smyth (2003).
Our model primarily captures regression of time series but because it is a generative model it can be used for class discovery, clustering, and classification. We demonstrate the utility of the model using several experiments with both synthetic data and realworld time series data from astrophysics. The experiments show that our model can yield superior results when compared to the singletask learning and Gaussian mixture models, especially when each individual task is sparsely and nonsynchronously sampled. The DPGMT model yields results that are competitive with model selection using BIC over the GMT model, at much reduced computational cost.
The remainder of the paper is organized as follows. Section 2 provides an introduction to the multitask learning problem and its Bayesian interpretation and develops the main assumptions of our model. Section 3 defines the new generative model, Section 4 develops the em algorithm for it, and the infinite mixture extension is addressed in Section 5. The experimental results are reported in Section 6. Related work is discussed in Section 7 and the final section concludes with a discussion and outlines ideas for future work.
2 Preliminaries
Throughout the paper, scalars are denoted using italics, as in ; vectors use bold typeface, as in , and denotes the th entry of . For a vector and real valued function , we extend the notation for to vectors so that where the superscript stands for transposition (and the result is a column vector). denotes a kernel function associated to some reproducing kernel Hilbert space (RKHS) and its norm is denoted as . To keep the notation simple, is substituted by where the index is not confusing.
2.1 Multitask learning with kernel
Given training set , where , singletask learning focuses on finding a function that best fits and generalizes the observed data. In the regularization framework, learning amounts to solving the following variational problem (Evgeniou et al., 2000; Scholkopf and Smola, 2002)
(1) 
where
is some (typically convex) loss function. The norm
relates to regularity condition on the function where a large norm penalizes nonsmooth functions. The regularization parameter provides a tradeoff between the loss term and the complexity of the function.Consider a set of tasks, with th task . Multitask learning seeks to find for each task simultaneously, which, assuming square loss function, can be formulated as the following regularization problem
(2) 
where the penalty term, applying jointly to all the tasks, encodes our prior information on how smooth the functions are, as well as how these tasks are correlated with each other. For example, setting the penalty term to implies that there is no correlation among the tasks. It further decomposes the optimization functional to separate singletask learning problems. On the other hand, with a shared penalty, the joint regularization can lead to improved performance. Moreover, we can use a norm in RKHS with a multitask kernel to incorporate the penalty term (Micchelli and Pontil, 2005). Formally, consider a vectorvalued function defined as . Then Equation (2) can be written as
(3) 
where is the norm in RKHS with the multitask kernel , where . As shown by Evgeniou et al. (2006), the representer theorem gives the form of the solution to Equation (3)
(4) 
with norm
Let and , then the coefficients are given by the following linear system
(5) 
where is the kernel matrix formed by .
2.2 Bayesian formulation
A Gaussian process is a functional extension for Multivariate Gaussian distributions. In the Bayesian literature, it has been widely used in statistical models by substituting a parametric latent function with a stochastic process with a Gaussian prior
(Rasmussen, 2006). More precisely, under the singletask setting a simple Gaussian regression model is given bywhere ’s prior is a zero mean Gaussian process with covariance function and
is independent zero mean white noise with variance
. Given data set , let , then and the posterior on is given byThe predictive distribution for some test point distinct from the training examples is
where . Furthermore, under square loss function, the optimizer of Equation (1) is equal to the expectation of the predictive distribution (Rasmussen, 2006). Finally, a Gaussian process corresponds to a RKHS with kernel such that
(6) 
In this way, we can express a prior on functions using a zero mean Gaussian process (Lu et al., 2008)^{1}^{1}1
In general, a Gaussian process can not be thought of as a distribution on the RKHS, because with probability 1, one can find a Gaussian process such that its sample path does not belong to the RKHS. However, the equivalence holds between the RKHS and the expectation of a Gaussian process conditioned on a finite number of observations. For more details on the relationship between RKHS and Gaussian processes we refer interested reader to
Seeger (2004).(7) 
Applying this framework in the context of multitask learning, the model is given by
where are zero mean Gaussian processes and captures i.i.d. zeromean noise with variance . Pillonetto et al. (2010) formalize the connection between multitask kernel and covariance function among using
(8) 
2.3 Basic Model Assumptions
Given data , the socalled nonparametric Bayesian mixedeffect model (Lu et al., 2008; Pillonetto et al., 2010) captures each task with respect to using a sum of an average effect function and an individual variation for each specific task,
This assumes that the fixedeffect (mean function) is sufficient to capture the behavior of the data, an assumption that is problematic for distributions with several modes. To address this, we introduce a mixture model allowing for multiple modes (just like standard Gaussian mixture model (GMM)), but maintaining the formulation using Gaussian processes. This amounts to adding a group effect structure and leads to the following assumption:
Assumption 1
For each and ,
(9) 
where and are zeromean Gaussian processes and . In addition, and are assumed to be mutually independent.
With the groupedeffect model and groups predefined, one can define a kernel that relates (with non zero similarity) only points from the same example or points for different examples but the same center as follows
where
However, in our work the groups are not known in advance and we cannot use this formulation. Instead we use a single kernel to relate all tasks.
The second extension allows us to handle phase shifted time series. In some applications, we face the challenge of learning a periodic function on a single period from samples , where similar functions in a group differ only in their phase. In the following assumption, the model of primary focus in this paper is presented, which extends the mixedeffect model to capture both shiftinvariance and the clustering property.
Assumption 2
For each and ,
(10) 
where , and are zeromean Gaussian processes, stands for circular convolution and is the Dirac function with support at .^{2}^{2}2 Given a periodic function with period , its circular convolution with another function is defined as
3 Shiftinvariant Grouped mixedeffect model
In Assumption 1, if we know the cluster assignment of each task, then the model decomposes to mixedeffect models which is the case investigated in (Pillonetto et al., 2010; Evgeniou et al., 2006). Similar results can be obtained for Assumption 2. However, prior knowledge of cluster membership is often not realistic. In this section, based on Assumption 2, a probabilistic generative model is formulated to capture the case of unknown clusters. We start by formally defining the generative model, which we call Shiftinvariant Grouped mixedeffect Model (GMT). In this model, group effect functions are assumed to share the same Gaussian prior characterized by . The individual effect functions are Gaussian processes with covariance function . The model is shown in Figure 1 and it is characterized by parameter set and summarized as follows

Draw

For the th time series

Draw

Draw

Draw , where

where is the mixture proportion. Additionally, denote and , where are the time points when each time series is sampled and are the corresponding observations.
We assume that the group effect kernel and the number of centers are known. The assumption on is reasonable, in that normally we can get more information on the shape of the mean waveforms, thereby making it possible to design kernel for . On the other hand, the individual variations are more arbitrary and therefore is not assumed to be known. The assumption that is known requires some form of model selection. An extension using a nonparametric Bayesian model, the Dirichlet process (Teh, 2010), that does not limit is discussed in the section 5. The group effect , individual shifts , noise variance and the kernel for individual variations are unknown and need to be estimated. The cluster assignments and individual variation are treated as hidden variables. Note that one could treat too as hidden variables, but we prefer to get a concrete estimate for these variables because of their role as the mean waveforms in our model.
The model above is a standard model for regression. We propose to use it for classification by learning a mixture model for each class and using the Maximum A Posteriori (MAP) probability for the class for classification. In particular, consider a training set that has classes, where the th instance is given by . Each observation is given a label from . The problem is to learn the model for each class ( in total) separately and the classification rule for a new instance is given by
(11) 
As we show in our experiments, the generative model can provide explanatory power for the application while giving excellent classification performance.
4 Parameter Estimation
Given data set , the learning process aims to find the MAP estimates of the parameter set
(12) 
The direct optimization of Equation (12) is analytically intractable because of coupled sums that come from the mixture distribution. To solve this problem, we resort to the em algorithm (Dempster et al., 1977). The em algorithm is an iterative method for optimizing the maximum likelihood (ML) or MAP estimates of the parameters in the context of hidden variables. In our case, the hidden variables are (which is the same as in standard GMM), and . The algorithm iterates between the following expectation and maximization steps until it converges to a local maximum.
4.1 Expectation step
In the Estep, we calculate
(13) 
where stands for estimated parameters from the last iteration. For our model, the difficulty comes from estimating the expectation with respect to the coupled latent variables . In the following, we show how this can be done. First notice that,
and further that
(14) 
The first term in Equation (14) can be further written as
(15) 
where is specified by the parameters estimated from last iteration. Since is given, the second term is the marginal distribution that can be calculated using a Gaussian process regression model. In particular, denoting we get
where is the kernel matrix for the th task using parameters from last iteration, i.e. . Therefore, the marginal distribution is
(16) 
Next consider the second term in Equation (14). Given , we know that , i.e. there is no uncertainty about the identity of and therefore the calculation amounts to estimating the posterior distribution under standard Gaussian process regression, that is
and the conditional distribution is given by
(17) 
where is the posterior mean
(18) 
and is the posterior covariance of
(19) 
Since Equation (15) is multinomial and is Normal in (17), the marginal distribution of is a Gaussian mixture distribution given by
To work out the concrete form of , denote iff . Then the complete data likelihood can be reformulated as
where we have used the fact that exactly one is 1 for each and included the last term inside the product over for convenience. Then Equation (13) can be written as
Denote the second term by . By a version of Fubini’s theorem (Stein and Shakarchi, 2005) we have
(20) 
Now because the last term in Equation (20) does not include any , the equation can be further decomposed as
(21) 
where
(22) 
can be calculated from Equation (15) and (16) and can be viewed as a fractional label indicating how likely the th task is to belong to the th group. Recall that
is a normal distribution given by
and is a standard multivariate Gaussian distribution determined by its priorUsing these facts and Equation (21), can be reformulated as
(23) 
We next develop explicit closed forms for the remaining expectations. For the first, note that for and a constant vector ,
Therefore the expectation is
(24) 
where is as in Equation (18) where we set explicitly.
For the second expectation we have
4.2 Mstep
In this step, we aim to find
(25) 
and use to update the model parameters. Using the results above this can be decomposed into three separate optimization problems as follows:
That is, can be estimated easily using its separate term, is only a function of and depends only on , and we have
(26) 
and
(27) 
The optimizations for and are described separately in the following two subsections.
4.2.1 Learning
To optimize Equation (26) we assume first that is given. In this case, optimizing decouples into subproblems, finding th group effect and its corresponding shift . Denoting the residual , where , the problem becomes
(28) 
Note that different have different dimensions and they are not assumed to be sampled at regular intervals. For further development, following Pillonetto et al. (2010), it is useful to introduce the distinct vector whose component are the distinct elements of . For example if , then . For th task, let the binary matrix be such that
That is, extracts the values corresponding to the th task from the full vector. If are fixed, then the optimization in Equation (28) is standard and the representer theorem gives the form of the solution as
(29) 
Denoting the kernel matrix as , and we get . To simplify the optimization we assume that can only take values in the discrete space , that is, , for some (e.g., a fixed finite fine grid), where we always choose . Therefore, we can write , where is . Accordingly, Equation (28) is reduced to
(30) 
To solve this optimization, we follow a cyclic optimization approach where we alternate between steps of optimizing and respectively,

At step , optimize equation (30) with respect to given . Since is known, it follows immediately that Equation (30) decomposes into independent tasks, where for the th task we need to find such that is closest to under the Euclidean distance. A brute force search with time complexity yields the optimal solution. If the time series are synchronously sampled (i.e. ), this is equivalent to finding the shift corresponding the crosscorrelation, defined as
(31) where and and refers to the vector right shifted by
Comments
There are no comments yet.