1 Deep Mixture Generative Autoencoders Fei Ye and Adrian G. Bors, Senior Member, IEEE Department of Computer Science, University of York, York YO10 5GH, UK E-mail: {fy689, adrian.bors}@york.ac.uk Abstract—Variational autoencoders (VAEs) are one of the most deep spectral clustering [13], [14]. Another generative model popular unsupervised generative models which rely on learning is the Generative Adversarial Network (GAN) [15]. A GAN latent representations of data. In this paper, we extend the is also composed of two networks: the generator and the classical concept of Gaussian mixtures into the deep variational framework by proposing a mixture of VAEs (MVAE). Each discriminator, playing a min-max game. The generator aims to component in the MVAE model is implemented by a variational generate realistic data in order to fool the discriminator whose encoder and has an associated sub-decoder. The separation task is to distinguish the real data from fake. GANs generate between the latent spaces modelled by different encoders is better quality images with higher contrast than VAE, but enforced using the d-variable Hilbert-Schmidt Independence they lack an appropriate inference mechanism. Mixed GAN Criterion (dHSIC) criterion. Each component would capture different data variational features. We also propose a mechanism and VAE architectures have been enabled with representation for finding the appropriate number of VAE components for a learning capabilities [16], [17]. given task, leading to an optimal architecture. The differen- Let us consider the generative ability of deep learning tiable categorical Gumbel-Softmax distribution is used in order to generate dropout masking parameters within the end-to- structures based on representations inferred from the latent end backpropagation training framework. Extensive experiments space. One limitation of the VAE is the fact that its latent space show that the proposed MAVE model learns a rich latent data is fixed after the training, with the data’ posterior probability representation and is able to discover additional underlying data represented by the parameters characterizing a simple Gaus- factors. sian distribution. A single VAE model has a low-dimensional Index Terms—Mixtures of Variational Autoencoders, Genera- latent space and can only capture a few underlying variation tive deep learning, Representation learning, Optimal number of factors of the data. For instance, a VAE with a two-dimensional components in mixtures. stochastic latent variable vector, can only learn two meaningful representations. Furthermore, the posterior in VAEs is of a I. I NTRODUCTION rather simple form, and not able to capture complex structures behind data. Other problems when using single VAEs are Research in deep learning focused initially on addressing overfitting and over-regularisation in data representation, [12]. supervised classification problems, where training data are In this work we address these problems by developing a labelled. Meanwhile, unsupervised learning aims to find the novel Mixture of Variational Autoencoders (MVAE), which intrinsic structure of the data without assuming any a priori enjoys many more advantages than existing models, including knowledge. A classic model for data representation, rooted in memory efficiency and performance. The contributions of this statistical inference and proving excellent statistical approx- research study are summarized as follows: imation properties is the Gaussian mixture model (GMM). GMMs have been used to define Radial Basis Functions (RBF) 1) We propose an efficient network architecture design networks by adding a second layer of linear processing units for the VAE mixture model. Unlike in other mixture for supervised classification [1], [2], [3], [4]. RBF networks models using deep networks for the decoder [12], [18], have been shown to have universal approximation properties MVAE implements the decoder of each component as [5], [6], [7]. The variational GMM model addresses the a simple non-linear mapping requiring few parameters uncertainty in the estimation of the mixture model parameters and low computational costs. A Dirichlet sampling pro- by defining a lower bound on the marginal log-likelihood, cess is used for assigning mixing parameters for each replacing the true posterior distribution with a variational component. Unlike using a fixed symmetric Dirichlet approximation, [8], [9], [10]. distribution as in other approaches [19], the proposed Lately, generative deep models have gained an increas- sampling process finds automatically the optimal weight ing attention from the research community. The Variational of each component. Autoencoder (VAE) [11] consists of two convolution neural 2) We propose a measure for enforcing the separability of network (CNN) components: the encoder and the decoder. The the latent space representation associated with each VAE decoder is used to approximate the conditional distribution component, by using the d-variable Hilbert-Schmidt p(x|z) for reconstructing the data x from the estimated latent Information criterion (dHSIC) [20]. The dHSIC measure space z. Meanwhile, the latent space is modelled by a vari- is always positive and can be easily integrated in the ational distribution q(z|x), estimated by the encoder, which objective function used for training MVAE deep learning aims to match the prior distribution during the training. In model. The dHSIC measure can also significantly relieve relation to classical clustering methods, VAEs have considered the over-regularization problem which affects other VAE GMMs as prior distributions [12] and they have been used for based methods [11], [12], [18]. 2 3) We propose a new way for selecting the number of where X is the empirical distribution characterizing the data components, during the training. A dropout mechanism and pη (x|z) is a generative model implemented by the decoder is enabled by sampling from either a Gumbel-softmax of parameters η, while p(z) is the prior distribution of the or a Gaussian distribution. We define a loss function latent space, usually a Gaussian distribution with the identity which includes the dropout rate estimation, controlling matrix as its covariance. The re-parameterization trick [22], how the number of components is decided and ensuring [23] is used in order to allow for the gradients to be effi- the end-to-end training for MVAE. ciently transferred from the VAE’s encoder to decoder when 4) We show through extensive experimentation, that the maximizing ELBO from (1). The first term from the right proposed mixture model learns several distinct clusters side of (1) can be calculated using the reconstruction error, in the latent space, which enables a rich data representa- while the others can be seen as the Kullback-Leibler (KL) tion, benefiting many down-stream tasks. The proposed divergence between the variational posterior and the prior model provides a competitive performance when com- distributions, encouraging the variational posterior to match pared with the state of the art VAE frameworks. the prior distribution. The rest of paper is organized as follows. Related research is discussed in Section II. The proposed model and its ob- C. Representation learning in VAEs jective function are described in detail in Section III, while in Section IV we present the architecture of the model and VAEs provide an efficient inference mechanism for esti- its training algorithm. The experimental results are reported mating informative latent variables corresponding to the given Section V and the conclusions are drawn in Section VI. data. There are two measures to evaluate the quality of approximate inference in VAEs, [22]. One consists in the ability of the variational posterior to match the true posterior. II. R ELATED WORK The second measure represents the capacity of the inference A. Probabilistic mixture model network to yield good variational parameters. One way for improving the quality of the approximate inference consists in The Gaussian mixture model (GMM) is an unsupervised increasing the expressiveness of the approximate posteriors. learning model which can be trained using the Expectation For instance, the normalizing flow [24], [25], [26], [27] was Maximization (EM) algorithm, [21]. Meanwhile, a varia- used in VAEs in order to make the approximate posteriors tional model can be used to define a lower bound for the more expressive. Another way is by introducing auxiliary marginal log-likelihood [8], [9]. The Variational Expectation- variables, such as the Hierarchical Variational Models [14], Maximization (EM) algorithm was used to find a set of hy- [28], the Hamiltonian variational inference [29], or using two perparameters for the variational GMM model [10]. Gaussian stochastic layers [30]. Importance sampling [31], [32] is also distributions are used to model the distribution of the means used in VAEs for improving the quality of the inference. of each Gaussian component of the mixture, Wishard distri- butions are used for their corresponding covariance matrices, One of the problems displayed by VAEs is that of posterior and Dirichlet for the mixing parameters. collapse when the variational distribution closely matches the uninformative prior for a subset of latent variables. InfoVAE GMMs have been embedded into Radial Basis Functions [33] aims to address the posterior collapse problem by using (RBF) networks by adding a second layer of linear processing a mutual information term in the objective function. Ma et units [1], [2], [3], [4], [5] for supervised classification. RBF al. [34], introduced a new regularization term in the VAE networks have been shown to have universal approximation objective, called the mutual posterior-divergence, used to mea- properties [5], [6], [7] and were trained using backpropagation sure the diversity of posteriors. Zhang et al. [35], proposed [1], robust clustering [2], or orthogonal least squares [3]. a new form of VAE, namely Wasserstein-Wasserstein Auto- Encoders, which replaces the KL divergence term with a B. The variational autoencoder (VAE) new regularization term measuring the squared Wasserstein- A variational autoencoder (VAE) [11] is a probabilistic 2 distance between the prior and the aggregated posterior. model which learns a compressed data representation. The VAE model is made up of two complementary networks: D. Deep mixture models using VAEs encoder and decoder. VAEs have probabilistic inference mech- anisms that can capture data’ characteristics. Let x be a data Some recent efforts propose to use mixture models based sample vector and z a vector of stochastic latent variables. on VAEs for learning complex structures behind the data. While the encoder maximizes pη (z|x), the decoder in VAE Kurle et al [18] introduced a mixture model, called Multi- implements a distribution qθ (x|z), called the variational pos- Source Neural Variational Inference (MSVI) aiming to capture terior, where η and θ represent the parameters of the Convolu- probabilistic characteristics from multiple sources. However, tion Neural Networks (CNNs) implementing the encoder and MSVI relies on multiple source domains and would not decoder, respectively. The evidence lower bound (ELBO) on encourage disentanglement between encoding distributions. the marginal log-likelihood in VAEs is defined as : Dilokthanakul et al [12], [18] develops a model considering a GMM as prior distribution for unsupervised clustering tasks. Ex∼X [log p(x)] ≥Ex∼X [Eqθ (z|x) [log pη (x|z) This model still suffers from over-regularisation. A mixture of (1) VAEs defined in a fixed architecture was proposed in [36]. + log p(z) − log qθ (z|x)]] 3 In summary, existing VAE models do not learn multiple latent representations are then combined, considering the mix- separate representations of data, where each representation ing parameters, to form a single representation: could capture rich statistical characteristics in different ways. K X One advantage for the proposed Mixture of VAEs model over s= wi si , (4) other mixture models [12], [18], is that it can learn multiple i=1 disentangled representations by using an efficient network K architecture which requires a lower computational complexity P characterized by the constraint wi = 1, and considering and a reduced number of parameters. Furthermore, instead of i=1 the inference in the latent space we have the probability for using a symmetric Dirichlet distribution [12] for sampling the output variable s : the mixing parameter, we model the sampling process by using inference models, leading to optimal configuration for XK Z the mixing parameters. We also propose a novel dropout p(s|z) = w i ti pθi (zi |x)dx (5) mechanism for the selection of MVAE’s components by using i=1 dHSIC regularization which overcomes the over-regularization where θi represents the parameters of each encoder network problem characteristic in VAEs [11]. i = 1, . . . , K. Afterwards, in the mixture model, we have a single mix-decoder for reconstructing the data x0 : III. T HE OBJECTIVE FUNCTION FOR THE M IXTURE OF x0 ∼ p(x|s). (6) VARIATIONAL AUTOENCODERS A single VAE has a fixed processing capacity defined For the generation process, each i-th sub-decoder corre- by its structure which is not able to model probabilistic sponding to an encoder, is seen as a component of the mixture relationships characterizing complex data. In the following, model. Each component has its own independent inference for comprehensively modelling complex data, we introduce the mechanism, which is beneficial for representation learning Mixture of Variational Autoencoders (MVAE) model. MVAE when representing complex latent spaces. Let us consider an extends the concept of Mixtures of Gaussians [8], [10], into the approximate posterior qθ (z, w|x) implemented by an indepen- deep learning framework. MVAE model learns a collection of dent encoder. We use the Jensen’s inequality to obtain the separate latent spaces, extracted independently by each VAE variational lower bound as follows : component of a mixture. In Section III-A we define the basic p(x, z, w) log p(x) = log Eqθ (z,w|x) objective function for the mixture model. In Section III-B q (z, w|x) θ (7) we discuss how various VAE components can learn different p(x, z, w) aspects of the data, while deciding the number of components ≥ Eqθ (z,w|x) log . qθ (z, w|x) in the MVAE model is explained in Sections III-C and III-D. In the following we consider the independence of the latent variables in the joint log-likelihood p(x, z, w) = A. The underlying framework and objective function p(x|z, w)p(z)p(w) and also q(z, w|x) = q(z|x)q(w|x) and The learning goal of the proposed mixture model is to model we replace these in (7). Then we have the upper bound on a collection of separate latent spaces that capture different − log pη (x) representing the mixture of the lower error bound aspects of data. Let us denote by pηi (x|zi ) the variational (MELBO) basic objective function for MVAE, where the loss : posterior for the i-th decoder, implemented by a Convolution LM ELBO = −Eqθ (z,w|x) [log pη (x|z, w)] Neural Network (CNN) of parameters ηi , where zi represents the inferred stochastic latent variable vector, for i = 1, . . . , K, p(z) p(w) + Eqθ (z,w|x) log + Eqθ (z,w|x) log where K is the number of VAE components. q(z|x) q(w|x) The data generation process is defined by the following = −Eqθ (z,w|x) [log pη (x|z, w)] stages. The latent variable zi is yielded by the i-th encoder : + DKL (qθ (z|x)||p(z)) + DKL (qθ (w|x)||p(w)) zi ∼ N (µθi (x), σθi (x)), (2) (8) where p(w) and p(z) are the priors for the mixing parameters where each encoder, is defined as a CNN of parameters θi , w and for the latent variables z = {zi |i = 1, . . . , K}, while and models a multivariate Gaussian function, [36]. θ = {θi |i = 1, . . . , K}, η = {ηi |i = 1, . . . , K}, represent the We do not directly recover the data by using a decoder, as parameters of the networks modelling the mixture of encoders in the classical VAE [11], due to the complexity of the mixture and decoders, respectively. After explicitly expanding the basic model. Instead, we firstly consider a simple network as a sub- objective function corresponding to the mixture model we decoder, implemented by a nonlinear transformation function have: denoted as ti (·), i = 1, . . . , K which outputs a variable si . LM ELBO = − Ep(s|z) [log pη (x|s)] + DKL (qθ (w|x)||p(w)) In the following, the mixing weights are sampled from a K Dirichlet distribution, Dir(a) [10], as : 1 X + DKL (qθi (z|x)||p(z)) K i=1 {w1 , . . . , wK } ∼ Dir(a), (3) (9) where a = {a1 , a2 , . . . , aK } represents its parameters, with where K is the number of components, qθi (z|x) represents the each entry provided by one of the encoders. The transformed variational modelled by one of the mixing components, defined 4 by the parameters θi , and pη (x|s) is the probability of the mix- where q(zi ) is the marginal probability of the latent variable decoder, modelled by a network defined by the parameters η, zi , which represents the output of the i-th encoder. Let us aiming to reconstruct the data x0 . Let us consider tγi (si |z) the consider the cross-covariance operator Czi ,zj defined on the function of each sub-decoder, where {γi |i = 1, . . . , K} are the encoders output variables zi and zj . The largest eigenvalue of parameters of K sub-decoders. The output of each sub-decoder the operator Czi ,zj measures the dependence score between is multiplied by its corresponding mixing parameter and the the distributions defined by the latent variables zi and zj and results are then summed up like in equation (5). this should be zero for ensuring the independence of the two latent spaces. The covariance operator is defined based on the squared Hilbert-Schmidt norm as follows, [20] : B. Enforcing the separation in the latent space among the MVAE components Czi ,zj = Ezi ,zj [k(zi ), l(zj )] − Ezi [k(zi )]Ezj [l(zj )] (13) LM ELBO from equation (9) is the basic objective function where zi and zj are the latent variables produced by i-th for the mixture model, where each encoder defines its own and j-th encoders and k(·), l(·) represent kernel functions in latent space. We enforce that various components learn distinct the latent space, usually defined as Gaussian. The Hilbert- aspects of the data by employing a regularization term r(z) : Schmidt independence criterion (HSIC) [20] represents the LObj = LM ELBO + β r(z) (10) generalization of Frobenius norm and is defined as : HSIC(zi , zj ) = kCzi ,zj k2 (14) where LObj is the objective function of MVAE, and β is a hyperparameter controlling the relative strength of the regu- where Czi ,zj is provided in (13). larization. The definition of the cross-covariance operator can be ex- In the mixture model, each encoder has its own independent tended for assessing the independence of d variables, similarly inference mechanism. Nevertheless, without a regularization to the expressions from (13) and (14), representing the latent mechanism, the encoders might all learn the same features, space {z1 , z2 , . . . , zd } defined by d VAE components [20], resulting in overlaps of their characteristic latent spaces. This [38]. This criterion is called dHSIC and evaluates the inde- happens because each associated encoder and sub-decoder pendence among all K encoder components. dHSIC is null shares the same network architecture. Consequently, we should if and only if all components zi , i = 1, . . . , K are mutually increase the distinction between the latent representations of independent. The K components are mutually independent if various MVAE components in order to encourage them to learn their joint distribution is equal to the product of their marginal different aspects of the data. distributions, [39]. dHSIC can be easily integrated as the Various measures for the regularization function r(z), from regularization factor r(z) in the optimization function LObj , (10), can be used for enforcing each component to learn a from (10), as follows: distinct latent space from the others. For example, the L2 norm between the latent variables of two different encoders, LObj = LM ELBO + β dHSIC (z1 , z2 , . . . , zK ) , (15) or statistical distances such as KL divergence or its symmetric where LM ELBO defines the inference process for MVAE and correspondent, Jensen-Shannon (JS) divergence [37] between β is the contribution of the constraint associated with dHSIC. the probabilistic representation of the latent spaces for each pair of encoders, can be used. In any of these measures we C. Deciding the number of VAE components using dropout calculate the differences between the latent space representa- probabilities tions for all possible pairs of encoders from the mixture: The number of components in classical mixture models, K X K X such as GMMs or RBF networks, was selected in various r(z) = d(p(zi ), p(zj )), i 6= j (11) ways. New processing units were added in [1] as long as i=1 j=1 decreasing the classification error. Bayesian Information Cri- where d(·, ·) represents one of the discriminatory measures terion [40], which is equivalent to the Minimum Description mentioned above, evaluated between the probabilistic repre- Length [41], was used in the context of Variation Expectation- sentations p(zi ) and p(zj ), characterizing the latent space of Maximization algorithm in [10] for deciding the number of a pair of encoders {i, j}i,j=1,...,K . mixing components. In this research study, we propose to All measures listed above are always positive, and when extend the idea of probabilistic dropout for selecting the considered in (10) they would be differentiated during the appropriate number of VAE components during the training Stochastic Gradient Descent (SGD) training which can lead to of MVAE while ensuring an efficient end-to-end training derailing the convergence during the training. In order to avoid mechanism. Each mixing component is seen as a probabilistic this situation we consider a new measure of independence by node which contributes to the mixture output in the network, assuming that the joint probability of the latent space for the according to a dropout probability. whole mixture of VAEs is equal to the product of marginal We consider two different approaches for the dropout pro- probabilities of the individual variables, [38]: cedure when selecting the number of components in MVAE. K First we introduce a traditional dropout method, by sampling Y a set of variables, which are either 0 or 1, according to the q(z1 , z2 , . . . , zK ) = q(zi ) (12) i=1 dropout rate. This models a vector whose entries represent 5 masking parameters for the outputs of each of the K sub- D. Sampling the Gumbel-softmax distribution for selecting the decoders. We denote a masking vector, sampled from the number of components Bernoulli distribution, by m = {m1 , m2 , . . . , mK }, whose In this section we consider the Gumbel-softmax distribution entries are used as the weights for the mixing parameters : [43], representing a categorical distribution which is also dif- mi wi ferentiable, for selecting the number K of VAE components. bi = K . (16) A sample vector is drawn from a categorical distribution with P mj wj probabilities {ai |i = 1, . . . , K} for each encoder of MVAE j=1 by using the Gumbel-softmax trick [44], [45] : Then, the output of each sub-decoder is multiplied by the one hot arg max(log ai + gi ) (20) corresponding mixing parameter and the probability of the i variable s, for the whole MVAE model (5), becomes: where gi is a sample drawn from the Gumbel(0, 1) distribu- K X Z tion. The Gumbel-softmax trick adopts the softmax function p(s|z) = bi ti pθi (zi |x)dx . (17) as a continuous, differentiable approximation to the arg max i=1 expression as, [43] : In this way, only the components contributing significantly to exp[(log(ai ) + gi )/T ] mi = K (21) the inference will be considered for the generation process P exp[(log(ai ) + gi )/T ] in MVAE, according to the dropout masking parameters from i=1 (16). The variable s is characteristic of the mixed latent space, where m = {m1 , m2 , . . . , mK } is a K-dimensional mask- which is then fed into the mix-decoder, yielding as outputs ing vector and T is the temperature parameter. When the the reconstructed data as in (6). The Bernoulli distribution temperature T is increasing, the samples inferred from the can be used for modelling the masking. However, this is non- Gumbel-softmax become uniformly distributed and they are no differentiable and cannot be used directly in the context of the longer selective. In contrast, if the temperature T approaches end-to-end backpropagation training. zero, the Gumbel-Softmax distribution becomes the one hot We could also consider the variational Gaussian dropout for selective categorical distribution, picking up one component selecting the VAE components (MVAE-Gau) as in [42]. In this of the mixture over the others. case, a dropout vector m for VAE components is sampled from In the following, we estimate the dropout masking param- a Gaussian distribution N (1, τ ), where τ = p/(1−p), and p is eters by using the following sampling process: the dropout rate. The sampled masking vector m is then used directly on the mixing parameters w as in (16). The dropout exp((log(1 − p) + g1 )/T ) mi = loss is taken into account in the objective function LObj from exp((log(1 − p) + g1 )/T ) + exp((log(1 − p) + g2 )/T ) (15) by adding the KL-divergence penalty associated with (22) the dropout DKL (q(m)||p(m)), measuring the KL divergence where mi is a sampled masking value, which can be closer between the variational distribution for the masking parameters to either 1 or 0, corresponding to keeping or dropping the q(m) and its corresponding posterior p(m). After considering mixing component i, g1 , g2 ∼ Gumbel(0, 1), [46], where p the general objective function for the mixture model (9), the is a learnable dropout rate, while we set the temperature as dHSIC regularization from (15), and considering the loss due T = 0.1. The dropout masking parameters mi , depending on to each VAE component dropout, we obtain the following a single dropout rate p, are then combined with the mixing objective function for the MVAE-Gau model : parameters, sampled from the Dirichlet distribution, as in equation (16), and then used for calculating the mixed latent LM V AE−Gau = − Ep(s|z) [log pη (x|s)] + DKL (pθ (w|x)||p(w)) variable, as in (17). Eventually, the mix-decoder yields the K reconstructed data, as in (6). This approach is called the 1 X + DKL (qθi (z|x)||p(z)) Mixture of VAEs with Gumbel-softmax dropout (MVAE-GS). K i=1 The Gumbel-softmax trick from (21) approximates a + β dHSIC (z1 , z2 , . . . , zK ) Bernoulli distribution by generating samples close to either 0 + DKL (q(m)||p(m)). or 1, while it is also differentiable. Gal et al. in [47] considered (18) the dropout regularization term as the entropy of a Bernoulli The last term DKL (q(m)||p(m)), represents the contribution random variable : of the dropout loss and is not analytically tractable when H(p) = −p log p − (1 − p) log(1 − p) (23) considering a Gaussian distribution dropout, but it can be approximated using the following polynomial expression : where p is the dropout probability in (22). We can see that this regularization term depends only on the dropout rate p and if DKL (q(m)||p(m)) ≈ c−0.5 log(τ )−c1 τ −c2 τ 2 −c3 τ 3 (19) we fix the dropout rate during the training, this term can be omitted. However, if we optimize the dropout rate, this term where τ defines the variance of the Gaussian distribution plays an important role in the MVAE mixture components used for sampling the dropout, and the constants used in the selection. For instance, the dropout rate becomes close to 0.5, polynomial approximation c, c1 , c2 , c3 are provided in [42]. when maximizing H(p) in (23). 6 After considering the entropy of the dropout regularization component dropout procedure is adopted for the mixing model. from (23) for the Gambel-softmax dropout penalty, instead of Similarly to the classical VAE [11], we consider the isotropic the last term from (18) used for MVAE-Gau, the objective multivariate Gaussian N (0, I) as the prior distribution for function for MVAE-GS method becomes : each decoder over their latent variable space representations. LM V AE−GS = − Ep(s|z) [log pη (x|s)] + DKL (pθ (w|x)||p(w)) The KL divergence DKL (qθi (z|x)||p(z)), between the prior K and posterior for each i-th encoders, is calculated considering 1 X Gaussian pdfs in the latent space, as: + DKL (qθi (z|x)||p(z)) K i=1 Si 1X 2 − µ2i,j − σi,j 2 + β dHSIC (z1 , z2 , . . . , zK ) − H(p), D (q KL θi (z|x)||p(z)) = (1 + log σi,j ), (24) 2 j=1 where H(p) is provided in (23). (27) where Si is the dimension of the latent variable space z IV. MVAE STRUCTURE AND TRAINING for the i-th encoder and the characteristic latent space vari- ables µi,j and σi,j are evaluated from the training data. In the following we discuss the architecture and the training The total KL divergence, is calculated considering all K algorithm for the MVAE model. components, where each VAE component contributes with the expression from (27). We also calculate the KL di- A. The MVAE model structure vergence DKL (qθ (w|x)||p(w)) corresponding to the mixing The proposed mixture model is based on a collection of en- parameters, representing the second term from either objective coders and sub-decoders processing the information in parallel. function, (18) or (24). The KL divergence corresponding to Their outputs are weighted according to their contribution to the mixing parameters is evaluated analytically between two the data representation. Meanwhile, this structure is enabled Dirichlet distributions of parameters a = {a1 , . . . , aK } and with a dropout mechanism aiming to define a minimal process- b = {bi , . . . , bK } as: ing architecture. The structure of the MVAE model is shown X K in the diagram from Fig. 1. Each encoder with the associated DKL (qθ (a|x)||p(b)) = log Γ(a0 ) − log Γ (ai ) sub-decoder can be seen as a component in the MVAE mixture i=1 model. The encoder outputs the hyperparameters {µi , σi } of XK a Gaussian distribution, and one parameter ai of the Dirichlet − log Γ(b0 ) + log Γ(bi ) (28) distribution, for i = 1, . . . , K, considering initially K mixing i=1 components. Then, the code is sampled from the distribution XK modelled by the corresponding hyperparameters. In order to + (ai − bi )(ψ(ai ) − ψ(a0 )) allow the gradients to be estimated from the encoder to sub- i=1 K K decoder, we use the reparametrization trick, [11] : where a0 = P ai , b0 = P bi , Γ(·) is the Gamma func- i=1 i=1 zi = µi + σi ε, (25) tion and Ψ(·) is the Digamma function. In practice, a are where ε ∼ N (0, I) is a random variable sampled from the the parameters of the Dirichlet distribution estimated by all Gaussian distribution of mean 0 and having the identity matrix encoders, following the training, while b are the parameters I as the covariance. of an empirical Dirichlet distribution, p(w). For the mixing parameters, we adopt implicit reparameter- For the reconstruction error, when considering N training ization gradients [48]. The mixing parameters are sampled images, we use the mean squared error (MSE) criterion, from the Dirichlet distribution Dir(a1 , a2 , . . . , aK ), with each representing the first term from (9) and in the expressions parameter produced by one of the encoders: from (18) and (24), as : N 1X 0 ! w1 wK 2 , . . . , PK ∼ Dir(a1 , . . . , aK ), LRec = kx − xi kF (29) PK (26) 2 i=1 i j=1 wj j=1 wj where the sum of the mixing parameters is 1. The mixing where x0 represents the reconstructed image result by the mix- parameters are then multiplied by the dropout parameters decoder while k · kF denotes the Frobenius norm. which are either 0 or 1, defined as in Section III-C for MVAE- The gradient used for SGD optimization [49], when consid- Gau, or as in Section III-D for MVAE-GS. The contribution ering the objective function LM V AE−Gau from equation (18), of each mixing component is calculated according to (16). derived in Section III-C, is given by : K X M V AE−Gau B. Training the Mixture of VAEs model ∇Ω [LRec + DKL (qθi (z|x)||p(z)) i=1 (30) Although the proposed mixture model is based on a collec- + DKL (qθ (a|x)||p(b)) + β dHSIC(z1 , . . . , zK ) tion of encoders and sub-decoders, we show that it can be eas- ily trained by using the SGD algorithm [49], when considering + DKL (q(m)||p(m))], either the objective function from equation (18) for MVAE- where the first three terms represent the image reconstruction Gau or that from (24) for MVAE-GS, depending on what (29), the KL divergence for all MVAE’s components, where 7 Calculate the mixing pararmeters for each component Dlrichlet distribution Dir (a1 , a2 , a3 , a4 ) ~ {w1 , w2 , w3 , w4 } wm GS ( p, k ) ~ {m1, m2 , m3 , m4} bi = K i i Gumble softmax dropout Gaussian dropout N (1,τ ) ~ {m1, m2 , m3 , m4} ∑ wi mi j =1 Each ai is the Dirichlet parameter Encoder1 {µ1 , σ 1 , a1} Z1 S1 reparametrization Encoder2 {µ2 , σ 2 , a2 } Z2 S2 K ∑S b j =1 j j Select components Encoder3 {µ3 , σ 3 , a3} Z3 S3 Encoder4 {µ4 , σ 4 , a4 } Z4 Sub-decoder4 S4 Fig. 1. The structure of the proposed MVAE model, when considering K = 4 encoders with associated sud-decoders. each component contributes with the expression from (27) and Algorithm 1: MVAE-GS training algorithm. their mixing weights (28), respectively, while the last two Algorithm : Training procedure with Bernoulli dropout components represent the objective functions for enforcing 1:Sample X r = { x1 , x 2 ,...., x N } from training dataset the independence of the mixing components (15) and for 2:While epoch < epoch max do evaluating their dropout probabilities. 3 : While batch < batch max do minibatch procedure When considering the MVAE-GS approach, described in 4: xbatch = Select ( epoch, X r ) batch samples Section III-D, we have the following updating gradient for the 5: ( µ1 , σ 1 , a1 ) .., ( µk , σ k , ak ) ← E1 ( xbatch ) .., Ek ( xbatch ) SGD based training: 6: { z1 ,..., zk } ← latent variables from all enccoders K X 7: ( a1 ,...., ak ) ← Dirichlet parameters from all encoders ∇M Ω V AE−GS [LRec + DKL (qθi (z|x)||p(z)) 8: ( w1 ,...., wk ) ← Dir ( a1 ,...., ak ) Sampling mixing weights i=1 9: {s1 ,....., sk } ← t1 ( z1 ) .., tk ( zk ) Transform by sub-decoders + DKL (qθ (a|x)||p(b)) + β dHSIC(z1 , . . . , zK ) − H(p) 10 : GS ( p, k ) ~ ( m1 ,..., mk ) Dropout from Gumbel-Softmax (31) wi mi 11 : bi = Drropout many components with the last term provided in equation (23). ∑w m k k k The parameters being updated in the MVAE model are : k 12: s = ∑ s j b j Integral outputs of all components Ω = {θ, γ, ν, η} (32) j =1 13 : x′ = p ( x | s ) reconstruct data by mix-decoder where θ and γ are the parameters of the mixture of K 14 : Calculate reconstruction error using equation (29) encoders qθ (z|x) and sub-decoders pγ (s|z), and η are the 15 : Calculatte kl divergence on guassian using equation (28) 16 : Calculate kl divergence on dirichlet using equation (27) parameters for the mix-decoder network, while ν represent the 17 : Calculate dHSIC using equation (14) parameters of the network inferring the dropout parameters for 18 : Update parameters of model according to equation (31) the components. 20 : End The pseudocode of the MVAE-GS training algorithm is 1 21:End provided in Algorithm 1. V. E XPERIMENTAL RESULTS components. For the implementation we use Python language In this section we assess the results provided by the and the deep learning Tensorflow platform. proposed Mixture of VAEs (MVAE) model on a variety of datasets. Each component of the MVAE model is represented by an encoder and a sub-decoder. The probabilistic models A. MNIST dataset are implemented by using fully connected and convolutional The MNIST dataset [50], consists of images of handwritten networks, depending on the complexity of the dataset. The digits of size 28 × 28 pixels. We train MVAE with K = 4 prior is the standard Gaussian distribution N (0, I) and the components when using the MVAE-Gau loss, defined in outputs for each encoder are the hyperparameters of the equation (18), considering 60,000 and 100,000 images for Gaussian distribution defining its latent space. We set β = 1 in training and testing, respectively. A set of original MNIST the objective function from (15), representing the weight for images are shown in Fig. 2a, and their reconstruction by the dHSIC term, defining the independence among the mixing MVAE is provided in Fig. 2f, while the reconstructions by each 8 of the four components are shown in Figures 2b-e. We observe face images with 10,177 identities. Each encoder of the that the mixture model gives a better reconstruction than mixture model is implemented by a deep convolution net each individual component. In the following, we investigate consisting of 5 convolution layers while the fully connected the uniqueness of the latent space representation by each layers are used to output the 256-dimensional hyperparameters component considering a different sub-decoder than its cor- of the Gaussian and Dirichlet distributions defining the latent responding encoder during the tests. For the images of digits space for MVAE. Each sub-decoder is implemented by a from Fig. 3a we consider reconstructions by mismatching the simple architecture with only a single layer of 8 × 8 × 256. sub-encoders and encoders for the same images. The results The output of the sub-decoder is then transformed into a 3D of such mismatches are shown in Figures 3b-e and we observe tensor after multiplying with the mixing weights and dropout that these images are not properly reconstructed. This happens parameters, as shown in Fig. 1. The mix-decoder is a deep because each sub-decoder is uniquely fitted to the associated deconvolution net consisting of 7 layers, which receive the encoder and not to any of the others which would yield tensor as the input and generates images as the output. We different latent space representations. set the kernel size as 3 × 3 for all convolution processing units. The mixture model is initially built using K = 6 mixing components considering the dropout rate optimized (a) Real test samples. during the training using the Gaussian dropout (MVAE-Gau), as described in Section III-C. We train the mixture model using the Adam optimization algorithm for 10 epochs with a learning (b) Results by the VAE component 1. rate of 0.001. A set of face images from CelebA dataset is shown in Fig. 4a, while their reconstructions by MVAE- (c) Results by the VAE component 2. Gau are provided in Fig. 4b. We also show the reconstruction results in Fig. 5c for MVAE-Gau for the subset of images from ImageNet database [52], shown in Fig. 5a. For comparison the (d) Results by the VAE component 3. same images are reconstructed by MSVI [18] in Fig. 5b. It can be observed that both human face and natural images are (e) Results by the VAE component 4. reconstructed well by the MVAE-Gau model. We also explore the manifold continuity by performing interpolation experiments in the latent space. For a single (f) Results by the MVAE model. VAE, we can directly perform interpolations on the latent Fig. 2. Reconstructed images from the MNIST dataset by MVAE and its variables inferred by that VAE. Nevertheless, in the mixture individual VAE components. model, we have multiple variational encoders, each defining its own latent space, which allows us to perform interpolations in new regions of the latent space, located in between the latent spaces modeled by different VAE components. Initially, a pair (a) Real test samples. of images {x1 , x2 } is randomly selected and we map these into the latent representation by using the trained encoders. We consider K = 6 components, and the interpolation di is (b) Results when using the encoder 1 with the latent variables of the performed on the output of each encoder as : sub-decoder 2. di = (1 − a)Ei (x1 )+aEi (x2 ), (33) (c) Results when using the encoder 2 with the latent variables of the sub-decoder 3. where Ei (·) is the output i-th variational encoder, i = 1, . . . , 6 and a is a hyperparameter controlling the interpolation in the latent space. Then, the latent space interpolations are (d) Results when using the encoder 3 with the latent variables of the transformed through the sub-decoders SubDeci (·) into : sub-decoder 4. 6 X c= wi · SubDeci (di ), (34) (e) Results when using the encoder 4 with the latent variables of the sub-decoder 3. i=1 Fig. 3. Results when mis-matching the encoders and sub-decoders when where wi is the weight modeling the contribution of each sub- reconstructing images from the MNIST dataset. decoder i = 1, . . . , K to the result of the interpolation c. Then, the mix-decoder M ixDec(·) is used to generate the image x0 : B. Representing complex images x0 = M ixDec(c). (35) In the section, we evaluate the performance of the proposed Interpolation results on CelebA dataset are shown in Fig. 6, MVAE mixture model on some databases which contain more where {x1 , x2 } are displayed as the extreme left and right im- complex images, such as the human face dataset entitled ages from each row of images. The interpolated images, when Celebrities Faces Attributes (CelebA) [51], and ImageNet varying a in (33) are shown in between the reconstructions of database [52]. CelebA dataset contains almost 200,000 human the original images. From these results we observe smooth and 9 realistic transitions which model various changes in the human their latent variables using the encoders. Then we perform face appearance, such as changing the illumination in the interpolations for each sub-decoder, where instead of using image, varying the hair style, modifying the age appearance, the corresponding inputs as in the previous experiment, we and so on. By exploring the manifold continuity, these results would use the outputs of a different encoder as the input to show the enriched data representations which can be achieved the given sub-decoder, replacing (34) with : in the latent space of the MVAE model. 6 X ĉ = wi · SubDeci (dj ) (36) i=1 where j 6= i. For example, the first sub-decoder is fed with the (a) Randomly selected images. interpolation results in the latent space of the second encoder. The reconstruction results are shown in Fig. 7, where the real images {x1 , x2 } are shown as the first and last on each row, while the images in between are the interpolation results using (36). We observe that in these situations, the mixture (b) Reconstructed face images by MVAE-Gau. model does not generate reasonable results. The main reason Fig. 4. Reconstruction results for face images from CelebA dataset, [51]. is that each component encoder learns a unique latent space, which is distinct from all other latent spaces modelled by the other encoders. This result is the consequence of enforcing the learning of distinct latent spaces for each VAE, by using the dHSIC measure, as described in Section III-B. (a) Randomly selected images. (b) Reconstructed natural images by MSVI [18]. (c) Reconstructed natural images by the MVAE-Gau. (a) Results of component 1. (b) Results of component 2. Fig. 5. Reconstruction results for images from ImageNet dataset. (c) Results of component 3. (d) Results of component 4. Fig. 8. Generated images of digits by different VAE components when fixing Fig. 6. Interpolation results in the latent space, where the extreme left and the class label and changing the first latent variable z1 from 0 to 3. right are real images from CelebA, while the images in between are the reconstruction results by MVAE model. C. Latent space analysis In the following we explore the latent space representation for the MVAE model. Each encoder and associated sub- decoder consider jointly the data and their corresponding class labels {x, y}, where the class information y is represented as a one-hot vector. We train the MVAE-Gau model, with Gaussian defined dropout, for K = 4 components, and considering only Fig. 7. Interpolation results, when the sub-decoder uses the latent space two latent variables, z1 and z2 , on images from the MNIST corresponding to a different encoder as its input, according to equation (36), where the extreme left and right images are real images from CelebA. dataset. Then we fix the class label while changing only one of the latent variables from 0 to 3. The generated results We also investigate the separation in the latent space be- with images of the digit ’5’ are shown in the Figures 8a-d tween the information encoded by different encoders. For this for each of the four components considered. Meanwhile, the experiment, we choose randomly a pair of images and extract generated images of the handwritten digit ’5’, for K = 6 10 (a) Component 1. (b) Component 2. (c) Component 3. (d) Component 4. (e) Component 5. (f) Component 6. Fig. 9. Generated images of digits for six different VAE components when fixing the class label and changing the first latent variable z1 from 0 to 6.0. (a) Component 1. (b) Component 2. (c) Component 3. (d) Component 4. (e) Component 5. (f) Component 6. Fig. 10. Generated images of digits for six different VAE components when fixing the class label and changing the second latent variable z2 from 0 to 6.0. components, when changing either the variable z1 or z2 After training MVAE-GS, considering the cost function from the latent space are shown in Figures 9a-f and 10a-f, from (24) with K = 6 components, we randomly select respectively, for each VAE component. We observe that each a batch of images belonging to the same class, and then component provides a different output when changing a single estimate their corresponding latent vectors by using various variable in the latent space, which shows the ability of the mixing components. We map the latent representation results latent space to model various data attributes. We observe that in Figures 13a and 13b, for β = 10 and β = 0, respectively, we achieve a better disentanglement in the data representation where the colors represent the latent space representations when increasing the number of components from 4 to 6. By produced by different components. We observe that the latent considering two different latent variables for each component space represented in Fig. 13a contains distinct clusters of of the mixture we can model additional underlying factors, latent vectors for the MVAE components, while when not while each component defines a different writing style for the considering the dHSIC term, i.e. β = 0, each component tends generated images. to embed data into the same region of the latent space, as We also train MVAE-Gau with K = 4 components, under shown in Fig. 13b. These results show that the dHSIC measure, the unsupervised learning setting, assuming that the class used in the objective function from (24), plays an important labels are not known. The latent variables corresponding to role in encouraging each component to embed data in different the images from the MNIST database, during testing, are regions of the latent space. plotted in Fig. 11, where the colours represent different class labels. These results indicate that MVAE provides a rich data E. Visual quality evaluation representation, where the information is distributed among In this section, we evaluate the reconstruction and repre- different regions of the latent space for each component. sentation learning ability of the proposed MVAE model and compare with the results achieved by the state of the art. The following models are considered for comparison: (1) Multiple D. Enforcing the separation between the latent space repre- Source Variational Inference (MSVI) [18] model, which is sentations of MVAE’s components a mixture of experts where each expert is implemented by The dHSIC measure is used to enforce the independence a VAE. We implement MSVI by using the same network between the latent spaces of the encoders, as explained in Sec- architecture used for our model. However, MSVI has more tion III-B. In this section we evaluate the separation between parameters than the proposed model, since we use a sub- the latent spaces modelled by the encoders, through exper- decoder implemented by a single layer instead of an entire iments following the training of MVAE-GS on the MNIST deep convolution net. (2) InfoVAE [33] is the current state of dataset. We consider K = 6 VAE components and a latent art VAE framework, which is able to balance accurate infer- space consisting of two variables, z1 and z2 . First, we estimate ence with the reconstruction quality. We implement InfoVAE the dHSIC measure between each pair of encoders’ latent by using a large network architecture. (3) β-VAE is a variant space distributions. The results, when assuming a penalty of of the VAE framework which aims to learn disentangled β = 1 and β = 10 in the objective function from (24) are representations. (4) We also compare with a single VAE [11] provided in Figures 12a and 12b, respectively. We observe implemented by a large network architecture. (5) Finally, we that when using a larger β we achieve better independence consider for comparison several other recent VAE frameworks, between the encoding distributions. such as those proposed in [34], [35], [53]. 11 4 4 4 4 2 2 2 2 0 0 0 0 −2 −2 −2 −2 −4 −4 −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 (a) Component 1. (b) Component 2. (c) Component 3. (d) Component 4. Fig. 11. The representation of the latent space for the MVAE model for the MNIST dataset. Independence of latent variables Independence of latent variables TABLE I RMSE AND I NCEPTION SCORE ON CIFAR10 DATABASE . 0.23 0.14 0.14 0.14 0.13 0.19 0.24 0.081 0.095 0.1 0.099 0.11 0.24 C1 C1 0.22 0.14 0.23 0.2 0.18 0.17 0.13 0.081 0.24 0.15 0.14 0.17 0.14 0.21 Model RMSE IS C2 C2 0.20 DCGAN* [54] in [16] - 6.16 0.14 0.2 0.23 0.18 0.17 0.13 0.095 0.15 0.24 0.19 0.17 0.13 0.18 C3 C3 0.18 ALI* [55] in [16], [56] - 5.34 0.14 0.18 0.18 0.23 0.2 0.13 0.1 0.14 0.19 0.24 0.15 0.12 PixelCNN++* [57] in [32] 3.289 5.51 0.15 C4 C4 0.16 BEGAN [58] - 5.62 0.13 0.17 0.17 0.2 0.22 0.12 0.099 0.17 0.17 0.15 0.24 0.16 MVAE-Gau 2.97 6.38 C5 C5 0.12 0.14 MVAE-Gau fixed K 3.52 6.09 0.19 0.13 0.13 0.13 0.12 0.23 0.11 0.14 0.13 0.12 0.16 0.26 MVAE-GS 3.36 6.26 C6 C6 0.09 C1 C2 C3 C4 C5 C6 C1 C2 C3 C4 C5 C6 MSVI [18] 3.46 5.84 InfoVAE [33] 3.24 6.17 β-VAE [59] 9.12 4.92 (a) β = 1 (b) β = 10 VAE 4.64 5.04 Wasserstein-Wasserstein Auto-Encoders [35] 3.49 6.05 Fig. 12. Analysing the independence of the latent spaces for the mixture Continuous Bernoulli VAE* [53] - 4.55 components in MVAE-GS, considering the cost function from (24). MAE [34] 4.11 5.49 4 12.5 10.0 TABLE II 2 7.5 RMSE AND I NCEPTION SCORE ON CIFAR100 DATABASE . 5.0 0 2.5 Model RMSE IS −2 0.0 MVAE-Gau 3.10 5.64 −2.5 −4 MVAE-Gau fixed K 3.68 5.44 −5.0 −4 −2 0 2 4 6 8 10 −5 −4 −3 −2 −1 0 1 2 3 MVAE-GS 3.11 5.60 MSVI [18] 5.30 4.72 InfoVAE [33] 4.29 5.06 (a) β = 10 (b) β = 0 β-VAE [59] 9.17 3.31 VAE 6.84 4.07 Fig. 13. The representation of the latent spaces for MVAE for the images Wasserstein-Wasserstein Auto-Encoders [35] 5.83 4.97 showing the handwritten digit ’1’ from the MNIST dataset. Different colours MAE [34] 4.11 5.20 represent the latent vector projections of different components. We evaluate the reconstruction ability of the proposed as well as the Inception Score for CIFAR10 and CIFAR100 approach on CIFAR10 dataset [60], which contains 60,000 databases are provided in Tables I and II, where “MVAE- images grouped into 10 classes. We train the proposed MVAE Gau fixed K” denotes that the model considers K fixed and model using 50,000 images, with K = 6 components initially, ‘*’ represents that we cite results reported at the indicated using the Gambel-softmax dropout during the training, while reference. We also indicate the number of components found setting the maximum number of training epochs to 100. In when considering the Gambel-softmax dropout (MVAE-GS), order to evaluate both generative ability and the likelihood of and the average, calculated from several runs, for this database model collapse, we consider the Inception Score (IS), [61]: is K = 4.11. From Table I we find that MVAE-Gau provides the best result. We can also observe that selecting the number IS = exp(Ex [DKL (p(y|x)||p∗ (y))]) (37) of components definitely improves the performance, as shown where DKL represents the Kullback-Leibler divergence be- when comparing the results of MVAE-Gau with those provided tween the distributions of the labels of the recovered images by MVAE-Gau using a fixed K. and those from the database; x represents the image and We also evaluate the performance on the more challenging p(y|x) is the probability of the softmax output for the trained dataset, ImageNet [52]. The results are reported in Table III, classifier; p∗ (y) represents the labels statistics for the given where it can be observed that the proposed MVAE based images. The reconstruction Root Mean Square Error (RMSE) models outperform all other VAE based methods considered 12 for comparison when applied on ImageNet. TABLE V T HE RESULTS BY THREE CLASSIFIERS TRAINED ON THE LATENT VARIABLES INFERRED BY THE ENCODER WITH G AUSSIAN DROPOUT. TABLE III RMSE AND I NCEPTION SCORE (IS) ON I MAGE N ET DATABASE . Classifier Mixture C1 C2 C3 C4 VAE Model RMSE IS MLP 98.04 96.94 96.87 96.85 96.97 97.21 Linear SVM 95.62 93.65 93.81 93.83 93.67 93.81 MVAE-Gau 19.44 6.84 KNN 97.25 96.91 96.98 96.85 96.15 96.51 MVAE-Gau fixed K 20.87 6.30 MVAE-GS 20.45 6.52 MSVI [18] 22.29 6.12 InfoVAE [33] 22.73 6.14 G. Selecting the number of VAE components β-VAE [59] 31.47 5.05 VAE 28.44 5.46 The use of component dropout, defining the appropriate Wasserstein-Wasserstein Auto-Encoders [35] 25.63 5.79 number of MVAE components, as discussed in Sections III-C MAE [34] 23.25 5.87 and III-D, reduces the overfitting to the training set while also easing the requirements on computation and architecture complexity. The dropout masking vector m, depends on the dropout rate p, which is learned during the training stage. F. Evaluation of the representation learning We train the MVAE mixture model changing the initial The representation learning ability is a very important prop- number of VAE components by adopting different dropout erty for deep learning models. We assume that the proposed mechanisms. The reconstruction error results for the MNIST mixture model is able to provide a rich representation of data, dataset, when considering the selection of VAE components which would help to avoid overfitting because it embeds data using the Gaussian dropout (MVAE-Gau) from (18), and the into different regions of the latent space, as shown in Fig. 11. Gumbel-Softmax dropout (MVAE-GS) using equation (24) as In order to measure the representation learning ability, we the objective function, are provided in Table VI. The “Selected consider using a simple classifier on the latent representations K” column represents the average number of VAE components extracted by various models. For the mixture model, we firstly required by the model calculated over all trials. When using extract features from each component and then concatenate the Gumbel-softmax trick from (21) to generate the dropout these features into a single vector. The classifier is trained masking parameters mi , i = 1, . . . , K for MVAE-GS we on the latent representations of the training data and then consider a threshold of 0.1, while removing all components evaluated on a different dataset. We consider initially a simple with lower mi ’s. When considering the MVAE-Gau algorithm, network consisting of two layers as a basic classifier. The we set the threshold as 0.8 on the dropout masking parameter classification results, after training on the latent representa- mi , given that the sampling takes place from a Gaussian tions, are provided in Table IV, where we only compare with distribution with the mean of 1. We observe that the mixture the best models, InfoVAE and MSVI, according to the results model with all its components provides a lower MSE error from the previous section. The results show that the proposed than the MVAE model with dropout, except for the MVAE- approach outperforms other methods by a large margin on Gau model. This is due to the fact that the proposed dropout CIFAR10 database. We achieve better results than the mixture method reduces overfitting. Furthermore, we also consider a model MSVI [18] which actually uses more parameters. This mixture model with a single component, K = 1 as well as shows that the proposed model is able to provide a rich data using a single VAE sharing the same network architecture representation. We also compare with the recently proposed and hyperparameters with the MVAE mixture model. It can GUIDE model [62]. be observed from Table VI that the performance of the mixture model with a single component is worse than all other TABLE IV architectures considered. C LASSIFICATION RESULTS OF VARIOUS METHODS . TABLE VI Dataset InfoVAE MSVI MVAE-GS MVAE-Gua GUIDE T HE RECONSTRUCTION ERROR ON THE MNIST DATASET WHEN USING COMPONENT DROPOUT. CIFAR10 42.92 49.48 52.13 51.78 - MNIST 96.73 97.15 98.04 97.59 98.15 Model Initial No Selected K MSE of Components We also investigate the performance of three classic clas- 4 3.07 7.35 sifiers: Multilayer Perceptron (MLP), Linear Support Vector 4 4 7.42 MVAE-GS 6 4.05 7.30 Machines (SVM) and K-nearest neighbours (KNN), which are 6 6 7.41 trained on the latent representations extracted by each VAE 1 1 10.35 component of the MVAE model. The results are reported in 4 4 9.52 Table V, where C1 denotes that the classifier is trained on MVAE-Gau 6 3.45 7.53 the representation extracted by the first component of MVAE- 6 6 7.89 1 1 12.62 GS. This result demonstrates that the proposed model embeds data into several distinct latent subspaces, which enhances the Single VAE 1 1 9.07 performance in classification tasks. 13 The objective value The objective value 270 MNIST Fashion MNIST 140 The dropout rate The dropout rate 260 0.24 Celeba 130 0.40 250 0.22 0.35 120 Loss Loss 0.20 240 0.30 Dropout rate Dropout rate 0.18 110 230 0.25 0.16 100 0.20 0.14 220 0 20 40 60 80 100 0 20 40 60 80 100 0.15 MNIST 0.12 Epochs Epochs Fashion MNIST CIFAR10 0.10 0.10 0 20 40 60 80 100 0 20 40 60 80 100 (a) MNIST (b) MNIST Fashion Epochs 100 batch The objective value The objective value (a) MNIST, MNIST Fashion (b) CelebaA database. CIFAR10 1200 Celeba 350 1100 and CIFAR10 databases 300 1000 Fig. 15. The variation of the dropout rate p during the training for MVAE-GS. 250 900 Loss Loss 800 200 700 150 600 500 100 400 The number of components The number of components 0 20 40 60 80 100 0 20 40 60 80 100 MNIST Celeba Epochs 100 batch 5.6 5.5 Fashion MNIST CIFRA10 (c) CIFAR10 (d) CelebA 5.0 5.4 Components Components 5.2 Fig. 14. The variation of the objective function LM V AE−GS during the 4.5 training. 5.0 4.0 4.8 3.5 4.6 In the following we train a mixture model by setting initially 0 20 40 60 80 100 0 20 40 60 80 100 Epochs 100 batch K = 6 components and considering the VAE component dropout implemented using the Gumbel-Softmax approach (a) MNIST, MNIST Fashion (b) CelebaA. (MVAE-GS), on four datasets: MNIST, MNIST-Fashion, CI- and CIFAR10 databases FAR10 and CelebA. We consider 100 training epochs for the Fig. 16. The variation in the number of VAE components through updating the first three datasets. For the CelebA dataset, we measure the dropout parameter p as in (22), when initially considering K = 6 components, when using Gambel-softmax. dropout rate for every 100 batches during one epoch. The variation of the objective function LM V AE−GS from equation (24), during the training for these databases, is shown in Figures 14a-d for each of the four databases: MNIST, MNIST Fashion, CIFAR10 and CelebA, respectively. The variation of the dropout p for MVAE-GS algorithm is provided in Fig. 15a for MNIST, Fashion and CIFAR10 databases. We observe that (a) Results of VAE component 1. p initially increases quickly, while afterwards it becomes rather stable during the training for each of the first 3 databases. The result for CelebA database is shown in Fig. 15b, where a (b) Results for VAE component 2. single epoch is considered for training, because this dataset is larger and more complex than the others. The dropout masking parameters mi , i = 1, . . . , K depend on the dropout p, according to equation (22) for MVAE-GS. We consider a (c) Results for VAE component 3. threshold of 0.1, on the estimated dropout parameter p, for removing a VAE component. The results for MNIST, Fashion and CIFAR10 are shown in Fig. 16a while for CelebA database are provided in Fig. 16b. It can be observed that MVAE (d) Results for VAE component 4. requires more components for modelling and generating the images corresponding to the CIFAR10 dataset, when compared to the other datasets because this dataset contains more diverse images, displaying complex information, compared to the (e) Results for VAE component 5. MNIST and MNIST Fashion databases. H. Data generation diversity by each VAE component in MVAE (f) Results for VAE component 6. In order to show that each component learns different Fig. 17. Generation results by each individual VAE component for the CelebA characteristics of the data, we train the MVAE model with dataset for a set of random inputs. K = 6 components on the CelebA dataset. Then we sample 14 a random vector from a Gaussian distribution, which is used [9] Z. Ghahramani and M. Beal, “Variational inference for Bayesian mix- as the input for each sub-decoder. The output of each sub- tures of factor analysers,” in Advances in Neural Inf. Proc. Systems (NIPS), 2000, pp. 449–455. decoder is then fed into the final decoder, which outputs the [10] N. Nasios and A. G. Bors, “Variational learning for Gaussian mixture generated images x0 . In Figures 17a-f, we show on each row models,” IEEE Trans. on Systems, Man, and Cybernetics, Part B the face images generated by each of the VAE components (Cybernetics), vol. 36, no. 4, pp. 849–862, 2006. [11] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” i = 1, . . . , 6, where the images from each column correspond 2013. [Online]. Available: https://arxiv.org/abs/1312.6114 to the same input random vector, sampled from a Gaussian [12] N. Dilokthanakul, P. Mediano, M. Garnelo, M. Lee, H. Salimbeni, distribution. From the results from Figures 17 we can observe K. Arulkumaran, and M. Shanahan, “Deep unsupervised clustering with Gaussian mixture variational autoencoders,” in Proc. Int. Conf. that each component outputs different human faces or different on Learning Representations (ICLR), 2018. [Online]. Available: appearances for the same face thus showing the ability for https://arxiv.org/abs/1611.02648 MVAE to generate diverse images. [13] X. Yang, C. Deng, F. Zheng, J. Yan, and W. Liu, “Deep spectral clustering using dual autoencoder network,” in Proc. IEEE Conf. on Pattern Recognition and Computer Vision (CVPR), 2019, pp. 4066– 4075. VI. C ONCLUSIONS [14] A. Klushyn, N. Chen, R. Kurle, B. Cseke, and P. van der Smagt, “Learning hierarchical priors in VAEs,” in Advances in Neural Inf. Proc. In this research study we propose a mixing deep learning Systems (NeurIPS), 2019, pp. 2870–2879. model using collections of variational encoders and sub- [15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, decoders, called the Mixture of Variational Autoencoders S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Inf. Proc. Systems (NIPS), 2014, pp. 2672–2680. (MVAE). The latent space of each VAE component captures [16] L. Chen, S. Dai, Y. Pu, C. Li, Q. Su, and L. Carin, “Symmetric specific characteristics of data in different ways providing rich variational autoencoder and connections to adversarial learning,” in Proc. latent representations benefiting many tasks. These properties Int. Conf. on Artificial Intelligence and Statistics (AISTATS), vol. PMLR 84, 2018, pp. 661–669. result in enhanced abilities for data generation by MVAE [17] Y. Fei and A. G. Bors, “Learning joint latent representations based on model. Each sub-decoder has a simple design consisting of information maximization,” Information Sciences, 2021. a single layer network benefiting from quick training, while [18] R. Kurle, S. Günnemann, and P. van der Smagt, “Multi-source neural variational inference,” in Proc. of AAAI Conf. on Artificial Intelligence, the mix-decoder is implemented by a deeper CNN. The vol. 33, 2019, pp. 4114–4121. separability between the latent spaces, corresponding to each [19] E. Abbasnejad, M. Dick, and A. van der Hengel, “Infinite variational VAE, is enforced by using the d-variable Hilbert-Schmidt autoencoder for semi-supervised learning,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5888– independence (dHSIC) criterion. Each component of MVAE 5897. models a distinct latent space, avoiding the overfitting which [20] Z. Szabó and B. K. Sriperumbudur, “Characteristic and universal tensor occurs in other models. We also consider a component dropout product kernels,” Journal of Machine Learning Research, vol. 18, no. 233, pp. 1–29, 2018. mechanism in order to select the appropriate number of VAE [21] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood components in MVAE. The training of MVAE involves the from incomplete data via the EM algorithm,” Journal of the Royal estimation of the parameters for the encoders, sub-decoders, Statistical Society, Series B, vol. 39, no. 1, pp. 1–38, 1977. [22] C. Cremer, X. Li, and D. Duvenaud, “Inference suboptimality in varia- implementing the dHSIC criterion, the Dirichlet sampling tional autoencoders,” in Proc. Int. Conf. on Learning Representations for the mixing weights, the component dropout procedure (ICLR), 2018. [Online]. Available: https://arxiv.org/abs/1801.03558 and the mix-decoder parameters into an end-to-end training [23] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropa- gation and approximate inference in deep generative models,” in Proc. procedure using stochastic gradient descent (SGD). A variety Int. Conf. on Machine Learning (ICML), vol. PMLR 32(2), 2014, pp. of data manipulations, including interpolations in the joint 1278–1286. latent spaces of the VAE components, show the capabilities [24] R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling, “Sylvester normalizing flows for variational inference,” in Proc. Un- of the proposed MVAE model. certainity in Artificial Intelligence (UAI), 2018, pp. 393–402. [25] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, “Improving variational inference with inverse autoregressive R EFERENCES flow,” in Advances in Neural Inf. Proc. Systems (NIPS), 2016, pp. 4743– 4751. [1] A. G. Bors and M. Gabbouj, “Minimal topology for a radial basis [26] J. M. Tomczak and M. Welling, “Improving variational auto- functions neural network for pattern classification,” Digital Signal Pro- encoders using householder flow,” 2016. [Online]. Available: cessing, vol. 4, no. 3, pp. 173–188, 1994. https://arxiv.org/abs/1611.09630 [2] A. G. Bors and I. Pitas, “Median radial basis function neural network,” [27] D. J. Rezende and S. Mohamed, “Variational inference with normalizing IEEE Trans. on Neural Networks, vol. 7, no. 6, pp. 1351–1364, 1996. flows,” in Proc. Int. Conf. on Machine Learning (ICML), vol. PMLR 37, [3] S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares 2015, pp. 1530–1538. learning algorithm for radial basis function networks,” IEEE Trans. on [28] R. Ranganath, D. Tran, and D. Blei, “Hierarchical variational models,” Neural Networks, vol. 2, no. 2, pp. 302–309, 1991. in Proc. Int. Conf. on Machine Learning (ICML), vol. PMLR 48, 2016, [4] L. Weruaga and J. Via, “Sparse multivariate Gaussian mixture regres- pp. 324–333. sion,” IEEE Trans. on Neural Networks and Learning Systems, vol. 26, [29] T. Salimans, D. P. Kingma, and M. Welling, “Markov chain Monte no. 5, pp. 1098–1108, 2015. Carlo and variational inference: Bridging the gap,” in Proc. Int. Conf. [5] E. J. Hartman, K. J. D., and M. Kowalski, J, “Layered neural networks on Machine Learning (ICML), vol. PMLR 37, 2015, pp. 1218–1226. with Gaussian hidden units as universal approximations,” Neural Com- [30] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther, “Auxiliary putation, vol. 2, no. 2, pp. 210–215, 1990. deep generative models,” in Proc. Int. Conf. on Machine Learning [6] J. Park and I. W. Sandberg, “Universal approximation using radial-basis- (ICML) vol. PMLR 48, 2016, pp. 1445–1453. function networks,” Neural Computation, vol. 3, no. 2, pp. 246–257, [31] J. Domke and D. R. Sheldon, “Importance weighting and variational 1991. inference,” in Advances in Neural Inf. Proc. Systems (NeurIPS), 2018, [7] T. Poggio and F. Girosi, “Networks for approximation and learning,” pp. 4470–4479. Proc. the IEEE, vol. 9, no. 78, pp. 1481–1497, 1990. [32] Y. Pu, W. Wang, R. Henao, C. L., Z. Gan, C. Li, and L. Carin, [8] H. Attias, “A variational Bayesian framework for graphical models,” in “Adversarial symmetric variational autoencoder,” in Advances in Neural Advances in Neural Inf. Proc. Systems (NIPS), 2000, pp. 209–215. Inf. Proc. Systems (NIPS), 2017, pp. 4333–4342. 15 [33] S. Zhao, J. Song, and S. Ermon, “InfoVAE: Balancing learning and with a constrained variational framework,” in Proc. Inter. Conf. on inference in variational autoencoders,” in Proc. AAAI Conf. on Artif. Learning Representations (ICLR), 2017. Intel., vol. 33, 2019, pp. 5885–5892. [60] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from [34] X. Ma, C. Zhou, and E. Hovy, “MAE: Mutual posterior-divergence tiny images,” Univ. of Toronto, Tech. Rep., 2009. regularization for variational autoencoders,” in Proc. Int. Conf. [61] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and on Learning Representations (ICLR), 2019. [Online]. Available: X. Chen, “Improved techniques for training GANs,” in Advances in https://arxiv.org/abs/1901.01498 Neural Inf. Proc. Systems (NIPS), 2016, pp. 2234–2242. [35] S. Zhang, Y. Gao, Y. Jiao, J. Liu, Y. Wang, and C. Yang, [62] Z. Ding, Y. Xu, W. Xu, G. Parmar, Y. Yang, M. Welling, and Z. Tu, “Wasserstein-Wasserstein auto-encoders,” 2019. [Online]. Available: “Guided variational autoencoder for disentanglement learning,” in Proc. https://arxiv.org/abs/1902.09323 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020, [36] Y. Fei and A. G. Bors, “Mixtures of variational autoencoders,” in Proc. pp. 7920–7929. Int. Conf. on Image Processing Theory, Tools and Applic. (IPTA), 2020. [37] D. M. Endres and J. E. Schindelin, “A new metric for probability distributions,” IEEE Trans. on Information Theory, vol. 49, no. 3, pp. 1858–1860, 2003. [38] N. Pfister, P. Bühlmann, B. Schölkopf, and J. Peters, “Kernel-based tests for joint independence,” Jour. of the Royal Statistical Society: Series B (Statistical Methodology), vol. 80, no. 1, pp. 5–31, 2018. [39] R. Lopez, J. Regier, M. I. Jordan, and N. Yosef, “Information constraints on auto-encoding variational Bayes,” in Advances in Neural Inf. Proc. Systems (NeurIPS), 2018, pp. 6117–6128. [40] W. He, H. Huang, and S. Ge, “Estimating the dimension of a model,” Annals of Statistics, vol. 7, no. 2, pp. 461–464, 1978. [41] J. Rissanen, Stochastic Complexity in Statistical Inquiry. World Scientific, 1989. Fei Ye is a currently third-year PHD student in com- [42] D. Kingma, T. Salimans, and M. Welling, “Variational dropout and the puter science from University of York. He received local reparameterization trick,” in Advances in Neural Inf. Proc. Systems the bachelor degree from Chengdu University of (NIPS), 2015, pp. 2575–2583. Technology, China, in 2014 and the master degree [43] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with in computer science and technology from Southwest Gumbel-Softmax,” in Proc. Int. Conf. on Learning Representations Jiaotong University, China, in 2018. His research (ICLR), 2017. [Online]. Available: https://arxiv.org/abs/1611.01144 topic includes a deep generative image model, life- [44] B. E. J. Gumbel, Statistical theory of extreme values and some practical long learning and mixture models. applications:a series of lectures, 1954. [45] C. Maddison, D. Tarlow, and T. Minka, “A* sampling,” Advances in Neural Inf. Processing Systems (NIPS), pp. 3086–3094, 2014. [46] E. J. Gumbel, “Bivariate logistic distributions,” Journal American Sta- tistical Association, vol. 56, pp. 335–349, 1961. [47] Y. Gal, J. Hron, and A. Kendall, “Concrete dropout,” in Advances in Neural Inf. Processing Systems (NIPS), 2017, pp. 3581–3590. [48] M. Figurnov, S. Mohamed, and A. Mnih, “Implicit reparameterization gradients,” in Advances in Neural Inf. Proc. Systems (NeurIPS), 2018, pp. 439–450. [49] S. Ruder, “An overview of gradient descent optimization algorithms,” 2016. [Online]. Available: https://arxiv.org/abs/1609.04747 [50] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [51] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in Adrian G. Bors received the MSc degree in Elec- the wild,” in Proc. IEEE Int. Conf. on Computer Vision (ICCV), 2015, tronics Engineering from the Polytechnic Univ. of pp. 3730–3738. Bucharest, Romania, in 1992, and the Ph.D. de- [52] A. van Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent gree in Informatics from the Univ. of Thessaloniki, neural networks,” in Proc. Int. Conf. on Machine Learning (ICML), vol. Greece in 1999. In 1999 he joined the Department PMLR 48, 2016, pp. 1747–1756. of Computer Science, Univ. of York, U.K., where [53] G. Loaiza-Ganem and J. P. Cunningham, “The continuous Bernoulli: he is currently a lecturer. Dr. Bors held temporary fixing a pervasive error in variational autoencoders,” in Advances in positions at Tampere Univ. of Technology, Finland, Neural Inf. Proc. Systems (NeurIPS), 2019, pp. 13 266–13 276. Univ. of California at San Diego (UCSD), and at the [54] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation Univ. of Montpellier, France. Dr. Bors has authored learning with deep convolutional generative adversarial networks,” in and co-authored more than 130 research papers Proc. Int. Conf. on Learning Representations (ICLR), 2015. [Online]. including 30 in journals. His research interests include computer vision, Available: https://arxiv.org/abs/1511.06434 computational intelligence and image processing. Dr. Bors was an associate [55] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, editor of IEEE Trans. Image Processing between 2010 and 2014 and of IEEE M. Arjovsky, and A. Courville, “Adversarially learned inference,” in Trans. Neural Networks from 2001 to 2009. He was a co-guest editor for a Proc. Int. Conf. on Learning Representations (ICLR), 2017. [Online]. special issue on Machine Vision for the International Journal for Computer Available: https://arxiv.org/abs/1606.00704 Vision in 2018, and for the Journal of Pattern Recognition in 2015. Dr. Bors [56] D. Warde-Farley and Y. Bengio, “Improving generative adversarial was a member of the organisation committees for IEEE WIFS 2021, IPTA networks with denoising feature matching,” in Proc. Int. Conf. on 2020, IEEE ICIP 2018, BMVC 2016, IPTA 2014, CAIP 2013 and IEEE ICIP Learning Representations (ICLR), 2017. 2001. He is a Senior Member of the IEEE. [57] T. Salimans, A. Karpathy, X. Chen, and D. Kingma, “PixelCNN++: Improving the pixelCNN with discretized logistic mixture likelihood and other modifications,” in Proc. Int. Conf. on Learning Representations (ICLR), 2017. [Online]. Available: https://arxiv.org/abs/1701.05517 [58] D. Berthelot, T. Schumm, and L. Metz, “BEGAN: Boundary equilibrium generative adversarial networks,” 2017. [Online]. Available: https://arxiv.org/abs/1703.10717 [59] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “β-VAE: Learning basic visual concepts