# View Embedding¶

## Generalized Canonical Correlation Analysis¶

class mvlearn.embed.GCCA(n_components=None, fraction_var=None, sv_tolerance=None, n_elbows=2, tall=False, max_rank=False, n_jobs=None)[source]

An implementation of Generalized Canonical Correlation Analysis [1] suitable for cases where the number of features exceeds the number of samples by first applying single view dimensionality reduction. Computes individual projections into a common subspace such that the correlations between pairwise projections are minimized (ie. maximize pairwise correlation). An important note: this is applicable to any number of views, not just two.

Parameters: n_components : int (positive), optional, default=None If self.sv_tolerance=None, selects the number of SVD components to keep for each view. If none, another selection method is used. fraction_var : float, default=None If self.sv_tolerance=None, and self.n_components=None, selects the number of SVD components to keep for each view by capturing enough of the variance. If none, another selection method is used. sv_tolerance : float, optional, default=None Selects the number of SVD components to keep for each view by thresholding singular values. If none, another selection method is used. n_elbows : int, optional, default: 2 If self.fraction_var=None, self.sv_tolerance=None, and self.n_components=None, then compute the optimal embedding dimension using utils.select_dimension(). Otherwise, ignored. tall : boolean, default=False Set to true if n_samples > n_features, speeds up SVD max_rank : boolean, default=False If true, sets the rank of the common latent space as the maximum rank of the individual spaces. If false, uses the minimum individual rank. n_jobs : int (positive), default=None The number of jobs to run in parallel when computing the SVDs for each view in fit and partial_fit. None means 1 job, -1 means using all processors.

Attributes

 projection_mats_ (list of arrays) A projection matrix for each view, from the given space to the latent space ranks_ (list of ints) Number of left singular vectors kept for each view during the first SVD

Notes

Consider two views $$X_1$$ and $$X_2$$. Canonical Correlation Analysis seeks to find vectors $$a_1$$ and $$a_2$$ to maximize the correlation $$X_1 a_1$$ and $$X_2 a_2$$, expanded below.

$\left(\frac{a_1^TC_{12}a_2} {\sqrt{a_1^TC_{11}a_1a_2^TC_{22}a_2}} \right)$

where $$C_{11}$$, $$C_{22}$$, and $$C_{12}$$ are respectively the view 1, view 2, and between view covariance matrix estimates. GCCA maximizes the sum of these correlations across all pairwise views and computes a set of linearly independent components. This specific algorithm first applies principal component analysis (PCA) independently to each view and then aligns the most informative projections to find correlated and informative subspaces. Parameters that control the embedding dimension apply to the PCA step. The dimension of each aligned subspace is the maximum or minimum of the individual dimensions, per the max_ranks parameter. Using the maximum will capture the most information from all views but also noise from some views. Using the minimum will better remove noise dimensions but at the cost of information from some views.

References

 [1] B. Afshin-Pour, G.A. Hossein-Zadeh, S.C. Strother, H. Soltanian-Zadeh. Enhancing reproducibility of fMRI statistical maps using generalized canonical correlation analysis in NPAIRS framework. Neuroimage, 60 (2012), pp. 1970-1981

Examples

>>> from mvlearn.datasets import load_UCImultifeature
>>> from mvlearn.embed import GCCA
>>> # Load full dataset, labels not needed
>>> gcca = GCCA(fraction_var = 0.9)
>>> # Transform the first 5 views
>>> Xs_latents = gcca.fit_transform(Xs[:5])
>>> print([X.shape[1] for X in Xs_latents])
[9, 9, 9, 9, 9]

fit(Xs, y=None)[source]

Calculates a projection from each view to a latent space such that the sum of pairwise latent space correlations is maximized. Each view 'X' is normalized and the left singular vectors of 'X^T X' are calculated using SVD. The number of singular vectors kept is determined by either the percent variance explained, a given rank threshold, or a given number of components. The singular vectors kept are concatenated and SVD of that is taken and used to calculated projections for each view.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) The data to fit to. Each view will receive its own embedding. y : ignored Included for API compliance. self : returns an instance of self.
partial_fit(Xs, reset=False, multiview_step=True)[source]

Performs like fit, but will not overwrite previously fitted single views and instead uses them as well as the new data. Useful if the data needs to be processed in batches.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) The data to fit to. Each view will receive its own embedding. reset : boolean (default = False) If True, overwrites all prior computations. multiview_step : boolean, (default = True) If True, performs the joint SVD step on the results from individual views. Must be set to True in the final call. self : returns an instance of self.
transform(Xs, view_idx=None)[source]

Embeds data matrix(s) using the fitted projection matrices. May be used for out-of-sample embeddings.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) A list of data matrices from each view to transform based on the prior fit function. If view_idx is defined, then Xs is a 2D data matrix corresponding to a single view. view_idx : int, default=None For transformation of a single view. If not None, then Xs is 2D and views_idx specifies the index of the view from which Xs comes from. Xs_transformed : list of array-likes or array-like Same shape as Xs
fit_transform(Xs, y=None)

Fit an embedder to the data and transform the data

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) y : array, shape (n_samples,), optional Targets to be used if fitting the algorithm is supervised. X_transformed : list of array-likes X_transformed length: n_views X_transformed[i] shape: (n_samples, n_components_i)

## Kernel Canonical Correlation Analysis¶

class mvlearn.embed.KCCA(n_components=2, ktype='linear', constant=0.1, sigma=1.0, degree=2.0, reg=0.1, decomp='full', method='kettenring-like', mrank=2, precision=1e-06)[source]

The kernel canonical correlation analysis (KCCA) is a method that generalizes the classical linear canonical correlation analysis (CCA) to nonlinear setting. It allows us to depict the nonlinear relation of two sets of variables and enables applications of classical multivariate data analysis originally constrained to linearity relation (CCA).

If the linear kernel is used, this is equivalent to CCA.

Parameters: n_components : int, default = 2 Number of canonical dimensions to keep ktype : string, default = 'linear' Type of kernel. If 'linear', KCCA is equivalent to CCA. - value can be 'linear', 'gaussian' or 'poly' constant : float, default = 1.0 Balances impact of lower-degree terms in Polynomial kernel sigma : float, default = 1.0 Standard deviation of Gaussian kernel degree : float, default = 2.0 Degree of Polynomial kernel reg : float, default = 0.1 Regularization parameter decomp : string, default = 'full' Decomposition type. Incomplete Cholesky Decomposition (ICD) can reduce computation times and storage - value can be 'full' or 'icd' method : string, default = 'kettenring-like' Decomposition method - value can be only be 'kettenring-like' mrank : int, default = 2 The rank of the ICD approximated kernel matrix precision: float, default = 0.000001 Precision of computing the ICD kernel matrix

Attributes

 weights_ (list of array-likes) Canonical weights for each view.

Notes

This class implements kernel canonical correlation analysis as described in [2] and [3].

Traditional CCA aims to find useful projections of the high- dimensional variable sets onto the compact linear representations, the canonical components (components_).

Each resulting canonical variate is computed from the weighted sum of every original variable indicated by the canonical weights (weights_).

The canonical correlation quantifies the linear correspondence between the two views of data based on Pearson’s correlation between their canonical components.

Canonical correlation can be seen as a metric of successful joint information reduction between two views and, therefore, routinely serves as a performance measure for CCA.

CCA may not extract useful descriptors of the data because of its linearity. kCCA offers an alternative solution by first projecting the data onto a higher dimensional feature space.

$\phi: \mathbf{x} = (x_1,...,x_m) \mapsto \phi(\mathbf{x}) = (\phi(x_1),...,\phi(x_N)), (m < N)$

before performing CCA in the new feature space.

Kernels are methods of implicitly mapping data into a higher dimensional feature space, a method known as the kernel trick. A kernel function K, such that for all $$\mathbf{x}, \mathbf{z} \in X$$,

$K(\mathbf{x}, \mathbf{z}) = \langle\phi(\mathbf{x}) \cdot \phi(\mathbf{z})\rangle,$

where $$\phi$$ is a mapping from X to feature space F.

The directions $$\mathbf{w_x}$$ and $$\mathbf{w_y}$$ (of length N) can be rewritten as the projection of the data onto the direction $$\alpha$$ and $$\beta$$ (of length m):

$\mathbf{w_x} = X'\alpha$
$\mathbf{w_y} = Y'\beta$

Letting $$K_x = XX'$$ and $$K_x = XX'$$ be the kernel matrices and adding a regularization term ($$\kappa$$) to prevent overfitting, we are effectively solving for:

$\rho = \underset{\alpha,\beta}{\text{max}} \frac{\alpha'K_xK_y\beta} {\sqrt{(\alpha'K_x^2\alpha+\kappa\alpha'K_x\alpha) \cdot (\beta'K_y^2\beta + \kappa\beta'K_y\beta)}}$

Kernel matrices grow exponentially with the size of data. They not only have to store $$n^2$$ elements, but also face the complexity of matrix eigenvalue problems. In a Cholesky decomposition a positive definite matrix A is decomposed to a lower triangular matrix $$L$$ : $$A = LL'$$.

The Incomplete Cholesky Decomposition (ICD) looks for a low rank approximation of $$L$$ to reduce the cost of operations of the matrix such that $$A \approx \tilde{L}\tilde{L}'$$. The algorithm skips a column if its diagonal element is small. The diagonal elements to the right of the column being updated are also updated. To select a column to update, it finds the largest diagonal element and pivots the element to the current diagonal by exchanging the corresponding rows and columns. The algorithm ends when all diagonal elemnts are below a specified accuracy.

ICD with rank $$m$$ yields storage requirements of $$O(mn)$$ instead of $$O(n^2)$$ and becomes $$O(nm^2)$$ instead of $$O(n^3)$$ [4]. Unlike full decomposition, ICD cannot be performed out of sample i.e you must fit and transform on the same data.

References

 [2] D. R. Hardoon, S. Szedmak and J. Shawe-Taylor, "Canonical Correlation Analysis: An Overview with Application to Learning Methods", Neural Computation, Volume 16 (12), Pages 2639--2664, 2004.
 [3] J. R. Kettenring, “Canonical analysis of several sets of variables,”Biometrika, vol.58, no.3, pp.433–451,1971.
 [4] M. I. Jordan, "Regularizing KCCA, Cholesky Decomposition", Lecture 9 Notes: CS281B/Stat241B, University of California, Berkeley.

Examples

>>> import numpy as np
>>> from mvlearn.embed.kcca import KCCA
>>> np.random.seed(1)
>>> # Define two latent variables
>>> N = 100
>>> latvar1 = np.random.randn(N, )
>>> latvar2 = np.random.randn(N, )
>>> # Define independent components for each dataset
>>> indep1 = np.random.randn(N, 3)
>>> indep2 = np.random.randn(N, 4)
>>> x = 0.25*indep1 + 0.75*np.vstack((latvar1, latvar2, latvar1)).T
>>> y = 0.25*indep2 + 0.75*np.vstack((latvar1, latvar2,
...                                   latvar1, latvar2)).T
>>> Xs = [x, y]
>>> Xs_train = [Xs[0][:80], Xs[1][:80]]
>>> Xs_test = [Xs[0][80:], Xs[1][80:]]
>>> kcca = KCCA(ktype ="linear", n_components = 3,  reg = 0.01)
>>> kcca.fit(Xs_train)
>>> linear_transform = kcca.transform(Xs_test)
>>> stats = kcca.get_stats()
>>> # Print the correlations of first 3 transformed variates
>>> # from the testing data
>>> print(stats['r'])
[0.85363047 0.91171037 0.06029391]

fit(Xs, y=None)[source]

Creates kcca mapping by determining canonical weghts from Xs.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) The data for kcca to fit to. Each sample will receive its own embedding. y : ignored Included for API compliance. self : returns an instance of self
transform(Xs)[source]

Uses KCCA weights to transform Xs into canonical components and calculates correlations.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: 2 Xs[i] shape: (n_samples, n_features_i) The data for kcca to fit to. Each sample will receive its own embedding. components_ : returns Xs_transformed, a list of numpy.ndarray Xs length: 2 Xs[i] shape: (n_samples, n_samples)
get_stats()[source]

Compute relevant statistics for the KCCA model after fitting and transforming.

Implementations of the statistics generally follow the code in the Matlab implementation of the function canoncorr.

Note: most statistics are only available if the linear kernel is used and decomposition method is full, i.e. self.ktype=='linear' and self.decomp='full'.

Returns: stats : dict Dict containing the statistics, with the following keys: 'r' : numpy.ndarray of shape (n_components,) Canonical correlations of each component. 'Wilks' : numpy.ndarray of shape (n_components,) Wilks' Lambda likelihood ratio statistic. Only available if self.ktype == 'linear'. 'df1' : numpy.ndarray of shape (n_components,) Degrees of freedom for the chi-squared statistic, and the numerator degrees of freedom for the F statistic. Only available if self.ktype == 'linear'. 'df2' : numpy.ndarray of shape (n_components,) Denominator degrees of freedom for the F statistic. Only available if self.ktype == 'linear'. 'F' : numpy.ndarray of shape (n_components,) Rao's approximate F statistic for H_0(k). Only available if self.ktype == 'linear'. 'pF' : numpy.ndarray of shape (n_components,) Right-tail significance level for stats['F']. Only available if self.ktype == 'linear'. 'chisq' : numpy.ndarray of shape (n_components,) Bartlett's approximate chi-squared statistic for H_0(k) with Lawley's modification. Only available if self.ktype == 'linear'. 'pChisq' : numpy.ndarray of shape (n_components,) Right-tail significance level for stats['chisq']. Only available if self.ktype == 'linear'.
fit_transform(Xs, y=None)

Fit an embedder to the data and transform the data

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) y : array, shape (n_samples,), optional Targets to be used if fitting the algorithm is supervised. X_transformed : list of array-likes X_transformed length: n_views X_transformed[i] shape: (n_samples, n_components_i)

## Deep Canonical Correlation Analysis¶

class mvlearn.embed.DCCA(input_size1=None, input_size2=None, n_components=2, layer_sizes1=None, layer_sizes2=None, use_all_singular_values=False, device=device(type='cpu'), epoch_num=200, batch_size=800, learning_rate=0.001, reg_par=1e-05, tolerance=0.001, print_train_log_info=False)[source]

An implementation of Deep Canonical Correlation Analysis [5] with PyTorch. It computes projections into a common subspace in order to maximize the correlation between pairwise projections into the subspace from two views of data. To obtain these projections, two fully connected deep networks are trained to initially transform the two views of data. Then, the transformed data is projected using linear CCA. This can be thought of as training a kernel for each view that initially acts on the data before projection. The networks are trained to maximize the ability of the linear CCA to maximize the correlation between the final dimensions.

Parameters: input_size1 : int (positive) The dimensionality of the input vectors in view 1. input_size2 : int (positive) The dimensionality of the input vectors in view 2. n_components : int (positive), default=2 The output dimensionality of the correlated projections. The deep network wil transform the data to this size. Must satisfy: n_components <= max(layer_sizes1[-1], layer_sizes2[-1]). layer_sizes1 : list of ints, default=None The sizes of the layers of the deep network applied to view 1 before CCA. For example, if the input dimensionality is 256, and there is one hidden layer with 1024 units and the output dimensionality is 100 before applying CCA, layer_sizes1=[1024, 100]. If None, set to [1000, self.n_components_]. layer_sizes2 : list of ints, default=None The sizes of the layers of the deep network applied to view 2 before CCA. Does not need to have the same hidden layer architecture as layer_sizes1, but the final dimensionality must be the same. If None, set to [1000, self.n_components_]. use_all_singular_values : boolean (default=False) Whether or not to use all the singular values in the CCA computation to calculate the loss. If False, only the top n_components singular values are used. device : string, default='cpu' The torch device for processing. Can be used with a GPU if available. epoch_num : int (positive), default=200 The max number of epochs to train the deep networks. batch_size : int (positive), default=800 Batch size for training the deep networks. learning_rate : float (positive), default=1e-3 Learning rate for training the deep networks. reg_par : float (positive), default=1e-5 Weight decay parameter used in the RMSprop optimizer. tolerance : float, (positive), default=1e-2 Threshold difference between successive iteration losses to define convergence and stop training. print_train_log_info : boolean, default=False If True, the training loss at each epoch will be printed to the console when DCCA.fit() is called.

Attributes

 input_size1_ (int (positive)) The dimensionality of the input vectors in view 1. input_size2_ (int (positive)) The dimensionality of the input vectors in view 2. n_components_ (int (positive)) The output dimensionality of the correlated projections. The deep network wil transform the data to this size. If not specified, will be set to 2. layer_sizes1_ (list of ints) The sizes of the layers of the deep network applied to view 1 before CCA. For example, if the input dimensionality is 256, and there is one hidden layer with 1024 units and the output dimensionality is 100 before applying CCA, layer_sizes1=[1024, 100]. layer_sizes2_ (list of ints) The sizes of the layers of the deep network applied to view 2 before CCA. Does not need to have the same hidden layer architecture as layer_sizes1, but the final dimensionality must be the same. device_ (string) The torch device for processing. batch_size_ (int (positive)) Batch size for training the deep networks. learning_rate_ (float (positive)) Learning rate for training the deep networks. reg_par_ (float (positive)) Weight decay parameter used in the RMSprop optimizer. deep_model_ (DeepPairedNetworks object) 2 view Deep CCA object used to transform 2 views of data together. linear_cca_ (linear_cca object) Linear CCA object used to project final transformations from output of deep_model to the n_components. model_ (torch.nn.DataParallel object) Wrapper around deep_model to allow parallelisation. loss_ (cca_loss object) Loss function for deep_model. Defined as the negative correlation between outputs of transformed views. optimizer_ (torch.optim.RMSprop object) Optimizer used to train the networks.
Warns: In order to run DCCA, pytorch and other certain optional dependencies must be installed. See the installation page for details.

Notes

Deep Canonical Correlation Analysis is a method of finding highly correlated subspaces for 2 views of data using nonlinear transformations learned by deep networks. It can be thought of as using deep networks to learn the best potentially nonlinear kernels for a variant of kernel CCA.

The networks used for each view in DCCA consist of fully connected linear layers with a sigmoid activation function.

The problem DCCA problem is formulated from [5]. Consider two views $$X_1$$ and $$X_2$$. DCCA seeks to find the parameters for each view, $$\Theta_1$$ and $$\Theta_2$$, such that they maximize

$\text{corr}\left(f_1\left(X_1;\Theta_1\right), f_2\left(X_2;\Theta_2\right)\right)$

These parameters are estimated in the deep network by following gradient descent on the input data. Taking $$H_1, H_2 \in R^{o \times m}$$ to be the outputs of the deep network in each column for the input data of size $$m$$. Take the centered matrix $$\bar{H}_1 = H_1-\frac{1}{m}H_1{1}$$, and $$\bar{H}_2 = H_2-\frac{1}{m}H_2{1}$$. Then, define

\begin{split}\begin{align*} \hat{\Sigma}_{12} &= \frac{1}{m-1}\bar{H}_1\bar{H}_2^T \\ \hat{\Sigma}_{11} &= \frac{1}{m-1}\bar{H}_1\bar{H}_1^T + r_1I \\ \hat{\Sigma}_{22} &= \frac{1}{m-1}\bar{H}_2\bar{H}_2^T + r_2I \end{align*}\end{split}

Where $$r_1$$ and $$r_2$$ are regularization constants $$>0$$ so the matrices are guaranteed to be positive definite.

The correlation objective function is the sum of the top $$k$$ singular values of the matrix $$T$$, where

$T = \hat{\Sigma}_{11}^{-1/2}\hat{\Sigma}_{12}\hat{\Sigma}_{22}^{-1/2}$

Which is the matrix norm of T. Thus, the loss is

$L(X_1, X2) = -\text{corr}\left(H_1, H_2\right) = -\text{tr}(T^TT)^{1/2}.$

References

 [5] (1, 2, 3, 4) Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013, February). Deep canonical correlation analysis. In International conference on machine learning (pp. 1247-1255).

Examples

>>> from mvlearn.embed import DCCA
>>> import numpy as np
>>> # Exponential data as example of finding good correlation
>>> view1 = np.random.normal(loc=2, size=(1000, 75))
>>> view2 = np.exp(view1)
>>> view1_test = np.random.normal(loc=2, size=(200, 75))
>>> view2_test = np.exp(view1_test)
>>> input_size1, input_size2 = 75, 75
>>> n_components = 2
>>> layer_sizes1 = [1024, 4]
>>> layer_sizes2 = [1024, 4]
>>> dcca = DCCA(input_size1, input_size2, n_components, layer_sizes1,
...             layer_sizes2)
>>> dcca = dcca.fit([view1, view2])
>>> outputs = dcca.transform([view1_test, view2_test])
>>> print(outputs[0].shape)
(200, 2)

fit(Xs, y=None)[source]

Fits the deep networks for each view such that the output of the linear CCA has maximum correlation.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) The data to fit to. Each view will receive its own embedding. y : ignored Included for API compliance. self : returns an instance of self.
transform(Xs, return_loss=False)[source]

Embeds data matrix(s) using the trained deep networks and fitted CCA projection matrices. May be used for out-of-sample embeddings.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) A list of data matrices from each view to transform based on the prior fit function. If view_idx defined, then Xs is a 2D data matrix corresponding to a single view. Xs_transformed : list of array-likes or array-like Transformed samples. Same structure as Xs, but potentially different n_features_i. loss : float Average loss over data, defined as negative correlation of transformed views. Only returned if return_loss=True.
fit_transform(Xs, y=None)

Fit an embedder to the data and transform the data

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) y : array, shape (n_samples,), optional Targets to be used if fitting the algorithm is supervised. X_transformed : list of array-likes X_transformed length: n_views X_transformed[i] shape: (n_samples, n_components_i)

## Omnibus Embedding¶

class mvlearn.embed.Omnibus(n_components=2, distance_metric='euclidean', normalize='l1', algorithm='randomized', n_iter=5)[source]

Omnibus computes the pairwise distances for each view. Each of these matrices is a n x n dissimilarity matrix where n is the number of rows in each view. Omnibus embedding [6] is then performed over the dissimilarity matrices and the computed embeddings are returned.

Parameters: n_components : strictly positive int (default = 2) Desired dimensionality of output embeddings. See graspy docs for additional details. distance_metric : string (default = 'euclidean') Distance metric used to compute pairwise distances. Metrics must be found in sklearn.neighbors.DistanceMetric. normalize : string or None (default = 'l1') Normalize function to use on views before computing pairwise distances. Must be 'l2', 'l1', 'max' or None. If None, the distance matrices will not be normalized. algorithm : string (default = 'randomized') SVD solver to use. Must be 'full', 'randomized', or 'truncated'. See graspy docs for details. n_iter : positive int (default = 5) Number of iterations for randomized SVD solver. See graspy docs for details.

Attributes

 embeddings_: list of arrays (default = None) List of Omnibus embeddings. One embedding matrix is provided per view. If fit() has not been called, embeddings_ is set to None.

Notes

From an implementation perspective, omnibus embedding is performed using the GrasPy package's implementation graspy.embed.OmnibusEmbed for dissimilarity matrices.

References

Examples

>>> from mvlearn.embed import omnibus
>>> import numpy as np
>>> # Create 2 random data views with feature sizes 50 and 100
>>> view1 = np.random.rand(1000, 50)
>>> view2 = np.random.rand(1000, 100)
>>> embedder = omnibus.Omnibus(n_components=3)
>>> embeddings = embedder.fit_transform([view1, view2])
>>> view1_hat, view2_hat = embeddings
>>> print(view1_hat.shape, view2_hat.shape)
(1000, 3) (1000, 3)

fit(Xs, y=None)[source]

Fit the model with Xs and apply the embedding on Xs. The embeddings are saved as a class attribute.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) The data to embed based on the prior fit function. Each X in Xs will receive its own embedding. y : ignored Included for API compliance.
fit_transform(Xs, y=None)[source]

Fit the model with Xs and apply the embedding on Xs using the fit() function. The resulting embeddings are returned.

Parameters: Xs : list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) The data to embed based on the prior fit function. Each X in Xs will receive its own embedding. y : ignored Included for API compliance. embeddings : list of arrays list of (n_samples, n_components) matrices for each X in Xs.

## Multiview Multidimensional Scaling¶

class mvlearn.embed.MVMDS(n_components=2, num_iter=15, dissimilarity='euclidean')[source]

An implementation of Classical Multiview Multidimensional Scaling for jointly reducing the dimensions of multiple views of data [7]. A Euclidean distance matrix is created for each view, double centered, and the k largest common eigenvectors between the matrices are found based on the stepwise estimation of common principal components. Using these common principal components, the views are jointly reduced and a single view of k-dimensions is returned.

MVMDS is often a better alternative to PCA for multi-view data. See the tutorials in the documentation.

Parameters: n_components : int (positive), default=2 Represents the number of components that the user would like to be returned from the algorithm. This value must be greater than 0 and less than the number of samples within each view. num_iter: int (positive), default=15 Number of iterations stepwise estimation goes through. dissimilarity : {'euclidean', 'precomputed'}, default='euclidean' Dissimilarity measure to use: 'euclidean': Pairwise Euclidean distances between points in the dataset. 'precomputed': Xs is treated as pre-computed dissimilarity matrices.

Attributes

 components_: numpy.ndarray, shape(n_samples, n_components) Joint transformed MVMDS components of the input views.

Notes

Classical Multiview Multidimensional Scaling can be broken down into two steps. The first step involves calculating the Euclidean Distance matrices, $$Z_i$$, for each of the $$k$$ views and double-centering these matrices through the following calculations:

$\Sigma_{i}=-\frac{1}{2}J_iZ_iJ_i$
$\text{where }J_i=I_i-{\frac {1}{n}}\mathbb{1}\mathbb{1}^T$

The second step involves finding the common principal components of the $$\Sigma$$ matrices. These can be thought of as multiview generalizations of the principal components found in principal component analysis (PCA) given several covariance matrices. The central hypothesis of the common principal component model states that given k normal populations (views), their $$p$$ x $$p$$ covariance matrices $$\Sigma_{i}$$, for $$i = 1,2,...,k$$ are simultaneously diagonalizable as:

$\Sigma_{i} = QD_i^2Q^T$

where $$Q$$ is the common $$p$$ x $$p$$ orthogonal matrix and $$D_i^2$$ are positive $$p$$ x $$p$$ diagonal matrices. The $$Q$$ matrix contains all the common principal components. The common principal component, $$q_j$$, is found by solving the minimization problem:

$\text{Minimize} \sum_{i=1}^{k}n_ilog(q_j^TS_iq_j)$
$\text{Subject to } q_j^Tq_j = 1$

where $$n_i$$ represent the degrees of freedom and $$S_i$$ represent sample covariance matrices.

This class does not support MVMDS.transform() due to the iterative nature of the algorithm and the fact that the transformation is done during iterative fitting. Use MVMDS.fit_transform() to do both fitting and transforming at once.

References

 [7] Trendafilov, Nickolay T. “Stepwise Estimation of Common Principal Components.” Computational Statistics & Data Analysis, vol. 54, no. 12, 2010, pp. 3446–3457., doi:10.1016/j.csda.2010.03.010.
 [8] Samir Kanaan-Izquierdo, Andrey Ziyatdinov, Maria Araceli Burgueño, Alexandre Perera-Lluna, Multiview: a software package for multiview pattern recognition methods, Bioinformatics, Volume 35, Issue 16, 15 August 2019, Pages 2877–2879

Examples

>>> from mvlearn.embed import MVMDS
>>> print(len(Xs)) # number of samples in each view
6
>>> print(Xs[0].shape) # number of samples in each view
(2000, 76)
>>> mvmds = MVMDS(n_components=5)
>>> Xs_reduced = mvmds.fit_transform(Xs)
>>> print(Xs_reduced.shape)
(2000, 5)

fit(Xs, y=None)[source]

Calculates dimensionally reduced components by inputting the Euclidean distances of each view, double centering them, and using the _commonpcs function to find common components between views. Works similarly to traditional, single-view Multidimensional Scaling.

Parameters: Xs: list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) y : ignored Included for API compliance.
fit_transform(Xs, y=None)[source]

" Embeds data matrix(s) using fitted projection matrices

Parameters: Xs: list of array-likes or numpy.ndarray Xs length: n_views Xs[i] shape: (n_samples, n_features_i) The data to embed based on the fit function. y : ignored Included for API compliance. X_transformed: numpy.ndarray, shape(n_samples, n_components) Joint transformed MVMDS components of the input views.

## Split Autoencoder¶

class mvlearn.embed.SplitAE(hidden_size=64, num_hidden_layers=2, embed_size=20, training_epochs=10, batch_size=16, learning_rate=0.001, print_info=False, print_graph=True)[source]

Implements an autoencoder that creates an embedding of a view View1 and from that embedding reconstructs View1 and another view View2, as described in [9].

Parameters: hidden_size : int (default=64) number of nodes in the hidden layers num_hidden_layers : int (default=2) number of hidden layers in each encoder or decoder net embed_size : int (default=20) size of the bottleneck vector in the autoencoder training_epochs : int (default=10) how many times the network trains on the full dataset batch_size : int (default=16): batch size while training the network learning_rate : float (default=0.001) learning rate of the Adam optimizer print_info : bool (default=True) whether or not to print errors as the network trains. print_graph : bool (default=True) whether or not to graph training loss

Attributes

 view1_encoder_ (torch.nn.Module) the View1 embedding network as a PyTorch module view1_decoder_ (torch.nn.Module) the View1 decoding network as a PyTorch module view2_decoder_ (torch.nn.Module) the View2 decoding network as a PyTorch module
Warns: In order to run SplitAE, pytorch and other certain optional dependencies must be installed. See the installation page for details.

Notes

In this figure $$\textbf{x}$$ is View1 and $$\textbf{y}$$ is View2

Each encoder / decoder network is a fully connected neural net with paramater count equal to:

$\left(\text{input_size} + \text{embed_size}\right) \cdot \text{hidden_size} + \sum_{1}^{\text{num_hidden_layers}-1}\text{hidden_size}^2$

Where $$\text{input_size}$$ is the number of features in View1 or View2.

The loss that is reduced via gradient descent is:

$J = \left(p(f(\textbf{x})) - \textbf{x}\right)^2 + \left(q(f(\textbf{x})) - \textbf{y}\right)^2$

Where $$f$$ is the encoder, $$p$$ and $$q$$ are the decoders, $$\textbf{x}$$ is View1, and $$\textbf{y}$$ is View2.

References

 [9] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. "On Deep Multi-View Representation Learning.", ICML, 2015.

For more extensive examples, see the tutorials for SplitAE in this documentation.

fit(Xs, validation_Xs=None, y=None)[source]

Given two views, create and train the autoencoder.

Parameters: Xs : list of array-likes or numpy.ndarray. Xs[0] is View1 and Xs[1] is View2 Xs length: n_views, only 2 is currently supported for splitAE. Xs[i] shape: (n_samples, n_features_i) validation_Xs : list of array-likes or numpy.ndarray optional validation data in the same shape of Xs. If print_info=True, then validation error, calculated with this data, will be printed as the network trains. y : ignored Included for API compliance.
transform(Xs)[source]

Transform the given view with the trained autoencoder. Provide a single view within a list.

Parameters: Xs : a list of exactly one array-like, or an np.ndarray Represents the View1 of some data. The array must have the same number of columns (features) as the View1 presented in the fit(...) step. Xs length: 1 Xs[0] shape: (n_samples, n_features_0) embedding : np.ndarray of shape (n_samples, embedding_size) the embedding of the View1 data view1_reconstructions : np.ndarray of shape (n_samples, n_features_0) the reconstructed View1 view2_prediction : np.ndarray of shape (n_samples, n_features_1) the predicted View2
fit_transform(Xs, y=None)[source]

fit(Xs) and then transform(Xs[:1]). Note that this method will be embedding data that the autoencoder was trained on.

Parameters: Xs : see fit(...) Xs parameters y : ignored Included for API compliance. See transform(...) return values.

## DCCA Utilities¶

class mvlearn.embed.linear_cca[source]

Implementation of linear CCA to act on the output of the deep networks in DCCA.

Consider two views $$X_1$$ and $$X_2$$. Canonical Correlation Analysis seeks to find vectors $$a_1$$ and $$a_2$$ to maximize the correlation between $$X_1 a_1$$ and $$X_2 a_2$$.

Attributes

 w_ (list (length=2)) w[i] : nd-array List of the two weight matrices for projecting each view. m_ (list (length=2)) m[i] : nd-array List of the means of the data in each view.
fit(H1, H2, n_components)[source]

Fit the linear CCA model to the outputs of the deep network transformations on the two views of data.

Parameters: H1: nd-array, shape (n_samples, n_features) View 1 data after deep network. H2: nd-array, shape (n_samples, n_features) View 2 data after deep network. n_components : int (positive) The output dimensionality of the CCA transformation.
transform(H1, H2)[source]

Transform inputs based on already fit matrices.

Parameters: H1 : nd-array, shape (n_samples, n_features) View 1 data. H2 : nd-array, shape (n_samples, n_features) View 2 data. results : list, length=2 Results of linear transformation on input data.
class mvlearn.embed.cca_loss(n_components, use_all_singular_values, device)[source]

An implementation of the loss function of linear CCA as introduced in the original paper for DCCA [5]. Details of how this loss is computed can be found in the paper or in the documentation for DCCA.

Parameters: n_components : int (positive) The output dimensionality of the CCA transformation. use_all_singular_values : boolean Whether or not to use all the singular values in the loss calculation. If False, only use the top n_components singular values. device : torch.device object The torch device being used in DCCA.

Attributes

 n_components_ (int (positive)) The output dimensionality of the CCA transformation. use_all_singular_values_ (boolean) Whether or not to use all the singular values in the loss calculation. If False, only use the top n_components singular values. device_ (torch.device object) The torch device being used in DCCA.
loss(H1, H2)[source]

Compute the loss (negative correlation) between 2 views. Details can be found in [5] or the documentation for DCCA.

Parameters: H1: torch.tensor, shape (n_samples, n_features) View 1 data. H2: torch.tensor, shape (n_samples, n_features) View 2 data.
class mvlearn.embed.MlpNet(layer_sizes, input_size)[source]

Multilayer perceptron implementation for fully connected network. Used by DCCA for the fully transformation of a single view before linear CCA. Extends torch.nn.Module.

Parameters: layer_sizes : list of ints The sizes of the layers of the deep network applied to view 1 before CCA. For example, if the input dimensionality is 256, and there is one hidden layer with 1024 units and the output dimensionality is 100 before applying CCA, layer_sizes1=[1024, 100]. input_size : int (positive) The dimensionality of the input vectors to the deep network.

Attributes

 layers_ (torch.nn.ModuleList object) The layers in the network.
forward(x)[source]

Feed input forward through layers.

Parameters: x : torch.tensor Input tensor to transform by the network. x : torch.tensor The output after being fed forward through network.
bfloat16() → T

Casts all floating point parameters and buffers to bfloat16 datatype.

Returns:
Module: self
parameters(recurse: bool = True) → Iterator[torch.nn.parameter.Parameter]

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
Yields:
Parameter: module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

requires_grad_(requires_grad: bool = True) → T

Change if autograd should record operations on parameters in this module.

This method sets the parameters' requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

Args:
parameters in this module. Default: True.
Returns:
Module: self
class mvlearn.embed.DeepPairedNetworks(layer_sizes1, layer_sizes2, input_size1, input_size2, n_components, use_all_singular_values, device=device(type='cpu'))[source]

A pair of deep networks for operating on the two views of data. Consists of two MlpNet objects for transforming 2 views of data in DCCA. Extends torch.nn.Module.

Parameters: layer_sizes1 : list of ints The sizes of the layers of the deep network applied to view 1 before CCA. For example, if the input dimensionality is 256, and there is one hidden layer with 1024 units and the output dimensionality is 100 before applying CCA, layer_sizes1=[1024, 100]. layer_sizes2 : list of ints The sizes of the layers of the deep network applied to view 2 before CCA. Does not need to have the same hidden layer architecture as layer_sizes1, but the final dimensionality must be the same. input_size1 : int (positive) The dimensionality of the input vectors in view 1. input_size2 : int (positive) The dimensionality of the input vectors in view 2. n_components : int (positive), default=2 The output dimensionality of the correlated projections. The deep network will transform the data to this size. If not specified, will be set to 2. use_all_singular_values : boolean (default=False) Whether or not to use all the singular values in the CCA computation to calculate the loss. If False, only the top n_components singular values are used. device : string, default='cpu' The torch device for processing.

Attributes

 model1_ (MlpNet object) Deep network for view 1 transformation. model2_ (MlpNet object) Deep network for view 2 transformation. loss_ (cca_loss object) Loss function for the 2 view DCCA.
forward(x1, x2)[source]

Feed two views of data forward through the respective network.

Parameters: x1 : torch.tensor, shape=(batch_size, n_features) View 1 data to transform. x2 : torch.tensor, shape=(batch_size, n_features) View 2 data to transform. outputs : list, length=2 outputs[i] : torch.tensor List of the outputs from each view transformation.
bfloat16() → T

Casts all floating point parameters and buffers to bfloat16 datatype.

Returns:
Module: self
parameters(recurse: bool = True) → Iterator[torch.nn.parameter.Parameter]

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Args:
recurse (bool): if True, then yields parameters of this module
and all submodules. Otherwise, yields only parameters that are direct members of this module.
Yields:
Parameter: module parameter

Example:

>>> for param in model.parameters():
>>>     print(type(param), param.size())
<class 'torch.Tensor'> (20L,)
<class 'torch.Tensor'> (20L, 1L, 5L, 5L)

requires_grad_(requires_grad: bool = True) → T

Change if autograd should record operations on parameters in this module.

This method sets the parameters' requires_grad attributes in-place.

This method is helpful for freezing part of the module for finetuning or training parts of a model individually (e.g., GAN training).

Args:
parameters in this module. Default: True.
Returns:
Module: self

## Dimension Selection¶

mvlearn.embed.select_dimension(X, n_components=None, n_elbows=2, threshold=None, return_likelihoods=False)[source]

Generates profile likelihood from array based on Zhu and Godsie method [11]. Elbows correspond to the optimal embedding dimension.

Parameters: X : 1d or 2d array-like Input array generate profile likelihoods for. If 1d-array, it should be sorted in decreasing order. If 2d-array, shape should be (n_samples, n_features). n_components : int, optional, default: None. Number of components to embed. If None, n_components = floor(log2(min(n_samples, n_features))). Ignored if X is 1d-array. n_elbows : int, optional, default: 2. Number of likelihood elbows to return. Must be > 1. threshold : float, int, optional, default: None If given, only consider the singular values that are > threshold. Must be >= 0. return_likelihoods : bool, optional, default: False If True, returns the all likelihoods associated with each elbow. elbows : list Elbows indicate subsequent optimal embedding dimensions. Number of elbows may be less than n_elbows if there are not enough singular values. sing_vals : list The singular values associated with each elbow. likelihoods : list of array-like Array of likelihoods of the corresponding to each elbow. Only returned if return_likelihoods is True.

References

 [10] Code from the https://github.com/neurodata/graspy package, reproduced and shared with permission.
 [11] Zhu, M. and Ghodsi, A. (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51(2), pp.918-930.