SemiSupervised¶
Cotraining Classifier¶

class
mvlearn.semi_supervised.
CTClassifier
(estimator1=None, estimator2=None, p=None, n=None, unlabeled_pool_size=75, num_iter=50, random_state=None)[source]¶ This class implements the cotraining classifier for supervised and semisupervised learning with the framework as described in [1]. The best use case is when the 2 views of input data are sufficiently distinct and independent as detailed in [1]. However, this can also be successful when a single matrix of input data is given as both views and two estimators are chosen which are quite different. [2]. See the examples below.
In the semisupervised case, performance can vary greatly, so using a separate validation set or cross validation procedure is recommended to ensure the classifier has fit well.
Parameters: estimator1 : classifier object, (default=sklearn GaussianNB)
The classifier object which will be trained on view 1 of the data. This classifier should support the predict_proba() function so that classification probabilities can be computed and cotraining can be performed effectively.
estimator2 : classifier object, (default=sklearn GaussianNB)
The classifier object which will be trained on view 2 of the data. Does not need to be of the same type as
estimator1
, but should support predict_proba().p : int, optional (default=None)
The number of positive classifications from the unlabeled_pool training set which will be given a positive "label". If None, the default is the floor of the ratio of positive to negative examples in the labeled training data (at least 1). If only one of
p
orn
is not None, the other will be set to be the same. When the labels are 0 or 1, positive is defined as 1, and in general, positive is the larger label.n : int, optional (default=None)
The number of negative classifications from the unlabeled_pool training set which will be given a negative "label". If None, the default is the floor of the ratio of positive to negative examples in the labeled training data (at least 1). If only one of
p
orn
is not None, the other will be set to be the same. When the labels are 0 or 1, negative is defined as 0, and in general, negative is the smaller label.unlabeled_pool_size : int, optional (default=75)
The number of unlabeled_pool samples which will be kept in a separate pool for classification and selection by the updated classifier at each training iteration.
num_iter : int, optional (default=50)
The maximum number of training iterations to run.
random_state : int (default=None)
The starting random seed for fit() and class operations, passed to numpy.random.seed().
Attributes
estimator1_ (classifier object) The classifier used on view 1. estimator2_ (classifier object) The classifier used on view 2. class_name_: string The name of the class. p_ (int, optional (default=None)) The number of positive classifications from the unlabeled_pool training set which will be given a positive "label" each round. n_ (int, optional (default=None)) The number of negative classifications from the unlabeled_pool training set which will be given a negative "label" each round. classes_ (arraylike of shape (n_classes,)) Unique class labels. Notes
Multiview cotraining is most helpful for tasks in semisupervised learning where each view offers unique information not seen in the other. As is shown in the example notebooks for using this algorithm, multiview cotraining can provide good classification results even when number of unlabeled samples far exceeds the number of labeled samples. This classifier uses 2 classifiers which work individually on each view but which share information and thus result in improved performance over looking at the views completely separately or even when concatenating the views to get more features in a singleview setting. The classifier can be initialized with or without the classifiers desired for each view being specified, but if the classifier for a certain view is specified, then it must support a predict_proba() method in order to give a sense of the most likely labels for different examples. This is because the algorithm must be able to determine which of the training samples it is most confident about during training epochs. The algorithm, as first proposed by Blum and Mitchell, is described in detail below.
Algorithm
Given:
 a set L of labeled training samples (with 2 views)
 a set U of unlabeled samples (with 2 views)
Create a pool U' of examples by choosing u examples at random from U
Loop for k iterations
 Use L to train a classifier h1 (
estimator1
) that considers only the view 1 portion of the data (i.e. Xs[0])  Use L to train a classifier h2 (
estimator2
) that considers only the view 2 portion of the data (i.e. Xs[1])  Allow h1 to label p (
self.p_
) positive and n (self.n_
) negative samples from view 1 of U'  Allow h2 to label p positive and n negative samples from view 2 of U'
 Add these selflabeled samples to L
 Randomly take 2*p* + 2*n* samples from U to replenish U'
References
[1] (1, 2) Blum, A., & Mitchell, T. (1998, July). Combining labeled and unlabeled_pool data with cotraining. In Proceedings of the eleventh annual conference on Computational learning theory (pp. 92100). ACM. [2] Goldman, Sally, and Yan Zhou. "Enhancing supervised learning with unlabeled data." ICML. 2000. Examples
>>> # Supervised learning of singleview data with 2 distinct estimators >>> from mvlearn.semi_supervised import CTClassifier >>> from mvlearn.datasets import load_UCImultifeature >>> import numpy as np >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.model_selection import train_test_split >>> data, labels = load_UCImultifeature(select_labeled=[0,1]) >>> X1 = data[0] # Only using the first view >>> X1_train, X1_test, l_train, l_test = train_test_split(X1, labels)
>>> # Supervised learning with a single view of data and 2 estimator types >>> estimator1 = GaussianNB() >>> estimator2 = RandomForestClassifier() >>> ctc = CTClassifier(estimator1, estimator2, random_state=1) >>> # Use the same matrix for each view >>> ctc = ctc.fit([X1_train, X1_train], l_train) >>> preds = ctc.predict([X1_test, X1_test]) >>> print("Accuracy: ", sum(preds==l_test) / len(preds)) Accuracy: 0.97

fit_predict
(Xs, y)¶ Fit a cotrain estimator to the semisupervised data and then predict.
Parameters: Xs : list of arraylikes or numpy.ndarray
 Xs length: n_views
 Xs[i] shape: (n_samples, n_features_i)
A list of the different views of data to fit and then predict.
y : array, shape (n_samples,)
Targets of the training data. Unlabeled examples should have label np.nan.
Returns: y_pred : arraylike (n_samples,)
Predictions for each sample.

fit
(Xs, y)[source]¶ Fit the classifier object to the data in Xs, y.
Parameters: Xs : list of arraylikes or numpy.ndarray
 Xs length: n_views
 Xs[i] shape: (n_samples, n_features_i)
A list of the different views of data to train on.
y : array, shape (n_samples,)
The labels of the training data. Unlabeled examples should have label np.nan.
Returns: self : returns an instance of self

predict
(Xs)[source]¶ Predict the classes of the examples in the two input views.
Parameters: Xs : list of arraylikes or numpy.ndarray
 Xs length: n_views
 Xs[i] shape: (n_samples, n_features_i)
A list of the different views of data to predict.
Returns: y_pred : arraylike (n_samples,)
The predicted class of each input example. If the two classifiers don't agree, pick the one with the highest predicted probability from predict_proba().

predict_proba
(Xs)[source]¶ Predict the probability of each example belonging to a each class.
Parameters: Xs : list of arraylikes or numpy.ndarray
 Xs length: n_views
 Xs[i] shape: (n_samples, n_features_i)
A list of the different views of data to predict.
Returns: y_proba : arraylike (n_samples, n_classes)
The probability of each sample being in each class.
Cotraining Regressor¶

class
mvlearn.semi_supervised.
CTRegressor
(estimator1=None, estimator2=None, k_neighbors=5, unlabeled_pool_size=50, num_iter=100, random_state=None)[source]¶ This class implements the cotraining regression for supervised and semi supervised learning with the framework as described in [3]. The best use case is when 2 views of input data are sufficiently distinct and independent as detailed in [3]. However this can also be successfull when a single matrix of input data is given as both views and two estimators are choosen which are quite different [4].
In the semisupervised case, performance can vary greatly, so using a separate validation set or cross validation procedure is recommended to ensure the regression model has fit well.
Parameters: estimator1: sklearn object, (only supports KNeighborsRegressor)
The regressor object which will be trained on view1 of the data.
estimator2: sklearn object, (only supports KNeighborsRegressor)
The regressir object which will be trained on view2 of the data.
k_neighbors: int, optional (default = 5)
The number of neighbors to be considered for determining the mean squared error.
unlabeled_pool_size: int, optional (default = 50)
The number of unlabeled_pool samples which will be kept in a separate pool for regression and selection by the updated regressor at each training iteration.
num_iter: int, optional (default = 100)
The maximum number of iteration to be performed
random_state: int (default = None)
The seed for fit() method and other class operations
Attributes
estimator1_ (regressor object, used on view1) estimator2_ (regressor object, used on view2) class_name_: string The name of the class. k_neighbors_ (int) The number of neighbors to be considered for determining the mean squared error. unlabeled_pool_size: int The number of unlabeled_pool samples which will be kept in a separate pool for regression and selection by the updated regressor at each training iteration. num_iter: int The maximum number of iterations to be performed n_views (int) The number of views in the data Notes
Multiview cotraining is most helpful for tasks in semisupervised learning where each view offers unique information not seen in the other. As is shown in the example notebooks for using this algorithm, multiview cotraining can provide good regression results even when number of unlabeled samples far exceeds the number of labeled samples. This regressor uses 2 sklearn regressors which work individually on each view but which share information and thus result in improved performance over looking at the views completely separately. The regressor needs to be KNeighborsRegressor, as described in the [3].
References
[3] (1, 2, 3) SemiSupervised Regression with CoTraining by ZhiHua Zhou and Ming Li https://pdfs.semanticscholar.org/437c/85ad1c05f60574544d31e96bd8e60393fc92.pdf [4] Goldman, Sally, and Yan Zhou. "Enhancing supervised learning with unlabeled data." ICML. 2000. http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf Examples
>>> from mvlearn.semi_supervised import CTRegressor >>> from sklearn.neighbors import KNeighborsRegressor >>> import numpy as np >>> # X1 and X2 are the 2 views of the data >>> X1 = [[0], [1], [2], [3], [4], [5], [6]] >>> X2 = [[2], [3], [4], [6], [7], [8], [10]] >>> y = [10, 11, 12, 13, 14, 15, 16] >>> # Converting some of the labeled values to nan >>> y_train = [10, np.nan, 12, np.nan, 14, np.nan, 16] >>> knn1 = KNeighborsRegressor(n_neighbors = 2) >>> knn2 = KNeighborsRegressor(n_neighbors = 2) >>> ctr = CTRegressor(knn1, knn2, k_neighbors = 2, random_state = 42) >>> ctr = ctr.fit([X1, X2], y_train) >>> pred = ctr.predict([X1, X2]) >>> print("True value\n{}".format(y)) True value [10, 11, 12, 13, 14, 15, 16] >>> print("Predicted value\n{}".format(pred)) Predicted value [10.75 11.25 11.25 13.25 13.25 14.75 15.25]

fit
(Xs, y)[source]¶ Fit the regressor object to the data in Xs, y.
Parameters: Xs : list of arraylikes or numpy.ndarray
 Xs length: n_views
 Xs[i] shape: (n_samples, n_features_i)
A list of the different views of data to train on.
y : array, shape (n_samples,)
The target values of the training data. Unlabeled examples should have label np.nan.
Returns: self : returns an instance of self

predict
(Xs)[source]¶ Predict the values of the samples in the two input views.
Parameters: Xs : list of arraylikes or numpy.ndarray
 Xs length: n_views
 Xs[i] shape: (n_samples, n_features_i)
A list of the different views of data to predict.
Returns: y_pred : arraylike (n_samples,)
The average of the predictions from both estimators is returned

fit_predict
(Xs, y)¶ Fit a cotrain estimator to the semisupervised data and then predict.
Parameters: Xs : list of arraylikes or numpy.ndarray
 Xs length: n_views
 Xs[i] shape: (n_samples, n_features_i)
A list of the different views of data to fit and then predict.
y : array, shape (n_samples,)
Targets of the training data. Unlabeled examples should have label np.nan.
Returns: y_pred : arraylike (n_samples,)
Predictions for each sample.
