Semi-Supervised

Cotraining Classifier

class mvlearn.semi_supervised.CTClassifier(estimator1=None, estimator2=None, p=None, n=None, unlabeled_pool_size=75, num_iter=50, random_state=None)[source]

This class implements the co-training classifier for supervised and semi-supervised learning with the framework as described in [1]. The best use case is when the 2 views of input data are sufficiently distinct and independent as detailed in [1]. However, this can also be successful when a single matrix of input data is given as both views and two estimators are chosen which are quite different. [2]. See the examples below.

In the semi-supervised case, performance can vary greatly, so using a separate validation set or cross validation procedure is recommended to ensure the classifier has fit well.

Parameters:

estimator1 : classifier object, (default=sklearn GaussianNB)

The classifier object which will be trained on view 1 of the data. This classifier should support the predict_proba() function so that classification probabilities can be computed and co-training can be performed effectively.

estimator2 : classifier object, (default=sklearn GaussianNB)

The classifier object which will be trained on view 2 of the data. Does not need to be of the same type as estimator1, but should support predict_proba().

p : int, optional (default=None)

The number of positive classifications from the unlabeled_pool training set which will be given a positive "label". If None, the default is the floor of the ratio of positive to negative examples in the labeled training data (at least 1). If only one of p or n is not None, the other will be set to be the same. When the labels are 0 or 1, positive is defined as 1, and in general, positive is the larger label.

n : int, optional (default=None)

The number of negative classifications from the unlabeled_pool training set which will be given a negative "label". If None, the default is the floor of the ratio of positive to negative examples in the labeled training data (at least 1). If only one of p or n is not None, the other will be set to be the same. When the labels are 0 or 1, negative is defined as 0, and in general, negative is the smaller label.

unlabeled_pool_size : int, optional (default=75)

The number of unlabeled_pool samples which will be kept in a separate pool for classification and selection by the updated classifier at each training iteration.

num_iter : int, optional (default=50)

The maximum number of training iterations to run.

random_state : int (default=None)

The starting random seed for fit() and class operations, passed to numpy.random.seed().

Attributes

estimator1_ (classifier object) The classifier used on view 1.
estimator2_ (classifier object) The classifier used on view 2.
class_name_: string The name of the class.
p_ (int, optional (default=None)) The number of positive classifications from the unlabeled_pool training set which will be given a positive "label" each round.
n_ (int, optional (default=None)) The number of negative classifications from the unlabeled_pool training set which will be given a negative "label" each round.
classes_ (array-like of shape (n_classes,)) Unique class labels.

Notes

Multi-view co-training is most helpful for tasks in semi-supervised learning where each view offers unique information not seen in the other. As is shown in the example notebooks for using this algorithm, multi-view co-training can provide good classification results even when number of unlabeled samples far exceeds the number of labeled samples. This classifier uses 2 classifiers which work individually on each view but which share information and thus result in improved performance over looking at the views completely separately or even when concatenating the views to get more features in a single-view setting. The classifier can be initialized with or without the classifiers desired for each view being specified, but if the classifier for a certain view is specified, then it must support a predict_proba() method in order to give a sense of the most likely labels for different examples. This is because the algorithm must be able to determine which of the training samples it is most confident about during training epochs. The algorithm, as first proposed by Blum and Mitchell, is described in detail below.

Algorithm

Given:

  • a set L of labeled training samples (with 2 views)
  • a set U of unlabeled samples (with 2 views)

Create a pool U' of examples by choosing u examples at random from U

Loop for k iterations

  • Use L to train a classifier h1 (estimator1) that considers only the view 1 portion of the data (i.e. Xs[0])
  • Use L to train a classifier h2 (estimator2) that considers only the view 2 portion of the data (i.e. Xs[1])
  • Allow h1 to label p (self.p_) positive and n (self.n_) negative samples from view 1 of U'
  • Allow h2 to label p positive and n negative samples from view 2 of U'
  • Add these self-labeled samples to L
  • Randomly take 2*p* + 2*n* samples from U to replenish U'

References

[1](1, 2) Blum, A., & Mitchell, T. (1998, July). Combining labeled and unlabeled_pool data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory (pp. 92-100). ACM.
[2]Goldman, Sally, and Yan Zhou. "Enhancing supervised learning with unlabeled data." ICML. 2000.

Examples

>>> # Supervised learning of single-view data with 2 distinct estimators
>>> from mvlearn.semi_supervised import CTClassifier
>>> from mvlearn.datasets import load_UCImultifeature
>>> import numpy as np
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.model_selection import train_test_split
>>> data, labels = load_UCImultifeature(select_labeled=[0,1])
>>> X1 = data[0]  # Only using the first view
>>> X1_train, X1_test, l_train, l_test = train_test_split(X1, labels)
>>> # Supervised learning with a single view of data and 2 estimator types
>>> estimator1 = GaussianNB()
>>> estimator2 = RandomForestClassifier()
>>> ctc = CTClassifier(estimator1, estimator2, random_state=1)
>>> # Use the same matrix for each view
>>> ctc = ctc.fit([X1_train, X1_train], l_train)
>>> preds = ctc.predict([X1_test, X1_test])
>>> print("Accuracy: ", sum(preds==l_test) / len(preds))
Accuracy:  0.97
fit_predict(Xs, y)

Fit a co-train estimator to the semi-supervised data and then predict.

Parameters:

Xs : list of array-likes or numpy.ndarray

  • Xs length: n_views
  • Xs[i] shape: (n_samples, n_features_i)

A list of the different views of data to fit and then predict.

y : array, shape (n_samples,)

Targets of the training data. Unlabeled examples should have label np.nan.

Returns:

y_pred : array-like (n_samples,)

Predictions for each sample.

fit(Xs, y)[source]

Fit the classifier object to the data in Xs, y.

Parameters:

Xs : list of array-likes or numpy.ndarray

  • Xs length: n_views
  • Xs[i] shape: (n_samples, n_features_i)

A list of the different views of data to train on.

y : array, shape (n_samples,)

The labels of the training data. Unlabeled examples should have label np.nan.

Returns:

self : returns an instance of self

predict(Xs)[source]

Predict the classes of the examples in the two input views.

Parameters:

Xs : list of array-likes or numpy.ndarray

  • Xs length: n_views
  • Xs[i] shape: (n_samples, n_features_i)

A list of the different views of data to predict.

Returns:

y_pred : array-like (n_samples,)

The predicted class of each input example. If the two classifiers don't agree, pick the one with the highest predicted probability from predict_proba().

predict_proba(Xs)[source]

Predict the probability of each example belonging to a each class.

Parameters:

Xs : list of array-likes or numpy.ndarray

  • Xs length: n_views
  • Xs[i] shape: (n_samples, n_features_i)

A list of the different views of data to predict.

Returns:

y_proba : array-like (n_samples, n_classes)

The probability of each sample being in each class.

Cotraining Regressor

class mvlearn.semi_supervised.CTRegressor(estimator1=None, estimator2=None, k_neighbors=5, unlabeled_pool_size=50, num_iter=100, random_state=None)[source]

This class implements the co-training regression for supervised and semi supervised learning with the framework as described in [3]. The best use case is when 2 views of input data are sufficiently distinct and independent as detailed in [3]. However this can also be successfull when a single matrix of input data is given as both views and two estimators are choosen which are quite different [4].

In the semi-supervised case, performance can vary greatly, so using a separate validation set or cross validation procedure is recommended to ensure the regression model has fit well.

Parameters:

estimator1: sklearn object, (only supports KNeighborsRegressor)

The regressor object which will be trained on view1 of the data.

estimator2: sklearn object, (only supports KNeighborsRegressor)

The regressir object which will be trained on view2 of the data.

k_neighbors: int, optional (default = 5)

The number of neighbors to be considered for determining the mean squared error.

unlabeled_pool_size: int, optional (default = 50)

The number of unlabeled_pool samples which will be kept in a separate pool for regression and selection by the updated regressor at each training iteration.

num_iter: int, optional (default = 100)

The maximum number of iteration to be performed

random_state: int (default = None)

The seed for fit() method and other class operations

Attributes

estimator1_ (regressor object, used on view1)
estimator2_ (regressor object, used on view2)
class_name_: string The name of the class.
k_neighbors_ (int) The number of neighbors to be considered for determining the mean squared error.
unlabeled_pool_size: int The number of unlabeled_pool samples which will be kept in a separate pool for regression and selection by the updated regressor at each training iteration.
num_iter: int The maximum number of iterations to be performed
n_views (int) The number of views in the data

Notes

Multi-view co-training is most helpful for tasks in semi-supervised learning where each view offers unique information not seen in the other. As is shown in the example notebooks for using this algorithm, multi-view co-training can provide good regression results even when number of unlabeled samples far exceeds the number of labeled samples. This regressor uses 2 sklearn regressors which work individually on each view but which share information and thus result in improved performance over looking at the views completely separately. The regressor needs to be KNeighborsRegressor, as described in the [3].

References

[3](1, 2, 3) Semi-Supervised Regression with Co-Training by Zhi-Hua Zhou and Ming Li https://pdfs.semanticscholar.org/437c/85ad1c05f60574544d31e96bd8e60393fc92.pdf
[4]Goldman, Sally, and Yan Zhou. "Enhancing supervised learning with unlabeled data." ICML. 2000. http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf

Examples

>>> from mvlearn.semi_supervised import CTRegressor
>>> from sklearn.neighbors import KNeighborsRegressor
>>> import numpy as np
>>> # X1 and X2 are the 2 views of the data
>>> X1 = [[0], [1], [2], [3], [4], [5], [6]]
>>> X2 = [[2], [3], [4], [6], [7], [8], [10]]
>>> y = [10, 11, 12, 13, 14, 15, 16]
>>> # Converting some of the labeled values to nan
>>> y_train = [10, np.nan, 12, np.nan, 14, np.nan, 16]
>>> knn1 = KNeighborsRegressor(n_neighbors = 2)
>>> knn2 = KNeighborsRegressor(n_neighbors = 2)
>>> ctr = CTRegressor(knn1, knn2, k_neighbors = 2, random_state =  42)
>>> ctr = ctr.fit([X1, X2], y_train)
>>> pred = ctr.predict([X1, X2])
>>> print("True value\n{}".format(y))
True value
[10, 11, 12, 13, 14, 15, 16]
>>> print("Predicted value\n{}".format(pred))
Predicted value
[10.75 11.25 11.25 13.25 13.25 14.75 15.25]
fit(Xs, y)[source]

Fit the regressor object to the data in Xs, y.

Parameters:

Xs : list of array-likes or numpy.ndarray

  • Xs length: n_views
  • Xs[i] shape: (n_samples, n_features_i)

A list of the different views of data to train on.

y : array, shape (n_samples,)

The target values of the training data. Unlabeled examples should have label np.nan.

Returns:

self : returns an instance of self

predict(Xs)[source]

Predict the values of the samples in the two input views.

Parameters:

Xs : list of array-likes or numpy.ndarray

  • Xs length: n_views
  • Xs[i] shape: (n_samples, n_features_i)

A list of the different views of data to predict.

Returns:

y_pred : array-like (n_samples,)

The average of the predictions from both estimators is returned

fit_predict(Xs, y)

Fit a co-train estimator to the semi-supervised data and then predict.

Parameters:

Xs : list of array-likes or numpy.ndarray

  • Xs length: n_views
  • Xs[i] shape: (n_samples, n_features_i)

A list of the different views of data to fit and then predict.

y : array, shape (n_samples,)

Targets of the training data. Unlabeled examples should have label np.nan.

Returns:

y_pred : array-like (n_samples,)

Predictions for each sample.