ssl_framework.main#
This module contains the core SelfTrainingClassifier class, which implements the main semi-supervised learning functionality.
SelfTrainingClassifier#
- class ssl_framework.main.SelfTrainingClassifier(base_model, max_iter=10, selection_strategy=None, integration_strategy=None, patience=3, tol=0.01, labeling_convergence_threshold=5)[source]#
Bases:
BaseEstimator,ClassifierMixinSemi-supervised learning classifier using self-training approach.
This classifier wraps a base supervised model and iteratively trains it on both labeled and pseudo-labeled data, following the scikit-learn API.
- Parameters:
Methods
SelfTrainingClassifier.fit(X_labeled, ...[, ...])Fit the self-training classifier using semi-supervised learning.
Predict class labels for samples in X.
Predict class probabilities for samples in X.
Attributes
After fitting, the following attributes are available:
- classes_: numpy.ndarray#
The classes seen during
fit().
- history_: List[Dict[str, Any]]#
Training history containing metrics for each iteration. Each dictionary contains:
iteration(int): Iteration numberlabeled_data_count(int): Number of labeled samples before adding new onesnew_labels_count(int): Number of new pseudo-labels addedaverage_confidence(float): Mean confidence of newly added samplesvalidation_score(float, optional): Validation score if validation data providedstopping_reason(str, optional): Reason for stopping if applicable
- stopping_reason_: str#
Reason why training stopped (e.g., “Maximum iterations reached”, “Early stopping: no improvement”, “Labeling convergence”).
- feature_names_: List[str] or None#
Feature names if input was DataFrame, None otherwise.
- __init__(base_model, max_iter=10, selection_strategy=None, integration_strategy=None, patience=3, tol=0.01, labeling_convergence_threshold=5)[source]#
Initialize the SelfTrainingClassifier.
- Parameters:
base_model (estimator) – Base supervised model that implements fit, predict, and predict_proba. Must be sklearn-compatible (e.g., LogisticRegression, RandomForestClassifier).
max_iter (int, default=10) – Maximum number of iterations for the self-training loop.
selection_strategy (object, default=None) – Strategy for selecting which unlabeled samples to pseudo-label. If None, uses ConfidenceThreshold(0.95). Available strategies: ConfidenceThreshold, TopKFixedCount.
integration_strategy (object, default=None) – Strategy for integrating pseudo-labeled samples into the labeled set. If None, uses AppendAndGrow(). Available strategies: AppendAndGrow, FullReLabeling, ConfidenceWeighting.
patience (int, default=3) – Number of iterations with no improvement to wait before early stopping. Only used when validation data is provided.
tol (float, default=0.01) – The minimum improvement in validation score to be considered an improvement. Only used when validation data is provided.
labeling_convergence_threshold (int, default=5) – Stop if fewer than this many new labels are added in an iteration. Prevents infinite loops when no more confident samples can be found.
Examples
>>> from sklearn.linear_model import LogisticRegression >>> from ssl_framework.main import SelfTrainingClassifier >>> from ssl_framework.strategies import ConfidenceThreshold, AppendAndGrow >>> >>> base_model = LogisticRegression(random_state=42) >>> selection_strategy = ConfidenceThreshold(threshold=0.9) >>> integration_strategy = AppendAndGrow() >>> >>> ssl_clf = SelfTrainingClassifier( ... base_model=base_model, ... selection_strategy=selection_strategy, ... integration_strategy=integration_strategy, ... max_iter=10 ... )
- fit(X_labeled, y_labeled, X_unlabeled, X_val=None, y_val=None)[source]#
Fit the self-training classifier using semi-supervised learning.
This method iteratively trains the base model by: 1. Training on current labeled data 2. Making predictions on unlabeled data 3. Selecting confident predictions using the selection strategy 4. Integrating new pseudo-labels using the integration strategy 5. Repeating until stopping criteria are met
- Parameters:
X_labeled (array-like of shape (n_labeled_samples, n_features)) – Initial labeled training data. Can be numpy array or pandas DataFrame.
y_labeled (array-like of shape (n_labeled_samples,)) – Target values for labeled data. Can be numpy array or pandas Series.
X_unlabeled (array-like of shape (n_unlabeled_samples, n_features)) – Unlabeled training data to iteratively pseudo-label. Can be numpy array or pandas DataFrame.
X_val (array-like of shape (n_val_samples, n_features), optional) – Validation data for early stopping. If provided with y_val, enables early stopping based on validation score plateau.
y_val (array-like of shape (n_val_samples,), optional) – Validation targets for early stopping.
- Returns:
self – Returns the fitted instance.
- Return type:
- classes_#
The classes seen during fit.
- Type:
ndarray of shape (n_classes,)
- history_#
Training history containing metrics for each iteration: - iteration: iteration number - labeled_data_count: number of labeled samples before adding new ones - new_labels_count: number of new pseudo-labels added - average_confidence: mean confidence of newly added samples - validation_score: validation score (if validation data provided) - stopping_reason: reason for stopping (if applicable)
- stopping_reason_#
Reason why training stopped (e.g., “Maximum iterations reached”, “Early stopping: no improvement”, “Labeling convergence”).
- Type:
Examples
>>> import numpy as np >>> from sklearn.linear_model import LogisticRegression >>> from ssl_framework.main import SelfTrainingClassifier >>> >>> # Create sample data >>> X_labeled = np.array([[0, 0], [1, 1], [10, 10], [11, 11]]) >>> y_labeled = np.array([0, 0, 1, 1]) >>> X_unlabeled = np.array([[0.5, 0.5], [10.5, 10.5], [5, 5]]) >>> >>> # Fit SSL classifier >>> ssl_clf = SelfTrainingClassifier(LogisticRegression()) >>> ssl_clf.fit(X_labeled, y_labeled, X_unlabeled) >>> >>> # Check training progress >>> print(f"Stopped due to: {ssl_clf.stopping_reason_}") >>> print(f"Training iterations: {len(ssl_clf.history_)}")
- predict(X)[source]#
Predict class labels for samples in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to predict. Can be numpy array or pandas DataFrame.
- Returns:
y_pred – Predicted class labels for each sample.
- Return type:
ndarray of shape (n_samples,)
- predict_proba(X)[source]#
Predict class probabilities for samples in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to predict probabilities for. Can be numpy array or pandas DataFrame.
- Returns:
y_proba – Predicted class probabilities for each sample and class.
- Return type:
ndarray of shape (n_samples, n_classes)
- set_fit_request(*, X_labeled: bool | None | str = '$UNCHANGED$', X_unlabeled: bool | None | str = '$UNCHANGED$', X_val: bool | None | str = '$UNCHANGED$', y_labeled: bool | None | str = '$UNCHANGED$', y_val: bool | None | str = '$UNCHANGED$') SelfTrainingClassifier#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
X_labeled (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_labeledparameter infit.X_unlabeled (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_unlabeledparameter infit.X_val (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_valparameter infit.y_labeled (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
y_labeledparameter infit.y_val (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
y_valparameter infit.
- Returns:
self – The updated object.
- Return type:
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') SelfTrainingClassifier#
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Examples#
Basic Usage#
from sklearn.linear_model import LogisticRegression
from ssl_framework.main import SelfTrainingClassifier
import numpy as np
# Sample data
X_labeled = np.array([[0, 0], [1, 1], [10, 10], [11, 11]])
y_labeled = np.array([0, 0, 1, 1])
X_unlabeled = np.array([[0.5, 0.5], [10.5, 10.5]])
# Create and fit SSL classifier
ssl_clf = SelfTrainingClassifier(LogisticRegression())
ssl_clf.fit(X_labeled, y_labeled, X_unlabeled)
# Make predictions
predictions = ssl_clf.predict([[0.2, 0.2], [10.2, 10.2]])
probabilities = ssl_clf.predict_proba([[0.2, 0.2], [10.2, 10.2]])
With Custom Strategies#
from ssl_framework.strategies import TopKFixedCount, ConfidenceWeighting
ssl_clf = SelfTrainingClassifier(
base_model=LogisticRegression(),
selection_strategy=TopKFixedCount(k=10),
integration_strategy=ConfidenceWeighting(),
max_iter=5
)
With Early Stopping#
# Validation data for early stopping
X_val = np.array([[0.3, 0.3], [10.3, 10.3]])
y_val = np.array([0, 1])
ssl_clf = SelfTrainingClassifier(
base_model=LogisticRegression(),
patience=3,
tol=0.01
)
ssl_clf.fit(X_labeled, y_labeled, X_unlabeled, X_val, y_val)
print(f"Stopped due to: {ssl_clf.stopping_reason_}")
Pandas Integration#
import pandas as pd
# DataFrame inputs
X_labeled_df = pd.DataFrame([[0, 0], [1, 1]], columns=['x', 'y'])
y_labeled_series = pd.Series([0, 1], name='target')
X_unlabeled_df = pd.DataFrame([[0.5, 0.5]], columns=['x', 'y'])
ssl_clf = SelfTrainingClassifier(LogisticRegression())
ssl_clf.fit(X_labeled_df, y_labeled_series, X_unlabeled_df)
print(ssl_clf.feature_names_) # ['x', 'y']