ssl_framework.main#

This module contains the core SelfTrainingClassifier class, which implements the main semi-supervised learning functionality.

SelfTrainingClassifier#

class ssl_framework.main.SelfTrainingClassifier(base_model, max_iter=10, selection_strategy=None, integration_strategy=None, patience=3, tol=0.01, labeling_convergence_threshold=5)[source]#

Bases: BaseEstimator, ClassifierMixin

Semi-supervised learning classifier using self-training approach.

This classifier wraps a base supervised model and iteratively trains it on both labeled and pseudo-labeled data, following the scikit-learn API.

Parameters:

Methods

SelfTrainingClassifier.fit(X_labeled, ...[, ...])

Fit the self-training classifier using semi-supervised learning.

SelfTrainingClassifier.predict(X)

Predict class labels for samples in X.

SelfTrainingClassifier.predict_proba(X)

Predict class probabilities for samples in X.

Attributes

After fitting, the following attributes are available:

classes_: numpy.ndarray#

The classes seen during fit().

history_: List[Dict[str, Any]]#

Training history containing metrics for each iteration. Each dictionary contains:

  • iteration (int): Iteration number

  • labeled_data_count (int): Number of labeled samples before adding new ones

  • new_labels_count (int): Number of new pseudo-labels added

  • average_confidence (float): Mean confidence of newly added samples

  • validation_score (float, optional): Validation score if validation data provided

  • stopping_reason (str, optional): Reason for stopping if applicable

stopping_reason_: str#

Reason why training stopped (e.g., “Maximum iterations reached”, “Early stopping: no improvement”, “Labeling convergence”).

feature_names_: List[str] or None#

Feature names if input was DataFrame, None otherwise.

__init__(base_model, max_iter=10, selection_strategy=None, integration_strategy=None, patience=3, tol=0.01, labeling_convergence_threshold=5)[source]#

Initialize the SelfTrainingClassifier.

Parameters:
  • base_model (estimator) – Base supervised model that implements fit, predict, and predict_proba. Must be sklearn-compatible (e.g., LogisticRegression, RandomForestClassifier).

  • max_iter (int, default=10) – Maximum number of iterations for the self-training loop.

  • selection_strategy (object, default=None) – Strategy for selecting which unlabeled samples to pseudo-label. If None, uses ConfidenceThreshold(0.95). Available strategies: ConfidenceThreshold, TopKFixedCount.

  • integration_strategy (object, default=None) – Strategy for integrating pseudo-labeled samples into the labeled set. If None, uses AppendAndGrow(). Available strategies: AppendAndGrow, FullReLabeling, ConfidenceWeighting.

  • patience (int, default=3) – Number of iterations with no improvement to wait before early stopping. Only used when validation data is provided.

  • tol (float, default=0.01) – The minimum improvement in validation score to be considered an improvement. Only used when validation data is provided.

  • labeling_convergence_threshold (int, default=5) – Stop if fewer than this many new labels are added in an iteration. Prevents infinite loops when no more confident samples can be found.

Examples

>>> from sklearn.linear_model import LogisticRegression
>>> from ssl_framework.main import SelfTrainingClassifier
>>> from ssl_framework.strategies import ConfidenceThreshold, AppendAndGrow
>>>
>>> base_model = LogisticRegression(random_state=42)
>>> selection_strategy = ConfidenceThreshold(threshold=0.9)
>>> integration_strategy = AppendAndGrow()
>>>
>>> ssl_clf = SelfTrainingClassifier(
...     base_model=base_model,
...     selection_strategy=selection_strategy,
...     integration_strategy=integration_strategy,
...     max_iter=10
... )
fit(X_labeled, y_labeled, X_unlabeled, X_val=None, y_val=None)[source]#

Fit the self-training classifier using semi-supervised learning.

This method iteratively trains the base model by: 1. Training on current labeled data 2. Making predictions on unlabeled data 3. Selecting confident predictions using the selection strategy 4. Integrating new pseudo-labels using the integration strategy 5. Repeating until stopping criteria are met

Parameters:
  • X_labeled (array-like of shape (n_labeled_samples, n_features)) – Initial labeled training data. Can be numpy array or pandas DataFrame.

  • y_labeled (array-like of shape (n_labeled_samples,)) – Target values for labeled data. Can be numpy array or pandas Series.

  • X_unlabeled (array-like of shape (n_unlabeled_samples, n_features)) – Unlabeled training data to iteratively pseudo-label. Can be numpy array or pandas DataFrame.

  • X_val (array-like of shape (n_val_samples, n_features), optional) – Validation data for early stopping. If provided with y_val, enables early stopping based on validation score plateau.

  • y_val (array-like of shape (n_val_samples,), optional) – Validation targets for early stopping.

Returns:

self – Returns the fitted instance.

Return type:

SelfTrainingClassifier

classes_#

The classes seen during fit.

Type:

ndarray of shape (n_classes,)

history_#

Training history containing metrics for each iteration: - iteration: iteration number - labeled_data_count: number of labeled samples before adding new ones - new_labels_count: number of new pseudo-labels added - average_confidence: mean confidence of newly added samples - validation_score: validation score (if validation data provided) - stopping_reason: reason for stopping (if applicable)

Type:

list of dict

stopping_reason_#

Reason why training stopped (e.g., “Maximum iterations reached”, “Early stopping: no improvement”, “Labeling convergence”).

Type:

str

feature_names_#

Feature names if input was DataFrame, None otherwise.

Type:

list or None

Examples

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from ssl_framework.main import SelfTrainingClassifier
>>>
>>> # Create sample data
>>> X_labeled = np.array([[0, 0], [1, 1], [10, 10], [11, 11]])
>>> y_labeled = np.array([0, 0, 1, 1])
>>> X_unlabeled = np.array([[0.5, 0.5], [10.5, 10.5], [5, 5]])
>>>
>>> # Fit SSL classifier
>>> ssl_clf = SelfTrainingClassifier(LogisticRegression())
>>> ssl_clf.fit(X_labeled, y_labeled, X_unlabeled)
>>>
>>> # Check training progress
>>> print(f"Stopped due to: {ssl_clf.stopping_reason_}")
>>> print(f"Training iterations: {len(ssl_clf.history_)}")
predict(X)[source]#

Predict class labels for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to predict. Can be numpy array or pandas DataFrame.

Returns:

y_pred – Predicted class labels for each sample.

Return type:

ndarray of shape (n_samples,)

predict_proba(X)[source]#

Predict class probabilities for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to predict probabilities for. Can be numpy array or pandas DataFrame.

Returns:

y_proba – Predicted class probabilities for each sample and class.

Return type:

ndarray of shape (n_samples, n_classes)

set_fit_request(*, X_labeled: bool | None | str = '$UNCHANGED$', X_unlabeled: bool | None | str = '$UNCHANGED$', X_val: bool | None | str = '$UNCHANGED$', y_labeled: bool | None | str = '$UNCHANGED$', y_val: bool | None | str = '$UNCHANGED$') SelfTrainingClassifier#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • X_labeled (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_labeled parameter in fit.

  • X_unlabeled (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_unlabeled parameter in fit.

  • X_val (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_val parameter in fit.

  • y_labeled (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for y_labeled parameter in fit.

  • y_val (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for y_val parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') SelfTrainingClassifier#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

Examples#

Basic Usage#

from sklearn.linear_model import LogisticRegression
from ssl_framework.main import SelfTrainingClassifier
import numpy as np

# Sample data
X_labeled = np.array([[0, 0], [1, 1], [10, 10], [11, 11]])
y_labeled = np.array([0, 0, 1, 1])
X_unlabeled = np.array([[0.5, 0.5], [10.5, 10.5]])

# Create and fit SSL classifier
ssl_clf = SelfTrainingClassifier(LogisticRegression())
ssl_clf.fit(X_labeled, y_labeled, X_unlabeled)

# Make predictions
predictions = ssl_clf.predict([[0.2, 0.2], [10.2, 10.2]])
probabilities = ssl_clf.predict_proba([[0.2, 0.2], [10.2, 10.2]])

With Custom Strategies#

from ssl_framework.strategies import TopKFixedCount, ConfidenceWeighting

ssl_clf = SelfTrainingClassifier(
    base_model=LogisticRegression(),
    selection_strategy=TopKFixedCount(k=10),
    integration_strategy=ConfidenceWeighting(),
    max_iter=5
)

With Early Stopping#

# Validation data for early stopping
X_val = np.array([[0.3, 0.3], [10.3, 10.3]])
y_val = np.array([0, 1])

ssl_clf = SelfTrainingClassifier(
    base_model=LogisticRegression(),
    patience=3,
    tol=0.01
)

ssl_clf.fit(X_labeled, y_labeled, X_unlabeled, X_val, y_val)
print(f"Stopped due to: {ssl_clf.stopping_reason_}")

Pandas Integration#

import pandas as pd

# DataFrame inputs
X_labeled_df = pd.DataFrame([[0, 0], [1, 1]], columns=['x', 'y'])
y_labeled_series = pd.Series([0, 1], name='target')
X_unlabeled_df = pd.DataFrame([[0.5, 0.5]], columns=['x', 'y'])

ssl_clf = SelfTrainingClassifier(LogisticRegression())
ssl_clf.fit(X_labeled_df, y_labeled_series, X_unlabeled_df)

print(ssl_clf.feature_names_)  # ['x', 'y']