Estimator Guide¶

Estimators are the ML models at the core of the recommendation system. They predict engagement, conversion, or reward for user-item pairs.

Available Estimators¶

Classification Estimators¶

For binary outcomes (click/no-click, convert/no-convert):

XGBClassifierEstimator - XGBoost classifier (most popular)
LightGBMClassifierEstimator - LightGBM classifier (fast, memory efficient)
LogisticRegressionClassifierEstimator - Logistic regression (simple baseline)
SklearnUniversalClassifierEstimator - Wrap any sklearn-compatible classifier (e.g. RandomForestClassifier)
DeepFMClassifier - DeepFM for feature interactions (requires torch)

Regression Estimators¶

For continuous outcomes (revenue, time-spent, rating):

XGBRegressorEstimator - XGBoost regressor
LightGBMRegressorEstimator - LightGBM regressor
SklearnUniversalRegressorEstimator - Wrap any sklearn-compatible regressor (e.g. Ridge, RandomForestRegressor)

Multi-output Estimators¶

MultiOutputClassifierEstimator - Wrapper for multi-output classification (multiple binary numeric targets per user — values strictly in {0, 1}). Pairs with MultioutputScorer in classifier mode. Fits one separate booster per target (sklearn.MultiOutputClassifier). The scorer-level MultioutputScorer._validate_targets (early, before the wrapper sees the data) enforces the strict {0, 1} value contract — strings and signed-integer encodings like {-1, 1} are rejected there. Bool columns ARE accepted (Python True == 1 / False == 0 in set membership, so {True, False} collapses to {0, 1}). The wrapper's own _validate_binary_targets (defense-in-depth pre-flight scan, so the wrapper is safe to use standalone) enforces the per-target cardinality contract — exactly two unique values per column, neither single-class nor 3+-class. See scorer migration paths.
MultiOutputRegressorEstimator - Wrapper for multi-output regression (multiple continuous targets per user). Fits one separate booster per target. Pairs with MultioutputScorer in regressor mode for predictions like per-user revenue, engagement minutes, click counts, etc.
JointXGBMultiOutputClassifierEstimator - A single joint XGBoost booster over all N binary labels (vs N independent boosters above). Pairs with MultioutputScorer classifier mode; reshapes XGBoost's (n, N) predict_proba into the per-label list the scorer expects. multi_strategy='one_output_per_tree' (default, GPU-capable) does not share structure across labels; multi_strategy='multi_output_tree' (vector leaf, CPU-only) does — that's the only mode with genuine cross-label learning. See Per-label vs joint estimators.
JointXGBMultiOutputRegressorEstimator - The regressor analogue: a single joint XGBoost booster over N continuous targets, for MultioutputScorer regressor mode (no reshape needed — predict is already (n, N)).
Note on multi_strategy: "joint" means one model artifact; it does not automatically mean cross-target learning. Only multi_output_tree (or sklearn tree ensembles below) shares tree structure across targets. None of these condition one target on another's value — for that, use the conditional estimators under MixedTypeMultiTargetScorer.
sklearn tree ensembles (RandomForestClassifier, ExtraTreesClassifier, regressors) are joint multi-output for free: they natively accept a 2-D target matrix, grow multi-output trees that share structure across all targets by default, and their multilabel predict_proba already returns the list-of-blocks layout MultioutputScorer expects. Use them through the plain SklearnUniversalClassifierEstimator / SklearnUniversalRegressorEstimator (2-D y) — no dedicated estimator needed.

Class weighting & fit-time parameters¶

Every sklearn-API estimator (XGBoost / LightGBM / sklearn wrappers, single- and multi-target) takes two optional constructor args (and matching set_sample_weight / set_fit_params setters):

sample_weight — a row-weight strategy: 'balanced' (computes compute_sample_weight('balanced', y) at fit time — the production class-balancing recipe), a callable fn(y) -> weights, or an explicit per-row array. Defaults to uniform.
fit_params — a dict of static kwargs forwarded verbatim to the wrapped model's fit (feature_weights, base_margin, a custom objective, callbacks, …).

# class-balanced multiclass (firmographics-industry recipe)
estimator = XGBClassifierEstimator(
    {"n_estimators": 426, "objective": "multi:softprob"},
    sample_weight="balanced",
)

This is the general replacement for ad-hoc weighting; the row weights are resolved against y at fit time, so they stay aligned through the scorer pipeline. (WeightedXGBClassifierEstimator's item/action weighting still exists and composes multiplicatively with a 'balanced' strategy.)

Embedding Estimators¶

Specialized estimators for building models that learn user and item embeddings (e.g., two-tower, matrix factorization).

Factorized Inputs: Unlike general classifiers/regressors that take a single X matrix, embedding estimators are typically trained using separate DataFrames for users, items, and interactions. This is handled by the method fit_embedding_model(users: Optional[DataFrame], items: Optional[DataFrame], interactions: DataFrame, ...)
Specialized Prediction: Inference is made with predict_proba_with_embeddings, which supports two primary modes:
- Batch Prediction Mode: When called as predict_proba_with_embeddings(interactions: DataFrame, users: None), the estimator uses its internally learned user embeddings and features (if any) to make predictions for the users specified in the interactions DataFrame. This is suitable for offline batch scoring.
- Real-time Inference Mode: When called as predict_proba_with_embeddings(interactions: DataFrame, users: DataFrame), where the users DataFrame contains pre-computed user embeddings (under the USER_EMBEDDING_NAME column) and optionally other user features. This allows for efficient real-time predictions using externally managed user embeddings.
Embedding Management:
- get_user_embeddings() -> DataFrame: Extracts the learned user embeddings from the trained model into a DataFrame, typically containing USER_ID_NAME and USER_EMBEDDING_NAME columns.
- truncate_user_data(): Modifies the estimator's internal state to reduce its memory footprint, typically after user embeddings have been extracted for deployment. This involves removing most user-specific data from the model while preserving a placeholder embedding for unknown users. This makes the model more lightweight for pickling and deployment in real-time systems.

Available Embedding Estimators:

MatrixFactorizationEstimator - Native collaborative filtering - ALS and SGD, continuous/binary/ordinal outcomes; NumPy-only, no PyTorch. See Collaborative Filtering (Matrix Factorization).
NCFEstimator - Neural Collaborative Filtering - GMF, MLP, and NeuMF variants for collaborative filtering
NeuralFactorizationEstimator - Neural factorization with contextual interactions
ContextualizedTwoTowerEstimator - Two-tower architecture with three selectable context modes (user_tower, trilinear, scoring_layer). See Contextualized Two-Tower Guide.
DeepCrossNetworkEstimator - Deep cross network for feature interactions

Sequential Estimators¶

Specialized estimators for modelling the order of user interactions. Unlike embedding estimators, these are trained on sequences of items (sorted by timestamp) rather than individual user-item pairs. Both support early stopping via early_stopping_patience + restore_best_weights.

Class	Architecture	Loss	Use for
`SASRecClassifierEstimator`	Transformer (self-attention)	BCE	Implicit feedback, long histories
`SASRecRegressorEstimator`	Transformer (self-attention)	MSE	Explicit ratings
`HRNNClassifierEstimator`	GRU + GRU (two-level)	BCE	Session-structured data
`HRNNRegressorEstimator`	GRU + GRU (two-level)	MSE	Continuous outcomes with sessions

See SASRec Guide and HRNN Guide for full documentation.

Quick Start¶

from skrec.estimator.classification.xgb_classifier import XGBClassifierEstimator

# Initialize with hyperparameters
estimator = XGBClassifierEstimator({
    "learning_rate": 0.1,
    "n_estimators": 100,
    "max_depth": 5,
    "subsample": 0.8
})

Hyperparameter Tuning¶

Manual Tuning¶

estimator = XGBClassifierEstimator({
    "learning_rate": 0.1,
    "n_estimators": 100,
    "max_depth": 5
})

Tuned Estimators¶

Each estimator type ships with a Tuned* variant that wraps sklearn's GridSearchCV or RandomizedSearchCV:

from skrec.estimator.classification.xgb_classifier import TunedXGBClassifierEstimator
from skrec.estimator.datatypes import HPOType

param_space = {
    "n_estimators": [50, 100, 200],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.05, 0.1],
}

estimator = TunedXGBClassifierEstimator(
    hpo_method=HPOType.GRID_SEARCH_CV,
    param_space=param_space,
    optimizer_params={"cv": 5, "scoring": "roc_auc"},
)
estimator.fit(X_train, y_train)
proba = estimator.predict_proba(X_test)

TunedEstimator can also be used directly with any sklearn-compatible estimator:

from sklearn.ensemble import RandomForestClassifier
from skrec.estimator.tuned_estimator import TunedEstimator

estimator = TunedEstimator(
    estimator_class=RandomForestClassifier,
    hpo_method=HPOType.GRID_SEARCH_CV,
    param_space={"n_estimators": [50, 100], "max_depth": [3, 5]},
    optimizer_params={"cv": 3},
)
estimator.fit(X_train, y_train)
proba = estimator.predict_proba(X_test)

Learn more: HPO Guide

Choosing an Estimator¶

Decision Guide¶

What type of outcome?
Binary (0/1) → Classification estimators
Continuous (revenue, time) → Regression estimators
Multiple outcomes → MultiOutput estimators
What's your priority?
Performance: XGBoost or LightGBM
Speed: LightGBM or LogisticRegression
Interpretability: LogisticRegression or LinearRegression
Feature interactions: DeepFM
Learned embeddings + real-time inference: Embedding estimators
How much data?
Large datasets (>100K rows): XGBoost, LightGBM, DeepFM, Embedding estimators
Medium datasets: XGBoost, RandomForest
Small datasets: LogisticRegression, LinearRegression
What's your deployment architecture?
Traditional batch prediction: Any estimator
Real-time with embedding store: Embedding estimators (NeuralFactorization, TwoTower, DeepCrossNetwork)
Cold-start scenarios: Embedding estimators or content-based approaches

Comparison Table¶

Estimator	Speed	Performance	Interpretability	Data Needs	Use Case
XGBoost	Medium	⭐⭐⭐⭐⭐	Medium	Medium-Large	General-purpose
LightGBM	Fast	⭐⭐⭐⭐⭐	Medium	Medium-Large	General-purpose
LogisticRegression	Very Fast	⭐⭐⭐	High	Any	Baseline/Simple
RandomForest	Slow	⭐⭐⭐⭐	Medium	Medium-Large	General-purpose
DeepFMClassifier	Slow	⭐⭐⭐⭐⭐	Low	Large	Feature interactions
MatrixFactorization	Fast	⭐⭐⭐⭐	Medium (latent factors)	Medium-Large	Collaborative filtering (native)
NCF	Slow	⭐⭐⭐⭐⭐	Low	Large	Collaborative filtering (neural)
NeuralFactorization	Slow	⭐⭐⭐⭐⭐	Low	Large	Embeddings + context
TwoTower	Slow	⭐⭐⭐⭐⭐	Low	Large	Real-time embeddings + context modes
DeepCrossNetwork	Slow	⭐⭐⭐⭐⭐	Low	Large	Cross-feature learning

Best Practices¶

1. Start Simple¶

# Start with XGBoost + default parameters
estimator = XGBClassifierEstimator({"learning_rate": 0.1, "n_estimators": 100})

2. Use Cross-Validation¶

# Use Tuned estimators with CV for robust parameter selection
estimator = TunedXGBClassifierEstimator(
    hpo_method=HPOType.GRID_SEARCH_CV,
    param_space=param_space,
    optimizer_params={"cv": 5},
)

3. Feature Engineering Matters¶

Good features > complex models
XGBoost/LightGBM handle raw features well
Deep models benefit from feature engineering

4. Monitor Training¶

# XGBoost early stopping
estimator = XGBClassifierEstimator({
    "n_estimators": 1000,
    "early_stopping_rounds": 50
})

API Contract¶

The library has two estimator families with different calling conventions:

Family	Training	Inference
Tabular (`BaseClassifier`, `BaseRegressor`)	`fit(X, y)`	`predict_proba(X)` or `predict(X)`
Embedding / Sequential (`BaseEmbeddingEstimator`, `SequentialEstimator`)	`fit_embedding_model(users, items, interactions)`	`predict_proba_with_embeddings(interactions, users)`

Within each family, the method contract is enforced by abstract base classes — any concrete subclass that doesn't implement the required hook raises TypeError at instantiation, not at call time.

Implementation Details¶

For implementation details and complete API, see: - skrec/estimator/classification/ - Classification estimators - skrec/estimator/regression/ - Regression estimators - skrec/estimator/embedding/ - Embedding estimators (MF, NCF, TwoTower, DCN) - skrec/estimator/sequential/ - Sequential estimators (SASRec, HRNN) - skrec/estimator/base_estimator.py - Base estimator interface

Next Steps¶

Scorer Guide - Pair your estimator with the right scorer
HPO Guide - Optimize hyperparameters
Training Guide - Train your complete pipeline