Decision rule: which scorer + estimator for which data?¶
This page is the single source of truth for two recurring questions:
- Which scorer fits my data shape and target types?
- Within
MixedTypeMultiTargetScorer, do I want joint or independent?
It also publishes the canonical per-TargetType metric dispatch table that the runtime, capability_matrix()["target_type_metric_compat"], scorers.md, and scikit-rec-agent all reference.
Decision tree¶
Q1: Long format (USER_ID, ITEM_ID, OUTCOME triples) or wide (one row per user with ITEM_* columns)?
├── long
│ ├── Need timestamps? → InteractionsDataset + TIMESTAMP + SequentialScorer / HierarchicalScorer
│ ├── Single multi-class target (ITEM_ID IS the class)? → MulticlassScorer
│ └── Otherwise → UniversalScorer or IndependentScorer
└── wide
├── All-binary targets, single mode → MultioutputScorer (classifier mode)
├── All-continuous targets, single mode → MultioutputScorer (regressor mode)
└── Heterogeneous targets (mix of binary, regression, multiclass, multilabel)
│
▼
MixedTypeMultiTargetScorer + one of:
│
┌───────────────────────────────┴───────────────────────────────┐
│ │
Joint family Independent
│ │
Targets are correlated; Targets are independent; OR
want a shared representation; you want per-target estimator
moderate target count. type flexibility (e.g. XGB on
Multilabel groups carry an one target, LightGBM on another).
inductive bias. Multilabel group inductive bias
│ is lost.
│
┌───────────────┴───────────────┐
│ │
JointMultiTargetMLP JointMultiTargetTransformer
Default; small to FT-Transformer-style; better
moderate features. when pairwise feature
interactions matter.
Per-TargetType metric dispatch (canonical table)¶
The runtime constant TARGET_TYPE_TO_METRICS in skrec.scorer.mixed_type_multi_target is the source of truth; capability_matrix()["target_type_metric_compat"] mirrors it; gate 7 (tests/test_mixed_type_multi_target_gates.py) asserts both agree with this table.
| Target type | Compatible metrics | Notes |
|---|---|---|
BINARY |
ROC_AUC, PR_AUC |
Standard binary classification metrics |
REGRESSION |
RMSE, MAE |
Per-target prediction error |
MULTICLASS |
MULTICLASS_ACCURACY |
Top-1 accuracy (new v2 metric). For log-loss / macro-F1, use score_per_target |
MULTILABEL member |
ROC_AUC, PR_AUC |
Each fanned-out member is binary |
Metrics outside this table reach via scorer.score_per_target(metric_callables=...) — see scorers.md for the callable contract per TargetType.
Joint vs independent: when to pick which¶
| Question | Lean joint | Lean independent |
|---|---|---|
| Are your targets correlated? | Yes — joint shares a learned representation that improves prediction on each | No — independent estimators avoid the joint loss-balancing complexity |
| Do you have a multilabel group? | Yes — joint preserves group inductive bias | No — independent fans out into per-member binaries |
| Do you want per-target estimator-type flexibility (XGB here, LightGBM there)? | No | Yes — independent's per_target dict supports this directly |
| Is your data feature-rich (>20 features)? | Joint Transformer if pairwise feature interactions matter; joint MLP otherwise | Either; sub-estimator handles its own features |
| Do you have very different target scales (e.g. dollar regression + binary)? | Joint MLP / Transformer with regression_normalize=True (default) |
Independent: each sub-estimator has its own scaling internals |
| Is HPO important? | Both work via skrec/orchestrator/hpo.py; joint has one model to tune, independent has K sub-estimators |
Both work; per-target tuning is more granular in independent mode |
When in doubt, start with joint MLP (smaller default network, low setup cost) and compare against independent with XGB defaults as a baseline. The v2 quickstart notebook does this comparison side-by-side.
When to add a new target type vs. encode as existing¶
| Situation | Choose |
|---|---|
| Multi-class target with 2 classes only | BINARY (multiclass machinery is overkill) |
| Continuous target bounded to [0, 1] (e.g. probability) | REGRESSION — model handles the bound; switching to BINARY loses information |
| Multi-class target with class imbalance | MULTICLASS — top-1 accuracy is the v2 metric; for class-weighted F1, use score_per_target |
| Several related binary outcomes (engagement signals) | TargetGroupSpec(type=MULTILABEL, columns=[...]) — joint families preserve the group bias |
"No scalar default" for evaluation¶
MixedTypeMultiTargetScorer.evaluate() always returns Dict[str, float]. There is no honest macro aggregation across heterogeneous target types (binary AUC of 0.85 and regression RMSE of 12.7 aren't on a common scale).
If you need a single number for HPO or model comparison:
- Pick a primary target (the one your downstream task cares about most) and use that target's metric as the objective.
- Compose a weighted aggregate via
score_per_targetwith user-supplied callables:
weights = {"ITEM_clicked": 0.5, "ITEM_revenue": 0.5}
result = scorer.score_per_target(...) # per-target metrics
score = sum(weights[k] * v for k, v in result.items() if k in weights)
This is intentional — baking a "primary metric" default into the scorer would mask a per-use-case choice that belongs in the caller's hands.
Real-time-label conditioning (v3, available)¶
For scenarios where the caller has observed the ground truth for some targets at inference and wants those values to condition predictions for the rest, use a conditional estimator:
ConditionalJointMultiTargetMLPEstimatorConditionalJointMultiTargetTransformerEstimator
Both implement the runtime-checkable ConditionalMultiTargetEstimator Protocol subclass. The scorer's inference validator delegates to the estimator: vanilla estimators reject OBSERVED_* with a clean error; conditional estimators permit it and validate the multilabel-group "members must mask together per row" rule. Predictions flow through predict_with_observed(X, observed) when OBSERVED_* columns are present; vanilla predict_proba_dict(X) is equivalent to passing an empty observed dict.
capability_matrix()["scorer_supports_observed_conditioning"] is now ("mixed_type_multi_target",).
OBSERVED_* columns are auto-preserved through interactions_schema.apply() via the BaseScorer.preserved_inference_columns() hook even when the client schema doesn't declare them, so recommend_online works out of the box.
Independent + conditional is NOT supported (v3 locked decision #1). Cross-target observed-as-features is structurally different from joint masking; if revisited it lands in v4+. Use the joint families when you need conditioning.
See mixed_type_multi_target_plan_v3.md for the four locked design decisions and the v3 implementation details.