How It Works
This page explains what is happening end-to-end: how point-in-time features and forward-looking labels are built, how training and inference run inside your warehouse, what models are trained, how they're stored and versioned, when retraining happens, and how to interpret output metrics.
If you just want to get to first outputs, start with the Quickstart. If you want to understand or debug the pipeline, start here.
Pipeline overview
At a high level, the pipeline does the following:
- Define a monthly modeling grain: one row per
(person_id, anchor_month). - Build features from history ending at
anchor_month_end_date(no leakage). - Build labels from outcomes starting after the anchor month (forward window).
- Assemble a sparse model matrix per target/horizon and data source.
- Train one model per
(data_source, target_key)and bundle them into a single versioned artifact. - Score the latest artifact for the configured prediction anchor month.
- Write predictions and standardized evaluation metrics back as warehouse tables.
Where it runs
The package is intentionally split between SQL and Python:
- SQL models: build anchors, features, labels, and sparse-coordinate artifacts. These run as standard warehouse SQL and support Snowflake, Databricks, and Microsoft Fabric.
- Python models: train, score, and compute metrics inside the configured warehouse runtime.
The Python models are: train_model_registry, predict_values, predict_probabilities_long, and train_metrics_long. Everything else is SQL that prepares the inputs they consume.
Configuration
Configuration is driven by dbt vars and staged into small config tables so that Python models read a consistent view at runtime.
Key levers:
- Runtime config: train/predict anchor bounds,
test_size, random seed, artifact stage, andml_force_train. - Target policy: which targets/horizons are enabled and which claims are included via
target_dimension+ exacttarget_values. - Feature policy: enable/disable feature groups (demographics, utilization, conditions, HCC).
- Count probability policy: thresholds
kused to publishP(Y >= k)outputs. - Spend percentile probability policy: percentile cutoffs used to publish
P(spend in top k%)outputs forpaid_amounttargets.
Point-in-time design (no leakage)
The modeling grain is monthly anchors:
- Anchor row:
(person_id, anchor_month) - Feature windows: lookbacks ending at
anchor_month_end_date - Outcome windows: starting at
anchor_month + 1 month, extendinghorizon_monthsforward
Labels are always in the future relative to features. There is no lookahead by construction.
Training window layout
ml_train_anchor_start_monthandml_train_anchor_end_monthbound the date range of eligible training anchors.ml_train_anchor_stride_monthscontrols how many anchors are sampled per person within that range, reducing overlap between adjacent months.- For each anchor, feature windows look backward and are shortened by
ml_claims_lag_monthsto match real-world claims completeness at prediction time. - Target windows begin after the anchor month and extend forward by
horizon_months. An anchor only receives a complete label when its full target window is observable in the data.
Features
Features are built in a canonical long format:
- Each row is
(person_id, anchor_month, feature_name, window_months, feature_value, feature_data_type) - Categorical features (e.g.,
sex,race,state) are automatically one-hot encoded during matrix assembly.
Feature groups (v1):
- Demographics: age, sex, race, state, and enrollment/history flags.
- Utilization: paid and encounter counts over 3/6/12 month windows.
- Conditions: binary indicators for condition assignments in the last 12 months.
- HCC: binary indicators for CMS-HCC assignments in the last 12 months.
The feature policy controls which groups are active at runtime. The feature dictionary defines each feature's metadata and default fill value.
Labels
Targets are computed as per-member-per-month rates over the forward horizon:
- Spend (
paid_amount): paid amount in the filtered outcome claims divided by member-month exposure. - Utilization counts (
encounter_count): distinct encounter count in the filtered outcome claims divided by member-month exposure.
Labels are only considered trainable when the outcome window is complete, which prevents partial labels from contaminating training.
Training set construction
Before fitting, the training population is constrained to avoid leakage and reduce redundant anchors:
- Label completeness filter: only rows where the forward outcome window is complete are included.
- Train anchor window: you can bound the anchor months used for training; if unset, defaults to the min/max anchors whose forward outcome windows are fully observable. This is an outcome-completeness rule, not a full-lookback-history rule.
- Deterministic anchor striding: by default, a per-person subset of anchors is used (e.g., every 12th month) to reduce overlap across adjacent months. Controlled by
ml_train_anchor_stride_months. - Train/test split by person: a grouped shuffle split on
person_idso anchors from the same person never leak across train and test sets.
Early anchors can still have partial feature history. The package does not require a full 12 months of prior enrollment or claims to keep an anchor; instead it carries that information into training through features such as member_months_lookback_12m and cold_start_flag.
Models
The package trains gradient-boosted tree regressors (XGBoost) in a warehouse-native Python runtime:
paid_amounttargets use a Tweedie regression objective (reg:tweedie) for non-negative, heavy-tailed spend distributions.encounter_counttargets use a Poisson objective (count:poisson) for non-negative event rates.
Training runs independently per data_source. The published artifact contains separate model entries keyed by (data_source, target_key).
Calibration
After training, a simple multiplicative calibration factor is computed on the training set:
calibration_factor = sum(pred_raw_train) / sum(actual_train)
At scoring time:
predicted_value = pred_raw / calibration_factor
This aligns aggregate predicted totals to aggregate actuals, making the pa_ratio metric easier to interpret (closer to 1.0 is better).
For count targets, P(Y >= k) probabilities are derived by treating the calibrated prediction as a Poisson mean and computing the tail probability.
For spend targets, P(spend in top k%) probabilities are derived by:
- computing the spend cutoff for the requested top
k%separately for eachdata_sourceand spend target / horizon on the training labels - turning that into a binary event such as
1[paid_amount >= cutoff] - fitting a one-dimensional calibration map from the model's predicted spend to that binary event
At prediction time, the calibrated spend prediction is passed through that learned map to produce a probability.
Model storage and versioning
Training publishes a single bundle per run to warehouse-managed artifact storage:
- A
model_versionis generated (timestamp + uuid suffix). - The bundle contains: runtime config used for training, per-target feature column lists, trained XGBoost estimators, and calibration factors.
- The bundle is serialized as a pickle file stored in the location configured by
ml_artifact_stage. In Snowflake, that is typically@<database>.<schema>.ML_MODEL_STAGE.
The registry table stores the artifact_uri so inference can fetch and load the exact artifact.
When retraining happens
Two modes control retraining:
ml_force_train: false(default): reuses an existing artifact bundle if the training signature matches a prior entry intrain_model_registry_history.ml_force_train: true: always trains a new model version.
The training signature is a deterministic hash of: train/test split controls (random_seed, test_size), claims-lag settings, the enabled target/horizon plan with available training window characteristics (row counts and anchor bounds), and the per-target feature-column layout used for training.
Practical guidance:
- If you want to force a new artifact even when the signature would otherwise match, use
ml_force_train: true. - If new months of data arrive, the signature typically changes and a new model is trained automatically.
- Reuse is evaluated at the full bundle level. If you add, remove, or change any target, all targets retrain together.
- A policy row with multiple
target_valuesexpands into multiple distinct targets. It is shorthand for repeated YAML, not a combined target.
Evaluation metrics
After training, train_metrics_long loads the latest artifact, re-scores the training matrix, and computes metrics on a person-level grouped train/test split.
All targets:
| Metric | Description |
|---|---|
mae | Mean absolute error |
rmse | Root mean squared error |
r2 | Coefficient of determination |
pa_ratio | sum(pred) / sum(actual); aggregate calibration check |
Probability targets (for each configured threshold):
| Metric | Description |
|---|---|
auc | ROC AUC for the configured binary event |
brier | Brier score (calibration + sharpness) |
logloss | Cross-entropy loss |
Is the model output good?
There is no single universal threshold. Difficulty depends on population mix, claims completeness, and event prevalence. Use these checks to quickly detect broken or weak models.
Sanity checks (all targets):
pa_rationear1.0is a basic calibration check. Values far from 1.0 indicate systematic under/over prediction.- Compare
mean_predvsmean_actual. They should be in the same ballpark. - Watch for tiny
feature_count, tinyx_train_nnz, or near-zero prediction variance; those typically indicate a feature pipeline issue.
Spend (paid_amount):
r2 < 0means the model is worse than predicting the mean. That is a strong signal something is wrong.- Individual-level healthcare spend is genuinely hard to predict and modest R² is normal. See the industry context below for what to expect.
- Focus on
mae/rmserelative tomean_actual, and validate ranking performance. Are top-decile predictions meaningfully higher than average?
Count thresholds (encounter_count, P(Y >= k)):
auc = 0.5is random.0.6–0.7is modest signal;0.7–0.8is typically good;> 0.8is strong but not always achievable depending on the event.- Brier score is easiest to interpret relative to a baseline that always predicts prevalence; lower is better.
Spend percentile thresholds (paid_amount, P(spend in top k%)):
- Interpret AUC/Brier/logloss the same way as other binary-event metrics, but remember these are harder events as
kgets smaller. top 1%is a very sparse event, so expect noisier metrics thantop 5%.- These thresholds are derived per
data_sourceand spend target / horizon, so compare runs within the same target definition.
Industry benchmarks for spend prediction
Individual-level prospective spend is one of the hardest prediction problems in healthcare analytics. These benchmarks from actuarial literature provide context for what to expect.
Diagnosis-only models (e.g. CMS-HCC):
The CMS-HCC model, Medicare's official risk adjuster, uses age, sex, and diagnosed condition categories and achieves R² of roughly 11–13% on Medicare data. That means ~87–89% of the variation in individual spending remains unexplained. MAE for models in this class is typically around 95–105% of mean actual cost. In other words, the average prediction error is approximately equal to the average annual spend per member. A 2016 Society of Actuaries comparative study found that diagnosis-only prospective models ranged from roughly 9–20% R² across different model families and populations.
Models that include prior utilization:
Adding prior-year spend or utilization data roughly doubles the explained variance. According to the same SOA study:
| Model | R² (diagnosis only) | R² (with prior cost) | MAE% (diagnosis only) | MAE% (with prior cost) |
|---|---|---|---|---|
| Milliman MARA | ~20% | ~24–25% | ~97% | ~91–92% |
| DxCG (Verisk) | ~18–19% | ~23–24% | ~99% | ~91% |
| Johns Hopkins ACG | ~16% | ~18% | ~101% | n/a |
| CMS-HCC | ~11–13% | n/a | ~100% | n/a |
MAE% = mean absolute error as a percentage of mean actual cost.
What to target:
A model incorporating demographics, diagnoses, HCC risk, and utilization history should aim for R² in the 0.20–0.25 range, which is on par with leading commercial risk models. If your R² is in the low teens, the model is functioning but is likely missing utilization signal (check the feature policy). If it's in the mid-twenties or higher, the model is performing well for this problem class.
Even a well-performing model will have MAE around 85–95% of mean cost. This is not a defect; it reflects the inherent randomness of individual healthcare utilization. The value of the model is in ranking and segmentation (identifying the top-risk members), not in precise dollar-level prediction for any given individual.
Sources: Society of Actuaries: Accuracy of Claims-Based Risk Scoring Models (2016); GAO: Medicare Advantage: Comparison of Plan Bids to Fee-for-Service Costs (2012).