How It Works

This page explains what is happening end-to-end: how point-in-time features and forward-looking labels are built, how training and inference run inside your warehouse, what models are trained, how they're stored and versioned, when retraining happens, and how to interpret output metrics.

If you just want to get to first outputs, start with the Quickstart. If you want to understand or debug the pipeline, start here.

Pipeline overview

At a high level, the pipeline does the following:

Define a monthly modeling grain: one row per (person_id, anchor_month).
Build features from history ending at anchor_month_end_date (no leakage).
Build labels from outcomes starting after the anchor month (forward window).
Assemble a sparse model matrix per target/horizon and data source.
Train one model per (data_source, target_key) and bundle them into a single versioned artifact.
Score the latest artifact for the configured prediction anchor month.
Write predictions and standardized evaluation metrics back as warehouse tables.

Where it runs

The package is intentionally split between SQL and Python:

SQL models: build anchors, features, labels, and sparse-coordinate artifacts. These run as standard warehouse SQL and support Snowflake, Databricks, and Microsoft Fabric.
Python models: train, score, and compute metrics inside the configured warehouse runtime.

The Python models are: train_model_registry, predict_values, predict_probabilities_long, and train_metrics_long. Everything else is SQL that prepares the inputs they consume.

Configuration

Configuration is driven by dbt vars and staged into small config tables so that Python models read a consistent view at runtime.

Key levers:

Runtime config: train/predict anchor bounds, test_size, random seed, artifact stage, and ml_force_train.
Target policy: which targets/horizons are enabled and which claims are included via target_dimension + exact target_values.
Feature policy: enable/disable feature groups (demographics, utilization, conditions, HCC).
Count probability policy: thresholds k used to publish P(Y >= k) outputs.
Spend percentile probability policy: percentile cutoffs used to publish P(spend in top k%) outputs for paid_amount targets.

Point-in-time design (no leakage)

The modeling grain is monthly anchors:

Anchor row: (person_id, anchor_month)
Feature windows: lookbacks ending at anchor_month_end_date
Outcome windows: starting at anchor_month + 1 month, extending horizon_months forward

Labels are always in the future relative to features. There is no lookahead by construction.

Training window layout

Training window layout showing how anchor months are sampled across a date range with stride, and how feature windows, claims lag, and target windows are constructed for each anchor

ml_train_anchor_start_month and ml_train_anchor_end_month bound the date range of eligible training anchors.
ml_train_anchor_stride_months controls how many anchors are sampled per person within that range, reducing overlap between adjacent months.
For each anchor, feature windows look backward and are shortened by ml_claims_lag_months to match real-world claims completeness at prediction time.
Target windows begin after the anchor month and extend forward by horizon_months. An anchor only receives a complete label when its full target window is observable in the data.

Features

Features are built in a canonical long format:

Each row is (person_id, anchor_month, feature_name, window_months, feature_value, feature_data_type)
Categorical features (e.g., sex, race, state) are automatically one-hot encoded during matrix assembly.

Feature groups (v1):

Demographics: age, sex, race, state, and enrollment/history flags.
Utilization: paid and encounter counts over 3/6/12 month windows.
Conditions: binary indicators for condition assignments in the last 12 months.
HCC: binary indicators for CMS-HCC assignments in the last 12 months.

The feature policy controls which groups are active at runtime. The feature dictionary defines each feature's metadata and default fill value.

Labels

Targets are computed as per-member-per-month rates over the forward horizon:

Spend (paid_amount): paid amount in the filtered outcome claims divided by member-month exposure.
Utilization counts (encounter_count): distinct encounter count in the filtered outcome claims divided by member-month exposure.

Labels are only considered trainable when the outcome window is complete, which prevents partial labels from contaminating training.

Training set construction

Before fitting, the training population is constrained to avoid leakage and reduce redundant anchors:

Label completeness filter: only rows where the forward outcome window is complete are included.
Train anchor window: you can bound the anchor months used for training; if unset, defaults to the min/max anchors whose forward outcome windows are fully observable. This is an outcome-completeness rule, not a full-lookback-history rule.
Deterministic anchor striding: by default, a per-person subset of anchors is used (e.g., every 12th month) to reduce overlap across adjacent months. Controlled by ml_train_anchor_stride_months.
Train/test split by person: a grouped shuffle split on person_id so anchors from the same person never leak across train and test sets.

Early anchors can still have partial feature history. The package does not require a full 12 months of prior enrollment or claims to keep an anchor; instead it carries that information into training through features such as member_months_lookback_12m and cold_start_flag.

Models

The package trains gradient-boosted tree regressors (XGBoost) in a warehouse-native Python runtime:

paid_amount targets use a Tweedie regression objective (reg:tweedie) for non-negative, heavy-tailed spend distributions.
encounter_count targets use a Poisson objective (count:poisson) for non-negative event rates.

Training runs independently per data_source. The published artifact contains separate model entries keyed by (data_source, target_key).

Calibration

After training, a simple multiplicative calibration factor is computed on the training set:

calibration_factor = sum(pred_raw_train) / sum(actual_train)

At scoring time:

predicted_value = pred_raw / calibration_factor

This aligns aggregate predicted totals to aggregate actuals, making the pa_ratio metric easier to interpret (closer to 1.0 is better).

For count targets, P(Y >= k) probabilities are derived by treating the calibrated prediction as a Poisson mean and computing the tail probability.

For spend targets, P(spend in top k%) probabilities are derived by:

computing the spend cutoff for the requested top k% separately for each data_source and spend target / horizon on the training labels
turning that into a binary event such as 1[paid_amount >= cutoff]
fitting a one-dimensional calibration map from the model's predicted spend to that binary event

At prediction time, the calibrated spend prediction is passed through that learned map to produce a probability.

Model storage and versioning

Training publishes a single bundle per run to warehouse-managed artifact storage:

A model_version is generated (timestamp + uuid suffix).
The bundle contains: runtime config used for training, per-target feature column lists, trained XGBoost estimators, and calibration factors.
The bundle is serialized as a pickle file stored in the location configured by ml_artifact_stage. In Snowflake, that is typically @<database>.<schema>.ML_MODEL_STAGE.

The registry table stores the artifact_uri so inference can fetch and load the exact artifact.

When retraining happens

Two modes control retraining:

ml_force_train: false (default): reuses an existing artifact bundle if the training signature matches a prior entry in train_model_registry_history.
ml_force_train: true: always trains a new model version.

The training signature is a deterministic hash of: train/test split controls (random_seed, test_size), claims-lag settings, the enabled target/horizon plan with available training window characteristics (row counts and anchor bounds), and the per-target feature-column layout used for training.

Practical guidance:

If you want to force a new artifact even when the signature would otherwise match, use ml_force_train: true.
If new months of data arrive, the signature typically changes and a new model is trained automatically.
Reuse is evaluated at the full bundle level. If you add, remove, or change any target, all targets retrain together.
A policy row with multiple target_values expands into multiple distinct targets. It is shorthand for repeated YAML, not a combined target.

Evaluation metrics

After training, train_metrics_long loads the latest artifact, re-scores the training matrix, and computes metrics on a person-level grouped train/test split.

All targets:

Metric	Description
`mae`	Mean absolute error
`rmse`	Root mean squared error
`r2`	Coefficient of determination
`pa_ratio`	sum(pred) / sum(actual); aggregate calibration check

Probability targets (for each configured threshold):

Metric	Description
`auc`	ROC AUC for the configured binary event
`brier`	Brier score (calibration + sharpness)
`logloss`	Cross-entropy loss

Is the model output good?

There is no single universal threshold. Difficulty depends on population mix, claims completeness, and event prevalence. Use these checks to quickly detect broken or weak models.

Sanity checks (all targets):

pa_ratio near 1.0 is a basic calibration check. Values far from 1.0 indicate systematic under/over prediction.
Compare mean_pred vs mean_actual. They should be in the same ballpark.
Watch for tiny feature_count, tiny x_train_nnz, or near-zero prediction variance; those typically indicate a feature pipeline issue.

Spend (paid_amount):

r2 < 0 means the model is worse than predicting the mean. That is a strong signal something is wrong.
Individual-level healthcare spend is genuinely hard to predict and modest R² is normal. See the industry context below for what to expect.
Focus on mae/rmse relative to mean_actual, and validate ranking performance. Are top-decile predictions meaningfully higher than average?

Count thresholds (encounter_count, P(Y >= k)):

auc = 0.5 is random. 0.6–0.7 is modest signal; 0.7–0.8 is typically good; > 0.8 is strong but not always achievable depending on the event.
Brier score is easiest to interpret relative to a baseline that always predicts prevalence; lower is better.

Spend percentile thresholds (paid_amount, P(spend in top k%)):

Interpret AUC/Brier/logloss the same way as other binary-event metrics, but remember these are harder events as k gets smaller.
top 1% is a very sparse event, so expect noisier metrics than top 5%.
These thresholds are derived per data_source and spend target / horizon, so compare runs within the same target definition.

Industry benchmarks for spend prediction

Individual-level prospective spend is one of the hardest prediction problems in healthcare analytics. These benchmarks from actuarial literature provide context for what to expect.

Diagnosis-only models (e.g. CMS-HCC):

The CMS-HCC model, Medicare's official risk adjuster, uses age, sex, and diagnosed condition categories and achieves R² of roughly 11–13% on Medicare data. That means ~87–89% of the variation in individual spending remains unexplained. MAE for models in this class is typically around 95–105% of mean actual cost. In other words, the average prediction error is approximately equal to the average annual spend per member. A 2016 Society of Actuaries comparative study found that diagnosis-only prospective models ranged from roughly 9–20% R² across different model families and populations.

Models that include prior utilization:

Adding prior-year spend or utilization data roughly doubles the explained variance. According to the same SOA study:

Model	R² (diagnosis only)	R² (with prior cost)	MAE% (diagnosis only)	MAE% (with prior cost)
Milliman MARA	~20%	~24–25%	~97%	~91–92%
DxCG (Verisk)	~18–19%	~23–24%	~99%	~91%
Johns Hopkins ACG	~16%	~18%	~101%	n/a
CMS-HCC	~11–13%	n/a	~100%	n/a

MAE% = mean absolute error as a percentage of mean actual cost.

What to target:

A model incorporating demographics, diagnoses, HCC risk, and utilization history should aim for R² in the 0.20–0.25 range, which is on par with leading commercial risk models. If your R² is in the low teens, the model is functioning but is likely missing utilization signal (check the feature policy). If it's in the mid-twenties or higher, the model is performing well for this problem class.

Even a well-performing model will have MAE around 85–95% of mean cost. This is not a defect; it reflects the inherent randomness of individual healthcare utilization. The value of the model is in ranking and segmentation (identifying the top-risk members), not in precise dollar-level prediction for any given individual.

Sources: Society of Actuaries: Accuracy of Claims-Based Risk Scoring Models (2016); GAO: Medicare Advantage: Comparison of Plan Bids to Fee-for-Service Costs (2012).

Pipeline overview​

Where it runs​

Configuration​

Point-in-time design (no leakage)​

Training window layout​

Features​

Labels​

Training set construction​

Models​

Calibration​

Model storage and versioning​

When retraining happens​

Evaluation metrics​

Is the model output good?​

Industry benchmarks for spend prediction​