Configuration Reference
All package behavior is controlled through dbt vars. No seed files or code changes are required. Start with the defaults and override only what you need.
Recommended approach:
- Define project-level vars in
dbt_project.ymlfor settings that apply to all runs. - Use
--varson the command line for one-off overrides (e.g., forcing a retrain or targeting a specific prediction month).
Core runtime vars
| Var | Type | Default | Description |
|---|---|---|---|
ml_enabled | bool | true | Enables or disables all ML models in the package; when omitted it runs as enabled. |
ml_force_train | bool | false | false reuses a matching prior model bundle from train_model_registry_history; true always trains a new model; when omitted it behaves as false. |
ml_train_anchor_start_month | date string or null | null | Inclusive lower bound for training anchors; when omitted/null there is no manual lower bound and training uses the earliest anchor whose forward outcome window is fully observable in the run. |
ml_train_anchor_end_month | date string or null | null | Inclusive upper bound for training anchors; when omitted/null there is no manual upper bound and training uses the latest anchor whose forward outcome window is fully observable in the run. |
ml_train_anchor_stride_months | int | 12 | Deterministic per-person anchor striding to reduce overlap in training rows. |
ml_prediction_anchor_month | date string or null | null | Month to generate predictions for. Defaults to the latest available anchor month in the anchor population if not set. |
ml_artifact_stage | stage URI or null | null | Artifact storage location for trained model bundles. In Snowflake, this defaults to @<db>.<schema>.ML_MODEL_STAGE. |
ml_random_seed | int | 42 | Random seed for train/test split and model training. |
ml_test_size | float | 0.2 | Fraction of data reserved for the test split. |
ml_claims_lag_months | int | 3 | Number of months of recent claims to exclude from feature lookback windows (see Claims Lag Adjustment below). |
tuva_schema_prefix | string or null | null | If set, writes package outputs to {tuva_schema_prefix}_ml. |
null/undefined runtime vars do not raise errors here; the package falls back to defaults.
For the training anchor defaults, "complete" refers to the forward-looking label window, not the backward-looking feature window. Early anchors can still have less than 12 months of prior history; those rows remain eligible for training and the model accounts for that with features such as member_months_lookback_12m and cold_start_flag.
Prediction window reference
ml_prediction_anchor_monthsets the single member month being scored.- Feature windows look backward from the anchor and are shortened by
ml_claims_lag_monthsto account for incomplete recent claims. - The target horizon defines the forward prediction period, with duration set by
ml_target_policy.horizon_months.
Claims lag adjustment
In most claims environments, the most recent 1-3 months of data are not fully adjudicated because claims are still being submitted and processed (sometimes called IBNR or "claims lag"). This means utilization counts, paid amounts, condition flags, and HCC assignments are understated for recent months.
Without adjustment, the model trains on fully adjudicated historical data but scores on incomplete recent data at prediction time. This train/predict mismatch causes systematic underprediction of risk.
How it works: ml_claims_lag_months trims the end of every claims-based feature lookback window by N months, in both training and prediction. The lookback start stays fixed, so the effective window shrinks. For example, with ml_claims_lag_months: 3:
- A 12-month utilization window uses months
[anchor-11, anchor-3](9 months of claims) - A 6-month utilization window uses months
[anchor-5, anchor-3](3 months of claims) - A 3-month utilization window becomes empty (all zeros) since it falls entirely within the lag period
- Condition and HCC features (12-month lookback) use 9 months of claims
This applies symmetrically to training and prediction, so the model learns from the same truncated-window pattern it will encounter at scoring time. Train/test metrics will be slightly lower (the model sees less signal), but they honestly reflect real-world prediction performance.
Forward-looking targets are not affected. "spend in the next 6 months" still uses the full outcome window from anchor+1 through anchor+6.
Tuning guidance:
| Value | When to use |
|---|---|
0 | Fully adjudicated data, retrospective studies |
1 | Fast-paying payers, electronic-only claims |
2-3 | Typical commercial claims (default: 3) |
4+ | Slow payers, complex claim types |
Dev sampling vars
Use these to run the pipeline on a reduced population while iterating. This is useful for validating end-to-end wiring without full compute costs.
| Var | Type | Default | Description |
|---|---|---|---|
ml_dev_sample_enabled | bool | false | Enables deterministic anchor-row downsampling. |
ml_dev_sample_rows | int | 10000 | Maximum number of sampled anchor rows. |
ml_dev_sample_seed | int | 42 | Seed for deterministic sampling. |
Policy vars
Feature policy
Controls which feature groups are included in training and prediction. feature_group can be a single value or a list:
ml_feature_policy:
- feature_group: [demographics, utilization, conditions, hcc]
Each row is enabled by existence. To disable a group, omit it from ml_feature_policy.
Target policy
Controls which targets and horizons are active, and which claims are included in each target numerator.
target_measure:paid_amountorencounter_counttarget_dimension:all,encounter_group, orencounter_typetarget_values: omitted forall; otherwise required as a non-empty list- Encounter target values are validated against Tuva terminology seed
terminology__encounter_type
ml_target_policy:
- target_measure: paid_amount
horizon_months: [6, 12]
target_dimension: all
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_group
target_values: [inpatient]
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department, ambulatory surgery center]
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [acute inpatient]
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department]
horizon_months can be a scalar or list. target_values must always be a list for encounter targets, even when there is only one value. If a single policy row includes multiple target_values, the package expands that row into one separate target per listed value.
Examples:
- Spend across all claims:
- target_measure: paid_amount
horizon_months: [6, 12]
target_dimension: all
- Spend limited to acute inpatient encounters:
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [acute inpatient]
- Spend limited to emergency department encounters:
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department]
Optional count probabilities:
- For
encounter_counttargets, the package can also output threshold probabilities such as "probability of at least 1 emergency department visit in the next 12 months." - These are controlled separately by
ml_count_probability_policy. - For
paid_amounttargets, the package can also output threshold probabilities such as "probability this member is in the top 1% of spend next year." - Those spend percentile cutoffs are derived separately for each
data_sourceand each spend target / horizon combination.
Each row is enabled by existence. To disable a target, remove it from ml_target_policy.
Count probability policy
Controls the k thresholds used to compute P(Y >= k) for count targets. threshold_k can be a single int or a list:
ml_count_probability_policy:
- threshold_k: [1, 2, 3, 5]
Each row is enabled by existence. To disable a threshold, remove it from ml_count_probability_policy.
Spend percentile probability policy
Controls the percentile cutoffs used to compute P(spend in top k%) for paid_amount targets. top_percent can be a single numeric value or a list.
These percentile cutoffs are derived separately for each data_source and each spend target / horizon combination.
ml_spend_percentile_probability_policy:
- top_percent: [1, 5]
Examples:
- Probability a member lands in the top 1% of total spend in the next 12 months:
ml_target_policy:
- target_measure: paid_amount
horizon_months: [12]
target_dimension: all
ml_spend_percentile_probability_policy:
- top_percent: [1]
- Probability a member lands in the top 5% of emergency department spend in the next 12 months:
ml_target_policy:
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department]
ml_spend_percentile_probability_policy:
- top_percent: [5]
Each row is enabled by existence. To disable a threshold, remove it from ml_spend_percentile_probability_policy.
Common run patterns
First-time run
dbt deps
dbt run --select package:illuminate_predictive_models
Validation checklist after the run completes:
train_model_registryhas rows with atrained_...orskipped_...status.predict_valueshas non-zero rows.train_metrics_longhas bothTRAINandTESTscopes.
Monthly prediction cycle (reuse existing models)
Update ml_prediction_anchor_month to the target month and run. With ml_force_train omitted (default false), training is skipped if the model signature matches the current runtime, target, and feature configuration already recorded in train_model_registry_history:
dbt run --select package:illuminate_predictive_models --vars '{ml_prediction_anchor_month: 2026-02-01}'
Force retraining
Run training first, then re-generate all prediction and metric outputs:
dbt run --select train_model_registry --vars '{ml_force_train: true}'
dbt run --select predict_values predict_probabilities_long train_metrics_long
Low-cost development loop
Run with a downsampled anchor population to validate end-to-end wiring without full compute costs:
dbt run --select package:illuminate_predictive_models --vars '{ml_dev_sample_enabled: true, ml_dev_sample_rows: 10000, ml_dev_sample_seed: 20260301}'
Recommended example
A practical dbt_project.yml vars block with the settings most users will actually change:
vars:
ml_enabled: true
ml_train_anchor_start_month: "2017-01-01" # optional; if unset, uses earliest anchor with a fully observed forward outcome window
ml_train_anchor_end_month: "2017-12-01" # optional; if unset, uses latest anchor with a fully observed forward outcome window
ml_prediction_anchor_month: "2018-06-01" # optional; if unset, uses the latest member month available in the anchor population
ml_claims_lag_months: 3
ml_feature_policy:
- feature_group: [demographics, utilization, conditions, hcc]
ml_target_policy:
- target_measure: paid_amount
horizon_months: [6, 12]
target_dimension: all
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_group
target_values: [inpatient]
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department, ambulatory surgery center]
ml_count_probability_policy:
- threshold_k: [1, 2, 3, 5]
ml_spend_percentile_probability_policy:
- top_percent: [1, 5]
Full reference example
vars:
ml_enabled: true
ml_force_train: false
ml_train_anchor_start_month: "2017-01-01" # optional; if unset, uses earliest anchor with a fully observed forward outcome window
ml_train_anchor_end_month: "2017-12-01" # optional; if unset, uses latest anchor with a fully observed forward outcome window
ml_prediction_anchor_month: "2018-06-01" # optional; if unset, uses the latest member month available in the anchor population
ml_train_anchor_stride_months: 12
ml_artifact_stage: null
ml_random_seed: 42
ml_test_size: 0.2
ml_claims_lag_months: 3
ml_dev_sample_enabled: false
ml_dev_sample_rows: 10000
ml_dev_sample_seed: 42
ml_feature_policy:
- feature_group: [demographics, utilization, conditions, hcc]
ml_target_policy:
- target_measure: paid_amount
horizon_months: [6, 12]
target_dimension: all
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_group
target_values: [inpatient]
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department, ambulatory surgery center]
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [acute inpatient]
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department]
ml_count_probability_policy:
- threshold_k: [1, 2, 3, 5]
If the anchor-month vars are unset, those defaults are global within the run, not recalculated separately for each data_source. For training, that default window is based on label completeness only; early anchors may still have partial feature lookback history.