Configuration Reference

All package behavior is controlled through dbt vars. No seed files or code changes are required. Start with the defaults and override only what you need.

Recommended approach:

Define project-level vars in dbt_project.yml for settings that apply to all runs.
Use --vars on the command line for one-off overrides (e.g., forcing a retrain or targeting a specific prediction month).

Core runtime vars

Var	Type	Default	Description
`ml_enabled`	bool	`true`	Enables or disables all ML models in the package; when omitted it runs as enabled.
`ml_force_train`	bool	`false`	`false` reuses a matching prior model bundle from `train_model_registry_history`; `true` always trains a new model; when omitted it behaves as `false`.
`ml_train_anchor_start_month`	date string or null	`null`	Inclusive lower bound for training anchors; when omitted/null there is no manual lower bound and training uses the earliest anchor whose forward outcome window is fully observable in the run.
`ml_train_anchor_end_month`	date string or null	`null`	Inclusive upper bound for training anchors; when omitted/null there is no manual upper bound and training uses the latest anchor whose forward outcome window is fully observable in the run.
`ml_train_anchor_stride_months`	int	`12`	Deterministic per-person anchor striding to reduce overlap in training rows.
`ml_prediction_anchor_month`	date string or null	`null`	Month to generate predictions for. Defaults to the latest available anchor month in the anchor population if not set.
`ml_artifact_stage`	stage URI or null	`null`	Artifact storage location for trained model bundles. In Snowflake, this defaults to `@<db>.<schema>.ML_MODEL_STAGE`.
`ml_random_seed`	int	`42`	Random seed for train/test split and model training.
`ml_test_size`	float	`0.2`	Fraction of data reserved for the test split.
`ml_claims_lag_months`	int	`3`	Number of months of recent claims to exclude from feature lookback windows (see Claims Lag Adjustment below).
`tuva_schema_prefix`	string or null	`null`	If set, writes package outputs to `{tuva_schema_prefix}_ml`.

null/undefined runtime vars do not raise errors here; the package falls back to defaults.

For the training anchor defaults, "complete" refers to the forward-looking label window, not the backward-looking feature window. Early anchors can still have less than 12 months of prior history; those rows remain eligible for training and the model accounts for that with features such as member_months_lookback_12m and cold_start_flag.

Prediction window reference

Prediction window layout showing how a single anchor month defines the feature lookback and scored target horizon

ml_prediction_anchor_month sets the single member month being scored.
Feature windows look backward from the anchor and are shortened by ml_claims_lag_months to account for incomplete recent claims.
The target horizon defines the forward prediction period, with duration set by ml_target_policy.horizon_months.

Claims lag adjustment

In most claims environments, the most recent 1-3 months of data are not fully adjudicated because claims are still being submitted and processed (sometimes called IBNR or "claims lag"). This means utilization counts, paid amounts, condition flags, and HCC assignments are understated for recent months.

Without adjustment, the model trains on fully adjudicated historical data but scores on incomplete recent data at prediction time. This train/predict mismatch causes systematic underprediction of risk.

How it works: ml_claims_lag_months trims the end of every claims-based feature lookback window by N months, in both training and prediction. The lookback start stays fixed, so the effective window shrinks. For example, with ml_claims_lag_months: 3:

A 12-month utilization window uses months [anchor-11, anchor-3] (9 months of claims)
A 6-month utilization window uses months [anchor-5, anchor-3] (3 months of claims)
A 3-month utilization window becomes empty (all zeros) since it falls entirely within the lag period
Condition and HCC features (12-month lookback) use 9 months of claims

This applies symmetrically to training and prediction, so the model learns from the same truncated-window pattern it will encounter at scoring time. Train/test metrics will be slightly lower (the model sees less signal), but they honestly reflect real-world prediction performance.

Forward-looking targets are not affected. "spend in the next 6 months" still uses the full outcome window from anchor+1 through anchor+6.

Tuning guidance:

Value	When to use
`0`	Fully adjudicated data, retrospective studies
`1`	Fast-paying payers, electronic-only claims
`2-3`	Typical commercial claims (default: 3)
`4+`	Slow payers, complex claim types

Dev sampling vars

Use these to run the pipeline on a reduced population while iterating. This is useful for validating end-to-end wiring without full compute costs.

Var	Type	Default	Description
`ml_dev_sample_enabled`	bool	`false`	Enables deterministic anchor-row downsampling.
`ml_dev_sample_rows`	int	`10000`	Maximum number of sampled anchor rows.
`ml_dev_sample_seed`	int	`42`	Seed for deterministic sampling.

Policy vars

Feature policy

Controls which feature groups are included in training and prediction. feature_group can be a single value or a list:

ml_feature_policy:
  - feature_group: [demographics, utilization, conditions, hcc]

Each row is enabled by existence. To disable a group, omit it from ml_feature_policy.

Target policy

Controls which targets and horizons are active, and which claims are included in each target numerator.

target_measure: paid_amount or encounter_count
target_dimension: all, encounter_group, or encounter_type
target_values: omitted for all; otherwise required as a non-empty list
Encounter target values are validated against Tuva terminology seed terminology__encounter_type

ml_target_policy:
  - target_measure: paid_amount
    horizon_months: [6, 12]
    target_dimension: all
  - target_measure: encounter_count
    horizon_months: [12]
    target_dimension: encounter_group
    target_values: [inpatient]
  - target_measure: encounter_count
    horizon_months: [12]
    target_dimension: encounter_type
    target_values: [emergency department, ambulatory surgery center]
  - target_measure: paid_amount
    horizon_months: [12]
    target_dimension: encounter_type
    target_values: [acute inpatient]
  - target_measure: paid_amount
    horizon_months: [12]
    target_dimension: encounter_type
    target_values: [emergency department]

horizon_months can be a scalar or list. target_values must always be a list for encounter targets, even when there is only one value. If a single policy row includes multiple target_values, the package expands that row into one separate target per listed value.

Examples:

Spend across all claims:

- target_measure: paid_amount
  horizon_months: [6, 12]
  target_dimension: all

Spend limited to acute inpatient encounters:

- target_measure: paid_amount
  horizon_months: [12]
  target_dimension: encounter_type
  target_values: [acute inpatient]

Spend limited to emergency department encounters:

- target_measure: paid_amount
  horizon_months: [12]
  target_dimension: encounter_type
  target_values: [emergency department]

Optional count probabilities:

For encounter_count targets, the package can also output threshold probabilities such as "probability of at least 1 emergency department visit in the next 12 months."
These are controlled separately by ml_count_probability_policy.
For paid_amount targets, the package can also output threshold probabilities such as "probability this member is in the top 1% of spend next year."
Those spend percentile cutoffs are derived separately for each data_source and each spend target / horizon combination.

Each row is enabled by existence. To disable a target, remove it from ml_target_policy.

Count probability policy

Controls the k thresholds used to compute P(Y >= k) for count targets. threshold_k can be a single int or a list:

ml_count_probability_policy:
  - threshold_k: [1, 2, 3, 5]

Each row is enabled by existence. To disable a threshold, remove it from ml_count_probability_policy.

Spend percentile probability policy

Controls the percentile cutoffs used to compute P(spend in top k%) for paid_amount targets. top_percent can be a single numeric value or a list.

These percentile cutoffs are derived separately for each data_source and each spend target / horizon combination.

ml_spend_percentile_probability_policy:
  - top_percent: [1, 5]

Examples:

Probability a member lands in the top 1% of total spend in the next 12 months:

ml_target_policy:
  - target_measure: paid_amount
    horizon_months: [12]
    target_dimension: all

ml_spend_percentile_probability_policy:
  - top_percent: [1]

Probability a member lands in the top 5% of emergency department spend in the next 12 months:

ml_target_policy:
  - target_measure: paid_amount
    horizon_months: [12]
    target_dimension: encounter_type
    target_values: [emergency department]

ml_spend_percentile_probability_policy:
  - top_percent: [5]

Each row is enabled by existence. To disable a threshold, remove it from ml_spend_percentile_probability_policy.

Common run patterns

First-time run

dbt deps
dbt run --select package:illuminate_predictive_models

Validation checklist after the run completes:

train_model_registry has rows with a trained_... or skipped_... status.
predict_values has non-zero rows.
train_metrics_long has both TRAIN and TEST scopes.

Monthly prediction cycle (reuse existing models)

Update ml_prediction_anchor_month to the target month and run. With ml_force_train omitted (default false), training is skipped if the model signature matches the current runtime, target, and feature configuration already recorded in train_model_registry_history:

dbt run --select package:illuminate_predictive_models --vars '{ml_prediction_anchor_month: 2026-02-01}'

Force retraining

Run training first, then re-generate all prediction and metric outputs:

dbt run --select train_model_registry --vars '{ml_force_train: true}'
dbt run --select predict_values predict_probabilities_long train_metrics_long

Low-cost development loop

Run with a downsampled anchor population to validate end-to-end wiring without full compute costs:

dbt run --select package:illuminate_predictive_models --vars '{ml_dev_sample_enabled: true, ml_dev_sample_rows: 10000, ml_dev_sample_seed: 20260301}'

Recommended example

A practical dbt_project.yml vars block with the settings most users will actually change:

vars:
  ml_enabled: true
  ml_train_anchor_start_month: "2017-01-01" # optional; if unset, uses earliest anchor with a fully observed forward outcome window
  ml_train_anchor_end_month: "2017-12-01" # optional; if unset, uses latest anchor with a fully observed forward outcome window
  ml_prediction_anchor_month: "2018-06-01" # optional; if unset, uses the latest member month available in the anchor population
  ml_claims_lag_months: 3

  ml_feature_policy:
    - feature_group: [demographics, utilization, conditions, hcc]

  ml_target_policy:
    - target_measure: paid_amount
      horizon_months: [6, 12]
      target_dimension: all
    - target_measure: encounter_count
      horizon_months: [12]
      target_dimension: encounter_group
      target_values: [inpatient]
    - target_measure: encounter_count
      horizon_months: [12]
      target_dimension: encounter_type
      target_values: [emergency department, ambulatory surgery center]

  ml_count_probability_policy:
    - threshold_k: [1, 2, 3, 5]

  ml_spend_percentile_probability_policy:
    - top_percent: [1, 5]

Full reference example

vars:
  ml_enabled: true
  ml_force_train: false
  ml_train_anchor_start_month: "2017-01-01" # optional; if unset, uses earliest anchor with a fully observed forward outcome window
  ml_train_anchor_end_month: "2017-12-01" # optional; if unset, uses latest anchor with a fully observed forward outcome window
  ml_prediction_anchor_month: "2018-06-01" # optional; if unset, uses the latest member month available in the anchor population
  ml_train_anchor_stride_months: 12
  ml_artifact_stage: null
  ml_random_seed: 42
  ml_test_size: 0.2
  ml_claims_lag_months: 3
  ml_dev_sample_enabled: false
  ml_dev_sample_rows: 10000
  ml_dev_sample_seed: 42

  ml_feature_policy:
    - feature_group: [demographics, utilization, conditions, hcc]

  ml_target_policy:
    - target_measure: paid_amount
      horizon_months: [6, 12]
      target_dimension: all
    - target_measure: encounter_count
      horizon_months: [12]
      target_dimension: encounter_group
      target_values: [inpatient]
    - target_measure: encounter_count
      horizon_months: [12]
      target_dimension: encounter_type
      target_values: [emergency department, ambulatory surgery center]
    - target_measure: paid_amount
      horizon_months: [12]
      target_dimension: encounter_type
      target_values: [acute inpatient]
    - target_measure: paid_amount
      horizon_months: [12]
      target_dimension: encounter_type
      target_values: [emergency department]

  ml_count_probability_policy:
    - threshold_k: [1, 2, 3, 5]

If the anchor-month vars are unset, those defaults are global within the run, not recalculated separately for each data_source. For training, that default window is based on label completeness only; early anchors may still have partial feature lookback history.

Core runtime vars​

Prediction window reference​

Claims lag adjustment​

Dev sampling vars​

Policy vars​

Feature policy​

Target policy​

Count probability policy​

Spend percentile probability policy​

Common run patterns​

First-time run​

Monthly prediction cycle (reuse existing models)​

Force retraining​

Low-cost development loop​

Recommended example​