Skip to main content

Configuration Reference

All package behavior is controlled through dbt vars. No seed files or code changes are required. Start with the defaults and override only what you need.

Recommended approach:

  1. Define project-level vars in dbt_project.yml for settings that apply to all runs.
  2. Use --vars on the command line for one-off overrides (e.g., forcing a retrain or targeting a specific prediction month).

Core runtime vars

VarTypeDefaultDescription
ml_enabledbooltrueEnables or disables all ML models in the package; when omitted it runs as enabled.
ml_force_trainboolfalsefalse reuses a matching prior model bundle from train_model_registry_history; true always trains a new model; when omitted it behaves as false.
ml_train_anchor_start_monthdate string or nullnullInclusive lower bound for training anchors; when omitted/null there is no manual lower bound and training uses the earliest anchor whose forward outcome window is fully observable in the run.
ml_train_anchor_end_monthdate string or nullnullInclusive upper bound for training anchors; when omitted/null there is no manual upper bound and training uses the latest anchor whose forward outcome window is fully observable in the run.
ml_train_anchor_stride_monthsint12Deterministic per-person anchor striding to reduce overlap in training rows.
ml_prediction_anchor_monthdate string or nullnullMonth to generate predictions for. Defaults to the latest available anchor month in the anchor population if not set.
ml_artifact_stagestage URI or nullnullArtifact storage location for trained model bundles. In Snowflake, this defaults to @<db>.<schema>.ML_MODEL_STAGE.
ml_random_seedint42Random seed for train/test split and model training.
ml_test_sizefloat0.2Fraction of data reserved for the test split.
ml_claims_lag_monthsint3Number of months of recent claims to exclude from feature lookback windows (see Claims Lag Adjustment below).
tuva_schema_prefixstring or nullnullIf set, writes package outputs to {tuva_schema_prefix}_ml.

null/undefined runtime vars do not raise errors here; the package falls back to defaults.

For the training anchor defaults, "complete" refers to the forward-looking label window, not the backward-looking feature window. Early anchors can still have less than 12 months of prior history; those rows remain eligible for training and the model accounts for that with features such as member_months_lookback_12m and cold_start_flag.

Prediction window reference

Prediction window layout showing how a single anchor month defines the feature lookback and scored target horizon

  • ml_prediction_anchor_month sets the single member month being scored.
  • Feature windows look backward from the anchor and are shortened by ml_claims_lag_months to account for incomplete recent claims.
  • The target horizon defines the forward prediction period, with duration set by ml_target_policy.horizon_months.

Claims lag adjustment

In most claims environments, the most recent 1-3 months of data are not fully adjudicated because claims are still being submitted and processed (sometimes called IBNR or "claims lag"). This means utilization counts, paid amounts, condition flags, and HCC assignments are understated for recent months.

Without adjustment, the model trains on fully adjudicated historical data but scores on incomplete recent data at prediction time. This train/predict mismatch causes systematic underprediction of risk.

How it works: ml_claims_lag_months trims the end of every claims-based feature lookback window by N months, in both training and prediction. The lookback start stays fixed, so the effective window shrinks. For example, with ml_claims_lag_months: 3:

  • A 12-month utilization window uses months [anchor-11, anchor-3] (9 months of claims)
  • A 6-month utilization window uses months [anchor-5, anchor-3] (3 months of claims)
  • A 3-month utilization window becomes empty (all zeros) since it falls entirely within the lag period
  • Condition and HCC features (12-month lookback) use 9 months of claims

This applies symmetrically to training and prediction, so the model learns from the same truncated-window pattern it will encounter at scoring time. Train/test metrics will be slightly lower (the model sees less signal), but they honestly reflect real-world prediction performance.

Forward-looking targets are not affected. "spend in the next 6 months" still uses the full outcome window from anchor+1 through anchor+6.

Tuning guidance:

ValueWhen to use
0Fully adjudicated data, retrospective studies
1Fast-paying payers, electronic-only claims
2-3Typical commercial claims (default: 3)
4+Slow payers, complex claim types

Dev sampling vars

Use these to run the pipeline on a reduced population while iterating. This is useful for validating end-to-end wiring without full compute costs.

VarTypeDefaultDescription
ml_dev_sample_enabledboolfalseEnables deterministic anchor-row downsampling.
ml_dev_sample_rowsint10000Maximum number of sampled anchor rows.
ml_dev_sample_seedint42Seed for deterministic sampling.

Policy vars

Feature policy

Controls which feature groups are included in training and prediction. feature_group can be a single value or a list:

ml_feature_policy:
- feature_group: [demographics, utilization, conditions, hcc]

Each row is enabled by existence. To disable a group, omit it from ml_feature_policy.

Target policy

Controls which targets and horizons are active, and which claims are included in each target numerator.

  • target_measure: paid_amount or encounter_count
  • target_dimension: all, encounter_group, or encounter_type
  • target_values: omitted for all; otherwise required as a non-empty list
  • Encounter target values are validated against Tuva terminology seed terminology__encounter_type
ml_target_policy:
- target_measure: paid_amount
horizon_months: [6, 12]
target_dimension: all
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_group
target_values: [inpatient]
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department, ambulatory surgery center]
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [acute inpatient]
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department]

horizon_months can be a scalar or list. target_values must always be a list for encounter targets, even when there is only one value. If a single policy row includes multiple target_values, the package expands that row into one separate target per listed value.

Examples:

  • Spend across all claims:
- target_measure: paid_amount
horizon_months: [6, 12]
target_dimension: all
  • Spend limited to acute inpatient encounters:
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [acute inpatient]
  • Spend limited to emergency department encounters:
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department]

Optional count probabilities:

  • For encounter_count targets, the package can also output threshold probabilities such as "probability of at least 1 emergency department visit in the next 12 months."
  • These are controlled separately by ml_count_probability_policy.
  • For paid_amount targets, the package can also output threshold probabilities such as "probability this member is in the top 1% of spend next year."
  • Those spend percentile cutoffs are derived separately for each data_source and each spend target / horizon combination.

Each row is enabled by existence. To disable a target, remove it from ml_target_policy.

Count probability policy

Controls the k thresholds used to compute P(Y >= k) for count targets. threshold_k can be a single int or a list:

ml_count_probability_policy:
- threshold_k: [1, 2, 3, 5]

Each row is enabled by existence. To disable a threshold, remove it from ml_count_probability_policy.

Spend percentile probability policy

Controls the percentile cutoffs used to compute P(spend in top k%) for paid_amount targets. top_percent can be a single numeric value or a list.

These percentile cutoffs are derived separately for each data_source and each spend target / horizon combination.

ml_spend_percentile_probability_policy:
- top_percent: [1, 5]

Examples:

  • Probability a member lands in the top 1% of total spend in the next 12 months:
ml_target_policy:
- target_measure: paid_amount
horizon_months: [12]
target_dimension: all

ml_spend_percentile_probability_policy:
- top_percent: [1]
  • Probability a member lands in the top 5% of emergency department spend in the next 12 months:
ml_target_policy:
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department]

ml_spend_percentile_probability_policy:
- top_percent: [5]

Each row is enabled by existence. To disable a threshold, remove it from ml_spend_percentile_probability_policy.

Common run patterns

First-time run

dbt deps
dbt run --select package:illuminate_predictive_models

Validation checklist after the run completes:

  1. train_model_registry has rows with a trained_... or skipped_... status.
  2. predict_values has non-zero rows.
  3. train_metrics_long has both TRAIN and TEST scopes.

Monthly prediction cycle (reuse existing models)

Update ml_prediction_anchor_month to the target month and run. With ml_force_train omitted (default false), training is skipped if the model signature matches the current runtime, target, and feature configuration already recorded in train_model_registry_history:

dbt run --select package:illuminate_predictive_models --vars '{ml_prediction_anchor_month: 2026-02-01}'

Force retraining

Run training first, then re-generate all prediction and metric outputs:

dbt run --select train_model_registry --vars '{ml_force_train: true}'
dbt run --select predict_values predict_probabilities_long train_metrics_long

Low-cost development loop

Run with a downsampled anchor population to validate end-to-end wiring without full compute costs:

dbt run --select package:illuminate_predictive_models --vars '{ml_dev_sample_enabled: true, ml_dev_sample_rows: 10000, ml_dev_sample_seed: 20260301}'

A practical dbt_project.yml vars block with the settings most users will actually change:

vars:
ml_enabled: true
ml_train_anchor_start_month: "2017-01-01" # optional; if unset, uses earliest anchor with a fully observed forward outcome window
ml_train_anchor_end_month: "2017-12-01" # optional; if unset, uses latest anchor with a fully observed forward outcome window
ml_prediction_anchor_month: "2018-06-01" # optional; if unset, uses the latest member month available in the anchor population
ml_claims_lag_months: 3

ml_feature_policy:
- feature_group: [demographics, utilization, conditions, hcc]

ml_target_policy:
- target_measure: paid_amount
horizon_months: [6, 12]
target_dimension: all
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_group
target_values: [inpatient]
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department, ambulatory surgery center]

ml_count_probability_policy:
- threshold_k: [1, 2, 3, 5]

ml_spend_percentile_probability_policy:
- top_percent: [1, 5]
Full reference example
vars:
ml_enabled: true
ml_force_train: false
ml_train_anchor_start_month: "2017-01-01" # optional; if unset, uses earliest anchor with a fully observed forward outcome window
ml_train_anchor_end_month: "2017-12-01" # optional; if unset, uses latest anchor with a fully observed forward outcome window
ml_prediction_anchor_month: "2018-06-01" # optional; if unset, uses the latest member month available in the anchor population
ml_train_anchor_stride_months: 12
ml_artifact_stage: null
ml_random_seed: 42
ml_test_size: 0.2
ml_claims_lag_months: 3
ml_dev_sample_enabled: false
ml_dev_sample_rows: 10000
ml_dev_sample_seed: 42

ml_feature_policy:
- feature_group: [demographics, utilization, conditions, hcc]

ml_target_policy:
- target_measure: paid_amount
horizon_months: [6, 12]
target_dimension: all
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_group
target_values: [inpatient]
- target_measure: encounter_count
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department, ambulatory surgery center]
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [acute inpatient]
- target_measure: paid_amount
horizon_months: [12]
target_dimension: encounter_type
target_values: [emergency department]

ml_count_probability_policy:
- threshold_k: [1, 2, 3, 5]

If the anchor-month vars are unset, those defaults are global within the run, not recalculated separately for each data_source. For training, that default window is based on label completeness only; early anchors may still have partial feature lookback history.