Quickstart
This is the shortest path to running identity resolution with empi_dbt_unified.
The package supports Snowflake, Databricks, and Microsoft Fabric.
1. Add package dependency
In your host project packages.yml:
packages:
- git: "https://github.com/illuminatehealth/empi_dbt.git"
revision: main
Then install:
dbt deps
2. Create pre-contract models
Your project must provide empi_pre__* models that map your source data into the expected schema. At minimum for claims-only linkage:
-- models/pre_contract/empi_pre__eligibility.sql
select
source_system,
source_id,
first_name,
middle_name,
last_name,
birth_date,
sex,
social_security_number,
address,
city,
state,
zip_code,
phone,
email,
data_source,
ingest_datetime,
file_date
from {{ ref('your_eligibility_staging') }}
See Outputs and Contract for the full column spec.
3. Configure vars
In your dbt_project.yml:
vars:
claims_enabled: true
clinical_enabled: true
provider_attribution_enabled: true
Optional development vars:
| Variable | Default | Purpose |
|---|---|---|
empi_source_person_limit | null | Row cap for dev runs |
empi_enable_case_suite | false | Enable package regression tests |
4. Seed configuration
The package ships seed CSVs that control all linkage behavior. Load them first:
dbt seed --select package:empi_dbt_unified
This loads:
empi_blocking_rules: candidate pair generation predicatesempi_scoring_rules: Splink comparison definitionsempi_priors: Fellegi-Sunter m/u prior probabilitiesempi_hard_rules: post-match disqualifiersempi_runtime_config: thresholds and togglesempi_survivorship_rules: golden record attribute selection
See Configuration for details on each seed.
5. Build the EMPI pipeline
dbt build --select tag:empi
In shared projects with multiple packages, scope the selector:
dbt build --select package:empi_dbt_unified,tag:empi
What happens:
- Staging models standardize and combine claims + clinical demographics.
- Adapter-specific Python models score candidate pairs and assign clusters.
- Pair decisions apply thresholds and any existing manual overrides.
- Survivorship selects golden-record attributes per person.
- Final outputs publish to
empi,core, andinput_layerschemas.
6. Validate outputs
Check these primary tables:
-- Core crosswalk: source records → resolved person_id
select * from empi.person_crosswalk limit 100;
-- Golden record: one row per person with consolidated attributes
select * from empi.person_attrs limit 100;
-- Work queue: pairs flagged for manual review
select count(*) from empi.work_queue;
-- Person dimension
select * from core.person limit 100;
-- Remapped input layer (spot-check person_id values)
select person_id, count(*) from input_layer.eligibility group by 1 order by 2 desc limit 20;
7. Full refresh
To rebuild all incremental models from scratch:
dbt build --full-refresh --select tag:empi
Use this after significant configuration changes (e.g., modifying blocking rules or thresholds).