Skip to main content

Quickstart

This is the shortest path to running identity resolution with empi_dbt_unified.

The package supports Snowflake, Databricks, and Microsoft Fabric.

1. Add package dependency

In your host project packages.yml:

packages:
- git: "https://github.com/illuminatehealth/empi_dbt.git"
revision: main

Then install:

dbt deps

2. Create pre-contract models

Your project must provide empi_pre__* models that map your source data into the expected schema. At minimum for claims-only linkage:

-- models/pre_contract/empi_pre__eligibility.sql
select
source_system,
source_id,
first_name,
middle_name,
last_name,
birth_date,
sex,
social_security_number,
address,
city,
state,
zip_code,
phone,
email,
data_source,
ingest_datetime,
file_date
from {{ ref('your_eligibility_staging') }}

See Outputs and Contract for the full column spec.

3. Configure vars

In your dbt_project.yml:

vars:
claims_enabled: true
clinical_enabled: true
provider_attribution_enabled: true

Optional development vars:

VariableDefaultPurpose
empi_source_person_limitnullRow cap for dev runs
empi_enable_case_suitefalseEnable package regression tests

4. Seed configuration

The package ships seed CSVs that control all linkage behavior. Load them first:

dbt seed --select package:empi_dbt_unified

This loads:

  • empi_blocking_rules: candidate pair generation predicates
  • empi_scoring_rules: Splink comparison definitions
  • empi_priors: Fellegi-Sunter m/u prior probabilities
  • empi_hard_rules: post-match disqualifiers
  • empi_runtime_config: thresholds and toggles
  • empi_survivorship_rules: golden record attribute selection

See Configuration for details on each seed.

5. Build the EMPI pipeline

dbt build --select tag:empi

In shared projects with multiple packages, scope the selector:

dbt build --select package:empi_dbt_unified,tag:empi

What happens:

  1. Staging models standardize and combine claims + clinical demographics.
  2. Adapter-specific Python models score candidate pairs and assign clusters.
  3. Pair decisions apply thresholds and any existing manual overrides.
  4. Survivorship selects golden-record attributes per person.
  5. Final outputs publish to empi, core, and input_layer schemas.

6. Validate outputs

Check these primary tables:

-- Core crosswalk: source records → resolved person_id
select * from empi.person_crosswalk limit 100;

-- Golden record: one row per person with consolidated attributes
select * from empi.person_attrs limit 100;

-- Work queue: pairs flagged for manual review
select count(*) from empi.work_queue;

-- Person dimension
select * from core.person limit 100;

-- Remapped input layer (spot-check person_id values)
select person_id, count(*) from input_layer.eligibility group by 1 order by 2 desc limit 20;

7. Full refresh

To rebuild all incremental models from scratch:

dbt build --full-refresh --select tag:empi

Use this after significant configuration changes (e.g., modifying blocking rules or thresholds).