Quickstart

This is the shortest path to running identity resolution with empi_dbt_unified.

The package supports Snowflake, Databricks, and Microsoft Fabric.

1. Add package dependency

In your host project packages.yml:

packages:
  - git: "https://github.com/illuminatehealth/empi_dbt.git"
    revision: main

Then install:

dbt deps

2. Create pre-contract models

Your project must provide empi_pre__* models that map your source data into the expected schema. At minimum for claims-only linkage:

-- models/pre_contract/empi_pre__eligibility.sql
select
    source_system,
    source_id,
    first_name,
    middle_name,
    last_name,
    birth_date,
    sex,
    social_security_number,
    address,
    city,
    state,
    zip_code,
    phone,
    email,
    data_source,
    ingest_datetime,
    file_date
from {{ ref('your_eligibility_staging') }}

See Outputs and Contract for the full column spec.

3. Configure vars

In your dbt_project.yml:

vars:
  claims_enabled: true
  clinical_enabled: true
  provider_attribution_enabled: true

Optional development vars:

Variable	Default	Purpose
`empi_source_person_limit`	`null`	Row cap for dev runs
`empi_enable_case_suite`	`false`	Enable package regression tests

4. Seed configuration

The package ships seed CSVs that control all linkage behavior. Load them first:

dbt seed --select package:empi_dbt_unified

This loads:

empi_blocking_rules: candidate pair generation predicates
empi_scoring_rules: Splink comparison definitions
empi_priors: Fellegi-Sunter m/u prior probabilities
empi_hard_rules: post-match disqualifiers
empi_runtime_config: thresholds and toggles
empi_survivorship_rules: golden record attribute selection

See Configuration for details on each seed.

5. Build the EMPI pipeline

dbt build --select tag:empi

In shared projects with multiple packages, scope the selector:

dbt build --select package:empi_dbt_unified,tag:empi

What happens:

Staging models standardize and combine claims + clinical demographics.
Adapter-specific Python models score candidate pairs and assign clusters.
Pair decisions apply thresholds and any existing manual overrides.
Survivorship selects golden-record attributes per person.
Final outputs publish to empi, core, and input_layer schemas.

6. Validate outputs

Check these primary tables:

-- Core crosswalk: source records → resolved person_id
select * from empi.person_crosswalk limit 100;

-- Golden record: one row per person with consolidated attributes
select * from empi.person_attrs limit 100;

-- Work queue: pairs flagged for manual review
select count(*) from empi.work_queue;

-- Person dimension
select * from core.person limit 100;

-- Remapped input layer (spot-check person_id values)
select person_id, count(*) from input_layer.eligibility group by 1 order by 2 desc limit 20;

7. Full refresh

To rebuild all incremental models from scratch:

dbt build --full-refresh --select tag:empi

Use this after significant configuration changes (e.g., modifying blocking rules or thresholds).

1. Add package dependency​

2. Create pre-contract models​

3. Configure vars​

4. Seed configuration​

5. Build the EMPI pipeline​

6. Validate outputs​

7. Full refresh​