Semantic Similarity for Dental EDI

High-performance document comparison using SimHash, MinHash, and ensemble scoring.

Features

SimHash

Content-based similarity using locality-sensitive hashing for fast comparisons.
MinHash

Token-based Jaccard similarity estimation for accurate overlap detection.
Ensemble Scoring

Combines multiple algorithms with adaptive weights for optimal accuracy.
Dental EDI Support

Built-in synonym resolution for common dental field name variations.

Performance

Comparison Time: <20µs
Memory Efficient: Zero-copy design
Type-Safe: Compile-time checks

How Commodus Works

Commodus uses multiple algorithms to detect similar records, even when field names or formats differ. Each algorithm contributes to a weighted ensemble score.

SimHash (Content Fingerprinting)

Creates a compact "fingerprint" of document content that stays similar even with small changes. Think of it like a document's DNA.

Technical Details

Algorithm: Locality-Sensitive Hashing (LSH) that maps documents to 64-bit signatures.

How it works: Hashes each token, accumulates weighted bit positions, then threshold to produce final signature.

Similarity measure: Hamming distance between signatures, normalized to [0, 1].

Complexity: O(n) where n is document length.

Key property: Symmetric — similarity(A, B) = similarity(B, A)

MinHash (Set Similarity)

Estimates how much two documents share the same tokens (words, numbers, identifiers), even with millions of possible values.

Technical Details

Algorithm: Probabilistic data structure for estimating Jaccard similarity between sets.

How it works: Uses k hash functions to find minimum hash values, creating a signature that preserves set similarity.

Similarity measure: Jaccard index — |A ∩ B| / |A ∪ B|

Signature size: 128 hashes (configurable 64-256)

Key property: Order-independent — same tokens in any order produce same result

Structural (Field Schema)

Compares the structure of documents — which fields are present, how they're nested, and their overall shape.

Technical Details

Algorithm: Field set overlap with synonym resolution.

How it works: Normalizes field names, resolves synonyms, then computes Jaccard similarity of canonical field sets.

Similarity measure: |Fields_A ∩ Fields_B| / |Fields_A ∪ Fields_B|

Key property: Handles schema variations between different EDI systems

Field Normalization

Recognizes that "member_id" and "subscriberId" mean the same thing in dental records, enabling cross-system matching.

Dental EDI Synonyms

Patient identifiers:

member_id, subscriber_id, patient_id, insured_id, enrollee_id

Provider identifiers:

provider_id, npi, dentist_id, practitioner_id

Procedure codes:

procedure_code, cdt_code, ada_code, service_code

Claim identifiers:

claim_id, claim_number, dcn, claim_reference

Date fields:

service_date, date_of_service, dos, treatment_date

Ensemble Scoring

Combines all algorithms into a single weighted score. Weights can be static or adaptive based on document characteristics.

Technical Details

Formula:
score = w₁×SimHash + w₂×MinHash + w₃×Structural

Default weights: SimHash 40%, MinHash 40%, Structural 20%

Adaptive mode: Adjusts weights based on field overlap and token similarity characteristics.

Threshold: Documents with score ≥ 0.5 are considered "similar" by default.

Semantic Similarity for Dental EDI

Features

SimHash

MinHash

Ensemble Scoring

Dental EDI Support

Performance