Commodus uses multiple algorithms to detect similar records,
even when field names or formats differ. Each algorithm contributes
to a weighted ensemble score.
🔷
SimHash (Content Fingerprinting)
Creates a compact "fingerprint" of document content that stays
similar even with small changes. Think of it like a document's DNA.
Technical Details
Algorithm: Locality-Sensitive Hashing (LSH)
that maps documents to 64-bit signatures.
How it works: Hashes each token, accumulates
weighted bit positions, then threshold to produce final signature.
Similarity measure: Hamming distance between
signatures, normalized to [0, 1].
Complexity: O(n) where n is document length.
Key property: Symmetric — similarity(A, B) = similarity(B, A)
🔶
MinHash (Set Similarity)
Estimates how much two documents share the same tokens (words,
numbers, identifiers), even with millions of possible values.
Technical Details
Algorithm: Probabilistic data structure for
estimating Jaccard similarity between sets.
How it works: Uses k hash functions to find
minimum hash values, creating a signature that preserves
set similarity.
Similarity measure: Jaccard index —
|A ∩ B| / |A ∪ B|
Signature size: 128 hashes (configurable 64-256)
Key property: Order-independent —
same tokens in any order produce same result
🔷
Structural (Field Schema)
Compares the structure of documents — which fields are
present, how they're nested, and their overall shape.
Technical Details
Algorithm: Field set overlap with
synonym resolution.
How it works: Normalizes field names,
resolves synonyms, then computes Jaccard similarity
of canonical field sets.
Similarity measure: |Fields_A ∩ Fields_B| /
|Fields_A ∪ Fields_B|
Key property: Handles schema variations
between different EDI systems
🔗
Field Normalization
Recognizes that "member_id" and "subscriberId" mean the same
thing in dental records, enabling cross-system matching.
Dental EDI Synonyms
Patient identifiers:
- member_id, subscriber_id, patient_id, insured_id, enrollee_id
Provider identifiers:
- provider_id, npi, dentist_id, practitioner_id
Procedure codes:
- procedure_code, cdt_code, ada_code, service_code
Claim identifiers:
- claim_id, claim_number, dcn, claim_reference
Date fields:
- service_date, date_of_service, dos, treatment_date
⚖️
Ensemble Scoring
Combines all algorithms into a single weighted score.
Weights can be static or adaptive based on document characteristics.
Technical Details
Formula:
score = w₁×SimHash + w₂×MinHash + w₃×Structural
Default weights: SimHash 40%, MinHash 40%, Structural 20%
Adaptive mode: Adjusts weights based on
field overlap and token similarity characteristics.
Threshold: Documents with score ≥ 0.5 are
considered "similar" by default.