Compare Documents

Paste JSON or XML documents to analyze their semantic similarity. Format is auto-detected.

Document A

Document B

Synchronized scrolling

Enable synonym resolution

Analyzing similarity...

Load sample documents or paste your own to get started.

Sample Documents

Explore how the similarity algorithm responds to different document pairs. The algorithm measures structural and token similarity.

Same Structure (High Similarity)

Documents with identical field names score high, even with different values.

Near Duplicate

Identical claim data

~100%

Same Patient, Different Visit

Same fields, different procedure/dates

~100%

Same Provider, Different Patients

Same structure, different patient data

~98%

String vs Numeric Types

"1200.00" vs 1200.00 - same structure

~95%

Date Format Variations

2024-01-15 vs 01/15/2024 - same fields

~98%

Claim Resubmission

Original vs corrected - some new fields

~85%

Partial Overlap (Medium Similarity)

Documents sharing some field names but with structural differences.

Eligibility Request/Response

270/271 pair - shared subscriber fields

~60%

Dental vs Medical

Different claim types, some field overlap

~63%

Case Normalization (High Similarity)

camelCase, PascalCase, kebab-case are automatically normalized to snake_case.

camelCase vs snake_case

claimId → claim_id (normalized)

~99%

Semantic Synonyms

Different words for same concept (member_id vs subscriber_id). Enable "Synonym resolution" above to see high scores.

Field Synonyms

member_id vs subscriber_id

~39% (~95% with synonyms)

Abbreviated vs Full Names

dob vs date_of_birth

~39% (~95% with synonyms)

Structural Differences (Low Similarity)

Different JSON structure or completely different document types.

Nested vs Flat JSON

Different structure depth

~39%

Claim vs Remittance

837D vs 835 - different transaction types

~35%

EDI Segments vs JSON

X12 format vs normalized JSON

~31%

Patient vs Provider

Completely different entity types

~39%

Cross-Format: XML ↔ JSON

Compare documents in different formats. Format is auto-detected.

XML to JSON - Same Fields

Identical field names across formats

~88%

XML to JSON - Synonyms

subscriber_id ↔ member_id with synonyms

~88% (with synonyms)

XML with Attributes

XML attributes extracted as fields

~83%

Cross-Format: HTML ↔ JSON

Compare HTML forms/tables with JSON. Field names extracted from inputs, headers, labels.

HTML Form to JSON

Form inputs → JSON fields

~85%

HTML Table to JSON

Table headers → JSON fields

~85%

HTML Data Attributes

data-* attributes as fields

~83%

How Commodus Works

Commodus uses multiple algorithms to detect similar records, even when field names or formats differ. Each algorithm contributes to a weighted ensemble score.

SimHash (Content Fingerprinting)

Creates a compact "fingerprint" of document content that stays similar even with small changes. Think of it like a document's DNA.

Technical Details

Algorithm: Locality-Sensitive Hashing (LSH) that maps documents to 64-bit signatures.

How it works: Hashes each token, accumulates weighted bit positions, then threshold to produce final signature.

Similarity measure: Hamming distance between signatures, normalized to [0, 1].

Complexity: O(n) where n is document length.

Key property: Symmetric — similarity(A, B) = similarity(B, A)

MinHash (Set Similarity)

Estimates how much two documents share the same tokens (words, numbers, identifiers), even with millions of possible values.

Technical Details

Algorithm: Probabilistic data structure for estimating Jaccard similarity between sets.

How it works: Uses k hash functions to find minimum hash values, creating a signature that preserves set similarity.

Similarity measure: Jaccard index — |A ∩ B| / |A ∪ B|

Signature size: 128 hashes (configurable 64-256)

Key property: Order-independent — same tokens in any order produce same result

Structural (Field Schema)

Compares the structure of documents — which fields are present, how they're nested, and their overall shape.

Technical Details

Algorithm: Field set overlap with synonym resolution.

How it works: Normalizes field names, resolves synonyms, then computes Jaccard similarity of canonical field sets.

Similarity measure: |Fields_A ∩ Fields_B| / |Fields_A ∪ Fields_B|

Key property: Handles schema variations between different EDI systems

Field Normalization

Recognizes that "member_id" and "subscriberId" mean the same thing in dental records, enabling cross-system matching.

Dental EDI Synonyms

Patient identifiers:

member_id, subscriber_id, patient_id, insured_id, enrollee_id

Provider identifiers:

provider_id, npi, dentist_id, practitioner_id

Procedure codes:

procedure_code, cdt_code, ada_code, service_code

Claim identifiers:

claim_id, claim_number, dcn, claim_reference

Date fields:

service_date, date_of_service, dos, treatment_date

Ensemble Scoring

Combines all algorithms into a single weighted score. Weights can be static or adaptive based on document characteristics.

Technical Details

Formula:
score = w₁×SimHash + w₂×MinHash + w₃×Structural

Default weights: SimHash 40%, MinHash 40%, Structural 20%

Adaptive mode: Adjusts weights based on field overlap and token similarity characteristics.

Threshold: Documents with score ≥ 0.5 are considered "similar" by default.