CrossER Benchmark

LLMs struggle with cross-system entity resolution without institutional context. CrossER measures the gap—and benchmarks how different context delivery mechanisms (RAG, knowledge graphs, tool use) help close it.

7,603
Pairs
7,162
Entities
5
Systems
237
Documents

Leaderboard

Rank Submission Base Model Context Method F1 Precision Recall
baseline Attribute matching 0.145 0.080 0.783
baseline Claude Sonnet 4 BM25 RAG (top-10) 0.091 0.077 0.111
baseline Claude Sonnet 4 Zero-shot 0.090 0.143 0.065
baseline String matching 0.000 0.000 0.000
Your submission Any Your approach

More baselines coming soon.

About

Task definition and evaluation

Task

Given two entity records from different enterprise systems, predict whether they refer to the same real-world entity (match), different entities (no_match), or evidence is insufficient (ambiguous).


Systems have access to auxiliary context: migration runbooks, email threads, wiki pages—typical enterprise documentation.

Source Systems

  • ERP_ALPHA — Global HQ, formal naming
  • ERP_BETA — European subsidiary, German names
  • ERP_GAMMA — Acquired company, abbreviations
  • ERP_DELTA — Regional ops, informal naming
  • LEGACY_SIGMA — Decommissioned, cryptic codes

Why It's Hard

  • Match pairs have low string similarity (avg 0.29)
  • No-match pairs have high similarity (avg 0.94)
  • Context required to resolve most pairs

Evaluation Metric

  • F1 Score — Primary ranking metric
  • Precision — Accuracy of match predictions
  • Recall — Coverage of true matches

Evaluation

Submissions are evaluated on Raw Context mode

Raw Context

237 enterprise documents with multi-hop challenge. This is the competition mode.

python baselines/llm_rag_raw_baseline.py

No Context

Entity pairs only. Zero-shot baseline reference.

python baselines/llm_baseline.py

Raw Context Documents

The context/raw/ folder contains 237 enterprise documents with multi-hop reasoning challenge:

Runbooks — 30 markdown (Source → IC)
Email Threads — 169 text files (IC → Target)
Wiki Pages — 26 markdown docs
Slack Exports — 10 JSON files
Policy Documents — 2 markdown docs

Data

Synthetically generated based on Phyvant's observations across Fortune 500 enterprise systems

Full Dataset

Entities and pairs with labels

crosser_dataset.json

Raw Context

237 enterprise documents

context/raw/

Evaluation Scripts

Scoring and validation

eval/

Directory Structure

CrossER/ ├── data/ │ ├── entities.json # All 7,162 entities │ ├── pairs.json # All 7,603 pairs with labels │ └── context/ │ └── raw/ # 237 enterprise documents │ ├── migration_runbooks/ │ ├── email_threads/ │ ├── wiki_pages/ │ ├── slack_exports/ │ └── policy_documents/ ├── eval/ # Evaluation scripts └── baselines/ # Baseline implementations

Entity & Pair Format

// entities.json [{ "entity_id": "erp_alpha_prod_001", "source_system": "ERP_ALPHA", "name": "Maltodextrin DE20 Grade A", "attributes": { "category": "modified_starches", ... } }] // pairs.json [{ "pair_id": "pair_0001", "entity_a": "erp_alpha_prod_001", "entity_b": "legacy_sigma_prod_023", "label": "match", "split": "test" }]

Submission Format

[ { "pair_id": "pair_0001", "predicted_label": "match" }, { "pair_id": "pair_0002", "predicted_label": "no_match" }, ... ]

Submit Your Approach

Compete on context delivery mechanisms.

Evaluate

Run the evaluation script on your predictions:

python eval/evaluate.py predictions.json --split test

Submit

  1. Fork nihalgunu/CrossER
  2. Add your predictions to submissions/
  3. Include: team/company name, base model used, context approach description
  4. Submit pull request

Proprietary approaches welcome—you don't need to open-source your method, just submit predictions.

Citation

If you use CrossER in your research:

@misc{crosser2026, author = {Gunukula, Nihal and Murthy, Sameer}, title = {{CrossER: Cross-System Entity Resolution Benchmark}}, year = {2026}, publisher = {GitHub}, url = {https://github.com/nihalgunu/CrossER} }