CrossER Benchmark

LLMs struggle with cross-system entity resolution without institutional context. CrossER measures the gap—and benchmarks how different context delivery mechanisms (RAG, knowledge graphs, tool use) help close it.

GitHub Download Cite

Leaderboard

Rank	Submission	Base Model	Context Method	F1	Precision	Recall
—	baseline	—	Attribute matching	0.145	0.080	0.783
—	baseline	Claude Sonnet 4	BM25 RAG (top-10)	0.091	0.077	0.111
—	baseline	Claude Sonnet 4	Zero-shot	0.090	0.143	0.065
—	baseline	—	String matching	0.000	0.000	0.000
—	Your submission	Any	Your approach	—	—	—

More baselines coming soon.

About

Task definition and evaluation

Task

Given two entity records from different enterprise systems, predict whether they refer to the same real-world entity (match), different entities (no_match), or evidence is insufficient (ambiguous).

Systems have access to auxiliary context: migration runbooks, email threads, wiki pages—typical enterprise documentation.

Source Systems

ERP_ALPHA — Global HQ, formal naming
ERP_BETA — European subsidiary, German names
ERP_GAMMA — Acquired company, abbreviations
ERP_DELTA — Regional ops, informal naming
LEGACY_SIGMA — Decommissioned, cryptic codes

Why It's Hard

Match pairs have low string similarity (avg 0.29)
No-match pairs have high similarity (avg 0.94)
Context required to resolve most pairs

Evaluation Metric

F1 Score — Primary ranking metric
Precision — Accuracy of match predictions
Recall — Coverage of true matches

Evaluation

Submissions are evaluated on Raw Context mode

Raw Context

237 enterprise documents with multi-hop challenge. This is the competition mode.

python baselines/llm_rag_raw_baseline.py

No Context

Entity pairs only. Zero-shot baseline reference.

python baselines/llm_baseline.py

Raw Context Documents

The context/raw/ folder contains 237 enterprise documents with multi-hop reasoning challenge:

Runbooks — 30 markdown (Source → IC)

Email Threads — 169 text files (IC → Target)

Wiki Pages — 26 markdown docs

Slack Exports — 10 JSON files

Policy Documents — 2 markdown docs

Data

Synthetically generated based on Phyvant's observations across Fortune 500 enterprise systems

Full Dataset

Entities and pairs with labels

crosser_dataset.json

Raw Context

237 enterprise documents

context/raw/

Evaluation Scripts

Scoring and validation

eval/

Directory Structure

CrossER/
├── data/
│   ├── entities.json          # All 7,162 entities
│   ├── pairs.json             # All 7,603 pairs with labels
│   └── context/
│       └── raw/               # 237 enterprise documents
│           ├── migration_runbooks/
│           ├── email_threads/
│           ├── wiki_pages/
│           ├── slack_exports/
│           └── policy_documents/
├── eval/                      # Evaluation scripts
└── baselines/                 # Baseline implementations

Entity & Pair Format

// entities.json
[{
  "entity_id": "erp_alpha_prod_001",
  "source_system": "ERP_ALPHA",
  "name": "Maltodextrin DE20 Grade A",
  "attributes": { "category": "modified_starches", ... }
}]

// pairs.json
[{
  "pair_id": "pair_0001",
  "entity_a": "erp_alpha_prod_001",
  "entity_b": "legacy_sigma_prod_023",
  "label": "match",
  "split": "test"
}]

Submission Format

[
  { "pair_id": "pair_0001", "predicted_label": "match" },
  { "pair_id": "pair_0002", "predicted_label": "no_match" },
  ...
]

Submit Your Approach

Compete on context delivery mechanisms.

Evaluate

Run the evaluation script on your predictions:

python eval/evaluate.py predictions.json --split test

Submit

Fork nihalgunu/CrossER
Add your predictions to submissions/
Include: team/company name, base model used, context approach description
Submit pull request

Proprietary approaches welcome—you don't need to open-source your method, just submit predictions.

Citation

If you use CrossER in your research:

@misc{crosser2026, author = {Gunukula, Nihal and Murthy, Sameer}, title = {{CrossER: Cross-System Entity Resolution Benchmark}}, year = {2026}, publisher = {GitHub}, url = {https://github.com/nihalgunu/CrossER} }