| Rank | Submission | Base Model | Context Method | F1 | Precision | Recall |
|---|---|---|---|---|---|---|
| — | baseline | — | Attribute matching | 0.145 | 0.080 | 0.783 |
| — | baseline | Claude Sonnet 4 | BM25 RAG (top-10) | 0.091 | 0.077 | 0.111 |
| — | baseline | Claude Sonnet 4 | Zero-shot | 0.090 | 0.143 | 0.065 |
| — | baseline | — | String matching | 0.000 | 0.000 | 0.000 |
| — | Your submission | Any | Your approach | — | — | — |
More baselines coming soon.
Task definition and evaluation
Given two entity records from different enterprise systems, predict whether
they refer to the same real-world entity (match), different entities
(no_match), or evidence is insufficient (ambiguous).
Systems have access to auxiliary context: migration runbooks, email threads, wiki pages—typical enterprise documentation.
Submissions are evaluated on Raw Context mode
237 enterprise documents with multi-hop challenge. This is the competition mode.
python baselines/llm_rag_raw_baseline.py
Entity pairs only. Zero-shot baseline reference.
python baselines/llm_baseline.py
The context/raw/ folder contains 237 enterprise documents with multi-hop reasoning challenge:
Synthetically generated based on Phyvant's observations across Fortune 500 enterprise systems
Compete on context delivery mechanisms.
Run the evaluation script on your predictions:
submissions/Proprietary approaches welcome—you don't need to open-source your method, just submit predictions.
If you use CrossER in your research: