LOCOMO Benchmark
Benchmark runner for measuring CME retrieval quality using LOCOMO QA pairs.
Use this for CI integration and evaluating search quality across modes.
Endpoints
Service surface
POST
python -m scripts.run_locomo --api-url <url> --tenant <id> --search-mode hybrid
Run LOCOMO benchmark against CME API with specified search mode.
N/A
Run hybrid benchmark
Execute LOCOMO with hybrid search.
python -m scripts.run_locomo \ --api-url http://localhost:8080 \ --tenant <tenant-id> \ --search-mode hybrid \ --graph \ --output-format json
Request example
CLI invocation
Run LOCOMO with vector-only mode for ablation study.
python -m scripts.run_locomo \ --api-url http://localhost:8080 \ --tenant <tenant-id> \ --search-mode vector-only \ --k 10 \ --output-format json
Base path
scripts/run_locomo.py
Schemas
OpenAPI-style field tables
CLI arguments
Supported command-line arguments for run_locomo.py.
| Field | Type | Required | Description |
|---|---|---|---|
| --api-url | string | required | CME API base URL. |
| --tenant | string | required | Tenant ID for test data. |
| --search-mode | string | optional | Search mode: vector-only, hybrid, hybrid+graph (default hybrid). |
| --graph | boolean | optional | Enable graph retrieval context. |
| --k | number | optional | Number of results to retrieve (default 10). |
| --output-format | string | optional | Output format: json (default) or text. |
| --dataset-path | string | optional | Path to LOCOMO dataset (downloads if not set). |
| --harness-dir | string | optional | Path to LOCOMO harness directory containing datasets/. |
Score metrics
Composite scores computed by the benchmark runner.
| Field | Type | Required | Description |
|---|---|---|---|
| recall@10 | number | required | Proportion of relevant results in top 10. |
| mrr | number | required | Mean reciprocal rank of first relevant result. |
| ndcg@10 | number | required | Normalized discounted cumulative gain at 10. |
Response examples
What the API returns
Benchmark output
JSON output with composite scores for CI integration.
{
"mode": "hybrid",
"graph": true,
"scores": {
"recall@10": 0.87,
"mrr": 0.72,
"ndcg@10": 0.81
},
"total_queries": 150,
"passed": true
}Notes
Implementation notes
- The benchmark runner falls back to synthetic smoke-test data if LOCOMO dataset is unavailable.
- Supports ablation studies: compare vector-only vs hybrid vs hybrid+graph modes.
- Set CME_API_KEY environment variable for API authentication, or use session auth.
- Output JSON format is designed for CI integration and trending over time.