LOCOMO Benchmark

API index•scripts/run_locomo.py

LOCOMO Benchmark

Benchmark runner for measuring CME retrieval quality using LOCOMO QA pairs.

Use this for CI integration and evaluating search quality across modes.

Endpoints

Service surface

POST

python -m scripts.run_locomo --api-url <url> --tenant <id> --search-mode hybrid

Run LOCOMO benchmark against CME API with specified search mode.

N/A

Run hybrid benchmark

Execute LOCOMO with hybrid search.

python -m scripts.run_locomo \ --api-url http://localhost:8080 \ --tenant <tenant-id> \ --search-mode hybrid \ --graph \ --output-format json

Request example

CLI invocation

Run LOCOMO with vector-only mode for ablation study.

python -m scripts.run_locomo \
  --api-url http://localhost:8080 \
  --tenant <tenant-id> \
  --search-mode vector-only \
  --k 10 \
  --output-format json

Base path

scripts/run_locomo.py

Schemas

OpenAPI-style field tables

CLI arguments

Supported command-line arguments for run_locomo.py.

Field	Type	Required	Description
--api-url	string	required	CME API base URL.
--tenant	string	required	Tenant ID for test data.
--search-mode	string	optional	Search mode: vector-only, hybrid, hybrid+graph (default hybrid).
--graph	boolean	optional	Enable graph retrieval context.
--k	number	optional	Number of results to retrieve (default 10).
--output-format	string	optional	Output format: json (default) or text.
--dataset-path	string	optional	Path to LOCOMO dataset (downloads if not set).
--harness-dir	string	optional	Path to LOCOMO harness directory containing datasets/.

Score metrics

Composite scores computed by the benchmark runner.

Field	Type	Required	Description
recall@10	number	required	Proportion of relevant results in top 10.
mrr	number	required	Mean reciprocal rank of first relevant result.
ndcg@10	number	required	Normalized discounted cumulative gain at 10.

Response examples

What the API returns

Benchmark output

JSON output with composite scores for CI integration.

{
  "mode": "hybrid",
  "graph": true,
  "scores": {
    "recall@10": 0.87,
    "mrr": 0.72,
    "ndcg@10": 0.81
  },
  "total_queries": 150,
  "passed": true
}

Notes

Implementation notes

The benchmark runner falls back to synthetic smoke-test data if LOCOMO dataset is unavailable.
Supports ablation studies: compare vector-only vs hybrid vs hybrid+graph modes.
Set CME_API_KEY environment variable for API authentication, or use session auth.
Output JSON format is designed for CI integration and trending over time.