Evidence Over Vendor Claims

Benchmark the same workload across the inference stack.

BitterBench makes local cells, cloud APIs, and routers comparable by running the same packet through each plane and keeping the metrics, cost, and output evidence in one place.

Open App See The Model

Sources

BitterMill, cloud, routers

One benchmark surface for local cells, provider APIs, and routed model traffic.

Packets

Same workload, every time

Prompt shape, limits, schemas, and notes travel together so throughput claims stay comparable.

Current phase

Product shell first

The public edge stays simple while the app surface hardens around real benchmark records.

Workloads

Frozen benchmark packets

A benchmark starts with a durable workload definition, not an anecdotal prompt.

Runs

Timing and output truth

Queue, prefill, decode, total, cost, tokens, failures, and output evidence should sit in one comparable record.

Decisions

Promotion by evidence

The goal is not pretty charts. The goal is to decide what should run where, and why.

Boundary

The root domain stays public. The benchmark workbench stays sealed.

`bitterbench.com` should explain the offer clearly. Authentication, workload submission, run history, and comparison state stay on `app.bitterbench.com`.