| docs | ||
| results | ||
| src/benchmark | ||
| static | ||
| tests | ||
| .coveragerc | ||
| .env.example | ||
| .gitignore | ||
| config.yaml | ||
| filler.toml | ||
| main.py | ||
| prompts.yaml | ||
| pyproject.toml | ||
| README.md | ||
LLM Principles Benchmark
A framework for evaluating whether principles prompts measurably improve LLM behavior across conciseness, filler reduction, guessing abstention, and code correctness.
Why This Exists
Most LLM interactions suffer from:
- Sycophantic openers: "Sure! I'd be happy to help..."
- Verbose hedging: "Probably, maybe, it seems like..."
- Unnecessary exploration: Investigating beyond what's needed
- Guessing without context: Inventing data when information is missing
- Unicode bloat: Smart quotes, em dashes, non-ASCII characters
This benchmark tests whether a compact principles prompt (14 rules, ~179 tokens) can reduce these behaviors without degrading technical correctness.
Hypotheses Tested
| Hypothesis | Expected Effect |
|---|---|
| H1 | Response length decreases ≥ 20% |
| H2 | Filler patterns decrease ≥ 50% |
| H3 | Correct abstention increases when context is missing |
| H4 | Unnecessary exploration decreases |
| H5 | Technical correctness does not degrade |
| H6 | Thinking/reasoning tokens decrease or stay flat |
| H7 | Effect differs between session persistence vs. per-prompt injection |
Supported Providers
| Provider | Model | Input $/1M | Output $/1M | Reasoning |
|---|---|---|---|---|
| OpenAI | gpt-5.5 | $5.00 | $30.00 | high |
| Anthropic | claude-opus-4-8 | $5.00 | $25.00 | adaptive |
| gemini-3.5-flash | $1.50 | $9.00 | disabled | |
| DeepSeek | deepseek-v4-pro | $0.435 | $0.87 | enabled |
| Moonshot | kimi-k2.6 | $0.95 | $4.00 | enabled |
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ prompts.yaml │────▶│ experiment │────▶│ runner │
│ config.yaml │ │ builder │ │ (async/await) │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│
┌────────────────────────────────────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌──────────┐
│ providers │ │ metrics │ │ judge │
│ (5 APIs) │ │ (local) │ │ (LLM) │
└────────────┘ └────────────┘ └──────────┘
│ │ │
│ ┌──────────────┐ │ │
└────────▶│ results/raw/ │◀──────────────────┘◀───────────┘
│ *.json │
└──────┬───────┘
│
▼
┌──────────────┐
│ normalization│
│ (usage data)│
└──────┬───────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ normalized │ │ charts │ │ comparison │
│ .csv │ │ (.json) │ │ (W/T/L) │
└─────┬──────┘ └─────┬──────┘ └────────────┘
│ │
└───────┬───────┘
│
▼
┌────────────┐
│ report │
│ (.md) │
└─────┬──────┘
│
▼
┌─────────────┐
│inject_report│
│ (.html) │
└─────────────┘
Data Flow
Input Files Generated Files
───────────────────────────────── ─────────────────────────────────
prompts.yaml results/raw/<uuid>.json (raw API responses)
config.yaml results/normalized.csv (processed metrics)
filler.toml results/report.md (markdown report)
.env results/chart_data.json (Chart.js data)
results/judge_conclusion.txt (raw LLM judge output)
static/index.html (injected dashboard)
Quick Start
1. Install
uv sync
2. Configure API Keys
cp .env.example .env
# Edit .env and add your API keys
Required keys depend on which providers you configure in config.yaml:
OPENAI_API_KEYANTHROPIC_API_KEYGEMINI_API_KEYDEEPSEEK_API_KEYMOONSHOT_API_KEY
3. Verify Setup
# Check all providers
uv run python -m benchmark.check --provider all
# Or check just one
uv run python -m benchmark.check --provider openai
# Skip API calls, test functions only
uv run python -m benchmark.check --provider all --skip-api
4. Run Benchmark
# Full benchmark (expensive - 500 API calls, ~$2.17)
uv run python -m benchmark.runner
# Generate report from results (includes judge analysis if enabled in config.yaml)
uv run python -m benchmark.report
# Inject report into website
uv run python -m benchmark.inject_report
5. View Results
Open static/index.html in a browser or serve it:
python -m http.server 8000 --directory web
# Navigate to http://localhost:8000
Workflow
The benchmark follows a 4-step pipeline:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐
│ check │───▶│ runner │───▶│ report │───▶│ inject_report │
│ (smoke) │ │ (benchmark) │ │ (analyze) │ │ (dashboard) │
└─────────────┘ └──────────────┘ └──────────────┘ └─────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
API connectivity results/raw/ results/report.md static/index.html
Metric functions *.json files normalized.csv (interactive
Code extraction chart_data.json charts + tables)
judge_conclusion.txt
Step 1: Smoke Test
uv run python -m benchmark.check --provider all
Verifies API connectivity, response parsing, and metrics extraction without running the full benchmark.
Step 2: Run Benchmark
uv run python -m benchmark.runner
Executes all 500 API calls across 5 providers, 2 modes, and 25 prompts. Results saved to results/raw/.
Step 3: Generate Report
uv run python -m benchmark.report
Processes raw results into results/report.md with tables, statistics, and comparison analysis.
If judge.enabled: true in config.yaml, the report generation also:
- Calls the configured judge LLM to analyze the full report
- Saves raw judge response to
results/judge_conclusion.txt - The judge prompt includes the full principles text for context
Step 4: Inject into Website
uv run python -m benchmark.inject_report
Converts the markdown report to HTML and injects it into static/index.html between REPORT_BLOCK_START and REPORT_BLOCK_END markers. Generates interactive Chart.js visualizations from results/chart_data.json.
Project Structure
llm-principles-benchmark/
├── src/benchmark/
│ ├── __init__.py
│ ├── check.py # Smoke test script (moved from root)
│ ├── runner.py # Main benchmark orchestration (~900 lines)
│ ├── providers.py # API clients for 5 providers with reasoning params
│ ├── metrics.py # Local metrics (filler, guessing, abstention, ASCII)
│ ├── code_tests.py # Code extraction + AST validation + pytest runner
│ ├── judge.py # LLM judge for subjective evaluation + report analysis
│ ├── pricing.py # Cost calculation including reasoning tokens
│ ├── report.py # Markdown report generation with judge conclusion
│ ├── charts.py # Chart data generation for 5 chart types
│ ├── comparison.py # WIN/TIE/LOSS scoring logic
│ ├── experiment.py # Experiment matrix builder (mode × model × prompt)
│ ├── normalization.py # Provider-specific usage normalization
│ ├── schemas.py # Pydantic data models
│ ├── ui.py # Rich terminal dashboard components
│ └── inject_report.py # Markdown-to-HTML injector with Chart.js
├── tests/
│ ├── conftest.py # Shared fixtures
│ ├── test_charts.py # Chart data generation
│ ├── test_generated_code.py # Code extraction and pytest runner
│ ├── test_hypothesis_properties.py # Property-based tests (Hypothesis)
│ ├── test_code_tests_llm.py # LLM fallback code extraction
│ ├── test_metrics.py # Filler, guessing, abstention detection
│ ├── test_pricing.py # Cost calculation
│ ├── test_schemas.py # Pydantic model validation
│ ├── test_providers.py # Mocked API clients for 5 providers
│ ├── test_judge.py # Mocked LLM judge
│ ├── test_report.py # Report generation
│ ├── test_runner.py # Runner helper functions
│ ├── test_runner_async.py # Async runner functions
│ ├── test_runner_integration.py # File-based integration tests
│ ├── test_runner_edge_cases.py # Edge cases and error handling
│ └── test_ui.py # Terminal UI components
├── static/
│ └── index.html # Flat dashboard template (gets injected)
├── results/
│ ├── raw/ # Raw API responses (*.json)
│ ├── normalized.csv # Processed results
│ ├── report.md # Generated markdown report
│ ├── chart_data.json # Serialized chart data for dashboard
│ └── judge_conclusion.txt # Raw LLM judge response with markers
├── docs/
│ └── PLAN.md # Full experimental design
├── config.yaml # Benchmark configuration
├── prompts.yaml # 25 test prompts + principles
├── filler.toml # 100+ regex patterns for filler detection
├── .env.example # Required environment variables
├── pyproject.toml # Dependencies (uv)
└── main.py # Entry point stub
Configuration
Edit config.yaml:
concurrency: 5 # Parallel API calls
timeout_seconds: 120 # Request timeout
max_output_tokens: 2048 # Max tokens per response (accommodates reasoning mode)
temperature: 0 # Deterministic output (omitted for some providers)
execution_modes:
- session_once # Principles sent once at session start (Mode 2)
- isolated_per_prompt # Principles resent for each prompt (Mode 1)
judge:
enabled: true # Enable LLM judge for report analysis
provider: moonshot # Judge provider
model: kimi-k2.6 # Judge model
blind: true # Randomized A/B position
models:
- provider: openai
model: gpt-5.5
reasoning: high # OpenAI reasoning effort (high/xhigh)
- provider: anthropic
model: claude-opus-4-8
reasoning: adaptive # Anthropic thinking mode (adaptive/enabled)
- provider: gemini
model: gemini-3.5-flash
reasoning: disabled # Gemini thoughts (disabled/includeThoughts)
- provider: deepseek
model: deepseek-v4-pro
reasoning: enabled # DeepSeek thinking (enabled/disabled)
- provider: moonshot
model: kimi-k2.6
reasoning: enabled # Moonshot reasoning (enabled/disabled)
Reasoning Configuration
Different providers expose reasoning/thinking tokens differently:
| Provider | Config Value | Behavior |
|---|---|---|
| OpenAI | high |
reasoning.effort: "high" (not xhigh — consumes all tokens) |
| Anthropic | adaptive |
thinking.type: "adaptive" |
| Gemini | disabled |
thinkingConfig: {includeThoughts: false} (tracks thoughtsTokenCount separately) |
| DeepSeek | enabled |
Returns both content and reasoning_content |
| Moonshot | enabled |
Returns both content and reasoning_content |
When reasoning is enabled, the framework:
- Extracts reasoning tokens from provider-specific response fields
- Includes them in cost calculation (billed separately or as output tokens)
- Falls back to
reasoning_contentwhencontentis empty
How It Works
Execution Modes
Mode 1: Isolated Per Prompt
- Baseline: New session per prompt, no principles
- Treatment: New session per prompt, principles included as first message
- Measures: Direct effect per task, minimal cross-contamination
- Higher input cost (principles repeated 25× per model)
Mode 2: Session Once
- Baseline: 25 prompts in one session without principles
- Treatment: Principles sent in first turn, then 25 prompts in same session
- Measures: Instruction persistence, context degradation, conversation cost
- Risk of context contamination in later turns
Prompt Categories
| Category | Prompts | What It Tests |
|---|---|---|
| Conciseness | P01-P05 | Response brevity under constraints |
| Guessing | P06-P10 | Abstention when context is missing |
| Code | P11-P15 | Correctness via pytest validation |
| Rewrite | P16-P20 | Filler removal and tone improvement |
| Long Context | P21-P25 | Smallest useful action first |
Metrics Collected
Objective Metrics:
visible_chars,visible_lines— Response lengthvisible_output_tokens— Token count from APIreasoning_tokens— Thinking tokens (when exposed by provider)filler_count— Regex matches against 100+ patterns infiller.tomlascii_violation— Non-ASCII character detectionguessing_violation— Heuristic invention detectioncorrect_abstention— Proper "I don't know" responsesclosing_fluff— Generic endings detectedsmallest_useful_action— Concrete first stepscost_usd— Real API cost including reasoning tokenslatency_ms— Response time
Code Validation:
- Deterministic extraction from markdown (
extract_code_deterministic) - AST parsing validation (
is_valid_python) - Pytest execution (P11-P15)
Subjective Evaluation:
- LLM judge (
judge_pair) compares baseline vs. principles blindly - Winner determined by conciseness, correctness, filler, guessing
- Per-prompt WIN/TIE/LOSS scoring (
calculate_comparisons)
Report Analysis:
- Separate judge call (
analyze_report) generates executive summary - Receives full benchmark summary + principles text as context
- Saves raw response with
BEGIN_CONCLUSION/END_CONCLUSIONmarkers - Parsed at injection time into styled HTML card with disclaimer
Comparison Groups
Three mandatory comparisons:
- Baseline vs. Mode 1 (Isolated): Does principles help when repeated?
- Baseline vs. Mode 2 (Session): Does principles help when sent once?
- Mode 1 vs. Mode 2: Is once enough or is repetition better?
Important: Comparisons stay within the same execution mode — baseline isolated vs. principles isolated, baseline session vs. principles session. Never crosses modes.
Success Criteria
Principles "works" if vs. baseline:
- Output tokens ↓ ≥ 20%
- Thinking tokens ↓ or flat
- Total billable tokens don't increase significantly
- Filler ↓ ≥ 50%
- Guessing ↓ on impossible prompts
- Code pass rate doesn't degrade
- ASCII violations ≈ 0
- Cost and latency don't rise meaningfully
Web Dashboard
The generated static/index.html features:
Interactive Charts (Chart.js):
- Mode 1 & 2 Scorecards — Side-by-side metric comparison cards
- Baseline vs Principles by Model — Grouped bar charts with warm/cool color palette (orange = baseline, teal = principles)
- Total Token Usage — Input/output/thinking breakdown per provider
- Win/Tie/Loss — Stacked horizontal bars showing per-prompt comparisons
- Improvement % Heatmaps — Color-coded tables (green = improvement, red = degradation) for both Isolated and Session modes
- Comparison Tables — Baseline vs Mode 1, Baseline vs Mode 2, Mode 1 vs Mode 2
Design:
- Flat design with Fira Code monospace font
- 0 border-radius (no rounded corners)
- Dark header with accent colors
- Responsive card grids replacing tables where appropriate
Content:
- Judge conclusion card with 💡 icon and disclaimer
- Methodology explanation cards
- Copy button for principles text
- Category tabs for prompt browsing
Smoke Testing
Before running the full benchmark, verify everything works:
# Test all functions and API connectivity
uv run python -m benchmark.check --provider all
# Skip API calls, test functions only
uv run python -m benchmark.check --provider all --skip-api
# Test specific provider
uv run python -m benchmark.check --provider gemini
Development
Running Tests
All tests (~55 tests, ~5 seconds):
uv run pytest tests/ -v
Run a specific test file:
uv run pytest tests/test_providers.py -v
uv run pytest tests/test_runner.py -v
With coverage report:
uv run pytest tests/ --cov=benchmark --cov-report=term-missing
Coverage threshold is 90%. Configured in .coveragerc.
Test Suite
All API calls are mocked via unittest.mock and pytest-asyncio. No real HTTP requests are made during tests, and no API keys or environment variables are required.
| Test File | Tests | What It Covers |
|---|---|---|
test_charts.py |
~15 | Chart data generation, aggregation logic |
test_generated_code.py |
19 | Code extraction from markdown, AST validation, pytest runner |
test_hypothesis_properties.py |
22 | Property-based tests: invariants, edge cases, fuzzing |
test_metrics.py |
27 | Filler detection, guessing heuristics, abstention, ASCII checks |
test_pricing.py |
7 | Cost calculation per model, hypothesis property testing |
test_schemas.py |
13 | Pydantic model validation, hypothesis fuzzing |
test_providers.py |
20 | Mocked HTTP calls for all 5 LLM providers + retry logic |
test_judge.py |
13 | Mocked provider calls for LLM judge, text extraction |
test_report.py |
5 | Report generation with mocked pandas DataFrames |
test_runner.py |
27 | Runner functions: messages, usage normalization, comparisons |
test_runner_async.py |
13 | Async runner functions with mocked clients |
test_runner_integration.py |
5 | File-based code tests & judge with temp directories |
test_runner_edge_cases.py |
6 | Exception handling, provider filtering, comparison edge cases |
test_code_tests_llm.py |
6 | Async LLM code extraction fallback |
test_ui.py |
27 | Dashboard, tables, panels, status badges |
Test Design
- Unit tests: Fast, isolated, no I/O. Most tests run in < 10ms.
- Async tests:
pytest-asyncio(auto mode) with mockedAsyncClient. All provider calls are mocked withAsyncMock. - Integration tests: Temporary files and directories (
tmp_pathfixture), no network access. - Property tests: Hypothesis generates random inputs to verify invariants across 100+ examples per test.
- Coverage: Configured in
.coveragercwith 90% fail-under threshold.
Hypothesis Property Tests
Property-based tests verify invariants across hundreds of generated examples:
# Run only hypothesis tests
uv run pytest tests/test_hypothesis_properties.py -v
# Run with verbose output showing generated examples
uv run pytest tests/test_hypothesis_properties.py -v --hypothesis-verbosity=verbose
# Run with specific seed for reproducibility
uv run pytest tests/test_hypothesis_properties.py -v --hypothesis-seed=12345
Properties tested:
strip_repl_promptsis idempotentis_valid_pythonnever crashes on any inputcalculate_costalways returns non-negativevisible_metricsalways returns dict with required keysdetect_guessingreturns bool for all inputsbuild_messagesreturns correct message countscalculate_comparisonsalways produces valid outcomes (WIN/LOSS/TIE)
Adding Tests
When adding a new provider:
- Add mocked test in
test_providers.py - Mock env vars with
patch.dict("os.environ", ...) - Assert on call arguments (URL, headers, payload)
When adding runner logic:
- Add sync tests in
test_runner.py - Add async tests in
test_runner_async.py - Add integration tests in
test_runner_integration.pyif files are involved - Add property tests in
test_hypothesis_properties.pyif there are invariants to verify
Linting
uv run ruff check src/ tests/
uv run ruff check src/benchmark/check.py
Adding a New Provider
- Add API function to
src/benchmark/providers.py - Register in
PROVIDER_CALLSdict - Add pricing to
PRICINGdict - Add to
config.yamlmodels list with reasoning config - Add env var to
.env.example - Add usage normalization in
src/benchmark/normalization.py - Add mocked test in
tests/test_providers.py
Adding a New Prompt
- Add to
prompts.yamlwithid,category,text - For code prompts, add
expected_symboland test template tocode_tests.py - Update
judge_required()injudge.pyif needed
Cost Estimate
Full run (2 modes × 5 models × 25 prompts × 2 conditions):
| Mode | Input Tokens | Output Tokens | Estimated Cost |
|---|---|---|---|
| Session Once | ~9,475 | 15,000 | ~$1.06 |
| Isolated | ~30,955 | 15,000 | ~$1.11 |
| Total | ~40,430 | 30,000 | ~$2.17 |
Actual cost calculated from API usage metadata including reasoning tokens.
Limitations
- 25 prompts show trends, not universal proof
- Single run has variance; multiple runs recommended for publication
- Models change without notice
- APIs report usage differently (especially reasoning token separation)
- Semantic evaluation depends on judge model choice
- Session mode risks context contamination in later turns
- Isolated mode increases input cost (principles repeated)
- Some providers don't expose reasoning token counts separately
License
MIT
References
See docs/PLAN.md for full experimental design, methodology, and analysis plan.