No description
Find a file
2026-06-01 00:55:28 -03:00
docs first commit 2026-06-01 00:26:57 -03:00
results first commit 2026-06-01 00:26:57 -03:00
src/benchmark first commit 2026-06-01 00:26:57 -03:00
static fix: repo path, remove pages (which didnt work) 2026-06-01 00:55:28 -03:00
tests first commit 2026-06-01 00:26:57 -03:00
.coveragerc first commit 2026-06-01 00:26:57 -03:00
.env.example first commit 2026-06-01 00:26:57 -03:00
.gitignore first commit 2026-06-01 00:26:57 -03:00
config.yaml first commit 2026-06-01 00:26:57 -03:00
filler.toml first commit 2026-06-01 00:26:57 -03:00
main.py first commit 2026-06-01 00:26:57 -03:00
prompts.yaml first commit 2026-06-01 00:26:57 -03:00
pyproject.toml first commit 2026-06-01 00:26:57 -03:00
README.md first commit 2026-06-01 00:26:57 -03:00

LLM Principles Benchmark

A framework for evaluating whether principles prompts measurably improve LLM behavior across conciseness, filler reduction, guessing abstention, and code correctness.

Why This Exists

Most LLM interactions suffer from:

  • Sycophantic openers: "Sure! I'd be happy to help..."
  • Verbose hedging: "Probably, maybe, it seems like..."
  • Unnecessary exploration: Investigating beyond what's needed
  • Guessing without context: Inventing data when information is missing
  • Unicode bloat: Smart quotes, em dashes, non-ASCII characters

This benchmark tests whether a compact principles prompt (14 rules, ~179 tokens) can reduce these behaviors without degrading technical correctness.

Hypotheses Tested

Hypothesis Expected Effect
H1 Response length decreases ≥ 20%
H2 Filler patterns decrease ≥ 50%
H3 Correct abstention increases when context is missing
H4 Unnecessary exploration decreases
H5 Technical correctness does not degrade
H6 Thinking/reasoning tokens decrease or stay flat
H7 Effect differs between session persistence vs. per-prompt injection

Supported Providers

Provider Model Input $/1M Output $/1M Reasoning
OpenAI gpt-5.5 $5.00 $30.00 high
Anthropic claude-opus-4-8 $5.00 $25.00 adaptive
Google gemini-3.5-flash $1.50 $9.00 disabled
DeepSeek deepseek-v4-pro $0.435 $0.87 enabled
Moonshot kimi-k2.6 $0.95 $4.00 enabled

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   prompts.yaml  │────▶│   experiment    │────▶│     runner      │
│   config.yaml   │     │    builder      │     │   (async/await) │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
            ┌────────────────────────────────────────────┼────────────┐
            │                                            │            │
            ▼                                            ▼            ▼
     ┌────────────┐                             ┌────────────┐  ┌──────────┐
     │ providers  │                             │  metrics   │  │  judge   │
     │ (5 APIs)   │                             │  (local)   │  │  (LLM)   │
     └────────────┘                             └────────────┘  └──────────┘
            │                                            │            │
            │         ┌──────────────┐                   │            │
            └────────▶│ results/raw/ │◀──────────────────┘◀───────────┘
                      │  *.json      │
                      └──────┬───────┘
                             │
                             ▼
                      ┌──────────────┐
                      │ normalization│
                      │  (usage data)│
                      └──────┬───────┘
                             │
            ┌────────────────┼────────────────┐
            │                │                │
            ▼                ▼                ▼
     ┌────────────┐  ┌────────────┐  ┌────────────┐
     │ normalized │  │   charts   │  │ comparison │
     │   .csv     │  │  (.json)   │  │  (W/T/L)   │
     └─────┬──────┘  └─────┬──────┘  └────────────┘
           │               │
           └───────┬───────┘
                   │
                   ▼
            ┌────────────┐
            │   report   │
            │   (.md)    │
            └─────┬──────┘
                  │
                  ▼
           ┌─────────────┐
           │inject_report│
           │  (.html)    │
           └─────────────┘

Data Flow

Input Files                          Generated Files
─────────────────────────────────    ─────────────────────────────────
prompts.yaml                         results/raw/<uuid>.json      (raw API responses)
config.yaml                          results/normalized.csv       (processed metrics)
filler.toml                          results/report.md            (markdown report)
.env                                 results/chart_data.json      (Chart.js data)
                                     results/judge_conclusion.txt (raw LLM judge output)
                                     static/index.html            (injected dashboard)

Quick Start

1. Install

uv sync

2. Configure API Keys

cp .env.example .env
# Edit .env and add your API keys

Required keys depend on which providers you configure in config.yaml:

  • OPENAI_API_KEY
  • ANTHROPIC_API_KEY
  • GEMINI_API_KEY
  • DEEPSEEK_API_KEY
  • MOONSHOT_API_KEY

3. Verify Setup

# Check all providers
uv run python -m benchmark.check --provider all

# Or check just one
uv run python -m benchmark.check --provider openai

# Skip API calls, test functions only
uv run python -m benchmark.check --provider all --skip-api

4. Run Benchmark

# Full benchmark (expensive - 500 API calls, ~$2.17)
uv run python -m benchmark.runner

# Generate report from results (includes judge analysis if enabled in config.yaml)
uv run python -m benchmark.report

# Inject report into website
uv run python -m benchmark.inject_report

5. View Results

Open static/index.html in a browser or serve it:

python -m http.server 8000 --directory web
# Navigate to http://localhost:8000

Workflow

The benchmark follows a 4-step pipeline:

┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────────┐
│    check    │───▶│    runner    │───▶│    report    │───▶│ inject_report   │
│  (smoke)    │    │  (benchmark) │    │  (analyze)   │    │  (dashboard)    │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────────┘
       │                  │                  │                    │
       ▼                  ▼                  ▼                    ▼
  API connectivity   results/raw/      results/report.md    static/index.html
  Metric functions   *.json files      normalized.csv       (interactive
  Code extraction                     chart_data.json       charts + tables)
                                      judge_conclusion.txt

Step 1: Smoke Test

uv run python -m benchmark.check --provider all

Verifies API connectivity, response parsing, and metrics extraction without running the full benchmark.

Step 2: Run Benchmark

uv run python -m benchmark.runner

Executes all 500 API calls across 5 providers, 2 modes, and 25 prompts. Results saved to results/raw/.

Step 3: Generate Report

uv run python -m benchmark.report

Processes raw results into results/report.md with tables, statistics, and comparison analysis.

If judge.enabled: true in config.yaml, the report generation also:

  • Calls the configured judge LLM to analyze the full report
  • Saves raw judge response to results/judge_conclusion.txt
  • The judge prompt includes the full principles text for context

Step 4: Inject into Website

uv run python -m benchmark.inject_report

Converts the markdown report to HTML and injects it into static/index.html between REPORT_BLOCK_START and REPORT_BLOCK_END markers. Generates interactive Chart.js visualizations from results/chart_data.json.

Project Structure

llm-principles-benchmark/
├── src/benchmark/
│   ├── __init__.py
│   ├── check.py              # Smoke test script (moved from root)
│   ├── runner.py             # Main benchmark orchestration (~900 lines)
│   ├── providers.py          # API clients for 5 providers with reasoning params
│   ├── metrics.py            # Local metrics (filler, guessing, abstention, ASCII)
│   ├── code_tests.py         # Code extraction + AST validation + pytest runner
│   ├── judge.py              # LLM judge for subjective evaluation + report analysis
│   ├── pricing.py            # Cost calculation including reasoning tokens
│   ├── report.py             # Markdown report generation with judge conclusion
│   ├── charts.py             # Chart data generation for 5 chart types
│   ├── comparison.py         # WIN/TIE/LOSS scoring logic
│   ├── experiment.py         # Experiment matrix builder (mode × model × prompt)
│   ├── normalization.py      # Provider-specific usage normalization
│   ├── schemas.py            # Pydantic data models
│   ├── ui.py                 # Rich terminal dashboard components
│   └── inject_report.py      # Markdown-to-HTML injector with Chart.js
├── tests/
│   ├── conftest.py                   # Shared fixtures
│   ├── test_charts.py                # Chart data generation
│   ├── test_generated_code.py        # Code extraction and pytest runner
│   ├── test_hypothesis_properties.py # Property-based tests (Hypothesis)
│   ├── test_code_tests_llm.py        # LLM fallback code extraction
│   ├── test_metrics.py               # Filler, guessing, abstention detection
│   ├── test_pricing.py               # Cost calculation
│   ├── test_schemas.py               # Pydantic model validation
│   ├── test_providers.py             # Mocked API clients for 5 providers
│   ├── test_judge.py                 # Mocked LLM judge
│   ├── test_report.py                # Report generation
│   ├── test_runner.py                # Runner helper functions
│   ├── test_runner_async.py          # Async runner functions
│   ├── test_runner_integration.py    # File-based integration tests
│   ├── test_runner_edge_cases.py     # Edge cases and error handling
│   └── test_ui.py                    # Terminal UI components
├── static/
│   └── index.html              # Flat dashboard template (gets injected)
├── results/
│   ├── raw/                    # Raw API responses (*.json)
│   ├── normalized.csv          # Processed results
│   ├── report.md               # Generated markdown report
│   ├── chart_data.json         # Serialized chart data for dashboard
│   └── judge_conclusion.txt    # Raw LLM judge response with markers
├── docs/
│   └── PLAN.md                 # Full experimental design
├── config.yaml                 # Benchmark configuration
├── prompts.yaml                # 25 test prompts + principles
├── filler.toml                 # 100+ regex patterns for filler detection
├── .env.example                # Required environment variables
├── pyproject.toml              # Dependencies (uv)
└── main.py                     # Entry point stub

Configuration

Edit config.yaml:

concurrency: 5              # Parallel API calls
timeout_seconds: 120        # Request timeout
max_output_tokens: 2048     # Max tokens per response (accommodates reasoning mode)
temperature: 0              # Deterministic output (omitted for some providers)

execution_modes:
  - session_once            # Principles sent once at session start (Mode 2)
  - isolated_per_prompt     # Principles resent for each prompt (Mode 1)

judge:
  enabled: true             # Enable LLM judge for report analysis
  provider: moonshot        # Judge provider
  model: kimi-k2.6          # Judge model
  blind: true               # Randomized A/B position

models:
  - provider: openai
    model: gpt-5.5
    reasoning: high         # OpenAI reasoning effort (high/xhigh)
  - provider: anthropic
    model: claude-opus-4-8
    reasoning: adaptive     # Anthropic thinking mode (adaptive/enabled)
  - provider: gemini
    model: gemini-3.5-flash
    reasoning: disabled     # Gemini thoughts (disabled/includeThoughts)
  - provider: deepseek
    model: deepseek-v4-pro
    reasoning: enabled      # DeepSeek thinking (enabled/disabled)
  - provider: moonshot
    model: kimi-k2.6
    reasoning: enabled      # Moonshot reasoning (enabled/disabled)

Reasoning Configuration

Different providers expose reasoning/thinking tokens differently:

Provider Config Value Behavior
OpenAI high reasoning.effort: "high" (not xhigh — consumes all tokens)
Anthropic adaptive thinking.type: "adaptive"
Gemini disabled thinkingConfig: {includeThoughts: false} (tracks thoughtsTokenCount separately)
DeepSeek enabled Returns both content and reasoning_content
Moonshot enabled Returns both content and reasoning_content

When reasoning is enabled, the framework:

  1. Extracts reasoning tokens from provider-specific response fields
  2. Includes them in cost calculation (billed separately or as output tokens)
  3. Falls back to reasoning_content when content is empty

How It Works

Execution Modes

Mode 1: Isolated Per Prompt

  • Baseline: New session per prompt, no principles
  • Treatment: New session per prompt, principles included as first message
  • Measures: Direct effect per task, minimal cross-contamination
  • Higher input cost (principles repeated 25× per model)

Mode 2: Session Once

  • Baseline: 25 prompts in one session without principles
  • Treatment: Principles sent in first turn, then 25 prompts in same session
  • Measures: Instruction persistence, context degradation, conversation cost
  • Risk of context contamination in later turns

Prompt Categories

Category Prompts What It Tests
Conciseness P01-P05 Response brevity under constraints
Guessing P06-P10 Abstention when context is missing
Code P11-P15 Correctness via pytest validation
Rewrite P16-P20 Filler removal and tone improvement
Long Context P21-P25 Smallest useful action first

Metrics Collected

Objective Metrics:

  • visible_chars, visible_lines — Response length
  • visible_output_tokens — Token count from API
  • reasoning_tokens — Thinking tokens (when exposed by provider)
  • filler_count — Regex matches against 100+ patterns in filler.toml
  • ascii_violation — Non-ASCII character detection
  • guessing_violation — Heuristic invention detection
  • correct_abstention — Proper "I don't know" responses
  • closing_fluff — Generic endings detected
  • smallest_useful_action — Concrete first steps
  • cost_usd — Real API cost including reasoning tokens
  • latency_ms — Response time

Code Validation:

  • Deterministic extraction from markdown (extract_code_deterministic)
  • AST parsing validation (is_valid_python)
  • Pytest execution (P11-P15)

Subjective Evaluation:

  • LLM judge (judge_pair) compares baseline vs. principles blindly
  • Winner determined by conciseness, correctness, filler, guessing
  • Per-prompt WIN/TIE/LOSS scoring (calculate_comparisons)

Report Analysis:

  • Separate judge call (analyze_report) generates executive summary
  • Receives full benchmark summary + principles text as context
  • Saves raw response with BEGIN_CONCLUSION / END_CONCLUSION markers
  • Parsed at injection time into styled HTML card with disclaimer

Comparison Groups

Three mandatory comparisons:

  1. Baseline vs. Mode 1 (Isolated): Does principles help when repeated?
  2. Baseline vs. Mode 2 (Session): Does principles help when sent once?
  3. Mode 1 vs. Mode 2: Is once enough or is repetition better?

Important: Comparisons stay within the same execution mode — baseline isolated vs. principles isolated, baseline session vs. principles session. Never crosses modes.

Success Criteria

Principles "works" if vs. baseline:

  • Output tokens ↓ ≥ 20%
  • Thinking tokens ↓ or flat
  • Total billable tokens don't increase significantly
  • Filler ↓ ≥ 50%
  • Guessing ↓ on impossible prompts
  • Code pass rate doesn't degrade
  • ASCII violations ≈ 0
  • Cost and latency don't rise meaningfully

Web Dashboard

The generated static/index.html features:

Interactive Charts (Chart.js):

  • Mode 1 & 2 Scorecards — Side-by-side metric comparison cards
  • Baseline vs Principles by Model — Grouped bar charts with warm/cool color palette (orange = baseline, teal = principles)
  • Total Token Usage — Input/output/thinking breakdown per provider
  • Win/Tie/Loss — Stacked horizontal bars showing per-prompt comparisons
  • Improvement % Heatmaps — Color-coded tables (green = improvement, red = degradation) for both Isolated and Session modes
  • Comparison Tables — Baseline vs Mode 1, Baseline vs Mode 2, Mode 1 vs Mode 2

Design:

  • Flat design with Fira Code monospace font
  • 0 border-radius (no rounded corners)
  • Dark header with accent colors
  • Responsive card grids replacing tables where appropriate

Content:

  • Judge conclusion card with 💡 icon and disclaimer
  • Methodology explanation cards
  • Copy button for principles text
  • Category tabs for prompt browsing

Smoke Testing

Before running the full benchmark, verify everything works:

# Test all functions and API connectivity
uv run python -m benchmark.check --provider all

# Skip API calls, test functions only
uv run python -m benchmark.check --provider all --skip-api

# Test specific provider
uv run python -m benchmark.check --provider gemini

Development

Running Tests

All tests (~55 tests, ~5 seconds):

uv run pytest tests/ -v

Run a specific test file:

uv run pytest tests/test_providers.py -v
uv run pytest tests/test_runner.py -v

With coverage report:

uv run pytest tests/ --cov=benchmark --cov-report=term-missing

Coverage threshold is 90%. Configured in .coveragerc.

Test Suite

All API calls are mocked via unittest.mock and pytest-asyncio. No real HTTP requests are made during tests, and no API keys or environment variables are required.

Test File Tests What It Covers
test_charts.py ~15 Chart data generation, aggregation logic
test_generated_code.py 19 Code extraction from markdown, AST validation, pytest runner
test_hypothesis_properties.py 22 Property-based tests: invariants, edge cases, fuzzing
test_metrics.py 27 Filler detection, guessing heuristics, abstention, ASCII checks
test_pricing.py 7 Cost calculation per model, hypothesis property testing
test_schemas.py 13 Pydantic model validation, hypothesis fuzzing
test_providers.py 20 Mocked HTTP calls for all 5 LLM providers + retry logic
test_judge.py 13 Mocked provider calls for LLM judge, text extraction
test_report.py 5 Report generation with mocked pandas DataFrames
test_runner.py 27 Runner functions: messages, usage normalization, comparisons
test_runner_async.py 13 Async runner functions with mocked clients
test_runner_integration.py 5 File-based code tests & judge with temp directories
test_runner_edge_cases.py 6 Exception handling, provider filtering, comparison edge cases
test_code_tests_llm.py 6 Async LLM code extraction fallback
test_ui.py 27 Dashboard, tables, panels, status badges

Test Design

  • Unit tests: Fast, isolated, no I/O. Most tests run in < 10ms.
  • Async tests: pytest-asyncio (auto mode) with mocked AsyncClient. All provider calls are mocked with AsyncMock.
  • Integration tests: Temporary files and directories (tmp_path fixture), no network access.
  • Property tests: Hypothesis generates random inputs to verify invariants across 100+ examples per test.
  • Coverage: Configured in .coveragerc with 90% fail-under threshold.

Hypothesis Property Tests

Property-based tests verify invariants across hundreds of generated examples:

# Run only hypothesis tests
uv run pytest tests/test_hypothesis_properties.py -v

# Run with verbose output showing generated examples
uv run pytest tests/test_hypothesis_properties.py -v --hypothesis-verbosity=verbose

# Run with specific seed for reproducibility
uv run pytest tests/test_hypothesis_properties.py -v --hypothesis-seed=12345

Properties tested:

  • strip_repl_prompts is idempotent
  • is_valid_python never crashes on any input
  • calculate_cost always returns non-negative
  • visible_metrics always returns dict with required keys
  • detect_guessing returns bool for all inputs
  • build_messages returns correct message counts
  • calculate_comparisons always produces valid outcomes (WIN/LOSS/TIE)

Adding Tests

When adding a new provider:

  1. Add mocked test in test_providers.py
  2. Mock env vars with patch.dict("os.environ", ...)
  3. Assert on call arguments (URL, headers, payload)

When adding runner logic:

  1. Add sync tests in test_runner.py
  2. Add async tests in test_runner_async.py
  3. Add integration tests in test_runner_integration.py if files are involved
  4. Add property tests in test_hypothesis_properties.py if there are invariants to verify

Linting

uv run ruff check src/ tests/
uv run ruff check src/benchmark/check.py

Adding a New Provider

  1. Add API function to src/benchmark/providers.py
  2. Register in PROVIDER_CALLS dict
  3. Add pricing to PRICING dict
  4. Add to config.yaml models list with reasoning config
  5. Add env var to .env.example
  6. Add usage normalization in src/benchmark/normalization.py
  7. Add mocked test in tests/test_providers.py

Adding a New Prompt

  1. Add to prompts.yaml with id, category, text
  2. For code prompts, add expected_symbol and test template to code_tests.py
  3. Update judge_required() in judge.py if needed

Cost Estimate

Full run (2 modes × 5 models × 25 prompts × 2 conditions):

Mode Input Tokens Output Tokens Estimated Cost
Session Once ~9,475 15,000 ~$1.06
Isolated ~30,955 15,000 ~$1.11
Total ~40,430 30,000 ~$2.17

Actual cost calculated from API usage metadata including reasoning tokens.

Limitations

  • 25 prompts show trends, not universal proof
  • Single run has variance; multiple runs recommended for publication
  • Models change without notice
  • APIs report usage differently (especially reasoning token separation)
  • Semantic evaluation depends on judge model choice
  • Session mode risks context contamination in later turns
  • Isolated mode increases input cost (principles repeated)
  • Some providers don't expose reasoning token counts separately

License

MIT

References

See docs/PLAN.md for full experimental design, methodology, and analysis plan.