No description

Find a file

Rodolfo De Nadai 4f877c9dea fix: repo path, remove pages (which didnt work)		2026-06-01 00:55:28 -03:00
docs	first commit	2026-06-01 00:26:57 -03:00
results	first commit	2026-06-01 00:26:57 -03:00
src/benchmark	first commit	2026-06-01 00:26:57 -03:00
static	fix: repo path, remove pages (which didnt work)	2026-06-01 00:55:28 -03:00
tests	first commit	2026-06-01 00:26:57 -03:00
.coveragerc	first commit	2026-06-01 00:26:57 -03:00
.env.example	first commit	2026-06-01 00:26:57 -03:00
.gitignore	first commit	2026-06-01 00:26:57 -03:00
config.yaml	first commit	2026-06-01 00:26:57 -03:00
filler.toml	first commit	2026-06-01 00:26:57 -03:00
main.py	first commit	2026-06-01 00:26:57 -03:00
prompts.yaml	first commit	2026-06-01 00:26:57 -03:00
pyproject.toml	first commit	2026-06-01 00:26:57 -03:00
README.md	first commit	2026-06-01 00:26:57 -03:00

README.md

LLM Principles Benchmark

A framework for evaluating whether principles prompts measurably improve LLM behavior across conciseness, filler reduction, guessing abstention, and code correctness.

Why This Exists

Most LLM interactions suffer from:

Sycophantic openers: "Sure! I'd be happy to help..."
Verbose hedging: "Probably, maybe, it seems like..."
Unnecessary exploration: Investigating beyond what's needed
Guessing without context: Inventing data when information is missing
Unicode bloat: Smart quotes, em dashes, non-ASCII characters

This benchmark tests whether a compact principles prompt (14 rules, ~179 tokens) can reduce these behaviors without degrading technical correctness.

Hypotheses Tested

Hypothesis	Expected Effect
H1	Response length decreases ≥ 20%
H2	Filler patterns decrease ≥ 50%
H3	Correct abstention increases when context is missing
H4	Unnecessary exploration decreases
H5	Technical correctness does not degrade
H6	Thinking/reasoning tokens decrease or stay flat
H7	Effect differs between session persistence vs. per-prompt injection

Supported Providers

Provider	Model	Input $/1M	Output $/1M	Reasoning
OpenAI	gpt-5.5	$5.00	$30.00	high
Anthropic	claude-opus-4-8	$5.00	$25.00	adaptive
Google	gemini-3.5-flash	$1.50	$9.00	disabled
DeepSeek	deepseek-v4-pro	$0.435	$0.87	enabled
Moonshot	kimi-k2.6	$0.95	$4.00	enabled

Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   prompts.yaml  │────▶│   experiment    │────▶│     runner      │
│   config.yaml   │     │    builder      │     │   (async/await) │
└─────────────────┘     └─────────────────┘     └────────┬────────┘
                                                         │
            ┌────────────────────────────────────────────┼────────────┐
            │                                            │            │
            ▼                                            ▼            ▼
     ┌────────────┐                             ┌────────────┐  ┌──────────┐
     │ providers  │                             │  metrics   │  │  judge   │
     │ (5 APIs)   │                             │  (local)   │  │  (LLM)   │
     └────────────┘                             └────────────┘  └──────────┘
            │                                            │            │
            │         ┌──────────────┐                   │            │
            └────────▶│ results/raw/ │◀──────────────────┘◀───────────┘
                      │  *.json      │
                      └──────┬───────┘
                             │
                             ▼
                      ┌──────────────┐
                      │ normalization│
                      │  (usage data)│
                      └──────┬───────┘
                             │
            ┌────────────────┼────────────────┐
            │                │                │
            ▼                ▼                ▼
     ┌────────────┐  ┌────────────┐  ┌────────────┐
     │ normalized │  │   charts   │  │ comparison │
     │   .csv     │  │  (.json)   │  │  (W/T/L)   │
     └─────┬──────┘  └─────┬──────┘  └────────────┘
           │               │
           └───────┬───────┘
                   │
                   ▼
            ┌────────────┐
            │   report   │
            │   (.md)    │
            └─────┬──────┘
                  │
                  ▼
           ┌─────────────┐
           │inject_report│
           │  (.html)    │
           └─────────────┘

Data Flow

Input Files                          Generated Files
─────────────────────────────────    ─────────────────────────────────
prompts.yaml                         results/raw/<uuid>.json      (raw API responses)
config.yaml                          results/normalized.csv       (processed metrics)
filler.toml                          results/report.md            (markdown report)
.env                                 results/chart_data.json      (Chart.js data)
                                     results/judge_conclusion.txt (raw LLM judge output)
                                     static/index.html            (injected dashboard)

Quick Start

1. Install

uv sync

2. Configure API Keys

cp .env.example .env
# Edit .env and add your API keys

Required keys depend on which providers you configure in config.yaml:

OPENAI_API_KEY
ANTHROPIC_API_KEY
GEMINI_API_KEY
DEEPSEEK_API_KEY
MOONSHOT_API_KEY

3. Verify Setup

# Check all providers
uv run python -m benchmark.check --provider all

# Or check just one
uv run python -m benchmark.check --provider openai

# Skip API calls, test functions only
uv run python -m benchmark.check --provider all --skip-api

4. Run Benchmark

# Full benchmark (expensive - 500 API calls, ~$2.17)
uv run python -m benchmark.runner

# Generate report from results (includes judge analysis if enabled in config.yaml)
uv run python -m benchmark.report

# Inject report into website
uv run python -m benchmark.inject_report

5. View Results

Open static/index.html in a browser or serve it:

python -m http.server 8000 --directory web
# Navigate to http://localhost:8000

Workflow

The benchmark follows a 4-step pipeline:

┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────────┐
│    check    │───▶│    runner    │───▶│    report    │───▶│ inject_report   │
│  (smoke)    │    │  (benchmark) │    │  (analyze)   │    │  (dashboard)    │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────────┘
       │                  │                  │                    │
       ▼                  ▼                  ▼                    ▼
  API connectivity   results/raw/      results/report.md    static/index.html
  Metric functions   *.json files      normalized.csv       (interactive
  Code extraction                     chart_data.json       charts + tables)
                                      judge_conclusion.txt

Step 1: Smoke Test

uv run python -m benchmark.check --provider all

Verifies API connectivity, response parsing, and metrics extraction without running the full benchmark.

Step 2: Run Benchmark

uv run python -m benchmark.runner

Executes all 500 API calls across 5 providers, 2 modes, and 25 prompts. Results saved to results/raw/.

Step 3: Generate Report

uv run python -m benchmark.report

Processes raw results into results/report.md with tables, statistics, and comparison analysis.

If judge.enabled: true in config.yaml, the report generation also:

Calls the configured judge LLM to analyze the full report
Saves raw judge response to results/judge_conclusion.txt
The judge prompt includes the full principles text for context

Step 4: Inject into Website

uv run python -m benchmark.inject_report

Converts the markdown report to HTML and injects it into static/index.html between REPORT_BLOCK_START and REPORT_BLOCK_END markers. Generates interactive Chart.js visualizations from results/chart_data.json.

Project Structure

llm-principles-benchmark/
├── src/benchmark/
│   ├── __init__.py
│   ├── check.py              # Smoke test script (moved from root)
│   ├── runner.py             # Main benchmark orchestration (~900 lines)
│   ├── providers.py          # API clients for 5 providers with reasoning params
│   ├── metrics.py            # Local metrics (filler, guessing, abstention, ASCII)
│   ├── code_tests.py         # Code extraction + AST validation + pytest runner
│   ├── judge.py              # LLM judge for subjective evaluation + report analysis
│   ├── pricing.py            # Cost calculation including reasoning tokens
│   ├── report.py             # Markdown report generation with judge conclusion
│   ├── charts.py             # Chart data generation for 5 chart types
│   ├── comparison.py         # WIN/TIE/LOSS scoring logic
│   ├── experiment.py         # Experiment matrix builder (mode × model × prompt)
│   ├── normalization.py      # Provider-specific usage normalization
│   ├── schemas.py            # Pydantic data models
│   ├── ui.py                 # Rich terminal dashboard components
│   └── inject_report.py      # Markdown-to-HTML injector with Chart.js
├── tests/
│   ├── conftest.py                   # Shared fixtures
│   ├── test_charts.py                # Chart data generation
│   ├── test_generated_code.py        # Code extraction and pytest runner
│   ├── test_hypothesis_properties.py # Property-based tests (Hypothesis)
│   ├── test_code_tests_llm.py        # LLM fallback code extraction
│   ├── test_metrics.py               # Filler, guessing, abstention detection
│   ├── test_pricing.py               # Cost calculation
│   ├── test_schemas.py               # Pydantic model validation
│   ├── test_providers.py             # Mocked API clients for 5 providers
│   ├── test_judge.py                 # Mocked LLM judge
│   ├── test_report.py                # Report generation
│   ├── test_runner.py                # Runner helper functions
│   ├── test_runner_async.py          # Async runner functions
│   ├── test_runner_integration.py    # File-based integration tests
│   ├── test_runner_edge_cases.py     # Edge cases and error handling
│   └── test_ui.py                    # Terminal UI components
├── static/
│   └── index.html              # Flat dashboard template (gets injected)
├── results/
│   ├── raw/                    # Raw API responses (*.json)
│   ├── normalized.csv          # Processed results
│   ├── report.md               # Generated markdown report
│   ├── chart_data.json         # Serialized chart data for dashboard
│   └── judge_conclusion.txt    # Raw LLM judge response with markers
├── docs/
│   └── PLAN.md                 # Full experimental design
├── config.yaml                 # Benchmark configuration
├── prompts.yaml                # 25 test prompts + principles
├── filler.toml                 # 100+ regex patterns for filler detection
├── .env.example                # Required environment variables
├── pyproject.toml              # Dependencies (uv)
└── main.py                     # Entry point stub

Configuration

Edit config.yaml:

concurrency: 5              # Parallel API calls
timeout_seconds: 120        # Request timeout
max_output_tokens: 2048     # Max tokens per response (accommodates reasoning mode)
temperature: 0              # Deterministic output (omitted for some providers)

execution_modes:
  - session_once            # Principles sent once at session start (Mode 2)
  - isolated_per_prompt     # Principles resent for each prompt (Mode 1)

judge:
  enabled: true             # Enable LLM judge for report analysis
  provider: moonshot        # Judge provider
  model: kimi-k2.6          # Judge model
  blind: true               # Randomized A/B position

models:
  - provider: openai
    model: gpt-5.5
    reasoning: high         # OpenAI reasoning effort (high/xhigh)
  - provider: anthropic
    model: claude-opus-4-8
    reasoning: adaptive     # Anthropic thinking mode (adaptive/enabled)
  - provider: gemini
    model: gemini-3.5-flash
    reasoning: disabled     # Gemini thoughts (disabled/includeThoughts)
  - provider: deepseek
    model: deepseek-v4-pro
    reasoning: enabled      # DeepSeek thinking (enabled/disabled)
  - provider: moonshot
    model: kimi-k2.6
    reasoning: enabled      # Moonshot reasoning (enabled/disabled)

Reasoning Configuration

Different providers expose reasoning/thinking tokens differently:

Provider	Config Value	Behavior
OpenAI	`high`	`reasoning.effort: "high"` (not `xhigh` — consumes all tokens)
Anthropic	`adaptive`	`thinking.type: "adaptive"`
Gemini	`disabled`	`thinkingConfig: {includeThoughts: false}` (tracks thoughtsTokenCount separately)
DeepSeek	`enabled`	Returns both `content` and `reasoning_content`
Moonshot	`enabled`	Returns both `content` and `reasoning_content`

When reasoning is enabled, the framework:

Extracts reasoning tokens from provider-specific response fields
Includes them in cost calculation (billed separately or as output tokens)
Falls back to reasoning_content when content is empty

How It Works

Execution Modes

Mode 1: Isolated Per Prompt

Baseline: New session per prompt, no principles
Treatment: New session per prompt, principles included as first message
Measures: Direct effect per task, minimal cross-contamination
Higher input cost (principles repeated 25× per model)

Mode 2: Session Once

Baseline: 25 prompts in one session without principles
Treatment: Principles sent in first turn, then 25 prompts in same session
Measures: Instruction persistence, context degradation, conversation cost
Risk of context contamination in later turns

Prompt Categories

Category	Prompts	What It Tests
Conciseness	P01-P05	Response brevity under constraints
Guessing	P06-P10	Abstention when context is missing
Code	P11-P15	Correctness via pytest validation
Rewrite	P16-P20	Filler removal and tone improvement
Long Context	P21-P25	Smallest useful action first

Metrics Collected

Objective Metrics:

visible_chars, visible_lines — Response length
visible_output_tokens — Token count from API
reasoning_tokens — Thinking tokens (when exposed by provider)
filler_count — Regex matches against 100+ patterns in filler.toml
ascii_violation — Non-ASCII character detection
guessing_violation — Heuristic invention detection
correct_abstention — Proper "I don't know" responses
closing_fluff — Generic endings detected
smallest_useful_action — Concrete first steps
cost_usd — Real API cost including reasoning tokens
latency_ms — Response time

Code Validation:

Deterministic extraction from markdown (extract_code_deterministic)
AST parsing validation (is_valid_python)
Pytest execution (P11-P15)

Subjective Evaluation:

LLM judge (judge_pair) compares baseline vs. principles blindly
Winner determined by conciseness, correctness, filler, guessing
Per-prompt WIN/TIE/LOSS scoring (calculate_comparisons)

Report Analysis:

Separate judge call (analyze_report) generates executive summary
Receives full benchmark summary + principles text as context
Saves raw response with BEGIN_CONCLUSION / END_CONCLUSION markers
Parsed at injection time into styled HTML card with disclaimer

Comparison Groups

Three mandatory comparisons:

Baseline vs. Mode 1 (Isolated): Does principles help when repeated?
Baseline vs. Mode 2 (Session): Does principles help when sent once?
Mode 1 vs. Mode 2: Is once enough or is repetition better?

Important: Comparisons stay within the same execution mode — baseline isolated vs. principles isolated, baseline session vs. principles session. Never crosses modes.

Success Criteria

Principles "works" if vs. baseline:

Output tokens ↓ ≥ 20%
Thinking tokens ↓ or flat
Total billable tokens don't increase significantly
Filler ↓ ≥ 50%
Guessing ↓ on impossible prompts
Code pass rate doesn't degrade
ASCII violations ≈ 0
Cost and latency don't rise meaningfully

Web Dashboard

The generated static/index.html features:

Interactive Charts (Chart.js):

Mode 1 & 2 Scorecards — Side-by-side metric comparison cards
Baseline vs Principles by Model — Grouped bar charts with warm/cool color palette (orange = baseline, teal = principles)
Total Token Usage — Input/output/thinking breakdown per provider
Win/Tie/Loss — Stacked horizontal bars showing per-prompt comparisons
Improvement % Heatmaps — Color-coded tables (green = improvement, red = degradation) for both Isolated and Session modes
Comparison Tables — Baseline vs Mode 1, Baseline vs Mode 2, Mode 1 vs Mode 2

Design:

Flat design with Fira Code monospace font
0 border-radius (no rounded corners)
Dark header with accent colors
Responsive card grids replacing tables where appropriate

Content:

Judge conclusion card with 💡 icon and disclaimer
Methodology explanation cards
Copy button for principles text
Category tabs for prompt browsing

Smoke Testing

Before running the full benchmark, verify everything works:

# Test all functions and API connectivity
uv run python -m benchmark.check --provider all

# Skip API calls, test functions only
uv run python -m benchmark.check --provider all --skip-api

# Test specific provider
uv run python -m benchmark.check --provider gemini

Development

Running Tests

All tests (~55 tests, ~5 seconds):

uv run pytest tests/ -v

Run a specific test file:

uv run pytest tests/test_providers.py -v
uv run pytest tests/test_runner.py -v

With coverage report:

uv run pytest tests/ --cov=benchmark --cov-report=term-missing

Coverage threshold is 90%. Configured in .coveragerc.

Test Suite

All API calls are mocked via unittest.mock and pytest-asyncio. No real HTTP requests are made during tests, and no API keys or environment variables are required.

Test File	Tests	What It Covers
`test_charts.py`	~15	Chart data generation, aggregation logic
`test_generated_code.py`	19	Code extraction from markdown, AST validation, pytest runner
`test_hypothesis_properties.py`	22	Property-based tests: invariants, edge cases, fuzzing
`test_metrics.py`	27	Filler detection, guessing heuristics, abstention, ASCII checks
`test_pricing.py`	7	Cost calculation per model, hypothesis property testing
`test_schemas.py`	13	Pydantic model validation, hypothesis fuzzing
`test_providers.py`	20	Mocked HTTP calls for all 5 LLM providers + retry logic
`test_judge.py`	13	Mocked provider calls for LLM judge, text extraction
`test_report.py`	5	Report generation with mocked pandas DataFrames
`test_runner.py`	27	Runner functions: messages, usage normalization, comparisons
`test_runner_async.py`	13	Async runner functions with mocked clients
`test_runner_integration.py`	5	File-based code tests & judge with temp directories
`test_runner_edge_cases.py`	6	Exception handling, provider filtering, comparison edge cases
`test_code_tests_llm.py`	6	Async LLM code extraction fallback
`test_ui.py`	27	Dashboard, tables, panels, status badges

Test Design

Unit tests: Fast, isolated, no I/O. Most tests run in < 10ms.
Async tests: pytest-asyncio (auto mode) with mocked AsyncClient. All provider calls are mocked with AsyncMock.
Integration tests: Temporary files and directories (tmp_path fixture), no network access.
Property tests: Hypothesis generates random inputs to verify invariants across 100+ examples per test.
Coverage: Configured in .coveragerc with 90% fail-under threshold.

Hypothesis Property Tests

Property-based tests verify invariants across hundreds of generated examples:

# Run only hypothesis tests
uv run pytest tests/test_hypothesis_properties.py -v

# Run with verbose output showing generated examples
uv run pytest tests/test_hypothesis_properties.py -v --hypothesis-verbosity=verbose

# Run with specific seed for reproducibility
uv run pytest tests/test_hypothesis_properties.py -v --hypothesis-seed=12345

Properties tested:

strip_repl_prompts is idempotent
is_valid_python never crashes on any input
calculate_cost always returns non-negative
visible_metrics always returns dict with required keys
detect_guessing returns bool for all inputs
build_messages returns correct message counts
calculate_comparisons always produces valid outcomes (WIN/LOSS/TIE)

Adding Tests

When adding a new provider:

Add mocked test in test_providers.py
Mock env vars with patch.dict("os.environ", ...)
Assert on call arguments (URL, headers, payload)

When adding runner logic:

Add sync tests in test_runner.py
Add async tests in test_runner_async.py
Add integration tests in test_runner_integration.py if files are involved
Add property tests in test_hypothesis_properties.py if there are invariants to verify

Linting

uv run ruff check src/ tests/
uv run ruff check src/benchmark/check.py

Adding a New Provider

Add API function to src/benchmark/providers.py
Register in PROVIDER_CALLS dict
Add pricing to PRICING dict
Add to config.yaml models list with reasoning config
Add env var to .env.example
Add usage normalization in src/benchmark/normalization.py
Add mocked test in tests/test_providers.py

Adding a New Prompt

Add to prompts.yaml with id, category, text
For code prompts, add expected_symbol and test template to code_tests.py
Update judge_required() in judge.py if needed

Cost Estimate

Full run (2 modes × 5 models × 25 prompts × 2 conditions):

Mode	Input Tokens	Output Tokens	Estimated Cost
Session Once	~9,475	15,000	~$1.06
Isolated	~30,955	15,000	~$1.11
Total	~40,430	30,000	~$2.17

Actual cost calculated from API usage metadata including reasoning tokens.

Limitations

25 prompts show trends, not universal proof
Single run has variance; multiple runs recommended for publication
Models change without notice
APIs report usage differently (especially reasoning token separation)
Semantic evaluation depends on judge model choice
Session mode risks context contamination in later turns
Isolated mode increases input cost (principles repeated)
Some providers don't expose reasoning token counts separately

License

MIT

References

See docs/PLAN.md for full experimental design, methodology, and analysis plan.

README.md Unescape Escape