sleepy

Campaign evidence

The campaign ledger

Sleepy is not an AI coding assistant. It is an autonomous software optimization campaign, and this page is the lab notebook: pinned repositories, exact benchmark commands, validation checks, rejected candidates, open PRs, caveats, and review outcomes.

public ledgerchecked July 5, 2026 UTC
9 open upstream PRs 0 merged PRs so far
9campaigns recorded 9public repositories 9verified optimizations 9open upstream PRs 0merged PRs recorded 30.9%median displayed benchmark improvement 8visible campaign lessons 20xbenchmark sample count for final claims

Every number above is an index into public artifacts on this page. Maintainer acceptance is tracked separately from local validation; the current PR states were checked on July 5, 2026 UTC.

Campaigns

Small diffs, measured wins, published caveats.

These cards show the PR-ready artifacts, not every generated candidate. Several raw candidates were rejected because they changed behavior, regressed a sub-benchmark, or became too large to submit upstream.

winnow-rs/winnow: non-SIMD slice search

PR opened Rust parser Criterion bench
-72.8%find_slice/slice/medium
  • slice/medium3.1145us -> 846.90ns
  • slice/large2.5519ms -> 246.87us
  • byte casesunchanged path
Validation

rustfmt --check src/stream/mod.rs, cargo test, cargo test --features simd, and a temporary oracle comparing public FindSlice behavior to a naive byte search and str::find across fixed and generated small cases.

Caveat

The claim is scoped to the non-SIMD multi-byte fallback. The local mixed-model run was stopped after generation 0 because the Rust cargobench: loop projected to roughly two more hours; z.ai GLM-5.2 participated but its crossover candidates were prose/fence-wrapped and invalid.

ohler55/ojg: JSONPath parse token allocation

PR opened Go JSONPath parser 10 allocs/op
-14.97%BenchmarkParse
  • BenchmarkParse175.7n -> 149.4n
  • B/op374 -> 326
  • allocs/op12 -> 10
Validation

go test -count=1 ./..., 20-count BenchmarkParse benchstat, and a temporary oracle against the previous parser over 10,058 fixed and deterministic randomized JSONPath strings/byte strings comparing String(), BracketString(), and error text.

Caveat

The claim is scoped to jp.Parse. Hosted Sleepy stalled after one failed candidate; a local mixed-model run produced the useful direction but was stopped after generation 0 because the remaining search was projected at about 2.5 hours with mostly invalid candidates and provider failures.

fxamacker/cbor: map-to-struct fallback matching

PR opened Go CBOR decoder 0 alloc delta
-6.91%geomean ns/op
  • default unknown fields-18.21%
  • default unknown duplicate-16.91%
  • reject duplicate unknown-12.88%
Validation

go test -count=1 ./..., 20-count BenchmarkUnmarshalMapToStruct matrix, and a temporary oracle comparing the byte matcher to the previous strings.EqualFold helper over fixed ASCII, Unicode, invalid-UTF-8, and 10,000 deterministic random byte-string cases.

Caveat

The claim is scoped to map-to-struct decoding. The hosted mixed-model run produced invalid candidates only; the submitted patch came from a local mixed-model follow-up, with z.ai sampled but not responsible for the final valid candidate.

agnivade/levenshtein: ASCII distance fast path

PR opened Go edit distance ASCII fast path
-42.41%geomean ns/op
  • ASCII / Long lead-27.72% / -64.89%
  • Long middle / trail-56.21% / -40.40%
  • Long diff-4.31%, 720 B -> 144 B
Validation

go test -count=1 ./..., a temporary independent rune-based oracle over fixed ASCII, Unicode, invalid-UTF-8, and 10,000 deterministic random string pairs, repeated ASCII benchmarks, and Unicode control benchmarks.

Caveat

The claim is scoped to existing BenchmarkSimple ASCII and long-string cases. A much faster local Sleepy candidate used unsafe and a Myers rewrite, but it was rejected as too large for a focused upstream PR.

BurntSushi/toml: faster Key.String formatting

PR opened Go TOML parser 48 B/op
-21.52%BenchmarkKey
  • Key.String38.72n -> 30.39n
  • B/op64 -> 48
  • allocs/op1 -> 1
Validation

go test -count=1 ./..., plus a temporary differential test against the previous implementation over fixed edge cases, empty keys, Unicode, invalid UTF-8 bytes, and 5,000 deterministic random keys.

Caveat

The claim is limited to the existing BenchmarkKey helper benchmark. The hosted run was cut short when the engine entered cancelled after a test-failing candidate, so the submitted patch rests on local validation and repeated benchmarks.

segmentio/encoding: ISO8601 Flexible validation

PR opened Go validator 0 allocs/op
-58.63%geomean ns/op
  • Validate/success-55.55%
  • Validate/failure-61.50%
  • Geomean12.71n -> 5.257n
Validation

go test -count=1 ./..., then differential testing against the previous implementation across all 32 flag combinations and 50,000 deterministic random strings.

Caveat

The claim is limited to the existing BenchmarkValidate/(success|failure) Flexible cases, not broad package throughput.

tidwall/match: exact literal fast path

PR opened Go matcher 0 allocs/op
-58.22%geomean ns/op
  • BenchmarkAscii-57.27%
  • BenchmarkUnicode-59.15%
  • Geomean10.61n -> 4.431n
Validation

go test -count=1 ./... plus a temporary differential harness over random strings, patterns, escapes, Unicode, invalid bytes, and MatchLimit variants.

Caveat

The final claim is exact-literal matching only. The raw Sleepy candidate improved ASCII but regressed Unicode, so it was simplified before submission.

Links

PR #5 and public Gist.

buger/jsonparser: short parseInt path

PR opened Go parser helper 0 allocs/op
-30.90%BenchmarkParseInt
  • ParseInt3.129n -> 2.163n
  • B/op0 -> 0
  • allocs/op0 -> 0
Validation

go test -count=1 ./... plus differential testing against the previous implementation over fixed edge cases and 10,000 deterministic random byte slices.

Caveat

The benchmark parses the short input "123". Faster raw candidates were rejected after changing edge behavior or expanding into large manual unrolling.

valyala/fastjson: single-pass string validation

PR opened Go JSON validator 0 allocs/op
-11.51%Validate geomean
  • small / medium / large-13.79% to -16.32%
  • canada-3.19%
  • citm / twitter-10.01% / -10.88%
Validation

go test -count=1 ./..., PR-branch retest, and differential checks for both Validate and ValidateBytes against encoding/json.Valid over focused cases and 20,000 deterministic random inputs.

Caveat

The evidence is for Validate fixture benchmarks, not full JSON parser throughput. One fixture improved only 3.19%, so per-fixture numbers are shown.

Review outcomes

The PR state is evidence too.

Local validation proves a submitted patch was worth review. Maintainer review proves something different, so the ledger keeps those signals separate.

9

Open PRs

0

Merged PRs

No campaign PR is recorded as merged yet. This should change only when GitHub shows the PR merged or a maintainer explicitly confirms acceptance.

1

Outcome rule

Do not count an optimization as accepted just because local benchmarks pass. Track opened, reviewed, merged, closed, and rejected states separately.

Campaign method

What counts as proof.

The bar is intentionally higher than a single lucky benchmark run. Sleepy can search, but the final artifact still has to be boring enough for a maintainer to review.

1

Pin the target

Record the repository, commit, target file, benchmark regex or Criterion filter, machine, language/toolchain version, and exact commands.

2

Gate correctness

Run existing tests before measurement. Parser, matcher, validator, and codec changes also get differential or oracle checks when practical.

3

Repeat measurement

Use repeated local benchmark counts, benchstat for Go, or Criterion output for Rust. Report every matched sub-benchmark, not only the aggregate.

4

Review the diff

Reject bulky unrolling, fixture-specific shortcuts, unrelated rewrites, and candidates that are faster but harder to maintain.

5

Publish caveats

State exactly what was measured and what was not. Do not convert a narrow benchmark win into a broad throughput claim.

6

Export the story

Keep the final diff, raw outputs, lineage, PR body, result writeup, and follow-up status so another reviewer can audit the path.

Learning log

The misses are part of the evidence.

The campaign records failed setup, stalled hosted runs, rejected candidates, and product fixes. That is how the service avoids turning benchmark noise into marketing copy.

Auth failure, no PR

The first tidwall/match attempt reached hosted Sleepy but could not create a remote run without workspace auth. No candidate was generated and no upstream PR was opened.

Useful but unsafe candidate

The first jsonparser candidate found the right direction but changed parseInt("-") behavior. Differential testing caught it before submission.

Hosted orchestration fixes

Long hosted runs exposed provider stream errors and generation back-pressure handling gaps. Those issues became product follow-ups instead of hidden caveats.

Cancelled hosted engine

The BurntSushi/toml run produced a useful candidate, then the hosted engine entered cancelled after a test-failing mutation. The candidate was validated locally, and the orchestration issue is tracked as a product caveat.

Fastest was not best

The agnivade/levenshtein local mixed-model run found a much faster unsafe Myers-style rewrite. The submitted PR used the smaller ASCII dynamic-programming fast path because reviewability is part of the proof standard.

Hosted invalid, local winner

The fxamacker/cbor hosted mixed-model run sampled both GPT-5.5 and z.ai GLM-5.2 but produced only invalid candidates. A local mixed-model follow-up found the reviewable byte-matching fallback, then oracle testing and 20-count benchmarks decided whether it was PR-worthy.

Reduce before submitting

The ohler55/ojg raw candidate also rewrote number and quoted-string scanning. The submitted PR kept only the token-slicing allocation win that was benchmarked, oracle-checked, and small enough to review.

Rust harness cost matters

The winnow-rs/winnow run showed that a filtered Criterion target can still be expensive inside Sleepy's cargobench: loop. The campaign stopped the search early, exported lineage, and promoted the candidate only after independent oracle checks and full sibling benchmark reporting.

claim formatpublic proof
target: repository + commit + file
gate: go test -count=1 ./... / cargo test
benchmark: go test -run '^$' -bench '...' -benchmem -count=20 / cargo bench ...
extra check: differential / oracle / fuzz-style validation
result: per-sub-benchmark benchstat, final diff, caveats, PR link
ledger: public card, outcome status, learning note, next campaign decision

Run the same campaign on your benchmark.

Sleepy hosts the search. Your worker runs the code. The final claim belongs to the evidence you can reproduce and publish.