Docs Testing, validation, and Lucy logs

Testing, validation, and Lucy logs

This page ties together how we stress poly/, where artifacts land, and how to read parity tables in captured logs (for example lucy/lucy_testing_output/log.txt).


Where logs come from

The Lucy tree (lucy/) drives broad layer suites: forward/backward parity, training matrices, save/reload checks, and GPU timing tables. A typical full run writes a transcript under:

  • lucy/lucy_testing_output/log.txt

That file is meant for human review and regression diffing (adapter name, Metal/Vulkan, per-dtype rows, and summary tallies at the end of each section).


How to read parity summary lines

Sections often end with a line shaped like:

>> [Forward Parity] 84 Tests | 💎 42 | ✅ 24 | 🟨 0 | 🟠 0 | 🟤 18 | ❌ 0 | 💀 0

Rough meaning (exact thresholds live in the test harness, not duplicated here):

Symbol Typical meaning
💎 Exact / diamond-grade agreement within the tightest tolerance
Pass within configured industry-grade tolerance
🟨 / 🟠 Elevated drift bands (still classified by the harness)
🟤 Heavy drift (e.g. H-DRIFT in backward tables) — worth investigating dtype + path
Hard failure (assert or threshold breach)
💀 Fatal / panic / infrastructure failure

Backward tables may label columns INDUS (industry tolerance) vs H-DRIFT (heavy drift). Treat 🟤 rows as “numerically alive but not interchangeable with FP32 reference at the same tolerance,” not necessarily as engine bugs: some combinations are expected to diverge when the reference path is float32-simulated and the subject path is true low-bit or integer-native.


May 2026 full-suite snapshot (log.txt)

Recent Run All Layer Tests captures (Metal / arm64, ~2992 rows) show:

Metric Value
Broken (❌) 0
Fatal / NaN (💀) 0
Bit-exact (💎) ~75% of classified rows
Heavy drift (🟤) ~17% — mostly forward parity vs FP32 reference on native-int / low-bit paths

Fixes reflected in this run (vs earlier transcripts):

  • Training matrixFile / RAM columns print correctly (no %!s(MISSING)); every Dense training row TrainOK PASS and Save/Reload PASS for all 21 dtypes.
  • Save/Reload — CNN1/2/3, Dense, Embedding, LSTM, MHA, Residual, RNN, SwiGLU each end with [Save/Reload <layer>] PASS.
  • Global manifest — no hard failures across the full layer sweep.

Still classified as 🟤 (not ❌): Dense forward parity rows where CPU uses true integer/low-bit math and the harness compares to a float-shaped reference; CNN backward H-DRIFT on Float16/BFloat16/Int4 (GPU vs CPU reference). Treat as tolerance bands — see parity legend above.


Dense forward ASM (Plan 9)

Lucy Dense → Generic Layer Suite prints Go SC · Go MC · ASM SC · ASM MC · GPU SC · GPU MC and speedup columns:

  • Go/Asm↑ = Go wall time ÷ ASM wall time (> 1.0 = assembly wins).
  • Toggle: UseAsmForward on the network/layer; kernels live under poly/asm/ (see asm/README.md).

Latest Dense bench (8×1024→512, Metal host, from log.txt):

Highlight Go/Asm↑ SC Go/Asm↑ MC
Best single-core Uint8 ~2.46×
Best multi-core Uint4 ~3.55×
Strong quant MC Ternary ~3.21×, FP4 ~3.25×, Binary ~2.78×, Int8 ~2.72×
Float32 ~1.11× SC, ~1.00× MC (parity)
Float64 < 1× (asm slower on this shape) ~0.61× MC

Low-bit and morphed-uint8 paths benefit most from native integer dots in Plan 9. Float64 SC/MC still favors Go tiled matmul on the current tile sizes — tuning item, not a broken toggle.

Backward / training: asm is forward-only today; Dense backward parity uses Go CPU vs GPU; training does not call asm.


Interpreting a real log (examples)

The following patterns show up in recent log.txt captures (Metal adapter, tiled CNN1 suite):

  1. CNN1 generic suite note — The harness itself reminds you that generic CNN1 tests still include simulated / PTQ fallback where a dtype has no strict native path. For a strict native-only CPU/GPU/tiling audit, use the Glitch layer_matrix example (see Glitch docs / examples in-repo).

  2. Float64 on GPU forward — CPU microseconds vs GPU milliseconds often look like a large “speedup ratio < 1×”; that is frequently dispatch overhead dominating tiny work, not a claim that FP64 GPU is slower than CPU math in the large-batch limit.

  3. Wide integer CNN1 backwardInt64 / Uint64 / Int32 / Uint32 rows may show 🟤 H-DRIFT vs float reference in GPU backward parity: the harness compares against an FP32-shaped reference while the native path uses integer semantics — read those rows as classification / tolerance, not as “GPU kernel wrong.”

  4. Save/Reload after training — On the Dense suite (May 2026 log), Save/Reload PASS for all 21 dtypes after training. Older CNN-only rows or pre-native-save builds may still show FAIL on specific combos; diff against current persistence.go (Native: true + per-layer dtype) before treating as open bugs.

  5. Uint CPU trainingUint64 / Uint32 (and sometimes Uint16) may show TrainOK FAIL on CPU-tiled modes while GPU modes PASS: that points at CPU-side training / loss scaling for unsigned paths, not at GPU correctness.

  6. Peak performance gap line — The footer PEAK PERFORMANCE GAP (e.g. Dense Forward Float16) is a headline ratio from one worst row in the scan table; it is useful for spotting outliers, not as a single global quality score.


Poly package: what the suites actually exercise

High-signal files and areas (not exhaustive):

Area Representative files
Core types & dispatch poly.go, forward.go, backward.go, training.go
Numerical morphing weights.go, quantization.go, CNN/ dense / MHA polymorphic *.go
GPU / WebGPU wgpu_context.go, wgpu_forward.go, wgpu_kernels.go, wgpu_shaders.go, wgpu_softmax.go
Tiling & tile size tile_detection.go, *_tiled*.go paths in dense / CNN / MHA
Serialization serialization.go, persistence.go, safetensors.go
Native layer matrix harness native_layer_matrix.go, native_matrix_builtin_hooks.go
Telemetry tanhi.go, hardware probes in hardware.go

When you add a layer or dtype, extend both the Lucy (or Glitch) harness and this doc if the log format or tolerance bands change.


Exact entrypoints move with refactors; prefer:

  • lucy/README.md — MRBiVS stack and pointers into poly/.
  • poly/README.md — version checklist and capability matrix.
  • welvet/cabi/internal/check/ — C-ABI vs poly/ export parity scanner (Go).

See also