# Testing, validation, and Lucy logs > This page ties together **how we stress `poly/`**, where **artifacts land**, and how to read **parity tables** in captured logs (for example `lucy/lucy_testing_output/log.txt`). Canonical: https://openfluke.com/docs/testing-and-validation --- # Testing, validation, and Lucy logs This page ties together **how we stress `poly/`**, where **artifacts land**, and how to read **parity tables** in captured logs (for example `lucy/lucy_testing_output/log.txt`). --- ## Where logs come from The **Lucy** tree (`lucy/`) drives broad layer suites: forward/backward parity, training matrices, save/reload checks, and GPU timing tables. Typical transcripts: | Log | Menu | Contents | |-----|------|----------| | `lucy/lucy_testing_output/log.txt` | Dense L1 / GPU parity / layer matrices | Forward/backward parity, ASM timers, GPU tables | | `lucy/lucy_testing_output/seven_layer.txt` | **[7] Seven-layer CPU suite** | 10 layer types × 21 dtypes × 1³/2³/3³ grids, SC/MC, train, save/reload | Both files are meant for human review and regression diffing (adapter name, per-dtype rows, summary tallies). **Seven-layer suite (v0.79+):** See [`bedrock_validation.md`](bedrock_validation.md) for what the harness gates (MHA layout, KV decode, native ternary save, C-ABI `SyncInferenceWeights`). Run `cd lucy && go run .` → **[7]** or **[0]**. --- ## How to read parity summary lines Sections often end with a line shaped like: ```text >> [Forward Parity] 84 Tests | 💎 42 | ✅ 24 | 🟨 0 | 🟠 0 | 🟤 18 | ❌ 0 | 💀 0 ``` Rough meaning (exact thresholds live in the test harness, not duplicated here): | Symbol | Typical meaning | |--------|-----------------| | **💎** | Exact / diamond-grade agreement within the tightest tolerance | | **✅** | Pass within configured industry-grade tolerance | | **🟨 / 🟠** | Elevated drift bands (still classified by the harness) | | **🟤** | Heavy drift (e.g. **H-DRIFT** in backward tables) — worth investigating dtype + path | | **❌** | Hard failure (assert or threshold breach) | | **💀** | Fatal / panic / infrastructure failure | Backward tables may label columns **INDUS** (industry tolerance) vs **H-DRIFT** (heavy drift). Treat **🟤** rows as “numerically alive but not interchangeable with FP32 reference at the same tolerance,” not necessarily as engine bugs: some combinations are expected to diverge when the reference path is float32-simulated and the subject path is true low-bit or integer-native. --- ## May 2026 full-suite snapshot (`log.txt`) Recent **Run All Layer Tests** captures (Metal / arm64, ~2992 rows) show: | Metric | Value | |--------|--------| | **Broken (❌)** | **0** | | **Fatal / NaN (💀)** | **0** | | Bit-exact (💎) | ~75% of classified rows | | Heavy drift (🟤) | ~17% — mostly forward parity vs FP32 reference on native-int / low-bit paths | **Fixes reflected in this run (vs earlier transcripts):** - **Training matrix** — `File` / `RAM` columns print correctly (no `%!s(MISSING)`); every Dense training row **TrainOK PASS** and **Save/Reload PASS** for all 21 dtypes. - **Save/Reload** — CNN1/2/3, Dense, Embedding, LSTM, MHA, Residual, RNN, SwiGLU each end with `[Save/Reload ] PASS`. - **Global manifest** — no hard failures across the full layer sweep. **Still classified as 🟤 (not ❌):** Dense forward parity rows where CPU uses true integer/low-bit math and the harness compares to a float-shaped reference; CNN backward **H-DRIFT** on Float16/BFloat16/Int4 (GPU vs CPU reference). Treat as tolerance bands — see parity legend above. --- ## Dense forward ASM (Plan 9) Lucy **Dense → Generic Layer Suite** prints **Go SC · Go MC · ASM SC · ASM MC · GPU SC · GPU MC** and speedup columns: - **Go/Asm↑** = Go wall time ÷ ASM wall time (**> 1.0** = assembly wins). - Toggle: `UseAsmForward` on the network/layer; kernels live under `poly/asm/` (see [`asm/README.md`](../poly/asm/README.md)). **Latest Dense bench (8×1024→512, Metal host, from `log.txt`):** | Highlight | Go/Asm↑ SC | Go/Asm↑ MC | |-----------|------------|------------| | Best single-core | **Uint8** ~**2.46×** | — | | Best multi-core | — | **Uint4** ~**3.55×** | | Strong quant MC | — | **Ternary** ~3.21×, **FP4** ~3.25×, **Binary** ~2.78×, **Int8** ~2.72× | | Float32 | ~1.11× SC, ~1.00× MC (parity) | | | Float64 | **< 1×** (asm slower on this shape) | ~0.61× MC | Low-bit and morphed-`uint8` paths benefit most from native integer dots in Plan 9. Float64 SC/MC still favors Go tiled matmul on the current tile sizes — tuning item, not a broken toggle. **Backward / training:** asm is **forward-only** today; Dense backward parity uses Go CPU vs GPU; training does not call asm. --- ## Interpreting a real log (examples) The following patterns show up in recent `log.txt` captures (Metal adapter, tiled CNN1 suite): 1. **CNN1 generic suite note** — The harness itself reminds you that generic CNN1 tests still include **simulated / PTQ fallback** where a dtype has no strict native path. For a **strict native-only** CPU/GPU/tiling audit, use the **Glitch** `layer_matrix` example (see Glitch docs / examples in-repo). 2. **Float64 on GPU forward** — CPU microseconds vs GPU milliseconds often look like a large “speedup ratio < 1×”; that is frequently **dispatch overhead dominating tiny work**, not a claim that FP64 GPU is slower than CPU math in the large-batch limit. 3. **Wide integer CNN1 backward** — **Int64 / Uint64 / Int32 / Uint32** rows may show **🟤 H-DRIFT** vs float reference in GPU backward parity: the harness compares against an FP32-shaped reference while the native path uses integer semantics — read those rows as **classification / tolerance**, not as “GPU kernel wrong.” 4. **Save/Reload after training** — On the **Dense** suite (May 2026 log), **Save/Reload PASS** for all 21 dtypes after training. Older CNN-only rows or pre-native-save builds may still show FAIL on specific combos; diff against current `persistence.go` (`Native: true` + per-layer `dtype`) before treating as open bugs. 5. **Uint CPU training** — **Uint64 / Uint32** (and sometimes **Uint16**) may show **TrainOK FAIL** on CPU-tiled modes while GPU modes **PASS**: that points at **CPU-side training / loss scaling** for unsigned paths, not at GPU correctness. 6. **Peak performance gap line** — The footer **PEAK PERFORMANCE GAP** (e.g. Dense Forward Float16) is a **headline ratio** from one worst row in the scan table; it is useful for spotting outliers, not as a single global quality score. --- ## Poly package: what the suites actually exercise High-signal files and areas (not exhaustive): | Area | Representative files | |------|------------------------| | Core types & dispatch | `poly.go`, `forward.go`, `backward.go`, `training.go` | | Numerical morphing | `weights.go`, `quantization.go`, CNN/ dense / MHA polymorphic `*.go` | | GPU / WebGPU | `wgpu_context.go`, `wgpu_forward.go`, `wgpu_kernels.go`, `wgpu_shaders.go`, `wgpu_softmax.go` | | Tiling & tile size | `tile_detection.go`, `*_tiled*.go` paths in dense / CNN / MHA | | Serialization | `serialization.go`, `persistence.go`, `safetensors.go` | | Native layer matrix harness | `native_layer_matrix.go`, `native_matrix_builtin_hooks.go` | | Telemetry | `tanhi.go`, hardware probes in `hardware.go` | When you add a layer or dtype, extend **both** the Lucy (or Glitch) harness **and** this doc if the log format or tolerance bands change. --- ## Related commands (developer workflow) Exact entrypoints move with refactors; prefer: - `lucy/README.md` — MRBiVS stack and pointers into `poly/`. - `poly/README.md` — version checklist and capability matrix. - `welvet/cabi/internal/check/` — C-ABI vs `poly/` export parity scanner (Go); expect **461/461 (100%)** after v0.79 (`LoomSyncInferenceWeights`). --- ## See also - [bedrock_validation.md](bedrock_validation.md) — v0.79.0 seven-layer suite, MHA/KV, C-ABI - [numerical_types.md](numerical_types.md) — DType list and `WeightStore` lifecycle - [gpu.md](gpu.md) — WebGPU context and dispatch overview - [serialization.md](serialization.md) — Save/load and safetensors - [training.md](training.md) — Training modes and loss paths