ASM & Volumetric Executor Exploration (Archive)
Status: Exploratory work — not merged as a long-term direction.
Date: June 2026
Context: Lucy menu [7] (seven-layer CPU suite), Dense forward only.
This document captures everything built and learned during the Plan 9 ASM dense forward push and the volumetric fused executor (v1) experiment. Copy this file before deleting the exploration code; the complexity-to-gain ratio did not justify keeping it in-tree.
Related existing docs: poly/asm/README.md, bitnet_cpu.md, dispatch.md.
Executive summary
What we tried
- Native-quant ASM dense forward — BitNet-style W8A8: morphed weights stay in native storage, activations quantized per row to int8, integer dot in Plan 9 asm, one output scale per row. No expanding morphed weights to float32 for matmul on the ASM path.
- BitNet I2_S scaffolding — Microsoft
ggml-bitnet-mad.cppI2_S pack + scalar/asm row dots for ternary. - Volumetric executor v1 —
BuildDenseExecPlan+ForwardDenseExecutor: skipDispatchLayerper cell×layer, pre-plan dtype→kernel, morph weights once, direct dense kernel calls.
What we learned
| Finding | Detail |
|---|---|
| ASM wins on fat layers | 1×1×1 with pyramid dims (16→64): quant native ASM ~1.5–2.4× vs Go float-dequant tiled path |
| ASM loses on thin grids | 3×3×3 with flatEndpoints(4) (4×4 micro-layers): ASM ~0.5–0.93× vs Go — setup + 189 tiny GEMVs dominate |
| Executor v1 saves ~15–38% | Skipping interpret dispatch on 3×3×3 dense: consistent 1.15–1.38× (Interp/Exec↑), perfect parity |
| Dispatch is not the main 3³ problem | Executor proves interpret tax ≈ 25% of forward; remaining ~75% is per-layer 4×4 work |
| Quant does not win bigger via executor | Int8/Int4 gains ≈ Float32 — no special breakout |
| Not worth the complexity | Many files, dual parity paths, Lucy table sprawl, MC races, dtype routing — for ~20–35% on a narrow CPU interpret path |
Decision
Archive the exploration. Keep the interpreter (ForwardPolymorphic + DispatchLayer) as the research path. Peak inference belongs on GPU (poly/wgpu_forward.go) or a future batched executor that fuses matmul across cells — not per-hop ASM + per-hop dispatch trimming.
Part I — Architecture before and after
Two tiers (intended end state)
┌─────────────────────────────────────────────────────────────┐
│ Tier 1 — Interpreter (keep) │
│ ForwardPolymorphic → spatial tile loops → DispatchLayer │
│ Build, train, parity, volumetric experiments │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Tier 2 — Executor (explored, v1 only) │
│ Plan at load: dtype → kernel map, pooled buffers │
│ Fused visit order, few CPU entry points │
│ v1: no cross-cell matmul batching yet │
└─────────────────────────────────────────────────────────────┘
Interpret path (baseline)
ForwardPolymorphic (forward.go)
→ spatial 4×4×4 tile loops (or classic z→y→x→l)
→ DispatchLayer (jump table)
→ DenseForwardPolymorphic (dense.go)
→ asm: denseForwardAsm
→ bitnet: DenseForwardBitNetNativeQuantCPU
→ default: DenseForwardTiled (float dequant via denseActiveWeights)
Lucy [7] Go column for quant dtypes timed DenseForwardTiled (float dequant), not native BitNet — so ASM-vs-Go tables compare different semantics unless you read the captions.
ASM dense path
DenseForwardPolymorphic
└─ layerUseAsmForward && asm.Enabled()
└─ denseForwardAsm (dense_asm.go)
├─ ternary: denseForwardAsmBitNetTernary → I2S asm if available
├─ quant / low-bit float morph:
│ denseForwardAsmNativeQuant (dense_asm_native_quant.go)
│ · morphed []uint8 weights (WeightStore.Morph)
│ · per-row int8 activation quant (bitNetQuantizeActivationNumeric)
│ · dot.U8BytesTileNativeI64 / packed nibble / 2-bit / binary paths
│ · one output scale per row → float preAct
└─ Float32/64: denseForwardAsmByDType → asm/matmul GEMV
Key principle: weights use proper native/entity layouts; forward does not re-morph from FP32 on every hop if inference sync already populated Versions[dtype].
Executor v1 path
BuildDenseExecPlan(net)
· forwardVisitLayerIndices — same order as ForwardPolymorphic
· verify all visited layers are LayerDense
· prepDenseExecWeights — Morph once per layer
· pickDenseExecKernel — dtype → DenseExecKernel enum
ForwardDenseExecutor(net, plan, input)
· for step in plan.Steps:
denseExecForwardStep(kernel, layer, current) // no DispatchLayer
· kernels mirror DenseForwardPolymorphic (no ASM): Tiled / BitNet native / I2S try
What v1 did not do: fuse tiles under the grid, batch 27 cells into one GEMV, or call ASM from the executor loop.
Part II — ASM exploration (detail)
Goals
- Hand-written Plan 9 assembly (
*.s, not CGO) on amd64 and arm64. - 21 dtypes on Dense forward via
UseAsmForward/VolumetricNetwork.SetAsmForward. - Native integer dots — multiply/accumulate in storage width; dequant once at boundary.
- BitNet W8A8 reference in Go (
DenseForwardBitNetNativeQuantCPU) for quant parity vs ASM. - Lucy
[7]timing: Go tiled vs ASM, all dtypes whenForwardCapable.
Package layout (poly/asm/)
| Package | Role |
|---|---|
asm/dot/ |
f32/f64 dots, native int dots (native_int_*.s), packed row dots (native_packed_*.s) |
asm/matmul/ |
Tiled GEMV, OverOutputTiles MC, wires dot into tile callbacks |
asm/dense/ |
Thin float forward entry for poly |
asm/bitnet/ |
rowDotI2SI8, ternaryWordDot16 (scalar asm stubs on arm64/amd64) |
BitNet I2_S (poly/bitnet/)
Reference: Microsoft ggml-bitnet-mad.cpp (cloned under loom/BitNet/ during exploration).
| Constant / type | Meaning |
|---|---|
QK_I2S = 128 |
Block size; cols must be % 128 == 0 for I2S matrix |
I2SMatrix |
Row-major ternary in ACT_PARALLEL layout |
Codes 0,1,2 |
→ weights -1, 0, +1 |
RowDotI2SI8Go |
Go scalar reference |
| ASM | poly/asm/bitnet/i2s_amd64.s, i2s_arm64.s — not full AVX2 maddubs loop from Microsoft |
Not ported: full AVX2 I2_S inner loop from BitNet/src/ggml-bitnet-mad.cpp.
BitNet W8A8 native quant flow
Per output row:
- Read morphed weight row from
WeightStore.Versions[dtype](uint8 storage). - Quantize input row to int8 (
bitNetQuantizeActivationNumeric). - Integer dot (Go ref or
dot.U8BytesTileNativeI64in asm). outScale = weightScale * actMax / 127→ float pre-activation.- Apply layer activation.
Dtype-specific branches in denseForwardAsmNativeQuant: wide int16/32/64, FP8 morph, nibble Int4, 2-bit Int2/Ternary, binary.
Poly files (ASM exploration)
| File | Role |
|---|---|
poly/dense_asm.go |
SetAsmForward, denseForwardAsm routing |
poly/dense_asm_dtypes.go |
Dtype → asm matmul dispatch |
poly/dense_asm_native.go |
Legacy native width paths |
poly/dense_asm_native_matmul.go |
Native tiled matmul wiring |
poly/dense_asm_native_quant.go |
BitNet W8A8 asm path |
poly/dense_bitnet_w8a8_cpu.go |
Go scalar BitNet reference |
poly/bitnet_cpu.go |
Ternary word dot delegates to asm |
poly/bitnet_i2s_cpu.go |
GetBitNetI2SMatrix, DenseForwardBitNetI2SCPU |
poly/bitnet/i2s.go |
I2_S pack + Go row dot |
poly/bitnet/ternary_u32.go |
TernaryWordDot16Go |
poly/dense.go |
layerUseBitNetNativeQuant, DenseForwardPolymorphic routing |
Network flags (poly/poly.go)
UseAsmForward bool // Plan 9 assembly CPU kernels
UseBitNetNativeQuant bool // Dense: BitNet W8A8 Go ref (Lucy quant parity column)
Bugs fixed during exploration
| Bug | Symptom | Fix |
|---|---|---|
| Lucy gated ASM timing on float-only determinism | Blank ASM columns for quant dtypes | Time ASM for all dtypes when ForwardCapable |
Shared xq/xu buffers in runRows closure |
MC parity ~1e19 on Float16 | Allocate quant buffers inside each runRows goroutine |
GetBitNetI2SMatrix with cols=16 |
Panic — I2_S needs cols % 128 | Guard / fallback to packed ternary |
Duplicate bitNetTernaryWordDotTail |
Build error | Cleaned bitnet_cpu.go |
Lucy [7] ASM integration
Files: lucy/examples/seven_layer/runner.go, summary.go, common.go
requiresAsmGoTiledParity— Float32/64: ASM vs Go tiledrequiresAsmBitNetParity— quant: ASM vscaptureForwardBitNetNative(Go BitNet ref)- Table:
printDenseForwardAsmTimingTable— Go SC/MC, ASM SC/MC, Go/Asm↑, |Δ|
Captions:
- Go =
DenseForwardTiled(float dequant) - ASM = native quant (BitNet W8A8) or float GEMV
- Go/Asm↑ = Go÷ASM (>1 means assembly wins)
ASM benchmark findings (Lucy Dense)
1×1×1 (pyramid 16→24→32→48→64→48→32→8)
| Category | Go/Asm↑ (approx) |
|---|---|
| Int8, FP8, Float16 | ~1.5–2.4× ASM wins |
| Int4, Binary, Ternary | ~0.5–1.0× (scalar packed dots; ternary may hit Go matvec) |
| Float32/64 | ~1.0–1.1× |
3×3×3 (flatEndpoints(4) → 4×4 layers, 189 hops)
| Category | Go/Asm↑ (approx) |
|---|---|
| Everything | ~0.5–0.93× — ASM slower than Go tiled |
Why 3³ is bad for ASM:
- 27 cells × 7 layers = 189 serial dense forwards
- Each layer is 4×4 GEMV — too small to amortize asm entry, quant buffers, tile setup
- Interpreter adds overhead too, but Go tiled float-dequant path is cheap per micro-op
- Native quant correctness is fine; kernel economics are wrong at this shape
Tests added
| Test | Package |
|---|---|
dense_asm_parity_test.go |
Float → Go tiled; quant → BitNet native ref |
bitnet_i2s_test.go |
I2S row dot, ternary word dot, I2S forward smoke |
bitnet_cpu_test.go |
BitNet CPU helpers |
cd loom/poly
go test ./asm/... ./tests/ -run 'DenseAsm|BitNet|I2S' -count=1
Part III — Volumetric executor (detail)
Motivation
User goal: "dtype-aware volumetric executor that fuses tiles under the grid, instead of re-interpreting every cell×layer on the way through."
v1 scope: prove interpret tax measurable; skip DispatchLayer; pre-plan kernels. Not true tile fusion under the grid.
API (poly/volumetric_executor.go)
type DenseExecKernel uint8 // Tiled | BitNetNative | BitNetI2SThenTiled
type DenseExecStep struct {
LayerIdx int
Kernel DenseExecKernel
}
type DenseExecPlan struct {
Steps []DenseExecStep
}
func BuildDenseExecPlan(n *VolumetricNetwork) (*DenseExecPlan, error)
func ForwardDenseExecutor(n, plan, input) (*Tensor[float32], time.Duration)
ErrNotDenseOnly if any visited layer is not LayerDense.
Lucy [7] executor table
Printed after ASM table for Dense suites: Interpret vs Fused Executor.
| Column | Meaning |
|---|---|
| Interp SC/MC | ForwardPolymorphic + DispatchLayer (CPU Go, no ASM) |
| Exec SC/MC | ForwardDenseExecutor with pre-built plan |
| Interp/Exec↑ | Interpret÷Executor (>1 = executor wins) |
| |Δ| | max |Interpret − Executor| — parity |
Files: runner.go (capture), common.go (benchmarkExecutorForward), summary.go (printDenseExecutorTimingTable), dense_executor_3x3_test.go (smoke).
Measured results — Dense 3×3×3 (5 passes, user run)
╔══════════════════════════════════════════════════════════════════════╗
║ Dense 3×3×3 — forward: Interpret vs Fused Executor (avg of 5 passes) ║
╚══════════════════════════════════════════════════════════════════════╝
| DType | Interp SC | Exec SC | Interp/Exec↑ | |Δ| SC | Interp MC | Exec MC | Interp/Exec↑ | |Δ| MC |
| Float64 | 31.4µs | 25.9µs | 1.21× | 0.00e+00 | 152.8µs | 131.0µs | 1.17× | 0.00e+00 |
| Float32 | 31.2µs | 22.6µs | 1.38× | 0.00e+00 | 136.1µs | 140.1µs | 0.97× | 0.00e+00 |
| Float16 | 38.7µs | 32.9µs | 1.18× | 0.00e+00 | 152.9µs | 121.9µs | 1.25× | 0.00e+00 |
| BFloat16 | 36.5µs | 27.4µs | 1.33× | 0.00e+00 | 136.0µs | 122.7µs | 1.11× | 0.00e+00 |
| FP8-E4M3 | 38.0µs | 32.9µs | 1.15× | 0.00e+00 | 145.8µs | 120.9µs | 1.21× | 0.00e+00 |
| FP8-E5M2 | 38.5µs | 33.2µs | 1.16× | 0.00e+00 | 151.4µs | 131.0µs | 1.16× | 0.00e+00 |
| Int64 | 36.0µs | 30.4µs | 1.19× | 0.00e+00 | 139.2µs | 114.8µs | 1.21× | 0.00e+00 |
| Uint64 | 36.2µs | 29.7µs | 1.22× | 0.00e+00 | 144.7µs | 127.7µs | 1.13× | 0.00e+00 |
| Int32 | 37.1µs | 30.0µs | 1.24× | 0.00e+00 | 159.2µs | 126.7µs | 1.26× | 0.00e+00 |
| Uint32 | 36.4µs | 29.9µs | 1.22× | 0.00e+00 | 132.2µs | 129.3µs | 1.02× | 0.00e+00 |
| Int16 | 36.7µs | 28.0µs | 1.31× | 0.00e+00 | 140.5µs | 114.2µs | 1.23× | 0.00e+00 |
| Uint16 | 35.9µs | 29.2µs | 1.23× | 0.00e+00 | 131.4µs | 117.9µs | 1.11× | 0.00e+00 |
| Int8 | 38.7µs | 30.8µs | 1.26× | 0.00e+00 | 141.3µs | 111.1µs | 1.27× | 0.00e+00 |
| Uint8 | 38.1µs | 27.8µs | 1.37× | 0.00e+00 | 155.2µs | 133.0µs | 1.17× | 0.00e+00 |
| Int4 | 36.8µs | 30.3µs | 1.21× | 0.00e+00 | 141.9µs | 114.9µs | 1.23× | 0.00e+00 |
| Uint4 | 40.4µs | 32.1µs | 1.26× | 0.00e+00 | 143.7µs | 125.6µs | 1.14× | 0.00e+00 |
| FP4 | 36.9µs | 28.9µs | 1.28× | 0.00e+00 | 130.8µs | 115.4µs | 1.13× | 0.00e+00 |
| Int2 | 38.7µs | 30.6µs | 1.27× | 0.00e+00 | 145.5µs | 110.2µs | 1.32× | 0.00e+00 |
| Uint2 | 37.4µs | 29.4µs | 1.27× | 0.00e+00 | 151.8µs | 122.7µs | 1.24× | 0.00e+00 |
| Ternary | 54.9µs | 73.5µs | 0.75× | 0.00e+00 | 56.6µs | 44.0µs | 1.29× | 0.00e+00 |
| Binary | 37.3µs | 31.6µs | 1.18× | 0.00e+00 | 138.7µs | 126.8µs | 1.09× | 0.00e+00 |
Best Interp/Exec↑ SC: Float32 at 1.38× | Best MC: Int2 at 1.32×
Anomaly: Ternary SC executor slower (0.75×) while MC faster (1.29×) — same |Δ|=0; likely kernel path + 5-pass noise. Worth noting if revisiting.
1×1×1 executor (reference)
Most dtypes ~1.0–1.1× — fat matmul dominates; dispatch tax invisible.
When executor wins (rule of thumb)
noticeability ∝ (dispatch hops) / (compute per hop)
| Workload | Executor feel |
|---|---|
| Dense 3×3×3, width 4 | Noticeable (~1.2–1.4×) |
| Dense 1×1×1, pyramid | Barely (~1.0–1.1×) |
| MHA / SwiGLU / CNN | Less — heavy per-hop compute |
| Training | None — executor forward-only |
| GPU path | Different tier — wgpu_forward.go |
Part IV — Lessons learned
1. Interpret dispatch is real but bounded
Executor v1 isolates ~20–35% overhead from DispatchLayer + routing on 3×3×3 dense. That is not enough to explain ASM being 0.5× on the same grid — the bottleneck is 189 × tiny GEMV, not the jump table alone.
2. ASM economics need layer mass
Native quant ASM needs enough FLOPs per call to amortize:
- Plan 9 entry
- Per-row activation quant buffers
- Tile orchestration
- Output scale + activation
4×4 on 3³ fails; 16×48 on 1³ succeeds.
3. Compare apples to apples in benchmarks
Lucy Go column used float-dequant tiled forward for quant dtypes; ASM used native BitNet. Parity was split (requiresAsmGoTiledParity vs requiresAsmBitNetParity) but the timing comparison was semantically asymmetric.
4. Executor v1 ≠ "fused volumetric"
True fusion means e.g. one batched GEMV per layer-index across all 27 cells, pooled activations, ASM called once per slab — not 189 direct kernel calls with dispatch removed.
5. Complexity inventory
- Dual forward paths (interpret / executor / asm / bitnet ref)
- Per-dtype asm branches in Go +
.sstubs per arch - Lucy table proliferation (ASM + executor + determinism + save/reload)
- MC threading bugs in quant buffer pooling
- I2S layout constraints (cols % 128)
Gain: ~1.3× on a narrow CPU interpret forward for one layer type on one grid shape.
Verdict: Good experiment, wrong ROI to maintain.
Part V — What v2 would have required
If revisiting (e.g. after GPU path matures):
- Cross-cell batching — for each
lin0..LayersPerCell-1, stack 27 cell activations → onebatch×4GEMV (or wider). - ASM inside executor — single plan entry point; no
DispatchLayerand no per-layer asm re-entry. - Buffer pool —
xq/xu/output tiles allocated per frame, not per row batch. - Full I2_S SIMD — port AVX2/NEON from Microsoft BitNet for ternary at scale.
- GPU executor tier — mirror plan on device;
BeginFrame+ block dispatches (partial precedent inwgpu_forward.go).
Part VI — File inventory (exploration code to delete)
Use git status / git log for the exact commit set. Approximate list:
poly/ — executor
volumetric_executor.gotests/dense_executor_test.go
poly/ — BitNet W8A8 + I2S (if removing entire exploration)
dense_asm.go,dense_asm_dtypes.go,dense_asm_native.go,dense_asm_native_matmul.go,dense_asm_native_quant.godense_bitnet_w8a8_cpu.gobitnet_cpu.go,bitnet_i2s_cpu.gobitnet/(package)tests/dense_asm_parity_test.go,tests/bitnet_i2s_test.go(and related)
poly/asm/bitnet/
dot.go,dot_decl.go,dot_stub.goi2s_amd64.s,i2s_arm64.sternary_amd64.s,ternary_arm64.s
lucy/examples/seven_layer/
- Executor fields in
summary.go(DTypeRowexecutor columns,printDenseExecutorTimingTable) - Executor capture in
runner.go,benchmarkExecutorForwardincommon.go dense_executor_3x3_test.go
Revert / keep decisions
| Keep | Remove (exploration-only) |
|---|---|
poly/asm/dot, matmul, dense if predating this work and used elsewhere |
volumetric_executor.go + Lucy executor table |
ForwardPolymorphic, DispatchLayer |
BitNet I2S asm if unused |
wgpu_forward.go GPU path |
dense_asm_native_quant.go if reverting to pre-BitNet asm |
docs/bitnet_cpu.md (older ternary doc) |
This archive doc keep |
Note: poly/asm/README.md may describe paths you're removing — update or revert that README when deleting code.
Part VII — How to reproduce (before delete)
# Poly parity
cd loom/poly
go test ./tests/ -run 'DenseExecutor|DenseAsm|BitNet' -count=1
# Lucy Dense 1³ regression (21 dtypes + ASM + executor tables)
cd loom/lucy/examples/seven_layer
go test -run TestRunLayerSuiteDense1x1AllDTypes -v -count=1
# Lucy 3³ executor smoke
go test -run TestDenseExecutor3x3Float32Smoke -v -count=1
# Full menu
cd loom/lucy && go run . # → [7] Dense
Closing
The exploration validated:
- Native quant ASM can beat float-dequant Go on fat dense layers.
- It cannot rescue thin volumetric grids without batching.
- Skipping dispatch saves ~25%, not 2×.
- A real volumetric performance story needs fusion below the grid, not more interpreter variants.
Copy this document, then delete the code. The interpreter + GPU paths remain the production-shaped architecture.