Memory history, GPU load, and HF→entity convert

This page covers poly/memory_history.go: timed samples during LLM GPU upload, the in-terminal chart Lucy prints after load, and how memory policy ties to block-wise safetensor import (HF → .entity convert) and block-wise GPU upload (.entity → chat) so we do not hold full CPU and GPU weight copies at once.

Why this exists

When a transformer moves from CPU weights to GPU buffers, peak RAM matters — especially on mobile (SoulGlitch / iOS) and when loading large .entity checkpoints.

Two failure modes were measured on Lucy ENTITY Talk [8] (Qwen3-0.6B):

GPU load (`.entity` → chat, GPU enabled)

Phase	Old behavior	Fixed behavior
Decoder blocks	Already OK — `ReleaseInferenceHostWeights()` after each block	Same (~7.5 MB host freed per block)
Globals (embeddings, LM head, final norm)	Single `SyncToGPU()` uploaded all globals while ~1187 MB CPU weights still resident → ~2637 MB host+gpu overlap	`SyncGlobalWeightsToGPUSequential()` uploads one global tensor, releases CPU, then next → ~2044 MB peak overlap
Steady state	Host weights eventually dropped	Host 0 MB, GPU ~1451 MB

The remaining ~2044 MB GPU-load peak is a brief per-tensor overlap during each global upload (CPU slice still present until that tensor’s GPU buffer exists). It is not the old “entire model doubled at once” bug.

HF → `.entity` convert (safetensors → save)

Phase	Old behavior	Fixed behavior
Import	`LoadSafetensors()` into one map, then `LoadWithPrefixes()` copied into `WeightStore.Master` while the map stayed resident → ~2× decoder weights until GC	Globals first, then one decoder block at a time + `ReleaseTransientSafetensorMap()` after each block
Encode + save	Full FP32 decoder in RAM through `SerializeEntityTransformer` (all layers + growing `[]byte` payload + final file buffer)	Low-RAM path: `ImportHFSaveEntityTransformerBlockwise` — Q4/FP32 bake one block at a time into a streaming payload file, `releaseEntityConvertLayerWeights()` after each block, then `writeEntityWireStreaming()` (header + `io.Copy` payload — no full-file `[]byte`)
BitNet	Already block-wise via `ImportHFBitNetCheckpointDir`	Unchanged (still `SaveEntityTransformer` one-shot)

Memory history is not wired to the convert step yet — only GPU chat load. Convert progress is visible via HFEntityConvertProgress callbacks (SoulGlitch task UI) or Lucy safetensor import logs (see below).

Enabling recording

Lucy (interactive)

When GPU is enabled, Lucy prompts:

📈 Measure memory during GPU load? (terminal chart after load — CPU weights vs GPU upload vs release) (1=yes / 0=no) [1]:

This appears in Poly Talk [1] and ENTITY Talk [8] (lucy/poly_talk_session.go, lucy/hf_entity.go). Answering yes calls poly.SetMemoryHistoryRecording(true).

Block-by-block upload is a separate prompt:

📥 Block-by-block GPU upload? (1=yes / 0=no) [0]:

For meaningful charts, use GPU + block upload + measure memory.

Environment variables

Variable	Effect
`LOOM_MEMORY_HISTORY=1`	Enable sampling without the Lucy prompt (off by default)
`LOOM_MEMORY_HISTORY_JSON=/path/out.json`	After load, write samples as JSON in addition to the terminal report
`COLUMNS=100`	Widen the braille chart (default 80, max 120)

Lucy’s runtime prompt override takes precedence over env when set.

What gets recorded

Each sample is a poly.MemorySample:

Field	Meaning
`elapsed_sec`	Seconds since session start
`label`	Step name (e.g. `block_03_after_release`, `embeddings_after_sync`)
`host_weights_mb`	Poly-accounted CPU model weights (`MemoryFootprint.HostWeightsMB`)
`gpu_weights_mb`	Poly-accounted GPU weight buffers
`gpu_kv_mb`	KV cache reservation on GPU
`vram_total_mb`	Total VRAM usage from `GetVRAMUsage()`
`heap_alloc_mb` / `heap_sys_mb`	Go runtime heap
`process_rss_mb`	OS process RSS (`getrusage` on Unix; 0 on unsupported platforms)

Important: host_weights_mb + gpu_weights_mb is the Poly overlap metric used for diagnosis. RSS can stay high after host weights drop because Go/OS may retain pages until memory pressure — that is called out in the diagnosis block.

Terminal output

When the GPU load session finishes, GlobalMemoryHistory.FinishSession() prints:

Braille chart — four series: host weights (H), GPU weights (G), process RSS (R), VRAM (V)
ASCII sparklines — same series as .:-=+*# ramps (works in all terminals)
Sample log — table of every labeled step
Peak overlap line — peak host+gpu Poly weights overlap: X MB when overlap exceeds baseline by >5%
Diagnosis block — pass/fail hints for block release, embeddings release, global sequential upload, RSS retention

Example labels during a healthy ENTITY GPU load:

block_01_before_sync … block_28_after_release
embeddings_before_sync → embeddings_after_sync → embeddings_after_release
lm_head_before_sync → lm_head_after_sync → lm_head_after_release
final_norm_before_sync → final_norm_after_sync → final_norm_after_release
host_weights_released → after_gc

Legacy builds used a single embeddings_on_gpu label; diagnosis still recognizes that for regression comparison.

API (poly)

// Process-wide recorder (Lucy uses this)
poly.GlobalMemoryHistory.BeginSession("entity_gpu_load")
poly.RecordFromTransformer(poly.GlobalMemoryHistory, tr, "block_01_after_sync")
_ = poly.GlobalMemoryHistory.FinishSession() // chart + table + diagnosis

// Toggle without env
poly.SetMemoryHistoryRecording(true)
poly.ResetMemoryHistoryRecording()

// Footprint at any time
fp := poly.NewMemoryFootprintFromTransformer(tr)
fmt.Printf("host %.1f MB | gpu %.1f MB\n", fp.HostWeightsMB, fp.GPUWeightsMB)

Source files:

File	Role
`memory_history.go`	`MemoryHistory`, samples, diagnosis
`memory_history_chart.go`	Braille chart + sparklines
`process_memory_unix.go`	RSS via `getrusage`
`process_memory_stub.go`	RSS stub on unsupported OS
`hf_entity_convert.go`	`ImportHFSaveEntityTransformerBlockwise(Progress)`
`entity_convert_io.go`	Streaming payload acc, per-block Q4 encode, `writeEntityWireStreaming`
`hf_import.go`	Block-wise safetensor import (`ImportHFCheckpointDir`)
`poly/tests/memory_history_test.go`	Unit tests

HF → `.entity` convert (import + encode memory)

There are two llama-style convert lanes in poly today:

Lane	API	Peak RAM during convert	Who uses it
Standard	`ImportHFCheckpointDir` → `SaveEntityTransformer`	Block-wise safetensor import, then full FP32 network held through encode	Lucy `[8]` `convertEntityEntry`, `ImportHFToEntity`
Low-RAM encode	`ImportHFSaveEntityTransformerBlockwise` (+ optional `Progress`)	~one decoder block FP32 + globals briefly at start + payload temp file on disk	SoulGlitch / mvp-simulation (iOS/macOS `.entity` convert)

Both lanes share the same block-wise safetensor import in hf_import.go. The low-RAM lane adds block-wise encode in hf_entity_convert.go + entity_convert_io.go.

Block-wise safetensor import (both lanes)

// 1. Globals only
LoadSafetensorsSelective(f, HFWeightIsGlobal)
mapper.MapWeights(globalTensors) // embeddings, lm_head, final_norm

// 2. One transformer block at a time
for li := 0; li < numLayers; li++ {
    LoadSafetensorsSelective(sf, HFWeightMatchesLayer(k, li))
    LoadWithPrefixes(net, layerMap)   // copy into WeightStore.Master
    ReleaseTransientSafetensorMap(layerMap)
}
ReleaseTransientSafetensorMap(globalTensors, embeddings, lmHead, finalNorm)

copyWeights in prefix_safetensor.go copies HF slices into Master; without per-block release, the full safetensor map and the network both held decoder weights.

Block-wise encode + streaming save (low-RAM lane only)

// ImportHFSaveEntityTransformerBlockwiseProgress(modelDir, entityPath, weightDType, progress)

// 1. Encode globals → payload temp file; drop embeddings/lm_head/final_norm from RAM
collectEntityGlobalBlobAcc("embeddings", …)
if !entityLMHeadTied(embeddings, lmHead) {
    collectEntityGlobalBlobAcc("lm_head", …)  // same rule as SaveEntityTransformer
}
collectEntityGlobalBlobAcc("final_norm", …)
embeddings, lmHead, finalNorm = nil; GC

// 2. Per transformer block
for li := 0; li < numLayers; li++ {
    LoadSafetensorsSelective + LoadWithPrefixes  // one block FP32 in net
    ReleaseTransientSafetensorMap(layerMap)
    for j := 0; j < 4; j++ {
        collectEntityWeightBlobsAcc(&net.Layers[base+j], …, weightDType) // Q4_0 bake if INT4
        releaseEntityConvertLayerWeights(&net.Layers[base+j])             // drop Master
    }
    GC
}

// 3. Write .entity: fixed header + JSON blob index + io.Copy(payload file)
writeEntityWireStreaming(entityPath, net, trSpec, blobs, payloadPath)

What gets Q4-baked: decoder MHA + SwiGLU only (via collectEntityQ4_0LayerAcc). RMSNorm, MHA q_norm/k_norm, embeddings, lm_head, final_norm stay FP32 — same rules as entity_q4.go / SaveEntityTransformer.

Progress callback (HFEntityConvertProgress): blockIndex is 1-based per packed block; detail like packed block 14/28. SoulGlitch maps this to its convert task progress bar.

Terminal signature (Lucy standard convert)

When using ImportHFCheckpointDir + SaveEntityTransformer (Lucy [8] convertEntityEntry), a Qwen3-0.6B reconvert prints three global ✓ Loaded … lines, then num_hidden_layers lines of ✅ Finished loading weights with prefixes. (28 for Qwen3-0.6B). The old bulk path printed one bulk load without per-block messages.

Example:

⏳ Converting Qwen/Qwen3-0.6B → lucy_entities/Qwen--Qwen3-0.6B.entity [Q4 (INT4)] …
  ✓ Loaded model.embed_tokens.weight: … (role: embeddings)
  ✓ Loaded model.norm.weight: … (role: final_norm)
  ✓ Loaded lm_head.weight: … (role: lm_head)
✅ Finished loading weights with prefixes.   ← block 1
…                                            ← repeat per layer
   ✅ Qwen--Qwen3-0.6B.entity  …

The low-RAM lane does not print those per-block safetensor lines (import is silent); use HFEntityConvertProgress or reconvert on Mac with Lucy to compare.

Supported converts

Path	API	Safetensor import	Encode / save
Lucy `[8]` convert	`convertEntityEntry`	✅ block-wise	`SaveEntityTransformer` (full network in RAM during encode)
Programmatic (standard)	`ImportHFToEntity` / `ImportHFCheckpointDir` + `SaveEntityTransformer`	✅ block-wise	Full network during encode
Programmatic (low-RAM)	`ImportHFSaveEntityTransformerBlockwise(Progress)`	✅ block-wise	✅ block-wise encode + streaming payload
SoulGlitch convert	mvp → `ImportHFSaveEntityTransformerBlockwiseProgress`	✅	✅ (+ CHGLUE standalone wrapper streams loom bytes — app layer)
BitNet	`ImportHFBitNetCheckpointDir` + `SaveEntityTransformer`	✅ (packed ternary per block)	One-shot save

Remaining encode overlap (expected)

Even the low-RAM lane still holds one block’s FP32 weights plus the globals encode spike at the start (embeddings + lm_head for large-vocab models are large). That is much smaller than holding all blocks through SerializeEntityTransformer, but not zero — see roadmap below.

GPU load path (what the history measures)

Lucy centralizes inference GPU setup in lucy/inference_setup.go → setupTransformerForInference. Welvet SoulGlitch mirrors the same policy in welvet/cabi/llm_ext.go (LoomCreateLLM).

Step 1 — Init WGPU

tr.Network.InitWGPU()

Step 2 — Decoder blocks (when `sequentialGPULoad`)

For each transformer block (4 grid layers: input norm, MHA, post-attn norm, SwiGLU):

layer.SyncToGPU()
(&tr.Network.Layers[idx]).ReleaseInferenceHostWeights()

Step 3 — Global weights (sequential)

Prefer SyncGlobalWeightsToGPUSequential() over bulk SyncToGPU() for inference load:

tr.SyncEmbeddingsToGPU(); tr.ReleaseEmbeddingsHost()
tr.SyncLMHeadToGPU();     tr.ReleaseLMHeadHost()      // skips duplicate buffer when tied
tr.SyncFinalNormToGPU();  tr.ReleaseFinalNormHost()
// or:
tr.SyncGlobalWeightsToGPUSequential()

SyncToGPU() still uploads all three globals without mid-upload CPU release — kept for training paths and legacy callers.

Step 4 — Warmup and final cleanup

_, _ = tr.ForwardTokenIDsWGPU([]uint32{0}, nil, true, true)
tr.Reset()
tr.ReleaseInferenceHostWeights() // sweep any remaining host slices
runtime.GC()
debug.FreeOSMemory()

Where each policy is used

Caller	Import / convert	Decoder upload	Global upload
Lucy `[8]` HF → `.entity`	Block-wise import + `SaveEntityTransformer`	—	—
SoulGlitch / mvp HF → `.entity`	`ImportHFSaveEntityTransformerBlockwiseProgress`	—	—
Lucy `setupTransformerForInference`	—	Block-wise + release	`SyncGlobalWeightsToGPUSequential`
Welvet `LoomCreateLLM` (safetensors)	Block-wise (chat load)	Block-wise + release	`SyncGlobalWeightsToGPUSequential`
Welvet `LoomSyncToGPU` / bulk `SyncToGPU()`	—	All layers, no mid-release	Bulk, no mid-release
Training / demos calling `SyncToGPU()` directly	Varies	Varies	Bulk

Entity on SoulGlitch: LoomLoadEntityTransformerAs builds a full CPU transformer only. GPU setup must follow the Lucy sequence above (or a future LoomCreateLLMFromEntity export). See entity.md — GPU load.

Further peak reduction (roadmap)

GPU load: stream or mmap entity globals so embeddings/LM head never exist as full FP32 CPU slices before GPU upload; quantize-on-upload for globals (v1 entity keeps globals FP32 on disk)
Convert: optional memory history during convert (same chart as GPU load); stream globals encode in two passes to shrink the initial globals spike on mobile
Load: staged DeserializeEntityWithOptions + block GPU upload without full-file deserialize peak (see entity.md)