Memory history, GPU load, and HF→entity convert
This page covers poly/memory_history.go: timed samples during LLM GPU upload, the in-terminal chart Lucy prints after load, and how memory policy ties to block-wise safetensor import (HF → .entity convert) and block-wise GPU upload (.entity → chat) so we do not hold full CPU and GPU weight copies at once.
Why this exists
When a transformer moves from CPU weights to GPU buffers, peak RAM matters — especially on mobile (SoulGlitch / iOS) and when loading large .entity checkpoints.
Two failure modes were measured on Lucy ENTITY Talk [8] (Qwen3-0.6B):
GPU load (.entity → chat, GPU enabled)
| Phase | Old behavior | Fixed behavior |
|---|---|---|
| Decoder blocks | Already OK — ReleaseInferenceHostWeights() after each block |
Same (~7.5 MB host freed per block) |
| Globals (embeddings, LM head, final norm) | Single SyncToGPU() uploaded all globals while ~1187 MB CPU weights still resident → ~2637 MB host+gpu overlap |
SyncGlobalWeightsToGPUSequential() uploads one global tensor, releases CPU, then next → ~2044 MB peak overlap |
| Steady state | Host weights eventually dropped | Host 0 MB, GPU ~1451 MB |
The remaining ~2044 MB GPU-load peak is a brief per-tensor overlap during each global upload (CPU slice still present until that tensor’s GPU buffer exists). It is not the old “entire model doubled at once” bug.
HF → .entity convert (safetensors → save)
| Phase | Old behavior | Fixed behavior |
|---|---|---|
| Import | LoadSafetensors() into one map, then LoadWithPrefixes() copied into WeightStore.Master while the map stayed resident → ~2× decoder weights until GC |
Globals first, then one decoder block at a time + ReleaseTransientSafetensorMap() after each block |
| Encode + save | Full FP32 decoder in RAM through SerializeEntityTransformer (all layers + growing []byte payload + final file buffer) |
Low-RAM path: ImportHFSaveEntityTransformerBlockwise — Q4/FP32 bake one block at a time into a streaming payload file, releaseEntityConvertLayerWeights() after each block, then writeEntityWireStreaming() (header + io.Copy payload — no full-file []byte) |
| BitNet | Already block-wise via ImportHFBitNetCheckpointDir |
Unchanged (still SaveEntityTransformer one-shot) |
Memory history is not wired to the convert step yet — only GPU chat load. Convert progress is visible via HFEntityConvertProgress callbacks (SoulGlitch task UI) or Lucy safetensor import logs (see below).
Enabling recording
Lucy (interactive)
When GPU is enabled, Lucy prompts:
📈 Measure memory during GPU load? (terminal chart after load — CPU weights vs GPU upload vs release) (1=yes / 0=no) [1]:
This appears in Poly Talk [1] and ENTITY Talk [8] (lucy/poly_talk_session.go, lucy/hf_entity.go). Answering yes calls poly.SetMemoryHistoryRecording(true).
Block-by-block upload is a separate prompt:
📥 Block-by-block GPU upload? (1=yes / 0=no) [0]:
For meaningful charts, use GPU + block upload + measure memory.
Environment variables
| Variable | Effect |
|---|---|
LOOM_MEMORY_HISTORY=1 |
Enable sampling without the Lucy prompt (off by default) |
LOOM_MEMORY_HISTORY_JSON=/path/out.json |
After load, write samples as JSON in addition to the terminal report |
COLUMNS=100 |
Widen the braille chart (default 80, max 120) |
Lucy’s runtime prompt override takes precedence over env when set.
What gets recorded
Each sample is a poly.MemorySample:
| Field | Meaning |
|---|---|
elapsed_sec |
Seconds since session start |
label |
Step name (e.g. block_03_after_release, embeddings_after_sync) |
host_weights_mb |
Poly-accounted CPU model weights (MemoryFootprint.HostWeightsMB) |
gpu_weights_mb |
Poly-accounted GPU weight buffers |
gpu_kv_mb |
KV cache reservation on GPU |
vram_total_mb |
Total VRAM usage from GetVRAMUsage() |
heap_alloc_mb / heap_sys_mb |
Go runtime heap |
process_rss_mb |
OS process RSS (getrusage on Unix; 0 on unsupported platforms) |
Important: host_weights_mb + gpu_weights_mb is the Poly overlap metric used for diagnosis. RSS can stay high after host weights drop because Go/OS may retain pages until memory pressure — that is called out in the diagnosis block.
Terminal output
When the GPU load session finishes, GlobalMemoryHistory.FinishSession() prints:
- Braille chart — four series: host weights (H), GPU weights (G), process RSS (R), VRAM (V)
- ASCII sparklines — same series as
.:-=+*#ramps (works in all terminals) - Sample log — table of every labeled step
- Peak overlap line —
peak host+gpu Poly weights overlap: X MBwhen overlap exceeds baseline by >5% - Diagnosis block — pass/fail hints for block release, embeddings release, global sequential upload, RSS retention
Example labels during a healthy ENTITY GPU load:
block_01_before_sync … block_28_after_release
embeddings_before_sync → embeddings_after_sync → embeddings_after_release
lm_head_before_sync → lm_head_after_sync → lm_head_after_release
final_norm_before_sync → final_norm_after_sync → final_norm_after_release
host_weights_released → after_gc
Legacy builds used a single embeddings_on_gpu label; diagnosis still recognizes that for regression comparison.
API (poly)
// Process-wide recorder (Lucy uses this)
poly.GlobalMemoryHistory.BeginSession("entity_gpu_load")
poly.RecordFromTransformer(poly.GlobalMemoryHistory, tr, "block_01_after_sync")
_ = poly.GlobalMemoryHistory.FinishSession() // chart + table + diagnosis
// Toggle without env
poly.SetMemoryHistoryRecording(true)
poly.ResetMemoryHistoryRecording()
// Footprint at any time
fp := poly.NewMemoryFootprintFromTransformer(tr)
fmt.Printf("host %.1f MB | gpu %.1f MB\n", fp.HostWeightsMB, fp.GPUWeightsMB)
Source files:
| File | Role |
|---|---|
memory_history.go |
MemoryHistory, samples, diagnosis |
memory_history_chart.go |
Braille chart + sparklines |
process_memory_unix.go |
RSS via getrusage |
process_memory_stub.go |
RSS stub on unsupported OS |
hf_entity_convert.go |
ImportHFSaveEntityTransformerBlockwise(Progress) |
entity_convert_io.go |
Streaming payload acc, per-block Q4 encode, writeEntityWireStreaming |
hf_import.go |
Block-wise safetensor import (ImportHFCheckpointDir) |
poly/tests/memory_history_test.go |
Unit tests |
HF → .entity convert (import + encode memory)
There are two llama-style convert lanes in poly today:
| Lane | API | Peak RAM during convert | Who uses it |
|---|---|---|---|
| Standard | ImportHFCheckpointDir → SaveEntityTransformer |
Block-wise safetensor import, then full FP32 network held through encode | Lucy [8] convertEntityEntry, ImportHFToEntity |
| Low-RAM encode | ImportHFSaveEntityTransformerBlockwise (+ optional Progress) |
~one decoder block FP32 + globals briefly at start + payload temp file on disk | SoulGlitch / mvp-simulation (iOS/macOS .entity convert) |
Both lanes share the same block-wise safetensor import in hf_import.go. The low-RAM lane adds block-wise encode in hf_entity_convert.go + entity_convert_io.go.
Block-wise safetensor import (both lanes)
// 1. Globals only
LoadSafetensorsSelective(f, HFWeightIsGlobal)
mapper.MapWeights(globalTensors) // embeddings, lm_head, final_norm
// 2. One transformer block at a time
for li := 0; li < numLayers; li++ {
LoadSafetensorsSelective(sf, HFWeightMatchesLayer(k, li))
LoadWithPrefixes(net, layerMap) // copy into WeightStore.Master
ReleaseTransientSafetensorMap(layerMap)
}
ReleaseTransientSafetensorMap(globalTensors, embeddings, lmHead, finalNorm)
copyWeights in prefix_safetensor.go copies HF slices into Master; without per-block release, the full safetensor map and the network both held decoder weights.
Block-wise encode + streaming save (low-RAM lane only)
// ImportHFSaveEntityTransformerBlockwiseProgress(modelDir, entityPath, weightDType, progress)
// 1. Encode globals → payload temp file; drop embeddings/lm_head/final_norm from RAM
collectEntityGlobalBlobAcc("embeddings", …)
if !entityLMHeadTied(embeddings, lmHead) {
collectEntityGlobalBlobAcc("lm_head", …) // same rule as SaveEntityTransformer
}
collectEntityGlobalBlobAcc("final_norm", …)
embeddings, lmHead, finalNorm = nil; GC
// 2. Per transformer block
for li := 0; li < numLayers; li++ {
LoadSafetensorsSelective + LoadWithPrefixes // one block FP32 in net
ReleaseTransientSafetensorMap(layerMap)
for j := 0; j < 4; j++ {
collectEntityWeightBlobsAcc(&net.Layers[base+j], …, weightDType) // Q4_0 bake if INT4
releaseEntityConvertLayerWeights(&net.Layers[base+j]) // drop Master
}
GC
}
// 3. Write .entity: fixed header + JSON blob index + io.Copy(payload file)
writeEntityWireStreaming(entityPath, net, trSpec, blobs, payloadPath)
What gets Q4-baked: decoder MHA + SwiGLU only (via collectEntityQ4_0LayerAcc). RMSNorm, MHA q_norm/k_norm, embeddings, lm_head, final_norm stay FP32 — same rules as entity_q4.go / SaveEntityTransformer.
Progress callback (HFEntityConvertProgress): blockIndex is 1-based per packed block; detail like packed block 14/28. SoulGlitch maps this to its convert task progress bar.
Terminal signature (Lucy standard convert)
When using ImportHFCheckpointDir + SaveEntityTransformer (Lucy [8] convertEntityEntry), a Qwen3-0.6B reconvert prints three global ✓ Loaded … lines, then num_hidden_layers lines of ✅ Finished loading weights with prefixes. (28 for Qwen3-0.6B). The old bulk path printed one bulk load without per-block messages.
Example:
⏳ Converting Qwen/Qwen3-0.6B → lucy_entities/Qwen--Qwen3-0.6B.entity [Q4 (INT4)] …
✓ Loaded model.embed_tokens.weight: … (role: embeddings)
✓ Loaded model.norm.weight: … (role: final_norm)
✓ Loaded lm_head.weight: … (role: lm_head)
✅ Finished loading weights with prefixes. ← block 1
… ← repeat per layer
✅ Qwen--Qwen3-0.6B.entity …
The low-RAM lane does not print those per-block safetensor lines (import is silent); use HFEntityConvertProgress or reconvert on Mac with Lucy to compare.
Supported converts
| Path | API | Safetensor import | Encode / save |
|---|---|---|---|
Lucy [8] convert |
convertEntityEntry |
✅ block-wise | SaveEntityTransformer (full network in RAM during encode) |
| Programmatic (standard) | ImportHFToEntity / ImportHFCheckpointDir + SaveEntityTransformer |
✅ block-wise | Full network during encode |
| Programmatic (low-RAM) | ImportHFSaveEntityTransformerBlockwise(Progress) |
✅ block-wise | ✅ block-wise encode + streaming payload |
| SoulGlitch convert | mvp → ImportHFSaveEntityTransformerBlockwiseProgress |
✅ | ✅ (+ CHGLUE standalone wrapper streams loom bytes — app layer) |
| BitNet | ImportHFBitNetCheckpointDir + SaveEntityTransformer |
✅ (packed ternary per block) | One-shot save |
Remaining encode overlap (expected)
Even the low-RAM lane still holds one block’s FP32 weights plus the globals encode spike at the start (embeddings + lm_head for large-vocab models are large). That is much smaller than holding all blocks through SerializeEntityTransformer, but not zero — see roadmap below.
GPU load path (what the history measures)
Lucy centralizes inference GPU setup in lucy/inference_setup.go → setupTransformerForInference. Welvet SoulGlitch mirrors the same policy in welvet/cabi/llm_ext.go (LoomCreateLLM).
Step 1 — Init WGPU
tr.Network.InitWGPU()
Step 2 — Decoder blocks (when sequentialGPULoad)
For each transformer block (4 grid layers: input norm, MHA, post-attn norm, SwiGLU):
layer.SyncToGPU()
(&tr.Network.Layers[idx]).ReleaseInferenceHostWeights()
Step 3 — Global weights (sequential)
Prefer SyncGlobalWeightsToGPUSequential() over bulk SyncToGPU() for inference load:
tr.SyncEmbeddingsToGPU(); tr.ReleaseEmbeddingsHost()
tr.SyncLMHeadToGPU(); tr.ReleaseLMHeadHost() // skips duplicate buffer when tied
tr.SyncFinalNormToGPU(); tr.ReleaseFinalNormHost()
// or:
tr.SyncGlobalWeightsToGPUSequential()
SyncToGPU() still uploads all three globals without mid-upload CPU release — kept for training paths and legacy callers.
Step 4 — Warmup and final cleanup
_, _ = tr.ForwardTokenIDsWGPU([]uint32{0}, nil, true, true)
tr.Reset()
tr.ReleaseInferenceHostWeights() // sweep any remaining host slices
runtime.GC()
debug.FreeOSMemory()
Where each policy is used
| Caller | Import / convert | Decoder upload | Global upload |
|---|---|---|---|
Lucy [8] HF → .entity |
Block-wise import + SaveEntityTransformer |
— | — |
SoulGlitch / mvp HF → .entity |
ImportHFSaveEntityTransformerBlockwiseProgress |
— | — |
Lucy setupTransformerForInference |
— | Block-wise + release | SyncGlobalWeightsToGPUSequential |
Welvet LoomCreateLLM (safetensors) |
Block-wise (chat load) | Block-wise + release | SyncGlobalWeightsToGPUSequential |
Welvet LoomSyncToGPU / bulk SyncToGPU() |
— | All layers, no mid-release | Bulk, no mid-release |
Training / demos calling SyncToGPU() directly |
Varies | Varies | Bulk |
Entity on SoulGlitch: LoomLoadEntityTransformerAs builds a full CPU transformer only. GPU setup must follow the Lucy sequence above (or a future LoomCreateLLMFromEntity export). See entity.md — GPU load.
Further peak reduction (roadmap)
- GPU load: stream or mmap entity globals so embeddings/LM head never exist as full FP32 CPU slices before GPU upload; quantize-on-upload for globals (v1 entity keeps globals FP32 on disk)
- Convert: optional memory history during convert (same chart as GPU load); stream globals encode in two passes to shrink the initial globals spike on mobile
- Load: staged
DeserializeEntityWithOptions+ block GPU upload without full-file deserialize peak (see entity.md)
See also
- gpu.md —
InitWGPU, per-layerSyncToGPU,ReleaseInferenceHostWeights - transformer.md —
SyncGlobalWeightsToGPUSequential, tied weights - entity.md — Lucy [8] ENTITY Talk flow
- testing_and_validation.md — reading Lucy logs