Docs Memory history, GPU load, and HF→entity convert

Memory history, GPU load, and HF→entity convert

This page covers poly/memory_history.go: timed samples during LLM GPU upload, the in-terminal chart Lucy prints after load, and how memory policy ties to block-wise safetensor import (HF → .entity convert) and block-wise GPU upload (.entity → chat) so we do not hold full CPU and GPU weight copies at once.


Why this exists

When a transformer moves from CPU weights to GPU buffers, peak RAM matters — especially on mobile (SoulGlitch / iOS) and when loading large .entity checkpoints.

Two failure modes were measured on Lucy ENTITY Talk [8] (Qwen3-0.6B):

GPU load (.entity → chat, GPU enabled)

Phase Old behavior Fixed behavior
Decoder blocks Already OK — ReleaseInferenceHostWeights() after each block Same (~7.5 MB host freed per block)
Globals (embeddings, LM head, final norm) Single SyncToGPU() uploaded all globals while ~1187 MB CPU weights still resident → ~2637 MB host+gpu overlap SyncGlobalWeightsToGPUSequential() uploads one global tensor, releases CPU, then next → ~2044 MB peak overlap
Steady state Host weights eventually dropped Host 0 MB, GPU ~1451 MB

The remaining ~2044 MB GPU-load peak is a brief per-tensor overlap during each global upload (CPU slice still present until that tensor’s GPU buffer exists). It is not the old “entire model doubled at once” bug.

HF → .entity convert (safetensors → save)

Phase Old behavior Fixed behavior
Import LoadSafetensors() into one map, then LoadWithPrefixes() copied into WeightStore.Master while the map stayed resident → ~2× decoder weights until GC Globals first, then one decoder block at a time + ReleaseTransientSafetensorMap() after each block
Encode + save Full FP32 decoder in RAM through SerializeEntityTransformer (all layers + growing []byte payload + final file buffer) Low-RAM path: ImportHFSaveEntityTransformerBlockwise — Q4/FP32 bake one block at a time into a streaming payload file, releaseEntityConvertLayerWeights() after each block, then writeEntityWireStreaming() (header + io.Copy payload — no full-file []byte)
BitNet Already block-wise via ImportHFBitNetCheckpointDir Unchanged (still SaveEntityTransformer one-shot)

Memory history is not wired to the convert step yet — only GPU chat load. Convert progress is visible via HFEntityConvertProgress callbacks (SoulGlitch task UI) or Lucy safetensor import logs (see below).


Enabling recording

Lucy (interactive)

When GPU is enabled, Lucy prompts:

📈 Measure memory during GPU load? (terminal chart after load — CPU weights vs GPU upload vs release) (1=yes / 0=no) [1]:

This appears in Poly Talk [1] and ENTITY Talk [8] (lucy/poly_talk_session.go, lucy/hf_entity.go). Answering yes calls poly.SetMemoryHistoryRecording(true).

Block-by-block upload is a separate prompt:

📥 Block-by-block GPU upload? (1=yes / 0=no) [0]:

For meaningful charts, use GPU + block upload + measure memory.

Environment variables

Variable Effect
LOOM_MEMORY_HISTORY=1 Enable sampling without the Lucy prompt (off by default)
LOOM_MEMORY_HISTORY_JSON=/path/out.json After load, write samples as JSON in addition to the terminal report
COLUMNS=100 Widen the braille chart (default 80, max 120)

Lucy’s runtime prompt override takes precedence over env when set.


What gets recorded

Each sample is a poly.MemorySample:

Field Meaning
elapsed_sec Seconds since session start
label Step name (e.g. block_03_after_release, embeddings_after_sync)
host_weights_mb Poly-accounted CPU model weights (MemoryFootprint.HostWeightsMB)
gpu_weights_mb Poly-accounted GPU weight buffers
gpu_kv_mb KV cache reservation on GPU
vram_total_mb Total VRAM usage from GetVRAMUsage()
heap_alloc_mb / heap_sys_mb Go runtime heap
process_rss_mb OS process RSS (getrusage on Unix; 0 on unsupported platforms)

Important: host_weights_mb + gpu_weights_mb is the Poly overlap metric used for diagnosis. RSS can stay high after host weights drop because Go/OS may retain pages until memory pressure — that is called out in the diagnosis block.


Terminal output

When the GPU load session finishes, GlobalMemoryHistory.FinishSession() prints:

  1. Braille chart — four series: host weights (H), GPU weights (G), process RSS (R), VRAM (V)
  2. ASCII sparklines — same series as .:-=+*# ramps (works in all terminals)
  3. Sample log — table of every labeled step
  4. Peak overlap linepeak host+gpu Poly weights overlap: X MB when overlap exceeds baseline by >5%
  5. Diagnosis block — pass/fail hints for block release, embeddings release, global sequential upload, RSS retention

Example labels during a healthy ENTITY GPU load:

block_01_before_sync … block_28_after_release
embeddings_before_sync → embeddings_after_sync → embeddings_after_release
lm_head_before_sync → lm_head_after_sync → lm_head_after_release
final_norm_before_sync → final_norm_after_sync → final_norm_after_release
host_weights_released → after_gc

Legacy builds used a single embeddings_on_gpu label; diagnosis still recognizes that for regression comparison.


API (poly)

// Process-wide recorder (Lucy uses this)
poly.GlobalMemoryHistory.BeginSession("entity_gpu_load")
poly.RecordFromTransformer(poly.GlobalMemoryHistory, tr, "block_01_after_sync")
_ = poly.GlobalMemoryHistory.FinishSession() // chart + table + diagnosis

// Toggle without env
poly.SetMemoryHistoryRecording(true)
poly.ResetMemoryHistoryRecording()

// Footprint at any time
fp := poly.NewMemoryFootprintFromTransformer(tr)
fmt.Printf("host %.1f MB | gpu %.1f MB\n", fp.HostWeightsMB, fp.GPUWeightsMB)

Source files:

File Role
memory_history.go MemoryHistory, samples, diagnosis
memory_history_chart.go Braille chart + sparklines
process_memory_unix.go RSS via getrusage
process_memory_stub.go RSS stub on unsupported OS
hf_entity_convert.go ImportHFSaveEntityTransformerBlockwise(Progress)
entity_convert_io.go Streaming payload acc, per-block Q4 encode, writeEntityWireStreaming
hf_import.go Block-wise safetensor import (ImportHFCheckpointDir)
poly/tests/memory_history_test.go Unit tests

HF → .entity convert (import + encode memory)

There are two llama-style convert lanes in poly today:

Lane API Peak RAM during convert Who uses it
Standard ImportHFCheckpointDirSaveEntityTransformer Block-wise safetensor import, then full FP32 network held through encode Lucy [8] convertEntityEntry, ImportHFToEntity
Low-RAM encode ImportHFSaveEntityTransformerBlockwise (+ optional Progress) ~one decoder block FP32 + globals briefly at start + payload temp file on disk SoulGlitch / mvp-simulation (iOS/macOS .entity convert)

Both lanes share the same block-wise safetensor import in hf_import.go. The low-RAM lane adds block-wise encode in hf_entity_convert.go + entity_convert_io.go.

Block-wise safetensor import (both lanes)

// 1. Globals only
LoadSafetensorsSelective(f, HFWeightIsGlobal)
mapper.MapWeights(globalTensors) // embeddings, lm_head, final_norm

// 2. One transformer block at a time
for li := 0; li < numLayers; li++ {
    LoadSafetensorsSelective(sf, HFWeightMatchesLayer(k, li))
    LoadWithPrefixes(net, layerMap)   // copy into WeightStore.Master
    ReleaseTransientSafetensorMap(layerMap)
}
ReleaseTransientSafetensorMap(globalTensors, embeddings, lmHead, finalNorm)

copyWeights in prefix_safetensor.go copies HF slices into Master; without per-block release, the full safetensor map and the network both held decoder weights.

Block-wise encode + streaming save (low-RAM lane only)

// ImportHFSaveEntityTransformerBlockwiseProgress(modelDir, entityPath, weightDType, progress)

// 1. Encode globals → payload temp file; drop embeddings/lm_head/final_norm from RAM
collectEntityGlobalBlobAcc("embeddings", …)
if !entityLMHeadTied(embeddings, lmHead) {
    collectEntityGlobalBlobAcc("lm_head", …)  // same rule as SaveEntityTransformer
}
collectEntityGlobalBlobAcc("final_norm", …)
embeddings, lmHead, finalNorm = nil; GC

// 2. Per transformer block
for li := 0; li < numLayers; li++ {
    LoadSafetensorsSelective + LoadWithPrefixes  // one block FP32 in net
    ReleaseTransientSafetensorMap(layerMap)
    for j := 0; j < 4; j++ {
        collectEntityWeightBlobsAcc(&net.Layers[base+j], …, weightDType) // Q4_0 bake if INT4
        releaseEntityConvertLayerWeights(&net.Layers[base+j])             // drop Master
    }
    GC
}

// 3. Write .entity: fixed header + JSON blob index + io.Copy(payload file)
writeEntityWireStreaming(entityPath, net, trSpec, blobs, payloadPath)

What gets Q4-baked: decoder MHA + SwiGLU only (via collectEntityQ4_0LayerAcc). RMSNorm, MHA q_norm/k_norm, embeddings, lm_head, final_norm stay FP32 — same rules as entity_q4.go / SaveEntityTransformer.

Progress callback (HFEntityConvertProgress): blockIndex is 1-based per packed block; detail like packed block 14/28. SoulGlitch maps this to its convert task progress bar.

Terminal signature (Lucy standard convert)

When using ImportHFCheckpointDir + SaveEntityTransformer (Lucy [8] convertEntityEntry), a Qwen3-0.6B reconvert prints three global ✓ Loaded … lines, then num_hidden_layers lines of ✅ Finished loading weights with prefixes. (28 for Qwen3-0.6B). The old bulk path printed one bulk load without per-block messages.

Example:

⏳ Converting Qwen/Qwen3-0.6B → lucy_entities/Qwen--Qwen3-0.6B.entity [Q4 (INT4)] …
  ✓ Loaded model.embed_tokens.weight: … (role: embeddings)
  ✓ Loaded model.norm.weight: … (role: final_norm)
  ✓ Loaded lm_head.weight: … (role: lm_head)
✅ Finished loading weights with prefixes.   ← block 1
…                                            ← repeat per layer
   ✅ Qwen--Qwen3-0.6B.entity  …

The low-RAM lane does not print those per-block safetensor lines (import is silent); use HFEntityConvertProgress or reconvert on Mac with Lucy to compare.

Supported converts

Path API Safetensor import Encode / save
Lucy [8] convert convertEntityEntry ✅ block-wise SaveEntityTransformer (full network in RAM during encode)
Programmatic (standard) ImportHFToEntity / ImportHFCheckpointDir + SaveEntityTransformer ✅ block-wise Full network during encode
Programmatic (low-RAM) ImportHFSaveEntityTransformerBlockwise(Progress) ✅ block-wise ✅ block-wise encode + streaming payload
SoulGlitch convert mvp → ImportHFSaveEntityTransformerBlockwiseProgress ✅ (+ CHGLUE standalone wrapper streams loom bytes — app layer)
BitNet ImportHFBitNetCheckpointDir + SaveEntityTransformer ✅ (packed ternary per block) One-shot save

Remaining encode overlap (expected)

Even the low-RAM lane still holds one block’s FP32 weights plus the globals encode spike at the start (embeddings + lm_head for large-vocab models are large). That is much smaller than holding all blocks through SerializeEntityTransformer, but not zero — see roadmap below.


GPU load path (what the history measures)

Lucy centralizes inference GPU setup in lucy/inference_setup.gosetupTransformerForInference. Welvet SoulGlitch mirrors the same policy in welvet/cabi/llm_ext.go (LoomCreateLLM).

Step 1 — Init WGPU

tr.Network.InitWGPU()

Step 2 — Decoder blocks (when sequentialGPULoad)

For each transformer block (4 grid layers: input norm, MHA, post-attn norm, SwiGLU):

layer.SyncToGPU()
(&tr.Network.Layers[idx]).ReleaseInferenceHostWeights()

Step 3 — Global weights (sequential)

Prefer SyncGlobalWeightsToGPUSequential() over bulk SyncToGPU() for inference load:

tr.SyncEmbeddingsToGPU(); tr.ReleaseEmbeddingsHost()
tr.SyncLMHeadToGPU();     tr.ReleaseLMHeadHost()      // skips duplicate buffer when tied
tr.SyncFinalNormToGPU();  tr.ReleaseFinalNormHost()
// or:
tr.SyncGlobalWeightsToGPUSequential()

SyncToGPU() still uploads all three globals without mid-upload CPU release — kept for training paths and legacy callers.

Step 4 — Warmup and final cleanup

_, _ = tr.ForwardTokenIDsWGPU([]uint32{0}, nil, true, true)
tr.Reset()
tr.ReleaseInferenceHostWeights() // sweep any remaining host slices
runtime.GC()
debug.FreeOSMemory()

Where each policy is used

Caller Import / convert Decoder upload Global upload
Lucy [8] HF → .entity Block-wise import + SaveEntityTransformer
SoulGlitch / mvp HF → .entity ImportHFSaveEntityTransformerBlockwiseProgress
Lucy setupTransformerForInference Block-wise + release SyncGlobalWeightsToGPUSequential
Welvet LoomCreateLLM (safetensors) Block-wise (chat load) Block-wise + release SyncGlobalWeightsToGPUSequential
Welvet LoomSyncToGPU / bulk SyncToGPU() All layers, no mid-release Bulk, no mid-release
Training / demos calling SyncToGPU() directly Varies Varies Bulk

Entity on SoulGlitch: LoomLoadEntityTransformerAs builds a full CPU transformer only. GPU setup must follow the Lucy sequence above (or a future LoomCreateLLMFromEntity export). See entity.md — GPU load.


Further peak reduction (roadmap)

  • GPU load: stream or mmap entity globals so embeddings/LM head never exist as full FP32 CPU slices before GPU upload; quantize-on-upload for globals (v1 entity keeps globals FP32 on disk)
  • Convert: optional memory history during convert (same chart as GPU load); stream globals encode in two passes to shrink the initial globals spike on mobile
  • Load: staged DeserializeEntityWithOptions + block GPU upload without full-file deserialize peak (see entity.md)

See also