LlamaCppEx.Server.Strategy.Batch (LlamaCppEx v0.8.36)

Shared batch-assembly helpers used by the batching strategies.

The strategies (DecodeMaximal, PrefillPriority, Balanced) only differ in the order and budget split between decode and prefill — the per-slot assembly of decode tokens and prefill chunks is identical, so it lives here.

Sampling, detokenization, and stream delivery all happen after the forward pass (fused into the batch_eval_sample NIF and the server's post-tick result handling) — these helpers only assemble batch entries and update slot bookkeeping.

Performance notes

These helpers run on every tick of the server loop, once per active slot, and the prefill helper runs once per token of every prompt. To keep the loop linear in the number of entries:

A running n_entries count is threaded through the accumulators instead of calling length/1 on the growing entries list. Entries are prepended and reversed once by the caller, so a freshly appended entry's final batch index is exactly the current n_entries.
Prompt length comes from the cached slot.n_prompt_tokens rather than length(slot.prompt_tokens).
Prefill chunks are sliced from slot.prompt_tokens_tuple (O(1) random access) rather than Enum.slice/3 on a list (O(prefill_pos)).

All helpers take and return the 4-tuple {entries, n_entries, slots, budget}.

Summary

Functions

add_decode_tokens(slots, entries, n_entries, budget)

Adds one decode token for each generating slot (lowest seq_id first) until the budget is exhausted.

add_prefill_chunks(slots, entries, n_entries, budget, chunk_size)

Fills the remaining budget with prefill chunks for each prefilling slot (lowest seq_id first). Only the last token of a slot's final chunk requests logits.

Functions

add_decode_tokens(slots, entries, n_entries, budget)

Adds one decode token for each generating slot (lowest seq_id first) until the budget is exhausted.

generated_token_ids is updated here — at feed time — so it tracks exactly the tokens present in the slot's KV cache (the prefix-cache bookkeeping depends on that invariant; a sampled-but-never-fed final token must not appear in cached_tokens).

add_prefill_chunks(slots, entries, n_entries, budget, chunk_size)

Fills the remaining budget with prefill chunks for each prefilling slot (lowest seq_id first). Only the last token of a slot's final chunk requests logits.