LlamaCppEx.Server.Strategy.Batch (LlamaCppEx v0.8.20)

Copy Markdown View Source

Shared batch-assembly helpers used by the batching strategies.

The strategies (DecodeMaximal, PrefillPriority, Balanced) only differ in the order and budget split between decode and prefill — the per-slot assembly of decode tokens and prefill chunks is identical, so it lives here.

Performance notes

These helpers run on every tick of the server loop, once per active slot, and the prefill helper runs once per token of every prompt. To keep the loop linear in the number of entries:

  • A running n_entries count is threaded through the accumulators instead of calling length/1 on the growing entries list. Entries are prepended and reversed once by the caller, so a freshly appended entry's final batch index is exactly the current n_entries.
  • Prompt length comes from the cached slot.n_prompt_tokens rather than length(slot.prompt_tokens).
  • Prefill chunks are sliced from slot.prompt_tokens_tuple (O(1) random access) rather than Enum.slice/3 on a list (O(prefill_pos)).

All helpers take and return the 4-tuple {entries, n_entries, slots, budget}.

Summary

Functions

Adds one decode token for each generating slot (lowest seq_id first) until the budget is exhausted. Streams each token piece to the slot's subscriber.

Fills the remaining budget with prefill chunks for each prefilling slot (lowest seq_id first). Only the last token of a slot's final chunk requests logits.

Functions

add_decode_tokens(slots, entries, n_entries, budget, model_ref)

Adds one decode token for each generating slot (lowest seq_id first) until the budget is exhausted. Streams each token piece to the slot's subscriber.

add_prefill_chunks(slots, entries, n_entries, budget, chunk_size)

Fills the remaining budget with prefill chunks for each prefilling slot (lowest seq_id first). Only the last token of a slot's final chunk requests logits.