# `LlamaCppEx.Server.Strategy.Batch`
[🔗](https://github.com/nyo16/llama_cpp_ex/blob/main/lib/llama_cpp_ex/server/strategy/batch.ex#L1)

Shared batch-assembly helpers used by the batching strategies.

The strategies (`DecodeMaximal`, `PrefillPriority`, `Balanced`) only differ in
the order and budget split between decode and prefill — the per-slot assembly
of decode tokens and prefill chunks is identical, so it lives here.

## Performance notes

These helpers run on every tick of the server loop, once per active slot, and
the prefill helper runs once per token of every prompt. To keep the loop linear
in the number of entries:

  * A running `n_entries` count is threaded through the accumulators instead of
    calling `length/1` on the growing `entries` list. Entries are prepended and
    reversed once by the caller, so a freshly appended entry's final batch index
    is exactly the current `n_entries`.
  * Prompt length comes from the cached `slot.n_prompt_tokens` rather than
    `length(slot.prompt_tokens)`.
  * Prefill chunks are sliced from `slot.prompt_tokens_tuple` (O(1) random
    access) rather than `Enum.slice/3` on a list (O(prefill_pos)).

All helpers take and return the 4-tuple `{entries, n_entries, slots, budget}`.

# `add_decode_tokens`

Adds one decode token for each generating slot (lowest seq_id first) until the
budget is exhausted. Streams each token piece to the slot's subscriber.

# `add_prefill_chunks`

Fills the remaining budget with prefill chunks for each prefilling slot
(lowest seq_id first). Only the last token of a slot's final chunk requests
logits.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
