# `LlamaCppEx.Context`
[🔗](https://github.com/nyo16/llama_cpp_ex/blob/main/lib/llama_cpp_ex/context.ex#L1)

Inference context with KV cache.

# `t`

```elixir
@type t() :: %LlamaCppEx.Context{model: LlamaCppEx.Model.t(), ref: reference()}
```

# `clear`

```elixir
@spec clear(t()) :: :ok
```

Clears the KV cache.

# `create`

```elixir
@spec create(
  LlamaCppEx.Model.t(),
  keyword()
) :: {:ok, t()} | {:error, String.t()}
```

Creates a new inference context for the given model.

## Options

### Core

  * `:n_ctx` - Context size (max tokens). Defaults to `2048`.
  * `:n_batch` - Max tokens per decode batch. Defaults to `n_ctx`.
  * `:n_ubatch` - Max tokens per micro-batch. Defaults to `512`.
  * `:n_threads` - Number of threads for generation. Defaults to system CPU count.
  * `:n_threads_batch` - Number of threads for prompt processing. Defaults to `:n_threads`.
  * `:n_seq_max` - Max number of concurrent sequences. Defaults to `1`.
  * `:embeddings` - Enable embedding extraction. Defaults to `false`.
  * `:pooling_type` - Pooling type for embeddings: `:unspecified`, `:none`, `:mean`,
    `:cls`, `:last`, `:rank`. Defaults to `:unspecified`.

### KV Cache Quantization

  * `:type_k` - Data type for K cache. Reduces memory at the cost of precision.
    Values: `:f16` (default), `:f32`, `:q8_0`, `:q4_0`, `:q4_1`, `:q5_0`, `:q5_1`, `:bf16`.
  * `:type_v` - Data type for V cache. Same values as `:type_k`. Defaults to `:f16`.

### Flash Attention & GPU Offload

  * `:flash_attn` - Flash Attention mode: `:auto` (default), `:enabled`, `:disabled`.
  * `:offload_kqv` - Offload KQV ops and KV cache to GPU. Defaults to `true`.
  * `:op_offload` - Offload host tensor operations to device. Defaults to `true`.

### RoPE Scaling (Context Extension)

  * `:rope_scaling_type` - RoPE scaling mode: `:unspecified` (default), `:none`,
    `:linear`, `:yarn`, `:longrope`.
  * `:rope_freq_base` - RoPE base frequency. `0.0` uses model default.
  * `:rope_freq_scale` - RoPE frequency scale. `0.0` uses model default.
  * `:yarn_ext_factor` - YaRN extrapolation mix factor. `-1.0` to disable.
  * `:yarn_attn_factor` - YaRN magnitude scaling. `-1.0` to disable.
  * `:yarn_beta_fast` - YaRN low correction dimension. `-1.0` to disable.
  * `:yarn_beta_slow` - YaRN high correction dimension. `-1.0` to disable.
  * `:yarn_orig_ctx` - YaRN original context length. `0` to disable.

### Misc

  * `:attention_type` - Attention type: `:unspecified` (default), `:causal`, `:non_causal`.
    Use `:non_causal` for embedding models.
  * `:no_perf` - Disable performance timing. Defaults to `true`.
  * `:swa_full` - Use full-size sliding window attention cache. Defaults to `true`.

### Speculative decoding / MTP

  * `:ctx_type` - Context kind. `:default` (the main target context, default) or `:mtp`
    (a draft context that consumes MTP heads from the same model). Use `:mtp` together
    with a separate `:default` context to drive multi-token-prediction speculative
    decoding via `LlamaCppEx.MTP`.
  * `:n_rs_seq` - Number of recurrent-state snapshots per sequence to retain for
    partial rollback of speculative drafts. `0` (default) disables rollback. For an
    MTP draft context, set this to your intended max draft length (e.g. `3`).

# `decode`

```elixir
@spec decode(t(), [integer()]) :: :ok | {:error, String.t()}
```

Decodes a list of tokens through the model.

# `generate`

```elixir
@spec generate(t(), LlamaCppEx.Sampler.t(), [integer()], keyword()) ::
  {:ok, String.t()} | {:error, String.t()}
```

Runs the generation loop: decodes prompt tokens and generates up to `max_tokens` new tokens.

Returns the generated text (not including the prompt).

## Options

  * `:max_tokens` - Maximum tokens to generate. Defaults to `256`.

# `n_ctx`

```elixir
@spec n_ctx(t()) :: integer()
```

Returns the context size.

# `n_rs_seq`

```elixir
@spec n_rs_seq(t()) :: non_neg_integer()
```

Returns the number of recurrent-state snapshots per sequence available for
partial rollback of speculative drafts.

`0` means the context does not support partial rollback (e.g. a regular target
context with `n_rs_seq: 0`). For an MTP draft context created with
`n_rs_seq: N`, this returns at most `N`.

# `n_seq_max`

```elixir
@spec n_seq_max(t()) :: integer()
```

Returns the max number of sequences.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
