Chunker
Stream a JSONL file and ask the configured LLM to split each record's content into smaller chunks following a natural-language method (e.g. "split by paragraph", "split by topic"). Each input record is emitted multiple times — once per produced chunk — with the chunk written to a configurable attribute. Useful for preparing documents for retrieval / RAG pipelines.
For deterministic, non-LLM splitting see AI::FixedChunker, AI::RecursiveChunker,
or AI::MarkdownChunker.
Pre-requisite: Install an AI provider application from Profile > {Organization} > Applications. The provider must support tool/function calling.
Parameters
Model identifier from the selected provider. Must support tool calling. Defaults to the provider's recommended model when left empty.
Batch(default) — submits all records as a single batch job. Cheaper, asynchronous, best for large files. -Direct— calls the AI per record synchronously. Finer control, faster on small files.
Verifies that the produced chunks reconstruct the source text exactly
(whitespace and pipes ignored): - No check (default) — use chunks as-is.
Exact or fail— abort the step on any mismatch. -Exact or original— fall back to the full source as a single chunk on mismatch.
Natural-language instruction telling the model how to split the content (e.g. "split by paragraph", "split by section heading", "semantic chunks of ~500 words").
Field on each incoming JSON record holding the text to chunk. Leave empty to chunk the full record.
Field name written on each outgoing record holding the chunk text.
Defaults to chunk when left empty.
Number of incoming records to bundle together before sending to the model.
Useful for keeping pages or sections together. Defaults to 1.
Number of records shared between consecutive groups (sliding window).
Higher overlap preserves context across boundaries at the cost of
duplicate processing. Defaults to 0.
Input
JSONL file. Each line is a JSON record from which the source text is read.
Output
JSONL file with one line per produced chunk. The original record is preserved with the source attribute removed and the output attribute filled with the chunk text.