Chunker

Stream a JSONL file and ask the configured LLM to split each record's content into smaller chunks following a natural-language method (e.g. "split by paragraph", "split by topic"). Each input record is emitted multiple times — once per produced chunk — with the chunk written to a configurable attribute. Useful for preparing documents for retrieval / RAG pipelines.

For deterministic, non-LLM splitting see AI::FixedChunker, AI::RecursiveChunker, or AI::MarkdownChunker.

Pre-requisite: Install an AI provider application from Profile > {Organization} > Applications. The provider must support tool/function calling.

Parameters

Provider—REQUIRED

Configured AI application.

Model

Model identifier from the selected provider. Must support tool calling. Defaults to the provider's recommended model when left empty.

Mode—REQUIRED

Batch (default) — submits all records as a single batch job. Cheaper, asynchronous, best for large files. - Direct — calls the AI per record synchronously. Finer control, faster on small files.

Coverage Check

Verifies that the produced chunks reconstruct the source text exactly (whitespace and pipes ignored): - No check (default) — use chunks as-is.

Exact or fail — abort the step on any mismatch. - Exact or original — fall back to the full source as a single chunk on mismatch.

Chunking Method—REQUIRED

Natural-language instruction telling the model how to split the content (e.g. "split by paragraph", "split by section heading", "semantic chunks of ~500 words").

Source Attribute

Field on each incoming JSON record holding the text to chunk. Leave empty to chunk the full record.

Output Attribute—REQUIRED

Field name written on each outgoing record holding the chunk text. Defaults to chunk when left empty.

Group Size

Number of incoming records to bundle together before sending to the model. Useful for keeping pages or sections together. Defaults to 1.

Overlap

Number of records shared between consecutive groups (sliding window). Higher overlap preserves context across boundaries at the cost of duplicate processing. Defaults to 0.

Input

File—REQUIRED

JSONL file. Each line is a JSON record from which the source text is read.

Output

File

JSONL file with one line per produced chunk. The original record is preserved with the source attribute removed and the output attribute filled with the chunk text.

Parameters​

Input​

Output​

Parameters

Input

Output