Skip to main content

Math: BM25

Embed text as a BM25 sparse vector (indices + values) suitable for hybrid search in vector databases like Qdrant. Pair this with a dense embedding upstream to power a hybrid (dense + sparse) retrieval pipeline.

Example output
{
"indices": [12, 84, 132, ...],
"values": [0.42, 0.31, 0.18, ...]
}

Parameters

avgdl

Average document length used by the BM25 length-normalisation term. Tune to the typical token count of your corpus. Defaults to 256.

b

Length-normalisation parameter — 0.0 disables length normalisation, 1.0 applies it fully. Defaults to 0.0.

k1

Term-frequency saturation parameter. Higher values let very frequent terms dominate; typical range 1.22.0. Defaults to 1.2.

language

Language hint for tokenisation / stop-word handling. Use detect (default) to auto-detect, or an ISO-639 code (en, fr, …) when you know the language up front.

Input

TextREQUIRED
The text to embed.

Output

Results

JSON object holding the sparse vector (indices + values arrays).