Math: BM25
Embed text as a BM25 sparse vector (indices + values) suitable for hybrid search in vector databases like Qdrant. Pair this with a dense embedding upstream to power a hybrid (dense + sparse) retrieval pipeline.
{
"indices": [12, 84, 132, ...],
"values": [0.42, 0.31, 0.18, ...]
}
Parameters
Average document length used by the BM25 length-normalisation term. Tune
to the typical token count of your corpus. Defaults to 256.
Length-normalisation parameter — 0.0 disables length normalisation,
1.0 applies it fully. Defaults to 0.0.
Term-frequency saturation parameter. Higher values let very frequent terms
dominate; typical range 1.2–2.0. Defaults to 1.2.
Language hint for tokenisation / stop-word handling. Use detect
(default) to auto-detect, or an ISO-639 code (en, fr, …) when you know
the language up front.
Input
Output
JSON object holding the sparse vector (indices + values arrays).