LLM Integration
The FOLIO Python Library can route searches through any LLM provider supported by alea-llm-client — OpenAI, Anthropic, Grok, Google, vLLM, or Together — to turn a natural-language query into a ranked list of FOLIO classes. This is semantic search: it catches matches that fuzzy, prefix, or definition search would miss because the user’s words don’t overlap with the ontology’s labels. It is async, returns structured scores against a candidate pool you control, and is cost-aware via universal effort and tier knobs that auto-translate to each provider’s reasoning and service-tier parameters.
How it works
At construction time, FOLIO() tries to auto-initialize self.llm to an OpenAIModel(model="gpt-5.4") instance from alea_llm_client. When OpenAI credentials are present, f.llm is ready to use — you do not have to wire anything up. When credentials are missing or alea-llm-client isn’t installed, the initialization fails silently (a warning is logged) and f.llm stays None. The LLM, once set, is consumed by two methods:
search_by_llm(query, search_set, …)— score a single candidate pool against a query.parallel_search_by_llm(query, search_sets=None, …)— fan out across many pools concurrently viaasyncio.gather.
Both methods are async coroutines and must be awaited inside an event loop.
A few non-obvious points worth calling out before you see the examples:
f.llmis auto-initialized. It is notNoneby default when credentials are available — there is no “register an LLM” step. Passingllm=...only matters if you want to override the default model or pick a different provider.- No credentials, no search. If the default OpenAI init fails and you don’t pass an
llm=override, callingsearch_by_llm()raisesRuntimeError("search extra must be installed to use llm search functions: pip install folio-python[search]"). SetOPENAI_API_KEYor pass a differentalea_llm_clientmodel to get real results. search_by_llmis async. Calling it withoutawaitreturns a coroutine object, not a list. You’ll seecoroutine 'FOLIO.search_by_llm' was never awaitedif you forget.search_setis required. It is a positionalList[OWLClass]parameter, not a default. Callingf.search_by_llm("query")raisesTypeError: FOLIO.search_by_llm() missing 1 required positional argument: 'search_set'.
The LLM never sees the full 18,323-class ontology in a single call. You always pass a candidate pool (one branch, a few subtrees, or the output of a lexical search) and the model scores that pool against the query. That keeps token counts — and cost — under control.
Setup
Default: OpenAI
Set one environment variable and you’re done:
export OPENAI_API_KEY=sk-...from folio import FOLIO
folio = FOLIO()
print(folio.llm)
# Output:
# <alea_llm_client.llms.models.openai_model.OpenAIModel object at 0x...>
print(folio.llm.model)
# Output:
# gpt-5.4The default model is gpt-5.4. If that model is too new for your account or you want something cheaper, override it via the llm= parameter (next section) — you don’t need to touch environment variables.
Alternative providers via FOLIO(llm=...)
Pass any alea_llm_client model instance as the llm keyword argument and FOLIO will use it for every LLM search call. Each provider reads its own API-key environment variable (OPENAI_API_KEY, ANTHROPIC_API_KEY, XAI_API_KEY, GOOGLE_API_KEY, etc.) or a per-provider key file under ~/.alea/keys/:
from folio import FOLIO
# Grok — fastest and cheapest in the March 2026 benchmark
from alea_llm_client import GrokModel
folio = FOLIO(llm=GrokModel(model="grok-4-fast-non-reasoning"))
# Google Gemini — effort translates to a provider-specific reasoning knob
from alea_llm_client import GoogleModel
folio = FOLIO(
llm=GoogleModel(model="gemini-3-flash-preview"),
effort="low",
)
# Anthropic Claude — effort becomes output_config={'effort': 'low'}
from alea_llm_client import AnthropicModel
folio = FOLIO(llm=AnthropicModel(model="claude-sonnet-4-6"))alea-llm-client also ships VLLMModel and TogetherModel for self-hosted or Together-hosted open-weight models. The FOLIO side of the API is identical regardless of provider.
Required dependencies
All LLM features live behind the [search] extra:
# uv (recommended)
uv add 'folio-python[search]'
# pip
pip install 'folio-python[search]'That single extra pulls in alea-llm-client (for the LLM calls), rapidfuzz (for fuzzy search), and marisa-trie (for prefix search). If you install the base package without [search], the auto-init of f.llm is a no-op and all LLM methods will raise RuntimeError the first time you call them.
Effort and tier
Added in version 0.3.1, the effort and tier parameters are universal knobs that map to the right provider-specific kwargs for whatever model you’re using. You set them once at FOLIO() construction time and they flow through to every LLM call via self.llm_kwargs.
effort
Accepts "low", "medium", or "high". Higher effort asks the model for more reasoning time — the exact translation is provider-specific and model-aware:
from folio.graph import get_llm_kwargs
from alea_llm_client import OpenAIModel, AnthropicModel
print(get_llm_kwargs(OpenAIModel(model="gpt-5.4"), effort="low", tier="flex"))
# Output:
# {'reasoning_effort': 'none', 'service_tier': 'flex'}
print(get_llm_kwargs(OpenAIModel(model="gpt-5.4"), effort="high"))
# Output:
# {'reasoning_effort': 'high'}
# Non-reasoning models get an empty dict — effort is silently a no-op
print(get_llm_kwargs(OpenAIModel(model="gpt-4.1-mini"), effort="low", tier="flex"))
# Output:
# {}
print(get_llm_kwargs(AnthropicModel(model="claude-sonnet-4-6"), effort="high"))
# Output:
# {'output_config': {'effort': 'high'}}Avoid
effort="high"for taxonomy search. The README puts it plainly: “Avoideffort: \"high\"— benchmarks show 5x latency with no quality improvement for structured search tasks.” Structured search is a classification problem, not a reasoning problem; the model spends time thinking when it just needs to score items.
tier
Accepts "flex", "standard", or "priority". This maps to OpenAI’s service_tier parameter and is a no-op on providers that don’t expose a tier control. In order of increasing price and decreasing latency variance:
flex— cheapest; requests may queue behind priority traffic.standard— normal pricing and latency (the default whentieris unset).priority— premium pricing, most consistent latency.
How llm_kwargs is merged
Both effort and tier are translated via get_llm_kwargs() and then merged into self.llm_kwargs at FOLIO() construction time. The merge order matters: provider-inferred kwargs are the base and any user-supplied llm_kwargs={…} overrides them. This lets you layer your own per-call options on top of the universal effort/tier knobs:
from folio import FOLIO
# Recommended config for cost-effective batch search
folio = FOLIO(effort="low", tier="flex")
print(folio.llm_kwargs)
# Output:
# {'reasoning_effort': 'none', 'service_tier': 'flex'}
# Override with an explicit reasoning_effort while keeping tier=flex
folio = FOLIO(
effort="low",
tier="flex",
llm_kwargs={"reasoning_effort": "minimal"},
)
print(folio.llm_kwargs)
# Output:
# {'reasoning_effort': 'minimal', 'service_tier': 'flex'}Every call to search_by_llm passes **self.llm_kwargs through to the underlying alea_llm_client model, so anything you put in llm_kwargs (or inherit from effort/tier) reaches the provider on every request without you having to plumb it through.
search_by_llm
async def search_by_llm(
query: str,
search_set: List[OWLClass],
limit: int = 10,
scale: int = 10,
include_reason: bool = False,
) -> List[Tuple[OWLClass, int | float]]search_by_llm scores every class in search_set against query on a 1-to-scale scale, drops irrelevant items, and returns the top limit tuples sorted descending by score. It is an async def coroutine — you must await it from inside an event loop (typically asyncio.run(...) at the top level).
A full, runnable example — narrow the candidate pool with a taxonomy helper, then hand it to the LLM:
import asyncio
from folio import FOLIO
folio = FOLIO()
async def main():
# search_by_llm requires a candidate pool — it never searches the whole
# ontology by itself. Here we pick the 31 top-level areas of law.
candidates = folio.get_areas_of_law(max_depth=1)
results = await folio.search_by_llm(
"court that hears patent disputes",
candidates,
limit=3,
)
for owl_class, score in results:
print(f"{score:3d} {owl_class.label}")
asyncio.run(main())
# Example output (OPENAI_API_KEY set):
# 10 Intellectual Property Law
# 8 Information Technology and Cyber Law
# 6 Information Security LawTwo things to notice in that example:
- The candidate pool is 31 classes — just the top-level areas of law (
max_depth=1). The LLM sees a JSONL-encoded description of each class and assigns a relevance score. The smaller the pool, the cheaper and faster the call. - No
OPENAI_API_KEY, no results. If the env var isn’t set whenFOLIO()is constructed,f.llmwill beNoneand theawaitraisesRuntimeError("search extra must be installed to use llm search functions: ..."). Catch it if you want a graceful fallback to lexical search.
If you want a broader candidate pool, chain helpers together:
# Search across all areas of law AND all courts
candidates = (
folio.get_areas_of_law(max_depth=2)
+ folio.get_forum_venues(max_depth=2)
)
results = await folio.search_by_llm("patent litigation venue", candidates, limit=5)Or build a pool from a lexical search and let the LLM rerank:
# Lexical shortlist → LLM rerank
prelim = [c for c, _ in folio.search_by_label("patent", limit=40)]
results = await folio.search_by_llm("where patent cases are heard", prelim, limit=5)include_reason=True
When you set include_reason=True, each result is a 3-tuple — (OWLClass, score, explanation) — with a short natural-language reason the model produced. This is invaluable for debugging “why did this match?” and for surfacing a justification to end users:
results = await folio.search_by_llm(
"court that hears patent disputes",
folio.get_forum_venues(max_depth=2),
limit=3,
include_reason=True,
)
for cls, score, reason in results:
print(f"{score:3d} {cls.label}")
print(f" {reason}")scale
The scale parameter (default 10) controls the integer range the LLM uses for scoring. Lower scales produce coarser ratings; higher scales give finer-grained ordering but can confuse models that prefer round numbers. Most users should leave it at the default.
What the LLM actually sees
Under the hood, search_by_llm builds a structured prompt with four parts: a JSONL block of items (one class per line, produced by format_classes_for_llm), a bulleted instruction list, the query itself, and a JSON response schema. The model is asked to respond with {"results": [{"iri": string, "relevance": integer}]} — or with an explanation field added when include_reason=True. Results with IRIs that aren’t in self.iri_to_index are dropped, duplicates are deduped, and the final list is sorted by -relevance before being trimmed to limit. If you want to see exactly what the prompt looks like, call folio.format_classes_for_llm(search_set) directly and print it.
Troubleshooting
A few common error cases and what they mean:
| Symptom | Cause | Fix |
|---|---|---|
TypeError: FOLIO.search_by_llm() missing 1 required positional argument: 'search_set' | Called without the candidate pool | Pass a List[OWLClass] as the second argument |
RuntimeError: search extra must be installed to use llm search functions: pip install folio-python[search] | f.llm is None — no credentials or extra not installed | Set OPENAI_API_KEY, install folio-python[search], or pass llm=... explicitly |
RuntimeError: Error searching with LLM. wrapping a provider error | The LLM call itself failed (network, rate limit, schema parse error) | Inspect the traceback; retry with backoff or a smaller search_set |
coroutine 'FOLIO.search_by_llm' was never awaited | Called without await | Wrap in async def and run via asyncio.run(...) |
parallel_search_by_llm
async def parallel_search_by_llm(
query: str,
search_sets: Optional[List[List[OWLClass]]] = None,
limit: int = 10,
scale: int = 10,
include_reason: bool = False,
max_depth: int = DEFAULT_SEARCH_MAX_DEPTH, # = 2
) -> List[Tuple[OWLClass, int | float]]parallel_search_by_llm runs one search_by_llm per search set concurrently via asyncio.gather, then flattens, sorts by score, and trims to limit. Use it when you don’t know which FOLIO branch the answer lives in — each branch gets its own small, cheap LLM call instead of one giant prompt covering the whole ontology.
When search_sets is None (the default), the method fans out across all 24 FOLIO branches via folio.get_folio_branches(max_depth=max_depth). At the default max_depth=2 that’s ~2,150 candidate classes spread across 24 parallel calls — broad coverage at modest depth, good for “I have no idea where this lives.”
Explicit search sets
Passing your own search_sets is how you get the best price/quality trade-off. Pick the branches that could plausibly contain the answer and skip the rest:
import asyncio
from folio import FOLIO
folio = FOLIO()
async def search_example():
results = await folio.parallel_search_by_llm(
"redline lease agreement",
search_sets=[
folio.get_areas_of_law(max_depth=1),
folio.get_player_actors(max_depth=2),
],
)
for cls, score in results:
print(f"{score:3d} {cls.label}")
asyncio.run(search_example())Two search sets, two concurrent LLM calls, one merged and sorted list. The top result should be something like Real Property Law (for the area-of-law side) or Tenant / Lessee (for the actor side), depending on how the model weighs the intent.
When to use parallel vs single
| Situation | Use |
|---|---|
| You already know which branch the answer is in (e.g. “courts,” “areas of law”) | search_by_llm with one narrow pool |
| You’ve run a lexical pre-filter and have a shortlist of ≤100 classes | search_by_llm with the shortlist as the pool |
| You have no idea which branch to look in | parallel_search_by_llm(search_sets=None) — all 24 branches |
| You want 2–3 branches that you think are plausible | parallel_search_by_llm(search_sets=[...]) |
The parallel form trades cost for coverage: 24 LLM calls cost more than one, but total wall-clock latency is bounded by the slowest call, not the sum. In practice, end-to-end parallel searches run in 1–4 seconds depending on the provider (see the benchmark table below).
Benchmark results
The FOLIO team benchmarked four LLM configurations on 5 legal queries against the FOLIO 2.0.0 ontology in March 2026. These are real numbers copied verbatim from the README:
| Config | Avg Latency | Avg Results | Cost/M input |
|---|---|---|---|
| grok-4-fast-non-reasoning | 1.1s | 4.0 | $0.20 |
| gpt-5.4 effort=low tier=flex | 1.8s | 3.8 | $2.50 |
| gemini-3-flash-preview effort=low | 3.6s | 4.8 | low |
| gpt-4.1-mini | 1.7s | 4.0 | $0.40 |
Reading the table:
- Grok
grok-4-fast-non-reasoningis the overall winner on both latency (1.1s) and price ($0.20 per million input tokens). If you want “fast and cheap,” this is the default. - OpenAI
gpt-5.4witheffort="low"andtier="flex"is the recommended default when you care about quality — slightly slower (1.8s) and pricier but produces the tightest, most accurate scoring. This is the config the library encourages via the defaults plus two keyword arguments. - Google
gemini-3-flash-previewwas slowest in this benchmark (3.6s) but returned the most results on average, which can be useful if you want a wider top-K. - OpenAI
gpt-4.1-miniis a budget option — cheap ($0.40) and fast (1.7s) but note thateffortis a no-op for this model because it’s not a reasoning model.
And again, from the README: avoid effort="high". The benchmarks show it adds 5x latency with no quality improvement on structured search tasks.
Building prompts manually
If you want to build your own LLM prompts — perhaps for few-shot classification, custom reranking, or a task the library doesn’t expose — format_classes_for_llm() is the helper that turns a list of OWLClass instances into the same JSONL format search_by_llm uses internally:
classes = folio.get_areas_of_law(max_depth=1)[:3]
print(folio.format_classes_for_llm(classes))Each line is a compact JSON object with iri, label, preferred_label (when set), definition, alt_labels, and parents. Empty fields are omitted so the prompt stays short. See the Serialization page for the full format spec, examples with captured output, and tips on combining it with to_jsonld() for richer context.
When to use lexical vs LLM search
Lexical search (search_by_label, search_by_definition, search_by_prefix) is fast, deterministic, free, and offline. It runs in microseconds, returns the same results every time, and costs nothing per query — use it when the user already knows roughly what they’re looking for, when you need a deterministic pipeline, or when you’re on a tight latency budget. LLM search is slow (1–4 seconds), costs real money per query, and is non-deterministic — but it handles synonymy, intent, and cross-domain vocabulary in a way no string-matching algorithm can. Reach for it when lexical methods are missing obvious matches, when the user’s query is a description rather than a name (“court that hears patent disputes”), or when you need to rerank a lexical shortlist with real semantic understanding. The most cost-effective architecture is a two-stage funnel: run search_by_label or search_by_prefix first to get a candidate shortlist of 20–100 classes, then hand that shortlist to search_by_llm for semantic reranking.
See also
See also: Searching for the lexical search methods that feed into LLM reranking, Querying for structured filters that can narrow a candidate pool before an LLM call, and Serialization for the format_classes_for_llm helper and related prompt-building utilities.