Searching

The FOLIO Python Library ships four complementary ways to find classes: fuzzy label search (search_by_label) for typo-tolerant lookups by name, fuzzy definition search (search_by_definition) for definition-body matching, prefix search (search_by_prefix) for typeahead and autocomplete, and exact lookups (get_by_label / get_by_alt_label) for when you already know the canonical string. Each is tuned for a different shape of query — picking the right one keeps results fast and precise. The three fuzzy/prefix methods require the [search] extra (uv add 'folio-python[search]' or pip install 'folio-python[search]'); exact lookups work with the base install.

Choosing a search method

Method	Best for	Returns	Requires `[search]`
`search_by_label(label, …)`	Typo-tolerant lookup by name or acronym	`List[Tuple[OWLClass, float]]` ranked by score	Yes
`search_by_definition(definition, …)`	Finding classes whose definition body mentions a phrase	`List[Tuple[OWLClass, float]]` ranked by score	Yes
`search_by_prefix(prefix, …)`	Typeahead, autocomplete, “starts with” lookups	`List[OWLClass]` ranked primary-label first, then length	Recommended (pure-Python fallback available)
`get_by_label(label, …)` / `get_by_alt_label(alt_label, …)`	Exact-match lookup when you already have the canonical string	`List[OWLClass]` (all classes with that exact label)	No

Rule of thumb: start with search_by_prefix for interactive UIs, search_by_label for one-shot lookups of user input, get_by_label for programmatic pipelines where the label is known to be canonical, and search_by_definition only when the label search has failed. For structured combinations of filters (branch + parent + label substring), skip these and use query() on the Querying page instead.

A worked comparison

To make the trade-offs concrete, here is how each of the four methods handles the same query — "new york" — against the same ontology:

# Fuzzy label: typo-tolerant, ranked by score, 90-plateau tail
for c, score in f.search_by_label("new york", limit=5):
    print(f"{score:6.2f}  {c.label!r}")
# Output:
# 100.00  'New York'
#  90.00  'New York Supreme Court'
#  90.00  'New York City - Civil Court'
#  90.00  'New York Supreme Court - Appellate Terms'

# Prefix: no score, primary-label first, deterministic ordering
for c in f.search_by_prefix("new york")[:5]:
    print(f"  {c.label!r}")
# Output:
#  'New York'
#  'New York City Court'
#  'New York Town Court'
#  'New York Civil Court'
#  'New York County Court'

# Exact: one canonical class, no ranking
print(f.get_by_label("New York"))
# Output:
# [OWLClass(label='New York', iri='…/RE4Ea9963A08024006374a25', …)]

Three methods, three different shapes of result for the same phrase. The fuzzy label search gives you scored candidates (and tail noise); the prefix search gives you a clean, ordered list suitable for a dropdown; the exact lookup collapses to the single class the user almost certainly meant. Pick the method whose output shape matches what your caller needs to do next.

Fuzzy label search

search_by_label(label, include_alt_labels=True, limit=10) is the general-purpose “find a class by name” method. It runs the query against every primary label and alt label using rapidfuzz.fuzz.WRatio, returns List[Tuple[OWLClass, float]] ranked by score (descending) with length as a tiebreaker, and deduplicates by IRI before truncating to limit.

from folio import FOLIO

f = FOLIO()

for owl_class, score in f.search_by_label("Michigan", limit=5):
    print(f"{score:6.2f}  {owl_class.label!r:45s} {owl_class.iri}")

# Output:
# 100.00  'Michigan'                                      https://folio.openlegalstandard.org/R8BD30978Ccbc4C2f0f8459f
#  90.00  'Ig'                                            https://folio.openlegalstandard.org/R25B7be5E4c8a07E5ba2153f
#  90.00  'U.S. District Court - D. Michigan'             https://folio.openlegalstandard.org/R3FBe0474D8c62BF080588f3
#  90.00  'U.S. Bankruptcy Court - E.D. Michigan'         https://folio.openlegalstandard.org/R3E0603BEE859EAB1d6C3a40

A couple of things are worth pointing out even in this simple example. First, the exact match scores 100 and comes back first — that’s the common case. Second, because results are deduplicated by IRI before the limit is applied, the returned list can be shorter than limit when multiple labels collapse to the same class. Third, passing include_alt_labels=False restricts matching to primary rdfs:label / skos:prefLabel only; useful when alt labels are noisy for your domain.

The include_alt_labels default is True on purpose. The FOLIO 2.0.0 ontology carries a huge number of skos:altLabel entries (court abbreviations, jurisdiction codes, synonym sets, foreign-language translations), and alt-label coverage was substantially improved in v0.3.4 — before that release, roughly 90% of lang-tagged alt labels were invisible to search_by_label. In 0.3.4+ those alt labels are part of the search corpus, so queries like "SDNY", "MICH", or "US+MI" resolve directly to the courts and jurisdictions they abbreviate:

for c, score in f.search_by_label("SDNY", limit=3):
    print(f"{score:6.2f}  {c.label!r}")

# Output:
# 100.00  'U.S. District Court - S.D. New York'
#  90.00  'New York State Courts'
#  90.00  'Sudan'

Set include_alt_labels=False only when you want to force matching against preferred labels exclusively — for example, when normalizing user input to a canonical display label, or when alt labels are producing noise you cannot tolerate. And note the tail of that result once again: Sudan at score 90 because SDN is a substring. Same WRatio false-positive story as "contract law".

WRatio false positives on short queries

WRatio is a composite scorer that combines several fuzz ratios and applies heuristics. It is very forgiving, which is usually what you want — but for short queries, it plateaus at score 90 for anything sharing a few characters with your input. That means the ranking stops being meaningful below the top result. Concretely:

for owl_class, score in f.search_by_label("contract law", limit=10):
    print(f"{score:6.2f}  {owl_class.label!r}")

# Output:
# 100.00  'Contract Law'
#  90.00  'Colombia'
#  90.00  'Colorado State Courts'
#  90.00  'Turkey'
#  90.00  "Lao People's Democratic Republic"
#  90.00  'Louisiana State Courts'
#  90.00  'Connecticut State Courts'
#  90.00  'Aruba'
#  90.00  'Confidential Matter Narrative'
#  90.00  'Construction Industry'

Colombia, Turkey, and Lao People's Democratic Republic are obviously not what the user meant by “contract law”. They are ranked at 90 because WRatio’s token-set fallback finds a few matching characters and the short query length amplifies the score. The top result is still correct, but do not trust the tail.

If you need precision — for example, to populate a filtered dropdown or run a bulk classification — use query(label="contract law") from the Querying page for a plain substring match, or set the limit=1 and ignore anything below 100.

Fuzzy definition search

search_by_definition(definition, limit=10) runs the query against every class’s definition field (skipping classes without one) using rapidfuzz.fuzz.partial_token_set_ratio. This scorer is designed to find scattered query tokens inside a larger body of text, which is exactly what you want for matching against multi-sentence definitions — but it has the same “too forgiving” tendency as WRatio, and it is even more aggressive because partial token set ratio ignores token order entirely.

for owl_class, score in f.search_by_definition("court of appeals", limit=3):
    print(f"{score:6.2f}  {owl_class.label!r}")
    if owl_class.definition:
        print(f"        {owl_class.definition[:100]}...")

# Output:
# 100.00  'Confectionery Merchant Wholesalers'
#         This industry comprises establishments primarily engaged in the merchant wholesale distribution...
# 100.00  'Deep Sea, Coastal, and Great Lakes Water Transportation'
#         This industry comprises establishments primarily engaged in providing deep sea, coastal, Great ...
# 100.00  'Other Miscellaneous Nondurable Goods Merchant Wholesalers'
#         This industry comprises establishments primarily engaged in the merchant wholesale distribution...

Every result scores 100 and none of them are courts. The scorer has latched onto incidental tokens (court, of, appeals) scattered across long NAICS-style industry descriptions. This is the nature of partial_token_set_ratio: for any query whose tokens individually appear in a definition, the score saturates.

Recommendation: for definition lookup, prefer query(any_text=..., match_mode="substring") or query(definition=..., match_mode="fuzzy") — both documented on the Querying page. search_by_definition is still useful as a last-resort, wide-net recall tool, but treat its scores as “matches contained all your tokens somewhere” rather than as similarity rankings.

Prefix search

search_by_prefix(prefix, case_sensitive=False) is backed by a marisa-trie index over every primary label and alt label in the ontology. It is the right tool for typeahead widgets, autocomplete menus, and any “starts with” lookup. Results come back as List[OWLClass] — there is no score — sorted so that primary-label matches rank ahead of alt-label matches, then by ascending label length, and deduplicated by IRI.

As of v0.3.5 (2026-04-08), search_by_prefix is case-insensitive by default. A parallel lowercase trie is built at load time (using str.casefold() for Unicode-safe folding) so lowercase, mixed-case, and uppercase queries all match. Passing case_sensitive=True falls back to the original behavior if you need it — see below for when that matters.

Default (case-insensitive)

for owl_class in f.search_by_prefix("Mich")[:5]:
    print(f"{owl_class.label!r:30s} {owl_class.iri}")

# Output:
# 'Michigan'                     https://folio.openlegalstandard.org/R8BD30978Ccbc4C2f0f8459f
# 'Michoacan de Ocampo'          https://folio.openlegalstandard.org/R9F9550531bbb547c4724478
# 'Michigan State Courts'        https://folio.openlegalstandard.org/RA3C143EB06fcBA8f410Fd50
# 'Michigan Circuit Court'       https://folio.openlegalstandard.org/R9CPJK9GY42oQZFTPL3DyA1
# 'Michigan Supreme Court'       https://folio.openlegalstandard.org/R8372b3AC7127F95f5238a85

Notice that Michigan comes first — not Michigan Supreme Court, even though Michigan Supreme Court also has an alt label MICH that matches the prefix. That’s the 0.3.5 primary-label ordering at work. Before 0.3.5, alt-label matches could outrank their own preferred class.

The case-insensitivity is the headline change. Here is the proof:

for owl_class in f.search_by_prefix("dui"):
    print(f"{owl_class.label!r:40s} {owl_class.iri}")

# Output:
# 'Driving Under the Influence'            https://folio.openlegalstandard.org/RB0QemTkd6XZOLDEO42pdIw
# 'Boating DUI/BUI'                        https://folio.openlegalstandard.org/RCMeywGHqtDlkLxwnDrD0CF

The query "dui" is lowercase. The alt label in the ontology is DUI (uppercase). Before 0.3.5, this call returned zero results — the old case-sensitive trie had no key starting with dui. Now it returns both the class whose alt label starts with DUI (Driving Under the Influence) and the class whose primary label contains DUI as a substring of Boating DUI/BUI. You get these without having to uppercase user input or maintain your own folding layer.

A longer prefix returns more ontology territory. securit pulls back several Securities/Security classes in one call:

for owl_class in f.search_by_prefix("securit")[:8]:
    print(f"{owl_class.label!r:55s}")

# Output:
# 'Securities Fraud'
# 'Security Deposit'
# 'Securities Expert'
# 'Security Incident'
# 'Security Agreement'
# 'Securities Law Claims'
# 'Securitization Practice'
# 'Security Deposit Clause'

Case-sensitive mode

Setting case_sensitive=True disables the lowercase trie and matches labels exactly as they appear in the ontology. This is occasionally the right move — specifically, when you are targeting acronym-shaped alt labels like MICH, TAX, or CAL and you want to exclude the dozens of mixed-case labels that would otherwise match:

for owl_class in f.search_by_prefix("MICH", case_sensitive=True):
    print(f"{owl_class.label!r:40s} {owl_class.iri}")

# Output:
# 'Michigan Supreme Court'                 https://folio.openlegalstandard.org/R8372b3AC7127F95f5238a85
# 'U.S. District Court - D. Michigan'      https://folio.openlegalstandard.org/R3FBe0474D8c62BF080588f3
# 'Michigan Court of Appeals'              https://folio.openlegalstandard.org/RCD1eef57D69a4E97a3BCeb6

Every result here is a class with an alt label literally starting with the four uppercase letters MICH. Compare that against the case-insensitive call, which also pulls in Michoacan, Michigan, and so on. For a general-purpose search box, you want the case-insensitive default. For looking up a citator abbreviation or a jurisdiction code, case_sensitive=True is the precise tool.

About `MIN_PREFIX_LENGTH`

The module exports a MIN_PREFIX_LENGTH = 3 constant (at folio.graph.MIN_PREFIX_LENGTH). It is used internally at index-build time to decide which labels are short enough to skip — but it is not enforced at query time by search_by_prefix. You can still call the method with a 1- or 2-character prefix, and you will get a lot of results:

from folio.graph import MIN_PREFIX_LENGTH

print(MIN_PREFIX_LENGTH)                       # 3
print(len(f.search_by_prefix("a")))            # 2931
print(len(f.search_by_prefix("ab")))           # 108

Treat MIN_PREFIX_LENGTH as advisory: your UI layer should still impose a 3-character minimum on user input before calling search_by_prefix, unless you specifically want the runaway result sets (which is rarely the case for interactive typeahead). The library will not do this for you.

Pure-Python fallback

When marisa_trie is not installed — i.e., you did not install the [search] extra — search_by_prefix falls back to a plain Python filter over the label and alt-label dicts with identical semantics (same case handling, same primary-first ordering, same deduplication). It is noticeably slower on the full 18,323-class ontology but still works end-to-end, so prefix search is usable without the optional dependency. Install [search] whenever you can; the trie build happens once at load and is then practically free per query.

Exact lookup

get_by_label(label, include_alt_labels=False) and get_by_alt_label(alt_label, include_hidden_labels=True) are not fuzzy searches. They are dictionary lookups: they return every class whose label (or alt label) matches the query string exactly. No scoring, no ranking, no tolerance for case or whitespace. When you already know the canonical string — because it came from a database, a previous API call, or a curated list — these are faster and more predictable than any of the search methods above.

# Exact lookup by primary label
print(f.get_by_label("Michigan"))
# -> [OWLClass(label='Michigan', iri='…/R8BD30978Ccbc4C2f0f8459f', …)]

# Exact lookup by alt label
print(f.get_by_alt_label("US+MI"))
# -> [OWLClass(label='Michigan', iri='…/R8BD30978Ccbc4C2f0f8459f', …)]

Both methods return a List[OWLClass] because a single label string can, in principle, belong to multiple classes — for example, short acronyms are often shared across jurisdictions. In practice most lookups return a single-element list.

The `include_alt_labels` / `include_hidden_labels` flags

get_by_label has an include_alt_labels: bool = False parameter. When True, it also searches the alt-label index for the same string, which lets you look up a class by its acronym or alias while using the get_by_label call:

# Default: only the primary-label index is consulted
print(f.get_by_label("MICH"))
# -> []

# include_alt_labels=True also searches alt labels
for c in f.get_by_label("MICH", include_alt_labels=True):
    print(c.label, c.iri)
# -> Michigan Supreme Court https://folio.openlegalstandard.org/R8372b3AC7127F95f5238a85

get_by_alt_label mirrors this with include_hidden_labels: bool = True (defaulting the other way): it searches the alt-label index first and, unless you opt out, falls back to the primary-label index for the same string. Together, the two methods give you precise control over which index you hit first:

# Alt label in uppercase — present in the alt-label index
for c in f.get_by_alt_label("MICH"):
    print(c.label, "<-", c.alternative_labels)
# Output:
# Michigan Supreme Court <- ['MICH']

# include_hidden_labels=False restricts to the alt-label index only
# 'Turkey Production' is only a primary rdfs:label, not an alt label,
# so the strict call returns nothing:
print(f.get_by_alt_label("Turkey Production", include_hidden_labels=False))
# -> []

The pair is designed so that get_by_label starts at the primary index and get_by_alt_label starts at the alt index, and you decide whether to let each one spill over into the other. In day-to-day use the defaults are almost always right. Note that since v0.2.1, skos:prefLabel values are indexed alongside rdfs:label in the primary index, so prefLabel-only classes are now correctly reachable from both methods.

When to use exact lookup

Prefer get_by_label / get_by_alt_label over the fuzzy methods when:

You have a known canonical label that came from the ontology itself (e.g., a previous OWLClass.label you cached)
You are resolving an alt-label acronym or identifier such as MICH, SDNY, or a jurisdiction code like US+MI
You want deterministic results with no scoring — either the class exists under that label or it does not
You are looking up a label inside a hot loop where even the rapidfuzz overhead matters

For anything less precise — user-typed queries, misspellings, partial strings — search_by_label or search_by_prefix are the right entry points.

Performance and caching

search_by_prefix results are memoized on two private dicts on the FOLIO instance: _prefix_cache for case-sensitive calls and _ci_prefix_cache for case-insensitive calls. The case-insensitive cache is keyed by prefix.casefold(), so "cont", "Cont", and "CONT" all resolve to the same cache entry. The first call for a given prefix walks the trie (or the pure-Python fallback) and builds the result list; subsequent calls are a dict lookup:

import time

t0 = time.perf_counter()
first  = f.search_by_prefix("Cont")
t1 = time.perf_counter()
cached = f.search_by_prefix("Cont")
t2 = time.perf_counter()
print(f"first:  {(t1-t0)*1e6:7.1f} us  ({len(first)} results)")
print(f"cached: {(t2-t1)*1e6:7.1f} us  ({len(cached)} results)")

# Output (indicative, cold trie):
# first:   1586.4 us  (174 results)
# cached:     3.7 us  (174 results)

That’s roughly three orders of magnitude. For typeahead UIs issuing a prefix call on every keystroke, this means only the first character press in a given session pays the search cost; everything after is near-instantaneous. Both caches are cleared by f.refresh() so you will not see stale results after re-downloading the ontology.

The fuzzy search methods have a different caching story. search_by_label and search_by_definition are not wrapped in caches themselves, but they delegate to a static helper FOLIO._basic_search(query, search_list, limit, search_type) that is decorated with functools.cache. That cache is keyed on the full tuple of labels (or definitions) passed in, so repeated calls with the same query and the same limit / include_alt_labels settings hit the cache even across instances. In practice the savings are smaller than the prefix cache — rapidfuzz is already fast — but you do not pay twice for the same query inside a single process.

If memory pressure ever becomes a concern (e.g., a long-running service that issues many unique queries), you can clear the rapidfuzz cache explicitly with FOLIO._basic_search.cache_clear() and the prefix caches by setting f._prefix_cache = {} / f._ci_prefix_cache = {}. Under normal workloads neither cache grows large enough to matter.