2009/11/17

Devoxx 2009: Full Text Search for Hibernate

17/11/2009, University sessions, Emmanuel Bernard

Search solutions:
  • categorize upfront
  • show detailed search screen
  • use single search box (preferred)
Plain SQL search limits:
  • performance: like '%...%' causes a full table scan
  • no support for approximation nor synonyms
  • no proximity concept
  • lacking relevance scoring
  • no simple multi-column search
Full-text search solutions:
  • word based
  • captures / indexes frequency and position
  • solutions:
    • RDMS: (like Oracle Text):
      • less flexible
      • not portable (vendor-specific API and behavior)
    • standalone: Lucene
      • text-only
      • no synchronization with model objects
Hibernate Search, general features:
  • LGPL
  • uses Hibernate core
  • uses Lucene under the hood
  • solves object vs text mismatch
  • convert object to text document (+reverse) → Hibernate application uses objects, not text
    documents
  • convention over configuration
  • heavily built on annotations
  • Optimize Lucene access:
    • update Lucene docs on commit
    • object graphs are consolidated to single Lucene docs to provide relevant searches
    • avoid flooding Lucene indexer:
      • batch Lucene updates on commit
      • optionally trigger the Lucene indexer asynchronously
    • support clustering (JMS)
Hibernate Search Annotations:
  • @Indexed
  • @Field: tunable how to convert to text with, among others, @FieldBridge. E.g. convert number to 0-padded number.
  • @IndexedEmbedded
  • @Boost: promote a particular field in the relevance score (can be at indexing time or at query time)
  • @Analyzer: e.g. anagram-support
Lucene Index as used by Hibernate Search:
  • event based
  • batches updates per transaction (=at commit time)
  • sync or async mode (optimize Lucenes' locking mechanism)
Query:
  • HQL
  • Full Text (Lucene syntax) e.g. with the ~ opperator
  • JPA2 criteria
  • native SQL
  • → always returns Objects, not Lucene documents.
Advanced stuff:
  • tokenizer: split text in words, remove common words
  • complex searches: combination of indexing and querying
  • fuzzy search:
    • “Levenstein distance”: quantifies similarity
    • “n-gram”: word is split in groups of 3 letters → matching groups determines score. (demo looked a bit hackery)
  • phonetic search (soundex-like): disappointing in practice
  • synonyms: use your application-specific list
  • stemming: → 'reduction'
    • Porter Algorithm
    • Snowball stemmer
  • filters: provide efficient an pluggable support for
    • security, categories, temporal data, caching...
  • “explain” query result
  • clustering / Scaling Lucene
    • one Lucene writer at a given time
    • use a JMS queue for indexing (→ 'Master')→ small delay, but very scalable.
    • Distributed in-memory index (Infinispan 4.0) – technical preview
    • index optimizations:
      • sharding
      • defragmenting or re-indexing

1 comment:

green tea said...

Hibernate Search integrates transparently with Hibernate, the object/relational (O/R) mapping and persistence engine, with little to no configuration (past specifying what entities to index). With advanced features such as query filter and index sharding, Hibernate Search can be embedded into user applications.