Episode 117: Full Text Search

Faceoff Show

Apr 19, 2011•34 min

--:--

Listen in podcast apps:

Apple Podcasts

Spotify

Download

Listen to this episode in Metacast mobile app

Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Add enterprise level search into your site.

News and Follow/Ups – 01:00

Geek Tools – 14:13

Yikerz! – Super fun magnet game

Webapps – 16:12

Surfboard – Flipboard as a web app
InstaLyrics – Find lyrics quickly

Full Text Search – 22:11

Options
- Google Custom Search
  - Commercial
  - Benefits
    - Super fast to setup
    - Easy to implement
    - Ability to add adsense into search results
  - Downsides
    - Unable to adjust content ranking and do custom integration
    - Mainly for just indexing HTML pages, not search queries and other text.
- Sphinx
  - “Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.”
  - Open source with commercial support
  - Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  - The search service daemon (searchd) is pretty low on memory usage – and you can set limits on how much memory the indexer process uses too.
  - API for:
    - Java, PHP, Python, Ruby, Perl, C, and other languages.
  - Written in C++
  - Stats
    - 60+ MB/sec per server
    - 500+ queries/sec
    - Biggest known Sphinx cluster indexes 5 billion documents, resulting in over 6 TB of data. Busiest known one is, unsurpisingly, Craigslist, that serves 50+ million search queries/day.
  - Companies using Sphinx
    - Craigslist
    - Slashdot
    - Mozilla
    - WordPress.org
- Lucene
  - Done by the Apache foundation
  - Open source
  - Written in Java
  - Search types
    - ranked searching — best results returned first
    - many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
    - fielded searching (e.g., title, author, contents)
    - date-range searching
    - sorting by any field
    - multiple-index searching with merged results
    - allows simultaneous update and searching
  - Stats
    - over 95GB/hour on modern hardware
    - small RAM requirements — only 1MB heap
    - index size roughly 20-30% the size of text indexed
- Solr
  - Lucene is a library where Solr is a server that supports XML, REST
  - Benefits over Sphinx
    - Solr is easily embeddable in Java applications.
    - Solr can be integrated with Hadoop to build distributed applications
    - Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can’t.
  - Companies using Solr
    - eHarmony
    - Ticketmaster
    - Digg
    - AOL
    - Zappos

For the best experience, listen in Metacast app for iOS or Android