Tokenized and Untokenized Fields

We have already briefly touched on the issue of tokenization of search fields. What tokenization entails is essentially breaking up the indexed data into units called tokens. This is done by use of an analyzer. Different analyzers behave differently, some may break tokens at whitespace, some at punctuation, etc. The resulting tokens are also usually transformed to lowercase. For tokenized fields query strings are tokenized in the same way, so searches are case insensitive, among other benefits.

For some fields it doesn't make sense to tokenize. Good examples of this are computer generated values, such as codetable codes. In general, however, most of your fields should be tokenized. In particular, the behaviour of multi-word untokenized fields and searches is counterintuitive. If you find your searches are not returning the data you expect consider whether this may be the case.

Example: Take an address field, with a document containing "Joyce Way Parkwest Dublin". If this were a tokenized field using the standard analyzer, then the index will contain four terms: joyce, way, parkwest and dublin. Any query string that contains terms matching these terms (exactly or via a wildcard) will find this document. For instance: "Dublin", "Joyce Way", "park*", etc.

However, if this field is untokenized and the same document is added, the index will contain a single term: "Joyce Way Parkwest Dublin". Much fewer query strings will match this, essentially only the string itself or the first part of the string as a prefix search. The search will also be case sensitive.