posts tagged search | /dev/notes

Date Range with Null Search in Solr 4.x

In Solr to do query for a date range you use the syntax:

field_name: [Start TO Finish]

You can also use wildcards and specific constants in a logical way e.g:

[NOW TO *] or [* TO *]

To search over documents that do not have a value for that date field, e.g. is NULL, you use the syntax:

-field_name: [* TO *].

It is hard though, to search for dates that are EITHER NULL OR lie in a specific range.

It would seem logical to specify

date_field:[Start To Finish] OR -date_field: [* TO *]

Unfortunately Solr does not appear to support specifying a field multiple times in this way.

So the trick is to effectively query for everything you do not want, then negate the result.

This approach says select me anything that isn't in the range of this date and is not null. When you invert that result, you get all the documents sit inside the date range or are NULL.

The query to weave this magic is

-(-date_field:[Start TO Finish] AND date_field:[* TO *])

Zend_Search_Lucene Indexing Fun

When working with a lucene index using the Zend Framework's lucene search component you'll often in the course of the index's lifecycle want to update documents. This can prove tricky with the current implementation as there is no insitu update feature, you must first delete the old document and add a new one. The tricky part is locating the unique document you want to update. The 'old way' was as following:

// Retrieving documents with find() method using a query string
  $query = $idFieldName . ':' . $docId;
  $hits  = $index->find($query);
  foreach ($hits as $hit) {
      $title = $hit->title;
      $contents = $hit->contents;
  }

This proves _painfully_ slow, you're loading the full index in an attempt to find a unique document with an ID. Even worse is if your unique ID happens to be a string such as a url or path. Since ZF 1.5, the 'best practice' direction to perform this type of task is to use the Zend_Search_Lucene::termDocs() method:

$term = new Zend_Search_Lucene_Index_Term('/somepath/somewhere', 'path');
  $docIds = $index->termDocs($term);
  foreach ($docIds as $id) {
      $doc = $index->getDocument($id);
      $title = $doc->title;
      $contents = $doc->contents;
  }

Performance wise this proves much more efficient. However, unless you're careful at the indexing stage you may run into trouble when running termDocs() on a string value such as a URL or path as opposed to an integer ID. This is down to the field being added tokenized. This is the most common way fields are added and corresponds to:

$doc = new Zend_Search_Lucene_Document();
  $doc->addField(Zend_Search_Lucene_Field::Text('title', $title));

If you want to use termDocs on an identifying field you need to add the field as type Keyword:

$doc = new Zend_Search_Lucene_Document();
  $doc->addField(Zend_Search_Lucene_Field::Keyword('http://a.com/uri', 'uri'));

Keyword fields are not tokenized, and a term vector (which termDocs() requires) is stored, the distinction between the two field types is documented in Zend_Search_Lucene_Field's phpdocs:

Zend_Search_Lucene_Field::Text() constructs a String-valued Field that is tokenized and indexed, and is stored in the index, for return with hits. Useful for short text fields, like "title" or "subject". Term vector will not be stored for this field.

In contrast see:

Zend_Seach_Lucene_Field::Keyword() constructs a String-valued Field that is not tokenized, but is indexed and stored. Useful for non-text fields, e.g. date or url.

This caught me out a little bit until I dug around the source a little bit looking to see where termDocs was going wrong. Hopefully this helps save someone else some time, and hopefully Zend can update their documentation to draw other developers' attention to this quirk.