Friday, May 1, 2009

Search Engine Death Match - Solr Wins (Pt. 2)

They key to the integration of Solr and removal of Verity in our model was to abstract the search functionality of the application. The process started by creating a 'CollectionManager' which held the methods that were needed to interact with the Verity Server through cfsearch and consolodated all calls to 'cfsearch' through this service.

Once that was done and working fine - we were ready to dive into Solr and figure out the nuances of the product and ultimately create a 'SolrManager' which would replace the collectionManager. Since we were using ColdSpring, this was a mere class path change in the coldspring.xml and that was it (I love ColdSpring).

Solr Data-Config:

I'll get a more formalized presentation together about swapping out Verity in favor of Solr, but in general, our data was in a SQL Database vs. xml etc, so we used the 'data-config.xml' to basically come up with our main query that populates the index (each solr instance has a single instance or collection vs. verity where there is 1 verity and many collections) as well as delta and delete queries for updating and removal of the collection data through JDBC. This data-config.xml file is simply an xml file in which you write sql to select your columns based on some criteria.
Note: it's best to have created, lastmodified, id, active columns in your database table you wish to index, as this will help Solr in determining what data has changed, and what has been deleted. Hint: keep these queries simple and use a 'view' if your sql is complex at all, as it will hide the complexity, making the file easier to read, and abstract the data in a way that you can change the view w/o having to change the data-config.xml file each time. It's also worth pointing out that the 'server time' should be exactly the same on both your database and solr machine (if not on the same box) as Solr stores internally the last time it indexed your data in a value referenced in your sql as '${dataimporter.last_index_time}'.
** If you have a 'smalldatetime' value for lastmodified where sql will 'round' the seconds of the field, then you may run into issues where Solr won't pick up on changes that have happened since it's last_index_time as compared to your database timestamps etc.

Solr Schema:
This is basically the data definition in a single schema.xml of your data. You define your 'fields' which are your columns of data from the data-config and the datatypes etc. This is very flexible including the ability to 'group' multiple fields together as a single field (copyfield) as well as defining what is the default field that's searched on if none is specified in the field (i.e. search for 'MN jobs' could search job titles, job desc, locations etc).

Solr Config:
There was a one line change here just to tell Solr to use the data-config for it's data:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>

Wayne Graham has published an open source component out there at Ria Forge to assist in the low level cfhttp calls called SolManager. I've added this as a helper to my main 'SolrManager' component and i just call it to do the cfhttp calls to Solr and then I can do what I need with the data it returns.
Since the data is returned as xml, and the 'schema' of the data defines which columns / nodes are returned, I then have full control over the result. I convert the xml to a query using query new, xmlparse, and xmlsearch - and for our purpose as long as the query is in the same format as the one returned by a cfsearch through verity, the application won't know any difference.

FAST! Solr does it's own caching and warming of searches and results and from my metrics, depending on the xml result set (which can EASILY be adjusted to limit the size of the xml as a '&rows=x' query string attribute in the search string) the performance is negligible if not better across the board for searches.

There's a lot more to discuss such as search syntax, replication, admin dashboard etc, but I'll leave a few stones un-turned for you to dive in and get your feet wet. They have good documentation on the site as well as an Extremely easy and efficient Jetty package that you can start up, execute a few commands on through a tutorial on their site, and get a feel for what to expect.

All in all, we were to the point where Verity was NO LONGER AN OPTION or a viable solution for our company and was causing nearly hourly baby sitting, so this solution was implemented JIT (Just in Time) and has saved us from feeling the effects of daily search stress.

Have fun and stay tuned for a full run down with code samples at MN CFUG soon.

1 comment:

ADelgadillo said...

Hi, how did you manage to convert the XML to a query... I've been going at it for days now, maybe you can give some tips on how to do it