The Apache Solr project is an open source technology that wraps the Lucene search engine up as a stand-alone web service. Lucene/Solr provides a lot of awesome functionality out of the box but, for those customers with special search needs, here are the three most common and effective ways to customize version 4 of Solr.
Both Lucene and Solr are written in Java. All of these customizations involve coding classes in Java, including the resulting JAR file in the appropriate folder on the server, and configuring your custom classes in the Solr configuration XML file. The classes for these types of customizations run in the JVM of Solr itself.
Other types of customizations, not covered here, can be written in any language and run outside of Solr itself. These customizations are usually connectors that populate and synchronize your Solr index with other datastores. There is a data import handler that comes with Solr but is not recommended for large data sets.
Solr exposes its functionality as web service calls conforming to the Java Servlet API and, as such, requires a servlet container to run in. Which folder to copy your JAR file to depends on which servlet container you use. Both resin and tomcat have configurable common library folder locations. You will need to repackage the Solr WAR file with your custom classes if you are using Jetty.
In all likelihood, Solr can already handle just about any possible query that you would need to run. If you have an elaborate, complex, or dynamic scoring algorithm or if you need real-time filtering by some data outside the index, then you may need to add your own custom function to Solr.
There are three classes that you will need to subclass in order to implement your own custom function; the ValueSourceParser, the ValueSource, and the FunctionValues classes.
The ValueSourceParser is a factory that parses user queries to generate ValueSource instances. You will need to override the parse method and you most probably should implement the init method too. You will need to collect and hold on to the value sources from the schema fields from the index schema from the function Q parser at parse time for use later on in the FunctionValue subclass.
Your subclass of ValueSource is responsible for instantiating function values for a particular query. You will need to implement getValues, equals, hashcode, and description. That first method returns the payload that you care the most about and contains mostly glue code. The next two methods are important because they work with Solr’s query results cache, the last method isn’t very important and is used mostly for debug logging purposes.
FunctionValues represents field values as different types. FunctionValues is distinct from ValueSource because there needs to be an object created at query evaluation time that is not referenced by the query itself. This is because query objects should be thread safe and query objects are often used as keys for caching so you don’t want the query carrying around big objects. The important functions to override are floatVal and doubleVal. One is for scoring and the other is for frange filtering. You will also be getting quite familiar with AtomicReader and ValueSource. Never access fields from the document directly. Always go through the ValueSource objects that you obtained in your ValueSourceParser. Getting field values from the document always goes to disk. Going through the ValueSource leverages Solr’s three caches for better performance.
You will be adding a valueSourceParser tag to the solrconfig.xml file in order to tell Solr about your new custom function. You can also include the configuration information that gets passed to your custom function at init time. Custom functions can be referenced in a query in two places; in the sort attribute and included in a frange call in a filter query attribute.
The select and update end points are the most often used request handlers in Solr but there are many others (e.g. replication and administration) and you can add your own.
In order to do that, you must subclass RequestHandlerBase and override the handleRequestBody, getSource, and getDescription methods. If you have ever written a servlet, then you should know what to do in your implementation of the handleRequestBody method. It is not the same APIs but it is conceptually similar. The other two methods are for debugging and diagnostics.
You let Solr know about your new request handler by adding a requestHandler section in the Solr configuration file. The name is the relative path in the URI for your end point and the class attribute is the full canonical class name for Solr to load. The defaults list gets parsed and passed into the init method.
Search requests in Solr are not monolithic processes. They are composed of a chain of components that you can customize in the solr configuration file.
You must subclass the SearchComponent class. You must override the prepare, process, getSource, and getDescription methods. The prepare methods for all components in the chain get called before any process method gets called so prepare is the place to do initialization that is request dependent. The code that you put in the process method is what you do to process the request. The other two methods are for debugging and diagnostics.
The most common way to extend search functionality is to either prepend or append your own custom component to this chain. For example, if you wanted to include a summary report in each request, then you would write your SearchComponent class, configure it in a searchComponent section, then include a reference to that in the last-components area of the requestHandler configuration. Be advised that your custom search component will be able to access only those rows to be returned and not the entire results of the search.
Lucene/Solr is a full featured search engine and it is rare to need to extend its search capabilities. If you do find yourself needing something not available in vanilla Solr, then functions, request handlers, and search components are three basic ways to extend Solr with your own custom Java code to handle special search requirements.
1. Don’t make another code generation wizard.
Most MDSD systems hit the mainstream packaged as some kind of application development wizard. The developer answers a set of questions pertaining to the desired application and the code for a starter application that generally fits the GUI requirements is generated. The motivation behind this kind of approach is that the developer feels productive right away without really learning much about the framework in which the generated application is dependent upon. This generation is a one time only deal. The developer cannot tweak and re-generate. As the developers add more of their code, the use of the wizard declines until it is never used. In the end, progress in writing the application will stall out until the developer does learn the entire framework. The wizard approach may be a useful onboarding tool or helpful in making the sale but, in the long run, contributes to overall productivity only very minimally.
2. Avoid an overly complicated meta-model.
In MDSD, a generator parses a model, based on a predetermined schema or meta-model, and pours that model into a set of templates in order to generate the output code in the target system. If you try to develop a general purpose MDSD system, then you will be tempted to make a very complicated meta-model. Don’t succumb to that temptation. The resulting excessive amount of cognitive overhead required to author any model written in the language of an overly complex meta-model will tempt the developers to abandon MDSD and just write the code themselves. The easiest solution is to make the generator domain specific, perhaps as a part of the core assets from a software product line. There are advanced techniques for developing a general purpose MDSD with a relatively simple meta-model but that topic is beyond the scope of this blog.
3. Provide a custom model editor.
In all likelihood, the model will be written in one of the following formats; XML, YAML, or JSON. If you leave it to the developer to use a general purpose editor for one of these formats, then there will be massive abandonment as it will be too easy for the developer to go astray and end up with a non-compliant model that will result in a buggy and unstable system. Be sure to factor in the time needed to code a validating model editor application by which developers can build their models more easily and with fewer errors. Feel free to provide wizards in the model editor that can accelerate productivity in specifying the model.
4. Be able to import models from a wide variety of sources.
Developers may already have a favorite preference of modeling tool so be prepared to be able to import the documents written from other tools. The model editor I wrote is capable of importing UML models from Eclipse and RDFS models from Protege as well as reverse engineer a model from a pre-existing relational data base schema.
5. Don’t edit the output by hand.
As time progresses and your MDSD gains adoption, it is going to be extremely tempting to just add that one line change to the output in order to satisfy some new requirement. Don’t do it. Instead, take the time to figure out how your MDSD system can model that new requirement properly. As soon as you start editing the output, you will stop using the MDSD system to create new releases. At that point, you are back to the diminishing returns of an app wizard. The vast majority of the work needed in any successful application goes into the subsequent releases so this is where a well designed MDSD system can really shine.
6. When writing the generator and template code, be sure to pick a programming language that is unlikely to be used as the target language.
7. The output should be just as maintainable as if you were writing it by hand.
The build process may start with the model and end up with the code but the debugging process starts with the code and ends up with changes to the templates and/or model. If the code is hard to comprehend, then it will be very hard to debug. Developers despise monolithic spaghetti code no matter how it was written.
8. It’s not just code that should be generated. It’s markup, resources, styling, and configuration too.
9. Give your templates some skin.
Skinning an app is a design approach where the actual GUI is somewhat decoupled from business rules, data access, and binding. CSS may be a necessary, but not sufficient, means to achieving this decoupling for web apps. For Google closure style apps, that might mean maintaining different soy files. I chose not to use soy to generate markup for other reasons so what I did was to create an abstract class called Skin that declared various methods for rendering the basic GUI components and subclasses that provided the actual implementations for those methods. Examples include rendering an input text field versus a select field with menu items for capturing hour in the day or an anchor tag versus a link button for navigation.
10. Decouple the back end data store from the front end GUI.
Not everyone who is interested in generating Google Closure apps will want to use XMPP. Decouple code that is responsible for the GUI with code that is responsible for communicating with the back end data store. I made this decoupling pluggable and configurable such that developers can choose to mix and match Closure with XMPP or AJAX.
If you are considering the development of a MDSD system, then avoid making some costly mistakes by learning these ten lessons that I learned when I wrote a MDSD system that generates web 2.0 apps based on Google Closure and XMPP. If you can keep it simple, keep it useful, provide an editor, provide importers, generate clean code, automate the entire build process, and decouple both GUI and data store, then you’ll have a really cool tool.