Posted Wednesday, November 12th, 2008 at 3:40 pm by Alan Gallauresi (13 posts)
Microsoft quietly launched their free Microsoft Search Server 2008 product earlier this year as a product that heavily leverages the same Enterprise Search technology found in their Sharepoint 2007 product. That quiet introduction belies the fact that Search Server is directly poised to take on the de-facto market leader in low cost website spidering and searching, the Google Mini. When the Mini was first launched it had a great price-point, the Google name, and a slew of glaring deficiencies that have now been largely patched out of existence. Search technology, which used to be the largest feature gap our company had to account for in implementations for content management systems without their own search provider, had started to become a new brainer – sometimes clients without the slightest idea what sort of technology to use for a website redesign were coming to us with a Mini already purchased. That is, until MS Search Server 2008 arrived, immediately becoming the Mini’s foremost competitor in the low and mid-tier market and presenting a compelling case for those clients already heavily invested in Microsoft technology.
Right now, we’re in the midst of a website redesign that utilizes Search Server 2008. It’s a pretty common type of redesign for us at Beaconfire – a website built on a content management system that doesn’t have a built-in search technology, or at least not one we want to use. The client’s search needs are also pretty typical for a lot of site builds:
- The search is primarily targeted at the site being redesigned – federated searching of external content might be a plus but takes a back seat to getting good search results for the site the user is searching on.
- The focus in on public HTML content, not documents or database items, and there’s little in the way of heavily role-based content restrictions
- The need for advanced filtering is minimal but needs to be expandable for the future
- The search is fully integrated into the main website. The search and the results returned are displayed seamlessly in pages on the site instead of shuttling users off to another server.
- Emphasis is placed on returning the best results possible through keywords and “best bet” mechanisms
Those requirements result in what is decidedly not an Enterprise level of search, but a very good example of the type of search plenty of customers want in a redesign – not a fancy search, just a particularly smooth and relevant one.
With those requirements in mind, here’s the my quick take on how the two technologies stack-up based on my impressions with both, right after the jump.
A quick and easy win for Microsoft. You can’t beat free, and the circumstance where you need to buy a license (load-balanced search servers) is uncommon enough that those who need it can afford the minor software fee. For many organizations, the Google Mini’s rather reasonable cost to purchase the box ($2999 for the standard model with 2 years of support) is less of a burden than the hosting costs associated with racking it up.
I’ve had far too much experience with Minis to still be as baffled as I am by its indexing interface. It’s only become more complex over time, with various settings in different places for toggling the frequency of particular URLS to get indexed (which only applies in certain crawling modes), the feature to tell it to index a page as soon as possible (as in NOW) without seeing much effect, and the predictive ability to tell you what it plans to index in the next hour that rarely seems to match reality. Starting a Google crawl is like a strongman competition for pulling a dump truck with a nylon rope and your own teeth – incredibly tough to get started, slow going at the best of times, and very painful to stop. MS Search, on the other hand, is brilliant –check a box and it starts, crawling the site extremely quickly.
Development to Deployment:
Search engines are a particular pain to network because they are usually within the same network segment as the website you’re deploying, yet they are set to spider content as if they were an external user looking at your site. Combine that with the issue of switching a DNS entry on a redesign launch and you get a ton of small but significant problems related to hostnames and networking. Every time we launch, there are tons of tasks for switching the search settings, reindexing immediately and testing. Search 2008 makes it a lot easier with a simple mechanism – integrated URL mapping. Just replace your staged website with the URL of the soon-to-be-live site and you can worry about reindexing later, when things aren’t so crazy. Google, take note – I want this in the Mini.
Microsoft’s approach is simple – a bunch of reasonable, pre-made reports. While Google has a couple prepared reports like content type stats, most of the appliance uses a byzantine method of creating a report, naming it, refreshing the page to see that it already finished by the time you named it, and then viewing, is terribly unfriendly by comparison.
Given that we do full integrations of both search interface and results display into our site pages, we want a result set that can be easily shoehorned into whatever interface the designers want, from basic results to AJAX-y showpieces. Both engines return text summaries that show the term hits in the context of the page and string it all together into a hit highlighted abstract. Google’s hit highlighting is basic, but good enough for most purposes, as it surrounds hit terms with <b> tags – like a lot of the HTML Google produces, it’s lowest common denominator (as anyone who has ever looked at their <FONT> tag ridden XSLT stylesheets can attest to). Microsoft uses a numbered hit tag system (c0, c1, etc), perhaps easy enough to understand once various helpful bloggers explain it, but still a bad idea. Note to Microsoft – don’t return your hit results with tags that Firefox and Chrome can style natively with CSS, but your own Internet Explorer ignores, or requires an XSL transformation when a span with classes or ids would do fine, and don’t number the tag names so each one has to be enumerated separately which is why you typically don’t see tags with numbers in the name. Devs, be prepared to do some wholesale text replacement on Search 2008 results if you want to format them nicely, like placing ellipses between term hits.
A feature list examination would give Microsoft a clear edge here –defined crawled properties for many different types of documents, which can be interactively mapped into managed properties in a chosen order, so properties can gracefully fall back onto less relevant ones if the most targeted ones aren’t available. That’s fabulous, especially compared to Google’s need-to-know approach of “we’ll tell you what metadata properties we indexed when you search and not before.” But Google is also understanding of some of the basics – just about anything you index is going to be a string, except for numbers and dates. There’s a good chance you’re going to want to use dates for range filtering or sorting, and Google is also extremely tolerant at guessing dates from a variety of formats sources like meta fields or server headers. MS Search 2008, on the other hand, seems to think just about every HTML property is a text field, and won’t let you map crawled text properties to date type managed properties, simultaneously segregating HTML content away from nicely indexed documents like PDFs or Word files and forcing workarounds for dealing with ranges.
You might recall I loved the instant gratification of the MS Search indexing. Less lovable is the fact that Search Server will happily replace a perfectly good index with whatever it just saw in its latest spidering attempt, even if it found absolutely nothing because, say, your entire site was down for a few minutes or some networking hiccup prevented the search from seeing your site. I’d love to say the Mini handles this situation very well – older versions had prominent toggles for a minimum number of crawled pages and specific page urls that must have been crawled before placing a staged index into production – but this feature is either gone or extremely well hidden in modern Minis.
Best Bets, Keywords and Synonyms
On the surface, both engines seem capable in this regard, and each has some unique features. Both have keymatching or “best bet” functionality, both allow synonyms but not fancy stemming through fuzzy logic. MS lets you specify authoritative pages. Google lets you import keymatches and related queries from text documents. Etc. Best bets (defining a single “featured” link for a common search term) and synonyms (equating a search for one term to be the same as another term) are particularly useful in the scenario where organizations want to push users to special events or articles. Unfortunately Search Server 2008 has a fine-print limitation on their synonym feature – synonyms are only relevant when coupled with an already existing best bet match, meaning that it’s basically just a way to keep from having to type the same best bet over again with a different term. I’d really love to see a true synonym system similar to Google’s.
To my mind, Google has this right – it is tolerant of mistakes or “bad” queries to the point that it’s nearly impossible to cause an actual error. The Microsoft querying web service has the heavy handed approach of throwing an exception for just about everything – from an unrecoverable network problem, to a developer configuration error when requesting a property the search doesn’t know about, to a user search of something like the words “and the”. That last query returns “Your query included only common words and / or characters, which were removed. No results are available. Try to add query terms. ” They could have added “Keep trying at that internet thing, you’ll figure it out eventually.” Yes, each of those conditions should be handled, but they’re not remotely in the same class. Developer exception trapping shouldn’t be a substitute for reasonable behavior from a search engine.
Results and Ranking
This is the most crucial feature, and in my opinion, the Mini wins this hand down. Google’s search technology is widespread for a very good reason – their “magic” algorithms do a great job of returning relevant results. Things have only improved over time as they’ve added features to the Mini borrowed from its big brother Google.com search, like automatically filtering similar results to drastically reduce the number of results displayed. Meanwhile, MS lacks some important features like partial page indexing (excluding content within a page from being indexed) which results in matches for keywords found in common navigation areas, etc. Google has had those features (googleoff:index) forever but has improved the relevancy to the point where they hardly need to be implemented any more. Google has also begun opening up their sacrosanct black box of assigning hit relevance by letting administrators apply bias to results using date and site section as a factor.
So there’s my quick take. I’ve skipped over a lot of functionality, and there are areas like federated searching and role-based authentication where MS clearly shines. Then there is the full-fledged Google Search Appliance whose features I haven’t touched on. So tell me where I’m right, tell me where I missed things, and tell me where I’m just plain wrong in the comments.