Search Engines for Fun and Profit Interlude: The Bungle

In my last article I talked about using NFS to separate resources for indexing and querying. I mentioned my preference for using a third, dedicated file server for both indexing and query servers. It didn't take long before the graphs told me that disabling the file attribute cache (noac as a mount option) - essential for the stable release's implementation of Lucene to work distributed across NFS - had decreased crawling efficiency about tenfold:

The outbound spike early on is my moving the /data/ directory to the NFS server. The large mutual inbound/outbound block is the traffic between the spider and NFS servers. In that run it indexed about 7000 pages. Where you see it drop off is when I re-mounted the share with the noac option. It indexed 300 pages.

I theorize that if you use the spider server as the NFS server for the query server you should be fine. You should still be able to use a third dedicated NFS server if, like mine, your network has an unequal distribution of storage capacity vs. processing capacity. It's simply a matter of daisy-chaining; mount /data/ from the NFS server onto the spider with attribute caching then serve it from the spider to the query server without it. The spider benefits from the cache and the read-only query server always sees fresh data.

The good news is there is a much less ass-backward way of doing this available in the latest developer and beta releases of OSS: index replication (and authentication!). I'm having trouble getting it to work flawlessly but the benefits are so tremendous I think it's going to be worth the wait. The short story is it will be easier to set up a dedicated spider in your home or office, anywhere there is a consumer grade connection and space and power are abundant so your server(s) can be cheap and huge. It is then merely a matter of setting up a VPN with your collocated/hosted servers to update the read-only index, all at once or periodically. That's zesty.


There are no comments for this item.