Search Engines for Fun and Profit Part Three: Indexing your Sites

Once one has a working installation of OSS one will need to index some content so there is something to work with when implementing the front-end. Start by going to the OSS configuration interface at http://[server-address]:8080 and create a new Index with the web crawler template using the form on the front page. One can create multiple indices to offer up results for different sets of sites or fields.Â This makes OSS an ideal solution for search-as-a-service as all of your clients can be consolidated on a single server and managed through a single interface.

Once you've created your index select it and a tab menu will show up across the top of the page. Click the crawler tab and, if it is not already selected the Web sub-tab. Click on the Pattern List tab and add some sites to be indexed, following the instructions regarding wildcards:

Enter http://www.open-search-server.com if you only want to crawl the home page
Enter http://www.open-search-server.com/* if you want to crawl all the content
Enter http://www.open-search-server.com/*wiki* if you only wish to crawl URLs containing the word "wiki" within the open-search-server.com domain.

Click the add button then on the Crawl process tab. Change the UserAgent to something relevant then tune the timing settings to be as timid or aggressive as your situation requires. Start indexing by de-selecting (if selected) the Dry run check-box, selecting the Optimize check-box and clicking on the Not running - Click to start button. Your statistics and threads panes should begin to populate with statistics.

See the Quick Start Guide to Crawl the Web for screencaps of this process.

Errors I have encountered while crawling include:

Error (org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/opt/open-search-server/data/furfinder/index/20101207171650/write.lock)

This is caused by the write.lock lockfile being left over from an unclean shutdown. Simply delete the file and start crawling again.

Error (background merge hit exception: _1k8:C27497 _1kj:c1116 _1kk:c27 _1kl:c4 _1km:c13 into _1kn [optimize])

Lucene, the "guts" behind OSS is having trouble optimizing the index after the crawl. Reading the catalina.out file in tomcat's logs directory indicated that there was not enough free storage to work with so the /data/ directory was moved off of the VM and onto a file server.

foxpa.ws

pitter patter on the keyboard

Search Engines for Fun and Profit Part Three: Indexing your Sites

Comments

Comments New Comment

Comments