Posts Tagged ‘crawling’

Search Engines for Fun and Profit Part Three: Indexing your Sites

Once one has a working installation of OSS one will need to index some content so there is something to work with when implementing the front-end. Start by going to the OSS configuration interface at http://[server-address]:8080 and create a new Index with the web crawler template using the form on the front page. One can create multiple indices to offer up results for different sets of sites or fields.  This makes OSS an ideal solution for search-as-a-service as all of your clients can be consolidated on a single server and managed through a single interface.

Once you’ve created your index select it and a tab menu will show up across the top of the page. Click the crawler tab and, if it is not already selected the Web sub-tab. Click on the Pattern List tab and add some sites to be indexed, following the instructions regarding wildcards:

Enter http://www.open-search-server.com if you only want to crawl the home page
Enter http://www.open-search-server.com/* if you want to crawl all the content
Enter http://www.open-search-server.com/*wiki* if you only wish to crawl URLs containing the word "wiki" within the open-search-server.com domain.

Click the add button then on the Crawl process tab. Change the UserAgent to something relevant then tune the timing settings to be as timid or aggressive as your situation requires. Start indexing by de-selecting (if selected) the Dry run check-box, selecting the Optimize check-box and clicking on the Not running – Click to start button. Your statistics and threads panes should begin to populate with statistics.

See the Quick Start Guide to Crawl the Web for screencaps of this process.

Errors I have encountered while crawling include:

Error (org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/opt/open-search-server/data/furfinder/index/20101207171650/write.lock)

This is caused by the write.lock lockfile being left over from an unclean shutdown. Simply delete the file and start crawling again.

Error (background merge hit exception: _1k8:C27497 _1kj:c1116 _1kk:c27 _1kl:c4 _1km:c13 into _1kn [optimize])

Lucene, the “guts” behind OSS is having trouble optimizing the index after the crawl. Reading the catalina.out file in tomcat’s logs directory indicated that there was not enough free storage to work with so the /data/ directory was moved off of the VM and onto a file server.

Search Engines for Fun and Profit Part One: Introduction

I started playing with search engines a couple years ago when I started up FurFinder.net for the Bad Karma Networks line of sites. At the time the pickings were slim, most of the software available was either discontinued or poorly supported. This was also true of DataparkSearch but it was the thinnest kid at fat camp. More importantly, it was the first one I got to work – despite much sweat and tears expended.

Eventually I started redeploying DP search in my professional work; it’s good to have search results embedded in your site since they provide great fodder for google et al. Unfortunately, due to the round-about way I had to compile DP it did not lend itself well to library updates and so on. Eventually I let FurFinder slip into disrepair; it was simply not worth the trouble dicking around with DP’s ./configure options and lengthly config files to keep the spider/index end of it operational.

This week, however, I have had reason to revisit the whole search engine arena again. I’m finishing up a contract with a client who had requested this feature, and I set them up with DP search about a year ago. In that time the “stored” content storage daemon, responsible in part for producing relevant content excerpts in the search results, had decided it would like to stop starting or producing any output at all. Additionally, I was planning on making a large multi-item announcement on the BKN sites covering things like the new image board interfaces and mass virtual hosting platform. Wouldn’t it be nice to re-launch FurFinder.net too?

Normally I avoid Java software like the plague but this time Open Search Server caught my eye thanks to rave reviews. I was not let down. Five minutes after I downloaded their binary package I was logged into the web-based front end. Half an hour later I had my client’s site indexed. An hour after that I had the search results seamlessly integrated with their site.

This series will chronicle my experience migrating FurFinder.net to OSS, from installation to fine tuning and maybe even some code samples here and there.

Return top
foxpa.ws
Online Marketing Toplist
Internet
Technology Blogs - Blog Rankings

Internet Blogs - BlogCatalog Blog Directory

Technology blogs
Bad Karma Networks

Please Donate!


Made in Canada  •  There's a fox in the Gibson!  •  2010-12