=^.^=

Search Engines for Fun and Profit Part Four: Separating Resources for Queries and Indexing

karma

NOTE Before you go and do any of this please read Interlude: The Bungle.

If you're indexing a wide array of URLs you will quickly find indexing operations are load-intensive and this can affect the speed of delivery in your front-end. To overcome this problem we can put the index/ices on an NFS share and run two instances of OSS, one dedicated to crawling and one that will only serve queries. Virtual machines lend themselves especially well to this setup since you will be making an exact copy of the original server, plus some minor configuration changes.

You can either set up the original OSS server as the NFS server or use a third server. I prefer using a dedicated file server however your resources may not permit. If you are using a dedicated file server first shut down OSS then move the contents of the /data/ directory down a level. Mount the NFS share onto the /data/ directory, move the indices onto the share then set the ownership for the mount point:

# chown oss: data/ -R

If you're going to use the indexing server as the NFS server your /etc/exports should look something like this:

/opt/open-search-server/data [QUERY_SERVER_ADDRESS](sync,no_subtree_check,ro,root_squash)

Note that if you are hosting the indices on a dedicated NFS server you should be using 'rw' in your exports file and the fstab of the indexing server instead of 'ro'. Add this to the /etc/fstab of the query server:

192.168.8.22:/opt/open-search-server/data        /opt/open-search-server/data    nfs     defaults,ro,noac             0 0

It is important that you mount the share with the noac option or you may end up with ?java.io.IOException: Stale NFS file handle errors resulting from the file attribute cache lagging behind changes made by the indexing server. Now restart your nfs and netmount init scripts (where available) or mount the share manually:

# mount /opt/open-search-server/data

It's now safe to start OSS on the query server.

# cd /opt/open-search-server/
# sudo -u oss ./start.sh

Search Engines for Fun and Profit Part Three: Indexing your Sites

karma

Once one has a working installation of OSS one will need to index some content so there is something to work with when implementing the front-end. Start by going to the OSS configuration interface at http://[server-address]:8080 and create a new Index with the web crawler template using the form on the front page. One can create multiple indices to offer up results for different sets of sites or fields.  This makes OSS an ideal solution for search-as-a-service as all of your clients can be consolidated on a single server and managed through a single interface.

Once you've created your index select it and a tab menu will show up across the top of the page. Click the crawler tab and, if it is not already selected the Web sub-tab. Click on the Pattern List tab and add some sites to be indexed, following the instructions regarding wildcards:

Enter http://www.open-search-server.com if you only want to crawl the home page
Enter http://www.open-search-server.com/* if you want to crawl all the content
Enter http://www.open-search-server.com/*wiki* if you only wish to crawl URLs containing the word "wiki" within the open-search-server.com domain.

Click the add button then on the Crawl process tab. Change the UserAgent to something relevant then tune the timing settings to be as timid or aggressive as your situation requires. Start indexing by de-selecting (if selected) the Dry run check-box, selecting the Optimize check-box and clicking on the Not running - Click to start button. Your statistics and threads panes should begin to populate with statistics.

See the Quick Start Guide to Crawl the Web for screencaps of this process.

Errors I have encountered while crawling include:

Error (org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/opt/open-search-server/data/furfinder/index/20101207171650/write.lock)

This is caused by the write.lock lockfile being left over from an unclean shutdown. Simply delete the file and start crawling again.

Error (background merge hit exception: _1k8:C27497 _1kj:c1116 _1kk:c27 _1kl:c4 _1km:c13 into _1kn [optimize])

Lucene, the "guts" behind OSS is having trouble optimizing the index after the crawl. Reading the catalina.out file in tomcat's logs directory indicated that there was not enough free storage to work with so the /data/ directory was moved off of the VM and onto a file server.

Search Engines for Fun and Profit Part Two: Installing Open Search Server

karma

After downloading the stable release of OSS at their SourceForge page load up their Quick Start guide for your platform. Decompress the package to your /opt/ directory. The first thing you will want to do is create a "nobody" user account for the server to run under.

On Gentoo, run:

?# useradd -d /opt/open-search-server -s /sbin/nologin -r oss

The -d flag specifies the user's home directory, -s /sbin/nologin specifies a disabled shell and -r puts the UID of the given user in the "system accounts" range. It is important that the data directory is writable by the new account so change its ownership:

?# chown oss: data/ -R

You also need to make some of the tomcat applet container's files writable:

#? ?chown oss: apache-tomcat-6.0.20/logs/ -R
#? ?chown oss: apache-tomcat-6.0.20/temp/ -R
#? ?chown oss: apache-tomcat-6.0.20/work/ -R

Now we want the server to start on boot-up. You can either make an init script or take the easy route and drop it in your local. On Gentoo edit /etc/conf.d/local.start to reflect:

cd /opt/open-search-server
sudo -u oss ./start.sh

You will either need to change the working directory as shown above or add OSS to your PATH otherwise java will complain about missing files.

Similarly, configure your local to shut down the app gracefully:

killall -15 java

Start your init script or restart local and you should find your instance of OSS chugging away:

spider open-search-server # ps aux | grep java
oss       9744 57.9 63.5 279724 166616 pts/0   Sl   14:43   2:28 /usr/lib/jvm/icedtea6-bin/bin/java -Djava.util.logging.config.file=/opt/open-search-server/apache-tomcat-6.0.20/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.endorsed.dirs=/opt/open-search-server/apache-tomcat-6.0.20/endorsed -classpath :/opt/open-search-server/apache-tomcat-6.0.20/bin/bootstrap.jar -Dcatalina.base=/opt/open-search-server/apache-tomcat-6.0.20 -Dcatalina.home=/opt/open-search-server/apache-tomcat-6.0.20 -Djava.io.tmpdir=/opt/open-search-server/apache-tomcat-6.0.20/temp org.apache.catalina.startup.Bootstrap start

You can now connect to the management interface at http://[server-addresss]:8080 .

Search Engines for Fun and Profit Part One: Introduction

karma

I started playing with search engines a couple years ago when I started up FurFinder.net for the Bad Karma Networks line of sites. At the time the pickings were slim, most of the software available was either discontinued or poorly supported. This was also true of DataparkSearch but it was the thinnest kid at fat camp. More importantly, it was the first one I got to work - despite much sweat and tears expended.

Eventually I started redeploying DP search in my professional work; it's good to have search results embedded in your site since they provide great fodder for google et al. Unfortunately, due to the round-about way I had to compile DP it did not lend itself well to library updates and so on. Eventually I let FurFinder slip into disrepair; it was simply not worth the trouble dicking around with DP's ./configure options and lengthly config files to keep the spider/index end of it operational.

This week, however, I have had reason to revisit the whole search engine arena again. I'm finishing up a contract with a client who had requested this feature, and I set them up with DP search about a year ago. In that time the "stored" content storage daemon, responsible in part for producing relevant content excerpts in the search results, had decided it would like to stop starting or producing any output at all. Additionally, I was planning on making a large multi-item announcement on the BKN sites covering things like the new image board interfaces and mass virtual hosting platform. Wouldn't it be nice to re-launch FurFinder.net too?

Normally I avoid Java software like the plague but this time Open Search Server caught my eye thanks to rave reviews. I was not let down. Five minutes after I downloaded their binary package I was logged into the web-based front end. Half an hour later I had my client's site indexed. An hour after that I had the search results seamlessly integrated with their site.

This series will chronicle my experience migrating FurFinder.net to OSS, from installation to fine tuning and maybe even some code samples here and there.

ISC.Org ANY Request DRDoS Update

karma

It has been some time now since I started talking about the curious case of the isc.org ANY request flood (later revealed to be a UDP amplification attack) and our friends are still knocking at the gates hot and heavy. In the past couple of days I have noticed some particularly voluminous activity, culminating in this wave this afternoon.

ID 	Blocked IP 	  	Date 	Time 	Time Remaining

4000002 	69.197.22.82 		11/17/10 	15:36:49 	1d 00:00:00

4000002 	72.20.9.147 		11/17/10 	15:35:31 	23:58:42

4000002 	72.20.9.154 		11/17/10 	15:24:38 	23:47:49

4000002 	95.168.172.188 		11/17/10 	15:24:01 	23:47:12

4000002 	85.195.105.91 		11/17/10 	15:22:31 	23:45:42

4000002 	72.20.9.150 		11/17/10 	15:19:21 	23:42:32

4000002 	84.16.227.96 		11/17/10 	15:15:08 	23:38:19

4000002 	78.159.121.149 		11/17/10 	15:13:21 	23:36:32

4000002 	178.162.182.250 	11/17/10 	15:11:02 	23:34:13

4000002 	72.20.9.156 		11/17/10 	15:10:21 	23:33:32

4000002 	78.159.99.146 		11/17/10 	15:07:53 	23:31:04

4000002 	78.129.164.142 		11/17/10 	15:05:57 	23:29:08

4000002 	78.159.107.219 		11/17/10 	15:05:13 	23:28:24

4000002 	72.20.9.149 		11/17/10 	15:04:32 	23:27:43

4000002 	206.217.216.249 	11/17/10 	15:02:50 	23:26:01

4000002 	95.154.240.8 		11/17/10 	15:00:38 	23:23:49

4000002 	72.20.56.237 		11/17/10 	14:58:20 	23:21:31

4000002 	78.159.108.198 		11/17/10 	14:50:12 	23:13:23

Since the attacks are so frequent the IPS is having a hard time keeping up and enough packets are getting through that I have decided this is no longer amusing enough to keep tracking. At the bottom of this page is the netfilter panacea.

For the curious, this is what I have been seeing in my packet captures:

0000  00 16 3e cc 00 02 00 16  3e bb 00 02 08 00 45 00   ..>..... >.....E.
0010  05 dc b6 5b 20 00 40 11  a1 8a 00 00 00 00 48 14   ...[ .@. ......H.
0020  09 93 00 35 63 01 06 c7  8d 82 2a 39 81 00 00 01   ...5c... ..*9....
0030  00 00 00 08 00 0f 03 69  73 63 03 6f 72 67 00 00   .......i sc.org..
0040  ff 00 01 c0 0c 00 02 00  01 00 00 76 ac 00 0e 04   ........ ...v....
0050  73 66 62 61 06 73 6e 73  2d 70 62 c0 0c c0 0c 00   sfba.sns -pb.....
0060  02 00 01 00 00 76 ac 00  06 03 6f 72 64 c0 2a c0   .....v.. ..ord.*.
0070  0c 00 02 00 01 00 00 76  ac 00 06 03 61 6d 73 c0   .......v ....ams.
0080  2a c0 0c 00 02 00 01 00  00 76 ac 00 19 02 6e 73   *....... .v....ns
0090  03 69 73 63 0b 61 66 69  6c 69 61 73 2d 6e 73 74   .isc.afi lias-nst
00a0  04 69 6e 66 6f 00 c0 0c  00 2e 00 01 00 00 93 fe   .info... ........
00b0  00 9b 00 02 05 02 00 00  a8 c0 4d 0a b9 03 4c e3   ........ ..M...L.
00c0  2c 03 38 79 03 69 73 63  03 6f 72 67 00 52 d3 b5   ,.8y.isc .org.R..
00d0  f4 98 f3 d6 75 d8 6c 8f  1b 95 b8 55 82 4b 1a ff   ....u.l. ...U.K..
00e0  93 99 29 95 09 a4 d8 1f  46 8b c9 92 45 6c 72 05   ..)..... F...Elr.
00f0  96 28 a7 53 4c 8c d6 e6  a3 b2 4d d6 3d 45 8b be   .(.SL... ..M.=E..
0100  c4 5b a5 2b f9 f1 95 3a  9a 66 02 d7 5e 58 f5 7a   .[.+...: .f..^X.z
0110  f2 f3 d6 94 f1 da a6 2b  e8 43 9a 86 71 48 a1 7b   .......+ .C..qH.{
0120  2e e2 d2 1c a9 9f 68 61  66 11 43 ca 70 88 d9 a0   ......ha f.C.p...
0130  03 82 0f af d3 e8 46 f7  86 33 21 ae 01 b8 62 01   ......F. .3!...b.
0140  84 41 f1 fe 88 23 2d 9c  27 7a 36 6c b7 c0 9a 00   .A...#-. 'z6l....
0150  2b 00 01 00 01 3c bd 00  18 32 5c 05 01 98 21 13   +....< .. .2\...!.
0160  d0 8b 4c 6a 1d 9f 6a ee  1e 22 37 ae f6 9f 3f 97   ..Lj..j. ."7...?.
0170  59 c0 9a 00 2b 00 01 00  01 3c bd 00 24 32 5c 05   Y...+... .<..$2\.
0180  02 f1 e1 84 c0 e1 d6 15  d2 0e b3 c2 23 ac ed 3b   ........ ....#..;
0190  03 c7 73 dd 95 2d 5f 0e  b5 c7 77 58 6d e1 8d a6   ..s..-_. ..wXm...
01a0  b5 c0 9a 00 2e 00 01 00  01 3c bd 00 97 00 2b 07   ........ .<....+.
01b0  02 00 01 51 80 4c f6 79  3a 4c e3 f6 2a f0 9e 03   ...Q.L.y :L..*...
01c0  6f 72 67 00 64 1a d8 1f  c6 51 40 a6 25 28 e7 b9   org.d... .Q@.%(..
01d0  21 c2 2a 4b 30 a0 e8 74  30 83 76 b2 52 eb 0c ec   !.*K0..t 0.v.R...
01e0  e4 e2 4c 3f f1 0e ec 6d  3a d6 b7 d6 2e 4e a3 4a   ..L?...m :....N.J
01f0  5d f6 ac 08 40 25 a5 de  0a 89 90 5d d9 c0 b3 d3   ]...@%.. ...]....
0200  ef 4b d0 8a c3 d5 c2 49  fa c4 c3 84 29 4e 4e 16   .K.....I ....)NN.
0210  47 2e 5c f4 09 9f c4 70  9d 2c 40 c2 63 4b 52 2a   G.\....p .,@.cKR*
0220  14 5b 55 ef 54 9d cc 20  9b 71 61 f4 6e 88 84 49   .[U.T..  .qa.n..I
0230  2c f3 08 77 c4 f0 4d cf  54 ea 64 19 be d3 bf 6c   ,..w..M. T.d....l
0240  cd c0 cb 2f c0 63 00 01  00 01 00 01 1f 6c 00 04   .../.c.. .....l..
0250  c7 fe 3f fe c0 63 00 1c  00 01 00 01 1f 6c 00 10   ..?..c.. .....l..
0260  20 01 05 00 00 2c 00 00  00 00 00 00 00 00 02 54    ....,.. .......T
0270  c0 51 00 01 00 01 00 00  93 fd 00 04 c7 06 01 1e   .Q...... ........
0280  c0 51 00 1c 00 01 00 00  93 fd 00 10 20 01 05 00   .Q...... .... ...
0290  00 60 00 00 00 00 00 00  00 00 00 30 c0 3f 00 01   .`...... ...0.?..
02a0  00 01 00 00 93 fd 00 04  c7 06 00 1e c0 3f 00 1c   ........ .....?..
02b0  00 01 00 00 93 fe 00 10  20 01 05 00 00 71 00 00   ........  ....q..
02c0  00 00 00 00 00 00 00 30  c0 25 00 01 00 01 00 00   .......0 .%......
02d0  76 ac 00 04 95 14 40 03  c0 25 00 1c 00 01 00 00   v.....@. .%......
02e0  76 ac 00 10 20 01 04 f8  00 00 00 02 00 00 00 00   v... ... ........
02f0  00 00 00 19 c0 51 00 2e  00 01 00 00 93 fd 00 9b   .....Q.. ........
0300  00 01 05 04 00 00 a8 c0  4d 0a b9 03 4c e3 2c 03   ........ M...L.,.
0310  38 79 03 69 73 63 03 6f  72 67 00 bb dc f9 a8 90   8y.isc.o rg......
0320  58 9c 7a 62 dd 73 82 89  78 82 1d b2 d6 6f e6 e6   X.zb.s.. x....o..
0330  36 d1 af d5 a1 a7 ff d7  54 c8 70 f2 14 57 f9 89   6....... T.p..W..
0340  99 fa 4e cb 70 23 cd 56  cc dd 8f 5b a7 a7 b7 ad   ..N.p#.V ...[....
0350  32 68 1b a1 c0 de 1b e5  a7 f8 7a 5c 57 1c 72 09   2h...... ..z\W.r.
0360  3f f4 1a 22 c1 9d d9 f7  28 91 b9 e2 17 09 f9 a2   ?..".... (.......
0370  52 89 a5 d8 7f 7f d9 ba  31 52 d0 53 f0 de a5 b2   R....... 1R.S....
0380  37 6e 30 fb 0c e4 0d 46  dc b6 f5 50 55 64 3d 32   7n0....F ...PUd=2
0390  ec 3d 26 41 fa 56 ad ad  20 13 29 c0 51 00 2e 00   .=&A.V..  .).Q...
03a0  01 00 00 93 fd 00 9b 00  1c 05 04 00 00 a8 c0 4d   ........ .......M
03b0  0a b9 03 4c e3 2c 03 38  79 03 69 73 63 03 6f 72   ...L.,.8 y.isc.or
03c0  67 00 47 51 42 a0 24 40  77 c3 eb 0d 1d 92 8f 04   g.GQB.$@ w.......
03d0  78 3e b2 f6 e7 93 73 98  41 ae ea e2 60 87 97 65   x>....s. A...`..e
03e0  4f e5 45 d1 3f b6 c9 ad  3b 52 48 e3 f8 cd 81 cc   O.E.?... ;RH.....
03f0  18 75 50 90 26 58 28 47  39 f5 b7 a7 7d 39 de aa   .uP.&X(G 9...}9..
0400  69 59 d0 36 de 09 a9 10  33 2b 0c ad 51 4e e0 74   iY.6.... 3+..QN.t
0410  dc ab 35 6c 1b a9 0d c4  31 31 b9 b6 b5 f1 42 11   ..5l.... 11....B.
0420  ef 08 c6 4f 4f eb 32 d6  9b fb 85 7d 67 1c 3f 8d   ...OO.2. ...}g.?.
0430  25 cc 50 c4 55 1f 40 2a  0e f8 db 78 38 8f 74 0f   %.P.U.@* ...x8.t.
0440  58 65 c0 3f 00 2e 00 01  00 00 93 fd 00 9b 00 01   Xe.?.... ........
0450  05 04 00 00 a8 c0 4d 0a  b9 03 4c e3 2c 03 38 79   ......M. ..L.,.8y
0460  03 69 73 63 03 6f 72 67  00 0d fd 01 af 6b 47 87   .isc.org .....kG.
0470  51 e1 92 82 64 82 f2 b4  27 36 d1 e5 55 79 21 14   Q...d... '6..Uy!.
0480  31 e9 78 e9 2a 64 b8 bc  1a 59 67 33 e0 cf 5d c6   1.x.*d.. .Yg3..].
0490  ac 30 be 9d 02 75 a0 1e  03 9e 40 46 63 9c b5 cc   .0...u.. ..@Fc...
04a0  18 fb 81 6d ca f5 7b c3  35 ce 2e 7a ad 6c a3 6f   ...m..{. 5..z.l.o
04b0  df 6f 14 4f ee 71 57 fe  f3 96 d0 b0 7b 43 54 65   .o.O.qW. ....{CTe
04c0  cf c8 d1 56 4e 9b 62 82  32 b5 db 73 67 3b f1 35   ...VN.b. 2..sg;.5
04d0  02 19 3a 1c bd cc d5 ad  7c 23 2c 53 1a 8c 0a 45   ..:..... |#,S...E
04e0  eb 10 f2 83 21 68 f3 7d  7a c0 3f 00 2e 00 01 00   ....!h.} z.?.....
04f0  00 93 fe 00 9b 00 1c 05  04 00 00 a8 c0 4d 0a b9   ........ .....M..
0500  03 4c e3 2c 03 38 79 03  69 73 63 03 6f 72 67 00   .L.,.8y. isc.org.
0510  76 61 9f e1 a7 45 ee c6  78 71 d9 a2 a3 e0 20 56   va...E.. xq.... V
0520  d6 64 17 a7 25 d1 11 5b  51 80 50 24 c5 9f 4b 19   .d..%..[ Q.P$..K.
0530  fa 5c e3 6f e2 f2 ca 9e  e9 c0 9d ee 13 f8 21 03   .\.o.... ......!.
0540  22 d9 58 54 92 48 5f 71  95 d7 f4 4b 94 d4 5f 54   ".XT.H_q ...K.._T
0550  bf 1e da c1 f4 95 35 28  75 8f 09 f8 6a 15 11 eb   ......5( u...j...
0560  ef 86 99 6f 45 5b 37 4d  bc c8 8c 2b de b7 fc 7c   ...oE[7M ...+...|
0570  77 e5 15 06 b4 cd 03 66  6b 32 da aa c1 c1 f5 0f   w......f k2......
0580  46 24 ea cb 9e 2b 2a 04  b7 2a d4 b7 3d be 58 23   F$...+*. .*..=.X#
0590  c0 25 00 2e 00 01 00 00  76 ac 00 9b 00 01 05 04   .%...... v.......
05a0  00 00 a8 c0 4d 0a b9 03  4c e3 2c 03 38 79 03 69   ....M... L.,.8y.i
05b0  73 63 03 6f 72 67 00 45  62 4e 36 4e c3 e8 69 a4   sc.org.E bN6N..i.
05c0  94 da 56 f0 6a 73 e5 1f  16 e0 56 c8 95 b4 83 0b   ..V.js.. ..V.....
05d0  28 d1 dd 06 10 da da 0c  78 43 4b c0 60 09 88 26   (....... xCK.`..&
05e0  d8 36 8e a0 69 3a 7d cd  9e 31                     .6..i:}. .1

The above seems to be a new version of the attack which makes use of fragmentation. It has a differing payload in each packet and comes in short waves. This one seems to be emanating (or targeting) mostly from the 72.20.9.0/24 block. It appears to be getting used in conjunction with the old request:

0000  00 16 3e bb 00 02 00 16  3e cc 00 02 08 00 45 00   ..>..... >.....E.
0010  00 40 a6 11 00 00 e8 11  2f 70 48 14 09 93 00 00   .@...... /pH.....
0020  00 00 63 01 00 35 00 2c  00 00 2a 39 01 00 00 01   ..c..5., ..*9....
0030  00 00 00 00 00 01 03 69  73 63 03 6f 72 67 00 00   .......i sc.org..
0040  ff 00 01 00 00 29 10 00  00 00 80 00 00 00         .....).. ......

Here is the magic rule my friends:

# iptables -A INPUT -p udp -m string --hex-string "|03697363036f726700|" --algo bm --to 65535 -j DROP

UPDATE Thanks to David (below) for pointing out --to (all ports) is inefficient and could interfere with legitimate traffic. Additionally, I was able to fix a problem resolving domains that involve .nl tld servers by broadening the pattern:

# iptables -A INPUT -p udp -m string --hex-string "|00000000000103697363036f726700|" --algo bm --to 65535 --dport 53 -j DROP