Posts Tagged ‘failure’

Recovering From a Disk Failure with Software RAID

Recovering from a disk failure in a software RAID is a very straightforward and easy process. If your server, chipset and kernel module allow, it may be possible to replace the offending drive without downtime. This is generally not the case with cheap SATA setups, where the server will have to be powered down for the drive replacement. Although it might be safe to remove a SATA drive while the system is running your kernel module may not support recognising the new drive without reloading or rebooting.

The failed drive will have to be identified. This is a very easy process with hardware controllers (typically an indicator light will flash on the failed drive) but not quite as simple when using software RAID. The surefire way to identify a drive is by serial number; find the device node (either through cat /proc/mdstat or dmesg) then run either:

# hdparm -i /dev/{disk node}
# smartctl -a /dev/{disk node}

If your configuration does not support hot swapping power down the machine and replace the bad drive. If it does, you may need to remove other partitions on the target disk from their respective arrays:

# mdadm -f /dev/{md node} /dev/{partition node}

will mark the partition as FAILED, then it can be removed from the set:

# mdadm -r /dev/{md node} /dev/{partition node}

If you are running swap space on the failed drive be sure to disable it with swapoff before removing the disk.

It should be possible to boot the machine off of the remaining set but if you run in to trouble it is just as easy to perform these operations from a livecd.

First we need to copy the partition table from one of the healthy disks exactly. Let sda represent a healthy disk and sdb represent the new one:

# sfdisk -d /dev/sda | sfdisk /dev/sdb

Now we need to add the new partition(s) to our existing RAID set. If you are booting off a livecd and your sets were not automatically configured on boot (they should have been) use mdadm –assemble to assemble them. Then:

# mdadm /dev/md0 -a /dev/sdb1

for each set and partition which needs to be added.

We can watch the resynchronization status with:

# watch cat /proc/mdstat

Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[2] sda2[1]
      243922816 blocks [2/1] [_U]
      [================>....]  recovery = 83.8% (204426880/243922816) finish=58.3min speed=11271K/sec

md0 : active raid1 sdb1[0] sda1[1]
      272960 blocks [2/2] [UU]

The resync process will slow down considerably if there is heavy disk i/o at the same time and you can expect below-average performance until recovery has completed.

ApacheBench Shows Lots of Failed Requests due to Length

Breathe easy. Smile. You’re probably here because you’ve just run ab and got output something like:

This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking **** (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Finished 1000000 requests


Server Software:        nginx/1.2.1
Server Hostname:        ****
Server Port:            80

Document Path:          ****
Document Length:        162 bytes

Concurrency Level:      5
Time taken for tests:   9502.884921 seconds
Complete requests:      1000000
Failed requests:        697730
   (Connect: 0, Length: 697730, Exceptions: 0)
Write errors:           0
Total transferred:      279019852 bytes
Total POSTed:           247000494
HTML transferred:       158019731 bytes
Requests per second:    105.23 [#/sec] (mean)
Time per request:       47.514 [ms] (mean)
Time per request:       9.503 [ms] (mean, across all concurrent requests)
Transfer rate:          28.67 [Kbytes/sec] received
                        25.38 kb/s sent
                        54.06 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   15  26.7     13    3183
Processing:     0   31  18.8     29    2319
Waiting:        0   30  15.9     28    1719
Total:          0   46  38.9     42    4333

Percentage of the requests served within a certain time (ms)
  50%     42
  66%     45
  75%     46
  80%     47
  90%     51
  95%     55
  98%    104
  99%    201
 100%   4333 (longest request)

697730 out of 1 million requests failed? No, not really.

ApacheBench expects to be run against something that produces consistent output. Chances are you’ve specified a script that has dynamic output and the length of that output has changed since the first pull.

Let’s have a nice cup of tea :)

Dumpster Diving Part Two: Self Indulgence and KSU Resets due to Power Loss

I wasn’t going to write about the Meridian I adopted way back again because I planned on cleaning it up and selling the system. It turns out I have had better things to do. One night recently I decided to reward myself for being productive by dicking around with it for a bit and managed to get it working on a VoIP line by way of an ATA. I was so impressed with the quality of the audio I decided to keep the system for personal use. There is a subtle irony to having a three metre analogue bridge between two perfectly digital systems and all this fire-retardant 1990s-beige/grey plastic is getting to my head.

The POTS interface is generally terminated with a 25-pair BIX (Building Industry Cross-connect) block. These require a punch-down tool. My only punch-down tool at the moment has automatic snippers on one end which is useless as the position of the terminal teeth alternates from top to bottom.

I jammed the pairs in with the attached blade tool which is strongly discouraged as you invariably weaken the teeth and risk cracking the (especially ancient) plastic.

I, however, am a rebel.

The KSU had been unplugged some time and defaulted. Interestingly, the dialing mode defaults to pulse which, intuitively enough, doesn’t work with my ATA. It turns out you have to go in and set each line to tone dialing individually.

m7324: I rode this phone into Germany during WWII and rescued some POWs.

It took me forever to find out how to do this so pay close attention, I’m only going to remember this once:

  • Punch FEATURE **CONFIG
    • **CONFIG is **266344
  • The password should be CONFIG,  if lost will have to be reset.
  • Press the top-rightmost indicated meta key, the display will read:
    • 1. Trk/Line Data
  • Press the top-rightmost indicated meta key again. The display will read:
    • Show line: _
  • This prompt expects three digits. To configure line one, press 001. The display will read:
    • Trunk data
  • Press the top-rightmost indicated meta key once then the bottom-rightmost twice. The display will read:
    • Dial Mode: Pulse
  • The rightmost display key will read
    • CHANGE
  • Press the CHANGE display key and it will toggle between Pulse and Tone.
  • Press Rls to exit the menu

Above you can see the data and software cartridges for the M8x24; one fits into the other which fits into the cabinet. If your KSU loses all of its settings when the power goes out you need to replace the backup capacitors mounted on the data board. They are 1 farad and 5.5 volts each.

These appear to be a very common capacitor configuration for data backup and shouldn’t be hard to find at a reasonable price.

Something neat I learned in my travels is that when these 24V phones are subjected to the ring voltage on a POTS line (90V in NA?) they tend to blow. What they may lack in ruggedness they more than make up for in ease of installation however, as their all-digital signalling makes their ports polarity agnostic.

Return top
foxpa.ws
Online Marketing Toplist
Internet
Technology Blogs - Blog Rankings

Internet Blogs - BlogCatalog Blog Directory

Technology blogs
Bad Karma Networks

Please Donate!


Made in Canada  •  There's a fox in the Gibson!  •  2010-12