When managing a large number of servers, it is most likely that you will come across a server that makes use of a RAID controller, and as such you will need to know the status of the disks and the disk array.
As these are usually proprietary controllers, normal tools like ipmiutil, lshw, lspci, etc. aren’t of much use.
Obviously we are talking about Unix-like Systems , and the agent must run as a cron job to send us up-to-date information about our hardware.
It seems that a lot of the major vendors’ (Dell, HP, IBM) controllers are either Adaptec-based or HP-based and thus it is possible to use the CLI utils: arcconf or hrconf (see the end of the post for download links).
As my own experience was mainly with the arcconf utility, the script included below, was written based on it. An example of arcconf’s output can be viewed here.
Small disclaimer, even though this blog is new, a few of these scripts are not so new… This for example is one of my earlier works and is very quirky, and not especially robust, however, it gets the job done!
I will also go into some detail over some basic stuff used in the script like awk, handling command output as variables, and sending mail with bash.
USUAL DISCLAIMER FOR SCRIPTS AND STUFF: PLEASE USE COMMON SENSE, HOPEFULLY YOU KNOW WHAT YOU ARE DOING BEFORE YOU RUN COMMANDS AS ROOT IN *NIX ENVIRONMENTS.
### check environment ###
if [ -f /usr/bin/arcconf ] # MAIN CHECK IF FILE NOT THERE SKIP EVERYTHING
then
### do raid controller check ###
output=`arcconf getconfig 1`
### set some variables
serverName=$(hostname -f)
status="OK" # initialization
specError="Report for $serverName:" # for nicer output when sending the email
### find stuff from check ###
contStatus=`echo -en "($output)" | awk '/Controller Status/ {print $4}'` # should be 'Okay'
stripes=`echo -en "($output)" | awk '/Defunct stripes/ {print $4}'` # should be 'No'
segments=`echo -en "($output)" | awk '/Defunct segments/ {print $4}'` # should be 'No'
PFA=`echo -en "($output)" | awk '/PFA/ {print $3}'` # should be 'No'
physState=`echo -en "($output)" | awk '/State/ {print $3}'` # should be 'Online'
logStatus=`echo -en "($output)" | awk '/Status of logical drive/ {print $6}'` # should be 'Okay'
### Controller status check
if [[ $contStatus != "Okay" ]]; then
status="BAD"
specError="$specError \n Controller Status is BAD"
fi
### Logical drive status loop for every logical drive
g=0 # for device identifier
for i in $logStatus; do
if [[ $i != "Okay" ]]; then
specError="$specError \n Logical Drive no. $g is not OK"
status="BAD"
fi
g=$((g+1))
done
### Defunct stripes loop for every logical drive
g=0 # for device identifier
for i in $stripes; do
if [[ $i != "No" ]]; then
specError="$specError \n Defunct Stripes in logical device $g"
status="BAD"
fi
g=$((g+1))
done
### Defunct segments loop for every logical drive
g=0 # for device identifier
for i in $segments; do
if [[ $i != "No" ]]; then
specError="$specError \n Defunct Segments in logical device $g"
status="BAD"
fi
g=$((g+1))
done
### PFA loops for every physical drive
g=0 # for device identifier
for i in $PFA; do
if [[ $i != "No" ]]; then
specError="$specError \n PFA of device $g is bad"
status="BAD"
fi
g=$((g+1))
done
### State loop for every physical drive
g=0 # for device identifier
for i in $physState; do
if [[ $i != "Online" ]]; then
specError="$specError \n Physical State of device $g is bad"
status="BAD"
fi
g=$((g+1))
done
else # end of checking /usr/bin/arcconf
status="BAD"
specError="$specError \n arcconf is not in /usr/bin! doing nothing!"
fi ### END OF MAIN CHECK
### do appropriate action ###
if [[ $status != "OK" ]]; then
# echo -e $specError # for testing
echo -e $specError | mail -s "$serverName RAID controller status error" support@domain.com -- -f raid_agent@server.com
fi
Now a little break-down of the script:
### check environment ###
if [ -f /usr/bin/arcconf ] # MAIN CHECK IF FILE NOT THERE SKIP EVERYTHING
then
### do raid controller check ###
output=`arcconf getconfig 1`
### set some variables
serverName=$(hostname -f)
status="OK" # initialization
specError="Report for $serverName:" # for nicer output when sending the email
Here for some reason I decided to put arcconf in /usr/bin/ and hardcode it into the script, and make the whole script dependent on it being there… NM.
Also regarding code-style, there is some inconsistency here, once using the old backtick operator to run programs from within the script and grab their result, and once using the $(command) syntax.
Oh well. FYI they are the same.
$output is basically the core of the whole script, as it includes the result of the arcconf command, and from it we retrieve all the data we find interesting.
Also I just figured out while writing this, why the ugly hardcoding above was done, because when the script runs via cron, anything built on relative directories (“./arcconf” rather than “/usr/bin/arcconf”) breaks.
$serverName and specError are just variables to include when sending the report at the end.
$status is initialized with the value “OK” as part of the logic of the script.
That is, we start positive, and then do various tests, which may negate the positive status, in which case we send a report.
contStatus=`echo -en "($output)" | awk '/Controller Status/ {print $4}'` # should be 'Okay'
stripes=`echo -en "($output)" | awk '/Defunct stripes/ {print $4}'` # should be 'No'
segments=`echo -en "($output)" | awk '/Defunct segments/ {print $4}'` # should be 'No'
PFA=`echo -en "($output)" | awk '/PFA/ {print $3}'` # should be 'No'
physState=`echo -en "($output)" | awk '/State/ {print $3}'` # should be 'Online'
logStatus=`echo -en "($output)" | awk '/Status of logical drive/ {print $6}'` # should be 'Okay'
### Controller status check
if [[ $contStatus != "Okay" ]]; then
status="BAD"
specError="$specError \n Controller Status is BAD"
fi
### Logical drive status loop for every logical drive
g=0 # for device identifier
for i in $logStatus; do
if [[ $i != "Okay" ]]; then
specError="$specError \n Logical Drive no. $g is not OK"
status="BAD"
fi
g=$((g+1))
done
### Defunct stripes loop for every logical drive
g=0 # for device identifier
for i in $stripes; do
if [[ $i != "No" ]]; then
specError="$specError \n Defunct Stripes in logical device $g"
status="BAD"
fi
g=$((g+1))
done
### Defunct segments loop for every logical drive
g=0 # for device identifier
for i in $segments; do
if [[ $i != "No" ]]; then
specError="$specError \n Defunct Segments in logical device $g"
status="BAD"
fi
g=$((g+1))
done
### PFA loops for every physical drive
g=0 # for device identifier
for i in $PFA; do
if [[ $i != "No" ]]; then
specError="$specError \n PFA of device $g is bad"
status="BAD"
fi
g=$((g+1))
done
### State loop for every physical drive
g=0 # for device identifier
for i in $physState; do
if [[ $i != "Online" ]]; then
specError="$specError \n Physical State of device $g is bad"
status="BAD"
fi
g=$((g+1))
done
Ok, so here we do some actual tests, to see if the report is good.
$contStatus, $stripes etc. are simply variables that store the exact part of the row from the report, which indicates the status of the element, which we later test.
This is done in the following method:
The variable $output is surrounded with double quotes so that echo doesn’t destroy its newline character, as it usually does.
This is piped into awk, which simply looks for a row with the specified string, for instance “Controller Status” and brings back the (in this case) 4th word in the row.
$contStat is different than the other case, as it is singular, and the other rows may appear more than once, there is more than one physical disk in every disk array, and more than one row accordingly.
For the others I created a funny little for loop, that loops over every instance of data found, which necessarily correlates to the physical disk. $g is an index to indicate which physical disk has a problem.
status="BAD"
specError="$specError \n arcconf is not in /usr/bin! doing nothing!"
fi ### END OF MAIN CHECK
### do appropriate action ###
if [[ $status != "OK" ]]; then
# echo -e $specError # for testing
echo -e $specError | mail -s "$serverName RAID controller status error" support@domain.com -- -f raid_agent@server.com
fi
Finally we have two more items in the script – the else clause of the very first test (if arcconf is in /usr/bin) – also please take care when using exclamation marks in bash – if they are after a word like here, then no problem, but it not a trivial issue to escape the exclamation mark on its own or preceding some string.
And finally we echo $specError and pipe it into mail, which uses the following syntax:
- “
-s” for the subject. - Then our destination.
- Then we use “
--” to separate the options section. - “
-f” and state a from address
That’s about it.
As promised here are links to the relevant command line tools, which should work for a pretty wide range of products.
Of course, this script is just an example, and should probably be tweaked according to your specific needs, however hopefully some of the methods described here may be useful to you.
- arcconf is supposedly available for various operating systems and architectures here.
- hrconf is available for x86 architecture and for x86_64.
-
Also I uploaded the actual tool I used on 32-bit GNU/Linux environment, which you can download if that suits you as well (don’t forget
chmod 755
!).
While looking at keywords relevant to this website, I found that someone wrote something very similar a few months ago, another bash based arcconf reporter for RAID status.
Anyways it is interesting – another strange case of simultaneous invention I guess.
Probably if I had found that script, I wouldn’t have bothered with writing this one.
They are very similar in their methods.
I am glad I did write my own, since this way I learned a lot, and had some fun.
Oh well
Link | July 23rd, 2009 at 00:18