Sunday, April 27, 2008

IP takover, IP failover, how it works ...

I've been interested lately in ip failover of Linux/FreeBSD machines, and been taking a look from time to time at HA linux webpage, frankly i didn't find any document how it is done "the right way" or "how it is usually done" ... probably this is due to the fact that i usually pick up the wrong keywords to google for.

Anyway, i decided ill try to understand how it works and maybe implement something that does it, i had the theory that it should work using a ping test and ifconfig for an alias ip on the interfaces that i want them to failover for each other, and i might to reload some services with new configs, anyway, i dont really know, and i decided to start looking at how this could be done, and as usual writing a post over the span of multiple days ... or weeks :)

The quick search on google brought this page which a little old (written in 2003) but made me feel good since it very much described what i thought is the idea behind ip takeover, but with some very important additions, mainly send_arp and that there must be a mechanism to bring things to normal when the other ip starts to respond again.

Fiddling in the net for a while, i found a very nice article/document, probably the best i found so far, you can see it on here on linux-ha.org, it mentions a software called fake, so i quickly searched the packages provided by ubuntu, and found package fake, i installed it (apt-get install fake) and looked up where it resides, i found it in /usr/sbin/fake and it turns out to be a bash script! (YAY) ... having a quick look at it shows that its using the above mentioned send_arp ... its nice when you find consistent information on the net ;)

So the basic idea is that you have an ip you wish to keep up, lets say IP 192.168.1.1 ... and you have two boxes, A and B, each will have an IP, lets say A has 192.168.1.10 and B has 192.168.1.20 , what you do is that each box monitors the other box (lets say via ping or arping ..etc), Initially you have 192.168.1.1 on box A along side with 192.168.1.10 , whenever box B stops getting response from box A (on ip 192.168.1.10), it runs ifconfig to acquire 192.168.1.1 ! and additioally any other desired scripts and services, whenever 192.168.1.10 starts responding again (box A is up again, and its the default machine for 192.168.1.1), box B brings 192.168.1.1 down in order to allow box A to respond to requests again ... tada !

Saturday, April 5, 2008

Oracle RAC on ISCSI ... the small details

Lately i've been trying to install oracle 11g RAC on ISCSI (following this document from oracle ... a very good document ; i have to say.) but with little success, now i think ill be sharing what i've been trying to do ... other than following the document.

Starting with debian etch, moving to RHES, Centos and now trying with Enterprise Linux (Oracle Unbreakable), i figured out how much i hate rpm package management, how annoying RH key thing is ... well, maybe this is because i come from .deb based distro, or simply a personal preference.

Anyway, after installing the OS according to the steps described in the above mentioned document, i had to install 18 packages, and i really didn't want to search each and every one of them and install them manually, so i did the following:

* i copied the iso images, mounted them, copied rpms, created a repo, and installed them all in one shot, here is how to do this quickly :


#create directories to mount the 4 iso images i have
mkdir -p /mnt/cd{1..4}
#mount the iso images (located /root in my case)
loopid=1
for i in `ls /root/Ent*`; do
mount $i /mnt/cd$loopid -o loop=/dev/loop$loopid
loopid=` expr $loopid + 1`
sleep 2
done

#copied the rpm files to create a local repo
mkdir /tmp/repo
find /mnt/ -name *.rpm -exec cp {} /tmp/repo/ \;

#now generate file list ( i suppose its called that way!)
createrepo /tmp/repo

#add it to yum repos[localrepo2]
cat << EOF > /etc/yum.repos.d/handcrafted2.repo
[localrepo2]
name=Enterprise releasever - My Local Repo
baseurl=file:///tmp/repo
enabled=1
gpgcheck=0
#gpgkey=file:///path/to/you/RPM-GPG-KEY
EOF

#update yum repo
yum update


Question:anyone knows what the story with the 8 loop devices limit ?
now, lets get to install the 18 + iscsi-initiator packs, here are their names (in a friendly way):
binutils compat-libstdc++-33 elfutils-libelf elfutils-libelf-devel glibc-2.5 glibc-common-2.5 glibc-devel-2.5 gcc gcc-c++ libaio libaio-devel libgcc libstdc++ libstdc++-devel make sysstat unixODBC unixODBC-devel iscsi-initiator-utils


you can as well add iscsi-initiator-utils package, which will be needed later when configuring iscsi on the nodes.

all you have to do is to run "yum install [paste line here]", or you can copy it to a file, oracle-rac-pack-list for example, and run "yum install `cat oracle-rac-pack-list`", if for some reason (i had one!) you want them every package on a seperate line, just run "cat oracle-rac-pack-list | sed -e 's/ /\n/g' ".

Now we have the packs installed on the first node, what about the others, i decided not to repeat the steps above, so i thought about using sshfs, but it quickly turns out that its not included as a package with Enterprise Linux 5.1, i've decided to copy needed packs to the other nodes, so i did:
scp `cat oracle-rac-pack-list | sed -e 's/ /-\[0-9\]\*\n/g' | xargs -l find /mnt/ -name ` myusername@remote-ip-addr:~

Note:make sure there is a space at the end of the line in the file

but this apparently was not enough, missing dependencies ! i thought ill cut the crap and get them once into a file and post it here, and ill copy it later from here if i need it ... i don't like the solution, but i need to get going. so the list that emerged was the following packages:

libgomp glibc-headers elfutils-libelf-devel-static kernel-headers

Now you can use the same line from above to copy the files , just replace oracle-rac-pack-list with a file name containing the missing dependencies files.
Note:make sure there is a space at the end of the line

on the remote machine, go the user home directory (or to what ever empty directory you copied the .rpm files to) and run "rpm -i *.rpm", well you might have to delete a package glibc-2.5-18.iXXX.rpm since you will have 2, one for 386 and one for 686... pick one according to your arch. (this section is ugly! any suggestions how to do it in a cleaner way are very welcomed).

Next i configured the iSCSI initiator, i already had the iscsi target configured on a debian machine, so i wont go into how i did it here, maybe some other time, ive checked if i can see the volumes i created on it using the command:
iscsiadm -m discovery -t sendtargets -p iscsi-storage-priv

apparently i could see them, so i wanted to login to them, instead of writing multiple lines, i decided to use the output of the previous command, so i did:

iscsiadm -m discovery -t sendtargets -p iscsi-storage-priv| cut -f2 -d\ | xargs -l -I '{}' iscsiadm -m node -T {} -p XX.XX.XX.XX -l

notice that in the last command ive put the IP, for some reason using the hostname which was defined in /etc/hosts did not work (!), and of course, using iscsiadm discovery results assumes that you the volumes defined are all for your current services or your oracle nodes, meaning if you are using the iscsi server for other things, you will need to run your commands one by one for the desired volumes, or use grep of there is something in the names used explicitly for your oracle RAC storage.

Now to add them also quickly to be automatically targeted when system starts up, also similar to the above command i did:
iscsiadm -m discovery -t sendtargets -p iscsi-storage-priv| cut -f2 -d\ | xargs -l -I '{}' iscsiadm -m node -T {} -p XX.XX.XX.XX --op update -n node.startup -v automatic


Now getting the ocfs2! again, i hate rpm, and no, i dont want to set a local repo on two nodes, and so far, still sticking with copying the required rpms, here are the packs and dependencies:
ocfs2-2.6.18-53.el5 ocfs2-tools ocfs2-tools-devel e2fsprogs-devel glib2-devel ocfs2console


Now comes the OCFS2 configuration step, using ocfs2console tool, the configuration goes well on the first node, no problems, on the second node, i get an error message:

Could not start cluster stack. This must be resolved before any OCFS2 filesystem can be mounted


The search now begins for why this happens, i thought ill start by checking the sysctl values, so i do sysctl -a on both nodes and compare the results, i see nothing abnormal, anyway, i decide to copy the sysctl.conf file from the first node to the second one and use it, run sysctl -p, and thats what i did, still same problem.

checking /var/log/messages yields and interesting error:
modprobe: FATAL: Module ocfs2_nodemanager


but why, same packs should be on both nodes! so i started checking if i have different packs using "yum list| grep installed|wc" ... i have 726 packs on the "working node" and 737 packs on the "failing node", i need to do a comparison, so i create two lists and see the diffrence! the diffrence was that for somereason i didnt have ocfs2 kernle module installed on the second node, so installing ocfs2-2.6.18-53.el5.i686 did it! and the ocfs2console config completed successfully... i think :)

Installing asm stuff went fine, i installed oracleasm-support-2.0.4-1.el5.i386.rpm , oracleasm-2.6.18-53.el5-2.0.4-1.el5.i686.rpm and oracleasmlib-2.0.3-1.el5.i386.rpm ; the last one i had to download from oracle website.

Then i downloaded and extracted the oracle clusterware software and the oracle database, installed the cvuqdisk on both nodes following the document mentioned at the beginning of this post and on the first node, i did exec for ssh-agent and ssh-add, i did not set any passwords for now, so i was not prompted for a passphrase, then, on first node linux1 i did:

./runcluvfy.sh stage -pre crsinst -n linux1,linux2 -verbose

and it failed! Check: User equivalence for user "oracle" failed for node linux1, the same node im performing the test on, easy, i did :

cat id_rsa.pub >> authorized_keys

and i added a swapfile to fix another warning related to the swap size, i still had Total memory check failed, but i ignore this, one of the nodes has 1027200KB but the other node only has 512MB of ram, i pray ;) and continue.

the next test went fine, and now im at "20. Install Oracle 11g Clusterware Software", everything is going quit smoothly and im following the document from oracle ... the wizard brings the two script that i need to run as root, i run the first on both nodes ... successfulyy, the second one runs successfully on the first node and fails on the other! i look at the logs, and there you ... Failed to get IP for linux2.mydomain.tld ... :( ... sure , i have nothing like this in the dns, but why did it append the domain name ? anyway, i cd to /etc/sysconfig/ , vi network and remove the domain name form the HOSTNAME line ... and retry running the root.sh script! it exits quickly, outputs only two lines saying that Oracle CRS is already configured and that it will be running under init(1M) ! im now not sure if its working correctly or not!

i perform the following test described in the doc, and here what i get :
[oracle@linux1 ~]$ $ORA_CRS_HOME/bin/crs_stat -t -v
Name Type R/RA F/FT Target State Host
----------------------------------------------------------------------
ora.linux1.gsd application 0/5 0/0 ONLINE ONLINE linux1
ora.linux1.ons application 0/3 0/0 ONLINE ONLINE linux1
ora.linux1.vip application 0/0 0/0 ONLINE ONLINE linux1
ora.linux2.gsd application 0/5 0/0 ONLINE ONLINE linux2
ora.linux2.ons application 0/3 0/0 ONLINE OFFLINE
ora.linux2.vip application 0/0 0/0 ONLINE ONLINE linux2


aaaaahhhh ... there is something incorrect ... i cd to /u01/app/crs/bin and try to guess what executable can be used, i try ./onsctl start ... and its still trying to use the domainname attached to hostname ... i switch to root, do hostname linux2, exit root retry ... now here is what i get :

[oracle@linux2 bin]$ ./onsctl start
globalInitNLS: NLS boot file not found or invalid
-- default linked-in boot block used
Number of onsconfiguration retrieved, numcfg = 0
globalInitNLS: NLS boot file not found or invalid
-- default linked-in boot block used
globalInitNLS: NLS boot file not found or invalid
-- default linked-in boot block used
Number of onsconfiguration retrieved, numcfg = 0
onsctl: ons started


looks good .. but retrying the $ORA_CRS_HOME/bin/crs_stat -t -v yields the same output as before.

a quick google search brings me to this blog post, and srvctl grabs my attention! well, i played around a little and did:


./srvctl stop asm -n linux2
$ORA_CRS_HOME/bin/crs_stat -t -v
./srvctl stop nodeapps -n linux2
$ORA_CRS_HOME/bin/crs_stat -t -v



then i switch to root and do:


/etc/init.d/init.crs stop
/etc/init.d/init.crs start



and now $ORA_CRS_HOME/bin/crs_stat -t -v reports everything to be online :) ... i have no clue which if any of the above was sufficient, nor do i know if this will screw my installation later on ! but the oracle clusterware software installation reported finishing successfully.

I went ahead and started installing the oracle database, i got a warning about ip_local_port_range, again, although it hought i followed the doc step by step, it seemed this step was not done, and now i fear there is something else i did not do ... :( anyway, i fixed it using sysctl and wrote the values to /etc/sysctl.conf too ... and procceeded ... database installed, seemed to go successfully, examples installed, seemed to go successfully, TNS listiner too, now i started creating a database, following the doc, with ASM ... ASM creator working ... BINGO ! i get an error:

PRKS-1009: Failed to start ASM instance "+ASM2" on node "linux2", [CRS-0215: Could not start resource ora.linux2.ASM.asm".]


i go to linux2 node, and i try /etc/init.d/oracleasm listdisks ... i get non ... :( ... i try /etc/init.d/oracleasm scandisks ... it finishes successfully ... but still no good i get nothing when i do list disks ! checking for permission issue, i did:

cd app/
[root@linux2 app]# ls
crs oracle oraInventory
[root@linux2 app]# ls -alh
total 20K
drwxrwxr-x 5 root oinstall 4.0K Apr 27 18:57 .
drwxr-xr-x 3 root root 4.0K Apr 9 15:34 ..
drwxr-xr-x 35 root oinstall 4.0K Apr 27 18:57 crs
drwxrwxr-x 5 oracle oinstall 4.0K Apr 28 16:24 oracle
drwxrwx--- 4 oracle oinstall 4.0K Apr 28 15:37 oraInventory


As you can see, crs owner was root, in the document, it should have been oracle, so i did chown -R oracle:oinstall /u01/app , still this did not help! i look into logs, i find in /u01/app/crs/log/linux2/crsd/crsd.log :
2008-04-28 16:25:23.484: [ CRSRES][128224144] startRunnable: setting CLI values
2008-04-28 16:26:01.377: [ CRSAPP][128224144] StartResource error for ora.linux2.ASM2.asm error code = 1


fiddling more in the logs (which i dont understand their structure btw), i found this "warning"
Starting ORACLE instance (normal)
WARNING: You are trying to use the MEMORY_TARGET feature. This feature requires the /dev/shm file system to be mounted for at least 285212672 bytes. /dev/shm is either not mounted or is mounted with available space less than this size. Please fix this so that MEMORY_TARGET can work as expected. Current available is 263954432 and used is 0 bytes.
memory_target needs larger /dev/shm


so i manually change the size by setting it in /etc/fstab by inserting the line:

tmpfs /dev/shm tmpfs size=300m 0 0

and now things go fine again :), but i risk Linux deadlocking ;) ... now i try to create disk group and i get a message saying:

Could not mount the diskgroup on remote node linux2 using connection service linux2:1521:+ASM2. Ensure that the listener is running on this node and the ASM instance is registered to the listener. Recived the following error:

ORA-15032: not all alternations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ORCL_DATA1"


i get the same error when i try to create group FLASH_RECOVERY_AREA ... so i check /etc/init.d/oracleasm listdisks ... and it yields nothing! i try to do scandisks, still same thing, so then i stop init.crs and start it again ... listdisks still brings nothing! then i did mount -a as root, scandisks and listdisks and i get the volumes .... good ... i click ok on the message, i get a message something like "no more date to read from socket" ... no idea if this is normal or not! i pray ;) and check the box beside ORCL_DATA1 and press next!

I followed the doc and everything went fine, till again, it complained about the size of the shm, it was too small, so i had to increase it manually ... and i continued and things went fine and the database installation started ...

Well, the database installation took too long, so i packed and left it and decided to continue on the next day ... what happened was that there was a power failure ( bad ... :( bad ... )the node linux1 did not finish the creation of the database ... so i ran dbca again, deleted the database, got a few warnings ... and started creating a new database, this time ive got an error that there is not enough disk space on the ASM !! i think that the data created by the first attemp is still there ... to make the long story short, the datbase installation failed and i decided i need to do things diffrently, starting by using better hardware and larger iscsi targets.