crazy little thing called love: Oracle RAC on ISCSI ... the small details

Lately i've been trying to install oracle 11g RAC on ISCSI (following this document from oracle ... a very good document ; i have to say.) but with little success, now i think ill be sharing what i've been trying to do ... other than following the document.

Starting with debian etch, moving to RHES, Centos and now trying with Enterprise Linux (Oracle Unbreakable), i figured out how much i hate rpm package management, how annoying RH key thing is ... well, maybe this is because i come from .deb based distro, or simply a personal preference.

Anyway, after installing the OS according to the steps described in the above mentioned document, i had to install 18 packages, and i really didn't want to search each and every one of them and install them manually, so i did the following:

* i copied the iso images, mounted them, copied rpms, created a repo, and installed them all in one shot, here is how to do this quickly :

#create directories to mount the 4 iso images i have mkdir -p /mnt/cd{1..4} #mount the iso images (located /root in my case) loopid=1 for i in `ls /root/Ent*`; do mount $i /mnt/cd$loopid -o loop=/dev/loop$loopid loopid=` expr $loopid + 1` sleep 2 done #copied the rpm files to create a local repo mkdir /tmp/repo find /mnt/ -name *.rpm -exec cp {} /tmp/repo/ \; #now generate file list ( i suppose its called that way!) createrepo /tmp/repo #add it to yum repos[localrepo2] cat << EOF > /etc/yum.repos.d/handcrafted2.repo [localrepo2] name=Enterprise releasever - My Local Repo baseurl=file:///tmp/repo enabled=1 gpgcheck=0 #gpgkey=file:///path/to/you/RPM-GPG-KEY EOF #update yum repo yum update

Question:anyone knows what the story with the 8 loop devices limit ?
now, lets get to install the 18 + iscsi-initiator packs, here are their names (in a friendly way):

binutils compat-libstdc++-33 elfutils-libelf elfutils-libelf-devel glibc-2.5 glibc-common-2.5 glibc-devel-2.5 gcc gcc-c++ libaio libaio-devel libgcc libstdc++ libstdc++-devel make sysstat unixODBC unixODBC-devel iscsi-initiator-utils

you can as well add iscsi-initiator-utils package, which will be needed later when configuring iscsi on the nodes.

all you have to do is to run "yum install [paste line here]", or you can copy it to a file, oracle-rac-pack-list for example, and run "yum install `cat oracle-rac-pack-list`", if for some reason (i had one!) you want them every package on a seperate line, just run "cat oracle-rac-pack-list | sed -e 's/ /\n/g' ".

Now we have the packs installed on the first node, what about the others, i decided not to repeat the steps above, so i thought about using sshfs, but it quickly turns out that its not included as a package with Enterprise Linux 5.1, i've decided to copy needed packs to the other nodes, so i did:

scp `cat oracle-rac-pack-list | sed -e 's/ /-\[0-9\]\*\n/g' | xargs -l find /mnt/ -name ` myusername@remote-ip-addr:~

Note:make sure there is a space at the end of the line in the file

but this apparently was not enough, missing dependencies ! i thought ill cut the crap and get them once into a file and post it here, and ill copy it later from here if i need it ... i don't like the solution, but i need to get going. so the list that emerged was the following packages:

libgomp glibc-headers elfutils-libelf-devel-static kernel-headers

Now you can use the same line from above to copy the files , just replace oracle-rac-pack-list with a file name containing the missing dependencies files.
Note:make sure there is a space at the end of the line

on the remote machine, go the user home directory (or to what ever empty directory you copied the .rpm files to) and run "rpm -i *.rpm", well you might have to delete a package glibc-2.5-18.iXXX.rpm since you will have 2, one for 386 and one for 686... pick one according to your arch. (this section is ugly! any suggestions how to do it in a cleaner way are very welcomed).

Next i configured the iSCSI initiator, i already had the iscsi target configured on a debian machine, so i wont go into how i did it here, maybe some other time, ive checked if i can see the volumes i created on it using the command:

iscsiadm -m discovery -t sendtargets -p iscsi-storage-priv

apparently i could see them, so i wanted to login to them, instead of writing multiple lines, i decided to use the output of the previous command, so i did:

iscsiadm -m discovery -t sendtargets -p iscsi-storage-priv| cut -f2 -d\ | xargs -l -I '{}' iscsiadm -m node -T {} -p XX.XX.XX.XX -l

notice that in the last command ive put the IP, for some reason using the hostname which was defined in /etc/hosts did not work (!), and of course, using iscsiadm discovery results assumes that you the volumes defined are all for your current services or your oracle nodes, meaning if you are using the iscsi server for other things, you will need to run your commands one by one for the desired volumes, or use grep of there is something in the names used explicitly for your oracle RAC storage.

Now to add them also quickly to be automatically targeted when system starts up, also similar to the above command i did:

iscsiadm -m discovery -t sendtargets -p iscsi-storage-priv| cut -f2 -d\ | xargs -l -I '{}' iscsiadm -m node -T {} -p XX.XX.XX.XX --op update -n node.startup -v automatic

Now getting the ocfs2! again, i hate rpm, and no, i dont want to set a local repo on two nodes, and so far, still sticking with copying the required rpms, here are the packs and dependencies:
ocfs2-2.6.18-53.el5 ocfs2-tools ocfs2-tools-devel e2fsprogs-devel glib2-devel ocfs2console

Now comes the OCFS2 configuration step, using ocfs2console tool, the configuration goes well on the first node, no problems, on the second node, i get an error message:

Could not start cluster stack. This must be resolved before any OCFS2 filesystem can be mounted

The search now begins for why this happens, i thought ill start by checking the sysctl values, so i do sysctl -a on both nodes and compare the results, i see nothing abnormal, anyway, i decide to copy the sysctl.conf file from the first node to the second one and use it, run sysctl -p, and thats what i did, still same problem.

checking /var/log/messages yields and interesting error:

modprobe: FATAL: Module ocfs2_nodemanager

but why, same packs should be on both nodes! so i started checking if i have different packs using "yum list| grep installed|wc" ... i have 726 packs on the "working node" and 737 packs on the "failing node", i need to do a comparison, so i create two lists and see the diffrence! the diffrence was that for somereason i didnt have ocfs2 kernle module installed on the second node, so installing ocfs2-2.6.18-53.el5.i686 did it! and the ocfs2console config completed successfully... i think :)

Installing asm stuff went fine, i installed oracleasm-support-2.0.4-1.el5.i386.rpm , oracleasm-2.6.18-53.el5-2.0.4-1.el5.i686.rpm and oracleasmlib-2.0.3-1.el5.i386.rpm ; the last one i had to download from oracle website.

Then i downloaded and extracted the oracle clusterware software and the oracle database, installed the cvuqdisk on both nodes following the document mentioned at the beginning of this post and on the first node, i did exec for ssh-agent and ssh-add, i did not set any passwords for now, so i was not prompted for a passphrase, then, on first node linux1 i did:

./runcluvfy.sh stage -pre crsinst -n linux1,linux2 -verbose

and it failed! Check: User equivalence for user "oracle" failed for node linux1, the same node im performing the test on, easy, i did :

cat id_rsa.pub >> authorized_keys

and i added a swapfile to fix another warning related to the swap size, i still had Total memory check failed, but i ignore this, one of the nodes has 1027200KB but the other node only has 512MB of ram, i pray ;) and continue.

the next test went fine, and now im at "20. Install Oracle 11g Clusterware Software", everything is going quit smoothly and im following the document from oracle ... the wizard brings the two script that i need to run as root, i run the first on both nodes ... successfulyy, the second one runs successfully on the first node and fails on the other! i look at the logs, and there you ... Failed to get IP for linux2.mydomain.tld ... :( ... sure , i have nothing like this in the dns, but why did it append the domain name ? anyway, i cd to /etc/sysconfig/ , vi network and remove the domain name form the HOSTNAME line ... and retry running the root.sh script! it exits quickly, outputs only two lines saying that Oracle CRS is already configured and that it will be running under init(1M) ! im now not sure if its working correctly or not!

i perform the following test described in the doc, and here what i get :

[oracle@linux1 ~]$ $ORA_CRS_HOME/bin/crs_stat -t -v Name Type R/RA F/FT Target State Host ---------------------------------------------------------------------- ora.linux1.gsd application 0/5 0/0 ONLINE ONLINE linux1 ora.linux1.ons application 0/3 0/0 ONLINE ONLINE linux1 ora.linux1.vip application 0/0 0/0 ONLINE ONLINE linux1 ora.linux2.gsd application 0/5 0/0 ONLINE ONLINE linux2 ora.linux2.ons application 0/3 0/0 ONLINE OFFLINE ora.linux2.vip application 0/0 0/0 ONLINE ONLINE linux2

aaaaahhhh ... there is something incorrect ... i cd to /u01/app/crs/bin and try to guess what executable can be used, i try ./onsctl start ... and its still trying to use the domainname attached to hostname ... i switch to root, do hostname linux2, exit root retry ... now here is what i get :

[oracle@linux2 bin]$ ./onsctl start globalInitNLS: NLS boot file not found or invalid -- default linked-in boot block used Number of onsconfiguration retrieved, numcfg = 0 globalInitNLS: NLS boot file not found or invalid -- default linked-in boot block used globalInitNLS: NLS boot file not found or invalid -- default linked-in boot block used Number of onsconfiguration retrieved, numcfg = 0 onsctl: ons started

looks good .. but retrying the $ORA_CRS_HOME/bin/crs_stat -t -v yields the same output as before.

a quick google search brings me to this blog post, and srvctl grabs my attention! well, i played around a little and did:

./srvctl stop asm -n linux2 $ORA_CRS_HOME/bin/crs_stat -t -v ./srvctl stop nodeapps -n linux2 $ORA_CRS_HOME/bin/crs_stat -t -v

then i switch to root and do:

/etc/init.d/init.crs stop /etc/init.d/init.crs start

and now $ORA_CRS_HOME/bin/crs_stat -t -v reports everything to be online :) ... i have no clue which if any of the above was sufficient, nor do i know if this will screw my installation later on ! but the oracle clusterware software installation reported finishing successfully.

I went ahead and started installing the oracle database, i got a warning about ip_local_port_range, again, although it hought i followed the doc step by step, it seemed this step was not done, and now i fear there is something else i did not do ... :( anyway, i fixed it using sysctl and wrote the values to /etc/sysctl.conf too ... and procceeded ... database installed, seemed to go successfully, examples installed, seemed to go successfully, TNS listiner too, now i started creating a database, following the doc, with ASM ... ASM creator working ... BINGO ! i get an error:

PRKS-1009: Failed to start ASM instance "+ASM2" on node "linux2", [CRS-0215: Could not start resource ora.linux2.ASM.asm".]

i go to linux2 node, and i try /etc/init.d/oracleasm listdisks ... i get non ... :( ... i try /etc/init.d/oracleasm scandisks ... it finishes successfully ... but still no good i get nothing when i do list disks ! checking for permission issue, i did:

cd app/ [root@linux2 app]# ls crs oracle oraInventory [root@linux2 app]# ls -alh total 20K drwxrwxr-x 5 root oinstall 4.0K Apr 27 18:57 . drwxr-xr-x 3 root root 4.0K Apr 9 15:34 .. drwxr-xr-x 35 root oinstall 4.0K Apr 27 18:57 crs drwxrwxr-x 5 oracle oinstall 4.0K Apr 28 16:24 oracle drwxrwx--- 4 oracle oinstall 4.0K Apr 28 15:37 oraInventory

As you can see, crs owner was root, in the document, it should have been oracle, so i did chown -R oracle:oinstall /u01/app , still this did not help! i look into logs, i find in /u01/app/crs/log/linux2/crsd/crsd.log :

2008-04-28 16:25:23.484: [ CRSRES][128224144] startRunnable: setting CLI values 2008-04-28 16:26:01.377: [ CRSAPP][128224144] StartResource error for ora.linux2.ASM2.asm error code = 1

fiddling more in the logs (which i dont understand their structure btw), i found this "warning"

Starting ORACLE instance (normal) WARNING: You are trying to use the MEMORY_TARGET feature. This feature requires the /dev/shm file system to be mounted for at least 285212672 bytes. /dev/shm is either not mounted or is mounted with available space less than this size. Please fix this so that MEMORY_TARGET can work as expected. Current available is 263954432 and used is 0 bytes. memory_target needs larger /dev/shm

so i manually change the size by setting it in /etc/fstab by inserting the line:

tmpfs /dev/shm tmpfs size=300m 0 0

and now things go fine again :), but i risk Linux deadlocking ;) ... now i try to create disk group and i get a message saying:

Could not mount the diskgroup on remote node linux2 using connection service linux2:1521:+ASM2. Ensure that the listener is running on this node and the ASM instance is registered to the listener. Recived the following error:

ORA-15032: not all alternations performed ORA-15063: ASM discovered an insufficient number of disks for diskgroup "ORCL_DATA1"

i get the same error when i try to create group FLASH_RECOVERY_AREA ... so i check /etc/init.d/oracleasm listdisks ... and it yields nothing! i try to do scandisks, still same thing, so then i stop init.crs and start it again ... listdisks still brings nothing! then i did mount -a as root, scandisks and listdisks and i get the volumes .... good ... i click ok on the message, i get a message something like "no more date to read from socket" ... no idea if this is normal or not! i pray ;) and check the box beside ORCL_DATA1 and press next!

I followed the doc and everything went fine, till again, it complained about the size of the shm, it was too small, so i had to increase it manually ... and i continued and things went fine and the database installation started ...

Well, the database installation took too long, so i packed and left it and decided to continue on the next day ... what happened was that there was a power failure ( bad ... :( bad ... )the node linux1 did not finish the creation of the database ... so i ran dbca again, deleted the database, got a few warnings ... and started creating a new database, this time ive got an error that there is not enough disk space on the ASM !! i think that the data created by the first attemp is still there ... to make the long story short, the datbase installation failed and i decided i need to do things diffrently, starting by using better hardware and larger iscsi targets.

crazy little thing called love

Saturday, April 5, 2008

Oracle RAC on ISCSI ... the small details

1 comment:

Blog Archive

About Me