......................................................................
First Created June 29, 2012 -  Raj Patel
Rev1:  Mar 26, 2012 - By Raj Patel
Rev12: Apr 14, 2015 - By Raj Patel 

Comments and feedback to => rajpat@us.ibm.com

==============================================================
Perform below minimum tests, Before escalating to AMVIRT,165.
Unless specified, all commands to be run from padmin shell.
==============================================================

*******************************************************************
* Only Commands used in padmin shell are supported by VIOs.       *
* Commands run within oem_setup_env are for component owners such *
* as CAA.                                                         *
*******************************************************************

1) Modify /etc/netsvc.conf file of the VIOS logical partition before 
   creating the cluster.

   You must make changes to the /etc/netsvc.conf file of the
   VIOS logical partition before creating the cluster. This file is 
   used to specify the ordering of name resolution for networking 
   routines and commands. Later, if you want to edit the 
   /etc/netsvc.conf file, perform the following steps on each 
    VIOS logical partition:

    To stop cluster services on the VIOS logical partition, type 
     the following command:

    clstartstop -stop -n clustername -m vios_hostname

    Make the required changes in /etc/netsvc.conf file. Ensure that 
    you do not change the IP address that resolves to the host name 
    that is being used for the cluster.
   
    To restart cluster services on the VIOS logical partition, type 
    the following command:

    clstartstop -start -n clustername -m vios_hostname

    Maintain the same ordering of name resolution for all the VIOS 
    logical partitions that are part of the same cluster. You must 
    not make changes to the /etc/netsvc.conf file when you are 
    migrating a cluster from IPv4 to IPv6.

2) Multicast must be enabled between all nodes in the cluster

   
>
   NOTE: Below applies ONLY to VIO levels 2.2.2.2 and lower.

   =============================================================
   ==**  VIOS 2.2.2.2 and lower requiring Multi-Cast        **== 
   ==**  For VIOs 2.2.3.1 using Unicast see below           **==
   =============================================================
  
# get MULTICAST ADDRESS from each node: $ lscluster -i | grep -i multi IPv4 MULTICAST ADDRESS: 228.3.131.88 broadcast 0.0.0.0 netmask 0.0.0.0 mping test to ALL candidate nodes ( within oem_setup_env ) Start below from all recieving nodes: Example: node1.austin.ibm.com # mping -r -a [ip-of-recieving-node] -c 100 mping version 1.1 Listening on 227.1.1.1/4098: Replying to mping from 9.3.131.87 bytes=32 seqno=1 ttl=32 Replying to mping from 9.3.131.87 bytes=32 seqno=2 ttl=32 Replying to mping from 9.3.131.87 bytes=32 seqno=3 ttl=32 Replying to mping from 9.3.131.87 bytes=32 seqno=4 ttl=32 Replying to mping from 9.3.131.87 bytes=32 seqno=5 ttl=32 Start below from sending node: Example: node2.austin.ibm.com # mping -s -a [ip-of-sender-node] -c 100 mping version 1.1 mpinging 227.1.1.1/4098 with ttl=32: 32 bytes from 9.3.131.88: seqno=1 ttl=32 time=0.387 ms 32 bytes from 9.3.131.88: seqno=2 ttl=32 time=0.318 ms 32 bytes from 9.3.131.88: seqno=3 ttl=32 time=1.488 ms 32 bytes from 9.3.131.88: seqno=4 ttl=32 time=0.345 ms 32 bytes from 9.3.131.88: seqno=5 ttl=32 time=0.332 ms Test reversing in other direction. Repeat for all nodes. The number of packets transmitted and received should match. The mping test should be run in both directions. Both NodeA and NodeB must be able to transmit and receive multicast packets. For more than 4 nodes repeat process through all nodes. TIPS: a) Primary reason is Multicast not setup and this is something that customer needs to work with their switch vendor and their local network engineers. b) If you are using Cisco Catalyst or Nexus, please follow below docs to setup Multicast from Cisco. http://www.cisco.com/en/US/products/hw/switches/ps708/ products_tech_note09186a008059a9df.shtml Other ref: PowerHA_7.1_and_Multicast_v1.pdf . . c) Please directly contact and work with switch vendors for other types & models. d) Create /var/ct/cfg/netmon.cf file on each node and put in an IP address for all nodes. e) Make sure following APARs are installed on ALL nodes if your VIO is below 2.2.1.4 - IV11698 ( Defect 819570 ) NOTE: VIOS Level is 2.2.1.4 already includes this ( IV13154 ) *********************************************************** ** Basically, with mping failure, first step is to make ** ** sure Multicast is setup ** *********************************************************** How to check Cisco Nexus switch settins: "# show ip igmp snooping vlan 127" They should be configured with below: IGMP snooping enabled IGMP querier enabled Switch-querier enabled Multicast Setup on Juniper Switches mping iptrace steps - Debugged by CAA / HACMP team

       ======================================================================
       ======================================================================
       ==** For VIOs 2.2.3.1 and above using Unicast use below test (f)  **==
       ======================================================================
       ======================================================================

      f) Use ssh, telnet to see if you can ssh and or telnet across all nodes. 
      
3) Locatiing repository disk $ lscluster -d | grep -p REPDISK # /usr/lib/cluster/clras lsrepos ( within oem_setup_env ) NOTE: $ lscluster -c ( will give additional details such as ipaddess,disks ) 4) Locating cluster disk $ lscluster -d | grep -p CLUSDISK 5) Detect if a Primary Database Node (DBN) has been elected by ALL candidate nodes. $ lssrc -ls vio_daemon 6) To restart vio_daemon on the node where it was stopped within oem_setup_env # startsrc -s vio_daemon "-a -d 4" ( change 4 to 8 for more details ) ================ To repopulte CMDB ================= # stopsrc -s vio_daemon # rm -rf /var/vio/CM # startsrc -s vio_daemon # kill -hup [vio_daemon pid] 7) To stop vio_daemon running on the other node to prevent a DBN election. This is used when nodes are down and trying to focus on brining A node up. This has to be done within oem_setup_env # stopsrc -s vio_daemon Note: It may be necessary to kill PID if vio_daemon does not respond. To get PID for vio_daemon # lssrc -s vio_daemon To kil vio_daemon PID # Kill -9 PID ==================================================================== NOTE: when vio_daemon on a DBN node is stopped, daemon will send message all the other to elect another DBN ==================================================================== 8) To get Cluster Name: $ cluster -list 9) To locate DBN node AND get cluster node status: $ cluster -status -clustername [cluster-name] -verbose | grep -p DBN 10) To locate mfsMgr node run in oem_setup_env # pooladm dump poolroot | grep mfsMgr This will show 'mfsMgr=1' or 'mfsMgr=0'. mfsMgr=1 == mfsMgr node. NOTE: errorlog entry will also have this info. 11) To check ip respoved by poolfs ( run in oem_setup_env ) # pooladm dump node | grep naddr 12) Check if CAA cluster is intact and operational by ALL candidate nodes. $lscluster -m 13) Check if repository disk can be accessed by ALL candidate nodes. This is run from within oem_setup_env # /usr/lib/cluster/clras dumprepos [-r reposdisk] # /usr/lib/cluster/clras sfwinfo -a 14) Cluster State from ALL candidate nodes a) First get the cluster name. $ cluster -list b) Get Cluster Status. $ cluster -status -clustername [output_from_cluster_list] 15) Access to ALL the pool disks by ALL candidate node $ lssp -clustername [output_from_cluster_list] Pool Size(mb) Free(mb) LUs Type PoolID leftv1_pl 3432448 3152896 420 CLPOOL 1234... 16) Enable detailed debug for CAA. Data Collection with CAA debug enabled 17) Collect snaps from ALL VIOs SSP nodes If too many nodes then one from DBN node and 1 from non-working node. This may not be enough in most cases so if possible collect from all nodes. $ snap - create /home/padmin/snap.pax.Z - rename as viossp.node1.snap.pax.Z - rename as viossp.node2.snap.pax.Z - and so on ... To get DBN node: $ cluster -status -clustername [cluster-name] -verbose | grep -p DBN 18) Collect ctsnap from all SSP nodes - Part of New versions of VIO and can be skipped. $ oem_setup_env # ctsnap -x runrpttr - This will create => /tmp/ctsupt/ctsnap*.tar.gz - move /tmp/ctsupt/ctsnap*.tar.gz to /tmp/PMRno.ZZZ.000.vio1.source.ctsnap.pax.Z 19) Creating single pax file for FTP. A single file it preferred, if the file it too large a couple of files may be FTP'ed. ************************************************************ ** Move all above pax.Z zip and log files into a single ** ** directory. ** ************************************************************ - mkdir -p /tmp/pmr#/pmdata (sample name only) - move or ftp or scp data to pmdata directory - cd /tmp/pmr# - pax -xpax -vw pmdata | gzip -c > pmr#.pax.gz - ftp the single file to testcase.software.ibm.com ftp testcase.software.ibm.com See "FTP procedure" below for detailed FTP steps. 20) FTP procedure to Boulder: Rename the file(s) being FTPed to include the PMR number. For example, if your pmr is 12345.999.000 ( where 12345 is the pmr#, 999 is the branch#, and 000 is the country code), you would do something similar to the following. mv data_collected.pax.gz 12345.999.000_data_collected.pax.gz FTP the file to ibm: ftp testcase.software.ibm.com, login: anonymous, passwd: your email address, ftp> cd /toibm/aix ftp> bin ftp> put (12345.999.000_date_collected.pax.gz) ftp> quit ============= END OF BASIC DEBUGGING & DATA COLLECTION ========== =========================================================== Steps required when applying ifix / apars or upgrading nodes =========================================================== 1) Stop node on cluster $ clstartstop -stop -n [cluster-name] -m [node] $ clstartstop -stop -n clvio12 -m vionode1.austin.ibm.com 2) Create directory on each node $ mkdir /home/padmin/ifix 3) ftp in BINARY attached ifix to all nodes in above location /home/padmin/ifix 4) Commit $ updateios -commit 5) Install $ updateios -dev /home/padmin/ifix -install -accept 6) After applying the patch or upgrading VIO, rejoin node. ** reboot may be required after upgrade of ifix install. ** $ clstartstop -start -n [cluster-name] -m [node] $ clstartstop -start -n clvio12 -m vionode1.austin.ibm.com Steps for removing and adding Backing Device ============================================== Adding back Backing Device == Other Debugging Tips ==================== Possible steps for recovering node node1 is "DOWN" Possible steps for recovering when node2 is "DOWN" Correct way for removing SSP backing devices Re-creating SSP backing devices Possible steps to recover caavg_private when repository disk no longer reachable. ================================================================== ======================================================= REFERENCES: Below - IBM Internal Access - Requires GSA ======================================================= Doc reference guide from Development: All require Blue Page Access and GSA access ( contact => Carlos Gomez (cgomez@us.ibm.com) ) 1) CAA Wiki 2) SSP 3) SF-Store Raj's own reference .... Not for external users. Raj's reference page ================================================================= STEPS to rebuild repository disk. Only use this if caa_private is still good. This can be checked if other nodes have access and still working and only need to fix on non worknig node. # clusterconf -r [repo_disk] -v If this does not help and caa_private is still accessable on other nodes, reboot the problamatic node. ================================================================== STEPS TO ** DELETE NODE ** IF ONE CAA VG STILL OPEN and unable to varyoff and no VTD / LUs mapped. =================================================================== 1) lsvg -o| xargs lsvg -l caavg_private: LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT caalv_private1 boot 1 1 1 closed/syncd N/A caalv_private2 boot 1 1 1 closed/syncd N/A caalv_private3 4 4 1 open/syncd N/A powerha_crlv boot 1 1 1 closed/syncd N/A 2) Confirm the repo disk: ( from snaps it was hdisk22 ) $ lsvg -pv caavg_private 3) Remove cluster $oem_setup_env export CAA_FORCE_ENABLED=1 #rmcluster -f -r hdisk22 #rmdev -dl cluster0 #chpv -C hdisk22 #reboot #lqueryvg -Atp /dev/hdisk22 (must show nothing in that VGDA) OR From padmin: 3) $ rmcluster -fr caa_private0 $ shutdown -restart