......................................................................
First Created June 29, 2012 -  Raj Patel
Rev1:  Mar 26, 2012 - By Raj Patel
Rev12: Apr 14, 2015 - By Raj Patel 

Comments and feedback to => rajpat@us.ibm.com

==============================================================
Perform below minimum tests, Before escalating to AMVIRT,165.
Unless specified, all commands to be run from padmin shell.
==============================================================

*******************************************************************
* Only Commands used in padmin shell are supported by VIOs.       *
* Commands run within oem_setup_env are for component owners such *
* as CAA.                                                         *
*******************************************************************

1) Modify /etc/netsvc.conf file of the VIOS logical partition before 
   creating the cluster.

   You must make changes to the /etc/netsvc.conf file of the
   VIOS logical partition before creating the cluster. This file is 
   used to specify the ordering of name resolution for networking 
   routines and commands. Later, if you want to edit the 
   /etc/netsvc.conf file, perform the following steps on each 
    VIOS logical partition:

    To stop cluster services on the VIOS logical partition, type 
     the following command:

    clstartstop -stop -n clustername -m vios_hostname

    Make the required changes in /etc/netsvc.conf file. Ensure that 
    you do not change the IP address that resolves to the host name 
    that is being used for the cluster.
   
    To restart cluster services on the VIOS logical partition, type 
    the following command:

    clstartstop -start -n clustername -m vios_hostname

    Maintain the same ordering of name resolution for all the VIOS 
    logical partitions that are part of the same cluster. You must 
    not make changes to the /etc/netsvc.conf file when you are 
    migrating a cluster from IPv4 to IPv6.

2) Multicast must be enabled between all nodes in the cluster

   >
   NOTE: Below applies ONLY to VIO levels 2.2.2.2 and lower.

   =============================================================
   ==**  VIOS 2.2.2.2 and lower requiring Multi-Cast        **== 
   ==**  For VIOs 2.2.3.1 using Unicast see below           **==
   =============================================================
  

   # get MULTICAST ADDRESS from each node: 

   $ lscluster -i | grep -i multi
     IPv4 MULTICAST ADDRESS: 228.3.131.88  broadcast 0.0.0.0  netmask 0.0.0.0  

   mping test to ALL candidate nodes ( within oem_setup_env )

   Start below from all recieving nodes:
   Example: node1.austin.ibm.com
   # mping -r -a [ip-of-recieving-node] -c 100
     mping version 1.1
     Listening on 227.1.1.1/4098:

     Replying to mping from 9.3.131.87 bytes=32 seqno=1 ttl=32
     Replying to mping from 9.3.131.87 bytes=32 seqno=2 ttl=32
     Replying to mping from 9.3.131.87 bytes=32 seqno=3 ttl=32
     Replying to mping from 9.3.131.87 bytes=32 seqno=4 ttl=32
     Replying to mping from 9.3.131.87 bytes=32 seqno=5 ttl=32

   Start below from  sending node:
   Example: node2.austin.ibm.com
   # mping -s -a [ip-of-sender-node] -c 100
     mping version 1.1
     mpinging 227.1.1.1/4098 with ttl=32:

     32 bytes from 9.3.131.88: seqno=1 ttl=32 time=0.387 ms
     32 bytes from 9.3.131.88: seqno=2 ttl=32 time=0.318 ms
     32 bytes from 9.3.131.88: seqno=3 ttl=32 time=1.488 ms
     32 bytes from 9.3.131.88: seqno=4 ttl=32 time=0.345 ms
     32 bytes from 9.3.131.88: seqno=5 ttl=32 time=0.332 ms

     Test reversing in other direction.
     Repeat for all nodes.

           The number of packets transmitted and received
           should match. The mping test should be run in both
           directions. Both NodeA and NodeB must be able to transmit
           and receive multicast packets. For more than 4 nodes
           repeat process through all nodes.

     TIPS:
       a) Primary reason is Multicast not setup and this is something
          that customer needs to work with their switch vendor and 
          their local network engineers.
       b) If you are using Cisco Catalyst or Nexus, please follow
          below docs to setup Multicast from Cisco.

          http://www.cisco.com/en/US/products/hw/switches/ps708/
          products_tech_note09186a008059a9df.shtml

          Other ref:
           PowerHA_7.1_and_Multicast_v1.pdf 
          .
          . 
       c) Please directly contact and work with switch vendors for
          other types & models.

       
       d) Create /var/ct/cfg/netmon.cf file on each node and
          put in an IP address for all nodes.
       

       
       e) Make sure following APARs are installed on ALL nodes if your VIO is below 2.2.1.4
          - IV11698 ( Defect 819570 )
            NOTE: VIOS Level is 2.2.1.4 already includes this ( IV13154 ) 
       

          ***********************************************************
          ** Basically, with mping failure, first step is to make  **
          **  sure Multicast is setup                              **
          ***********************************************************

       
          How to check Cisco Nexus switch settins:
            "# show ip igmp snooping vlan 127"
            
          They should be configured with below:
            IGMP snooping enabled                                           
            IGMP querier enabled                                            
            Switch-querier enabled   
       

        Multicast Setup on Juniper Switches 

        mping iptrace steps - Debugged by CAA / HACMP team 

       
       ======================================================================
       ======================================================================
       ==** For VIOs 2.2.3.1 and above using Unicast use below test (f)  **==
       ======================================================================
       ======================================================================

      f) Use ssh, telnet to see if you can ssh and or telnet across all nodes. 
      

3)  Locatiing repository disk 
       $  lscluster -d | grep -p REPDISK 
       #  /usr/lib/cluster/clras lsrepos ( within oem_setup_env )

       NOTE: $ lscluster -c ( will give additional details such as ipaddess,disks )

4)  Locating cluster disk 
       $  lscluster -d | grep -p CLUSDISK
 
5)  Detect if a Primary Database Node (DBN) has been elected by 
    ALL candidate nodes.

       $ lssrc -ls vio_daemon

6)  To restart vio_daemon on the node where it was stopped within oem_setup_env
       #  startsrc -s vio_daemon "-a -d 4"  ( change 4 to 8 for more details )
 
    ================  
    To repopulte CMDB
    =================
    # stopsrc -s vio_daemon
    # rm -rf /var/vio/CM
    # startsrc -s vio_daemon
    #  kill -hup [vio_daemon pid]



7) To stop vio_daemon running on the other node to prevent a DBN election.
   This is used when nodes are down and trying to focus on brining A node up.
   This has to be done within oem_setup_env

       # stopsrc -s vio_daemon
    
   Note: It may be necessary to kill PID if vio_daemon does not respond.

      To get PID for vio_daemon
       # lssrc -s vio_daemon

      To kil vio_daemon PID 
       # Kill -9 PID

   ====================================================================
   NOTE: when vio_daemon on a DBN node is stopped, daemon will send 
         message all the other to elect another DBN 
   ====================================================================

8)  To get Cluster Name:
    $ cluster -list

9)  To locate DBN node  AND get cluster node status:
       $ cluster -status -clustername [cluster-name] -verbose | grep -p DBN

10) To locate mfsMgr node run in oem_setup_env
    # pooladm dump poolroot | grep mfsMgr 

    This will show 'mfsMgr=1' or 'mfsMgr=0'. mfsMgr=1 == mfsMgr node.

    NOTE: errorlog entry will also have this info.

11) To check ip respoved by poolfs ( run in oem_setup_env )
    # pooladm dump node | grep naddr
    

12) Check if CAA cluster is intact and operational by ALL candidate nodes.

             $lscluster -m

13) Check if repository disk can be accessed by ALL candidate nodes.
    This is run from within oem_setup_env

             # /usr/lib/cluster/clras dumprepos [-r reposdisk]
             # /usr/lib/cluster/clras sfwinfo -a 

14) Cluster State from ALL candidate nodes

       a)  First get the cluster name.
             $ cluster -list

       b)  Get Cluster Status.
             $ cluster -status -clustername [output_from_cluster_list]

15) Access to ALL the pool disks by ALL candidate node

             $ lssp -clustername [output_from_cluster_list]
               Pool        Size(mb)  Free(mb)  LUs       Type   PoolID
               leftv1_pl   3432448   3152896   420       CLPOOL 1234...           

16) Enable detailed debug for CAA.
    Data Collection with CAA debug enabled 

17) Collect snaps from ALL VIOs SSP nodes 
    If too many nodes then one from DBN node and 1 from non-working node.
    This may not be enough in most cases so if possible collect from all nodes.

             $ snap 
               - create /home/padmin/snap.pax.Z
               - rename as viossp.node1.snap.pax.Z
               - rename as viossp.node2.snap.pax.Z
               - and so on ...

    To get DBN node:
    $ cluster -status -clustername [cluster-name]  -verbose | grep -p DBN

18) Collect ctsnap from all SSP nodes  - Part of New versions of VIO and can 
    be skipped.

             $ oem_setup_env
             # ctsnap -x runrpttr
                - This will create =>  /tmp/ctsupt/ctsnap*.tar.gz
                - move /tmp/ctsupt/ctsnap*.tar.gz to 
                       /tmp/PMRno.ZZZ.000.vio1.source.ctsnap.pax.Z

19) Creating single pax file for FTP.

     A single file it preferred, if the file it too large a couple 
     of files may be FTP'ed.


     ************************************************************
     **  Move all above pax.Z zip and log files into a single  **
     **  directory.                                            **
     ************************************************************

            - mkdir -p /tmp/pmr#/pmdata  (sample name only)
            - move or ftp or scp data to pmdata directory
            - cd /tmp/pmr#
            - pax -xpax -vw pmdata | gzip -c > pmr#.pax.gz 
            - ftp the single file to testcase.software.ibm.com

              ftp testcase.software.ibm.com
              See "FTP procedure" below for detailed FTP steps.

20) FTP procedure to Boulder:
    Rename the file(s) being FTPed to include the PMR number.

    For example, if your pmr is 12345.999.000 
    ( where 12345 is the pmr#, 999 is the branch#, and 000 is the 
      country code), you would do something similar to the following.

    mv  data_collected.pax.gz   12345.999.000_data_collected.pax.gz

           FTP the file to ibm:
           ftp  testcase.software.ibm.com,
           login:   anonymous,
           passwd:  your email address,
           ftp>  cd /toibm/aix
           ftp>  bin
           ftp>  put    (12345.999.000_date_collected.pax.gz)
           ftp>  quit

=============  END OF BASIC DEBUGGING & DATA COLLECTION ==========

===========================================================
Steps required when applying ifix / apars or upgrading nodes
===========================================================
1) Stop node on cluster
   $ clstartstop -stop -n [cluster-name] -m [node]
   $ clstartstop -stop -n clvio12 -m vionode1.austin.ibm.com

2) Create directory on each node
     $ mkdir /home/padmin/ifix

3) ftp in BINARY attached ifix to all nodes in above  location /home/padmin/ifix

4) Commit
     $ updateios -commit
   
5) Install
     $ updateios -dev /home/padmin/ifix -install -accept 
 
6) After applying the patch or upgrading VIO, rejoin node.
    ** reboot may be required after upgrade of ifix install. **
   $ clstartstop -start -n [cluster-name] -m [node]
   $ clstartstop -start -n clvio12 -m vionode1.austin.ibm.com


Steps for removing and adding Backing Device
==============================================
 Adding back Backing Device 

==   Other Debugging Tips   ====================
 Possible steps for recovering node node1 is "DOWN"  
 Possible steps for recovering when node2 is "DOWN"  
 Correct way for removing SSP backing devices 
 Re-creating SSP backing devices 
 Possible steps to recover caavg_private when repository disk no longer reachable.  
==================================================================

=======================================================
REFERENCES: Below - IBM Internal Access - Requires GSA
=======================================================

Doc reference guide from Development: All require Blue Page Access and GSA access ( contact => Carlos Gomez (cgomez@us.ibm.com) )

1) CAA Wiki   
2) SSP  
3) SF-Store 


Raj's own reference .... Not for external users.
 Raj's reference page 

=================================================================
STEPS to rebuild repository disk. Only use this if caa_private
is still good. This can be checked if other nodes have access and
still working and only need to fix on non worknig node.

# clusterconf -r [repo_disk] -v 

If this does not help and caa_private is still accessable on other
nodes, reboot the problamatic node.

==================================================================
STEPS TO ** DELETE NODE ** IF ONE CAA VG STILL OPEN and unable to 
varyoff and no VTD / LUs mapped.
===================================================================
   
   1)   lsvg -o| xargs lsvg -l

        caavg_private:
        LV NAME             TYPE       LPs     PPs     PVs  LV STATE      MOUNT
        POINT
        caalv_private1      boot       1       1       1    closed/syncd  N/A
        caalv_private2      boot       1       1       1    closed/syncd  N/A
        caalv_private3                 4       4       1    open/syncd    N/A
        powerha_crlv        boot       1       1       1    closed/syncd  N/A

  2) Confirm the repo disk: ( from snaps it was hdisk22 )
        $ lsvg -pv caavg_private

  3) Remove cluster
        $oem_setup_env
        export CAA_FORCE_ENABLED=1
        #rmcluster -f -r hdisk22
        #rmdev -dl cluster0
        #chpv -C hdisk22
        #reboot
        #lqueryvg -Atp /dev/hdisk22 (must show nothing in that VGDA)

    OR From padmin:

    3)  $ rmcluster -fr caa_private0
        $ shutdown -restart