......................................................................
First Created June 29, 2012 - Raj Patel
Rev1: Mar 26, 2012 - By Raj Patel
Rev12: Apr 14, 2015 - By Raj Patel
Comments and feedback to => rajpat@us.ibm.com
==============================================================
Perform below minimum tests, Before escalating to AMVIRT,165.
Unless specified, all commands to be run from padmin shell.
==============================================================
*******************************************************************
* Only Commands used in padmin shell are supported by VIOs. *
* Commands run within oem_setup_env are for component owners such *
* as CAA. *
*******************************************************************
1) Modify /etc/netsvc.conf file of the VIOS logical partition before
creating the cluster.
You must make changes to the /etc/netsvc.conf file of the
VIOS logical partition before creating the cluster. This file is
used to specify the ordering of name resolution for networking
routines and commands. Later, if you want to edit the
/etc/netsvc.conf file, perform the following steps on each
VIOS logical partition:
To stop cluster services on the VIOS logical partition, type
the following command:
clstartstop -stop -n clustername -m vios_hostname
Make the required changes in /etc/netsvc.conf file. Ensure that
you do not change the IP address that resolves to the host name
that is being used for the cluster.
To restart cluster services on the VIOS logical partition, type
the following command:
clstartstop -start -n clustername -m vios_hostname
Maintain the same ordering of name resolution for all the VIOS
logical partitions that are part of the same cluster. You must
not make changes to the /etc/netsvc.conf file when you are
migrating a cluster from IPv4 to IPv6.
2) Multicast must be enabled between all nodes in the cluster
>
NOTE: Below applies ONLY to VIO levels 2.2.2.2 and lower.
=============================================================
==** VIOS 2.2.2.2 and lower requiring Multi-Cast **==
==** For VIOs 2.2.3.1 using Unicast see below **==
=============================================================
# get MULTICAST ADDRESS from each node:
$ lscluster -i | grep -i multi
IPv4 MULTICAST ADDRESS: 228.3.131.88 broadcast 0.0.0.0 netmask 0.0.0.0
mping test to ALL candidate nodes ( within oem_setup_env )
Start below from all recieving nodes:
Example: node1.austin.ibm.com
# mping -r -a [ip-of-recieving-node] -c 100
mping version 1.1
Listening on 227.1.1.1/4098:
Replying to mping from 9.3.131.87 bytes=32 seqno=1 ttl=32
Replying to mping from 9.3.131.87 bytes=32 seqno=2 ttl=32
Replying to mping from 9.3.131.87 bytes=32 seqno=3 ttl=32
Replying to mping from 9.3.131.87 bytes=32 seqno=4 ttl=32
Replying to mping from 9.3.131.87 bytes=32 seqno=5 ttl=32
Start below from sending node:
Example: node2.austin.ibm.com
# mping -s -a [ip-of-sender-node] -c 100
mping version 1.1
mpinging 227.1.1.1/4098 with ttl=32:
32 bytes from 9.3.131.88: seqno=1 ttl=32 time=0.387 ms
32 bytes from 9.3.131.88: seqno=2 ttl=32 time=0.318 ms
32 bytes from 9.3.131.88: seqno=3 ttl=32 time=1.488 ms
32 bytes from 9.3.131.88: seqno=4 ttl=32 time=0.345 ms
32 bytes from 9.3.131.88: seqno=5 ttl=32 time=0.332 ms
Test reversing in other direction.
Repeat for all nodes.
The number of packets transmitted and received
should match. The mping test should be run in both
directions. Both NodeA and NodeB must be able to transmit
and receive multicast packets. For more than 4 nodes
repeat process through all nodes.
TIPS:
a) Primary reason is Multicast not setup and this is something
that customer needs to work with their switch vendor and
their local network engineers.
b) If you are using Cisco Catalyst or Nexus, please follow
below docs to setup Multicast from Cisco.
http://www.cisco.com/en/US/products/hw/switches/ps708/
products_tech_note09186a008059a9df.shtml
Other ref:
PowerHA_7.1_and_Multicast_v1.pdf
.
.
c) Please directly contact and work with switch vendors for
other types & models.
d) Create /var/ct/cfg/netmon.cf file on each node and
put in an IP address for all nodes.
e) Make sure following APARs are installed on ALL nodes if your VIO is below 2.2.1.4
- IV11698 ( Defect 819570 )
NOTE: VIOS Level is 2.2.1.4 already includes this ( IV13154 )
***********************************************************
** Basically, with mping failure, first step is to make **
** sure Multicast is setup **
***********************************************************
How to check Cisco Nexus switch settins:
"# show ip igmp snooping vlan 127"
They should be configured with below:
IGMP snooping enabled
IGMP querier enabled
Switch-querier enabled
Multicast Setup on Juniper Switches
mping iptrace steps - Debugged by CAA / HACMP team
======================================================================
======================================================================
==** For VIOs 2.2.3.1 and above using Unicast use below test (f) **==
======================================================================
======================================================================
f) Use ssh, telnet to see if you can ssh and or telnet across all nodes.
3) Locatiing repository disk
$ lscluster -d | grep -p REPDISK
# /usr/lib/cluster/clras lsrepos ( within oem_setup_env )
NOTE: $ lscluster -c ( will give additional details such as ipaddess,disks )
4) Locating cluster disk
$ lscluster -d | grep -p CLUSDISK
5) Detect if a Primary Database Node (DBN) has been elected by
ALL candidate nodes.
$ lssrc -ls vio_daemon
6) To restart vio_daemon on the node where it was stopped within oem_setup_env
# startsrc -s vio_daemon "-a -d 4" ( change 4 to 8 for more details )
================
To repopulte CMDB
=================
# stopsrc -s vio_daemon
# rm -rf /var/vio/CM
# startsrc -s vio_daemon
# kill -hup [vio_daemon pid]
7) To stop vio_daemon running on the other node to prevent a DBN election.
This is used when nodes are down and trying to focus on brining A node up.
This has to be done within oem_setup_env
# stopsrc -s vio_daemon
Note: It may be necessary to kill PID if vio_daemon does not respond.
To get PID for vio_daemon
# lssrc -s vio_daemon
To kil vio_daemon PID
# Kill -9 PID
====================================================================
NOTE: when vio_daemon on a DBN node is stopped, daemon will send
message all the other to elect another DBN
====================================================================
8) To get Cluster Name:
$ cluster -list
9) To locate DBN node AND get cluster node status:
$ cluster -status -clustername [cluster-name] -verbose | grep -p DBN
10) To locate mfsMgr node run in oem_setup_env
# pooladm dump poolroot | grep mfsMgr
This will show 'mfsMgr=1' or 'mfsMgr=0'. mfsMgr=1 == mfsMgr node.
NOTE: errorlog entry will also have this info.
11) To check ip respoved by poolfs ( run in oem_setup_env )
# pooladm dump node | grep naddr
12) Check if CAA cluster is intact and operational by ALL candidate nodes.
$lscluster -m
13) Check if repository disk can be accessed by ALL candidate nodes.
This is run from within oem_setup_env
# /usr/lib/cluster/clras dumprepos [-r reposdisk]
# /usr/lib/cluster/clras sfwinfo -a
14) Cluster State from ALL candidate nodes
a) First get the cluster name.
$ cluster -list
b) Get Cluster Status.
$ cluster -status -clustername [output_from_cluster_list]
15) Access to ALL the pool disks by ALL candidate node
$ lssp -clustername [output_from_cluster_list]
Pool Size(mb) Free(mb) LUs Type PoolID
leftv1_pl 3432448 3152896 420 CLPOOL 1234...
16) Enable detailed debug for CAA.
Data Collection with CAA debug enabled
17) Collect snaps from ALL VIOs SSP nodes
If too many nodes then one from DBN node and 1 from non-working node.
This may not be enough in most cases so if possible collect from all nodes.
$ snap
- create /home/padmin/snap.pax.Z
- rename as viossp.node1.snap.pax.Z
- rename as viossp.node2.snap.pax.Z
- and so on ...
To get DBN node:
$ cluster -status -clustername [cluster-name] -verbose | grep -p DBN
18) Collect ctsnap from all SSP nodes - Part of New versions of VIO and can
be skipped.
$ oem_setup_env
# ctsnap -x runrpttr
- This will create => /tmp/ctsupt/ctsnap*.tar.gz
- move /tmp/ctsupt/ctsnap*.tar.gz to
/tmp/PMRno.ZZZ.000.vio1.source.ctsnap.pax.Z
19) Creating single pax file for FTP.
A single file it preferred, if the file it too large a couple
of files may be FTP'ed.
************************************************************
** Move all above pax.Z zip and log files into a single **
** directory. **
************************************************************
- mkdir -p /tmp/pmr#/pmdata (sample name only)
- move or ftp or scp data to pmdata directory
- cd /tmp/pmr#
- pax -xpax -vw pmdata | gzip -c > pmr#.pax.gz
- ftp the single file to testcase.software.ibm.com
ftp testcase.software.ibm.com
See "FTP procedure" below for detailed FTP steps.
20) FTP procedure to Boulder:
Rename the file(s) being FTPed to include the PMR number.
For example, if your pmr is 12345.999.000
( where 12345 is the pmr#, 999 is the branch#, and 000 is the
country code), you would do something similar to the following.
mv data_collected.pax.gz 12345.999.000_data_collected.pax.gz
FTP the file to ibm:
ftp testcase.software.ibm.com,
login: anonymous,
passwd: your email address,
ftp> cd /toibm/aix
ftp> bin
ftp> put (12345.999.000_date_collected.pax.gz)
ftp> quit
============= END OF BASIC DEBUGGING & DATA COLLECTION ==========
===========================================================
Steps required when applying ifix / apars or upgrading nodes
===========================================================
1) Stop node on cluster
$ clstartstop -stop -n [cluster-name] -m [node]
$ clstartstop -stop -n clvio12 -m vionode1.austin.ibm.com
2) Create directory on each node
$ mkdir /home/padmin/ifix
3) ftp in BINARY attached ifix to all nodes in above location /home/padmin/ifix
4) Commit
$ updateios -commit
5) Install
$ updateios -dev /home/padmin/ifix -install -accept
6) After applying the patch or upgrading VIO, rejoin node.
** reboot may be required after upgrade of ifix install. **
$ clstartstop -start -n [cluster-name] -m [node]
$ clstartstop -start -n clvio12 -m vionode1.austin.ibm.com
Steps for removing and adding Backing Device
==============================================
Adding back Backing Device
== Other Debugging Tips ====================
Possible steps for recovering node node1 is "DOWN"
Possible steps for recovering when node2 is "DOWN"
Correct way for removing SSP backing devices
Re-creating SSP backing devices
Possible steps to recover caavg_private when repository disk no longer reachable.
==================================================================
=======================================================
REFERENCES: Below - IBM Internal Access - Requires GSA
=======================================================
Doc reference guide from Development: All require Blue Page Access and GSA access ( contact => Carlos Gomez (cgomez@us.ibm.com) )
1) CAA Wiki
2) SSP
3) SF-Store
Raj's own reference .... Not for external users.
Raj's reference page
=================================================================
STEPS to rebuild repository disk. Only use this if caa_private
is still good. This can be checked if other nodes have access and
still working and only need to fix on non worknig node.
# clusterconf -r [repo_disk] -v
If this does not help and caa_private is still accessable on other
nodes, reboot the problamatic node.
==================================================================
STEPS TO ** DELETE NODE ** IF ONE CAA VG STILL OPEN and unable to
varyoff and no VTD / LUs mapped.
===================================================================
1) lsvg -o| xargs lsvg -l
caavg_private:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT
POINT
caalv_private1 boot 1 1 1 closed/syncd N/A
caalv_private2 boot 1 1 1 closed/syncd N/A
caalv_private3 4 4 1 open/syncd N/A
powerha_crlv boot 1 1 1 closed/syncd N/A
2) Confirm the repo disk: ( from snaps it was hdisk22 )
$ lsvg -pv caavg_private
3) Remove cluster
$oem_setup_env
export CAA_FORCE_ENABLED=1
#rmcluster -f -r hdisk22
#rmdev -dl cluster0
#chpv -C hdisk22
#reboot
#lqueryvg -Atp /dev/hdisk22 (must show nothing in that VGDA)
OR From padmin:
3) $ rmcluster -fr caa_private0
$ shutdown -restart