Question
========
How do I correct the “No RMC Connection” error I get when I try to DLPAR, perform an LPM operation or many other similar operations involving virtual machines in a Power Systems Environment.
Answer
======
There are some basic commands that can be run to check status of RMC configurations and there are some dependancies on RSCT versions as to which commands you use.
RSCT 3.2.x.x levels are the newest and available in the latest releases of AIX and VIOS. More common installations will have RSCT at 3.1.x.x levels.
The basic queries you can run to check RMC health are listed below.
1. To check RMC status on a LPAR as root (AIX or VIOS)
a. Applies to all AIX and VIOS levels
lslpp -l rsct.core.rmc —> This fileset needs to be 3.1.0.x level or higher
/usr/sbin/rsct/bin/ctsvhbac —> Are all IP and host IDs trusted?
lsrsrc IBM.MCP —> Is the HMC listed as a resource?
b. Only applies if AIX 6.1 TL5 or lower is used
lslpp -l csm.client —> This fileset needs to be installed
lsrsrc IBM.ManagementServer —> Is HMC listed as a resource?
2. To check RMC status on Management Console (HMC or FSM as admin user)
lspartition -dlpar —> Is LPAR’s DCaps value non-zero ?
3. If you answer no to any of the above then corrective action is required.
a. Missing file sets or fixes need to be installed.
b. If RSCT file set rsct.core.rmc is at 3.1.5.0 or 3.2.0.0 then APARs apply.
c. Fix It Commands (run as root on LPAR, Management Console
(1) The commands to run as root
/usr/sbin/rsct/install/bin/recfgct
/usr/sbin/rsct/bin/rmcctrl -p
(2) CAUTION: Running the commands listed below on AIX LPARs is only safe if the node is only a member of the Management Console’s RMC domain. These commands should not be used iqn an active CAA cluster environment. If you need to determine if your system is a member of a CAA cluster then please refer to the Reliable Scalable Cluster Technology document titled, “Troubleshooting the resource monitoring and control (RMC) subsystem.”
http://www-01.ibm.com/support/knowledgecenter/SGVKBA_3.1.5/com.ibm.rsct315.trouble/bl507_diagrmc.htm
(3) Pay particular attempt to the section titled Diagnostic procedures to help learn if you node is a member of any domain other than the Management Console’s management domain.
(4) You would need a pesh password for your management console if you need to run the above fix commands on the Management Console.
(5) You can try the following command first as a super admin user
lspartition -dlparreset
(6) If that does not help you will need to request pesh passwords from IBM Support for your Management Console so you can run the recfgct and rmcctrl commands listed above.
(7) After running the above commands it will take several minutes before RMC connection is restored. The best way to monitor is by running the lspartition -dlpar command on the Management Console every few minutes and watch for the target LPAR to show up with a non-zero DCaps value.
4. Things to consider before using the above fix commands or if the reconfigure commands don’t help.
a. If you are still confused about whether or not your LPAR is a member of a CAA cluster then some application names might help (PowerHA 7, HPC applications such as GPFS, ViSDs, CSM, etc). Most administrations should have a good idea how their server is configured and what is running on them so the decision to proceed can be easy. The diagnostic checks covered in the document should help with the decision if you are unsure.
b. Network issue are often overlooked or disregarded. There are some network configuration issues and perhaps even some APAR issues that might need to be addressed if the commands that reconfigure RSCT don’t restore DLPAR functions and those issues will require additional debug steps not covered in this tech note. However, there are some common network issues that can prevent RMC communications from passing between the Management Console and the LPARs and they include the following.
(1) Firewalls blocking bidirectional RMC related traffic for UDP and TCP on port 657.
(2) Mix of jumbo frames and standard Ethernet frames between the Management Console and LPARs.
(3) Multiple interfaces with IP addresses on the LPARs that can route traffic to the Management Console.
c. The above steps only cover the more common and simplistic issues involved in RMC communication errors. If you are unable to reestablish RMC connection by running the command suggested then a more detailed look at the problem is required. Data gathering tools such as pedbg on the Management Console and ctsnap on the LPARs are the next tools that should be used to look at the problem more closely.
5. If the basic things listed above have been checked or performed and still not getting RMC to work then its appropriate to collect additional data.
a. RMC Connection Errors Data Collection on LPAR
(1) Please check the clock setting on LPAR and management console to make sure they are in sync (use date command). Synchronizing clocks will make data analysis much easier.
(2) From LPAR collect a snap
(a) If AIX LPAR as root run
snap -gtkc
(b) If VIOS LPAR as padmin run
snap
b. Collect a ctsnap from the LPAR as root
/usr/sbin/rsct/bin/ctsnap -x runrpttr
c. Collect a pedbg from the management console as described in following below.
(1) If HMC then run “pedbg -c -q 4” as user hscpe and refer to following
document for additional information if needed.
Gathering and Transmitting PE Debug Data from an HMC
http://www-01.ibm.com/support/docview.wss?&uid=isg3T1012079
(2) If FSM or SDMC run “pedbg -c -q r” as pe and refer to following
document for additional information if needed.
Collecting pedbg on Flex System Manager (FSM) and Systems Director Management Console (SDMC)
https://www-304.ibm.com/support/docview.wss?uid=nas777a837116ee2fca8862578320079823c
d. Rename the data files collected on the LPAR.
(1) rename then snap file
(a) On AIX LPAR the snap is in /tmp/ibmsupt so as root run following.
mv /tmp/ibmsupt/snap.pax.Z /tmp/ibmsupt/<PMR#.Branch.000>-snap.pax.Z
(b) On VIOS LPAR the snap is in /home/padmin so as padmin run following.
mv /home/padmin/snap.pax.Z /home/padmin/<PMR#.Branch.000>-snap.pax.Z
(2) rename the ctsnap file
Note: The output file for ctsnap will be in /tmp/ctsupt with a name
similar to ctsnap…tar.gz and so renaming it
requires you to list /tmp/ctsupt so you can view the current name.
ls -l /tmp/ctsupt
mv /tmp/ctsupt. <PMR#.Branch.000>-
e. Transmit the data files to IBM.
(1) FTP or HTTPS site is testcase.software.ibm.com
(2) User ID is anonymous and password can be your email address
(3) Directory is /toibm/aix
(4) Include the snap and ctsnap from the LPAR
(5) Include the pedbg from the HMC (FSM or SDMC)