The major components of Cluster Systems Management monitoring tool are the Resource Monitoring and Control (RMC) subsystem and certain resource managers. These are described in the following sections.
The Resource Monitoring and Control (RMC) subsystem monitors and queries resources. The RMC daemon manages an RMC session and recovers from communications problems.
The RMC subsystem is used by its clients to monitor the state of system resources and to send commands to resource managers. The RMC subsystem acts as a broker between the client processes that use it and the resource manager processes that control resources.
A resource manager is a process that maps resource and resource-class abstractions into calls and commands for one or more specific types of resources. A resource manager is a stand-alone daemon. The resource manager contains definitions of all resource classes that the resource manager supports. A resource class definition includes a description of all attributes, actions, and other characteristics of a resource class.
See the man pages for the RMC and ERRM commands or Cluster Systems Management for Linux Technical Reference to learn how to access the resource classes and manipulate their attributes through the command line interface.
The following resource managers are provided:
The Audit Log subsystem is implemented as a resource manager within the RMC subsystem. It has two resource classes, IBM.AuditLog for subsystem definitions and IBM.AuditLogTemplate for audit-log-template definitions. Entries in the audit log are called records. Records can be added, retrieved, and removed through actions on a specific subsystem or on the subsystem class. The template definition class contains a description of each record type that a subsystem can add to the audit log. The template definition contains the data type, a descriptive message, and other information for each subsystem-specific field within the record.
There are typically two types of clients for the audit-log subsystem, subsystems that need to add records to the audit log, and users who extract records from the audit log through the command line. The formatted message for each record provides a concise description of the situation and allows a user to easily see at a high level what has been happening on the system.
Each resource of this class represents a subsystem that will be adding records to the audit log. A resource of this class must be added before the subsystem can add records to the audit log. The resource can be added as part of the installation of the subsystem or at runtime.
The following properties can be monitored for this resource class:
This resource class holds all audit log templates. An audit log template describes the information that exists in each audit log record that is based on the template. In addition, an audit log template contains information on how to present records that use the template to an end user. Each template corresponds to a resource within this class. The attributes of this resource class are internal.
The distributed management server resource manager (IBM.DMSRM) controls the managed node (IBM.ManagedNode) resource class and the node group (IBM.NodeGroup) resource class. The distributed management server resource manager runs on the node designated as the management server and is automatically started by the RMC subsystem.
The program name of this resource class is IBM.ManagedNode. It runs on the management server and is started by the RMC subsystem. It is controlled by the distributed management server resource manager.
The following dynamic attributes can be monitored for the IBM.ManagedNode resource class:
The following table shows the predefined conditions and example expressions
that are available for the IBM.ManagedNode resource class.
Condition Name
|
Event Expression
|
Event Description
| Rearm Expression | Rearm Description |
Notes
|
---|---|---|---|---|---|
NodeReachability
|
Status!=1
|
An event is generated when a node in the network cannot be reached from the
management server.
|
Status=1
| The event is rearmed when the node can be reached again. |
None.
|
NodeChanged | ConfigChanged=1 | An event is generated when a node definiton in the ManagedNode resource class changes. | None. | None. | NodeNames = {localnode} |
The program name of the node group resource class is IBM.NodeGroup. The node group resource class runs on the management server.
The following dynamic attributes of the node group resource class can be monitored:
The system administrator interacts with the Event Response resource manager (ERRM) through the ERRM command-line interface.
When an event occurs, ERRM runs user-configured commands, which can include scripts provided by RSCT. A command and its attributes are a type of action, and many actions can be configured for a single Event Response resource. An action consists of a name, a command to be run, and other variables. You specify the range of times when the command is run (day, start time, and end time). If the condition occurs at a time outside the specified time ranges, the command is not run, and if all of the actions within this Event Response resource have the same time ranges, none of the commands are run. If no time ranges are specified, the command is always run. There are also event and rearm event flags that specify the events for which the command is run. Three options are allowable; only event set, only rearm event set, or both flags set.
The Event Response Resource Manager (ERRM) is automatically started when the RMC subsystem is started.
Although performance is important, ensuring that no events are lost and that the user's commands are run is of greater importance. Other factors outside the control of ERRM may affect performance as well (for example, network load, system load, and the performance of other required subsystems).
The only user ID that can define, undefine, and modify ERRM resources is root. All other users have read access to ERRM resources. Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. No security audits are generated, and no encryption mechanisms are used. ERRM communicates only with other local subsystems on the same node.
Information is handled as follows:
There are three Event Response resource classes:
The Condition resource class contains the necessary information (event expression and rearm expression) for the ERRM to register with the RMC for event notifications that the administrator deems important. Conditions contain essential information such as the resource attributes of the resource to be monitored, the event expression, and the optional rearm expression.
Configuration of ERRM begins with the definition of a set of Condition resources. A Condition resource is registered with the RMC subsystem when the Condition resource is used in the definition of an active Association resource.
Notes:
An Event Response resource is configured by defining one or more actions. Each action contains the name of the action, a command, and other fields within the action attribute. The Event Response resource runs any number of configured commands when an event with an active association occurs. When an event occurs, all of the actions associated with its Event Response resource are evaluated to determine whether they should be run.
Predefined responses are available to use and to serve as templates for creating your own responses. For a description of predefined responses and how to use them, see Predefined Responses. Scripts for notification and logging of events and for broadcasting messages to logged-in user consoles are provided in Cluster Systems Management for Linux Technical Reference.
See Getting Started with the Monitoring Application for specific task information on how to configure actions for Event Response resources and Event Response resources for Conditions.
The Association resource class joins the Condition resource class together with the Event Response resource class. It contains a flag that indicates whether the association between the condition and the event response is active. Event Responses and Conditions are separate entities, but for monitoring to take place, they need to be associated. An event cannot occur unless at least one Event Response is associated with a Condition. You can configure one or more actions for an Event Response, and one or more Event Responses for a Condition.
See Getting Started with the Monitoring Application for information on how to get started using the capabilities of the Event Response resource manager to monitor your system.
The File System resource manager (FSRM) manages file systems. It can do the following:
There is one File System resource manager (FSRM) on a node. It is started implicitly by the RMC subsystem and is run only when an attribute of an FSRM resource class is monitored (thus cutting down on performance overhead).
To enforce security, only root can start the FSRM resource manager (although it is strongly recommended that the FSRM resource manager not be started manually). Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. No security audits are generated, and no encryption mechanisms are used. The FSRM communicates only with other local subsystems on the same node and with the RMC subsystem. The FSRM has no direct contact with clients.
Information is handled as follows:
These attributes of a file system resource can be monitored:
The following table shows the predefined conditions and examples of expressions that are used to monitor the file system:
Condition Name
|
Event Expression
|
Event Description
|
Rearm Expression
|
Rearm Description
|
Monitored Resources
| Notes |
---|---|---|---|---|---|---|
File system state
|
OpState != 1
|
An event is generated when any file system goes offline.
|
OpState == 1
|
The event is rearmed when any file system comes back online.
|
all
| n/a |
File system i-nodes used
|
PercentINodeUsed > 90
|
An event is generated when more than 90% of the total i-nodes in any file
system are in use.
|
PercentINode Used < 85
|
The event is rearmed when the percentage of i-nodes used in the file system
falls below 85%.
|
all
| n/a |
File system space used
|
PercentTotUsed > 90
|
An event is generated when more than 90% of the total space of any file
system is in use.
|
PercentTotUsed < 85
|
The event is rearmed when the space used in the file system falls below
85%.
|
all
| n/a |
/tmp space used
|
PercentTotUsed > 90
|
An event is generated when more than 90% of the total space in the
/tmp file system is in use.
|
PercentTotUsed < 85
|
The event is rearmed when the space used in the /tmp file system
falls below 85%.
|
/tmp
| n/a |
/var space used
|
PercentTotUsed > 90
|
An event is generated when more than 90% of the total space in the
/var file system is in use.
|
PercentTotUsed < 85
|
The event is rearmed when the space used in the /var file system
falls below 85%.
|
/var
| n/a |
AnyNode FileSystem InodesUsed | PercentINodeUsed > 90 | An event is generated when more than 90% of the total i-nodes in the file system are in use. | PercentINodeUsed < 75 | The event is rearmed when the percentage of i-nodes used in the file system falls below 75%. | all | n/a |
AnyNode FileSystem SpaceUsed | PercentTotUsed>90 | An event is generated when more than 90% of the total space of the file system is in use. | PercentTotUsed <75 | The event is rearmed when the percentage of space used in the file system falls below 75%. | all | n/a |
AnyNodeTmp SpaceUsed | PercentTotUsed>90 | An event is generated when more than 90% of the total space in the /tmp directory is in use. | PercentTotUsed <75 | The event is rearmed when the percentage of space used in the /tmp directory falls below 75% | /tmp | Use Name= '/tmp' for select string. |
AnyNodeVar Space Used | PercentTotUsed>90 | An event is generated when more than 90% of the total space in the /var directory is in use. | PercentTotUsed <75 | The event is rearmed when the percentage of space used in the /var directory falls below 75% | /var | Use Name= '/tmp' for select string. |
The Host resource manager allows system resources for an individual machine to be monitored, particularly resources related to operating system load and status.
The Host resource manager is started implicitly by the RMC subsystem only when an attribute of a Host resource class is first monitored (thus cutting down on performance overhead).
Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. The Host resource manager runs as root. No security audits are generated, no encryption mechanisms are used, and there is no communication outside the node. The RMC daemon detects any unsuccessful authentication or authorization attempts. All interprocess communication is accomplished through pipes and shared memory.
Information is handled as follows:
The Host resource manager consumes minimal system resources during normal operation. This is because the following approaches have been implemented:
The Host resource manager has the following resource classes that you can use to monitor system resources:
The program name of this resource class is IBM.Host. It allows the following resources of a host system to be monitored:
The following attribute monitors the percentage of paging space in use:
The following table shows the predefined condition that is available for
monitoring paging space, and example expressions:
Condition Name
|
Event Expression
|
Event Description
|
Rearm Expression
|
Rearm Description
|
---|---|---|---|---|
Paging percent space used
|
PctTotalPgSpUsed > 90
|
An event is generated when more than 90% of the total paging space is in
use.
|
PctTotalPgSpUsed < 85
|
The event is rearmed when the percentage falls below 85%.
|
The values represented for this attribute reflect total processor utilization across all of the active processors in a system.
This attribute can be monitored:
The following table shows the predefined condition that is available for
monitoring system-wide processor idle time, and example expressions:
Condition Name
|
Event Expression
|
Event Description
|
Rearm Expression
|
Rearm Description
|
---|---|---|---|---|
Processor idle time
|
PctTotalTimeIdle>= 70
|
An event is generated when the average time all processors are idle at
least 70% of the time.
|
PctTotalTimeIdle < 10
|
The event is rearmed when the idle time decreases below 10%.
|
The program name of this resource class is IBM.Program resource class. This resource class can monitor a set of processes that are running a specific program or command whose attributes match a filter criterion. The filter criterion includes the real or effective user name of the process, arguments that the process was started with, etc. The primary aspect of a program resource that can be monitored is the set of processes that meet the program definition. A client can be informed when processes with the properties that meet the program definition are initiated and when they are terminated. This resource class typically is used to detect when a required subsystem encounters a problem so that recovery actions can be performed and the administrator can be notified.
A program definition requires the program name and the user name of the owner of the program. The program should be identified by user name in addition to program name to avoid confusion when two or more programs have the same name. These attributes are defined as follows:
For a process to match a program definition and thus be considered to be running the program, its name must match the ProgramName attribute value. In addition, the expression defined by the Filter attribute must evaluate to TRUE by using the properties of the process. The Filter attribute is a string that consists of the names of various properties of a process, comparison operators, and literal values. For example, a value of user==greg restricts the process set to those processes that run ProgramName under the user IDgreg. The syntax for the Filter value is the same as for a string.
For more information on selection strings, see Using Expressions.
Processes must have a minimum duration (approximately 15 seconds) to be monitored by the IBM.Program resource class. (If a program runs for only a few seconds, all processes that run the program may not be detected.)
This attribute can be monitored: Processes
These elements of the Processes attribute can be monitored:
ps -e -o "ruser,pid,ppid,comm" | grep biod root 7786 8040 biod root 8040 5624 biod root 8300 8040 biod root 8558 8040 biod root 8816 8040 biod root 9074 8040 biod
To be informed when the number of processes running the specified program changes, you can define this event expression:
Processes.CurPidCount!=Processes.PrevPidCount
To be informed when no processes are running the specified program, you can define this event expression:
Processes.CurPidCount==0
This resource class is typically used to detect when a required subsystem
encounters a problem so that some recovery action can be performed or an
administrator can be notified. The following table shows the predefined
conditions and examples of expression that are available for monitoring
programs.
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description | Monitored Resources | Notes |
---|---|---|---|---|---|---|
sendmail daemon state |
Processes .CurPidCount <=0
|
An event is generated whenever the sendmail daemon is not
running.
|
Processes .CurPidCount> 1
|
The event is rearmed when the sendmail daemon is running.
|
sendmail
| n/a |
inetd daemon state |
Processes .CurPidCount <=0
| An event is generated whenever the inetd daemon is not running. |
Processes .CurPidCount> 1
| The event is rearmed when the inetd daemon is running. | inetd | n/a |
MgmtSvrCfd Status |
Processes .CurPidCount <=0
| An event is generated when the cfengine daemon stops running. |
Processes .CurPidCount> 1
| The event is rearmed when the cfengine daemon starts running again. | CSM Mgmt Server | Use ProgramName= 'cfd' for the select string. |
AnyNodeCfd Status |
Processes .CurPidCount <=0
| An event is generated when the cfengine daemon stops running. |
Processes .CurPidCount> 1
| The event is rearmed when the cfengine daemon starts running again. | all nodes | Use ProgramName= 'cfd' for the select string |
The Sensor resource manager makes the output of a user-written script known to the RMC subsystem as a dynamic attribute of a sensor resource. The Sensor resource manager determines when this attribute is run according to a specified interval. Thus, an administrator can set up a user-defined sensor to monitor an attribute of interest and then create expressions that contain Conditions and Responses with associated actions that are performed when the attribute has a certain value. For example, a script can be written to return the number of users logged on to the system. Then an ERRM Condition and Response can be defined to run an action when the number of users logged on exceeds a certain threshold.
The Sensor resource manager has one class, IBM.Sensor. Each resource in the IBM.Sensor resource class represents one sensor and includes information such as the script command, the user name under which the command is run, and how often it should be run. The output of the script causes a dynamic attribute within the resource to be set. This attribute can then be monitored in the typical way.
See the mksensor man page for details on how to set up a sensor.
The following table shows the predefined condition and example expression
that is available for the IBM.Sensor resource class.
Condition Name
|
Event Expression
|
Event Description
|
Notes
|
---|---|---|---|
CFMRootModTimeChanged
|
"String!=\@P"
|
An event is generated when a file under /cfmroot is modified, added, or
deleted.
|
Selection String = 'Name="CFMRootModTime"'
|
The following predefined responses are shipped as templates or as starting points for monitoring.
See Using Expressions for a summary of the data types and operators that you can
use in selection strings for a customized response.
Response Name | Command |
---|---|
BroadcastEventsAnyTime | /usr/sbin/rsct/bin/wallevent |
CForce | /opt/csm/bin/cforce -a |
EmailEventsToRootAnyTime | /usr/sbin/rsct/bin/notifyevent root |
DisplayEventsAnyTime | /usr/sbin/rsct/bin/displayevent admindesktop:0 |
LogEventsAnyTime | /usr/sbin/rsct/bin/logevent /var/log/csm/systemEvents |
MsgEventsToRootAnytime | /usr/sbin/rsct/bin/msgevent root |
You can use the following commands, scripts, utilities, and files to control Monitoring on your system. See the command man pages or Cluster Systems Management for Linux Technical Reference for detailed usage information.