The CSM Monitoring application offers a comprehensive set of monitoring and response capabilities that lets you detect, and in many cases correct, system resource problems such as a critical file system becoming full. You can monitor virtually all aspects of your system resources and specify a wide range of actions to be taken when a problem occurs, from simple notification by email to recovery that runs a user-written script. You can specify an unlimited number of actions to be taken in response to an event.
As system administrator, you have a great deal of flexibility in responding to events. You can respond to an event in different ways based on the day of the week and time of day. The following are some examples of how you can use monitoring:
See Using the Monitoring Application for more details.
CSM uses RMC to monitor the system and to perform many of its operations. For information about the command line interface to the RMC subsystem, see IBM Cluster Systems Management for Linux Technical Reference. For information on RMC diagnostic information, see Recovering from RMC and Resource Manager Problems. For authorization and modifying the ACL file, see Security Considerations.
Monitoring lets you detect conditions of interest in the cluster nodes and their associated resources and automatically take action when those conditions occur. The key elements in monitoring are conditions and responses. A condition identifies one or more resources that you want to monitor, such as the /var file system, and the specific resource state you are interested in, such as /var > 90% full. A response specifies one or more actions to be taken when the condition is found to be true. Actions can include notification, running commands, and logging.
To understand and use conditions, you need to know about the following:
System resources that you can monitor are organized into general categories called resource classes. Examples of resource classes include Processor, File System, Physical Volume, and Ethernet Device.
Each resource class includes individual system resources that belong to the class. For example, the File System resource class might include these resources:
When a resource is specified for use in a condition, it is called a monitored resource.
Each resource within a resource class also has a set of attributes that you can monitor. For example, the File System resource class has the following attributes available for monitoring:
For a condition, you specify the monitored attribute of the resource in a logical expression that defines a threshold or state of the monitored resource. When the logical expression is true (the threshold is reached or the state becomes true), an event is generated. The logical expression is the event expression of the condition. Event expressions are typically used to monitor potential problems and significant changes in the system. For example, the event expression for a /var space used condition might be PercentTotUsed > 90.
The rearm expression of a condition is optional. A rearm expression typically indicates when the monitored resource has returned to an acceptable state. When the rearm expression is met, monitoring for the condition resumes. If a rearm event is not specified, when the event expression becomes true an event is generated for certain attributes every time the monitored attribute is evaluated.
If a rearm expression is specified, evaluation of the rearm expression starts after the event expression becomes true. When the rearm expression becomes true, a rearm event is generated; then the evaluation of the event expression starts again. For example, if the event expression for a /var space used condition is 90% full and the rearm expression is PercentTotUsed < 80, then an event is generated when /var is more than 90% full. The next time the condition is evaluated, the rearm expression is used. When /var is less than 80% full, an event is generated indicating that the condition has been reset, and the event expression is used again to evaluate the condition.
See Using Expressions for more information about data types and operators that you can use in an event expression or a rearm expression.
Predefined conditions are provided with the Monitoring application.
To create a new condition, you have to set the following condition
components:
Condition Component | Description | Example |
---|---|---|
Condition name | Required. The name you want to give the condition. | /var space used |
Resource class | Required. The resource class to be monitored. | FileSystem |
Monitored attribute | Optional. The attribute of the resource class to be monitored. If not specified, it will be extracted from the Event expression. | PercentTotUsed |
Monitored resources | Optional. The specific resources in the resource class that are to be monitored. If not specified, the default is all resources in the specified Resource Class. | /var |
Event expression | Required. A logical expression defining the value or state of the monitored property that is to generate an event. | PercentTotUsed > 90 |
Event description | Optional. A text description of the event expression. If not specified, the default is a NULL string. | An event occurs when /var is more than 90% full. |
Rearm expression | Optional. When a rearm expression is specified, the rearm expression is evaluated when the event expression becomes true. When the rearm expression becomes true, the event expression is used for evaluation again. If not specified, this condition will only be monitored with the event expression. | PercentTotUsed < 80 |
Rearm description | Optional. A text description of the rearm expression. If not specified, the default is a NULL string. | A rearm event occurs when /var is less than 80% full. |
Severity | Optional. The severity of the condition: Informational, Warning, or Critical. If not specified, the default is Informational. | Critical |
Finally, a user-defined sensor can be created to monitor an attribute of interest. Then expressions can be defined that contain conditions and responses with associated actions to be performed when the attribute has a certain value. For example, a script can be written to return the number of users logged on, and a condition and response can be defined so that a specified action is taken when the number of users exceeds a certain threshold.
A response consists of one or more actions to be performed by the system when an event or rearm event occurs for a condition. The Monitoring application allows you to use predefined responses or create new responses and associate them with conditions as needed. You can associate multiple responses with one condition, and one response with multiple conditions.
To create a new response you have to set the following response and action components:
Response Component | Description | Example |
---|---|---|
Response name | The name you want to give the response. | Response for critical conditions |
Actions | One or more actions to be taken as part of the response. | Log events to a file |
Action Component | Description | Example |
---|---|---|
Action name | The name of an action to be taken as part of the response. | Send email to the operator |
When in effect | The days and times when this action is to be used to respond to the condition. | 08:00 - 17:00 Monday - Friday |
Use for event, rearm event, or both | Whether the action is to be used to respond to an event, a rearm event, or both. | Event |
Command | The command to be run when an event or rearm event occurs. | A recovery script |
If you want to define different responses based on when the event occurs, you can associate multiple responses with a condition. For example, you might have a work day response and a weekend response, each containing one or more actions. During working hours, you might want to email the operator, run a command, and broadcast a message to users who are logged on. During weekend hours, you might want to email the system administrator and log a message to a file.
After monitoring for the condition begins, the system evaluates the event expression to see if it is true. When the event expression becomes true, an event occurs that automatically notifies all of the associated event responses, which causes each event response to run its defined actions.
The event expression and the rearm expression work together as follows when a condition is monitored:
This cycle is illustrated below:
The interactions are illustrated below: