IJ34882 |
High Importance
|
Assert: SGNotQuiesced sgmrpc.C
(show details)
Symptom |
Scale mmfsd daemon process crash |
Environment |
All |
Trigger |
Snapshot create or delete operations |
Workaround |
None |
|
5.1.1.4 |
Core GPFS |
IJ34886 |
Suggested |
When multiple nodes are creating files in the same directory, creates can slow down during recovery.
(show details)
Symptom |
Long Waiters |
Environment |
All |
Trigger |
File system crash |
Workaround |
None |
|
5.1.1.4 |
Core GPFS |
IJ34346 |
High Importance
|
If FIPS is enabled, call home uploads fail; manual call home uploads crash with an error, mentioning FIPS.
(show details)
Symptom |
Component Level Outage |
Environment |
Linux |
Trigger |
Enabling FIPS |
Workaround |
Disable FIPS. |
|
5.1.1.4 |
Call home |
IJ34917 |
High Importance
|
AFM gateway node crashes if the home is not responding while mounting the fileset target path.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
AFM caching with unresponsive home |
Workaround |
None |
|
5.1.1.4 |
AFM |
IJ34927 |
High Importance
|
logAssertFailed: exclLockWord == 0
(show details)
Symptom |
Assert |
Environment |
POWER |
Trigger |
NA |
Workaround |
Disable the assert. |
|
5.1.1.4 |
Core GPFS |
IJ34928 |
HIPER |
Data loss which may happen when an application uses the direct I/O mode to write to a pre-allocated file block.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger |
Application uses the direct I/O mode to write to a pre-allocated file block. |
Workaround |
None |
|
5.1.1.4 |
Core GPFS |
IJ34931 |
High Importance
|
Drives on an ESS3k may not show up after a boot or reboot of a canister. You can detect these errors using: lspci -s 0x87 | grep DpcSta | grep Trigger+ or lspci -s 0x3c | grep DpcSta | grep Trigger+
(show details)
Symptom |
Component Level Outage |
Environment |
Linux (x86_64) |
Trigger |
ESS 3000 boot or reboot canister (very rare) |
Workaround |
You can use the setpci utility to manually clear the DPC error flag of the 0x87 and 0x3c busses. This will force the devices to attempt to train. If drive still does not train, there is some other issue going on. |
|
5.1.1.4 |
ESS, GNR |
IJ33911 |
Suggested |
mmhealth encryption component shows "checking" instead of "healthy". The check differs from other components and is not refreshed by a timer but just by incoming events. Must start with "healthy" status because of that.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Encryption active |
Workaround |
For the encryption component, read "checking" as "healthy" as no error or warning events have happened. |
|
5.1.1.3 |
System health |
IJ33948 |
Suggested |
If a file system is set to maintenance mode then it is listed as 'SUSPENDED', but only a 'unmounted_fs_check' event is shown as the reason. It should say 'maintenance state' instead.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
The 'fs_maintenance_mode' event is only at info-level, since it is a user intended state. Info-level events are in general not reported by 'mmhealth node show' since they do not indicate an issue or error state. A code change was done to allow the 'fs_maintenance_mode' event to be listed as a reason. |
Workaround |
None |
|
5.1.1.3 |
System health |
IJ33949 |
High Importance
|
On a cluster with two quorum nodes and tiebreaker disks, an unexpected quorum loss can be seen on the challenger node when the current cluster manager sees a mmshutdown (or node reboot).
(show details)
Symptom |
File System Outage (unexpected GPFS file system unmount for about 30 seconds) |
Environment |
All |
Trigger |
GPFS shutdown (mmshutdown) or node reboot of current cluster manager |
Workaround |
Move the cluster manager role by using the 'mmchmgr -c ' command. |
|
5.1.1.3 |
Cluster Manager |
IJ33997 |
Suggested |
Fileset path is taking the chars count to read the actual mount path from a given directory path. If the directory mount has the same chars till the count then prefetch starts processing successfully.
(show details)
Symptom |
dirpath is checked properly. |
Environment |
All |
Trigger |
Prefetch starts working on an invalid path where the dir path doesn't belong to the same fileset. |
Workaround |
None |
|
5.1.1.3 |
AFM |
IJ34000 |
Suggested |
GPFS has fileset level permissions which can deny setting the mode or EAs on the fileset entities depending on which mode this targets. AFM doesn't consider this flag on the fileset and that results in getting E_PERM from the home which causes the queue to get stalled. Normal queue goes fine, but mostly recovery or resync queue hits this issue.
(show details)
Symptom |
Unexpected Results |
Environment |
Linux |
Trigger |
Set the same fileset level permissions flag (setAclOnly, ChmodOnly, chmodAndUpdateAcl, etc.) on both Cache/Primary and/or Home/Secondary sites and perform IO to the fileset and then run recovery or resync. |
Workaround |
Drop those operations that stall the queue when the fileset level permissions are enabled only at one of the 2 sites. |
|
5.1.1.3 |
AFM |
IJ34001 |
High Importance
|
mmkeyserv client register, deregister or rkm change command will fail if the new RKM.conf contains expired certificates.
(show details)
Symptom |
Error output/message Unexpected Results/Behavior |
Environment |
Linux, Windows (x86_64) |
Trigger |
This occurs when there is a client that is registered to multiple tenants and the certificate has expired or when there are multiple clients that are registered to at least one tenant and the their certificates have expired. |
Workaround |
Use the mmkeyserv client update to update the client certificates. Otherwise, shut down GPFS and retry the command. |
|
5.1.1.3 |
Admin commands, Encryption |
IJ34002 |
High Importance
|
When a file system has a high number of block allocation regions, the processing of allocation manager RPC could be slower than expected.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
All |
Trigger |
Running mmdf |
Workaround |
Avoid running mmdf. |
|
5.1.1.3 |
Core GPFS |
IJ34136 |
High Importance
|
With thousands of client nodes mounted in the file system, adding some more disks serviced by ESS 3000 nodes can cause long waiters trying to get NSD disk information on each client node.
(show details)
Symptom |
Stuck mmadddisk command |
Environment |
Linux |
Trigger |
Create new NSD disks from an ESS 3000 or an ECE cluster and add them to a file system before starting the GPFS service. |
Workaround |
Restart the GPFS service on the ESS 3000 or ECE nodes, or fail over the RG master from one node to the other one. |
|
5.1.1.3 |
ESS 3000, ECE, Admin commands |
IJ34142 |
High Importance
|
The automatic restart of NFS (remedy action) is blocked by an open unmounted_fs_check event which is not relevant for NFS/SMB exports.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux (CES nodes running NFS) |
Trigger |
File systems with the automount-Flag and an unmounted file system |
Workaround |
Remove the "automount" flag from the "testFS" file system. |
|
5.1.1.3 |
System health |
IJ34144 |
Suggested |
RAS event dir_sharedroot_perm_problem was received by mmhealth sometimes without a need and sometimes with a need, but the description of the event does not describe what is wrong with the permissions and which permissions should be provided.
(show details)
Symptom |
Error output |
Environment |
Linux |
Trigger |
cesSharedRoot does not have permissions 'rx' for 'group' and 'others'. |
Workaround |
Provide the necessary permissions for cesSharedRoot ('rx' for 'group' and 'others'). |
|
5.1.1.3 |
System health |
IJ34145 |
High Importance
|
The Mellanox firmware manager was called frequently (around every minute) by the system health monitor. That caused a high CPU load.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
All |
Trigger |
The Mellanox firmware check is executed too frequently by the system health monitor. There is no need for so much checking. |
Workaround |
None |
|
5.1.1.3 |
System health |
IJ34151 |
Suggested |
The timestamps displayed in the output of "mmdiag --iohist" on Windows nodes may show incorrect values, especially for the decimal part of the seconds. This may also cause incorrect duration reporting of the affected I/O operations.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Windows (x86_64) |
Trigger |
None |
Workaround |
None |
|
5.1.1.3 |
Admin commands |
IJ34152 |
Critical |
mmsysmon daemon does not start and mmhealth does not work on AIX.
(show details)
Symptom |
Component Level Outage |
Environment |
AIX |
Trigger |
Installing Scale 5.1.1.0-5.1.1.2 on an AIX node. |
Workaround |
1. In the file /usr/lpp/mmfs/lib/mmsysmon/CallhomeUpdateRequest.py remove the line "import requests" 2. Restart Sysmonitor on this node: mmsysmoncontrol restart |
|
5.1.1.3 |
System health, Call home, GUI |
IJ34190 |
Suggested |
Ganesha fails to open files when over 1 million files are open.
(show details)
Symptom |
Check for logs "Futility count exceeded. Client load is opening FDs faster than the LRU thread can close them." and values of current_open and former_open. |
Environment |
Linux |
Trigger |
Whenever a client opens more than 1 million files. |
Workaround |
None |
|
5.1.1.3 |
CES NFS |
IJ34194 |
High Importance
|
When an application reads with an IO size that is a multiple of the GPFS block, prefetching doesn't start until the application issue a second read request unless the read starts at the beginning of the file or prefetchAggressiveness is set to prefetchOnFirstAccess. This can cause slow read performance when read IO size is very big.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
All |
Trigger |
Application issue read with IO size that is much bigger than the GPFS block size. |
Workaround |
Set prefetchAggressiveness configuration to prefetchOnFirstAccess or reduce the read IO size to the GPFS block size. |
|
5.1.1.3 |
Core GPFS |
IJ34200 |
Suggested |
When the mmchmgr command is used to assign a new file system manager, it could fail with "No log available" message after the current file system panics with "No log available" error. This can happen if file system is not externally mounted on any node.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
Using the mmchmgr command to assign a new file system manager. |
Workaround |
Mount the file system before issuing the mmchmgr command. |
|
5.1.1.3 |
Core GPFS |
IJ34221 |
High Importance
|
Too many slots are reported by tslsenclslot for an LSI enclosure which reports duplicate enclosure ids.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Users with LSI Megaraid enclosures which have repeated eidx values when using the 'storcli /call/eall show all j' command. |
Workaround |
None |
|
5.1.1.3 |
ESS, GNR |
IJ34289 |
Critical |
AFM gateway may assert if the home server is not responding during a prefetch.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
AFM prefetch |
Workaround |
Stop prefetch until the efix is installed. |
|
5.1.1.3 |
AFM |
IJ34315 |
Suggested |
After remote error 2, fileset went to NeedResync state.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
The fileset is getting replicated to COS and there is a rename operation in the queue. |
Workaround |
None |
|
5.1.1.3 |
AFM |
IJ34389 |
Critical |
Running online fsck in repair mode (-o -y) can cause it to detect and repair false positive lost blocks (i.e. blocks that are assigned to files) and mark it as free, and doing this can lead to duplicate block corruptions.
(show details)
Symptom |
Data corruption due to duplicate blocks |
Environment |
All |
Trigger |
Running online fsck in repair mode (-o -y) |
Workaround |
Use offline fsck to fix corruptions. |
|
5.1.1.3 |
Online FSCK |
IJ34393 |
Critical |
Hard lockup between 2 pemsmod kernel threads can panic the kernel. Stack trace at vmcore-dmesg.txt will have something like this: [88432.803601] CPU: 27 PID: 14563 Comm: pemsRollUpQueue Kdump: loaded Tainted: G
(show details)
Symptom |
Kernel crash |
Environment |
Linux (x86_64) |
Trigger |
System running heavy I/O workload |
Workaround |
None |
|
5.1.1.3 |
ESS, GNR |
IJ34170 |
Suggested |
The timestamps displayed in the output of "mmdiag --iohist" on Windows nodes may show incorrect values, especially for the decimal part of the seconds. This may also cause misreporting of the duration of the affected I/O operations.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Windows (x86_64) |
Trigger |
Running "mmdiag --iohist" on Windows |
Workaround |
None |
|
5.1.1.3 |
Admin commands |
IJ34251 |
High Importance
|
Too many slots are reported by tslsenclslot for an LSI enclosure which reports duplicate enclosure ids.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Users with LSI Megaraid enclosures which have repeated eidx values when using the 'storcli /call/eall show all j' command. |
Workaround |
None |
|
5.1.1.3 |
ESS, GNR |
IJ32947 |
High Importance
|
On an AIX node, on some occasions, including when the /var file system becomes full, mmfsd is unable to run child processes, and that results in different failures, depending on the process which mmfsd attempts to run. Among the operations which have been seen to fail: - mmadddisk - mmauth
Once the problem is triggered, it will remain until the mmfsd daemon is restarted. If the problem is initiated by the /var file system getting full, freeing up space on that file system is not enough to solve the problem. An indication that problem is taking place is in the output of the /usr/lpp/mmfs/bin/tslsfs nonexistent_FS command (that is, passing the name of a nonexistent file system as parameter to the command above) In a system where the problem is occurring, the output will be mmcommon getEFOptions nonexistent_FS failed. Return code 1. While on a system without the problem, the output will be mmcommon: File system nonexistent_FS is not known to the GPFS cluster.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
AIX |
Trigger |
A likely trigger for the problem is the /var file system being filled, possibly around the time an operation is taking place that results in information being produced to the mmfs.log file. |
Workaround |
Once the issue in /var is resolved, restart mmfsd. |
|
5.1.1.2 |
Core GPFS |
IJ33003 |
Suggested |
While using IBM Spectrum Scale Erasure Code Edition running on LSI MegaRaid adapters, if the slotmap.yaml file is edited directly, several unintended consequences can arise that would not show up when using the drive mapping utility. This can include allowing several disallowed characters such as the hyphen in the location code name.
(show details)
Symptom |
Error output/message Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Users who edit the slotmap.yaml file on LSI MegaRAID systems may be affected. |
Workaround |
Avoid using leading "0" when creating slot location codes. After editing a slotmap.yaml file, run tslsenclslot.lmr --check-slot-map to verify that the mapped location codes are valid and have the expected form. |
|
5.1.1.2 |
ESS, ECE, GNR |
IJ33049 |
High Importance
|
In the current implementation of eviction on a file, the eviction program acquires a DMAPI lock first on the file and punches a hole on it. The program can be terminated at any point without the DMAPI lock being released, causing a lock leak and hence later DMAPI lock acquire on the file can deadlock and the only way to come out of this is to bounce mmfsd.
(show details)
Symptom |
Deadlock |
Environment |
Linux, AIX |
Trigger |
Trying to evict a file or list of files, and the eviction getting killed midway through. |
Workaround |
None |
|
5.1.1.2 |
AFM |
IJ33082 |
High Importance
|
If a new file is created and it is renamed before AFM could replicate it to the COS, with parallel IO enabled, incorrect target path is sent to the worker gateway node causing the remote error 2.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
AFM replication with parallel IO enabled |
Workaround |
Disable parallel IO. |
|
5.1.1.2 |
AFM |
IJ33084 |
Suggested |
Mounting a file system can hang.
(show details)
Symptom |
File system mount hangs. |
Environment |
Linux |
Trigger |
Mounting a file system. |
Workaround |
None |
|
5.1.1.2 |
Core GPFS |
IJ33095 |
Suggested |
Assert "(verify == 0) || (ofP == __null) || (ofP->sgP == __null) || ofP->isRoSnap() || (ofP->metadata.getInodeStatus() != 1) || (!ofP->sgP->isFileIncludedInSnapshot(ofP->getInodeNum(), ofP->getSnapId(), getInodeStatus())) || (ofP->assertInodeWasCopiedToPrevSnapshot()) || (ofP->isBeingRestriped() || ofP->beenRestriped)".
(show details)
Symptom |
Daemon crash |
Environment |
All |
Trigger |
Operations triggering a statlite call on a node without sufficient stat file token. |
Workaround |
Disable the statlite config parameter by "mmfschconfig statliteMaxAttrAge=0 -i". |
|
5.1.1.2 |
Core GPFS |
IJ33103 |
Critical |
afmParallelMounts option can be enabled at the fileset level which creates parallel mounts to the different NFS servers. Some dentries created as part of the parallel mounts may not connect to the file system root dentry and cause a VFS busy inodes issue when the fileset is stopped or unlinked.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM replication with afmParallelMounts enabled |
Workaround |
Disable afmParallelMounts. |
|
5.1.1.2 |
AFM, AFM DR |
IJ33163 |
Suggested |
This occurs on a compliant or compliant-plus mode fileset, when the immutable files inside them remain as is. When such files are taken for AFM replication, the files in the Resync/Recovery path can set immutable attribute at secondary and also remove the write flag. This ends up being seen as a data mismatch in terms of ACLs between the sites.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
None |
Workaround |
None |
|
5.1.1.2 |
AFM |
IJ33173 |
Critical |
During reconnect in the middle of a write operation, the below error may be reported: 2021-03-30_12:59:35.050-0400: [W] Encountered first checksum error on network I/O from NSD Client 10.10.10.10
(show details)
Symptom |
IO error |
Environment |
Linux (s390x) |
Trigger |
Network is not good which can lead to TCP connections reconnect. |
Workaround |
None |
|
5.1.1.2 |
Core GPFS |
1IJ33174 |
Suggested |
Compliant and Compliant-Plus fileset modes can stall the queue.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux |
Trigger |
Role reversal in compliant IAM mode, with the filesets having Immutable files with expiration time set on them. |
Workaround |
None |
|
5.1.1.2 |
AFM, AFM DR |
IJ33177 |
Suggested |
When compiling gpfs.gplbin rpm packages on RHEL8 for multiple kernel versions, installing them at the same time might fail due to conflicting build ids in the packages.
(show details)
Symptom |
Upgrade/Install failure |
Environment |
Red Hat Enterprise Linux 8.x |
Trigger |
RHEL8 RPM builds |
Workaround |
Remove the installed gpfs.gplbin package before installing the new one. |
|
5.1.1.2 |
Core GPFS |
IJ33190 |
High Importance
|
IBM Spectrum Scale on an AIX node will crash when trying to put a NFSv4 ACL on a .snapshots directory (e.g. through the "aclput -t nfs4" command).
(show details)
Symptom |
Abend/Crash |
Environment |
AIX |
Trigger |
Storing NFSv4 ACL on .snapshots directory on an AIX node. |
Workaround |
Do not try this operation. |
|
5.1.1.2 |
Core GPFS |
IJ33365 |
Suggested |
mmnetverify creates temporary test files when validating network functionality. After mmnetverify executes, the temporary test files still exist on the tested nodes of the cluster.
(show details)
Symptom |
Accumulation of files in /var/mmfs/tmp and /tmp directories |
Environment |
Linux |
Trigger |
Running mmnetverify will cause the test files to be created. |
Workaround |
Run this command: mmdsh -N all rm -rf /var/mmfs/tmp/copy_file.* /tmp/copy_file.* |
|
5.1.1.2 |
mmnetverify |
IJ33366 |
Critical |
readddir operation fails after a rename operation on the AFM object fileset if we are in independent-writer mode due to incorrect updates of the remote attributes.
(show details)
Symptom |
Unexpected Results |
Environment |
All |
Trigger |
Rename on an AFM Object fileset in IW mode. |
Workaround |
None |
|
5.1.1.2 |
AFM |
IJ33367 |
High Importance
|
A failover situation was generated by the NFS health monitor while a node was expelled in the cluster. The NFS service monitor detected a potential hung situation. As a result, a failover was triggered even though the system was able to recover itself after several minutes.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux (CES nodes) |
Trigger |
The NFS service monitor detected a potential hung situation, which means that the NFS NULL check failed and the number of internal NFS operations did not increase over a while (around 60 seconds). During that time NFS is in a grace mode (allow previous clients to reclaim their locks) and therefore not able to let new clients start their I/O work. This grace time was not considered by the systemhealth monitor, but it should increase the waiting time. |
Workaround |
The systemhealth monitor can be configured with a configuration option to signal a degraded state (nfs_unresponsive event) instead of triggering a failover (nfs_not_active event, error state). |
|
5.1.1.2 |
System health |
IJ33368 |
Suggested |
The mmces events active object command is failing because object is not a valid option.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (CES nodes) |
Trigger |
The OBJ option was not included in the mmces events active command. |
Workaround |
None |
|
5.1.1.2 |
CES |
IJ33418 |
High Importance
|
When the fileset moves to unmounted or disconnected states, there is a window where SETXATTR operations from SW cache can get queued to a non-GPFS home site and remain queued for ever.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM gateway nodes) |
Trigger |
Fileset moving to unmounted or disconnected state and running SETXATTR operations from SW cache |
Workaround |
1. Drop the SetXattr operation that gets queued using the "mmfsadm afm msgdrop" command. or 2. Ensure that fileset is always active/dirty before performing SETXATTR operations at the SW/IW cache site. |
|
5.1.1.2 |
AFM |
IJ33419 |
Suggested |
mmafmctl command has a provision to reclaim deleted inodes when Resync is manually run. This causes the inodes to be reclaimed only under some conditions and not reclaimed in other cases.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux (AFM gateway nodes) |
Trigger |
Running manual resync on fileset without recovery, with afmSkipResyncRecovery flag set on the fileset level. |
Workaround |
1. A full resync and then recovery needs to be run in order to reclaim deleted inodes. 2. A resync with afmSkipResyncRecovery at the cluster level tuned using the mmchconfig command should be run. Fileset level tuned doesn't work currently. 3. Worst case, an mmfsck needs to be run to reclaim the inodes. |
|
5.1.1.2 |
AFM |
IJ33420 |
Suggested |
When a thread is performing shutdown and a thread is initiating startup run concurrently, it is possible that it could result in a kernel crash.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
Very small race window in GPFS cleanup process |
Workaround |
None |
|
5.1.1.2 |
Core GPFS |
IJ33421 |
Suggested |
If a Linux node is overloaded and the thread cannot be scheduled quickly could result in a kernel panic: RIP list_del_entry_valid.cold.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
mmshutdown on a busy Linux node |
Workaround |
None |
|
5.1.1.2 |
Core GPFS |
IJ33254 |
HIPER |
AFM might incorrectly drop write messages during an AFM recovery, causing the data mismatch between cache or primary and home or secondary cluster. AFM recovery is triggered if in-memory queue is lost, for example, a gateway node restart. With parallel IO enabled, WriteSplit messages are sent to the worker gateway nodes to write the file parallelly. If the WriteSplit message fails on the worker gateway node, failed WriteSplit request is retried for 3 times before dropping the request. Since the Write request is dropped without replicating the data to the home or secondary, it will result in data mismatch between the cache or primary and home or secondary.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
AFM recovery with parallel IO enabled |
Workaround |
Disable parallel IO using the command "mmchfileset device fileset -p afmParallelWriteThreshold=disable" |
|
5.1.1.2 |
AFM, AFM DR |
IJ33530 |
Critical |
AFM gateway node crash when the home or secondary is not responding.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger |
AFM replication when the home or secondary is not responding. |
Workaround |
None |
|
5.1.1.2 |
AFM, AFM DR |
IJ33532 |
Critical |
When mmafmcosctl upload command is used with --all option, AFM LU (local-updates) mode uploads incorrect object name (old name) if the file was already renamed at the cache.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
Uploading the renamed objects from the LU mode cache. |
Workaround |
None |
|
5.1.1.2 |
AFM |
IJ33535 |
Critical |
tsenclstat causes a coredump with segmentation fault whenever it runs on a system with only one SAS adapter hooked up to a storage enclosure. This most commonly occurs in a "daisy-chain" configuration.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
This issue affects customers with daisy-chain storage enclosure configurations in which only one SAS adapter is connected from the server to a storage enclosure. It occurs whenever tsenclstat runs, which will occur automatically every few minutes as part of the daemon's regular status check. |
Workaround |
Ensure there are two SAS adapters hooked up to each storage enclosure. |
|
5.1.1.2 |
ESS, GNR |
IJ33568 |
High Importance
|
cNFS does not work on RHEL8.x. This is due to a change in pid of commnand in RHEL8.
(show details)
Symptom |
Unexpected Results/Behavior, Node Reboot |
Environment |
Red Hat Enterprise Linux 8.x |
Trigger |
Enabling cNFS on RHEL8.x nodes. |
Workaround |
Downgrade or upgrade procps-ng to procps-ng-3.3.15-3.el8 |
|
5.1.1.2 |
cNFS |
IJ33567 |
High Importance
|
AFM Prefetch is not generating the prefetch end callback event registered through the afmPrepopEnd event.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Register for afmPrepopEnd callback event, and run AFM prefetch with list file or directory option. |
Workaround |
None |
|
5.1.1.2 |
AFM |
IJ33607 |
Suggested |
[X] logAssertFailed: numaNodesP[node].numaNode != -2 in mmfs.log.latest and daemon will not start
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
One or more NUMA nodes without any CPU or memory resources. |
Workaround |
Reallocate the LPAR and ensure there are no NUMA nodes without any CPU or memory resources. |
|
5.1.1.2 |
NUMA Awareness |
IJ32097 |
High Importance
|
If the disks for a file system are not ready to be used yet and the command "mmfsadm dump deferreddeletions" is run at the same time, the command will fail with the side effect of causing a long waiter 'waiting for SG cleanup' when the file system is deleted and recreated.
(show details)
Symptom |
Long Waiters |
Environment |
All |
Trigger |
NA |
Workaround |
None |
|
5.1.1.1 |
Core GPFS |
IJ32159 |
High Importance
|
Operations requiring allocation of full metadata blocks. Examples: Expand number of allocated inode Create new independent fileset.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
All |
Trigger |
Operations requiring allocation of full metadata blocks. Examples: Expand number of allocated inode Create new independent fileset. |
Workaround |
Add more disks to the system pool. |
|
5.1.1.1 |
Core GPFS |
IJ32186 |
High Importance
|
There appears to be an issue at the systemd layer that causes startup service to fail with connection time out during reboot. If auto load is set to yes, GPFS may not be able to start up or it may get stuck waiting for the environment to be initialized.
(show details)
Symptom |
GPFS does not start after a reboot. |
Environment |
Linux |
Trigger |
This issue affects cluster with auto load set to yes and hitting systemd connection time out during reboot. |
Workaround |
Manually restart GPFS. |
|
5.1.1.1 |
GPFS startup, CCR, systemd |
IJ31735 |
Suggested |
gpfs_next_inode and gpfs_stat_inode APIs returns inode 0 as the first inode with an invalid state.
(show details)
Symptom |
Unexpected result |
Environment |
All |
Trigger |
gpfs_next_inode/gpfs_stat_inode APIs |
Workaround |
None |
|
5.1.1.1 |
GPFS APIs |
IJ31841 |
High Importance
|
When getting the stats of a file, users could run into the assert: "Assert exp((verify == 0) || (ofP == __null) || (ofP->sgP == __null) || ofP->isRoSnap() || (ofP->metadata.getInodeStatus() != 1) || !ofP->sgP->isFileIncludedInSnapshot(ofP->getInodeNum(), ofP->getSnapId(), getInodeStatus())) || (ofP->assertInodeWasCopiedToPrevSnapshot()) || (ofP->isBeingRestriped() || ofP->beenRestriped)" if there are writes to the same file from other nodes.
(show details)
Symptom |
Daemon crash |
Environment |
Linux |
Trigger |
Getting the lite stat of a file while writes are in progress from other nodes. |
Workaround |
Run the mmchconfig command to reset the configuration "statliteMaxAttrAge=0", which will disable the statlite and avoid this problem, but it may also impact the writes performance on the other nodes as well. |
|
5.1.1.1 |
gpfs_statlite API |
IJ32218 |
Critical |
AFM prefetch fails with "too many open files" error.
(show details)
Symptom |
Unexpected results |
Environment |
All |
Trigger |
AFM prefetch |
Workaround |
None |
|
5.1.1.1 |
AFM |
IJ32219 |
High Importance
|
AFM logs an error 124 (error not supported), when the Control file is not available at home site (non-GPFS home site).
(show details)
Symptom |
Daemon crash |
Environment |
Linux gateway nodes |
Trigger |
Try to set EAs on a file when home is a non-GPFS node which doesn't contain the AFM control file. |
Workaround |
None |
|
5.1.1.1 |
AFM |
IJ32223 |
Suggested |
After converting legacy recovery group to mmvdisk managed recovery group, poor write performance is observed from an application and the GPFS daemon did not come up because of OOM issue on some nodes.
(show details)
Symptom |
Abend/Crash Performance Impact/Degradation |
Environment |
Linux |
Trigger |
When converting legacy recovery group to mmvdisk managed recovery group by using the following command: mmvdisk recoverygroup convert --recovery-group RgName[,RgName] --node-class NcName |
Workaround |
Use the following command to reset the pagepool to 60%: mmvdisk server change --node-class NcName --pagepool 60% --recycle one |
|
5.1.1.1 |
ESS, GNR |
IJ32226 |
High Importance
|
When users run the mmlsfileset command, it doesn't show the junction paths of some fileset randomly.
(show details)
Symptom |
Unexpected results |
Environment |
All |
Trigger |
One fileset's root directory has been corrupted for an unknown reason. |
Workaround |
None |
|
5.1.1.1 |
Fileset |
IJ32227 |
High Importance
|
logAssertFailed: isNotCached() at ShHashS.C
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
The race occurs between the initialization and release of the indirect block descriptor. |
Workaround |
This assert can be safely ignored by using mmchconfig disableAssert='ShHashS.C:5400-5800:isNotCached()' |
|
5.1.1.1 |
Core GPFS |
IJ31571 |
High Importance
|
When mmchattr is issued with "--no-attr-ctime", it should not end with ctime update.
(show details)
Symptom |
Unexpected results |
Environment |
All |
Trigger |
mmchattr --no-attr-ctime |
Workaround |
None |
|
5.1.1.1 |
Core GPFS |
IJ32238 |
Suggested |
The systemhealth monitor did not detect all paths for RDMA support (libibverbs.so library) on Ubuntu machines. Therefore, it reports a "ib_rdma_libs_wrong_path" issue.
(show details)
Symptom |
Error output/messages |
Environment |
Ubuntu Linux |
Trigger | The issue shows up on Ubuntu machines with RDMA in use. |
Workaround |
None |
|
5.1.1.1 |
System health |
IJ32245 |
Suggested |
Command: err 46: tsunlinkfileset -f after mmunlinkfileset commands are invoked.
(show details)
Symptom |
Unable to unlink or delete the fileset which encountered this error. |
Environment |
All |
Trigger | Invoking the mmunlinkfileset command |
Workaround |
Reboot the node and retry the mmunlinkfileset command. |
|
5.1.1.1 |
Filesets |
IJ32287 |
Critical |
Application performance degradation while running on AFM filesets.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux |
Trigger |
AFM replication |
Workaround |
None |
|
5.1.1.1 |
AFM, AFM DR |
IJ32344 |
Suggested |
When fileset creation occurs using mmafmcosconfig with --perm option, the file entries of this fileset are created with 700(default) permission instead of the value specified with --perm.
(show details)
Symptom |
Permission set to default 700. |
Environment |
Linux |
Trigger | The file entries get listed from fileset root path. |
Workaround |
None |
|
5.1.1.1 |
AFM |
IJ32481 |
HIPER |
AFM recovery may incorrectly delete the files at home or secondary if there is any network issues while doing the home readdir.
(show details)
Symptom |
Unexpected Results |
Environment |
Linux |
Trigger |
AFM recovery |
Workaround |
Resync the fileset if there are any missing files at home. |
|
5.1.1.1 |
AFM, AFM DR |
IJ32506 |
Critical |
Assert exp(isFastCondvarPrepSignal(fcLockP->ul) && fcLockP->lw.slot < 16384) in line 4570 of file /project/sprelmax511/build/rmax511067B2b/src/avs/fs/mmfs/ts/tasking/dSynch.C
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
Race condition between two RDMA threads |
Workaround |
None |
|
5.1.1.1 |
RDMA |
IJ32507 |
High Importance
|
When a dependent fileset is created and linked under AFM independent fileset, ACLs form the home dependent fileset are not fetched and set at the cache dependent fileset. This happens only for the dependent fileset root path.
(show details)
Symptom |
Unexpected Results |
Environment |
Linux |
Trigger |
AFM caching with dependent filesets |
Workaround |
None |
|
5.1.1.1 |
AFM |
IJ32521 |
High Importance
|
When enabling or disabling the rapid repair functionality with the mmchfs command and the file system panics at the same time, the log recovery could fail due to the logs generated by rapid repair.
(show details)
Symptom |
I/O error for log recovery |
Environment |
All |
Trigger |
File system panic happens while the rapid repair is being enabled or disabled. |
Workaround |
None |
|
5.1.1.1 |
Rapid repair |
IJ32560 |
Suggested |
Copy of uncached file from Samba share fails with object backend while writing data to cache.There is another issue if setXattr operation is in the queue, a sync read for the same file fails to return data to the application.
(show details)
Symptom |
Read operation fails. |
Environment |
Linux |
Trigger | Read uncached file from Samba share of the AFM cache. |
Workaround |
Add node names to the /etc/hosts file. |
|
5.1.1.1 |
AFM |
IJ32553 |
High Importance
|
AFM prefetch fails with error 238 if the prefetch list file contains symlinks and if their target paths do not exist as part of the same fileset.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
AFM prefetch |
Workaround |
None |
|
5.1.1.1 |
AFM |
IJ32554 |
High Importance
|
Issuing a "mmchnode --daemon-interface" attempts to change the cluster configuration repository (CCR). When this mmchnode is issued from a Windows node, CCR gets committed with invalid IPv4 information, rendering the cluster in a non-working state.
(show details)
Symptom |
The mmchnode command fails with a message trail resembling: 'mmchnode: Unable to commit new changes.' 'mmchnode: [E] The command was unable to reach the CCR service on any quorum node. Ensure the CCR service (mmfsd or mmsdrserv daemon) is running on all quorum nodes and the communication port is not blocked by the firewall.' 'mmchnode: 6027-1271 Unexpected error from function setRunningCommand. Return code: 149' |
Environment |
Windows (x86_64) |
Trigger |
Issuing "mmchnode --daemon-interface" command on a Windows node specifying an alternate IPv4 address. |
Workaround |
None. A manual CCR restore (mmsdrrestore --ccr-repair) may be necessary to restore the cluster to a working state. |
|
5.1.1.1 |
CCR |
IJ32554 |
Suggested |
The Linux fallocate(2) API doesn't work correctly on Spectrum Scale file systems when punching a hole beyond the end of file.
(show details)
Symptom |
Punching a hole beyond the end of a file fails with EINVAL(22) error. |
Environment |
Linux |
Trigger | Punching a hole through the Linux fallocate(2) API. |
Workaround |
None |
|
5.1.1.1 |
fallocate(2) |
IJ32608 |
High Importance
|
With the introduction of the 5-level page tables, supported by Intel's Ice Lake processor generation, user space memory gets expanded by a factor of 512. This resulted in the change of kernel base address and due to this, GPFS asserts with message "logAssertFailed: (UIntPtr)(vmallocStart)" while validating kernel addresses.
(show details)
Symptom |
Assert |
Environment |
Linux (x86_64) |
Trigger |
Systems that attempt to install Spectrum Scale on a newer Intel x86_64 processor with 5-level page. |
Workaround |
Disable 5-level page table setting by adding no5lvl to the kernel command line and then rebooting the node. Check the documentation of the Linux distribution used for details on how to apply this change. For example, on RHEL8:
# grubby --update-kernel=ALL --args="no5lvl" # cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-240.10.1.el8_3.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=
rhel/swap rhgb quiet net.ifnames=0 biosdevname=0 no5lvl |
|
5.1.1.1 |
Core GPFS |
IJ32627 |
High Importance
|
When doing preallocation and writes (e.g., Spectrum Protect Plus copy restore), the block usage of the file system is bigger than the total data size of these files.
(show details)
Symptom |
More disk space usage than expected. |
Environment |
All |
Trigger |
Preallocate the data blocks of the file, and then write as much data as the file size. |
Workaround |
Issue this command: mmchattr --compact=fragment |
|
5.1.1.1 |
Disk space preallocation of files |
IJ32628 |
Suggested |
When the mmdf: command is run from a directory where the current working directory has become stale (directory was deleted after going to it) the command states it was run from an invalid directory.
(show details)
Symptom |
Command states it was run from an invalid directory. But the command fails with various additional errors. |
Environment |
All |
Trigger | Running the mmdf command from a directory that is stale (directory was deleted after going to it). |
Workaround |
Only use mm commands in a valid current working directory. Move to a directory that still exists within the node's file systems. |
|
5.1.1.1 |
Core GPFS |
IJ32632 |
Suggested |
Long waiters when running file audit logging or watch folder
(show details)
Symptom |
Long Waiters |
Environment |
All |
Trigger | Heavy stress on audited or watched filesystems or filesets. |
Workaround |
None |
|
5.1.1.1 |
Watch folder, File audit logging |
IJ32648 |
High Importance
|
GPFS allows the length of NSD names to be up to 255 characters and there are no rules that say it must contain an alpha. If there are NSD names with all digits and long enough, this can be a problem. With long digit names, two NSDs can incorrectly be identified as the same NSD.
(show details)
Symptom |
Error output/message Unexpected Results/Behavior |
Environment |
All |
Trigger |
Long NSD names of all digits |
Workaround |
Add an alphabetic character in to the NSD name. |
|
5.1.1.1 |
Core GPFS, Admin commands |
IJ32649 |
Critical |
On HAWC enabled file systems, if the file system has 'down' disks, it causes replica mismatch after the file system repair.
(show details)
Symptom |
Operation failure due to file system corruption |
Environment |
All |
Trigger |
Writes to HAWC enabled file systems which has 'down' disks |
Workaround |
None |
|
5.1.1.1 |
HAWC |
IJ32651 |
High Importance
|
When a disk is down, we may hit assertion Assert exp(!addrDirty && !synchedStale) in line 6446 of file bufdesc.C while doing directory block merge if this block is in this down disk.
(show details)
Symptom |
Abend/Crash |
Environment |
All |
Trigger |
Disk is down. |
Workaround |
None |
|
5.1.1.1 |
Core GPFS |
IJ32666 |
High Importance
|
logAssertFailed: mdiWorkingIndexP[entryIndex].wSlotAddr == slots line 4956 mdIndex.C when doing recovery group master recovery
(show details)
Symptom |
Abend/Crash |
Environment |
Linux (x86_64, PPC64, PPC64LE) |
Trigger |
RG master failure which causes recovery. |
Workaround |
None |
|
5.1.1.1 |
ESS, GNR |
IJ32667 |
Suggested |
Offline fsck will not be able to repair all corruptions when using the option of applying patch file ((i.e mmfsck FSchk -v --patch-file path-towrite-patchfile --patch) to repair the corruptions When repairing corruption by applying patch file the fsck output would show the below messages indicating the issue: ---------------- Invalid BlockType Inode. Skipping patch. ---------------
(show details)
Symptom |
Error output/message and all corruptions not fixed |
Environment |
All |
Trigger | Offline fsck repairing corruptions by applying patch file. |
Workaround |
Run fsck repair with the regular option -y to fix the corruptions. |
|
5.1.1.1 |
FSCK |
IJ32668 |
Suggested |
There's no more difference in format of stdout when running "mmces state cluster NFS" or "mmces state cluster NFS -Y". With former versions, a nice table list was generated when not using -Y. But, the current output of mmces state cluster NFS is: # mmcesstatecluster::HEADER:version:reserved: reserved:NODE:COMPONENT:STATE:EVENTS: mmcesstatecluster::0:1::::nas22ces01-i:NFS:HEALTHY: csm_resync_forced,no_longwaiters_found.ccr_quorum_nodes_ok,
service_running,node_resumed.nfs_dbus_ok,node_resumed,dns_found,
dns_found:mmcesstatecluster::0:1::::nas22ces02-i:NFS:HEALTHY:csm_resync_forced,
ccr_quorum_nodes_ok,nlockmgr_rpcinfo_ok,mountd_rpcinfo_ok,service_running,
node_resumed,nfs_rpcinfo_ok,nfsd_up,nfs_dbus_ok,wnbd_up,node_resumed,ads_up,
dns_found,dns_krb_tcp_dc_msdcs_up,dns_found,dns_query_ok:mmcesstatecluster::
0:1::::nas22ces03-i:NFS:HEALTHY:csm_resync_forced,ccr_quorum_nodes_ok,
service_running,node_resumed,nfs_dbus_ok,node_resumed,dns_found,
dns_found:mmcesstatecluster::0:1::::nas22ces04-i:NFS:HEALTHY:service_running,
ccr_quorum_nodes_ok,service_running,node_resumed,nfs_dbus_ok,service_running,
node_resumed,dns_found,dns_found:mmcesstatecluster::0:1::::nas22ces05-i:NFS:
HEALTHY:service_running,ccr_quorum_nodes_ok,service_running,node_resumed,
nfs_dbus_ok,service_running,node_resumed,dns_found,dns_found:mmcesstatecluster::
0:1::::nas22ces06-i:NFS:HEALTHY:service_running,nfs_exported_fs_chk,
nlockmgr_rpcinfo_ok,mountd_rpcinfo_ok,service_running,nfs_rpcinfo_ok,nfsd_up,
service_running,dns_found,dns_found
(show details)
Symptom |
Incorrect output |
Environment |
Linux |
Trigger | Parsing of the machine readable output was not done correctly. |
Workaround |
None |
|
5.1.1.1 |
CES |
INFO001 |
Suggested |
The release of v5.1.1.0 aligned with the release of v5.1.0.3 PTF. Please refer to v5.1.0.3 for list of APARs.
|
5.1.1.0 |
INFO |