IJ54197 |
High Importance
|
tsapolicy calls flushBuf() when each chosen record is written when policy list operation is performed.This may result in additional processing time. This problem does not occur if "-I prepare" or "-I test" is specified.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
all platforms that support mmapplypolicy |
Trigger |
run policy list operation with "-I defer" or "-I yes" option or run policy list operation without -I option. |
Workaround |
none |
|
5.1.9.10 |
mmapplypolicy |
IJ54322 |
Suggested |
tsapolicy sends all debug information to trace file since IBM Storage Scale 5.1.7.0. But if the cluster level is older than 5.1.7.0, one of the debug lines is displayed to the command output unconditionally. There is no functional impact.
(show details)
Symptom |
Error output/message |
Environment |
all platforms that support mmapplypolicy |
Trigger |
run mmapplypolicy while minReleaseLevel is older than 5.1.7.0. |
Workaround |
none |
|
5.1.9.10 |
mmapplypolicy |
IJ54425 |
High Importance
|
Today when a Gateway node fails and comes back up, there will a recovery attempt on almost all filesets that the Gateway node used to handle. Although the number of recoveries has been capped using the afmMaxParallelRecoveries tunable, there's still chance that all filesets attempt mount first flooding the gateway node with mount requests..
(show details)
Symptom |
Unexpected behavior |
Environment |
Linux Only |
Trigger |
Starting replication on many filesets belonging to the same Gateway node at the sametime. |
Workaround |
Keep all filesets stopped and start only afmMaxParallelRecoveries number of filesets per gateway at a time. |
|
5.1.9.10 |
AFM |
IJ54494 |
High Importance
|
BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
Shutting down mmfs after running a workload with Cifs for sometime. |
Workaround |
None |
|
5.1.9.10 |
GPFS Core |
IJ53557 |
High Importance
|
GPFS asserted due to unexpected hold count on events exporter object during destructor.
(show details)
Symptom |
Assert |
Environment |
All platforms |
Trigger |
A race condition between EventsExporterReceiverThread and EventsExporterListenThread and an error path where the destructor is called |
Workaround |
None |
|
5.1.9.10 |
All Scale Users |
IJ54780 |
Critical |
Node failure could lead to unexpected log recovery failure that would require offline mmfsck to repair. This could happen on file system with replication and snapshot.
(show details)
Symptom |
Cluster/File System Outage |
Environment |
ALL Operating System environments |
Trigger |
Node failure with file system that has replication and snapshot |
Workaround |
Avoid creating snapshot |
|
5.1.9.10 |
Snapshots |
IJ54781 |
High Importance
|
GPFS daemon could fail unexpectedly with logAssertFailed: firstValidOffset >= blockOffset && lastValidOffset > firstValidOffset. This could occur on file system with HAWC enabled.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Race between log wrap and write past end of file. |
Workaround |
Disable HAWC |
|
5.1.9.10 |
HAWC |
IJ54782 |
Critical |
When accessing a mmaped file concurrently, even from internal GPFS operations like restripe, the required flush of the mmap data can be incomplete, resulting in data from previous mmap writes not to be written to disk.
(show details)
Symptom |
Data corruption |
Environment |
ALL Linux OS environments |
Trigger |
Have a file mmaped for writing, write data to the mmap region and then trigger some concurrent access (e.g. mmrestripefs) |
Workaround |
No workaround. |
|
5.1.9.10 |
All Scale Users |
IJ53214 |
High Importance
|
With FAL and NFS Ganesha enabled, running workloads with path to an NFS export for long periods of time could result in NFS client ips not being logged in the audit log.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
- Wit FAL and NFS Ganesha enabled, run workloads with path to the NFS mount point for long periods of time |
Workaround |
- Restart NFS Ganesha if NFS client ips are not being logged |
|
5.1.9.10 |
File Audit Logging, NFS |
IJ54629 |
High Importance
|
mmrestorefs recreates all files and directories that were deleted after the snapshot was taken.If the deleted file is a special file, mmrestorefs uses mknod() system call to create the file.But mknod() cannot create a socket file on AIX. Hence, if socket files were deleted after the snapshot was taken,mmrestorefs on AIX will fail during re-creating the socket file.
(show details)
Symptom |
Component Level Outage |
Environment |
AIX only |
Trigger |
run mmrestorefs when a socket file was deleted after the snapshot was taken. |
Workaround |
none |
|
5.1.9.10 |
mmrestorefs |
IJ54874 |
High Importance
|
Assertion goes off when We attempt to check if parent and child are local and try to restrict replication for suchentities.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
Replicating any message in the AFM to COS backend with the fix for 344884 |
Workaround |
None |
|
5.1.9.10 |
AFM |
IJ54878 |
High Importance
|
If the dependent fileset is created as a non-root user and linked, then the uid/gid are not replicated for the dependent fileset to the remote site.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only |
Trigger |
Create and Link dependent fileset inside DR primary fileset as a non-root user. |
Workaround |
None |
|
5.1.9.10 |
AFM |
IJ54879 |
Suggested |
When capturing traces in blocking mode, the kernel message "BUG: scheduling while atomic error message when using blocking traces" can be triggered. This could also lead to a deadlocked node, requiring a reboot.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Linux OS environments |
Trigger |
- |
Workaround |
Capture traces in overwrite mode, not in blocking mode. |
|
5.1.9.10 |
All Scale Users |
IJ54880 |
Suggested |
mmafmconfig enable command on AFM primary mode fileset fails
(show details)
Symptom |
Unexpected Behavior |
Environment |
All OS Environments |
Trigger |
Run mmafmconfig enable command on AFM primary mode fileset |
Workaround |
None |
|
5.1.9.10 |
AFM |
IJ54783 |
High Importance
|
When trying to install Storage Scale on Windows with latest Cygwin version (3.6.1), the installation can fail due to security issues.
(show details)
Symptom |
Upgrade/Install failure. |
Environment |
Windows/x86_64 only |
Trigger |
Upgrading Cygwin to version 3.6.1 before trying to install Storage Scale on Windows |
Workaround |
Downgrade Cygwin to version 3.6.0 or below before attempting to install Storage Scale on Window |
|
5.1.9.10 |
Install/Upgrade |
IJ54962 |
High Importance
|
Snapshots are not listed under .snapshots directory when the AFM is enabled on the file system
(show details)
Symptom |
Unexpected results |
Environment |
All OS environments |
Trigger |
Listing snapshots when AFM is enabled on the file system |
Workaround |
None |
|
5.1.9.10 |
AFM |
IJ54963 |
High Importance
|
Symlinks are appended with a null character, which causes the pwd -P command to fail to resolve the real path.
(show details)
Symptom |
Unexpected results |
Environment |
Linux Only |
Trigger |
AFM caching with symlinks |
Workaround |
None |
|
5.1.9.10 |
AFM |
IJ54964 |
High Importance
|
Ganesha crashes during blocked lock request processing The thread stack in Ganesha during a crash may resemble the following examples:
Crash-1:
raise
crash_handler
raise
abort
_nl_load_domain.cold.0
lock_entry_dec_ref
process_blocked_lock_upcall
state_blocked_lock_caller
fridgethr_start_routine
start_thread
__clone
Crash-2:
raise
abort
state_hdl_cleanup
mdcache_lru_clean
_mdcache_lru_unref
mdcache_put_ref
lock_entry_dec_ref
remove_from_locklist
try_to_grant_lock
process_blocked_lock_upcall
state_blocked_lock_caller
fridgethr_start_routine
start_thread
__clone
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
Ganesha may crash when multiple NFS clients attempt to acquire file locks simultaneously using blocked lock requests. |
Workaround |
None |
|
5.1.9.10 |
NFS |
IJ54965 |
High Importance
|
NFSV4 ACLs are not replicated with AFM fileset level options afmSyncNFSV4ACL and afmNFSV4
(show details)
Symptom |
Unexpected results |
Environment |
Linux Only |
Trigger |
Using options afmSyncNFSV4ACL and afmNFSV4 to replicate NFSv4 ACLs. |
Workaround |
None |
|
5.1.9.10 |
AFM |
IJ54966 |
High Importance
|
Kernel Crash with selinux enabled
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
file creation with selinux enabled. |
Workaround |
None |
|
5.1.9.10 |
Scale core |
IJ54967 |
High Importance
|
crash during cxiStrcpy in setSecurityXattr
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
file creation with selinux enabled. |
Workaround |
None |
|
5.1.9.10 |
Scale core |
IJ54968 |
High Importance
|
opening a new file with O_RDWR|O_CREAT fails with EINVAL.
(show details)
Symptom |
file creation returns an error of EINVAL> |
Environment |
Linux Only |
Trigger |
Unknown |
Workaround |
None |
|
5.1.9.10 |
Scale Core |
IJ54969 |
High Importance
|
kernel panic: general protection fault / ovl_dentry_revalidate_common / mmfsd ORrunning lsof /proc on a node crashes the node
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
Running lsof /proc on a node crashes the node. |
Workaround |
None |
|
5.1.9.10 |
Scale core |
IJ54978 |
High Importance
|
There is a issue where the .ptrash directory's local bit gets reset and as a result - some operations performed during recovery in the trash dir are getting requeued to the remote site and causing queue to bedropped in a repeated loop.
(show details)
Symptom |
Unexpected Behaviour |
Environment |
All Linux OS Environments (AFM Gateway nodes) |
Trigger |
Recovery repeatedly failing because of ptrash bit being non local |
Workaround |
Setting the ptrash bit local manually and re-running the recovery. |
|
5.1.9.10 |
AFM |
IJ54979 |
High Importance
|
With afmFastCreate enabled, if the Create that tries to push the initial chunk of data fails to complete and gets requeued, then the requeued Create is replaying all data when it retries.And later there are a couple of Write messages that starting from offset where Create initially went inflight that is also played. Totaling to almost twice the amount of data of the file size to be replicated.
(show details)
Symptom |
Unexpected Behaviour |
Environment |
All Linux OS Environments (AFM Gateway nodes) |
Trigger |
afmFastCreate replication failing initially because of lock or network error and later replication being tried again. |
Workaround |
Set a higher value of afmAsyncDelay to push replication as far as the file is being written. |
|
5.1.9.10 |
AFM |
IJ54593 |
High Importance
|
During token minimization, a deadlock can occur on a client node. With token minimization, a client node is first asked to give up any tokens that are only for cached files. Without the fix, calling this codepath for files that have been deleted, could result in a deadlock.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Linux OS environments |
Trigger |
Have many files cached on a client node. Delete files. Trigger a token server change, which then uses token minimization. |
Workaround |
Disable token minimization to avoid the problem: mmchconfig tokenXferMinimization=no. Or restart GPFS on the client node, to get out of the deadlock. |
|
5.1.9.10 |
All Scale Users |
IJ53743 |
High Importance
|
Mmvdisk will not generate the nsd stanza properly when the failuregroups option is used, causing no thin inodes created.
(show details)
Symptom |
Lack of thin inodes on the expected pools |
Environment |
all |
Trigger |
Using mmvdisk fs create with failuregroups option where thin inodes are expected to be created |
Workaround |
None |
|
5.1.9.9 |
GNR |
IJ53744 |
Medium Importance |
kernel panic and crash when scale is used as an overlay filesystem. The failure message is like the following:
Failure at line 31071 in file mmfs/ts/kernext/gpfsops.C rc 0 reason 0 data (refcount != 0)
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
Scale is used as an overlay filesystem and files are mmapped. |
Workaround |
None |
|
5.1.9.9 |
Scale core |
IJ51961 |
Suggested |
Inside GPFS daemon, the variable that represents number of allocations is integer type which could overflow in large system. As a result of it, some statistics are shown in negative.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
run "/usr/lpp/mmfs/bin/mmdiag --memory" or "/usr/lpp/mmfs/bin/mmfsadm dump alloc" in highly loaded system. |
Workaround |
none |
|
5.1.9.9 |
Admin Commands (mmdiag and mmfsadm) |
IJ53784 |
High Importance
|
There is a race condition that involves multiple threads performing a full-track read operation to the same track while disk errors exist. When the configuration parameter nsdRAIDClientOnlyChecksum is enabled, this race condition could create a situation where, without going through the checksum validation, data read from disks could be used for the reconstruction of data that failed to read due to disk errors.
(show details)
Symptom |
The potential final outcome could be silent data corruption, however there is intermediate signs of "Error validating buffer checksum in vdisk RG001LG002VS004 vtrack 7611..." which itself is not necessary the sign of silent data corruption. |
Environment |
Linux Only |
Trigger |
The race condition and the specific data buffer corruption by disk drives. |
Workaround |
Disable client only checksum by running "mmchconfig nsdRAIDClientOnlyChecksum=no -i -N <server nodes or nodeclass>" |
|
5.1.9.9 |
GNR |
IJ53600 |
Suggested |
A Linux kernel change caused GPFS to break disk I/O into many small requests.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
ALL Linux OS environments with kernel version >= 5.1 |
Trigger |
N/A |
Workaround |
None |
|
5.1.9.9 |
All Scale Users |
IJ53548 |
Suggested |
Attempting to set a timestamp in GPFS to a time before Jan 1 1970 results in an unexpected timestamp being stored. GPFS currently stores timestamps as a 32bit unsigned integer, and thus can store timestamps from Jan 1 1970 00:00:00 UTC to 7 February 2106 at 06:28:15 UTC. Setting a timestamp before 1970 was silently accepted.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
Attempt to set timestamp on a GPFS inode before 1970, e.g.: touch -m -t 196001010000 testfile |
Workaround |
Avoid setting timestamps outside the supported range in GPFS. |
|
5.1.9.9 |
All Scale Users |
IJ53910 |
Critical |
Unexpected long waiters on PrefetchListMutex could cause user applications to hang. This could cause cluster wide performance degradation.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Operating System environments |
Trigger |
Setting prefetchPartition configuration to value other than default |
Workaround |
Set prefetchPartition configuration to default |
|
5.1.9.9 |
All Scale Users |
IJ53911 |
High Importance
|
[AFM] Resync is not able to sync xattr from evicted files, resulting in acls mismatch if afm control file is created after acl change in cache.
(show details)
Symptom |
ACL mismatch between cache and home |
Environment |
Linux Only |
Trigger |
ACL change in cache without afm control file. |
Workaround |
None |
|
5.1.9.9 |
AFM |
IJ53912 |
Critical |
Online mmfsckx could report false critical corruption (duplicate reference)
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
Directory block split or compaction triggered by file creation and deletion |
Workaround |
Run offline mmfsck to confirm |
|
5.1.9.9 |
All Scale Users |
IJ53724 |
High Importance
|
Automatic inode expansion for an inode space can get disabled.
(show details)
Symptom |
file creations can fail. |
Environment |
ALL Operating System environments |
Trigger |
creating many files which causes allocated inodes in an inode space to become equal to max inodes. |
Workaround |
perform manual inode-space expansion. |
|
5.1.9.9 |
File creation/Inode allocation. |
IJ53723 |
Critical |
Under certain condition when SED discovery command fails for a pdisk with EIO error, in those scenarios hardware type information is not set correctly in pdisk. Which results in showing SSD disk as HDD.
(show details)
Symptom |
Wrong hardware type is set for a disk. |
Environment |
All supported platform |
Trigger |
If SED discovery command fails for a pdisk with EIO error, in those scenarios hardware type information is not set correctly in pdisk |
Workaround |
None |
|
5.1.9.9 |
GNR |
IJ54002 |
High Importance
|
When adjusting number of prefetch threads in a partition, unexpected long waiter on PrefetchListMutex could be triggered leading to other long waiters and application hang.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Operating System environments |
Trigger |
Changes in number of files concurrently opened by user applications |
Workaround |
Perform sequential read on a file with file seize bigger than 2 times the file system block size |
|
5.1.9.9 |
All Scale Users |
IJ54021 |
Medium Importance |
AFM replication fails in IW mode if the remote mount hangs as the messages are incorrectly dropped.
(show details)
Symptom |
Unexpected results |
Environment |
Linux Only |
Trigger |
This issue happens when remote cluster mount is not responding and write message is stuck in IW mode |
Workaround |
None |
|
5.1.9.9 |
AFM |
IJ54022 |
High Importance
|
Client or CES node crashes when afmFastLookup is enabled due invalid fileset pointer access
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
afmFastLookup with memory mapped IO |
Workaround |
disable afmFastLookup |
|
5.1.9.9 |
AFM |
IJ54023 |
High Importance
|
AFM outband prefetch causes deadlock due to incorrect handling of async messages while creating the files/dirs in the cache.
(show details)
Symptom |
Deadlock |
Environment |
Linux Only |
Trigger |
AFM outband prefetch |
Workaround |
None |
|
5.1.9.9 |
AFM |
IJ54024 |
High Importance
|
AFM resync/changeSecondary commands hang as they try to send lookup on the local .ptrash directory
(show details)
Symptom |
Deadlock |
Environment |
Linux Only |
Trigger |
AFM resync/changeSecondary operations |
Workaround |
None |
|
5.1.9.9 |
AFM |
IJ54025 |
High Importance
|
Daemon asserts with Assert exp(oldValue == 0)*
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
A mount Failure and then multiple IO threads coming in to retry the Local SG mount. |
Workaround |
None |
|
5.1.9.9 |
AFM |
IJ54026 |
High Importance
|
There is a Race between Local SG Panic Handling Thread and msgMgrThreadBody on the handlerList. This causes deadlock in deciding who should delete and free the Filesystem.
(show details)
Symptom |
Deadlock |
Environment |
Linux Only |
Trigger |
Local Filesystem Panic happening when there are active IO requests happening on many filesets on the same FS. |
Workaround |
None |
|
5.1.9.9 |
AFM |
IJ54004 |
High Importance
|
The file system encryption functionality requires the CA certificates to be compliant with RFC 5280 specifications, which require that CA certificates' basicContraints are marked as critical. Consequently, Storage Scale does not allow the use of CA certificates that don't have basicContraints marked as critical.
(show details)
Symptom |
Failure to establish secure connections to the KMIP key server and retrieve the master encryption key required by the file system encryption functionality. |
Environment |
ALL |
Trigger |
The use of KMIP client and server certificates signed by CA certificates that have non-critical basicConstraints. |
Workaround |
Use KMIP client and server certificates that are signed by CA certificates with basicConstraints marked as critical, in conformance to RFC 5280. |
|
5.1.9.9 |
GPFS Core |
IJ53563 |
High Importance
|
With simplified setup for file system encryption, when the KMIP client and server certificates are signed by the CA certificate chains that have common certificates (e.g., same CA root and possibly intermediate certificates), mmgskkm command returns an error. That error forces the calling "mmkeyserv client create" command to fail.
(show details)
Symptom |
Failure to create a mmkeyserv client |
Environment |
AIX, Linux |
Trigger |
The use of KMIP client and server certificates signed by CA certificates chains with shared certificates. |
Workaround |
Use self-signed, system generated KMIP client certificates. |
|
5.1.9.9 |
GPFS Core |
IJ53647 |
High Importance
|
In a two-node+tiebreaker cluster using server-based cluster configuration when one of the nodes is powered off and the other node tries to run election and opens the tiebreaker disk, it tries to call Disk::devOpen() which has a side effect of retrieving the WWN from the device. This logic of retrieving WWN from the device and has check on disk lease before sending the SCSI request,hitting a deadlock there. With CCR configuraiton, when it goes through election and tries to access tiebreaker disk, it invokes OpenDevice() from CCR directly, therefore it doesn't hit the problem.Removing the call of wwnFromDevice() from Disk::devOpen() eliminates this deadlock.
(show details)
Symptom |
Deadlock during cluster probing in a two ndoe cluster with tiebreaker and server-based cluster configuration. |
Environment |
ALL Operating System environments |
Trigger |
Deadlock during cluster probing (after node failure) in a two node cluster with tiebreaker disk and server-based cluster configuration |
Workaround |
None |
|
5.1.9.9 |
Cluster configuration |
IJ54063 |
High Importance
|
An application like the SMB server may invoke the gpfs_stat_x() call (available in libgpfs.so) to retrieve stat information for a file. Such call implements "statlite" semantics, meaning that the size information is not assured to be the latest. Other applications which invoke standard stat()/fstat() calls do expect the size information to be up to date. However, due a problem in the logic, after gpfs_stat_x() is invoked, information is cached inside the kernel, and the cache is not purged even when other nodes change the file size (for example by appending data to it). The result is that stat() invoked on the node may still retrieve out of date file size information as other nodes write into the file.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
Applications on a node invoke gpfs_stat_x() (including the SMB server) on the same node where stat()/fstat() is called. |
Workaround |
None |
|
5.1.9.9 |
All Scale Users |
IJ53693 |
HIPER |
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
All of these conditions need to be met in order for this problem to occur: An application is using Linux shared memory, is issuing a direct i/o read request into that memory and that memory is being swapped out. |
Workaround |
Not using Linux shared memory, or preventing swapping would avoid this problem. |
|
5.1.9.9 |
All Scale Users |
IJ54106 |
Suggested |
Added support for NFS over RDMA (RoCE) while using AFM
(show details)
Symptom |
Unexpected results |
Environment |
Linux Only |
Trigger |
Using AFM NFS filesets with RDMA (RoCE). |
Workaround |
None |
|
5.1.9.9 |
AFM |
IJ54107 |
High Importance
|
.afm directory where some failure files are redirected today has a default permission of 700 when created.. Only the user has full access to it.. But there are times when other group users and others might want to view failure list files. And the file permission being 700 as well, the read gets denied.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All OS Environments |
Trigger |
Reading failure list in .afm directory of AFM filesets as non-root users. |
Workaround |
None |
|
5.1.9.9 |
AFM |
IJ54108 |
High Importance
|
Kernel panic - not syncing: hung_task: blocked tasks.
(show details)
Symptom |
hung tasks while calling iterate_supers or super_lock. |
Environment |
Linux kernel version >= 6.5 |
Trigger |
Linux kernel version >= 6.5 |
Workaround |
None |
|
5.1.9.9 |
Scale Core |
IJ53799 |
Suggested |
pmcollector service can segfault in LogStore::readAndProcess() and service will restart.There is a data race between 2 parallel code threads which was observed when the pmcollector aggregation was running (every 6h) while the node was under load.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Linux OS environments |
Trigger |
High system CPU utilization while pmcollector is processing it's data aggregation. |
Workaround |
None |
|
5.1.9.8 |
perfmon (Zimon) |
IJ53136 |
High Importance
|
As a result of an internal race condition, file system operations on encrypted file systems may fail with error 786. The error may be reported by either an mm command or in the error message log, e.g.,
2024-10-11_21:53:22.106-0500: [A] Log recovery fs9 failed for log group 204, error 786
(show details)
Symptom |
File system operation on encrypted file systems like error log recovery, deletion of snapshots, etc., may fail with error 786. |
Environment |
Linux, AIX |
Trigger |
There is no specific trigger that forces this issue to manifest. |
Workaround |
None |
|
5.1.9.8 |
GPFS Core |
IJ53151 |
High Importance
|
AFM getOutbandList fails to get the changed files and users may not be able to detect the changes to run the prefetch command later.
(show details)
Symptom |
Unexpected Results |
Environment |
All OS environments |
Trigger |
Running mmafmctl getOutbandList command |
Workaround |
None |
|
5.1.9.8 |
AFM |
IJ53183 |
High Importance
|
On Gateway node shutdown, Gateway node forcefully returns EIO to the application node which is promptly passing on to the application triggering the Read operation.
(show details)
Symptom |
IO Error |
Environment |
Linux Only |
Trigger |
Trigger Read on large 2GB file from app node and when read is in progress, mmshutdown the Gateway node.. |
Workaround |
None |
|
5.1.9.8 |
AFM |
IJ53213 |
Suggested |
Remove dependency from kernel version for afmNFSNconnect.
(show details)
Symptom |
Unexpected Results |
Environment |
Linux Only |
Trigger |
on some kernel version less than 5.3 its possible to enable nconnect option. |
Workaround |
None |
|
5.1.9.8 |
AFM |
IJ53324 |
Critical |
In extremely rare case, directory entry with wrong length could be wrongly created leading to file system panic client node and log recovery failure on file system manager node. This could eventually lead to file system been unmounted everywhere.
(show details)
Symptom |
Cluster/File System Outage |
Environment |
ALL Operating System environments |
Trigger |
Creating new directory entry via file/link create. |
Workaround |
None |
|
5.1.9.8 |
All Scale Users |
IJ53332 |
High Importance
|
mmbackup command internally communicates with tsbuhelper process using a formatted string to get backup result and the format was changed in Spectrum Scale 5.1.9.0.
mmbackup should accept old format and new format both but fails to handle old format properly. As a result of it, the backup count from the node using old format is not correctly added up.
(show details)
Symptom |
Error output/message |
Environment |
all platforms that support mmbackup. |
Trigger |
This problem could occur if one of remote helper nodes has Spectrum Scale 5.1.8 or older version installed while master node has Spectrum Scale 5.1.9 or higher version installed. |
Workaround |
run mmbackup on the node where Spectrum Scale 5.1.8 or older version is installed |
|
5.1.9.8 |
mmbackup |
IJ52584 |
Suggested |
The sdrServ was not able to initialize due to the hostname resolution failure of the legacy server-based configuration server. This prevents GPFS daemon from coming up.
(show details)
Symptom |
Startup failure. Hostname resolution failure messages found in mmfs.log. |
Environment |
All |
Trigger |
Startup GPFS |
Workaround |
Temporarily fix the hostname resolution. |
|
5.1.9.8 |
admin command |
IJ53333 |
High Importance
|
Add an option in mmafmctl to checkDeleted files and dirs which might be hogging the usedInodes count on the fileset.
(show details)
Symptom |
Unexpected Behavior. |
Environment |
Linux Only |
Trigger |
File/dir has been deleted at the cache/primary site but replication of the same is not completed to remote site. Fileset was stopped or disabled of AFM before this leading to permanent hold on the deleted inodes. |
Workaround |
Run a policy manually to see NLINK 0 inodes in the AFM fileset. |
|
5.1.9.8 |
AFM |
IJ53421 |
High Importance
|
Failed to register with GPFS: Bad file descriptor when SMB tries tree connect
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
A Samba process calls gpfs_register_cifs_export. That results in the process being registered in a table. This interface calls alloc_file() which triggers the issue. |
Workaround |
None |
|
5.1.9.8 |
GPFS core |
IJ53426 |
Critical |
When a new file system manager takeover after old file system manager loses quorum, it is possible for new file system to read stripe group descriptor too early which can cause stripe group descriptor updates to be lost.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
File system manager loses quorum while running command that updates stripe group descriptor. |
Workaround |
None |
|
5.1.9.8 |
All Scale Users |
IJ53420 |
High Importance
|
GPFS daemon could fail unexpectedly with assert after file system unmounted due to panic.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
File system panic |
Workaround |
None |
|
5.1.9.8 |
All Scale Users |
IJ53372 |
High Importance
|
GPFS leaks kernel memory every time a user that is a member in more than 32 groups tries to access an inode that denies access to that user through simple modebits (no ACL). This might go unnoticed, but if these conditions occur repeatedly, the kernel memory leak can affect the node operations, requiring a reboot to avoid outages.
(show details)
Symptom |
Abend/Crash (in the worst case that the kernel memory leak goes undetected, leading to OOM kills and node outage) |
Environment |
ALL Linux OS environments |
Trigger |
All Scale Users |
Workaround |
The only workaround would be reducing the number of groups to ensure that no user is a member in more than 32 groups. |
|
5.1.9.8 |
All Scale Users |
IJ53490 |
High Importance
|
The timeout test result is not consistent on AMD EPYC-Genoa Processor. If the test passes, the GSKIT hangs workaround will not be applied. This causes problem later
(show details)
Symptom |
Installation and admin commands hang. |
Environment |
Linux OS environments |
Trigger |
This problem affects AMD EPYC-Genoa. |
Workaround |
Manually apply the workaround |
|
5.1.9.8 |
Admin Commands, gskit |
IJ53592 |
High Importance
|
If its the first or only operation on the list and We attempt to queue it through startMarker, then We use the escaped path as opposed to unescaped path causing the failure in queueing the proper format file name.
(show details)
Symptom |
Unexpected behavior |
Environment |
Linux Only |
Trigger |
AFM |
Workaround |
Have more files without escaped sequence in name ahead of special character filenames which require escape.. In the list file given for download. |
|
5.1.9.8 |
AFM |
IJ53593 |
High Importance
|
Logging of failure when is to failed list file is causing deadlock within the mmafmcosctl binary.
(show details)
Symptom |
Deadlock. |
Environment |
Linux Only |
Trigger |
Having failures to log in the download/upload sub command of mmafmcosctl. |
Workaround |
Can run download/upload without --enable-failed-list-file and this problem shouldn't happen. |
|
5.1.9.8 |
AFM |
IJ53594 |
High Importance
|
Earlier a fix for the same issue was made, but it was considering to return RESTART between the Gateway node and app node only whenqueue is dropped. But there can be cases where Gateway node is being shutdown without queue being in dropped state.
(show details)
Symptom |
IO Failure |
Environment |
Linux Only |
Trigger |
Trigger Read on a single large file from COS to Cache and meanwhile shutdown the fileset's gateway or start a gateway node that wasshutdown already. |
Workaround |
None |
|
5.1.9.8 |
AFM |
IJ53595 |
High Importance
|
The AFM gateway node becomes unresponsive during startup due to numerous filesystem mount requests triggered by active I/O to multiple filesets.
(show details)
Symptom |
Performance impact |
Environment |
Linux Only |
Trigger |
Gateway node startup with multiple AFM filesets starting the recovery. |
Workaround |
None |
|
5.1.9.8 |
AFM |
IJ53186 |
High Importance
|
There is a bug in mmvdisk when process '--mmcrfs',
It processes two option model: --opt value and --opt=value.
An argument like the following will reproduce the issue: "afmMode=RO,afmTarget=gpfs:///gpfs/ssoft_src/"
Because the option value of '-p' includes '=', mmvdisk splits its value 'afmMode=RO,afmTarget=gpfs:///gpfs/ssoft_src/ ' into two parts 'afmMode' and 'RO,afmTarget=gpfs:///gpfs/ssoft_src/' by mistake.
That why mmcrfs reports Incorrect option: -p afmMode'.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
n argument containing one or several "=" characters for option "--mmcrfs" |
Workaround |
Apply the following patch to filesystem.pydiff
--git a/ts/appadmin/vdisk/filesystem.py b/ts/appadmin/vdisk/filesystem.py
index aa0822e..a6a7aa8 100755
--- a/ts/appadmin/vdisk/filesystem.py
+++ b/ts/appadmin/vdisk/filesystem.py
@@ -787,7 +787,7 @@
fs_option[:2] in ['-M', '-m', '-R', '-r', '-j', '-B']:
tmp_arg_fsoptions.append(fs_option[:2])
tmp_arg_fsoptions.append(fs_option[2:])
- elif '=' in fs_option:
+ elif fs_option[0] == '-' and '=' in fs_option:
idx = fs_option.find('=')
tmp_arg_fsoptions.append(fs_option[:idx])
tmp_arg_fsoptions.append(fs_option[idx+1:]) |
|
5.1.9.7 |
mmvdisk |
IJ52272 |
High Importance
|
2024-08-26_04:00:08.796+0100: [X] *** Assert exp(context != unknownOp) in line 6125 of file /project/sprelgpfs519/build/rgpfs519s005i/src/avs/fs/mmfs/ts/fs/openfile.C
2024-08-26_04:00:08.796+0100: [E] *** Traceback:
2024-08-26_04:00:08.796+0100: [E] 2:0x55AE8B58E84A logAssertFailed + 0x3AA at ??:0
2024-08-26_04:00:08.796+0100: [E] 3:0x55AE8B2E093D LockFile(OpenFile**, StripeGroup*, FileUID, OperationLockMode, LkObj::LockModeEnum, LkObj::LockModeEnum*, LkObj::LockModeEnum*, int, int) + 0x69D at ??:0
2024-08-26_04:00:08.796+0100: [E] 4:0x55AE8B22CB2B FSOperation::createLockedFile(StripeGroup*, FileUID, OperationLockMode, LkObj::LockModeEnum, OpenFile**, unsigned int*, int, int) + 0x9B at ??:0
2024-08-26_04:00:08.796+0100: [E] 5:0x55AE8B8D4F4C markAuditLogInactive(StripeGroup*, FileUID) + 0x5C at ??:0
2024-08-26_04:00:08.796+0100: [E] 6:0x55AE8B8DE14C AuditWriter::processRequest(WriteRequest) + 0x3FC at ??:0
2024-08-26_04:00:08.796+0100: [E] 7:0x55AE8B8C5DFE serveWriteRequests(WriteRequest const&, void*) + 0xBE at ??:0
2024-08-26_04:00:08.796+0100: [E] 8:0x55AE8B8C61A0 AuditWriter::callback() + 0x1A0 at ??:0
2024-08-26_04:00:08.796+0100: [E] 9:0x55AE8B036C42 RunQueue::threadBody(RunQueueWorker*) + 0x392 at ??:0
2024-08-26_04:00:08.796+0100: [E] 10:0x55AE8B038EC2 Thread::callBody(Thread*) + 0x42 at ??:0
2024-08-26_04:00:08.796+0100: [E] 11:0x55AE8B025EC0 Thread::callBodyWrapper(Thread*) + 0xA0 at ??:0
2024-08-26_04:00:08.796+0100: [E] 12:0x7FA83AF9A1CA start_thread + 0xEA at ??:0
2024-08-26_04:00:08.796+0100: [E] 13:0x7FA839C9F8D3 __GI___clone + 0x43 at ??:0 mmfsd: /project/sprelgpfs519/build/rgpfs519s005i/src/avs/fs/mmfs/ts/fs/openfile.C:6125: void logAssertFailed(UInt32, const char*, UInt32, Int32, Int32, UInt32, const char*, const char*): Assertion `context != unknownOp' failed.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
The problem happens when the file audit logging code is writing to the audit log, and attempts to grab a lock on the file. This happens frequently in file audit logging code, but the condition does not happen every time, so the logAsserts will be periodic. |
Workaround |
Disable file audit logging |
|
5.1.9.7 |
File Audit Logging |
IJ52511 |
Suggested |
Issuing the undocumented "tsdbfs test patch desc format" command results in mmfsd failures on other nodes.
(show details)
Symptom |
• Abend/Crash |
Environment |
• ALL Operating System environments |
Trigger |
Issuing the undocumented "tsdbfs test patch desc format" command. |
Workaround |
If issuing the command is required, unmount the file system from all the nodes before issuing the command. |
|
5.1.9.7 |
All Scale Users |
IJ52808 |
High Importance
|
ls command hangs on an NFS mounted directory
(show details)
Symptom |
ls command (readdir operation) on the nfs mounted directory hangs. |
Environment |
Linux Only, kernel version > 6.1 |
Trigger |
NFS export a gpfs directory and mount it via nfs 3/v4. An ls command on the mounted directory will hang. |
Workaround |
None |
|
5.1.9.7 |
NFS |
IJ52845 |
High Importance
|
When File Audit Logging is enabled, during fileset deletion, the LWE registry configuration is loaded into memory to retrieve the audit fileset name to check whether the fileset to be deleted is an audit fileset. This LWE registry configuration is not freed after retrieving the audit fileset name, leading to old configuration data being used when a new audit producer is configured.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
- Enable File Audit Logging
- Generate IOs so that there event producers are configured and events are generated in the audit log
- Delete a fileset or attempt to delete an audit fileset
- Disable File Audit Logging- Re-enable File Audit Logging |
Workaround |
None |
|
5.1.9.7 |
File Audit Logging |
IJ52846 |
High Importance
|
File Audit Logging producers not being cleanup from memory when audit is disabled
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
- Enable File Audit Logging
- Generate IOs so that there event producers are configured and events are generated in the audit log
- Disable File Audit Logging
- Check for active LWE producers |
Workaround |
None |
|
5.1.9.7 |
File Audit Logging |
IJ52847 |
High Importance
|
When a new policy is installed for audit log, the audit registry config currently in memory is updated to the latest info and the LWE garbage collector (LWE GC) is triggered to clean up defunct producers.There could be a case where the LWE GC finishes first and registry update second, resulting in stale data being loaded in memory when there are no producers present (when audit is disabled).The next time audit is enabled, the old config data is used to configure a new audit producer.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
- Enable File Audit Logging on node A
- Generate IOs so that there event producers are configured and events are generated in the audit log
- Disable File Audit Logging on node B
- Re-enable File Audit Logging on node A
- Generate IOs on node B |
Workaround |
None |
|
5.1.9.7 |
File Audit Logging |
IJ52848 |
High Importance
|
The major problem identified here is if killInflight issued on the mount is even working.
(show details)
Symptom |
Deadlock |
Environment |
Linux Only |
Trigger |
Home site timing out causing AFM for the need to killInflight and when the inflight messages can't be killed, it leads to the invoked command being stuck. |
Workaround |
None |
|
5.1.9.7 |
AFM |
IJ52849 |
Suggested |
Users with NFSv4 WRITE permission to a file can get permission denied when setting file timestamps to the current time
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All Operation System environments |
Trigger |
- A file has NFSv4 WRITE permission and no WRITE_ATTR permission for a user
- As the user, set file timestamps to the current time |
Workaround |
Be the file owner to update the timestamps, or
- Provide NFSv4 WRITE_ATTR permission to the user |
|
5.1.9.7 |
All Scale Users |
IJ52850 |
High Importance
|
Some client commands, when invoked in a fast, repetitive sequence, may fail to connect to the mmfsd daemon.
(show details)
Symptom |
Some client commands may fail. |
Environment |
ALL |
Trigger |
There is no specific trigger that forces this issue to manifest. |
Workaround |
None. |
|
5.1.9.7 |
GPFS Core |
IJ52851 |
High Importance
|
Deadlock while performing multiple small reads on same file
(show details)
Symptom |
Deadlock |
Environment |
Linux Only |
Trigger |
Multiple small reads on a single large file with or without parallel IO enabled leads to resource starvation on the gateway node. |
Workaround |
Increase afmWorkerThreads to 4096 or higher on the Gateway nodes. |
|
5.1.9.7 |
AFM |
IJ52949 |
High Importance
|
Script error in mmcrfileset leads to enabling afmObjectSyncOpenFileson the RO mode fileset which is failed promptly as expected.
(show details)
Symptom |
Unexpected behavior |
Environment |
Linux Only |
Trigger |
Enabling afmObjectACL and afmObjectAZ flags together while creating the fileset. |
Workaround |
Enable afmObjectAz/afmobjectpreferdir/afmevictrange separately without enabling any ioFlags with them so that none of the ioFlags2 are unintentionally enabled. |
|
5.1.9.7 |
AFM |
IJ52950 |
High Importance
|
Cannot mount file system because it does not have a manager in a file system withmore than 8192 inode spaces. The failure is due to wrong sanity check for the number ofinodes created in the file system.
(show details)
Symptom |
Unable to mount the file system. |
Environment |
ALL Operating System environments |
Trigger |
The sanity check for the number of inode spaces created in a file systemwas using the plain maxInodeSpaces configured in the file system. It is wrong becauseinternally, with the use of inodeSpaceMask, the maximum number of inode spacesin a file system should be the next power of 2 number of maxInodeSpaces. |
Workaround |
None |
|
5.1.9.7 |
filesets |
IJ52992 |
High Importance
|
This APAR addresses a problem in the NFS health check :
When NFS is under load, the rpc null check may fail. The failure will be tolerated if the statistics check shows that NFS-Ganesha is still completing NFS rpc requests. 5.1.9.6 statistics check is not working, so the rpc null check fail will not be ignored and a CES IP failover triggered even though NFS-Ganesha is still completing NFS rpc requests.
(show details)
Symptom |
CES IP failover. NFS Service marked as failed mmsysmonitor log constantly shows statistics check fail. |
Environment |
Linux Only. Only 5.1.9.6 ptf is affected. |
Trigger |
When NFS is under load, the rpc null check may fail. The failure will be tolerated if the statistics check shows that NFS-Ganesha is still completing NFS rpc requests. 5.1.9.6 statistics check is not working, so the rpc null check fail will not be ignored and a CES IP failover triggered even though NFS-Ganesha is still completing NFS rpc requests. |
Workaround |
None |
|
5.1.9.7 |
NFS-Ganesha, CES-IP failover. |
IJ53044 |
High Importance
|
In Scale 5.2.1 the lowDiskSpace callback is not being triggered when disk space usage has reached the high occupancy threshold that is specified in the current policy rules, breaking the usage of thresholds to migrate data between pools applications.
(show details)
Symptom |
Potential symptom descriptions: migration via callback does not work mmapplypolicy command for no_disk_space_warn doesn't run automatically |
Environment |
ALL Operating System environments |
Trigger |
Unrelated code fix was added to 5.2.1.0 that affects the free space recovery processing but mistakenly disabled the lowDiskSpace threshold check. There is not workaround for this problem. |
Workaround |
None/td>
|
|
5.1.9.7 |
Block allocation manager code. |
IJ52948 |
High Importance
|
Kernel-Crash in Scale 5.2.1.1 - general protection fault and system crash.The crash happens due to a memory corruption after mounting a gpfs filesystem.Sometimes this happens during a filesystem mount and sometimes a little while after.
(show details)
Symptom |
Memory corruption and subsequent crash |
Environment |
Linux Only |
Trigger |
We do not need any particular kernel version. For example the customer that hit this issue was running 4.18.0-553.16.1.el8_10.x86_64. While I have reproduced this on a 6.4 kernel.The length of the fstab entry should be in a sweet spot. What this means that is the memory is allocated from the slab cache which have fixed sizes.This means we may have some extra room in the memory allocated to us till we reach the object boundary and we will not have any corruption till we cross this boundary.The kernel slabs are of object sizes: 8, 16, 32, 64, 96, 128, 192, 256, 512 and so on ..For the problem to appears, we need an fstab entry in which, after the gpfsdev=“fsname” options, there are a sizeable number of characters and options. This leads us to write a larger size then what we requested. |
Workaround |
None |
|
5.1.9.7 |
Scale core |
IJ53006 |
High Importance
|
Deadlock hang condition in which InodeRevokeWorkerThread threads will hang, and dumping waiters (e.g. via "mmdiag --waiters") will show: InodeRevokeWorkerThread: for flush mapped pages, VMM iowait
(show details)
Symptom |
Hang/deadlock |
Environment |
Linux Only |
Trigger |
Use of mmap when there are replicas that have missed updates. |
Workaround |
avoiding the use of mmap or file replication will avoid the problem |
|
5.1.9.7 |
GPFS Core (mmap) |
IJ53007 |
High Importance
|
Kernel null pointer exception while running outbound metadata prefetch.
(show details)
Symptom |
Deadlock |
Environment |
Linux Only |
Trigger |
Second AFM outbound metadata prefetch after first prefetch and file type change at home.. |
Workaround |
None |
|
5.1.9.7 |
AFM |
IJ53034 |
High Importance
|
For unknown reasons (a possible reason is /tmp full), the update of RKM.conf was not able to obtain the KMIP port from a file. The code does not check for any error and continues to produce the kmipServerUri line with missing port number. Encrypted files may not be accessible due to this issue.
(show details)
Symptom |
File system unmount |
Environment |
All |
Trigger |
mount file system; access file |
Workaround |
On the problem node, remove the bad RKM.conf file then run a GPFS command to regenerate the RKM.conf file. For example, run mmlscluster. |
|
5.1.9.7 |
encryption, admin command |
IJ53036 |
High Importance
|
Daemon assert going off: index >= 0 && index < maxTcpConnsPerNodeConn in file llcomm_m.C resulting in a GPFS daemon shutdown.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Linux OS environments |
Trigger |
All Scale Users |
Workaround |
None |
|
5.1.9.7 |
All Scale Users |
IJ53038 |
Medium Importance |
Reviewed bug fixes went into Ganesha 6 upstream branch and cherry-picked applicable and important fixed to IBM Ganesha 5.7 branch. Also, utility commands for enabling log rotation has been added.
(show details)
Symptom |
None |
Environment |
Linux Only |
Trigger |
None |
Workaround |
None |
|
5.1.9.7 |
NFS-Ganesha |
IJ53039 |
High Importance
|
Frequent NFS hangs observed and also fixed a nfs crash in this tag update.
1. For crash related info, pls check defect: https://jazz07.rchland.ibm.com:21443/jazz/web/projects/GPFS#action=com.ibm.team.workitem.viewWorkItem&id=337064
2. For nfs-hang: https://jazz07.rchland.ibm.com:21443/jazz/web/projects/GPFS#action=com.ibm.team.workitem.viewWorkItem&id=338989
(show details)
Symptom |
None |
Environment |
Linux Only |
Trigger |
None |
Workaround |
None |
|
5.1.9.7 |
NFS-Ganesha |
IJ53040 |
Medium Importance |
When the parsing of the RKM.conf file results in errors, the mmfsd daemon did not log error messages to the Scale message log and did not raise mmhealth rkmconf_instance_err events. This change results in errors encountered during the parsing of individual RKM.conf stanza being logged in the message log and mmhealth event being raised.
(show details)
Symptom |
The file system encryption functionality may fail if the RKM.conf file is corrupted, making it difficult for the sysadmin to determine the source of the problem. |
Environment |
ALL |
Trigger |
Incorrect or incomplete RKM.conf stanzas. |
Workaround |
None |
|
5.1.9.7 |
GPFS Core |
IJ53048 |
High Importance
|
Daemon assert goes off when Read doesn't have right child Id on queue.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
Prefetch Reads on the queue causing an invalid message format. |
Workaround |
None |
|
5.1.9.7 |
AFM |
IJ46422 |
Critical |
Undetected silent data corruption may be left on disk after truncate operation
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All Operating Systems |
Trigger |
truncate operation |
Workaround |
None |
|
5.1.9.6 |
All Scale Users |
IJ50999 |
High Importance
|
During mmcheckquota running, a failed user data copy from user space to kernel space, leading to some cleanup works, and assertion goes off because one mutex related flag is missed acquired when fix quota accounting value.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Failure when copy data between user and kernel space |
Workaround |
N/A |
|
5.1.9.6 |
Quotas |
IJ52204 |
High Importance
|
mmimgrestore reports failure if a symbolic link has no content. This could result in an incomplete file system restore.
(show details)
Symptom |
File system Outage after restore. |
Environment |
all platforms that support SOBAR (mmimgbackup/mmimgrestore) |
Trigger |
This problem could occur if file system that image backup was taken from contains dangling symbolic links. |
Workaround |
none |
|
5.1.9.6 |
SOBAR (mmimgbackup/ mmimgrestore) |
IJ52205 |
High Importance
|
mmapplypolicy performs an inode scan in parallel and the number of iscanBuckets can be specified via -A option. If the -A option is not specified, tsapolicy calculates it based on total used inodes between 7 to 4096. In a large file system, the value calculated is often larger than the open file limit, and tsapolicy could fail with "'Too many open files".
(show details)
Symptom |
Component Level Outage |
Environment |
all platforms that support mmapplypolicy |
Trigger |
This problem could occur if total used inodes is larger than (open file limit>-100) millions. |
Workaround |
none |
|
5.1.9.6 |
mmapplypolicy |
IJ52206 |
Suggested |
The mmxcp utility does not copy the file mode bits for SUID, SGID, or "sticky bit".
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
The problem occurs when "mmxcp enable" copies a file or directory that has any of the special mode bits set (SUID, SGID, or "sticky bit"). |
Workaround |
The chmod command can be used to set the SUID, SGID or "sticky bit" on any files or directories in the target directory that need them. |
|
5.1.9.6 |
Filesets |
IJ51645 |
High Importance
|
When mmapplypolicy runs with single node, it runs as strictly local mode since 5.1.9.0.But it could show slower performance during sorting phase with large file system because it does not execute parallel sorting.
(show details)
Symptom |
Component Level Outage |
Environment |
all platforms that support mmapplypolicy |
Trigger |
This problem could occur if invoke mmapplypolicy with single node |
Workaround |
none |
|
5.1.9.6 |
mmapplypolicy |
IJ52208 |
High Importance
|
When a message is sending to multiple destinations, if reconnect happens while the sender thread is still doing sending, do the resend in another thread could cause this assert.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
TCP connection reconnect |
Workaround |
No |
|
5.1.9.6 |
All Scale Users |
IJ52209 |
Suggested |
With either the static or dynamic pagepool, when the allocation of pagepool memory is not possible (e.g. the node running ot of memory), an error message "[X] Cannot pin a page pool at the address X bytes because it is already in use" is printed, which is just confusing.
(show details)
Symptom |
Error output/message |
Environment |
ALL Linux OS environments |
Trigger |
This is triggered by the node running out of memory while trying to allocation pagepool memory. Either a static pagepool is configured too big, or the dynamic pagepool is attempting to grow while there is no more memory available. |
Workaround |
There is no workaround. When running out of memory while allocating pagepool memory, the above error message is still printed, and the administrator has to know that this is likely the node running out of memory. |
|
5.1.9.6 |
All Scale Users |
IJ52262 |
High Importance
|
CTDB uses a queue to receive requests and send answers. However there was an issue that gave priority to the receiving side so, when a request was processed and the answer posted to the queue,if another incoming request arrived, it was served before sending the previous answer. On high traffic this can lead to CTDB long waits and starvation.
(show details)
Symptom |
Frequent CTDB recovery and crashes |
Environment |
ALL Operating System environments |
Trigger |
High traffic and large number of accesses to contented file path. |
Workaround |
None |
|
5.1.9.6 |
SMB |
IJ52264 |
High Importance
|
Following assertion can be hit during a fileset deletion - Assert exp(!isValid() || inodesInTransit == NULL || inodesInTransit->getNumberOfOneBits() == 0 ...
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
A fileset deletion can result in the above assert being hit. |
Workaround |
None |
|
5.1.9.6 |
All Scale Users |
IJ52265 |
High Importance
|
Unable to handle kernel paging request crash in kxGanesha. The problem happens because thevalue of fdp->fd has changed after it was copied it.
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
The problem is triggered by a race between two threads that happen to be accessing the fdp pointer. |
Workaround |
None |
|
5.1.9.6 |
NFS Ganesha |
IJ52266 |
High Importance
|
When mmimgrestore creates a file with a saved inode, if the inode is already assigned as logfile, Storage Scale tries to move the logfile to another available inode. But if moving the logfile to another inode fails, Storage Scale returns EAGAIN and the mmimgrestore command will fail.
(show details)
Symptom |
Cluster/File System Outage |
Environment |
all platforms that support SOBAR |
Trigger |
This problem could occur if system is highly loaded and mmimgrestore is executed with very large file system. |
Workaround |
none |
|
5.1.9.6 |
SOBAR |
IJ52221 |
High Importance
|
Dangling entry in RO mode after re-uploading data to cos
(show details)
Symptom |
Unexpected Results |
Environment |
Linux Only |
Trigger |
set immutable flag on file in MU fileset and pull it using RO mode in object |
Workaround |
None |
|
5.1.9.6 |
AFM |
IJ52270 |
Critical |
When the manual update mode is in use at the AFM cache site, and an existing file in the cache is renamed and recreated with the same name, the AFM reconcile operation uploads the file to the AFM home site but may incorrectly update the file at the AFM home site.
(show details)
Symptom |
Unexpected results |
Environment |
All Linux OS environments |
Trigger |
Recreating the same named object/file after the rename and uploading to the COS |
Workaround |
None |
|
5.1.9.6 |
AFM |
IJ52271 |
High Importance
|
tsapolicy gets current cipherList setting from mmfs.cfg but it gets empty string if cipherList configuration variable is set to the default value (EMPTY). tsapolicy incorrectly determines the value as real cipher if the value is not "EMPTY", "DEFAULT", or "AUTHONLY".
(show details)
Symptom |
Component Level Outage |
Environment |
all platforms that support mmapplypolicy |
Trigger |
This problem could occur if mmapplypolicy is invoked when cipherList configuration variable is set to "EMPTY" and cluster level is 5.1.6.0 or higher. |
Workaround |
none |
|
5.1.9.6 |
mmapplypolicy |
IJ52324 |
High Importance
|
ACLs are not fetched to AFM cache from the home when the opened AFM control file becomes stale. AFM control file is used to fetch ACLs and EAs from the home.
(show details)
Symptom |
Unexpected results |
Environment |
All Linux OS environments |
Trigger |
Invalid AFM control file at the target |
Workaround |
Stop and Start the AFM replication using the mmafmctl command |
|
5.1.9.6 |
AFM |
IJ52323 |
High Importance
|
When an already encoded list file is passed to mmafmctl'sevict sub-command, it seems to encoded it once more andpass to tspcacheevict. Causing tspcacheevict to notrealize the decoded file name and fail eviction.If the list is already in encoded format, refrain fromencoding list-file again.
(show details)
Symptom |
Unexpected Behavior. |
Environment |
All OS environments. |
Trigger |
Run Evict with an encoded list file. |
Workaround |
Use a list file that is not encoded for evict alone. |
|
5.1.9.6 |
AFM |
IJ52322 |
Suggested |
When a previous snapshot is created with the expiration time set, the next snapshot created can get the expiration time of the previous snap when the expiration time is not explicitly provided for this next snapshot.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All Operation System environments |
Trigger |
- Create a snapshot with an expiration time
- Wait until snapshot expiration time, delete the snapshot
- Create another snapshot without an expiration time This snapshot is created with an expiration time of the previous snap |
Workaround |
None |
|
5.1.9.6 |
Snapshots |
IJ52321 |
High Importance
|
Daemon assert logAssertFailed: isNotCached() while disabling the AFM online with afmFastLookup enabled. This happens due to accessing the invalid fileset pointer.
(show details)
Symptom |
Abend/Crash |
Environment |
All Linux OS environments |
Trigger |
Disabling AFM online with afmFastLookup=yes option |
Workaround |
Disable AFM online after setting the afmFastLookup=no to avoid the assert |
|
5.1.9.6 |
AFM |
IJ51031 |
High Importance
|
Metadata corruption on one node with folders not being correctly visible. Cannot cd into directory on one node.
(show details)
Symptom |
System that seems to have cached some bad data for a directory and cannot cd into the directory on the bad node |
Environment |
Linux Only |
Trigger |
Unknown |
Workaround |
None |
|
5.1.9.5 |
No adverse affect. This is a failsafe change |
IJ51457 |
Critical |
If the File Audit Logging Audit Logs are compressed while GPFS is appending to them, the Audit Log can become corrupted and unrecoverable.This can happen when a compression policy is run against the audit log fileset / audit logs.
(show details)
Symptom |
Operation failure due to FS corruption |
Environment |
Linux Only |
Trigger |
The problem could be triggered if the Audit Logs are compressed by something other that File Audit Logging code. When File Audit Logging wraps logs, FAL compresses the Audit Logs after mmfsd is done appending to them. If a user or program attempts to compress logs that are currently being appended to, unrecoverable corruption can happen to that Audit Log. |
Workaround |
None |
|
5.1.9.5 |
File Audit Logging |
IJ49862 |
High Importance
|
When daemon restarts on a worker node, it is possible to have a race condition that causes worker local state change to take place after GNR's readmit operation which intends to repair tracks with stale data. The delayed state change could result the intended readmit operation to fail to repair the data on the given disks, thus result in stale sectors in the tracks which could have been fixed once the delayed state change takes place. With more disk failures before the next cycle of scan and repair operations having a chance to repair these vtracks, it could result data loss if number of faults are beyond the fault tolerance of the vdisk.
(show details)
Symptom |
Daemon crash |
Environment |
All |
Trigger |
Daemon restart on individual ECE node, or shared ESS node (even though much less likely), followed by more failing disks. |
Workaround |
Before the fix is installed, manually verify if there is any vtracks stuck in stale state. |
|
5.1.9.5 |
GNR |
IJ51652 |
High Importance
|
Configuring perfmon --collectors with a non-cluster node name (e.g. the hostname which is different to the admin or daemon name) will fail mmsysmon noderoles detection and cause perfmon query port down and the GUI node will raise gui_pmcollector_connection_failed event.
(show details)
Symptom |
Event gui_pmcollector_connection_failed on GUI node. |
Environment |
Linux Only |
Trigger |
Use of mmperfmon config option --collectors with an invalid cluster node identifier. |
Workaround |
None |
|
5.1.9.5 |
Performance Monitoring Tool, GUI, mmhealth thresholds, GrafanaBridge |
IJ51658 |
Critical |
Signal 11 hit in function AclDataFile::hashInsert in acl.C, due to race condition when adding ACLs and handling cached ACL data invalidation during node recovery or hitting a disk error, resulting in mmfsd daemon crash.
(show details)
Symptom |
Abend/Crash |
Environment |
All Operation System environments |
Trigger |
Setting ACLs while node recovery or a disk error happens. Node recovery or disk error invalidates the internal cached ACL data so there is a small window when setting ACLs during this time can cause unassigned memory access. |
Workaround |
Avoid settings ACLs to inodes |
|
5.1.9.5 |
All Scale Users |
IJ51363 |
High Importance
|
Scanning directory with policy or Scale gpfs_ireaddir64 API is degraded since 5.1.3 release.
(show details)
Symptom |
Performance impact |
Environment |
All Operating Systems |
Trigger |
Run policy job or use Scale gpfs_ireaddir64 API to scan directory in Scale file system |
Workaround |
None |
|
5.1.9.5 |
policy or gpfs_ireaddir64 API |
IJ51704 |
High Importance
|
Triggering recovery on IW fileset (by running ls -l on root of fileset), with afmIOFlag afmRecoveryUseFset set on it causes a deadlock - which resolves itself almost after 10 minutes (300 retries of queueing of Getattr for the ls command).
(show details)
Symptom |
Deadlock |
Environment |
Linux Only |
Trigger |
Triggering Recovery on the fileset using ls -l. |
Workaround |
Don't trigger recovery on IW fileset using "ls -l" on the fileset. Instead use the makeActive subcommand in the mmafmctl. |
|
5.1.9.5 |
AFM |
IJ51705 |
High Importance
|
1. Introduce new config option - afmSkipPtrash to skip moving files to the .ptrash directory.
2. Also add a mmafmctl subcommand \"emptyPtrash\" to cleanup the ptrash directory without relying on rm -rf. Similar to the one implemented in --empty-ptrash flag of prefetch.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All OS Environments |
Trigger |
Need for a separate command that can assist in deletion of .ptrash directory contents. |
Workaround |
manually cleanup ptrash directory by hand or by peforming --empty-ptrash with the prefetch subcommand. |
|
5.1.9.5 |
AFM |
IJ51706 |
High Importance
|
afmCheckRefreshDisable is a tunable at the cluster level today to avoid refresh from going to the Filesystem itself and return from the dcache. But it applies to all AFM filesets in the cluster, when tuned. Need a fileset level tunable to do the same so that it doesn't impact all other filesets in the cluster like it does today.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All OS Environments |
Trigger |
Enabling afmCheckRefreshDisable config at the cluster level. |
Workaround |
None |
|
5.1.9.5 |
AFM |
IJ51707 |
High Importance
|
For some threshold events the system pushes them from the Threshold to the Filesystem component internally. Due to misaligned data the event could get suppressed.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
Some parts get sorted and if fileset name get sorted before file system name the issue hits. Most likely with uppercase fileset names (assuming lower case filesystem name) |
Workaround |
Instead of using default threshold rule create one that only looks at gpfs_fset_freeInodes absolute number to reliably raise events or check Inode usage over time in the GUI |
|
5.1.9.5 |
• System Health • perfmon (Zimon) |
IJ51708 |
High Importance
|
When dynamic pagepool is enabled, pagepool memory is shrinking slowly when memoryconsuming application is requesting memory
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Linux OS environments |
Trigger |
Low system memory |
Workaround |
No |
|
5.1.9.5 |
All Scale Users |
IJ51709 |
High Importance
|
Pagepool grow is rejected due to recent pagepool change history
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
ALL Linux OS environments |
Trigger |
Daemon startup |
Workaround |
None |
|
5.1.9.5 |
All Scale Users |
IJ51710 |
High Importance
|
Memory allocation from the shared segment failed
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Linux OS environments |
Trigger |
pagepool growing and shrinking |
Workaround |
No |
|
5.1.9.5 |
All Scale Users |
IJ51711 |
High Importance
|
If a mount is symbolic link is attempted on an existing symlink of a directory, then it ends up creating the symbolic link with same name as the source inside the target directory. Since the DR is mostly RO in nature, it ends up getting an E_ROFS and prints these failures to the log.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only |
Trigger |
Remount of AFM DR target over the NSD backend target. |
Workaround |
None |
|
5.1.9.5 |
AFM |
IJ51712 |
High Importance
|
mmwmi.exe is a helper utility on Windows which is used by various mm* administration scripts to query various system settings such as IP addresses, mounted filesystems and so on. Under certain conditions such as active realtime scanning by security endpoints and anti-malwares, the output of mmwmi is not sent to stdout and any connected pipes that depend on it. This can cause various GPFS configuration scripts, such as mmcrcluster to fail.
(show details)
Symptom |
Unexpected Results/Behavior. |
Environment |
Windows/x86_64 only. |
Trigger |
Execute GPFS administration commands (such as mmcrcluster) during active anti-virus and anti-malware realtime scanning. |
Workaround |
None. |
|
5.1.9.5 |
Admin Commands. |
IJ51713 |
High Importance
|
Problem here is that during conversion a wrong target is specified, with protocol as jttp instead of http. Leading to parsePcacheTarget to find this as invalid, but later trying to persist a NULL to disk where the assert goes off.
(show details)
Symptom |
Crash |
Environment |
All OS Environments |
Trigger |
Converting a non-AFM fileset to AFM MU mode with wrong protocol in the target. |
Workaround |
Specify one of the correct supported protocols. |
|
5.1.9.5 |
AFM |
IJ51781 |
Suggested |
"mmperfmon delete" shows a usage string, referencing "usage: perfkeys delete [-h]" instead of the proper usage.
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
"mmperfmon delete" should be used |
Workaround |
None |
|
5.1.9.5 |
perfmon (ZIMON) |
IJ51782 |
Suggested |
Customer are getting "SyntaxWarning: invalid escape sequence" errors when "mmperfmon" is used for custom scripting.
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
"mmperfmon" should be used for custom scripting, |
Workaround |
None |
|
5.1.9.5 |
perfmon (ZIMON) |
IJ51783 |
High Importance
|
recovery is not syncing old directories*
(show details)
Symptom |
Unexpected Results/ |
Environment |
Linux Only |
Trigger |
1. create fileset without creating target bucket.
2. create 2 directories , fileset will be in unmounted state.
3. stop and start fileset
4. create another new dir |
Workaround |
Instead of mkdir operation run touch operation it will sync old directories in case of recovery. |
|
5.1.9.5 |
AFM |
IJ51784 |
High Importance
|
mmafmcosconfig fails to create afmcos fileset in sudo configured setup
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
With sudo configured setup try to create fileset using mmafmcosconfig , it fails with error . |
Workaround |
None |
|
5.1.9.5 |
AFM |
IJ51785 |
High Importance
|
Not able to initialiase download in case fileset is in Dirty state
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
create IW fileset and create one file under IW fileset , while fileset is in Dirty state try running download. It gives error . |
Workaround |
None |
|
5.1.9.5 |
AFM |
IJ51786 |
Critical |
AFM fileset is going in NeedsResync state due to replication of file whose parent directory is local.
(show details)
Symptom |
Fileset in needsResync state.. |
Environment |
Linux. |
Trigger |
Upload of files from AFM fileset where parent is local. |
Workaround |
None |
|
5.1.9.5 |
AFM |
IJ51787 |
High Importance
|
When a large number of secure connections are created at the same time between the mmfsd daemon instances in a Scale cluster, some of the secure connections may fail as a result of timeouts,resulting in unstable cluster operations.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL |
Trigger |
Rebooting all nodes of a large Scale cluster at the same time. |
Workaround |
Stage the rebooting of nodes in large Scale clusters such that they don't reboot at the same time. |
|
5.1.9.5 |
GPFS Core |
IJ51843 |
High Importance
|
Kernel crashes with the following assert message:
GPFS logAssertFailed: vinfoP->viInUse.
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
The problem can happen when closing a file opened via NFS. |
Workaround |
None |
|
5.1.9.5 |
NFS exports |
IJ51844 |
High Importance
|
Newer versions of libmount1 package that are installed by default on SUSE15.6 filter out device name from gpfs mount options due to which mount fails.
(show details)
Symptom |
Mount failure |
Environment |
Linux SLES15 SP6 |
Trigger |
System should have a libmount1 package installed that is > 2.37. |
Workaround |
Downgrade libmount1 package if possible. |
|
5.1.9.5 |
GPFS core |
IJ51845 |
Critical |
AFM Gateway node reboot due to Out of memory exception. There is memory leak while doing the upload/reconcile operation in MU mode fileset.
(show details)
Symptom |
OOM exception on AFM Gateway node. |
Environment |
Linux. |
Trigger |
Upload of files from AFM MU mode fileset to COS. |
Workaround |
None |
|
5.1.9.5 |
AFM |
IJ50654 |
High Importance
|
mmshutdown caused kernel crash while calling dentry_unlink_inode with the backtrace like this:
...
#10 page_fault at ffffffff8d8012e4
#11 iput at ffffffff8cef25cc
#12 dentry_unlink_inode at ffffffff8ceed5d6
#13 __dentry_kill at ffffffff8ceedb6f
#14 dput at ffffffff8ceee480
#15 __fput at ffffffff8ced3bcd
#16 ____fput at ffffffff8ced3d7e
#17 task_work_run at ffffffff8ccbf41f
#18 do_exit at ffffffff8cc9f69e
(show details)
Symptom |
Kernel crash during mmshutdown |
Environment |
All Linux OS environments |
Trigger |
Kernel crash with dentry_unlink_inode when run mmshutdown.For the normal open(), the kernel seems to call fops_get, which is a call totry_module_get.The fix: We need to call try_module_get when we install cleanupFD.This will hold the module in place until gpfs_f_cleanup (called when the lastmmfsd process terminates and allows basic cleanup for next daemon startup)has been called for the cleanupFD. |
Workaround |
None |
|
5.1.9.5 |
All Scale Users (Linux) |
IJ51332 |
High Importance
|
GPFS daemon could assert unexpectedly with EXP(REGP != _NULL) in file alloc.C. This could occur on client nodes where there are active block allocation activities.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Block allocation and deallocation activities on a client node |
Workaround |
None |
|
5.1.9.5 |
All Scale Users |
IJ50480 |
High Importance
|
Long ACL garbage collection runs in the filesystem manager can cause lock conflicts with nodes that need to retrieve ACLs during garbage collection. The conflicts will resolve after garbage collection has finished.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
All Operation System environments |
Trigger |
- Set new and unique ACLs for inodes
- Delete inodes with ACLs
- After some time, the ACL GC is started to clean up unreferenced ACLs (ACLs that no inodes reference)
- During the ACL GC run, retrieve the ACLs from existing inodes in the filesystem |
Workaround |
- Avoid setting new and unique ACLs for the inodes or,
- Change the filesystem manager to another node to stop the current garbage collection run or,
- Wait for the ACL GC to finish or,
- Use mode bits instead of ACLs |
|
5.1.9.5 |
All Scale Users |
IJ51846 |
Suggested |
due to locale issue few callhome commands were gethering data region specific and it causing the error in AoAtool while parsing this data
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux Only |
Trigger |
on non-english locale run the callhome commands |
Workaround |
None |
|
5.1.9.5 |
Callhome |
IJ51864 |
High Importance
|
Crash of the node mounting a filesystem or while starting the node.
(show details)
Symptom |
Crash |
Environment |
Linux Only |
Trigger |
A failure or error condition hit during the parsing of fstab entry. |
Workaround |
None |
|
5.1.9.5 |
Filesystem mount |
IJ51908 |
High Importance
|
When dynamic pagepool is enabled, we may not shrink due to pagepool grow still in progress, which results in out of memory
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Linux OS environments |
Trigger |
Pagepool growing and shrining |
Workaround |
No |
|
5.1.9.5 |
Dynamic pagepool |
IJ51909 |
Suggested |
There are a few occasions where error code 809 may be used inside the CCR Quorum management component. Although not user actionable, the product was changed to make some note of this in mmfs.log instead of, as had been the case, only available in GPFS Trace. The intent is to improved RAS in certain situations.
(show details)
Symptom |
N/a |
Environment |
All |
Trigger |
N/a |
Workaround |
Use "mmlsmgr" before running "mmchnode --nonquorum" to determine the current cluster manager. If the node to be changed is the current cluster manager, then use mmlsmgr to determine the new cluster manager and then, if necessary, run mmchmgr to make an explicit choice of cluster manager. |
|
5.1.9.5 |
CCR |
IJ51011 |
Suggested |
Nessus vulnerability scanner found HSTS communication is not enforced on mmsysmon port 9980
(show details)
Symptom |
Nessus vulnerability scan finding/record (medium severity) |
Environment |
Linux |
Trigger |
Nessus vulnerability scan |
Workaround |
None |
|
5.1.9.4 |
mmsysmon on GUI/pmcollector node |
IJ50232 |
High Importance
|
The automated node expel mechanism (see references to themmhealthPendingRPCExpelThreshold configuration parameter) uses the internalmmsdrcli to issue an expel request to a node in the home cluster. If thesdrNotifyAuthEnabled configuration parameter is set to false (not recommended) then the command will fail with a message like the following:
[W] The TLS handshake with node 192.168.132.151 failed with error 410 (client side).mmsdrcli: [err 144] Connection shut down and the expel request will fail.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
The problem is triggered by the following (all conditions required) :
- The sdrNotifyAuthEnabled configuration parameter is set to false
- Automated expel is enabled via the mmhealthPendingRPCExpelThreshold configuration parameter
- A node becomes hung or otherwise unable to respond to a token revoke request |
Workaround |
Set the sdrNotifyAuthEnabled configuration parameter to 'true' andthen restart (in a rolling fashion) all the nodes on both home andclient clusters. Once that is done, the mmsdrcli command should nolonger fail.
See also https://www.ibm.com/support/pages/node/6560094 |
|
5.1.9.4 |
System Health
(even though the fix is not in system health itself) |
IJ51036 |
Medium Importance |
mm{add|del}disk will fail which is triggered by signal 11.
(show details)
Symptom |
failure of the command. |
Environment |
Linux Only |
Trigger |
Run mm{add|del}disk with multiple NSDs. |
Workaround |
The problem/symptom would typically occur when there are multiple NSDs added/deleted with the command. By running
the command with one NSD at a time, we can avoid the problem/symptom. |
|
5.1.9.4 |
disk configuration and region management |
IJ51037 |
Suggested |
mmkeyserv returns an error when used to delete a previously delete tenant, instead of returning a success return code.
(show details)
Symptom |
Failure to remove an already deleted tenant. |
Environment |
ALL Linux OS environments |
Trigger |
Remove a Scale tenant from the GKLM server prior to invoking the 'mmkeyserv tenant delete' command. |
Workaround |
mmkeyserv can be used with the --force option to remove the Scale definition of a deleted tenant. |
|
5.1.9.4 |
File System Core |
IJ51057 |
Medium Importance |
From a Windows client, in MMC permissions Tab on a share, the ACL listing was always showing as Everyone.If a subdirectory inside subdirectory is deleted, in the respective snapshot that was taken before, traversal to the inner subdirectory was showing errors.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
The MMC permissions problem can happen in case of viewing permissions from MMC permissions tab for a share. Snapshot traversal would show problems from client when a sub-directory within sub-directory is deleted on the actual filesystem and the same is tried for access in snapshot directory. |
Workaround |
None |
|
5.1.9.4 |
SMB |
IJ51148 |
High Importance
|
find or download all when run on a given path, sets timefor each of the individual entities with respect to COSand ends up blocking a following revalidation to fetchactual changes on the object's metadata from the COSto cache.
(show details)
Symptom |
Unexpected Behaviour |
Environment |
All OS Environments |
Trigger |
Performing an ls on directory before trying lookup for the file. |
Workaround |
None |
|
5.1.9.4 |
AFM |
IJ51149 |
High Importance
|
Due to an issue with the way mmfsckx scans compressed files and internally stores information to detect if it has inconsistent compressed groups, mmfsckx
will report and/or repair false positive inconsistencies for compressed files.
The mmfsckx output will report something like below for example:
!Inode 791488 snap 6 fset 6 "user file" indirect block 1 level 1 @4:13508288: disk address (ditto) in slot 0 replica 0 pointing to data block 226 code 2012 is invalid
(show details)
Symptom |
False positive corrections by mmfsckx |
Environment |
ALL Operating System environments |
Trigger |
mmfsckx run on file system having compressed files |
Workaround |
Run offline mmfsck |
|
5.1.9.4 |
mmfsckx |
IJ51150 |
High Importance
|
mmfsckx captures allocation and deallocation information of blocks from remote client nodes or non-participating nodes that mount the file system while mmfsckx is running. And once the file system gets unmounted from these nodes it stops the capture of such information. But due to an issue mmfsckx was stopping the capture of such information before the complete unmount event ended and that led to mmfsckx then reporting and/or repairing false positive lost blocks, bad (incorrectly allocated) blocks, duplicate blocks.
(show details)
Symptom |
False positive corrections by mmfsckx |
Environment |
ALL Operating System environments |
Trigger |
File system is unmounted on remote client or non-participating nodes while mmfsckx is running |
Workaround |
Run offline mmfsck |
|
5.1.9.4 |
mmfsckx |
IJ51252 |
Suggested |
Prefetch command failes but returns error code 0
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux Only |
Trigger |
run prefetch on non-existant file. |
Workaround |
None |
|
5.1.9.4 |
AFM |
IJ51225 |
High Importance
|
There is a build failure while executing mmbuildgpl command. The failure is seen while compiling /usr/lpp/mmfs/src/gpl-linux/kx.c due to no member named '__st_ino' in struct stat.
Please refrain from upgrade to affected kernel and/or OpenShift
levels until fix is available
(show details)
Symptom |
Build failure while building kernel gpl modules. |
Environment |
Linux Only
OpenShift 4.13.42+, 4.14.14.25+, 4.15.13+ with IBM Storage Scale Container Native, Fusion with GDP |
Trigger |
The problem could be triggered by newer kernels containing Linux kernel commit 5ae2702d7c482edbf002499e23a2e22ac4047af1 |
Workaround |
None |
|
5.1.9.4 |
Build / Installation |
IJ51222 |
Suggested |
If a problem with an encryption server happens just the Rkmid is visible in the default "mmhealth node show" view. Furthermore there are two monitoring mechanisms, one which will use an index to convey whether main or backup server is affected and one which will directly use the hostname or ip for that. Moreover the usual way to resolve an event with "mmhealth event resolve" has been broken for that component.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
Code change to have RkmId as the common ground for old and new monitoring methods. |
Workaround |
mmhealth node show encryption -v (or -Y) to see the server information. mmhealth node eventlog -Y would work as well. Resolving the event needs a "mmsysmoncontrol restart" |
|
5.1.9.4 |
System Health |
IJ51160 |
Suggested |
Daemon Assert gets triggered when afmLookupMapSize is set tohigher value of 32. Supported range is only 0 to 30.
(show details)
Symptom |
Abend/Crash |
Environment |
All OS environments |
Trigger |
Setting value of afmLookupMapSize to high value of 32. |
Workaround |
set the value of afmLookupMapSize in the 0 to 30 range only. |
|
5.1.9.4 |
AFM |
IJ51282 |
High Importance
|
mmrestoreconfig also restores fileset configuration of a file systems. If the cluster version (minReleaseLevel) is below 5.1.3.0, the fileset restore will fail as it tries to restore the fileset permission inherit mode even if it is the default. The permission inherit mode was not enabled Storage Scale version 5.1.3.0.
(show details)
Symptom |
• Error output/message • Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
mmbackupconfig and mmrestoreconfig on Storage Scale Product version 5.1.3.0 or higher while the cluster minReleaseLevel is below 5.1.3.0. |
Workaround |
N/A |
|
5.1.9.4 |
Admin Commands SOBaR
|
IJ51283 |
Suggested |
Command mmchnode and mmumount did not cleanup tmp node files in /var/mmfs/tmp.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
Run mmchnode and mmumount command. |
Workaround |
Manually remove leftover tmp files from mmchnode and mmumount |
|
5.1.9.4 |
Admin Commands |
IJ51286 |
High Importance
|
GPFS daemon could unexpectedly fail with signal 11 when mounting a file system if file system quiesce is triggered during the mount process.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
File system quiesce triggered via file system command while file system mount is in progress |
Workaround |
Avoid running commands that trigger file system quiesce while client nodes are in the process of mounting the file system |
|
5.1.9.4 |
All Scale Users |
IJ51265 |
Critical |
It is possible for EA overflow block to be corrupted as result of log recovery after node failure. This can lead to lost of some extended attributes that can not be stored in the inode.
(show details)
Symptom |
Operation failure due to FS corruption |
Environment |
ALL Operating System environments |
Trigger |
Node failure after repeated extended attributes operations which trigger creation and deletion of overflow block |
Workaround |
None |
|
5.1.9.4 |
All Scale Users |
IJ51344 |
Critical |
When writing to a memory-mapped file, there is a chance that incorrect data could be written to the file before and after the targeted write range
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Operating System environments |
Trigger |
Writing to memory-mapped files with offsets and lengths unaligned to the internal buffer range size (usually subblock size or 4k) could cause incorrect data to be written before and after the targeted write range |
Workaround |
Stop using memory-mapping |
|
5.1.9.4 |
All scale users |
IJ49992 |
Suggested |
If the local cluster nistCompliance value is off, the mmremotecluster and mmauth commands fail with not clear error message.
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environments |
Trigger |
Running mmauth and mmremotecluster command to add or update remote cluster while the local cluster not nistCompliance. |
Workaround |
Fix the error and reissue the command. |
|
5.1.9.3 |
All Scale Users |
IJ50066 |
Suggested |
AFM LU mode fileset from a filesystem, to a target in the same filesystem (snapshot), using NSD backend is failing with error 1.
Happening because another code fix had an unintended consequence for this code path.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only |
Trigger |
Creating a LU mode fileset to a target (snapshot) in the samefilesystem. |
Workaround |
None |
|
5.1.9.3 |
AFM |
IJ50067 |
High Importance
|
When afmResyncVer2 is run with afmSkipResyncRecovery set to yes,then the priority directories that AFM usually queues shouldnot be done.
Since such directories might exist underparents that are not in sync already leading to error 112.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only |
Trigger |
Run afmResyncVer2 on a fileset which has never had syncto home and set afmSkipResyncRecovery on it. |
Workaround |
fallback to Resync Version1 where afmSkipResyncRecoveryhas effect. |
|
5.1.9.3 |
AFM |
IJ50068 |
High Importance
|
This APAR addresses two issues related to NFS-Ganesha that can cause crashes. Here are the details:
(gdb) bt
#0 0x00007fff88239a68 in raise ()
#1 0x00007fff8881ffb8 in crash_handler (signo=11, info=0x7ffb42abbe48, ctx=0x7ffb42abb0d0)
#3 0x00007fff888da5f4 in atomic_add_int64_t (augend=0x148, addend=1)
#4 0x00007fff888da658 in atomic_inc_int64_t (var=0x148)
#5 0x00007fff888de44c in _get_gsh_export_ref (a_export=0x0)
#6 0x00007fff8888c6c0 in release_lock_owner (owner=0x7ffef94a1cc0)
#7 0x00007fff88923e9c in nfs4_op_release_lockowner (op=0x7ffef922be60, data=0x7ffef954d290, resp=0x7ffef8629c30)
#8 0x00007fff888fb810 in process_one_op (data=0x7ffef954d290, status=0x7ffb42abcdf4)
#9 0x00007fff888fcc9c in nfs4_Compound (arg=0x7ffef95eec38, req=0x7ffef95ee410, res=0x7ffef8ce4b40)
#10 0x00007fff88819130 in nfs_rpc_process_request (reqdata=0x7ffef95ee410, retry=false)
#11 0x00007fff88819864 in nfs_rpc_valid_NFS (req=0x7ffef95ee410)
#12 0x00007fff88750618 in svc_vc_decode (req=0x7ffef95ee410)
#13 0x00007fff8874a8f4 in svc_request (xprt=0x7fff30039ca0, xdrs=0x7ffef95eb400)
#14 0x00007fff887504ac in svc_vc_recv (xprt=0x7fff30039ca0)
#15 0x00007fff8874a82c in svc_rqst_xprt_task_recv (wpe=0x7fff30039ed8)
#16 0x00007fff8874b858 in svc_rqst_epoll_loop (wpe=0x10041cc5cb0)
#17 0x00007fff8875b22c in work_pool_thread (arg=0x7ffdcd1047d0)
#18 0x00007fff88229678 in start_thread ()
#19 0x00007fff880d8938 in clone ()
Or
(gdb) bt
#0 0x00007f96f58d9b8f in raise ()
#1 0x00007f96f75c6633 in crash_handler (signo=11, info=0x7f96ad9fc9b0, ctx=0x7f96ad9fc880) a
#3 dec_nfs4_state_ref (state=0x7f9640465440)
#4 0x00007f96f76762f9 in dec_state_t_ref (state=0x7f9640465440)
#5 0x00007f96f767640c in nfs4_op_free_stateid (op=0x7f8dec12fba0, data=0x7f8dec1992b0, resp=0x7f8dec04ce70)
#6 0x00007f96f766dbae in process_one_op (data=0x7f8dec1992b0, status=0x7f96ad9fe128)
#7 0x00007f96f766ee80 in nfs4_Compound (arg=0x7f8dec110ab8, req=0x7f8dec110290, res=0x7f8dec5b7db0)
#8 0x00007f96f75c17db in nfs_rpc_process_request (reqdata=0x7f8dec110290, retry=false)
#9 0x00007f96f75c1cf1 in nfs_rpc_valid_NFS (req=0x7f8dec110290)
#10 0x00007f96f733edfd in svc_vc_decode (req=0x7f8dec110290)
#11 0x00007f96f733ac61 in svc_request (xprt=0x7f95d00c4a60, xdrs=0x7f8dec18dd00)
#12 0x00007f96f733ed06 in svc_vc_recv (xprt=0x7f95d00c4a60)
#13 0x00007f96f733abe1 in svc_rqst_xprt_task_recv (wpe=0x7f95d00c4c98)
#14 0x00007f96f73462f6 in work_pool_thread (arg=0x7f8ddc0cc2f0)
#15 0x00007f96f58cf1ca in start_thread ()
#16 0x00007f96f5119e73 in clone ()
(show details)
Symptom |
Abend/Crash |
Environment |
Linux Only |
Trigger |
The crash occurs when the NFSv4 client attempts to access and delete a file simultaneously through different processes or threads, potentially leading to timing issues. |
Workaround |
None |
|
5.1.9.3 |
NFS-Ganesha crash followed by CES-IP failover. |
IJ49856 |
Critical |
Multi-threaded applications that issue mmap I/O and I/O system calls concurrently can hit a deadlock on the buffer lock. This is likely not a common pattern, but this problem has been observed with database applications.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Operating System environments |
Trigger |
Multiple application perform read/write to the same file at the same time. |
Workaround |
Avoid concurrently read/write to same file from multiple process. |
|
5.1.9.3 |
All Scale Users |
IJ50208 |
High Importance
|
Multi-threaded applications that issue mmap I/O and I/O system calls concurrently can hit a deadlock on the buffer lock. This is likely not a common pattern, but this problem has been observed with database applications.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Linux OS environments |
Trigger |
This problem is a race between three threads within the same process:
1) One thread accessing data in a mmap'ed GPFS file.
2) A second thread issuing any system calls that modifies the memory layout of the process (e.g. mmap, munmap, ...)
3) A third thread issuing an I/O system call (read or write), that accesses the same file and the same offset as thread 1, where access the userspace buffer also hits a page fault./td>
|
Workaround |
Since this problem is specific to a newer codepath, this codepath can be disabled through a hidden config setting: mmchconfig mmapOptimizations=0 |
|
5.1.9.3 |
All Scale Users |
IJ50209 |
Suggested |
Setting security header as suggested by RFC 6797
(show details)
Symptom |
Unexpected Results/Behavior [not really, unless one really looks at the returned header fields of the HTTP response - body data is not affected] |
Environment |
ALL Linux OS environments |
Trigger |
Running Scale 5.1.2 or later |
Workaround |
None |
|
5.1.9.3 |
perfmon (Zimon) |
IJ50210 |
High Importance
|
With File Audit Logging (FAL) is enabled, when a change to the policy file happens and when the LWE garbage collector runs for FAL, there is a small window that a deadlock can occur with the long waiter message seen 'waiting for shared ThSXLock' for the PolicyCmdThread.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux |
Trigger |
- Enable FAL
- Generate events in the file system
- Make a change to the policy file/td>
|
Workaround |
- Restart GPFS, and/or
- Disable FAL |
|
5.1.9.3 |
File Audit Logging |
IJ50211 |
High Importance
|
During a mount operation of the file system, updating LWE configuration information for File Audit Logging before the Fileset metadata file (FMF) is initialized results in the signal 11, NotGlobalMutexClass::acquire() + 0x10 at mastSMsg.C:44
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
- Enable FAL
- Mount the file system
- The update to the LWE config information happens before the FMF is initialized during the mount operation |
Workaround |
Disable FAL |
|
5.1.9.3 |
File Audit Logging |
IJ50320 |
Critical |
AFM fileset is going in NeedsResync state due to replication of filewhose parent directory is local.
(show details)
Symptom |
Fileset in needsResync state.. |
Environment |
Linux |
Trigger |
Upload of files from AFM fileset where parent is local |
Workaround |
None |
|
5.1.9.3 |
AFM |
IJ50321 |
High Importance
|
When a thread is flushing the file metadata of the ACL file to disk, there's a small window that a deadlock can occur when a different thread tries to get a Windows security descriptor, as getting the security descriptor requires reading the ACL file.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Windows |
Trigger |
- Make changes to Windows Security Descriptors/ACLs and ensure the changes go to the disk
- Retrieve the Windows Security Descriptor of inodes |
Workaround |
- Avoid setting and getting Windows Security Descriptor at the same time
- Restart GPFS |
|
5.1.9.3 |
All Scale Users |
IJ50323 |
High Importance
|
When checking the block alloc map mmfsckx excludes the regions that are being checked or are already checked from further getting updated in the internal shadow map.But when checking for such excluded regions it was not checking which poolId the region belonged to. This resulted in mmfsckx not updating the shadow map for a region belongingto a pool while checking the block alloc map for the same region belonging to a different pool. This led to mmfsckx falsely marking blocks as lost block and then later to this assert.
(show details)
Symptom |
Node assert |
Environment |
ALL Operating System environments |
Trigger |
Run mmfsckx with --repair |
Workaround |
Run offline mmfsck to fix corruptions |
|
5.1.9.3 |
mmfsckx |
IJ50035 |
High Importance
|
When RDMA verbsSend is enabled and number of RDMA connections is largerthan 16, if reconnect happens, could cause segment fault issue.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
RDMA verbsSend and TCP reconnect |
Workaround |
None |
|
5.1.9.3 |
All Scale Users |
IJ50372 |
Critical |
O_TRUNC is not ignored correctly after a successful file lookup during atomic_open() so truncation can happen during the open routine, before permission checks happen. This leads to a scenario in which a user on a different node can truncate a file which he does not have permissions to.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All Linux OS environments |
Trigger |
- A file is created under a user with no write permissions for group and others (e.g mode 644) in one node
- A user on a different node atomic opens the file with O_TRUNC flag and tries to write to it |
Workaround |
- Avoid using O_TRUNC with atomic_open() |
|
5.1.9.3 |
All Scale Users |
IJ50373 |
Suggested |
For certain performance monitoring operations in the case of an error the query and response get logged. That response can be large and logging it regularly will cause mmsysmon.log to grow rapidly.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
Incomplete delete of performance metrics. Some checks are done on so called “measurements” which take several metrics and calculate a composite result. If only a subset of the metrics for the calculation is available, the error is triggered. |
Workaround |
Changing the log level or altering the log line in the python code. Alternative would be to eliminate the trigger, either by re-doing “mmperfmon delete -expiredKeys” or removing all collected performance data in /opt/IBM/zimon/data/ |
|
5.1.9.3 |
System Health / perfmon (Zimon) |
IJ50374 |
High Importance
|
With File Audit Logging (FAL) enabled, when deciding to run LWE garbage collector for FAL, an attempt to try-acquire the lock on the policy file mutex is performed. If the policy file mutex is busy, the attempt is canceled and retry on the next attempt. Upon canceling, the policy file mutex can be released without being held leading to the log assert.
(show details)
Symptom |
Abend/Crash |
Environment |
All Linux OS environments |
Trigger |
- FAL is enabled
- Generate events in the file system
- Make a change to the policy file
- Listing policy partitionsSymptom: Abend/Crash
|
Workaround |
- Disable FAL if it is enabled |
|
5.1.9.3 |
File Audit Logging |
IJ50375 |
Critical |
GPFS daemon could assert unexpectedly with: Assert exp(0) in direct.C. This could happen on file system manager node after a node failure.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Node failure after repeatedly create/delete same file in a directory |
Workaround |
None |
|
5.1.9.3 |
All Scale Users |
IJ50439 |
High Importance
|
The ts commands do not always return the correct error code, providing incorrect results to mm commands that call them, resulting incorrect cluster operations.
(show details)
Symptom |
Failed cluster operations |
Environment |
All |
Trigger |
ncorrect cluster information or file system state |
Workaround |
None |
|
5.1.9.3 |
Core |
IJ50440 |
High Importance
|
mmfsckx fails to detect file having an illReplicated extended attribute overflow block and in the repair mode will not mark the flag illReplicated in it.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All supported |
Trigger |
mmfsckx run on a file system having file having an illReplicated extended attribute overflow block |
Workaround |
Run offline mmfsck to fix corruption |
|
5.1.9.3 |
mmfsckx |
IJ50441 |
High Importance
|
When scanning a compressed file mmfsckx in some case can incorrectly report a file having bad disk address
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All supported |
Trigger |
mmfsckx run on a file system having sparsely compressed files |
Workaround |
Run offline mmfsck to fix corruption |
|
5.1.9.3 |
mmfsckx |
IJ50442 |
High Importance
|
When scanning a file system having a corrupted snapshot mmfsckx can cause node assert with logAssertFailed: countCRAs() == 0 && "likely a leftover cached inode in inode0 d'tor"*
(show details)
Symptom |
Node assert |
Environment |
All supported |
Trigger |
When scanning file system having a corrupted snapshot |
Workaround |
Run offline mmfsck to fix corruption |
|
5.1.9.3 |
mmfsckx |
IJ50443 |
High Importance
|
AFM policy generated intermediate files are always put to/var filesystem - /var/mmfs/tmp for Resync/Failover and/var/mmfs/afm for Recovery. We have seen in customer setups that the /var is provisioned very small and there might be other Filesystems that are well provisioned to handle such large files. /opt that IBM defaults to always or may be even inside the fileset.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only |
Trigger |
Run AFM recovery or resync with a very large filesetin terms of inode space, like 100M or more in it. |
Workaround |
None |
|
5.1.9.3 |
AFM |
IJ50463 |
High Importance
|
Stale data may be read while "mmchdisk start" is running.
(show details)
Symptom |
Either no symptom or an fsstruct assert |
Environment |
all |
Trigger |
Disks are marked down and data on the disks become stale before "mmchdisk start" is run. |
Workaround |
Stop all workload before running "mmchdisk start". |
|
5.1.9.3 |
All Scale Users |
IJ50563 |
Critical |
In a file system with replication configured, for a large file with number of data blocks more than 5000, if there are miss-updated on some data blocks\ due to disk failures on one replica disk, then these stale replicas would not be repaired if the helper nodes are getting involved to repair them.
(show details)
Symptom |
replica mismatch |
Environment |
All Operating Systems |
Trigger |
I/O errors on disk caused it marked as "down", and some further write failures happen on a large file with the number of data blocks more than 5000, then start the down disk with multiple participant nodes. |
Workaround |
Only specify the fs mgr node as the participant node for mmchdisk command. |
|
5.1.9.3 |
Scale Users |
IJ50577 |
High Importance
|
When there is a TCP network error, we will try to reconnect the TCP connection, butthe reconnect failed with "Connection timed out" error, which results in node expel.
(show details)
Symptom |
Node expel/Lost Membership |
Environment |
ALL Operating System environments |
Trigger |
Network is not good which leads to TCP connection reconnect |
Workaround |
No |
|
5.1.9.3 |
All Scale Users |
IJ50708 |
Critical |
In a file system with replication configured, the miss-update info set in the disk address could be overwritten by log recovery process, then lead to stale data to be read as well as the start disk process cannot repair such stale replicas.
(show details)
Symptom |
replica mismatch |
Environment |
All Operating Systems |
Trigger |
I/O errors happening and generating miss-update info into the disk address of data blocks, and then a mmfsd daemon crash could result in such problem. |
Workaround |
No |
|
5.1.9.3 |
All Operating Systems |
IJ50794 |
High Importance
|
Symbolic links may be incorrectly deleted during the offline mmfsck and may cause undetected data loss
(show details)
Symptom |
Offline mmfsck detects non-critical corruption (corrupt indirection level) and may delete the file if directed to.
mmfsck fsName -v -n
...
Error in inode 18177 snap 0: has corrupt indirection level 0
Delete inode? no |
Environment |
ALL Operating System environments |
Trigger |
When symbolic links are created in the filesystem with a current format version less than or equal to 3.5 and also with IBM Storage Scale V5.1.9.2, they may be incorrectly stored in the inode, even though the filesystem format does not support storing the symbolic links in the inode. This causes offline mmfsck to delete the incorrectly stored symbolic links. |
Workaround |
None |
|
5.1.9.3 |
General file system, creation of symbolic links. |
IJ50890 |
Suggested |
Metadata evict was giving error for 2nd attempt onwards.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux Only |
Trigger |
running metadata eviction multiple times will trigger issue |
Workaround |
None |
|
5.1.9.3 |
AFM |
IJ49762 |
High Importance
|
mmlsquota -d can cause gpfs daemon to crash
(show details)
Symptom |
GPFS daemon can crash when displaying default quota (mmlsquota -d) ifdefault quota if not on. |
Environment |
ALL Operating System environments |
Trigger |
Assertion happens because the entryType of root quotaentry has entry type e (explicit) instead of default state.The root quota entry type could have been changed if weedit the quota entry (entry type is changed to EXPLICIT_ENTRY),via mmsetquota, mmedquota, or mmdefedquota commands.When displaying default quota limits (mmlsquota -d), if defaultquota is on, the entry type will revert to "default on" - whichwould not cause the assertion. If default quota is off, theentryType remains e, hitting the assertion when displayingdefault quota limits.Fix: correct the mmlsquota -d processing so that the default quotastatus stored in root quota entries are updated to the expectedvalues, based on quota options in sgDesc, avoiding the assertion. |
Workaround |
Enable default quota (all types: user, group, fileset) on the filesystem and then run the mmlsquota -d. |
|
5.1.9.3 |
Quotas |
IJ49856 |
Critical |
Unexpected long waiter could appear with fetch thread waiting on FetchFlowControlCondvar with reason 'wait for buffer for fetch'. This could happen workload caused all prefetch/writebehind threads are assigned to do prefetching.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
ALL Operating System environments |
Trigger |
Multiple application perform read/write to the same file at the same time. |
Workaround |
Avoid concurrently read/write to same file from multiple process. |
|
5.1.9.3 |
All Scale Users |
IJ50061 |
High Importance
|
When mmfsckx is run on a file system such that it requires multiple scan passes to complete then mmfsckx can abort with reason "Assert failed "nEnqueuedNodes > 1"."
(show details)
Symptom |
Command aborts |
Environment |
ALL Operating System environments |
Trigger |
When mmfsckx is run on a file system such that it requires multiple inode scan passes to complete |
Workaround |
Increase the pagepool to make sure mmfsckx can run in single scan pass. |
|
5.1.9.3 |
mmfsckx |
IJ49583 |
Suggested |
When a RDMA connection to a remote node has to be shutdown due to network errors (e.g. network link goes down) it can sometimes happen that the affected RDMA connection will not be closed and all resources assigned to this RDMA connection (memory, VERBS Queue Pair, ...) are not freed.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
ALL Linux OS environments |
Trigger |
verbsRdmaSend must be enabled. Loss of a RDMA connection to a node because of network errors in the RDMA fabric. |
Workaround |
No work around available |
|
5.1.9.2 |
RDMA |
IJ49584 |
High Importance
|
Spectrum Scale Erasure code edition interacts with third party software/hardware APIs for internal disk enclosure management.If the management interface becomes degraded and starts to hang commands in the kernel, the hang may also block communication handling threads.
This causes a node to fail to renew its lease, causing it to be fenced off from the rest of the cluster. This may lead to additional outages.A previous APAR was issued for this in 5.1.4, but that fix was incomplete.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux Only |
Trigger |
Degradation in back-end storage management that causes commands to hang in the kernel |
Workaround |
The node with hardware problems will show waiters 'Until NSPDServer discovery completes. 'It is recommended to reboot those nodes with those GPFS waiters exceeding 2 minutes if this node is also being expelled. |
|
5.1.9.2 |
ESS/GNR |
IJ49585 |
Suggested |
If a tiebreaker disk has outdated version info, ccrrestore can abort with Python3 errors
(show details)
Symptom |
CCR files will not get restored. |
Environment |
ALL Operating System environments |
Trigger |
Running "mmsdrrestore ---ccr-repair” on a node that’s upgraded to a new Spectrum Storage release while a tiebreaker disk still has state data from a previous release. |
Workaround |
None |
|
5.1.9.2 |
CCR |
IJ49659 |
Critical |
AFM sets pcache attributes on inode after reading uncached file from home. It is modifying inode while filesystem is quiesced. Assert is hot due to same.
(show details)
Symptom |
Assert |
Environment |
All OS environments |
Trigger |
Reading uncached file in AFM while filesystem is quiesced |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49586 |
High Importance
|
File systems that have large number independent filesets usually tend to have a sparse inode space.
So if mmfsckx is run on such a file system having large sparse inode space then it will take longer to run as it unnecessarily parses over inode alloc map segments pointing to sparse inode spaces instead of skipping them.
(show details)
Symptom |
Slowness to complete run |
Environment |
All |
Trigger |
Run mmfsckx on file systems having large number of independent filesets |
Workaround |
None |
|
5.1.9.2 |
FSCKX |
IJ49587 |
High Importance
|
When building an NFSv4 ACL from a POSIX access and default ACL of a directory, in between the retrievals of the access ACL and the default ACL, if an update or store ACL to another file or a directory happens, a deadlock can occur and the long waiter message "waiting for exclusive NF ThSXLock for readers to finish" is seen.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux |
Trigger |
- Have directories with POSIX access and default ACL
- Retrieve the NFSv4 ACL of the directories
- At the same time, store or update the ACLs of other files/directories
- If the store/update occurs in between the retrieval of the access ACL and the default ACL during the process of building the NFSv4 ACL, the deadlock will be hit. |
Workaround |
- If NFSv4 ACL is needed, use NFSv4 ACL as the native ACL instead of using POSIX ACL, or
- Avoid retrieving ACLs of directories as NFSv4 ACLs when their native version are POSIX, or
- Use mode bits instead of ACLs. |
|
5.1.9.2 |
All Scale Users |
IJ49660 |
High Importance
|
When replicating over NFS With KRB plus AD - if there's a user who is not included in the AD at the primary site who creates a File, this file is replicated as root to the DR first and then a Setattr is attempted with the User/Group to which file/dir belongs to.
If the user doesn't exist on AD and is local to Primary cluster alone, then NFS prevents the Setattr and ergo the whole Create operation from Primary to DR gets stuck with E_INVAL.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only |
Trigger |
Trying AFM DR Replication with NFS + KRB + AD.
Having a local user who is not present on the AD - leading to NFS rejecting the user id related operations at the DR and leading to queue being stuck. |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49661 |
Suggested |
cluster health showing "healthy" for disabled CES services
(show details)
Symptom |
Error output/message |
Environment |
ALL OS environments |
Trigger |
Transfer of events from nodes to cluster manager was not working correctly if some field was empty. |
Workaround |
None |
|
5.1.9.2 |
System Health |
IJ49662 |
Suggested |
In ceratin cases the network status was not accounted for correctly which could result in "stuck" events like cluster_connections_bad and cluster_connections_down.
(show details)
Symptom |
Error output/message |
Environment |
ALL Linux OS environments |
Trigger |
Network status between nodes when channel was used, failed is then not needed anymore. |
Workaround |
As a stop-gap solution the events could be ignored, that would however also mute valid events. |
|
5.1.9.2 |
System Health |
IJ49710 |
Suggested |
For the failed callhome upload remove the job from queue if DC package not available.
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux Only |
Trigger |
remove/rename the DC packaging file before callhome retry upload schedule |
Workaround |
None |
|
5.1.9.2 |
Callhome |
IJ49699 |
Suggested |
Sometimes callhome upload getting failed due to curl(52) error
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux Only |
Trigger |
upload large/medium size file using callhome sendfile |
Workaround |
None |
|
5.1.9.2 |
Callhome |
IJ49700 |
Suggested |
Sometime exception in logs while callhome sendfile progress converted to integer
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux Only |
Trigger |
upload the file using callhome sendfile |
Workaround |
None |
|
5.1.9.2 |
Callhome |
IJ49701 |
High Importance
|
Processes hang due to deadlocks in our Storage Scale cluster. There aredeadlock notifications on multiple nodes which were triggered by 'long waiter' events on the nodes
(show details)
Symptom |
Client processes hang and system deadlocks |
Environment |
Linux Only |
Trigger |
A single large file being read sequentially from one node (causing a readahead to be performed on the file or by using a posix_fadvise call to trigger readahead forcefully) and also being truncated/deleted from another node at the same time. |
Workaround |
None |
|
5.1.9.2 |
Regular file read flow in kernel version >= 5.14 |
IJ49714 |
Suggested |
Creating AFM fileset with more than 32 afmNumFlushThreads gives an error
(show details)
Symptom |
Error output/message |
Environment |
ALL Operating System environment |
Trigger |
Creating AFM fileset with more than 32 afmNumFlushThreads gives an error |
Workaround |
Can create fileset with mmcrfileset afmNumFlushThreads <32 and later we can change this value using mmchfileset. |
|
5.1.9.2 |
AFM |
IJ49715 |
Suggested |
The 'rpc.statd' may be terminated or experience a crash due to statd-related issues. In these instances, the NFSv3 client will relinquish control over NFSv3 exports,and the GPFS health monitor will indicate 'statd_down'.
(show details)
Symptom |
The GPFS health monitor will show 'statd_down' warning and NFSv3 client lose control over NFSv3 exports. |
Environment |
Linux Only |
Trigger |
'rpc.statd' gets crashed or stopped by an external process. |
Workaround |
None |
|
5.1.9.2 |
NFS |
IJ49580 |
High Importance
|
When the device file for a NSD disk got offline or unattached from a node, the I/O issued from that node would fail with "No such device or address" error (6), even there are other NSD servers defined andavailable for servicing I/O request.
(show details)
Symptom |
I/O error |
Environment |
All Operating Systems |
Trigger |
The disk device got offline or unattached from a node. |
Workaround |
reboot the node |
|
5.1.9.2 |
All Scale Users |
IJ49770 |
High Importance
|
AFM object fileset fails to pull new objects from the S3/Azure store when the object fileset is exported via nfs-ganesha and readdir is performed over the NFS mount. However performing the readdir on the fileset directly pulls the entries correctly.
(show details)
Symptom |
Unexpected results |
Environment |
All OS enviroments |
Trigger |
Accessing the AFM object fileset over NFS mount with nfs-ganesha |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49771 |
High Importance
|
AFM outband metadata prefetch hangs if there is an orphan file already exists for the entries in the list file. AFM orphan files have inode allocated but not initialized.
(show details)
Symptom |
Deadlock |
Environment |
All OS environments |
Trigger |
AFM outband metadata prefetch with orphan files |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49772 |
High Importance
|
Damon assert going off: otherP == NULL in clicmd.C, resulting in daemon restart.
(show details)
Symptom |
Abend/Crash |
Environment |
All platforms |
Trigger |
Random occurrence of the condition due to collision of randomly generated numbers |
Workaround |
None |
|
5.1.9.2 |
All |
IJ49792 |
High Importance
|
Add config option to add nconnect for nfs mount
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux Only |
Trigger |
create a fileset and set nfs relationship between target and fileset |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49793 |
Suggested |
Prefetch is not generating the afmPrepopEnd callback event.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux Only |
Trigger |
Run Prefetch on any fileset |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49794 |
High Importance
|
Prefix downloads are getting failed or read or ls fails if prefix option is used with download or fileset creation.
(show details)
Symptom |
Error output/message |
Environment |
Linux Only |
Trigger |
Create fileset with --prefix option and do ls on fileset path or create a fileset and run download with --prefix option |
Workaround |
run below command on fileset : mmchfileset <fs_name> <fset_name> p afmobjectpreferdir=yes |
|
5.1.9.2 |
AFM |
IJ49795 |
Suggested |
Rename not reflected to COS automatically if afmMUAutoRemove configured.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux Only |
Trigger |
Configure fileset with afmMUAutoRemove , do rename on file |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49796 |
High Importance
|
AFM COS to GCS Hangs file system on GCS Errors if credentials doesnt have enough permission.
(show details)
Symptom |
Stuck IO |
Environment |
Linux Only |
Trigger |
With credentials not having read permission do ls on fileset path |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49851 |
High Importance
|
There is crash observed in read_pages when called from page_cache_ra_unbound on SLES with kernel version >=5.14.
(show details)
Symptom |
The node crashes on SLES 15 machine. It is specific only to kernel version >=5.14 |
Environment |
Linux SLES, kernel version >=5.14 |
Trigger |
A single large file being read sequentially from one node (causing a readahead to be performed on the file or by using a posix_fadvise call to trigger readahead forcefully) and also being truncated/deleted from another node at the same time. |
Workaround |
None |
|
5.1.9.2 |
Regular file read flow in kernel version >= 5.14 |
IJ49852 |
High Importance
|
With showNonZeroBlockCountForNonEmptyFiles set, block count is always shown as one to report fake block count.
This is a work-around for faulty applications (e.g., Gnu tar --sparse) that erroneously assume zero st_blocks means the file contains no nonzero bytes.
(show details)
Symptom |
Unexpected results |
Environment |
All OS environments |
Trigger |
Block count display issue on evicted files |
Workaround |
None |
|
5.1.9.2 |
AFM |
IJ49142 |
Suggested |
When running a workload on Windows which creates and deletes lots of files and directories in a short span, the inode number assigned for GPFS objects may be reused. If a stale inode entry somehow persists in the GPFS cache due to in flight hold counts, it can happen that due to conflict between the old and new object types, this stale entry will result in a file or directory not found error.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Windows/x86_64 only |
Trigger |
Running a workload on Windows which continuously creates and deletes lots of files and directories quickly |
Workaround |
None |
|
5.1.9.1 |
All Scale Users |
IJ49144 |
High Importance
|
When dependent fileset is created inline using afmOnlineDepFset or created offline as in the earlier supported method, we mandate enabling mmafmconfig so that .afm/.afmtrash is present at the DR site insode dep fset, to handle conflict renames that AFM does.
mmafmconfig enable at the DR on dep fset also creates .afmctl file which is CTL attr enabled and disallows anyone from removing it except when done through mmafmlocal. This causes the restore to fail removing the .afmctl inside dep fset when restoring to snapshot without the dep fset.
Fix is to enable mmafmconfig .afm/.afmtrash without creating the .afmctl file which is not needed inside dependent filesets anyways.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only (at the DR site) |
Trigger |
Running failoverToSecondary with --restore option with dependent filesets inside the independent DR fileset. |
Workaround |
mmafmconfig disable on all the dependent filesets at the DR site before running failoverToSecondary with --restore option |
|
5.1.9.1 |
AFM |
IJ49145 |
High Importance
|
When failover is performed to an entirely new Secondary fileset at the DR within the same Filesystem as previous target sec fileset - The dependent fileset path We request to link under should change too.
For this the existing dependent fileset is unlinked and when attempted to be linked under new path - since the dependent fileset exists, it returns the E_EXIST and later primary tries to lookup for remoteAttrs and fails the queue. Return E_EXIST if the fileset exists in linked state only so that the follow-up operation from Primary to build remote attributes succeeds.
(show details)
Symptom |
Unexpected Behavior |
Environment |
All Linux OS environments (At the DR site) |
Trigger |
Performing changeSecondary to the same Secondary site, but to a different fileset in it, with a dependent fileset in it. |
Workaround |
Manually link create/link the dependent fileset at the new Secondary site/fileset |
|
5.1.9.1 |
AFM |
IJ49151 |
High Importance
|
Memory corruption can happen if an application using the GPFS_FINE_GRAIN_WRITE_SHARING hint is running on a file system with its NSD servers having different endianness than the client node the application is running on.
(show details)
Symptom |
Segmentation fault, assert, or kernel crash |
Environment |
Linux only |
Trigger |
Run an application using the GPFS_FINE_GRAIN_WRITE_SHARING hint on nodes with mixed endianness. |
Workaround |
Don't run an application using the GPFS_FINE_GRAIN_WRITE_SHARING hinton nodes with mixed endianness. |
|
5.1.9.1 |
Data shipping |
IJ49152 |
High Importance
|
When running mmexpelnode to expel the node on which the command is running, we may hit this assert
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
Expel the node on which mmexpelnode is running |
Workaround |
None |
|
5.1.9.1 |
All Scale Users |
IJ49044 |
High Importance
|
When the file is opened with O_APPEND flag, sequential small read performance is poor
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
ALL Operating System environments |
Trigger |
File is opened with O_APPEND flag |
Workaround |
None |
|
5.1.9.1 |
All Scale Users |
IJ49154 |
Critical |
GPFS daemon could fail unexpectedly with assert when handling disk address changes.
This could happen when number of block in a file become very large and causes a variable used in internal calculation to over flow.
This is more like to happen on file system where block size is very small.
(show details)
Symptom |
Abend/Crash |
Environment |
ALL Operating System environments |
Trigger |
When number of block in a user or system file increases passes certain point. The actual number of block is different depending on block size, replication factor, etc. |
Workaround |
None |
|
5.1.9.1 |
All Scale Users |
IJ49169 |
High Importance
|
AFM metadata prefetch does not preserve ctime on the files if they are migrated at home. This causes ctime mismatch between cache and home.
(show details)
Symptom |
Unexpected results |
Environment |
All Linux OS environments |
Trigger |
AFM metadata prefetch using AFM when files are migrated at home. |
Workaround |
None |
|
5.1.9.1 |
AFM |
IJ49196 |
High Importance
|
If COS bucket has same name object and directory object, by default file objects were getting download, when customer requirement was to download directory content instead of files.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux Only |
Trigger |
COS bucket having object and directory with same name. |
Workaround |
None |
|
5.1.9.1 |
AFM |
IJ49197 |
Suggested |
Exception in mmsysmonitor.log due to some files were getting removed while mmcallhome data collection
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux Only |
Trigger |
Change some file(e.g., CCR files) during mmcallhome GatherSend data collection |
Workaround |
None |
|
5.1.9.1 |
Callhome |
IJ49198 |
Suggested |
mmcallhome SendFile: progress percentage not updated
(show details)
Symptom |
Performance Impact/Degradation |
Environment |
Linux Only |
Trigger |
Only start and done(100%) was showing when using mmcallhome sendfile to upload |
Workaround |
None |
|
5.1.9.1 |
Callhome |
IJ49216 |
High Importance
|
Quota manager/client node may assert during per-fileset quota check, when there is being-deleted inode.
(show details)
Symptom |
Cluster/File System Outage |
Environment |
ALL |
Trigger |
To isolate and improve per-fileset quota check logic, invalid filesetId from being-deleted inode is not handled correctly. |
Workaround |
None |
|
5.1.9.1 |
Quota |
IJ49135 |
Critical |
The assert going off on "logAssertFailed: oldDA1Found[i].compAddr(synched1[I])", then result in mmfsd daemon crashed and finally could cause file system can't be mounted on any node.
(show details)
Symptom |
Abend/Crash |
Environment |
All Operating Systems |
Trigger |
Run fsck to fix the duplicated disk address on compressed files. |
Workaround |
None |
|
5.1.9.0 |
Compression |
IJ48873 |
Critical |
File data loss when copying or archiving data from migrated files (e.g., using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface).
(show details)
Symptom |
Data Loss |
Environment |
Linux Only |
Trigger |
Using the copy or archive tools that support to detect the sparse holes in the source file with the lseek(2) interface. |
Workaround |
Switch to use other copy or archive tools to copy or archive the data from migrated files, or recall the file before using the copy or archive applications. |
|
5.1.9.0 |
DMAPI |
IJ48871 |
Critical |
File data loss when copying or archiving data from snapshot and clone files (e.g., using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface).
(show details)
Symptom |
Data Loss |
Environment |
Linux Only |
Trigger |
Using the copy or archive tools that support to detect the sparse holes in the source file with the lseek(2) interface. |
Workaround |
Switch to use other copy or archive tools to copy or archive the data from snapshot and clone files. |
|
5.1.9.0 |
Snapshot and clone files |