In some cases when the AIX kernel detects a severe level of file system metadata corruption, the kernel will panic and bring the system down immediately. In most cases, the file system can later be repaired with fsck.
A file system consists of two types of data; user data and metadata. User data consists of the actual contents of files in the file system. Metadata consists of internal data structures such as the superblock, i-nodes, indirect data pointers, and directory files.
Not all filesystem problems will result in a system crash. Corruption of user data, the contents of a file, will not cause a system to crash. This is because the file system itself knows nothing about what the user data should be and therefore is unable to detect corruption. It is the responsibility of applications to detect this kind of corruption and almost all applications have the facilities to do so. File system metadata on the other
hand is data that is maintained by the operating system to organize and store the user data. File system metadata is used on all file systems – rootvg file systems as well as all non-rootvg file systems. Corruption of file system metadata, even on a non-rootvg filesystem, can sometimes cause a the system to crash. Because of a new design in the JFS2 file system, system crashes are much less common with JFS2 than with JFS.
In some cases, fatal metadata corruption might make it impossible to unmount the file system. In all cases, fatal metadata corruption jeopardizes all of the user data on that file system. Moreover, it can have negative effects on other file systems. For this reason, at the first sign of serious trouble, all file system activity is ceased and the kernel abruptly halts, which results in a system crash. It is for the safety and integrity of the file system that the kernel ceases operation after detecting fatal file system corruption.
File system corruption is most often caused by hardware related problems. But sometimes bugs in the software that manage and use the file system can result in corruption. It is difficult if not impossible for AIX Support to determine precisely the cause of file system corruption from a system dump when there are no detectable hardware problems on the system and if there are no known AIX problems that could contribute to file system corruption.
Here are the most likely causes of filesystem corruption:
- The system is accidently powered off or crashes
- Adaptor hardware or downlevel microcode
- Down level adaptor driver
- Disk hardware or downlevel microcode
- Down level disk drivers
- Storage cables loose, defective, or unterminated
- Down level system firmware
- A program that uses the file system core dumped
- Bugs in software programs using the file system
When a system crashes due to file system corruption, the following steps should be taken to help prevent
future crashes.
- Determine which file system was corrupted and after unmounting the file system, run fsck -p on this file system two times to make sure it is clean. If check=true for a particular file system in /etc/filesystems, then fsck should have been run automatically by the operating system when the system was rebooted. Note that even though check is set to false for rootvg file systems, they are
always automatically checked and repaired by init during the boot process before the /etc/filesystems file is ever processed.
- Check the error report for any hardware related entries. Specifically, look for disk related errors on the disks where the file system resides or disk adapter entries for adapters that are connected to the disks. When possible, run diag on the adapters and disks that contain the file system.
- Check the levels of the disk and adapter device drivers and upgrade 3. if necessary
- Check the firmware and microcode levels in the disks and disk adapters and upgrade if necessary.
- Check the firmware in the computer system and upgrade if necessary.
- Check the software that is using the file system. All software that uses the file system should be investigated for potential bugs that can result in file system corruption.
- Ensure there are no known AIX issues that might contribute to file system corruption at the current operating system level.
If the problem recurs after taking all of the preventative measures listed above, then it is recommended that the file system be backed up and then removed, recreated, and then restored from the backup. If the problem still persists, then the file system can be removed and recreated on separate disks to rule out a problem with the disk drive hardware. Also, consider using JFS2 rather than JFS.