Troubleshooting the "ASM Health Checker Found 1 New Failures" Alert
If you are managing an Oracle Database environment using Automatic Storage Management (ASM), encountering the alert "ASM health checker found 1 new failures" can be a jarring experience. This message is usually triggered by the Oracle Health Monitor (HM), a framework designed to detect and analyze components within the database and ASM instances.
When this alert surfaces in your alert log or monitoring dashboard (like Enterprise Manager), it means ASM has identified a specific issue that could potentially impact the availability or performance of your storage layer.
Here is a deep dive into what this error means, how to diagnose it, and the steps to resolve it. 1. Understanding the ASM Health Checker
The ASM Health Checker is part of the broader Oracle Health Monitor. It runs periodic checks—and can be triggered manually—to assess the integrity of:
ASM Metadata (Disk headers, File Directory, Alias Directory) Disk Group health Process responsiveness
When a "new failure" is reported, Oracle has logged a diagnostic entry into its ADR (Automatic Diagnostic Repository). The alert doesn't tell you the problem directly; it tells you that a report is waiting for your review. 2. Immediate Diagnostic Steps
To fix the failure, you first have to identify it. You can do this via the Command Line Interface (CLI) using ADRCI. Step A: Access ADRCI Log in to your grid infrastructure server and run: adrci Use code with caution. Step B: Set the Home Path
Check which home is reporting the error (usually the ASM home):
show homes set homepath diag/asm/+asm/+asm1 -- (Adjust based on your SID) Use code with caution. Step C: List the Failures
Run the following command to see the specific failure identified: list failure Use code with caution.
This will provide a Failure ID, the severity (CRITICAL or HIGH), and a brief description of what went wrong. 3. Common Causes for ASM Failures
While the "1 new failure" could technically be anything, it usually falls into one of these three categories: A. Disk Corruption or Metadata Inconsistency
The most common cause is an inconsistency in the ASM metadata. This can happen due to an unexpected power loss, a bug in the storage firmware, or "lost writes." The Fix: Run an internal ASM check. ALTER DISKGROUP Use code with caution. B. Offline Disks or Path Issues
If a path to a physical disk is lost (due to HBA failure or cable issues), ASM might mark the disk as "OFFLINE." If the diskgroup is still mounted but missing a member, the Health Checker will flag it.
The Fix: Check v$asm_disk to ensure all disks are ONLINE and HEADER_STATUS is MEMBER. C. Resource Exhaustion
Sometimes the failure is not about the disks themselves, but about the ASM instance’s ability to manage them—such as running out of processes or memory in the SGA. 4. How to Resolve the Failure
Once you’ve identified the Failure ID in ADRCI, you can ask Oracle for a repair advice: Advise on Failure: advise failure Use code with caution.
This will generate a report explaining the impact and recommending a script or manual action to fix it.
Execute Repair:If Oracle provides a repair script, you can run: repair failure; Use code with caution.
Note: Always back up your metadata and ensure you have a valid backup before running automated repair scripts on production storage. 5. Clearing the Alert
After the underlying issue is resolved (e.g., the disk is back online or the metadata is repaired), you need to "close" the failure in the ADR so the health checker stops reporting it. Inside ADRCI:
set homepath Use code with caution.
The "ASM health checker found 1 new failures" alert is a call to action to check your storage integrity. By using ADRCI to drill down into the specific failure ID, you can move from a vague warning to a concrete resolution plan. asm health checker found 1 new failures
Pro Tip: Regularly monitor your v$asm_operation view. If you see long-running "REBAL" (rebalance) operations following a failure, ensure your ASM_POWER_LIMIT is set high enough to complete the recovery quickly without impacting database I/O.
Do you have the ADRCI output or the specific Failure ID from your logs? I can help you interpret the exact cause.
The alert "ASM Health Checker found 1 new failures" is a critical notification from Oracle's Automatic Storage Management (ASM) health monitoring system. It typically appears in the ASM alert logs or via automated email notifications when a storage-related incident is detected. Failure Overview
This specific message indicates that the Fault Diagnosability Infrastructure has identified a new incident in the Automatic Diagnostic Repository (ADR). While "1 new failure" is a generic count, it often points to one of the following underlying issues:
Disk Group Instability: A disk may have failed, leading to a loss of redundancy or a disk group being forced to dismount.
Metadata Corruption: Corruption in ASM metadata blocks (typically within the first 250 blocks) detected during routine operations or rebalancing.
Rebalance Failures: An error occurring during the addition or removal of disks, often accompanied by background process (ARB0) alerts.
Resource State Changes: CRS (Cluster Ready Services) resources moving to an INTERMEDIATE or OFFLINE state due to storage latency or connectivity issues. Immediate Diagnostic Actions
To identify the exact cause, execute the following steps within your environment:
Check the ADRCI Utility:Use the ADR Command Interpreter (ADRCI) to list the details of the specific failure. adrci> list failure Use code with caution. Copied to clipboard
This command provides a unique Failure ID and a description of the problem.
Inspect ASM Alert Logs:Locate the log file (usually in the trace directory of your Oracle Base) to see the events leading up to the "1 new failure" message. Look for: ORA-15xxx errors (ASM-specific).
SUCCESS: ALTER DISKGROUP... followed by immediate GMON dumping or failure notes.
Run Data Recovery Advisor:If the failure involves data loss or disk group mounting issues, use RMAN to get a repair recommendation: RMAN> list failure; RMAN> advise failure; Use code with caution. Copied to clipboard
Query V$ Views:Verify the status of your disks and current operations:
Disk Status: SELECT name, path, mount_status, header_status, state FROM v$asm_disk;
Active Operations: SELECT operation, state, est_minutes FROM v$asm_operation; Common Remediation Steps KB88485 - My Oracle Support
If you manage Oracle Grid Infrastructure (GI) or a standalone Automatic Storage Management (ASM) instance, one notification can send a chill down your spine: "ASM health checker found 1 new failures."
This message, often found in your alert log, crsd.log, or email alerts from Enterprise Manager (EM12c/13c), indicates that the automated ASM Health Checker has detected a new issue affecting the integrity, availability, or performance of your ASM environment. Ignoring it is not an option; unresolved failures can lead to disk group mount issues, I/O latency, or even database crashes.
This article provides a 360-degree breakdown of this alert: what triggers it, how to diagnose the root cause, step-by-step repair procedures, and long-term prevention strategies.
Introduction The terse message “asm health checker found 1 new failures” appears straightforward but carries significant operational weight: it signals that an ASM (Automatic Storage Management, or a similarly named subsystem) health-check routine has detected a failure. Whether that ASM is Oracle ASM, a cloud Autoscaling/Service Mesh monitor, or a custom “Application Service Monitor,” the phrasing implies an automated health-scan discovered one additional fault relative to its prior baseline. This essay examines the message’s possible meanings, root causes, investigative approach, risk implications, and systematic remediation and prevention strategies. The aim is to move from alarm to actionable resolution, and from reactive fixes to durable system hardening.
Conclusion “asm health checker found 1 new failures” is more than a log line: it is an early warning. Responding effectively requires prompt triage, methodical diagnosis, and decisive remediation—combined with post-incident learning and engineering improvements to reduce recurrence. By classifying possible causes (storage, probe, resource, network, regression, auth), following a disciplined RCA approach, and implementing monitoring and automation best practices, teams can convert such alerts from frightening unknowns into manageable events and steadily improve system resilience.
Appendix: Minimal quick runbook (steps to execute immediately) Troubleshooting the "ASM Health Checker Found 1 New
— End —
Troubleshooting Guide: ASM Health Checker Found 1 New Failure
If you are managing an Oracle database environment and receive the alert "ASM Health Checker found 1 new failure," it’s time to pay attention. While Oracle Automatic Storage Management (ASM) is robust, this specific notification indicates that the internal diagnostic framework has detected an issue that could potentially impact disk group availability or performance.
Here is a comprehensive breakdown of what this error means, how to diagnose it, and the steps to resolve it. 1. Understanding the ASM Health Checker (CHMA)
The ASM Health Checker is part of the Oracle Check Framework. It runs periodic checks on the ASM instance, disk groups, and metadata to ensure everything is operating within healthy parameters.
When it reports a "new failure," it means a specific "check" (such as disk connectivity, metadata consistency, or space usage) has moved from a PASS to a FAIL state. 2. Immediate Step: Identify the Failure
The alert itself is generic. To find out what actually failed, you need to query the ASM instance. Run this SQL command in your ASM instance:
SELECT check_name, failure_pri, status, repair_script FROM v$asm_healthcheck_status WHERE status = 'FAILED'; Use code with caution. Common culprits include:
Disk Offline: One or more disks in a disk group are no longer accessible.
Metadata Corruption: Inconsistencies in the ASM metadata (e.g., File Directory or Disk Directory).
Space Issues: A disk group is nearing 100% capacity, risking an instance crash.
Stale Quorum: Issues with voting files in a CRS/Grid Infrastructure environment. 3. Deep Dive into the Logs
To get the granular details, look at the ASM Alert Log. You can usually find this in your Oracle Base directory:$ORACLE_BASE/diag/asm/+asm/+asm1/trace/alert_+asm1.log
Search for the timestamp of the alert. You will often see a corresponding ORA- error code (like ORA-15078 or ORA-15032) that provides the exact technical reason for the health check failure. 4. How to Resolve the Failure Scenario A: Disk Connectivity Issues
If the health checker found a disk failure, check the OS-level connectivity. Command: lsdsk (within ASMCMD) or fdisk -l (Linux).
Fix: If a disk is "OFFLINE," try to online it using:ALTER DISKGROUP Scenario B: Metadata Inconsistency
If the health check indicates metadata issues, you may need to run a manual check on the disk group.
Action: Execute the CHECK command:ALTER DISKGROUP Note: This checks for consistency but does not fix errors. If errors are found, you may need to involve Oracle Support. Scenario C: Space Pressure
If the failure is related to "Insufficient Space," rebalance the disk group or add new disks immediately.
Action: Check free space:SELECT name, free_mb, total_mb, usable_file_mb FROM v$asm_diskgroup; 5. Clearing the Alert
Once you have fixed the underlying physical or logical issue, the Health Checker should automatically update during its next run. However, if the status remains "Failed" in the views, you can manually trigger a re-run of the health check or use ADRCI to purge the alert. Summary Checklist
Query v$asm_healthcheck_status to identify the specific check. Review the ASM Alert Log for specific ORA-error codes.
Verify Physical Disks at the OS level to ensure no hardware failure. What "ASM Health Checker found 1 new failures" means
Check Disk Group Capacity to ensure you haven't hit a "disk full" state.
By catching these "1 new failures" early, you prevent minor disk hiccups from turning into major database outages.
The error message "ASM Health Checker found 1 new failures" typically appears in the Oracle ASM alert logs when the system detects an issue with a disk or disk group
. This message indicates that a failure has been logged in the Automatic Storage Management (ASM) health check framework, often related to disk group dismounts, header corruption, or voting file issues. Oracle ASM Health Check Failure Report Report Field Description / Details Alert Message ASM Health Checker found 1 new failures System Component Oracle Automatic Storage Management (ASM) Detection Source ASM Alert Log (typically located at diag/asm/+asm/
(Requires immediate investigation to prevent data loss or service disruption) Potential Causes & Findings Disk Group Dismount
: A disk group may have been forced to dismount due to lost connectivity or multiple disk failures in a failure group. Disk Header Corruption
: The metadata (headers) on one or more ASM disks may be corrupted or in a "FORMER" or "PROVISIONED" status instead of "MEMBER". Voting File Issues
: If the ASM disk group hosts the Cluster Registry (OCR) or Voting Disks, a failure can cause node evictions or cluster instability. Storage Latency/I/O Timeouts
: The health checker may trigger a failure if it waits too long (e.g., >15 seconds) for I/O operations to complete on a specific disk. Oracle Forums Recommended Troubleshooting Steps
ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It
The Automatic Storage Management (ASM) health checker is a crucial tool in Oracle databases that monitors the health and integrity of the storage infrastructure. When the ASM health checker reports a new failure, it's essential to understand the implications and take corrective actions to prevent data loss or system downtime. In this blog post, we'll discuss what an ASM health checker failure means, how to investigate the issue, and steps to resolve it.
What does an ASM health checker failure mean?
When the ASM health checker detects a problem, it logs an error message indicating that a failure has been detected. The message may look like this:
"ASM health checker found 1 new failure"
This message indicates that the ASM health checker has detected a single failure in the storage system. The failure could be related to various issues, such as:
Investigating the ASM health checker failure
To investigate the failure, follow these steps:
$ORACLE_BASE/diag/asm/+ASM/<instance_name>/trace directory.asmcmd command: The asmcmd command-line tool provides a comprehensive view of the ASM configuration and status. Run asmcmd with the lsdg option to list the disk groups and their status: asmcmd ls dgasmcmd command with the dg option to check the status of the affected disk group: asmcmd dg <disk_group_name>Resolving the ASM health checker failure
Once you've identified the root cause of the failure, take corrective actions to resolve the issue:
Best practices to prevent ASM health checker failures
To minimize the likelihood of ASM health checker failures:
By understanding the causes of ASM health checker failures and taking proactive steps to prevent them, you can ensure the reliability and performance of your Oracle database storage infrastructure.
An "ASM health checker found 1 new failures" message in Oracle (AHF/ORAchk) signals a logged incident in the Automatic Diagnostic Repository (ADR), often caused by disk connectivity issues, failed rebalances, or metadata corruption. Immediate investigation requires using ADRCI to identify the specific incident and checking V$ASM_DISK for failed or dropped disks. Detailed diagnostic procedures are available from Oracle Help Center at Oracle Help Center.
If compatible.asm, compatible.rdbms, or compatible.advm values are set incorrectly relative to the GI version, the health checker will report advisories as failures.