Asm Health Checker Found 1 New Failures -

Troubleshooting the "ASM Health Checker Found 1 New Failures" Alert

If you are managing an Oracle Database environment using Automatic Storage Management (ASM), encountering the alert "ASM health checker found 1 new failures" can be a jarring experience. This message is usually triggered by the Oracle Health Monitor (HM), a framework designed to detect and analyze components within the database and ASM instances.

When this alert surfaces in your alert log or monitoring dashboard (like Enterprise Manager), it means ASM has identified a specific issue that could potentially impact the availability or performance of your storage layer.

Here is a deep dive into what this error means, how to diagnose it, and the steps to resolve it. 1. Understanding the ASM Health Checker

The ASM Health Checker is part of the broader Oracle Health Monitor. It runs periodic checks—and can be triggered manually—to assess the integrity of:

ASM Metadata (Disk headers, File Directory, Alias Directory) Disk Group health Process responsiveness

When a "new failure" is reported, Oracle has logged a diagnostic entry into its ADR (Automatic Diagnostic Repository). The alert doesn't tell you the problem directly; it tells you that a report is waiting for your review. 2. Immediate Diagnostic Steps

To fix the failure, you first have to identify it. You can do this via the Command Line Interface (CLI) using ADRCI. Step A: Access ADRCI Log in to your grid infrastructure server and run: adrci Use code with caution. Step B: Set the Home Path

Check which home is reporting the error (usually the ASM home):

show homes set homepath diag/asm/+asm/+asm1 -- (Adjust based on your SID) Use code with caution. Step C: List the Failures

Run the following command to see the specific failure identified: list failure Use code with caution.

This will provide a Failure ID, the severity (CRITICAL or HIGH), and a brief description of what went wrong. 3. Common Causes for ASM Failures

While the "1 new failure" could technically be anything, it usually falls into one of these three categories: A. Disk Corruption or Metadata Inconsistency

The most common cause is an inconsistency in the ASM metadata. This can happen due to an unexpected power loss, a bug in the storage firmware, or "lost writes." The Fix: Run an internal ASM check. ALTER DISKGROUP CHECK ALL; Use code with caution. B. Offline Disks or Path Issues

If a path to a physical disk is lost (due to HBA failure or cable issues), ASM might mark the disk as "OFFLINE." If the diskgroup is still mounted but missing a member, the Health Checker will flag it.

The Fix: Check v$asm_disk to ensure all disks are ONLINE and HEADER_STATUS is MEMBER. C. Resource Exhaustion

Sometimes the failure is not about the disks themselves, but about the ASM instance’s ability to manage them—such as running out of processes or memory in the SGA. 4. How to Resolve the Failure

Once you’ve identified the Failure ID in ADRCI, you can ask Oracle for a repair advice: Advise on Failure: advise failure ; Use code with caution.

This will generate a report explaining the impact and recommending a script or manual action to fix it.

Execute Repair:If Oracle provides a repair script, you can run: repair failure; Use code with caution.

Note: Always back up your metadata and ensure you have a valid backup before running automated repair scripts on production storage. 5. Clearing the Alert

After the underlying issue is resolved (e.g., the disk is back online or the metadata is repaired), you need to "close" the failure in the ADR so the health checker stops reporting it. Inside ADRCI:

set homepath list failure -- Get the ID # After verifying the fix: change failure closed; Use code with caution.

The "ASM health checker found 1 new failures" alert is a call to action to check your storage integrity. By using ADRCI to drill down into the specific failure ID, you can move from a vague warning to a concrete resolution plan. asm health checker found 1 new failures

Pro Tip: Regularly monitor your v$asm_operation view. If you see long-running "REBAL" (rebalance) operations following a failure, ensure your ASM_POWER_LIMIT is set high enough to complete the recovery quickly without impacting database I/O.

Do you have the ADRCI output or the specific Failure ID from your logs? I can help you interpret the exact cause.

The alert "ASM Health Checker found 1 new failures" is a critical notification from Oracle's Automatic Storage Management (ASM) health monitoring system. It typically appears in the ASM alert logs or via automated email notifications when a storage-related incident is detected. Failure Overview

This specific message indicates that the Fault Diagnosability Infrastructure has identified a new incident in the Automatic Diagnostic Repository (ADR). While "1 new failure" is a generic count, it often points to one of the following underlying issues:

Disk Group Instability: A disk may have failed, leading to a loss of redundancy or a disk group being forced to dismount.

Metadata Corruption: Corruption in ASM metadata blocks (typically within the first 250 blocks) detected during routine operations or rebalancing.

Rebalance Failures: An error occurring during the addition or removal of disks, often accompanied by background process (ARB0) alerts.

Resource State Changes: CRS (Cluster Ready Services) resources moving to an INTERMEDIATE or OFFLINE state due to storage latency or connectivity issues. Immediate Diagnostic Actions

To identify the exact cause, execute the following steps within your environment:

Check the ADRCI Utility:Use the ADR Command Interpreter (ADRCI) to list the details of the specific failure. adrci> list failure Use code with caution. Copied to clipboard

This command provides a unique Failure ID and a description of the problem.

Inspect ASM Alert Logs:Locate the log file (usually in the trace directory of your Oracle Base) to see the events leading up to the "1 new failure" message. Look for: ORA-15xxx errors (ASM-specific).

SUCCESS: ALTER DISKGROUP... followed by immediate GMON dumping or failure notes.

Run Data Recovery Advisor:If the failure involves data loss or disk group mounting issues, use RMAN to get a repair recommendation: RMAN> list failure; RMAN> advise failure; Use code with caution. Copied to clipboard

Query V$ Views:Verify the status of your disks and current operations:

Disk Status: SELECT name, path, mount_status, header_status, state FROM v$asm_disk;

Active Operations: SELECT operation, state, est_minutes FROM v$asm_operation; Common Remediation Steps KB88485 - My Oracle Support

Decoding the Alert: "ASM Health Checker Found 1 New Failures" – Causes, Fixes, and Prevention

If you manage Oracle Grid Infrastructure (GI) or a standalone Automatic Storage Management (ASM) instance, one notification can send a chill down your spine: "ASM health checker found 1 new failures."

This message, often found in your alert log, crsd.log, or email alerts from Enterprise Manager (EM12c/13c), indicates that the automated ASM Health Checker has detected a new issue affecting the integrity, availability, or performance of your ASM environment. Ignoring it is not an option; unresolved failures can lead to disk group mount issues, I/O latency, or even database crashes.

This article provides a 360-degree breakdown of this alert: what triggers it, how to diagnose the root cause, step-by-step repair procedures, and long-term prevention strategies.

What "ASM Health Checker found 1 new failures" means

ASM Health Checker is a diagnostic tool (commonly in application server or middleware stacks) that runs automated checks against configuration, services, and runtime components.
A message saying it “found 1 new failures” indicates a previously healthy check has now failed — one discrete test or probe returned an error or an unexpected result.
The failure could be transient (temporary network hiccup, brief service restart) or persistent (configuration drift, resource exhaustion, corrupt files).

Essay: “asm health checker found 1 new failures” — diagnosis, causes, and remediation

Introduction The terse message “asm health checker found 1 new failures” appears straightforward but carries significant operational weight: it signals that an ASM (Automatic Storage Management, or a similarly named subsystem) health-check routine has detected a failure. Whether that ASM is Oracle ASM, a cloud Autoscaling/Service Mesh monitor, or a custom “Application Service Monitor,” the phrasing implies an automated health-scan discovered one additional fault relative to its prior baseline. This essay examines the message’s possible meanings, root causes, investigative approach, risk implications, and systematic remediation and prevention strategies. The aim is to move from alarm to actionable resolution, and from reactive fixes to durable system hardening.

Interpreting the message

Literal reading: an automated health-check process labelled “asm health checker” has logged a detection: exactly one new failure event.
Context sensitivity: precise interpretation depends on environment:
- Oracle ASM: a disk/member or diskgroup issue (failed redundancy, I/O errors, disk offline).
- Service mesh/Autoscaling manager: a service instance or probe failed a liveness/readiness check.
- Custom agent named “asm health checker”: any monitored component (process, thread, port, storage, network) could be implicated.
Important implications:
- “New” indicates a state transition from healthy to unhealthy (not a stale alert).
- “1” suggests a single point of failure, but single failures often cascade; it’s important to identify whether this is isolated or symptomatic.

Immediate triage checklist (first 15–60 minutes)

Capture the context:
- Timestamp and full log entry (message, surrounding logs, stack traces).
- Host, resource identifier (disk name, instance ID, pod, container, node).
- Correlate with monitoring dashboards, metrics (CPU, memory, I/O, latency), and recent deployments/changes.
Prevent escalation:
- If the failure affects production traffic, consider circuit-breaker actions: divert traffic, scale up healthy instances, enable read-only modes, or failover to replicas.
- Announce to on-call and stakeholders with concise facts: what, when, where, impact, mitigation in progress.
Gather artifacts:
- Health-check configuration (probe interval, timeout, retries).
- Recent configuration changes, deployments, patching, or maintenance windows.
- Checkpoint / snapshot data for storage systems; container logs and systemd/journald entries for nodes.

Root-cause analysis (systematic approach)

Classify the failure type:
- Infrastructure (disk or network failure, node crash, lost quorum).
- Application-level (process crash, thread deadlock, memory leak).
- Configuration/compatibility (misconfigured probe, wrong path, permissions).
- Environmental change (updated TLS certs, rotated keys, firewall rules).
Use hypothesis-driven testing:
- Reproduce the health-check manually (curl, nc, mock probe) to see failure mode and error output.
- Run platform-specific diagnostics:
  - Oracle ASM: check v$asm_disk, v$asm_disk_stat, alert logs; verify diskgroup status, disk headers, ASMLib/udev mappings.
  - Kubernetes/service mesh: describe Pod, inspect readiness/liveness probe commands, check kubelet and container logs, look for OOMKilled or CrashLoopBackOff events.
  - Generic: run system-level checks (dmesg for kernel I/O errors, SMART for disks, iptables/netstat for network problems).
- Cross-check metrics around the failure time (spikes in latency, error rates, system load).
Look for correlated events:
- Recent rolling deployment, configuration commit, or scaledown/scaleup around the timestamp.
- Hardware alerts from infrastructure providers or cloud provider incident notices.

Common root causes and how they manifest

Disk/device failure (storage-backed ASM):
- Symptoms: I/O errors, device disappearing, degraded redundancy, “disk offline” in ASM tooling.
- Manifest: slow operations, timeouts, database errors, degraded redundancy alarms.
Probe misconfiguration:
- Symptoms: probe path or command changed, insufficient privileges, changed binary path or API endpoint.
- Manifest: instant failures after configuration change or deployment, but application otherwise healthy.
Resource exhaustion:
- Symptoms: OOMs, CPU saturation, slow responsiveness leading to probe timeouts.
- Manifest: spike in latency/queue depths, container restarts.
Network partitions and DNS:
- Symptoms: connectivity failures, name resolution errors, requests timing out.
- Manifest: distributed systems losing quorum or failing health checks intermittently.
Software/regression bug:
- Symptoms: new code path introduced a crash or deadlock.
- Manifest: reproducible failure tied to a recent change, stack traces in logs.
Permission or credential expiry:
- Symptoms: auth failures, TLS handshake errors, permission denied.
- Manifest: logs showing unauthorized, certificate expired, or permission denied messages.

Remediation steps (concrete actions)

If storage/device failure (ASM storage example):
- Mark failed disk offline in ASM; replace or reattach physical/virtual disk.
- Recreate or restore disk headers if corrupted (only after backups and vendor guidance).
- Rebalance diskgroup to restore redundancy; monitor rebalance progress.
- Verify backups before destructive repair; engage support for hardware-level issues.
If probe/config issue:
- Fix probe command/path/permissions; redeploy probe configuration.
- Tighten probe tolerances (timeouts/retries) conservatively—don’t mask real failures.
- Use start-up probes to give warm-up time before liveness checks.
If resource exhaustion:
- Increase resource limits, add replicas, tune GC or request/limit settings in orchestrator.
- Identify memory leaks or CPU hotspots; apply fixes or rollback problematic release.
If networking:
- Restore connectivity (route, firewall rules, security groups), verify DNS resolution.
- Consider adjusting health-check endpoints to use local checks where possible.
If software regression:
- Roll back to last known-good version or hotfix the bug; add test coverage to catch similar errors.
If credential/certificate issues:
- Renew or rotate credentials; update configuration; automate certificate renewal.

Validation and recovery verification

Confirm health-check returns healthy consistently across multiple intervals.
Re-run full end-to-end tests and smoke tests for functionality.
Monitor for recurrence over a longer window than the probe interval (e.g., 3× intervals).
Check downstream systems for residual impact: queues, caches, replication lag, user-facing errors.

Post-incident actions (SRE-style)

Incident timeline: assemble a precise timeline of events, alerts, root cause, and actions taken.
Blameless postmortem: document root cause, contributing factors, mitigations, and long-term fixes.
Action items with owners and deadlines:
- Fix root cause (code, config, hardware replacement).
- Add or adjust monitoring (better metrics, alert thresholds, synthetic tests).
- Improve runbooks: clearly document steps for this exact alert.
- Automate repetitive fixes where safe (auto-replace failed disks, auto-scale on thresholds).
Prevent recurrence:
- Increase redundancy, improve testing (chaos testing, canary releases), and validate health-check semantics.
- Ensure alerts escalate appropriately (avoid noisy alerts causing fatigue).

Design considerations for health checkers to reduce false positives and improve signal

Health-check best practices:
- Use layered checks: shallow liveness probe (process alive) and deep readiness/smoke tests (end-to-end).
- Include health endpoints that report application-specific readiness (dependency statuses, DB connectivity).
- Use gradual degradation rather than binary failure where possible (report degraded vs unhealthy).
- Align probe intervals, retries, and timeouts with realistic warm-up, GC, and transient network behavior.
- Avoid heavy-weight operations inside probes that amplify load.
Observability:
- Emit structured health-check telemetry (success/failure counts, latencies, error codes).
- Correlate health-check events with trace IDs to debug distributed failures.
- Provide contextual metadata in alerts (node id, disk id, pod name, last successful probe timestamp).

Risk assessment and business impact

Single failure significance:
- Might be low impact if redundancy absorbs the fault; but single failure can escalate into wider outage if not contained (e.g., degraded rebuilds, increased load on remaining resources).
Time-to-recovery (MTTR) and detection (MTTD):
- Shortening MTTD and MTTR reduces blast radius. Invest in precise alerts and automated remediation for common faults.
Regulatory and data risks:
- For storage/ASM, degraded redundancy increases risk of data loss if another failure occurs before rebuild completes; prioritize repair.

Conclusion “asm health checker found 1 new failures” is more than a log line: it is an early warning. Responding effectively requires prompt triage, methodical diagnosis, and decisive remediation—combined with post-incident learning and engineering improvements to reduce recurrence. By classifying possible causes (storage, probe, resource, network, regression, auth), following a disciplined RCA approach, and implementing monitoring and automation best practices, teams can convert such alerts from frightening unknowns into manageable events and steadily improve system resilience.

Appendix: Minimal quick runbook (steps to execute immediately) Troubleshooting the "ASM Health Checker Found 1 New

Capture the alert details and correlate logs/metrics.
Identify the affected resource (disk/pod/node/service).
Attempt a manual probe/connection to reproduce failure.
If production-impacting, trigger failover/scaleup and notify on-call.
Apply targeted remediation (replace disk, fix probe, rollback deployment).
Verify health across multiple intervals; monitor for recurrence.
Create postmortem and assign permanent fixes.

— End —

Troubleshooting Guide: ASM Health Checker Found 1 New Failure

If you are managing an Oracle database environment and receive the alert "ASM Health Checker found 1 new failure," it’s time to pay attention. While Oracle Automatic Storage Management (ASM) is robust, this specific notification indicates that the internal diagnostic framework has detected an issue that could potentially impact disk group availability or performance.

Here is a comprehensive breakdown of what this error means, how to diagnose it, and the steps to resolve it. 1. Understanding the ASM Health Checker (CHMA)

The ASM Health Checker is part of the Oracle Check Framework. It runs periodic checks on the ASM instance, disk groups, and metadata to ensure everything is operating within healthy parameters.

When it reports a "new failure," it means a specific "check" (such as disk connectivity, metadata consistency, or space usage) has moved from a PASS to a FAIL state. 2. Immediate Step: Identify the Failure

The alert itself is generic. To find out what actually failed, you need to query the ASM instance. Run this SQL command in your ASM instance:

SELECT check_name, failure_pri, status, repair_script FROM v$asm_healthcheck_status WHERE status = 'FAILED'; Use code with caution. Common culprits include:

Disk Offline: One or more disks in a disk group are no longer accessible.

Metadata Corruption: Inconsistencies in the ASM metadata (e.g., File Directory or Disk Directory).

Space Issues: A disk group is nearing 100% capacity, risking an instance crash.

Stale Quorum: Issues with voting files in a CRS/Grid Infrastructure environment. 3. Deep Dive into the Logs

To get the granular details, look at the ASM Alert Log. You can usually find this in your Oracle Base directory:$ORACLE_BASE/diag/asm/+asm/+asm1/trace/alert_+asm1.log

Search for the timestamp of the alert. You will often see a corresponding ORA- error code (like ORA-15078 or ORA-15032) that provides the exact technical reason for the health check failure. 4. How to Resolve the Failure Scenario A: Disk Connectivity Issues

If the health checker found a disk failure, check the OS-level connectivity. Command: lsdsk (within ASMCMD) or fdisk -l (Linux).

Fix: If a disk is "OFFLINE," try to online it using:ALTER DISKGROUP ONLINE DISK ; Scenario B: Metadata Inconsistency

If the health check indicates metadata issues, you may need to run a manual check on the disk group.

Action: Execute the CHECK command:ALTER DISKGROUP CHECK ALL;Note: This checks for consistency but does not fix errors. If errors are found, you may need to involve Oracle Support. Scenario C: Space Pressure

If the failure is related to "Insufficient Space," rebalance the disk group or add new disks immediately.

Action: Check free space:SELECT name, free_mb, total_mb, usable_file_mb FROM v$asm_diskgroup; 5. Clearing the Alert

Once you have fixed the underlying physical or logical issue, the Health Checker should automatically update during its next run. However, if the status remains "Failed" in the views, you can manually trigger a re-run of the health check or use ADRCI to purge the alert. Summary Checklist

Query v$asm_healthcheck_status to identify the specific check. Review the ASM Alert Log for specific ORA-error codes.

Verify Physical Disks at the OS level to ensure no hardware failure. What "ASM Health Checker found 1 new failures" means

Check Disk Group Capacity to ensure you haven't hit a "disk full" state.

By catching these "1 new failures" early, you prevent minor disk hiccups from turning into major database outages.

The error message "ASM Health Checker found 1 new failures" typically appears in the Oracle ASM alert logs when the system detects an issue with a disk or disk group

. This message indicates that a failure has been logged in the Automatic Storage Management (ASM) health check framework, often related to disk group dismounts, header corruption, or voting file issues. Oracle ASM Health Check Failure Report Report Field Description / Details Alert Message ASM Health Checker found 1 new failures System Component Oracle Automatic Storage Management (ASM) Detection Source ASM Alert Log (typically located at diag/asm/+asm//trace/alert_+asm.log Incident Status

(Requires immediate investigation to prevent data loss or service disruption) Potential Causes & Findings Disk Group Dismount

: A disk group may have been forced to dismount due to lost connectivity or multiple disk failures in a failure group. Disk Header Corruption

: The metadata (headers) on one or more ASM disks may be corrupted or in a "FORMER" or "PROVISIONED" status instead of "MEMBER". Voting File Issues

: If the ASM disk group hosts the Cluster Registry (OCR) or Voting Disks, a failure can cause node evictions or cluster instability. Storage Latency/I/O Timeouts

: The health checker may trigger a failure if it waits too long (e.g., >15 seconds) for I/O operations to complete on a specific disk. Oracle Forums Recommended Troubleshooting Steps

ASM Health Checker Found 1 New Failure: What It Means and How to Resolve It

The Automatic Storage Management (ASM) health checker is a crucial tool in Oracle databases that monitors the health and integrity of the storage infrastructure. When the ASM health checker reports a new failure, it's essential to understand the implications and take corrective actions to prevent data loss or system downtime. In this blog post, we'll discuss what an ASM health checker failure means, how to investigate the issue, and steps to resolve it.

What does an ASM health checker failure mean?

When the ASM health checker detects a problem, it logs an error message indicating that a failure has been detected. The message may look like this:

"ASM health checker found 1 new failure"

This message indicates that the ASM health checker has detected a single failure in the storage system. The failure could be related to various issues, such as:

Disk errors or corruption
Connectivity problems between the database server and storage
Insufficient disk space or quota issues
ASM configuration errors

Investigating the ASM health checker failure

To investigate the failure, follow these steps:

Check the ASM alert log: The ASM alert log provides detailed information about the failure, including the error message, timestamp, and affected disk group. You can find the alert log in the $ORACLE_BASE/diag/asm/+ASM/<instance_name>/trace directory.
Run the asmcmd command: The asmcmd command-line tool provides a comprehensive view of the ASM configuration and status. Run asmcmd with the lsdg option to list the disk groups and their status: asmcmd ls dg
Check the disk group status: Use the asmcmd command with the dg option to check the status of the affected disk group: asmcmd dg <disk_group_name>

Resolving the ASM health checker failure

Once you've identified the root cause of the failure, take corrective actions to resolve the issue:

Replace a failed disk: If the failure is due to a disk error, replace the disk and re-add it to the ASM disk group.
Check and correct connectivity: Verify that the storage connections are stable and functioning correctly.
Free up disk space: If the failure is due to insufficient disk space, free up space by deleting unnecessary files or expanding the disk group.
Reconfigure ASM: If the failure is due to an ASM configuration error, reconfigure ASM with the correct settings.

Best practices to prevent ASM health checker failures

To minimize the likelihood of ASM health checker failures:

Regularly monitor ASM alerts: Regularly check the ASM alert log and respond promptly to any errors or warnings.
Perform routine maintenance: Regularly perform routine maintenance tasks, such as checking disk space and replacing failed disks.
Test and validate ASM configurations: Test and validate ASM configurations to ensure they are correct and optimal.

By understanding the causes of ASM health checker failures and taking proactive steps to prevent them, you can ensure the reliability and performance of your Oracle database storage infrastructure.

An "ASM health checker found 1 new failures" message in Oracle (AHF/ORAchk) signals a logged incident in the Automatic Diagnostic Repository (ADR), often caused by disk connectivity issues, failed rebalances, or metadata corruption. Immediate investigation requires using ADRCI to identify the specific incident and checking V$ASM_DISK for failed or dropped disks. Detailed diagnostic procedures are available from Oracle Help Center at Oracle Help Center.

4. Mismatched Disk Group Compatibility

If compatible.asm, compatible.rdbms, or compatible.advm values are set incorrectly relative to the GI version, the health checker will report advisories as failures.

Asm Health Checker Found 1 New Failures -

Asm Health Checker Found 1 New Failures -

Decoding the Alert: "ASM Health Checker Found 1 New Failures" – Causes, Fixes, and Prevention

What "ASM Health Checker found 1 new failures" means

Essay: “asm health checker found 1 new failures” — diagnosis, causes, and remediation

4. Mismatched Disk Group Compatibility

Menu

Free Calendars

Write For Us

Contact Us