Clearing a Full vSAN Trace Ramdisk Across ESXi Hosts in Parallel

While reviewing an SOS support bundle from a VMware Cloud Foundation environment, I noticed every ESXi host in the management cluster was logging the same warning, over and over, right up to the moment the bundle was collected:

[vob.visorfs.ramdisk.full] Cannot extend visorfs file
/vsantraces/vsantracesLSOMVerbose--...zst because its ramdisk
(vsantraceFailover) is full.

These are -INFO level VOB events, not errors, and they do not touch your data or VMs. But when the same symptom appears on all hosts at once, fires continuously, and never clears on its own, it is worth understanding what is actually happening and fixing it cluster-wide rather than logging into each host by hand. This post walks through the diagnosis and a small Bash script that queries, fixes, and reclaims space on every host in parallel.

Understanding the vSAN Trace Ramdisk

vSAN writes diagnostic traces (DOM, LSOM, CLOM, PLOG, and others) to /vsantraces, an in-memory location backed by a ramdisk. Under normal operation these traces rotate into compressed .zst archives and the ramdisk stays well under capacity. ESXi also keeps a secondary vsantraceFailover ramdisk that catches trace writes when the primary cannot be extended.

The warning above means the failover ramdisk has hit 100% and the trace daemon can no longer write new trace data. Because tracing is purely diagnostic, the cluster keeps running normally — but you lose trace history, the logs fill with noise, and a genuinely useful troubleshooting tool is effectively offline.

A few things are easy to get wrong here, so it is worth being precise:

This is not caused by verbose tracing being left on. On a healthy vSAN ESA cluster, LSOMVerbose is enabled by default. Confirm the configured level before assuming someone changed it.
A vsantraced restart re-initializes the daemon but does not purge files already sitting on the failover ramdisk. If the ramdisk is full, restarting alone will often leave it full.
The real reclaim comes from removing the old rotated .zst archives, which are the bulk of the consumed space.

Diagnosing Before You Touch Anything

The first job is to confirm the state on every host with read-only commands. Three pieces of information tell you almost everything:

# What trace level is actually configured?
esxcli vsan trace get

# Is the failover ramdisk full?
vdf -h | grep -i vsantrace

# What is consuming the space?
ls -lhS /vsantraces/ | head

If vdf shows a line like vsantraceFailover 300M 300M 0B 100%, that is your smoking gun. The ls output will typically show several large vsantraces--*.zst archives (often the configured max file size each) as the dominant consumers.

A note on SSH: ESXi SSH is disabled by default, and a recent SOS bundle will reflect that. Enable it per host first (vCenter > Host > Configure > Services > SSH > Start) before running anything below, and disable it again when you are done. If you would rather keep SSH off entirely, the same commands can be issued through vCenter with PowerCLI Get-EsxCli.

Querying Every Host at Once

Logging into hosts one at a time does not scale, and the whole point is that this condition tends to hit the entire cluster together. The script below fans out over SSH: it launches one background job per host, waits for all of them, and writes each host’s output to its own file. A per-connection timeout keeps a single unreachable host from hanging the whole run.

#!/usr/bin/env bash
#
# query-vsan-traces.sh
# Query (and optionally fix) all ESXi hosts in parallel for the
# vSAN "vsantraceFailover ramdisk full" condition.
#
# Usage:
#   ./query-vsan-traces.sh                 # QUERY only (read-only, default)
#   ./query-vsan-traces.sh --fix           # QUERY, then restart vsantraced
#   ./query-vsan-traces.sh --clean         # QUERY, then delete OLD .zst archives
#   ./query-vsan-traces.sh --clean host1   # restrict the action to specific hosts
#   ./query-vsan-traces.sh --clean --yes   # skip the confirmation prompt
#   SSH_USER=root ./query-vsan-traces.sh   # override the SSH user (default: root)

set -u

# --- config -----------------------------------------------------------------
HOSTS_DEFAULT=(
  esx-01.example.lab
  esx-02.example.lab
  esx-03.example.lab
  esx-04.example.lab
)
SSH_USER="${SSH_USER:-root}"
# Per-connection timeouts so one unreachable host can't hang the whole run.
SSH_OPTS=(-o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=accept-new)
OUTDIR="vsan-trace-report-$(date +%Y%m%d-%H%M%S)"
# ---------------------------------------------------------------------------

# --- arg parsing ------------------------------------------------------------
DO_FIX=0
DO_CLEAN=0
ASSUME_YES=0
HOSTS=()
for arg in "$@"; do
  case "$arg" in
    --fix) DO_FIX=1 ;;
    --clean) DO_CLEAN=1 ;;
    --yes|-y) ASSUME_YES=1 ;;
    -*) echo "Unknown option: $arg" >&2; exit 2 ;;
    *) HOSTS+=("$arg") ;;
  esac
done
[ ${#HOSTS[@]} -eq 0 ] && HOSTS=("${HOSTS_DEFAULT[@]}")

mkdir -p "$OUTDIR"

# The diagnostic commands run on each host. Keep them read-only.
read -r -d '' QUERY_CMDS <<'EOF'
echo "===== $(hostname) ====="
echo "--- esxcli vsan trace get ---"
esxcli vsan trace get 2>&1
echo "--- ramdisk usage (vsantrace) ---"
vdf -h 2>/dev/null | grep -i vsantrace || echo "(no vsantrace ramdisk line)"
echo "--- top trace files by size ---"
ls -lhS /vsantraces/ 2>/dev/null | head -n 15 || echo "(/vsantraces not present)"
echo "--- recent ramdisk-full VOBs (last 20) ---"
grep "vob.visorfs.ramdisk.full" /var/log/vobd.log 2>/dev/null | tail -n 20 || echo "(none)"
EOF

# Remediation A: restart vsantraced and show before/after ramdisk usage.
read -r -d '' FIX_CMDS <<'EOF'
echo "===== $(hostname) ====="
echo "--- BEFORE: ramdisk usage ---"
vdf -h 2>/dev/null | grep -i vsantrace || echo "(no vsantrace ramdisk line)"
echo "--- restarting vsantraced ---"
/etc/init.d/vsantraced restart 2>&1
sleep 3
echo "--- AFTER: ramdisk usage ---"
vdf -h 2>/dev/null | grep -i vsantrace || echo "(no vsantrace ramdisk line)"
EOF

# Remediation B: delete OLD .zst archives, keeping the newest 3 per host.
# Active (non-.zst) trace files are never touched.
read -r -d '' CLEAN_CMDS <<'EOF'
echo "===== $(hostname) ====="
echo "--- BEFORE: ramdisk usage ---"
vdf -h 2>/dev/null | grep -i vsantrace || echo "(no vsantrace ramdisk line)"
echo "--- deleting all but the newest 3 .zst archives ---"
ls -t /vsantraces/*.zst 2>/dev/null | tail -n +4 | while read -r f; do
  echo "rm $f"
  rm -f "$f"
done
echo "--- AFTER: ramdisk usage ---"
vdf -h 2>/dev/null | grep -i vsantrace || echo "(no vsantrace ramdisk line)"
EOF

run_on_host() {
  local host="$1" cmds="$2" suffix="$3"
  local out="$OUTDIR/${host}${suffix}.txt"
  if ssh "${SSH_OPTS[@]}" "${SSH_USER}@${host}" "$cmds" >"$out" 2>&1; then
    echo "OK    $host  -> $out"
  else
    echo "FAIL  $host  (see $out for error)"
  fi
}

fan_out() {
  local cmds="$1" suffix="$2"
  local pids=()
  for h in "${HOSTS[@]}"; do
    run_on_host "$h" "$cmds" "$suffix" &
    pids+=("$!")
  done
  wait "${pids[@]}"
}

confirm() {
  [ "$ASSUME_YES" -eq 1 ] && return 0
  printf "Proceed on all %d hosts? [y/N] " "${#HOSTS[@]}"
  read -r reply
  case "$reply" in y|Y|yes|YES) return 0 ;; *) echo "Aborted."; exit 0 ;; esac
}

# --- 1. QUERY (always) ------------------------------------------------------
echo "Querying ${#HOSTS[@]} hosts in parallel as user '$SSH_USER'..."
echo "Output dir: $OUTDIR"
echo
fan_out "$QUERY_CMDS" ""
echo
echo "Hosts with the failover ramdisk at 100%:"
grep -l "vsantraceFailover.*100%" "$OUTDIR"/*.txt 2>/dev/null \
  || echo "  (none — ramdisks have headroom)"

# --- 2. FIX (only with --fix) -----------------------------------------------
if [ "$DO_FIX" -eq 1 ]; then
  echo; echo "FIX MODE: restart 'vsantraced' on ${#HOSTS[@]} host(s)."
  echo "Only affects diagnostic tracing — no vSAN data, VMs, or I/O touched."
  confirm
  fan_out "$FIX_CMDS" "-fix"
fi

# --- 3. CLEAN (only with --clean) -------------------------------------------
if [ "$DO_CLEAN" -eq 1 ]; then
  echo; echo "CLEAN MODE: delete old .zst archives on ${#HOSTS[@]} host(s)."
  echo "Keeps the newest 3 archives per host; active trace files untouched."
  confirm
  fan_out "$CLEAN_CMDS" "-clean"
fi

Step 1: Run the Query

With your hostnames filled into HOSTS_DEFAULT (or passed on the command line), run it with no arguments first. This is read-only and safe:

./query-vsan-traces.sh

You get one output file per host plus a summary line listing exactly which hosts are sitting at 100%. Confirm from the esxcli vsan trace get output that the trace level is the default before going further — if it is, you know this is a stuck-ramdisk problem, not a misconfiguration.

Step 2: Try the Restart

The lightest-touch remediation is to restart the trace daemon, which re-initializes the trace directories:

./query-vsan-traces.sh --fix

The script prints before/after vdf output for each host. In my case the restart completed cleanly on all hosts but the failover ramdisk stayed at 100% — the daemon came back, but the files already on the ramdisk were not purged. That is the expected outcome when the ramdisk is already full, and it is exactly why the script does not stop here.

Step 3: Reclaim the Space

The reclaim step deletes the old rotated .zst archives, which are what actually fills the ramdisk. The script keeps the newest three archives per host so you do not lose all recent history, and it never touches the active (non-.zst) trace files:

./query-vsan-traces.sh --clean

Each remediation mode prompts for confirmation once, listing the hosts it is about to act on, and you can skip the prompt with --yes in an automation context. After the cleanup, the AFTER vdf line should show the failover ramdisk back under capacity, and the ramdisk.full VOBs should stop appearing within a minute or two.

Why This Order Matters

It is tempting to jump straight to deleting files, but running the query first buys you two things. You confirm the trace configuration is actually default (ruling out a real misconfiguration), and you capture a record of the pre-change state in the per-host output files. The restart is offered before the cleanup simply because it is the lower-impact action; when it does not free space, the cleanup is the definitive fix.

If a host is still at 100% after a cleanup, that points to something actively re-filling the ramdisk faster than rotation can drain it — a stuck writer, or a persistent trace target that is unwritable. At that point you are past the mechanical fixes and it is worth opening a support request rather than looping on the same commands.

Conclusion

The “vsantraceFailover ramdisk full” warning looks alarming because it repeats endlessly across every host, but it is a contained, diagnostic-only condition with a safe, repeatable fix. The key is to treat the whole cluster as a unit: query every host in parallel, confirm the configuration is default, attempt the cheap restart, and reclaim space by clearing old trace archives when the restart is not enough. A small Bash wrapper turns what would be a tedious host-by-host chore into three commands you can run from your workstation, with a written record of each host’s state along the way.

Understanding the vSAN Trace Ramdisk#

Diagnosing Before You Touch Anything#

Querying Every Host at Once#

Step 1: Run the Query#

Step 2: Try the Restart#

Step 3: Reclaim the Space#

Why This Order Matters#

Conclusion#

Related posts