Releases available update open-source analysis github

I Couldn’t Debug My AI/ML GPU Incident - So I Built My Own Tool

DEV Communityby Vu NguyenApril 2, 20263 min read1 views

Several weeks ago, I encountered some problems with ML jobs running on my GPU server. I received alerts triggered at midnight, and one of the jobs failed due to GPU memory usage. The next morning, I performed a root-cause analysis to understand what had happened the night before. However, I couldn’t identify the issue because I only had access to overall GPU usage metrics at current time. I used <code>nvidia-smi</code> and <code>nvtop</code> to inspect the current state, but there was no trace about the issue we got from last night. Therefore, I realized I needed a solution to prevent similar problems from happening in the future. I tried using <a href="https://github.com/NVIDIA/dcgm-exporter" rel="noopener noreferrer">DCGM exporter</a> to expose GPU metrics, but <a href="

Several weeks ago, I encountered some problems with ML jobs running on my GPU server. I received alerts triggered at midnight, and one of the jobs failed due to GPU memory usage.

The next morning, I performed a root-cause analysis to understand what had happened the night before. However, I couldn’t identify the issue because I only had access to overall GPU usage metrics at current time. I used nvidia-smi and nvtop to inspect the current state, but there was no trace about the issue we got from last night. Therefore, I realized I needed a solution to prevent similar problems from happening in the future.

I tried using DCGM exporter to expose GPU metrics, but it couldn’t provide PID-level metrics. I also tested it in a Kubernetes environment to get pod-level metrics, but it didn’t work because our GPUs only support time-slicing mode.

Therefore, I developed an open-source tool called gpuxray to monitor GPUs at the process level. gpuxray has helped our team significantly when observing and investigating bottlenecks in AI/ML processes running on Linux servers. It exposes metrics in Prometheus format, which we use to build Grafana dashboards for visualizing resource usage at the process level.

We deployed gpuxray in a Kubernetes cluster as a DaemonSet on all GPU nodes that need to be monitored.

> kubectl -n kube-operators get daemonset/gpuxray  NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpuxray 2 2 2 2 2 node.k8s.cluster/gpu=exists 20d

> kubectl -n kube-operators get daemonset/gpuxray  NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE gpuxray 2 2 2 2 2 node.k8s.cluster/gpu=exists 20d

Enter fullscreen mode

Exit fullscreen mode

With the setup described here, we can easily enable per-process GPU observability.

gpuxray achieves high performance while consuming minimal resources. Because it is built using eBPF to trace GPU memory-related events. This is powerful because eBPF allows us to observe what is happening inside the kernel based on specific use cases - in this case, we create probes these are attached to CUDA API. The project is built on a solid codebase, making it easy to extend in the future. If you have ideas, feel free to discuss or open a pull request.

Design and Architecture

Now, I will describe the architecture of gpuxray to help you understand how it works.

Basically, the userspace-code handles the main logic and is written in Go. The eBPF-program is attached to CUDA API calls. When these APIs are invoked, events are captured. The eBPF-program performs lightweight processing at the kernel level, updates eBPF maps, and sends events to the ring-buffer.The userspace-code then consumes events from the ring-buffer, processes them, and produces the final metrics output.

Performance and Resources Usage

With the mon option, gpuxray even no taken resources on the GPU server.

When tracing memory leaks using the memtrace option for a specific PID, I used a Python script to generate more than 2,000 malloc/free calls per second on the GPU and observed resource usage, gpuxray consumed only about ~8% of a single CPU core (on a server with 32 CPU cores and 125GB RAM).

This is impressive because ~2,000 malloc/free operations per second is not a typical real-world workload. As a result, we don’t need to worry about performance or resource overhead when using gpuxray.

Feel free to explore the project, try it out, and contribute your ideas: https://github.com/vuvietnguyenit/gpuxray

Original source

DEV Community

https://dev.to/vuvietnguyenit/i-couldnt-debug-my-aiml-gpu-incident-so-i-built-my-own-tool-33n9

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

availableupdateopen-source

Open Source AILive

Building an AI-Powered DevSecOps Guardrail Pipeline with GitHub Actions

Learn how to build an AI-powered DevSecOps guardrail pipeline using GitHub Actions to automatically detect security vulnerabilities before deployment. Read All

Hackernoon AI

1mabout 1 hour ago

Open Source AIFresh

v0.16.1

Axolotl v0.16.1 Release Notes Gemma 4 Support Example YAML: https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/gemma4/26b-a4b-moe-qlora.yaml gemma4 support by @winglian in #3574 Full Changelog : v0.16.0...v0.16.1

Axolotl Releases

1mabout 4 hours ago

Open Source AILive

v4.3

Changes ik_llama.cpp support : Add ik_llama.cpp as a new backend: new textgen-portable-ik portable builds, new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. API: Add echo + logprobs for /v1/completions . The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field. Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc). Transformers: Autodetect torch_dtype fr

text-gen-webui Releases

3m24 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 138 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Releases

ReleasesRecent

I am skipping the iPhone 18 Pro: Why the base iPhone 18 is shaping up to be Apple's ultimate AI sleeper hit - PhoneArena

I am skipping the iPhone 18 Pro: Why the base iPhone 18 is shaping up to be Apple's ultimate AI sleeper hit PhoneArena

GNews AI Apple

1mabout 13 hours ago

ReleasesFresh

Naoris Launches Post-Quantum Blockchain as Bitcoin, Ethereum Devs Scramble to Face Threat

Naoris Protocol says its blockchain network uses quantum-resistant cryptography, as the wider crypto industry prepares for future threats.

Decrypt AI

1mabout 3 hours ago

ReleasesRecent

Family building platform Sunfish launches AI-powered egg freezing program with cost guarantee - Fierce Healthcare

Family building platform Sunfish launches AI-powered egg freezing program with cost guarantee Fierce Healthcare

GNews AI healthcare

1mabout 13 hours ago

ReleasesFresh

1.13.0

What's Changed Features Add RuntimeState RootModel for unified state serialization Enhance event listener with new telemetry spans for skill and memory events Add A2UI extension with v0.8/v0.9 support, schemas, and docs Emit token usage data in LLMCallCompletedEvent Auto-update deployment test repo during release Improve enterprise release resilience and UX Bug Fixes Add tool repository credentials to crewai install Add tool repository credentials to uv build in tool publish Pass fingerprint metadata via config instead of tool args Handle GPT-5.x models not supporting the stop API parameter Add GPT-5 and o-series to multimodal vision prefixes Bust uv cache for freshly published packages in enterprise release Cap lancedb below 0.30.1 for Windows compatibility Fix RBAC permission levels to m

CrewAI Releases

2mabout 3 hours ago