Products model benchmark launch application feature report

124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level

DEV Communityby IngeroApril 1, 20266 min read2 views

<p><strong>TL;DR:</strong> PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root cause. The GPU wasn't slow - it was starving. DataLoader workers generated 200,000 CPU context switches and 300,000 page allocations in 40 seconds, leaving the GPU waiting an average of 301ms per data transfer that should take microseconds.</p> <h2> The Problem </h2> <p>A PyTorch user reported that DataLoader was 7-22x slower than direct tensor indexing for a simple MLP inference workload. Even with <code>num_workers=12</code>, <code>pin_memory=True</code>, and <code>prefetch_factor=12</code>, the gap remained massive. GPU utilization sat at 10-2

TL;DR: PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root cause. The GPU wasn't slow - it was starving. DataLoader workers generated 200,000 CPU context switches and 300,000 page allocations in 40 seconds, leaving the GPU waiting an average of 301ms per data transfer that should take microseconds.

The Problem

A PyTorch user reported that DataLoader was 7-22x slower than direct tensor indexing for a simple MLP inference workload. Even with num_workers=12, pin_memory=True, and prefetch_factor=12, the gap remained massive. GPU utilization sat at 10-20%.

We reproduced it. The gap was even worse on our hardware:

Method Time vs Direct

Direct tensor indexing 0.39s 1x

DataLoader (shuffle=True) 48.49s 124x slower

DataLoader (optimized, 4 workers, pin_memory) 43.29s 111x slower

The workload is trivial: 7M samples, 100 features, 2-layer MLP, batch size 1M. The model processes a batch in milliseconds. So where does the time go?

What nvidia-smi Shows

Nothing useful. GPU utilization flickers between 0% and 30%. Memory usage is stable. Temperature is fine. The GPU is clearly underutilized, but nvidia-smi can't explain why.

What torch.profiler Shows

The reporter tried PyTorch's built-in profiler and "obtained no meaningful trace data." This is a common frustration - application-level profilers can show what CUDA kernels are running, but they cannot see the host-side scheduling, memory, and process lifecycle events that determine whether data arrives at the GPU on time.

What Kernel-Level Tracing Shows

We ran the benchmark while tracing both CUDA API calls (via eBPF uprobes on libcudart.so) and Linux kernel events (scheduler context switches, memory page allocations, process forks) simultaneously. The results tell the complete story.

4 HIGH-severity causal chains

The causal chain engine detected 4 high-severity patterns, all with the same root cause:

[HIGH] cudaStreamSync p99=42ms (1,638x p50=25us) - CPU 100% + 1,880 sched_switch events Timeline:  [SYSTEM] CPU 100%  [HOST ] 1,880 context switches (21s off-CPU)  [CUDA ] p99=42ms (1,638x p50=25us) Root cause: DataLoader workers fighting for CPU, massive page allocation pressure

[HIGH] cudaStreamSync p99=42ms (1,638x p50=25us) - CPU 100% + 1,880 sched_switch events Timeline:  [SYSTEM] CPU 100%  [HOST ] 1,880 context switches (21s off-CPU)  [CUDA ] p99=42ms (1,638x p50=25us) Root cause: DataLoader workers fighting for CPU, massive page allocation pressure

[HIGH] cudaLaunchKernel p99=24.67ms (349x p50=70us) - CPU 100% Root: 34 sched_switch events

[HIGH] cuMemAlloc p99=627us (4.0x p50) - CPU 100%

[HIGH] cuLaunchKernel p99=106us (4.0x p50) - CPU 100%`

Enter fullscreen mode

Exit fullscreen mode

The cudaStreamSync p99 is 1,638 times the p50. That's not GPU slowness - that's the GPU waiting for data that never arrives on time.

The Per-Process Breakdown

This is where it gets clear. The main process and its 4 DataLoader workers are visible as separate entities:

Main process:

cudaMemcpyAsync (host-to-device transfer): avg 301ms, max 2.9 seconds
cudaStreamSync: p99 = 42ms (normally 25us)
1,567 context switches, avg 16ms off-CPU, worst stall 5 seconds
799,018 page allocations

DataLoader worker 1: 52,863 context switches, 89,338 page allocations, worst stall 5s DataLoader worker 2: 50,638 context switches, 83,509 page allocations, worst stall 5s DataLoader worker 3: 49,361 context switches, 70,035 page allocations, worst stall 5s DataLoader worker 4: 38,862 context switches, 56,354 page allocations, worst stall 5s

Total across workers: ~191,000 context switches and ~299,000 page allocations in 40 seconds.

What This Means

The DataLoader workers are doing three expensive things that direct indexing avoids entirely:

Shuffling and indexing. DataLoader with shuffle=True generates a random permutation of indices, then each worker selects its chunk. This requires random memory access across the full 7M-sample tensor - terrible for cache locality and triggers page faults.
Collation and copying. Each worker gathers scattered samples into a contiguous batch tensor. This means allocating new memory (page allocations), copying data from random locations (cache misses), and serializing the result back to the main process via shared memory or a queue.
Competing for CPU. Four workers + the main process on a 4-vCPU machine means constant preemption. Each worker gets descheduled 50,000 times. The worst-case stall is 5 seconds - during which the GPU has nothing to process.

With direct indexing: X[i:i+batch_size] is a zero-copy view of a contiguous tensor already in memory. .to(device) triggers one DMA transfer from a single contiguous region. No workers, no shuffling, no collation, no cross-process copies, no context switches. The GPU gets data in microseconds, not hundreds of milliseconds.

The Fix

For in-memory GPU workloads where the entire dataset fits in RAM:

Don't use DataLoader. Direct indexing with a pre-shuffled index array is simpler and 100x faster:

indices = torch.randperm(num_samples) for i in range(0, num_samples, batch_size):  batch = X[indices[i:i+batch_size]].to(device)  output = model(batch)

indices = torch.randperm(num_samples) for i in range(0, num_samples, batch_size):  batch = X[indices[i:i+batch_size]].to(device)  output = model(batch)

Enter fullscreen mode

Exit fullscreen mode

When DataLoader is necessary, match num_workers to the actual CPU core count minus 1. On a 4-core machine, num_workers=2 reduces contention. Add persistent_workers=True to avoid fork overhead.
For larger-than-memory datasets where DataLoader is necessary, the real bottleneck shifts to disk I/O. Use prefetch_factor=2 (not higher - more prefetching means more memory pressure) and ensure the storage subsystem can keep up.

The Bigger Picture

This investigation illustrates a pattern we see constantly in GPU workloads: the GPU is fast, the host is the bottleneck, and GPU metrics can't see it. nvidia-smi reported low utilization but couldn't explain why. torch.profiler captured CUDA kernels but missed the 200,000 context switches happening in userspace.

The only way to see the full picture was to trace both sides simultaneously - CUDA API calls at the library level and Linux kernel scheduling events - and correlate them by time and process ID.

The causal chain CPU 100% -> 1,880 sched_switch -> cudaMemcpyAsync 301ms -> cudaStreamSync 42ms tells the complete story in one line. Without cross-stack tracing, this would have remained a mystery - as it was for the original reporter who spent weeks debugging it.

Try It Yourself

Reproduce the benchmark:

import torch, time from torch.utils.data import DataLoader

import torch, time from torch.utils.data import DataLoader

X = torch.randn(7_000_000, 100) model = torch.nn.Sequential( torch.nn.Linear(100, 512), torch.nn.ReLU(), torch.nn.Linear(512, 512), torch.nn.ReLU(), torch.nn.Linear(512, 10) ).cuda()

Fast path

start = time.time() with torch.no_grad(): for i in range(0, len(X), 1_048_576): model(X[i:i+1_048_576].cuda()) torch.cuda.synchronize() print(f"Direct: {time.time()-start:.3f}s")

Slow path

loader = DataLoader(X, batch_size=1_048_576, shuffle=True) start = time.time() with torch.no_grad(): for batch in loader: model(batch.cuda()) torch.cuda.synchronize() print(f"DataLoader: {time.time()-start:.3f}s")`

Enter fullscreen mode

Exit fullscreen mode

Trace with Ingero to see what's happening under the hood:

git clone https://github.com/ingero-io/ingero.git cd ingero && make build sudo ./bin/ingero trace --duration 60s # in one terminal python3 benchmark.py # in another terminal ./bin/ingero explain --since 60s # after benchmark completes

git clone https://github.com/ingero-io/ingero.git cd ingero && make build sudo ./bin/ingero trace --duration 60s # in one terminal python3 benchmark.py # in another terminal ./bin/ingero explain --since 60s # after benchmark completes

Enter fullscreen mode

Exit fullscreen mode

GitHub: github.com/ingero-io/ingero

Original issue: pytorch/pytorch#154318

Investigation performed on TensorDock RTX 4090 (24GB), Ubuntu 22.04, PyTorch 2.10.0+cu128.

Original source

DEV Community

https://dev.to/ingero/124x-slower-what-pytorch-dataloader-actually-does-at-the-kernel-level-3o3a

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarklaunch

ModelsLive

Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run

Tested Gemma 4 (31B) on our benchmark. Genuinely did not expect this. 100% survival, 5 out of 5 runs profitable, +1,144% median ROI. At $0.20 per run. It outperforms GPT-5.2 ($4.43/run), Gemini 3 Pro ($2.95/run), Sonnet 4.6 ($7.90/run), and absolutely destroys every Chinese open-source model we've tested — Qwen 3.5 397B, Qwen 3.5 9B, DeepSeek V3.2, GLM-5. None of them even survive consistently. The only model that beats Gemma 4 is Opus 4.6 at $36 per run. That's 180× more expensive. 31 billion parameters. Twenty cents. We double-checked the config, the prompt, the model ID — everything is identical to every other model on the leaderboard. Same seed, same tools, same simulation. It's just genuinely this good. Strongly recommend trying it for your agentic workflows. We've tested 22 models so

Reddit r/LocalLLaMA

1m40 minutes ago

Self-Evolving AIFresh

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Sure you can't do agentic coding with the Gemma 4 E2B, but this model is a game-changer for people learning a new language. Imagine a few years from now that people can run this locally on their phones. They can point their camera at objects and talk about them. And this model is multi-lingual, so people can always fallback to their native language if they want. This is essentially what OpenAI demoed a few years ago. Repo: https://github.com/fikrikarim/parlor submitted by /u/ffinzy [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsFresh

Drummer's Skyfall 31B v4.2 aka SKYFALL-31B-V4.2-UNCENSORED-OPUS-4.6-ROLEPLAYING-100000X-XTREME-VALUE

Yes, Google stole my proprietary model size (31B). Yes, I plan to tune all the Gemma 4 models. Join us, and support the mission! Thank you all for the love submitted by /u/TheLocalDrummer [link] [comments]

Reddit r/LocalLLaMA

1mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 192 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level

The Problem

What nvidia-smi Shows

What torch.profiler Shows

What Kernel-Level Tracing Shows

4 HIGH-severity causal chains

The Per-Process Breakdown

What This Means

The Fix

The Bigger Picture

Try It Yourself

Fast path

Slow path

Daily AI Digest

More about

Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Drummer's Skyfall 31B v4.2 aka SKYFALL-31B-V4.2-UNCENSORED-OPUS-4.6-ROLEPLAYING-100000X-XTREME-VALUE

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Products

20 Must-Know Tips for Java Developers in the AI Era

OpenAI’s Top Executive Fidji Simo to Take Medical Leave From Company - WSJ

OpenAI’s Top Executive Fidji Simo to Take Medical Leave From Company - WSJ

OpenAI’s Top Executive Fidji Simo to Take Medical Leave From Company - WSJ