124x Slower: What PyTorch DataLoader Actually Does at the Kernel Level
<p><strong>TL;DR:</strong> PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root cause. The GPU wasn't slow - it was starving. DataLoader workers generated 200,000 CPU context switches and 300,000 page allocations in 40 seconds, leaving the GPU waiting an average of 301ms per data transfer that should take microseconds.</p> <h2> The Problem </h2> <p>A PyTorch user reported that DataLoader was 7-22x slower than direct tensor indexing for a simple MLP inference workload. Even with <code>num_workers=12</code>, <code>pin_memory=True</code>, and <code>prefetch_factor=12</code>, the gap remained massive. GPU utilization sat at 10-2
TL;DR: PyTorch's DataLoader can be 50-124x slower than direct tensor indexing for in-memory GPU workloads. We reproduced a real PyTorch issue on an RTX 4090 and traced every CUDA API call and Linux kernel event to find the root cause. The GPU wasn't slow - it was starving. DataLoader workers generated 200,000 CPU context switches and 300,000 page allocations in 40 seconds, leaving the GPU waiting an average of 301ms per data transfer that should take microseconds.
The Problem
A PyTorch user reported that DataLoader was 7-22x slower than direct tensor indexing for a simple MLP inference workload. Even with num_workers=12, pin_memory=True, and prefetch_factor=12, the gap remained massive. GPU utilization sat at 10-20%.
We reproduced it. The gap was even worse on our hardware:
Method Time vs Direct
Direct tensor indexing 0.39s 1x
DataLoader (shuffle=True) 48.49s 124x slower
DataLoader (optimized, 4 workers, pin_memory) 43.29s 111x slower
The workload is trivial: 7M samples, 100 features, 2-layer MLP, batch size 1M. The model processes a batch in milliseconds. So where does the time go?
What nvidia-smi Shows
Nothing useful. GPU utilization flickers between 0% and 30%. Memory usage is stable. Temperature is fine. The GPU is clearly underutilized, but nvidia-smi can't explain why.
What torch.profiler Shows
The reporter tried PyTorch's built-in profiler and "obtained no meaningful trace data." This is a common frustration - application-level profilers can show what CUDA kernels are running, but they cannot see the host-side scheduling, memory, and process lifecycle events that determine whether data arrives at the GPU on time.
What Kernel-Level Tracing Shows
We ran the benchmark while tracing both CUDA API calls (via eBPF uprobes on libcudart.so) and Linux kernel events (scheduler context switches, memory page allocations, process forks) simultaneously. The results tell the complete story.
4 HIGH-severity causal chains
The causal chain engine detected 4 high-severity patterns, all with the same root cause:
[HIGH] cudaStreamSync p99=42ms (1,638x p50=25us) - CPU 100% + 1,880 sched_switch events Timeline: [SYSTEM] CPU 100% [HOST ] 1,880 context switches (21s off-CPU) [CUDA ] p99=42ms (1,638x p50=25us) Root cause: DataLoader workers fighting for CPU, massive page allocation pressure[HIGH] cudaStreamSync p99=42ms (1,638x p50=25us) - CPU 100% + 1,880 sched_switch events Timeline: [SYSTEM] CPU 100% [HOST ] 1,880 context switches (21s off-CPU) [CUDA ] p99=42ms (1,638x p50=25us) Root cause: DataLoader workers fighting for CPU, massive page allocation pressure[HIGH] cudaLaunchKernel p99=24.67ms (349x p50=70us) - CPU 100% Root: 34 sched_switch events
[HIGH] cuMemAlloc p99=627us (4.0x p50) - CPU 100%
[HIGH] cuLaunchKernel p99=106us (4.0x p50) - CPU 100%`
Enter fullscreen mode
Exit fullscreen mode
The cudaStreamSync p99 is 1,638 times the p50. That's not GPU slowness - that's the GPU waiting for data that never arrives on time.
The Per-Process Breakdown
This is where it gets clear. The main process and its 4 DataLoader workers are visible as separate entities:
Main process:
-
cudaMemcpyAsync (host-to-device transfer): avg 301ms, max 2.9 seconds
-
cudaStreamSync: p99 = 42ms (normally 25us)
-
1,567 context switches, avg 16ms off-CPU, worst stall 5 seconds
-
799,018 page allocations
DataLoader worker 1: 52,863 context switches, 89,338 page allocations, worst stall 5s DataLoader worker 2: 50,638 context switches, 83,509 page allocations, worst stall 5s DataLoader worker 3: 49,361 context switches, 70,035 page allocations, worst stall 5s DataLoader worker 4: 38,862 context switches, 56,354 page allocations, worst stall 5s
Total across workers: ~191,000 context switches and ~299,000 page allocations in 40 seconds.
What This Means
The DataLoader workers are doing three expensive things that direct indexing avoids entirely:
-
Shuffling and indexing. DataLoader with shuffle=True generates a random permutation of indices, then each worker selects its chunk. This requires random memory access across the full 7M-sample tensor - terrible for cache locality and triggers page faults.
-
Collation and copying. Each worker gathers scattered samples into a contiguous batch tensor. This means allocating new memory (page allocations), copying data from random locations (cache misses), and serializing the result back to the main process via shared memory or a queue.
-
Competing for CPU. Four workers + the main process on a 4-vCPU machine means constant preemption. Each worker gets descheduled 50,000 times. The worst-case stall is 5 seconds - during which the GPU has nothing to process.
With direct indexing: X[i:i+batch_size] is a zero-copy view of a contiguous tensor already in memory. .to(device) triggers one DMA transfer from a single contiguous region. No workers, no shuffling, no collation, no cross-process copies, no context switches. The GPU gets data in microseconds, not hundreds of milliseconds.
The Fix
For in-memory GPU workloads where the entire dataset fits in RAM:
- Don't use DataLoader. Direct indexing with a pre-shuffled index array is simpler and 100x faster:
indices = torch.randperm(num_samples) for i in range(0, num_samples, batch_size): batch = X[indices[i:i+batch_size]].to(device) output = model(batch)indices = torch.randperm(num_samples) for i in range(0, num_samples, batch_size): batch = X[indices[i:i+batch_size]].to(device) output = model(batch)Enter fullscreen mode
Exit fullscreen mode
-
When DataLoader is necessary, match num_workers to the actual CPU core count minus 1. On a 4-core machine, num_workers=2 reduces contention. Add persistent_workers=True to avoid fork overhead.
-
For larger-than-memory datasets where DataLoader is necessary, the real bottleneck shifts to disk I/O. Use prefetch_factor=2 (not higher - more prefetching means more memory pressure) and ensure the storage subsystem can keep up.
The Bigger Picture
This investigation illustrates a pattern we see constantly in GPU workloads: the GPU is fast, the host is the bottleneck, and GPU metrics can't see it. nvidia-smi reported low utilization but couldn't explain why. torch.profiler captured CUDA kernels but missed the 200,000 context switches happening in userspace.
The only way to see the full picture was to trace both sides simultaneously - CUDA API calls at the library level and Linux kernel scheduling events - and correlate them by time and process ID.
The causal chain CPU 100% -> 1,880 sched_switch -> cudaMemcpyAsync 301ms -> cudaStreamSync 42ms tells the complete story in one line. Without cross-stack tracing, this would have remained a mystery - as it was for the original reporter who spent weeks debugging it.
Try It Yourself
Reproduce the benchmark:
import torch, time from torch.utils.data import DataLoaderimport torch, time from torch.utils.data import DataLoaderX = torch.randn(7_000_000, 100) model = torch.nn.Sequential( torch.nn.Linear(100, 512), torch.nn.ReLU(), torch.nn.Linear(512, 512), torch.nn.ReLU(), torch.nn.Linear(512, 10) ).cuda()
Fast path
start = time.time() with torch.no_grad(): for i in range(0, len(X), 1_048_576): model(X[i:i+1_048_576].cuda()) torch.cuda.synchronize() print(f"Direct: {time.time()-start:.3f}s")
Slow path
loader = DataLoader(X, batch_size=1_048_576, shuffle=True) start = time.time() with torch.no_grad(): for batch in loader: model(batch.cuda()) torch.cuda.synchronize() print(f"DataLoader: {time.time()-start:.3f}s")`
Enter fullscreen mode
Exit fullscreen mode
Trace with Ingero to see what's happening under the hood:
git clone https://github.com/ingero-io/ingero.git cd ingero && make build sudo ./bin/ingero trace --duration 60s # in one terminal python3 benchmark.py # in another terminal ./bin/ingero explain --since 60s # after benchmark completesgit clone https://github.com/ingero-io/ingero.git cd ingero && make build sudo ./bin/ingero trace --duration 60s # in one terminal python3 benchmark.py # in another terminal ./bin/ingero explain --since 60s # after benchmark completesEnter fullscreen mode
Exit fullscreen mode
GitHub: github.com/ingero-io/ingero
Original issue: pytorch/pytorch#154318
Investigation performed on TensorDock RTX 4090 (24GB), Ubuntu 22.04, PyTorch 2.10.0+cu128.
DEV Community
https://dev.to/ingero/124x-slower-what-pytorch-dataloader-actually-does-at-the-kernel-level-3o3aSign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelbenchmarklaunch
Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run
Tested Gemma 4 (31B) on our benchmark. Genuinely did not expect this. 100% survival, 5 out of 5 runs profitable, +1,144% median ROI. At $0.20 per run. It outperforms GPT-5.2 ($4.43/run), Gemini 3 Pro ($2.95/run), Sonnet 4.6 ($7.90/run), and absolutely destroys every Chinese open-source model we've tested — Qwen 3.5 397B, Qwen 3.5 9B, DeepSeek V3.2, GLM-5. None of them even survive consistently. The only model that beats Gemma 4 is Opus 4.6 at $36 per run. That's 180× more expensive. 31 billion parameters. Twenty cents. We double-checked the config, the prompt, the model ID — everything is identical to every other model on the leaderboard. Same seed, same tools, same simulation. It's just genuinely this good. Strongly recommend trying it for your agentic workflows. We've tested 22 models so

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B
Sure you can't do agentic coding with the Gemma 4 E2B, but this model is a game-changer for people learning a new language. Imagine a few years from now that people can run this locally on their phones. They can point their camera at objects and talk about them. And this model is multi-lingual, so people can always fallback to their native language if they want. This is essentially what OpenAI demoed a few years ago. Repo: https://github.com/fikrikarim/parlor submitted by /u/ffinzy [link] [comments]

Drummer's Skyfall 31B v4.2 aka SKYFALL-31B-V4.2-UNCENSORED-OPUS-4.6-ROLEPLAYING-100000X-XTREME-VALUE
Yes, Google stole my proprietary model size (31B). Yes, I plan to tune all the Gemma 4 models. Join us, and support the mission! Thank you all for the love submitted by /u/TheLocalDrummer [link] [comments]
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!