Products llama model language model neural network benchmark release

The Parallel Lanes Nobody Uses

DEV Communityby Naz QuadriMarch 31, 202617 min read2 views

<h1> The Parallel Lanes Nobody Uses </h1> <h2> SIMD and the Eight-Lane Highway You've Been Driving Solo </h2> <p><em>Reading time: ~13 minutes</em></p> <p>You ran ripgrep across a 2GB log file and it finished in half a second. grep would have taken ten. You called <code>np.array * 2</code> and it finished before the function call overhead had time to register.</p> <p>Here's what actually happened: your CPU has 256-bit registers that can process 8 floats simultaneously. Those tools used all eight lanes of an eight-lane highway. Your Python for-loop uses one.</p> <p>This is what your CPU can actually do.</p> <h2> The Fundamental Idea </h2> <p><strong>SIMD</strong> stands for Single Instruction, Multiple Data. It's not a clever trick. It's a first-class feature of every CPU you've used in the

The Parallel Lanes Nobody Uses

SIMD and the Eight-Lane Highway You've Been Driving Solo

Reading time: ~13 minutes

You ran ripgrep across a 2GB log file and it finished in half a second. grep would have taken ten. You called np.array * 2 and it finished before the function call overhead had time to register.*

Here's what actually happened: your CPU has 256-bit registers that can process 8 floats simultaneously. Those tools used all eight lanes of an eight-lane highway. Your Python for-loop uses one.

This is what your CPU can actually do.

The Fundamental Idea

SIMD stands for Single Instruction, Multiple Data. It's not a clever trick. It's a first-class feature of every CPU you've used in the last twenty years.

The idea is direct. A normal CPU instruction operates on one value:

ADD rax, rbx # add one 64-bit integer to one other 64-bit integer

Enter fullscreen mode

Exit fullscreen mode

A SIMD instruction operates on a packed vector of values in a single clock:

VADDPS ymm0, ymm1, ymm2 # add eight 32-bit floats at once

Enter fullscreen mode

Exit fullscreen mode

Eight additions. One instruction. One cycle.

The register ymm0 is 256 bits wide. You pack 8 floats (each 32 bits) into it and treat the whole thing as a single operand. The arithmetic unit is physically wider — eight adders in parallel — and the instruction wires them all to fire simultaneously.

This is not a metaphor. It's silicon.

How We Got Here: The Register Zoo

The story of SIMD is a story of Intel and AMD racing to add bigger and bigger registers while pretending backward compatibility wasn't getting worse.

MMX (1996) — Intel introduced the first SIMD extension in the Pentium MMX. Eight 64-bit registers (mm0–mm7) for integer operations. The catch: those registers were aliased to the mantissa fields of the x87 ST(0)–ST(7) floating-point registers. Switching between MMX and x87 FP required executing EMMS to reset the x87 tag word first. (I'm simplifying the aliasing here — the full story involves how x87 tracks "empty" register slots.) Programmers used it. Suffered for it. Moved on.

SSE (1999) — Streaming SIMD Extensions. Eight new 128-bit registers (xmm0–xmm7), finally independent of the FPU stack. Supported 4 single-precision floats or integer variants. Used heavily for 3D graphics and audio in the early 2000s.

SSE2 (2001) — Added double-precision floats and 128-bit integer operations. x86-64 made SSE2 mandatory, so as of 64-bit mode you can assume it exists. This is the baseline.

SSE3, SSSE3, SSE4.1, SSE4.2 (2004–2007) — A string of incremental additions. String comparison instructions, dot products, population counts. Useful but baroque. The naming got embarrassing.

AVX (2011) — Intel widened the registers to 256 bits (ymm0–ymm15). Now you could do 8 floats or 4 doubles at once. The ymm registers are actually the full-width versions of the xmm registers — xmm0 is the lower 128 bits of ymm0.

AVX2 (2013) — Extended AVX to integer operations and added gather instructions (load scattered values from memory into a vector register). Available on Intel Haswell and later, AMD Ryzen. This is the register set most production code targets today.

AVX-512 (2017) — 512-bit registers (zmm0–zmm31). 16 floats or 8 doubles at once. Intel pushed this hard in server chips; it's common in the data center. Desktop support is inconsistent — Intel disabled AVX-512 on Alder Lake desktop SKUs specifically because AVX-512 instructions are power-hungry enough to trigger thermal throttling, and Alder Lake's big/little core design made the behavior unpredictable. AMD added AVX-512 starting with Zen 4. The instruction set is 300+ pages of documentation.

The registers kept doubling. The theoretical throughput kept doubling. Most application code never noticed.

Why the Compiler Sometimes Does This For You

Modern compilers — GCC, Clang, MSVC, and rustc (which uses LLVM) — can auto-vectorize loops. This is when the compiler looks at your scalar loop and emits SIMD instructions for it without you asking.

This works well when:

The loop has no data dependencies between iterations (iteration N doesn't use the result of iteration N-1)
The data is contiguous in memory (array, not linked list)
The compiler can prove there's no aliasing (the input and output arrays don't overlap)
The trip count is known or the compiler can generate a scalar fallback for the remainder

A simple sum-of-squares is a textbook case the compiler handles automatically:

pub fn sum_squares(a: &[f32]) -> f32 {  a.iter().map(|x| x * x).sum() }

pub fn sum_squares(a: &[f32]) -> f32 {  a.iter().map(|x| x * x).sum() }

Enter fullscreen mode

Exit fullscreen mode

Compile with --release targeting AVX2 and... the multiply vectorizes (vmulps) but the sum stays scalar (vaddss). Wait, what?

Floating-point addition isn't associative — (a + b) + c can give a different result from a + (b + c) due to rounding. The compiler won't reorder your additions without permission, which means it can't pack 8 sums into a single vaddps. Switch to integers and the story changes:

pub fn sum_squares_i32(a: &[i32]) -> i32 {  a.iter().map(|x| x * x).sum() }

pub fn sum_squares_i32(a: &[i32]) -> i32 {  a.iter().map(|x| x * x).sum() }

Enter fullscreen mode

Exit fullscreen mode

Now you get vpmulld and vpaddd on ymm registers — 8 integers at once, fully vectorized. Integer addition is associative, so LLVM can reorder freely. See both versions side by side on Compiler Explorer →

This is the kind of thing that makes auto-vectorization both powerful and frustrating. The compiler is doing the right thing — it won't change your program's semantics — but it means the "just write clean code and the compiler will vectorize it" advice has a large asterisk on it.

This breaks down further the moment things get complicated. Add a branch inside the loop: the compiler has to use masked operations or give up. Use a data structure it can't prove is contiguous: it has to generate both a vectorized path and a scalar fallback, with a runtime check. Access non-contiguous memory: it has to use gather instructions, which are slower than you'd hope. Add any function call it can't inline: it bails entirely.

Rust's ownership model actually helps here — slices guarantee contiguous memory and the borrow checker proves non-aliasing at compile time. That's information the auto-vectorizer can use. In C, the compiler has to assume two float* arguments might alias unless you annotate with restrict.*

The compiler's auto-vectorizer is optimistic but conservative. You can inspect the emitted SIMD with cargo rustc --release -- --emit asm, or use Compiler Explorer to see exactly what LLVM generated. Read that output. It's educational in a way that is sometimes painful.

Intrinsics: Taking the Wheel

When auto-vectorization isn't enough, you can write SIMD code directly using intrinsics — functions in Rust's std::arch module that map one-to-one to specific CPU instructions.

This is not assembly. You're still writing Rust. You're just telling the compiler exactly which instruction to emit. The ISA-specific code lives inside unsafe blocks, making it explicit where you're stepping outside the compiler's guarantees:

#[cfg(target_arch = "x86_64")] use std::arch::x86_64::*;*

#[cfg(target_arch = "x86_64")] use std::arch::x86_64::*;*

/// Add two float slices element-wise using AVX. /// Handles lengths that aren't a multiple of 8 with a scalar tail. #[target_feature(enable = "avx")] unsafe fn add_arrays(a: &[f32], b: &[f32], out: &mut [f32]) { let n = a.len().min(b.len()).min(out.len()); let mut i = 0; while i + 8 <= n { let va = _mm256_loadu_ps(a.as_ptr().add(i)); // load 8 floats let vb = _mm256_loadu_ps(b.as_ptr().add(i)); // load 8 floats let vc = _mm256_add_ps(va, vb); // add all 8 _mm256_storeu_ps(out.as_mut_ptr().add(i), vc); // store 8 floats i += 8; } // scalar tail for remainder (if n % 8 != 0) for j in i..n { out[j] = a[j] + b[j]; } }`

Enter fullscreen mode

Exit fullscreen mode

The __m256 type is a 256-bit vector. _mm256_loadu_ps loads 8 unaligned single-precision floats. _mm256_add_ps adds them. One call, one instruction. The #[target_feature(enable = "avx")] attribute tells the compiler this function requires AVX — calling it on hardware without AVX is undefined behavior, which is why the function is unsafe.

Intrinsics code is not fun to write. The naming convention (_mm256_loadu_ps vs _mm256_load_ps vs mm512_loadu_ps) requires memorizing a taxonomy. The Intel Intrinsics Guide (at intrinsics.intel.com) is the reference — it lists every intrinsic, the instruction it maps to, the latency, and the throughput. You'll spend time there.

The upside over C: Rust's type system catches width mismatches at compile time. If you accidentally pass an __m128 where an __m256 is expected, that's a type error, not a silent runtime bug. The unsafe boundary also makes it easy to audit — every line that touches raw SIMD is visually contained.

For a higher-level alternative, Rust's portable SIMD API (std::simd) provides type-safe, architecture-independent vector types like f32x8. It's available on nightly and progressing toward stable. When it lands, it will be the preferred way to write explicit SIMD without unsafe or platform-specific intrinsics.

Most application programmers don't write intrinsics. But the programmers who write the libraries you depend on — numpy, simdjson, ripgrep — absolutely do.

Where SIMD Actually Lives

String Search

Finding a byte in a buffer. You do it constantly, you never think about it, and it's the single operation where SIMD makes the most visceral difference. A naive loop checks one byte at a time. SIMD checks 32 with a single mm256_cmpeq_epi8 — compare 32 bytes simultaneously, get a 32-bit mask of which positions matched.

memchr — the fundamental byte-search operation — is implemented with SIMD at every level: glibc's C implementation, and Rust's memchr crate (which we'll get to in a moment). The function you call every day is already vectorized.

ripgrep is fast partly because of SIMD-accelerated memchr. The memchr crate by Andrew Gallant implements memchr, memmem, and substring search using AVX2 (and AVX-512 where available). The core idea for substring search is Teddy — an algorithm that uses SIMD to find candidate positions in bulk, then verifies them. When ripgrep is blazing through a 2GB log file, it's pushing 32 bytes at a time through vectorized comparisons. This is why it outperforms grep by 5–10x on many workloads. It's not magic. It's lanes.

That's also why string search benchmarks look bizarre to anyone who hasn't seen SIMD before. A loop that calls find in a hot path and a SIMD-accelerated version can differ by 8x with identical O() complexity. The algorithm doesn't tell you the constant factor.

JSON Parsing

In 2019 Daniel Lemire wrote a whitepaper which proved that JSON parsing is fundamentally a SIMD problem, giving birth to simdjson. The bottleneck in parsing isn't the logic — it's scanning through bytes looking for structural characters ({, }, [, ], :, ,, ").

simdjson processes 64 bytes at a time using AVX-512 (or 32 with AVX2). It classifies every byte simultaneously — is this a structural character? A whitespace? A quote? — using bitwise SIMD operations to produce bitmasks. Then it uses those bitmasks to drive parsing without a byte-at-a-time loop.

The result: simdjson parses JSON at 2–3 GB/s on a modern CPU. The fastest pure-scalar parser does maybe 300–500 MB/s. The 6x difference is entirely SIMD.

That's why simdjson exists. That's why it's in MongoDB, Clickhouse, and dozens of other systems that care about throughput.

Image Processing

Every pixel is independent. Every channel is independent. This is SIMD's dream workload — no data dependencies, no branches, just arithmetic on contiguous arrays of bytes. SSE2 processes 16 pixels at once with saturating addition (u8x16::saturating_add in portable SIMD). OpenCV, libjpeg-turbo, libpng — they all have SIMD paths for their hot loops. When Photoshop applies a filter to a 24-megapixel image in under a second, this is why.

ML Inference

This is the one that matters most right now.

Neural network inference is fundamentally matrix multiplication: take a weight matrix, multiply by an input vector, pass through an activation function. Repeat. The core operation — multiply-accumulate on large matrices — is exactly what SIMD was built for.

AVX2's fused multiply-add (mm256_fmadd_ps via std::arch, or f32x8::mul_add in portable SIMD) does ab + c on 8 floats in one instruction. For a naive matrix multiply loop, this is an 8x multiplier before you've thought about anything else. Add tiling for cache efficiency and you're in the range of what high-performance BLAS libraries actually do.

AVX-512 with VNNI (Vector Neural Network Instructions, 2019) goes further — it adds instructions specifically for quantized integer dot products used in 8-bit inference. A single vpdpbusd instruction (exposed as mm512_dpbusd_epi32 in intrinsics) processes 16 multiply-accumulates in one clock. llama.cpp, the library that lets you run large language models on consumer hardware, has hand-written AVX2 and AVX-512 kernels for its matrix multiplication. When you run a local model on your laptop, those kernels are running in tight loops for every token you generate.

The Mindset Shift

Here's the insight that changes how you write code even if you never touch an intrinsic.

SIMD forces you to think in batches, not items.

Scalar code says: "for each element, do this." SIMD code says: "take 8 elements, do this to all of them at once, advance 8." The data structure implications are real.

Arrays of Structures vs Structures of Arrays

Consider a particle system. You might model it like this:

struct Particle {  x: f32, y: f32, z: f32, // position  vx: f32, vy: f32, vz: f32, // velocity  mass: f32, } let particles: Vec = Vec::with_capacity(1_000_000);

struct Particle {  x: f32, y: f32, z: f32, // position  vx: f32, vy: f32, vz: f32, // velocity  mass: f32, } let particles: Vec = Vec::with_capacity(1_000_000);

Enter fullscreen mode

Exit fullscreen mode

This is AoS — Array of Structures. Each particle's data is packed together. Intuitive. Natural.

The goal: update all x positions — x += vx * dt — for every particle.*

The problem: x and vx are separated by 24 bytes in each struct. When you load a SIMD vector of 8 x values, you also pull in y, z, vx, vy, vz, mass — data you don't need. Your cache lines are full of noise. Your SIMD registers require a scatter-gather to populate.

The SIMD-friendly layout is SoA — Structure of Arrays:

struct Particles {  x: Vec,  y: Vec,  z: Vec,  vx: Vec,  // ... }

struct Particles {  x: Vec,  y: Vec,  z: Vec,  vx: Vec,  // ... }

Enter fullscreen mode

Exit fullscreen mode

With SoA, all x values are contiguous. Loading &particles.x[i..i+8] gives 8 consecutive x values, ready to go. Loading &particles.vx[i..i+8] gives the matching 8 vx values. One fused multiply-add updates 8 particles. No scatter-gather. No cache waste.

This is not a micro-optimization. The difference in a physics simulation inner loop can be 4–8x. The code is otherwise identical.

That's why SoA and AoS matter — two data structures with identical asymptotic behavior, identical logical content, identical algorithmic logic. One is auto-vectorizable. One isn't. The difference is 8x. Nobody mentioned this in algorithms class.

This also explains why entity-component systems (ECS) — used in game engines like Unity DOTS and Bevy — look structurally odd until you see SIMD. ECS stores component data in contiguous arrays per component type, not per entity. That's SoA. The performance difference for physics and animation simulations is why the pattern exists.

Alignment

SIMD instructions have opinions about memory alignment. Aligned loads — _mm256_load_ps — require the address to be 32-byte aligned (the address mod 32 == 0). Unaligned loads — _mm256_loadu_ps — work on any address, but may be slower on older hardware.

On modern CPUs (Intel Skylake and later, AMD Zen 2 and later), unaligned loads are as fast as aligned loads — as long as you don't cross a 64-byte cache line boundary. So alignment mostly solves itself if you enforce it on your arrays and use mm256_loadu_ps in your code.

In Rust, you control alignment with #[repr(align(32))]:

#[repr(C, align(32))] struct AlignedBlock {  data: [f32; 8], }

#[repr(C, align(32))] struct AlignedBlock {  data: [f32; 8], }

Enter fullscreen mode

Exit fullscreen mode

This is the equivalent of C's attribute((aligned(32))) or alignas(32). It means: "I plan to load this with SIMD and I want the first element to be register-friendly."

You Don't Need to Write Intrinsics

The practical message is not "go rewrite your code in intrinsics." It's shorter:

Write in a way the compiler can vectorize. Keep your hot loops simple and branch-free. Lay your data out contiguously in the access order you need it. Prefer SoA over AoS in performance-critical code. Reach for libraries (numpy, simdjson, BLAS, any vectorized BLAS-backed ML framework) before reaching for intrinsics.

That's why numpy is fast and a Python for-loop isn't. numpy's inner loops are SIMD-vectorized C. When you call arr * 2, numpy dispatches to a vectorized multiply kernel operating on the entire array in chunks of 8 or 16 elements. Your Python for-loop multiplies one element per bytecode interpretation cycle.*

Understand that when two seemingly equivalent implementations have an 8x performance difference, this is frequently why. Not cache (though that's related). Not branch prediction (though that matters too). The data layout didn't allow the CPU to use seven of its eight lanes.

If you do need explicit SIMD, Rust gives you options before you reach for raw intrinsics:

std::simd — Rust's portable SIMD API (nightly, progressing toward stable). Type-safe vector types like f32x8 that compile to the best available instructions on any architecture. This is the future.
wide — a stable crate providing portable SIMD types today. Good for production code that can't wait for std::simd.
pulp — runtime CPU feature detection with safe SIMD dispatch.

For C++ codebases, highway (Google's portable SIMD abstraction) serves a similar role. Don't write raw mm256* calls unless you've exhausted the higher-level options — though in Rust, at least the type system will catch width mismatches at compile time instead of letting you discover them at midnight.*

What the CPU Looks Like Now

One instruction:  ADD rax, rbx  → adds two 64-bit integers  → uses 64 bits of register space

One instruction:  ADD rax, rbx  → adds two 64-bit integers  → uses 64 bits of register space

One SIMD instruction: VADDPS ymm0, ymm1, ymm2 → adds eight 32-bit floats → uses 256 bits of register space → eight physical adders firing simultaneously

Your loop over 8 million floats: Scalar: 8,000,000 add instructions AVX2: 1,000,000 add instructions (8x fewer) AVX-512: 500,000 add instructions (16x fewer)`

Enter fullscreen mode

Exit fullscreen mode

The lanes are there. They've been there since 1999, getting wider every few years. Every calculation you've ever run in a Python loop touched one lane of a machine that had eight available.

More about

llamamodellanguage model

ModelsLive

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

In my previous post I found evidence consistent with the scratchpad paper's compute/store alternation hypothesis — even steps showing higher intermediate answer detection and odd steps showing higher entropy along with results matching “Can we interpret latent reasoning using current mechanistic interpretability tools?”. This post investigates activation steering applied to latent reasoning and examines the resulting performance changes. Quick Summary: Tuned Logit lens sometimes does not find the final answer to a prompt and instead finds a close approximation Tuned Logit lens does not seem to have a consistent location layer or latent where the final answer is positioned. Tuned logit lens variants like ones only trained on latent 3 still only have therefore on odd vectors. Activation stee

LessWrong

6m32 minutes ago

ModelsLive

Large Language Models for Education: A Survey and Outlook

Dev.to AI

1m28 minutes ago

ProductsLive

How to Publish a Paid API for AI Agents Using MCP and AgenticTrade

How to Publish a Paid API for AI Agents Using MCP and AgenticTrade Most API monetization guides assume your consumers are humans who browse a marketplace, read your docs, and manually configure auth. That assumption is becoming outdated. AI agents do not browse. They query a service registry at runtime, read machine-structured MCP tool descriptors, execute calls autonomously, and handle payment without a human in the loop. The infrastructure for that workflow is what AgenticTrade is building. This article walks through the practical steps to register your API on AgenticTrade — an MCP-native marketplace where AI agents can discover, authenticate, and pay for your API per call in USDC. What MCP Actually Does Here MCP (Model Context Protocol) is a protocol for exposing tools and data sources

Dev.to AI

8m28 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 181 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsLive

Docker Model Runner vs Ollama: Local AI Deployment Compared 2026

Docker Model Runner vs Ollama: Local AI Deployment Compared 2026 Docker entered the local AI space. If you are already running models with Ollama, you are now looking at a second option that speaks the same language — literally the same OpenAI-compatible API — but comes from the company that standardized how the world ships software. Docker Model Runner (DMR) shipped with Docker Desktop 4.40 in mid-2025 and has been evolving fast. It uses llama.cpp under the hood, stores models as OCI artifacts on Docker Hub, and integrates directly into Docker Compose workflows. Ollama, meanwhile, remains the default choice for local LLM deployment with 52+ million monthly downloads, a broader model library, and an ecosystem that every AI coding tool already supports. The question is not which tool is obj

Dev.to AI

17m34 minutes ago

ProductsLive

How to Publish a Paid API for AI Agents Using MCP and AgenticTrade

Dev.to AI

8m28 minutes ago

ProductsLive

Automate Your Handyman Pricing: The True Hourly Cost AI Forgets

Ever spent 45 minutes crafting the perfect quote, only to lose the job or discover your price didn't cover your actual costs? You're not alone. Manual estimating eats profits before the work even begins. The core principle most automation misses isn't the material list—it's your True Hourly Cost . AI can count boards from a photo, but it can't guess your real cost of doing business. You must program this in. Calculate Your True Hourly Cost Your rate isn't just your wage. It must absorb all business expenses and non-billable time. Use this framework: Cost Factor Solo Owner Example Employee Example Annual Billable Hours (52 wks * 40 hrs) - Vacation - Admin = ~1,500 hours (52 wks * 40 hrs * Efficiency) = ~1,800 hours Base Wage/Salary Annual need ($70,000) Employee hourly wage ($25/hr) Non-Bil

Dev.to AI

2m16 minutes ago

ProductsLive

Top 15 MCP Servers Every Developer Should Install in 2026

Top 15 MCP Servers Every Developer Should Install in 2026 There are over 10,000 MCP servers listed across directories like mcpmarket.com , mcpservers.org , and GitHub. Most of them are weekend projects that break the first time you try them. A handful are production-grade tools that will fundamentally change how you work with AI coding assistants. This guide is not a directory listing. We tested these servers in our daily workflow at Effloow , where we run a fully AI-powered company with 14 agents . Every pick includes a real claude mcp add install command, a concrete use case, and honest notes about what does not work well. If a server is deprecated or has significant limitations, we say so. What Is MCP and Why It Matters Now The Model Context Protocol (MCP) is an open standard created by

DEV Community

17m38 minutes ago

The Parallel Lanes Nobody Uses

The Parallel Lanes Nobody Uses

SIMD and the Eight-Lane Highway You've Been Driving Solo

The Fundamental Idea

How We Got Here: The Register Zoo

Why the Compiler Sometimes Does This For You

Intrinsics: Taking the Wheel

Where SIMD Actually Lives

String Search

JSON Parsing

Image Processing

ML Inference

The Mindset Shift

Arrays of Structures vs Structures of Arrays

Alignment

You Don't Need to Write Intrinsics

What the CPU Looks Like Now

Further Reading

Daily AI Digest

More about

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens

Large Language Models for Education: A Survey and Outlook

How to Publish a Paid API for AI Agents Using MCP and AgenticTrade

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Products

Docker Model Runner vs Ollama: Local AI Deployment Compared 2026

How to Publish a Paid API for AI Agents Using MCP and AgenticTrade

Automate Your Handyman Pricing: The True Hourly Cost AI Forgets

Top 15 MCP Servers Every Developer Should Install in 2026