Products benchmark release open-source product feature report

Rewriting a FIX Engine in C++23: What Got Simpler (and What Didn't)

DEV Communityby AlanApril 1, 20266 min read1 views

QuickFIX has been around forever. If you've touched FIX protocol in the last 15 years, you've probably used it. It works. It also carries a lot of code that made sense in C++98 but feels heavy now. I wanted to see how far C++23 could take a FIX engine from scratch. Not a full QuickFIX replacement (not yet anyway), but a parser and session layer where I could actually use modern tools. The project ended up at about 5K lines of headers, covers 9 message types, parses an ExecutionReport in ~246 ns. QuickFIX does the same parse in ~730 ns on identical synthetic input. Microbenchmark numbers, so grain of salt. Single core, pinned affinity, RDTSCP timing, warmed cache, 100K iterations. But the code changes that got there were more interesting to me than the final numbers. <h

I've been working on a FIX protocol engine in C++23. Header-only, about 5K lines, compiled with -O2 -march=native on Clang 18. Parses an ExecutionReport in ~246 ns on my bench rig. QuickFIX does the same message in ~730 ns.

Before anyone gets excited: single core, pinned affinity, warmed cache, synthetic input. Not production traffic. The 3x gap will shrink on real messages with variable-length fields and optional tags. I know.

But the code that got there was more interesting to me than the final number. Most of the gains came from replacing stuff that QuickFIX had to build by hand because C++98 didn't have the tools.

The pool that disappeared

QuickFIX has a hand-rolled object pool. About 1,000 lines of allocation logic, intrusive free lists, manual cache line alignment. Made total sense when it was written. C++98 didn't give you anything better.

Now there's std::pmr::monotonic_buffer_resource. Stack buffer, pointer bump, reset between messages:

template  class MonotonicPool : public std::pmr::memory_resource {  alignas(64) std::array buffer_{};  std::pmr::memory_resource* upstream_;  std::pmr::monotonic_buffer_resource resource_;*_

template  class MonotonicPool : public std::pmr::memory_resource {  alignas(64) std::array buffer_{};  std::pmr::memory_resource* upstream_;  std::pmr::monotonic_buffer_resource resource_;*_

public: MonotonicPool() noexcept : upstream_{std::pmr::null_memory_resource()} , resource_{buffer_.data(), buffer_.size(), upstream_} {}_

void reset() noexcept { resource_.release(); } // do_allocate/do_deallocate just forward to resource_ };`

Enter fullscreen mode

Exit fullscreen mode

Call reset() after each message. P99 went from 780 ns to 56 ns. That's 14x on the tail, and it's basically just "stop hitting the allocator."

I also use mimalloc for per-session heaps. mi_heap_new() per session, mi_heap_destroy() on disconnect. Felt wasteful at first, like I was throwing away too much memory per session. But perf stat said otherwise so I stopped arguing.

consteval tag lookup

FIX messages are key-value pairs with integer tag numbers. Tag 35 is MsgType, tag 49 is SenderCompID, tag 55 is Symbol. QuickFIX resolves these with a switch statement, fifty-something cases.

C++23 lets you build the lookup table at compile time:

inline constexpr int MAX_COMMON_TAG = 200;

consteval std::array create_tag_table() { std::array table{}; for (auto& entry : table) { entry = {"", false, false}; } table[1] = {TagInfo<1>::name, TagInfo<1>::is_header, TagInfo<1>::is_required}; table[8] = {TagInfo<8>::name, TagInfo<8>::is_header, TagInfo<8>::is_required}; table[35] = {TagInfo<35>::name, TagInfo<35>::is_header, TagInfo<35>::is_required}; // ~30 more entries return table; }

inline constexpr auto TAG_TABLE = create_tag_table();

[[nodiscard]] inline constexpr std::string_view tag_name(int tag_num) noexcept { if (tag_num >= 0 && tag_num < MAX_COMMON_TAG) [[likely]] { return TAG_TABLE[tag_num].name; } return ""; }`

Enter fullscreen mode

Exit fullscreen mode

Array index, O(1), zero branches at runtime. About 300 branches eliminated across the parser.

Field offsets use the same trick. QuickFIX stores them in a std::map, so every field access is a tree traversal. Here it's offsets_[tag]. Took me a while to get the constexpr initialization right for nested structs, but once it compiled it was basically free._

SIMD: the scenic route

FIX uses SOH (0x01) as the field delimiter. Scanning for it byte-by-byte is fine until your messages have 40+ fields.

Started with raw AVX2 intrinsics. Worked. Process 32 bytes, compare against SOH, extract positions from the bitmask:

const __m256i soh_vec = _mm256_set1_epi8(fix::SOH);_

for (size_t i = 0; i < simd_end; i += 32) { __m256i chunk = _mm256_loadu_si256( reinterpret_cast(ptr + i)); __m256i cmp = _mm256_cmpeq_epi8(chunk, soh_vec); uint32_t mask = static_cast(mm256_movemask_epi8(cmp));

while (mask != 0) { int bit = __builtin_ctz(mask); // lowest set bit result.push(static_cast(i + bit)); mask &= mask - 1; // clear it } }`

Enter fullscreen mode

Exit fullscreen mode

Then I realized I'd need an AVX-512 path, an SSE path, and an ARM NEON path. Four copies of the same logic with different intrinsic names. Maintaining that sounded miserable.

Tried Highway (Google's portable SIMD library). Nice API, but the build dependency was heavy for a header-only project. Compile times went up noticeably. I spent a couple hours trying to make it work as a submodule before giving up.

Ended up on xsimd. Header-only, template-based, picks the instruction set at compile time:

template  inline SohPositions scan_soh_xsimd(std::span data) noexcept {  using batch_t = xsimd::batch;  constexpr size_t width = batch_t::size;

template  inline SohPositions scan_soh_xsimd(std::span data) noexcept {  using batch_t = xsimd::batch;  constexpr size_t width = batch_t::size;

const batch_t soh_vec(static_cast(fix::SOH)); // same loop, portable across architectures }`

Enter fullscreen mode

Exit fullscreen mode

Raw AVX2 was maybe 5% faster on the same hardware. I kept both paths in the repo but default to xsimd. The portability is worth 5%.

SOH scan throughput: 3.32 GB/s. Sounds impressive until you realize that's just finding delimiters. Actual parsing is slower. But it means delimiter scanning isn't the bottleneck anymore, which is the whole point.

What didn't get simpler

Session state. FIX sessions have sequence numbers, heartbeat timers, gap fill logic, reject handling. I was hoping std::expected would clean up the error propagation and... it helped a little. Like 10% less boilerplate. The complexity is in the protocol, not the language. It's a state machine with a lot of branches and I don't think any C++ standard is going to fix that.

Message type coverage. I've got 9 types (NewOrderSingle, ExecutionReport, the session-level ones). QuickFIX covers all of them. Adding a new type isn't hard, just tedious. Field definitions, validation rules, serialization. About a day per message type if you include tests. I got to nine and just... stopped. Started working on the transport layer instead because that was more interesting. Not my proudest engineering decision.

Header-only at 5K lines. Compiles in 2.8s on Clang, 4.1s on GCC. That's fine on my machine. No idea what happens on a CI runner with 2GB of RAM. I keep saying I'll add a compiled-library option. Haven't done it.

Benchmarks

$ ./bench --iterations=100000 --pin-cpu=3

ExecutionReport parse: 246 ns (QuickFIX: 730 ns) NewOrderSingle parse: 229 ns (QuickFIX: 661 ns) Field access (4): 11 ns (QuickFIX: 31 ns) Throughput: 4.17M msg/sec (QuickFIX: 1.19M msg/sec)`

Enter fullscreen mode

Exit fullscreen mode

Single core, RDTSCP timing, 100K iterations, synthetic messages. Not captured from a real feed. The gap will narrow on production traffic with variable-length fields and optional tags. I'm pretty confident the parser is faster, just not sure by how much once you leave the lab.

Where I am with it

Not production-ready. Parser and session layer work well enough to benchmark, but nobody should route real orders through this.

The thing that kept surprising me was how much of QuickFIX's complexity was the language, not the problem. PMR replaced a thousand-line pool. consteval eliminated a fifty-case switch. And xsimd collapsed four architecture-specific codepaths into one template. These aren't exotic features either, they just didn't exist in C++98. I don't know if this thing will ever cover all the message types QuickFIX does, but the parser core feels solid enough that I keep coming back to it on weekends.

GitHub: github.com/Lattice9AI/NexusFIX

Still figuring out: whether header-only holds past 10K lines, how much the 3x gap closes on captured traffic, and which message types actually matter beyond the obvious nine. If you've worked with FIX and have opinions on any of that, I'm interested.

Part of NexusFix, an open-source FIX protocol engine in C++23.

Original source

DEV Community

https://dev.to/silverstream/rewriting-a-fix-engine-in-c23-what-got-simpler-and-what-didnt-4icg

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkreleaseopen-source

ReleasesLive

In anticipation of Gemma 4's release, how was your experience with previous gemma models (at their times)

Pretty much the title, given that gemma 4 should be released ~today/tomorrow, I'm curious if anyone has used the previous models and has good reasons to be excited (or pessimistic) about the new model submitted by /u/Infrared12 [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

Analyst NewsLive

PickNik Robotics gives MoveIt Pro 9.0 enhanced perception-to-motion, teleop capabilities - The Robot Report

PickNik Robotics gives MoveIt Pro 9.0 enhanced perception-to-motion, teleop capabilities The Robot Report

Google News - AI robotics

1mabout 2 hours ago

Analyst NewsLive

Memory spiked first, CPUs followed, now PCBs could be the next victim of the AI boom

TrendForce reports that DRAM and NAND costs will continue increasing in the second quarter of 2026 as demand from AI data centers remains strong. Meanwhile, sources from the tech industry supply chain have informed Nikkei Asia that shortages of basic materials will raise the prices of printed circuit boards and... Read Entire Article

TechSpot

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 146 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsFresh

best option for chunking data

large body of text, multiple files, inconsistent format. llms seem to be hit or miss when it comes to chunking. is there a application that I don't know about that can make it happen? the text is academic medical articles with tons of content. I want to chunk it for embedding purposes submitted by /u/Immediate_Occasion69 [link] [comments]

Reddit r/LocalLLaMA

1mabout 3 hours ago

ProductsLive

5 best practices to secure AI systems

A decade ago, it would have been hard to believe that artificial intelligence could do what it can do now. However, it is this same power that introduces a new attack surface that traditional security frameworks were not built to address. As this technology becomes embedded in critical operations, companies need a multi-layered defense strategy [ ] The post 5 best practices to secure AI systems appeared first on AI News .

AI News

1m10 minutes ago

ProductsLive

Fortis Solutions on the rise of human-governed AI: Building trust through intelligent infrastructure

Fortis Solutions, an enterprise technology partner with decades of experience across infrastructure, cybersecurity, and data systems, approaches artificial intelligence as a force that is redefining how work is performed while preserving the importance of human contribution. Its perspective reflects a future where human judgment and machine precision operate in tandem, introducing new ways to elevate [ ] This story continues at The Next Web

The Next Web Neural

1mabout 1 hour ago

ProductsLive

Quanscient and Haiqu run the most complex quantum fluid simulation yet, on IBM’s Heron R3

A new quantum algorithm ran a 15-step nonlinear fluid simulation around a solid obstacle on real quantum hardware, the most physically complex publicly documented demonstration of its kind. The technique reduces qubit requirements and circuit depth, bringing industrial CFD applications closer to feasibility. Finnish simulation company Quanscient and quantum middleware developer Haiqu have demonstrated what [ ] This story continues at The Next Web

The Next Web Neural

1mabout 1 hour ago