Products model version application startup global

Why My "Lightning Fast" Spring Boot Native App Took 9 Seconds to Boot on Fly.io

DEV Communityby albertApril 2, 20266 min read0 views

Why My "Lightning Fast" Spring Boot Native App Took 9 Seconds to Boot on Fly.io We’ve all heard the promise of GraalVM and Spring Boot Native: sub-second cold starts! Instant scaling! A fraction of the memory! So, I spent the time configuring my Spring Boot 4 app to compile into a native image. Locally, inside a Docker container, it booted in a highly respectable 1.7 seconds . Feeling triumphant, I deployed it to Fly.io, expecting instantaneous "scale-to-zero" magic. I checked the logs. Started Application in 9.026 seconds. Wait, what? 9 seconds? For a pre-compiled native binary? Thus began my descent into a debugging rabbit hole that fundamentally changed how I view cloud hardware, GraalVM, and the "scale-to-zero" paradigm. Here is the story of how I debugged a 9-second cold start, and wh

Why My "Lightning Fast" Spring Boot Native App Took 9 Seconds to Boot on Fly.io

We’ve all heard the promise of GraalVM and Spring Boot Native: sub-second cold starts! Instant scaling! A fraction of the memory! So, I spent the time configuring my Spring Boot 4 app to compile into a native image. Locally, inside a Docker container, it booted in a highly respectable 1.7 seconds. Feeling triumphant, I deployed it to Fly.io, expecting instantaneous "scale-to-zero" magic.

I checked the logs.

Started Application in 9.026 seconds.

Wait, what? 9 seconds? For a pre-compiled native binary? Thus began my descent into a debugging rabbit hole that fundamentally changed how I view cloud hardware, GraalVM, and the "scale-to-zero" paradigm.

Here is the story of how I debugged a 9-second cold start, and why I eventually decided to abandon scale-to-zero altogether.

The Setup

Framework: Spring Boot 4 + Hibernate + Flyway
Java Version: Java 25
Build Tool: Gradle with the GraalVM Native Build Tools plugin
Infrastructure: Fly.io (shared-cpu-1x, 512MB RAM)
Database: PostgreSQL hosted on AWS RDS (us-east-1)

The most baffling part was that my local Docker container in Colombia was pointing to the same AWS RDS database, and it still started in 1.7 seconds. So, the code was fine, and the database was reachable. What was happening in the cloud?

Down the Debugging Rabbit Hole

Hypothesis 1: CPU Throttling and Memory Thrashing

My first thought was that GraalVM’s Serial Garbage Collector was thrashing within the tiny 512MB memory limit of my Fly.io microVM, or that the shared CPU was just too weak.

The Test: I scaled the machine up to a dedicated performance CPU and 2 GB of RAM.
The Result: Started Application in 9.041 seconds.

It didn't shave off a single millisecond. It wasn't a resource starvation issue.

Hypothesis 2: IPv6 and OS Entropy Blocking

Cloud microVMs can sometimes hang during startup if they lack OS-level entropy (needed for secure random number generation by Tomcat/Hikari) or if they timeout trying to resolve IPv6 DNS records before falling back to IPv4.

The Test: I passed standard Java arguments to bypass both:

[env]  JAVA_TOOL_OPTIONS = "-Djava.net.preferIPv4Stack=true -Djava.security.egd=file:/dev/./urandom"

[env]  JAVA_TOOL_OPTIONS = "-Djava.net.preferIPv4Stack=true -Djava.security.egd=file:/dev/./urandom"

Enter fullscreen mode

Exit fullscreen mode

The Result: Started Application in 9.037 seconds.

Still 9 seconds.

Hypothesis 3: Spring Boot's Eager Initialization

Maybe Spring was doing too much work on the main thread?

The Test: I forced global lazy initialization (SPRING_MAIN_LAZY_INITIALIZATION=true).
The Result: Started Application in 9.237 seconds.

It actually got slower.

This was because I was using the /actuator/health endpoint for fly.io to recognize that the app was healthy, and when you do that, the actuator actually performs a series of checks that then create all the beans required for the app to run and show as healthy.

Even though you can change this behavior. I abandoned the idea of lazy initialization because all my requests perform operations in the DB, so I think this was not the solution to my problem.

The "Aha!" Moment: The Reality of Cloud MicroVMs

After staring at timestamps, the reality of cloud architecture finally set in. The 9-second boot wasn't a bug; it was the natural hardware limit of running a heavy Spring Boot 4 app on a microVM.

It came down to two major bottlenecks:

1. Single-Threaded CPU Limits

GraalVM Native Image initialization is strictly single-threaded. Locally, my developer laptop has a massive single-core burst speed (4.0 GHz+). Cloud microVMs are carved out of massive, stable server chips (like AMD EPYC) with much lower single-core clock speeds (~2.5 GHz). Throwing cpus = 4 at the app did nothing, because startup only uses one core. The laptop chewed through Spring's AOT wiring in milliseconds; the cloud vCPU took seconds.

2. The Network Penalty (Flyway & HikariCP)

My app included Flyway and HikariCP. During startup, it had to:

Resolve the AWS RDS DNS hostname.
Perform the SSL handshake.
Run Flyway schema validations across the public internet.
Fetch Hibernate metadata.

Locally, the CPU steps were so fast they hid the network delay. On Fly.io, the slower CPU combined with the network hops compounded into a massive 9-second wall.

The Scale-to-Zero Dilemma

When your goal is to "scale to zero," a 9-second cold start is a death sentence. The first user to hit your API after it spins down has to wait 9 seconds just for the server to wake up.

I considered my options:

Switch to Quarkus? It might shave a few seconds off by shifting more reflection to compile time, but the network handshakes (Flyway/RDS) would still block the startup thread.
Rewrite in Go? A Go REST API compiles to a tiny binary and could probably cold-start and serve a request in under 100 ms. But rewriting the entire application wasn't feasible.

So, I made the pragmatic choice: I abandoned scale-to-zero.

The Pragmatic Solution: Always-On

For a typical REST API, leaving a single small instance running 24/7 on Fly.io costs roughly $3 to $5 a month. By setting a minimum machine count of 1, the first instance stays warm perpetually, guaranteeing instant responses.

But this led to a new architectural question: If the app is running 24/7, should I stick with the GraalVM Native Image, or go back to the standard JVM?

Here is the mental model I landed on for deploying Spring Boot 4:

Go back to the JVM if:

You can afford to run your container with 1GB+ of RAM.

Peak Performance: The JVM's Just-In-Time (JIT) compiler will eventually outperform the Native Image's AOT compiler on a long-running server.
Developer Experience: Your CI/CD builds will take seconds instead of 10+ minutes, and you get your profiling and debugging tools back.

Stick with the Native Image if:

You want to keep infrastructure costs as close to $0 as possible.

Memory Survival: If you are deploying on a tiny 256MB or 512MB instance, the JVM will feel claustrophobic and might get killed by the Linux OOM killer. The Native Image's incredibly tiny RAM footprint is the only way a heavy Spring + Hibernate application survives comfortably in that small of a box.

Conclusion

I ended up switching back to the standard JVM. I bumped my Fly.io machine up to 1 GB of RAM to give the JVM enough breathing room, turned off Flyway at startup (spring.flyway.enabled=false) to speed up future horizontal scaling, and set my configuration to leave one instance running permanently.

The extra couple of dollars a month for the upgraded RAM was entirely worth the blazing-fast CI/CD builds, easier debugging, and the peace of mind knowing the JVM's JIT compiler was optimizing my hot paths under the hood.

Scale-to-zero is a cool concept, but sometimes, paying a few bucks a month to let your server sleep with one eye open is the best engineering decision you can make.

Have you struggled with Native Image cold starts in the cloud? Did you rewrite it in Go/Rust, or just leave the server running? Please let me know in the comments!

Original source

DEV Community

https://dev.to/aerc18/why-my-lightning-fast-spring-boot-native-app-took-9-seconds-to-boot-on-flyio-db5

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelversionapplication

ModelsFresh

700KB embedding model that actually works, built a full family of static models from 0.7MB to 125MB

Hey everyone, Yesterday I shared some static embedding models I'd been working on using model2vec + tokenlearn. Since then I've been grinding on improvements and ended up with something I think is pretty cool, a full family of models ranging from 125MB down to 700KB, all drop-in compatible with model2vec and sentence-transformers. The lineup: Model Avg (25 tasks MTEB) Size Speed (CPU) potion-mxbai-2m-512d 72.13 ~125MB ~16K sent/s potion-mxbai-256d-v2 70.98 7.5MB ~15K sent/s potion-mxbai-128d-v2 69.83 3.9MB ~18K sent/s potion-mxbai-micro 68.12 0.7MB ~18K sent/s Evaluated on 25 tasks (10 STS, 12 Classification, 3 PairClassification), English subsets only. Note: sent/s is sentences/second on my i7-9750H These are NOT transformers! they're pure lookup tables. No neural network forward pass at

Reddit r/LocalLLaMA

3mabout 3 hours ago

ModelsFresh

Microsoft Unveils MAI-Transcribe-1, Its Own Speech-to-Text Model - theaieconomy.substack.com

Microsoft Unveils MAI-Transcribe-1, Its Own Speech-to-Text Model theaieconomy.substack.com

GNews AI Microsoft

1mabout 8 hours ago

Laws & RegulationFresh

Chinese industry body denounces AI-related infringement of actors’ rights - Global Times

Chinese industry body denounces AI-related infringement of actors’ rights Global Times

GNews AI China

1mabout 10 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 165 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsFresh

Copilot usage metrics now includes per-user GitHub Copilot CLI activity in organization reports

Following our enterprise-level, user-level, and organization-level CLI metrics releases, we re completing coverage with per-user CLI breakdowns in organization reports. Organization admins can now see which individual users are active on The post Copilot usage metrics now includes per-user GitHub Copilot CLI activity in organization reports appeared first on The GitHub Blog .

GitHub Copilot Changelog

1mabout 3 hours ago

ProductsFresh

Mount Sinai to integrate OpenEvidence AI enterprise-wide

New York-based Mount Sinai Health System has announced that it will deploy OpenEvidence, an artificial intelligence-powered medical search and clinical decision-support platform, across its seven hospitals. Pharmacists, registered nurses and physicians will all have access through the workflow integration, said Nicholas Gavin, Mount Sinai's vice president and chief clinical innovation officer. WHY IT MATTERS OpenEvidence is Mount Sinai's first enterprise-wide AI deployment across clinical roles, according to the health system's announcement this week.

Healthcare IT News AI

1mabout 2 hours ago

ProductsFresh

Startup debuts agentic AI assistant for war - oodaloop.com

Startup debuts agentic AI assistant for war oodaloop.com

GNews AI agentic

1mabout 5 hours ago

ProductsFresh

Threat actor abuse of AI accelerates from tool to cyberattack surface - Microsoft

Threat actor abuse of AI accelerates from tool to cyberattack surface Microsoft

GNews AI Microsoft

1mabout 5 hours ago