When Deploys Fail
Leaving the Happy Path
Section titled “Leaving the Happy Path”So far, we have walked the successful deployment path:
- the image built
- the runtime started
- the public URL loaded
- the app talked to Atlas
Win!!!
Sadly, real deployment work is not just about successful launches.
It is also about knowing how to respond when things fail.
And it will fail sometimes.
We will eventually:
- mistype an environment variable
- forget to commit a needed file
- break the frontend build
- push code that works locally but fails in hosted runtime
That does not mean the platform is random, cursed, or personally offended.
A failed deploy is not chaos.
It is evidence.
The First Question to Ask
Section titled “The First Question to Ask”When a deploy fails, do not start with:
“What is wrong with deployment?”
Start with:
Which phase failed?
At this point in the lesson, the most important diagnostic split is:
- Build failure
- Runtime failure
That distinction is everything.
Because once you identify the failure phase, your next debugging move becomes much clearer.
Do not diagnose everything as “the deploy failed.”
First identify the phase. Then debug that phase.
Failure Type 1: Build Failure
Section titled “Failure Type 1: Build Failure”A build failure means the platform could not successfully create the deployable image.
In other words:
- Docker never finished packaging the app
- the image was never completed
- the application never even reached startup
That means this is still a build pipeline problem, not yet a live-app problem.
Common build failure causes
Section titled “Common build failure causes”Typical examples include:
- missing dependencies
- bad
COPYpaths in the Dockerfile - frontend build errors
- files referenced by the build that were never committed
- project structure mismatches between the repo and the Dockerfile
What build failures usually look like
Section titled “What build failures usually look like”You will usually see errors during steps such as:
npm installnpm run build- file copy steps
- Docker instruction execution
The important point is simple:
the app never reached runtime.
If the image never built successfully, there is no point testing the public URL, blaming Atlas, or guessing about frontend routing yet.
Why build failures are often less scary
Section titled “Why build failures are often less scary”A build failure usually blocks the new deploy.
It often does not immediately destroy the last working version already running on the platform.
So while build failures are annoying, they are often a blocked release, not an instantly broken live service.
Failure Type 2: Runtime Failure
Section titled “Failure Type 2: Runtime Failure”A runtime failure happens after the image built successfully.
That means:
- Docker packaging worked
- the image exists
- the container starts, or tries to start
- the application fails during startup or cannot stay healthy
This is a different class of problem entirely.
Common runtime failure causes
Section titled “Common runtime failure causes”Typical examples include:
- missing or incorrect
MONGO_URI - missing or incorrect
SESSION_SECRET - missing or incorrect
NODE_ENV - hardcoded local port assumptions
- app crashes during startup
- Atlas connection failures
- production static-serving logic pointing at the wrong built asset path
What runtime failures usually look like
Section titled “What runtime failures usually look like”Runtime failures often show up as:
- a build that succeeds but never becomes healthy
- a service that restarts repeatedly
- a public URL returning platform or proxy errors
- startup exceptions in the application logs
The key idea here is:
the image exists, but the app cannot run correctly inside its hosted environment.
The Clean Split
Section titled “The Clean Split”Here is the useful mental model:
Build Failure
Section titled “Build Failure”- Docker build fails
- image is never completed
- startup never happens
- likely causes: Dockerfile, file layout, dependencies, build tooling
Runtime Failure
Section titled “Runtime Failure”- Docker build succeeds
- image is created
- startup begins
- app fails because of config, connectivity, port binding, or runtime code behavior
That distinction should become automatic.
Because once you know which side of the line you are on, the debugging path stops being blurry.
The Two-Question Triage Check
Section titled “The Two-Question Triage Check”When a deploy fails, ask these questions in order:
1) Did the Docker build complete successfully?
Section titled “1) Did the Docker build complete successfully?”If no, you are in build failure territory.
2) Did the built app start and stay healthy?
Section titled “2) Did the built app start and stay healthy?”If no, you are in runtime failure territory.
That one-two check will save a shocking amount of wasted nonsense.
Our First Move for Each Failure Type
Section titled “Our First Move for Each Failure Type”If it is a build failure, check:
Section titled “If it is a build failure, check:”- the Dockerfile
- project file paths
- committed files
- package metadata
- frontend build output
If it is a runtime failure, check:
Section titled “If it is a runtime failure, check:”- environment variables
- Atlas connectivity
- startup logs
- port binding
- application exceptions during boot
Different failure class.
Different first move.
Why This Distinction Matters
Section titled “Why This Distinction Matters”Without this framework, we tend to collapse every hosted problem into one mushy feeling:
“Something in the cloud is broken.”
That is not useful.
This is useful:
- Build failed → check Dockerfile, dependencies, repo state, build tooling
- Runtime failed → check environment config, Atlas, startup behavior, logs
That turns deployment from spooky mystery into a process we can reason about.
Do Not Solve the Wrong Problem
Section titled “Do Not Solve the Wrong Problem”One of the most expensive debugging habits is fixing the wrong layer.
Examples:
- rewriting runtime code when the image never built
- editing the Dockerfile when the real issue is a bad
MONGO_URI - blaming Atlas when the frontend bundle failed before startup
- blaming routing when the service never became healthy
First classify the failure.
Then fix the layer that actually failed.
⏭ Platform Logs as Truth
Now that we know how to classify deployment failures, the next step is learning where to look for the evidence that actually matters.