Vibe Coding a Full AI Security CTF Platform in One Day (Here's My Full Process)
From zero to 18 flags: architecture, UX, challenge design, deployment, and launch strategy for an AI CTF event.
Static workshops don’t teach security. They teach patience. You sit through slides, nod at diagrams, and leave with zero hands-on reps. That’s not useful. So instead of running another workshop, I built a full browser first CTF platform in one day.
I work DevRel at KubeArmor. The goal was simple: give AI security learners something they could actually play, not just watch.
The result: 3 tracks, 15 primary challenges plus 3 bonus flags, a CTFd scoring model, deterministic solve checks, admin controls for reset and operations, and a production Docker deployment path.
This is the build journal. The decisions, the tradeoffs, the build order, and the exact setup that made one day work without sacrificing reliability.
Why Browser First
The first decision was strategic, not technical.
If participation has friction, conversion drops. A CTF that demands local setup loses people before they ever see a challenge. The event had to be browser first, full stop.
That drove product shape:
CTFd for registration and official scoring
a custom browser app for the actual game experience
a backend runtime that keeps secrets and challenge logic server side
This separation gave three concrete wins. Players get a familiar, trusted scoreboard workflow. The custom UX handles AI-native challenge interaction. Secrets stay server side and solve logic stays deterministic.
Think Top Down, Not Challenge First
My rule for this build: do not start from individual challenges. Start from experience architecture.
The classic CTF trap is writing many challenge ideas before the runtime foundations are stable. You end up with 20 half-formed challenge concepts and no working platform. I reversed the order.
Layer 1: Event Structure
What is the official scoring source of truth? CTFd.
Where do players actually solve? Browser app.
What is the challenge unit design? 3 tracks, 5 primary challenges each, plus 1 bonus.
Layer 2: Platform Contract
Track pages must be coherent and navigable.
Challenge pages must support both chat based and action based workflows.
Solve checks must be stable, even when model output drifts.
Layer 3: Build Order
Ship platform core once.
Prove vertical slices end to end.
Scale challenge content only after reliability exists.
That sequencing prevented most of the rework. Scope ambiguity at layer 1 multiplies into hours of changes at layer 3.
The Exact Build Order
1. Lock the challenge map and scoring model first
I fixed scope early: three tracks, fixed point distribution. This removed downstream rework in both UI and backend. Any scope ambiguity at this stage is the most expensive kind.
2. Build the runtime shell first
Express + EJS. Server rendered pages are simple and debuggable. I was not trying to win a framework debate. I was trying to ship.
3. Implement one complete challenge slice per track before writing content
Before authoring all challenge files, I validated one representative solve path per track:
Prompt Injection
Agent Workflow
Hidden API
4. Add deterministic solve checks
AI output variance is real. Solve logic cannot rely purely on model text. I paired prompt flows with deterministic regex and action checks to keep challenge solving reliable across different model states.
5. Expand content to the full catalog
Only after the runtime was stable did I fill out all challenge JSON definitions and bonus progression mechanics.
6. Build admin and operational controls last
Reset paths, user and session visibility, solution library views. These are for organizers, not players. Build what players need first.
Architecture And Key Considerations
Browser First Challenge Delivery
Everything playable in the browser: prompt manipulation tasks, constrained agent workflow attacks, hidden API tamper paths. No local AI setup required.
Challenge Authoring Model
Challenges live as JSON content loaded dynamically. Fast to add and tune:
challenge metadata
system prompt behavior
flag values
solve patterns
track-level bonus progression
Reliability Over Model Luck
When the live LLM is unavailable or inconsistent, fallback mode preserves deterministic challenge checks. The CTF stays playable and testable regardless of model state. This saved me during testing more than I expected. The LLM I used was Groq given it’s generous free tier with `llm-.3-70b-verstile` allowing upto 1K/requests per day
Security
provider keys stay server side
no raw secret exposure in front end assets
challenge solve states tracked per user session
role-based admin access for control operations
UI And UX Choices
I treated this as a product, not a challenge dump. That distinction sounds obvious until you see how most CTFs are actually built.
Principles I used:
Immediate clarity: players understand what to do within seconds
Visible progression: solved states and track completion are obvious
Low navigation friction: home to track to challenge is one click
Challenge identity: strong visual differentiation per track
The home page acts as a mission board with progress context. Track pages show challenge list and status in one view. Challenge pages keep conversation, solved state, and prompt box in the same loop. No context switching.
For organizers, I built an admin dashboard with rate limit visibility, user management, and quick access to solution guides. An event platform that is hard to run doesn’t get run twice.
Testing Strategy
The fastest way to lose trust in a CTF is inconsistent solve behavior. I prioritized testing ergonomics over coverage metrics.
What I built in:
deterministic solve checks in runtime
reset paths for fast retesting
seeded demo users
visible solve status per challenge
structured challenge activity and session storage
Then three practical test passes:
Creator pass: can I solve from blank state quickly?
Peer pass: can another person solve with only available hints?
Fresh pass: can someone with no context navigate and complete the objective?
The third pass is the one most CTF builders skip. Don’t skip it.
Infrastructure And Deployment
Optimized for event reliability and organizer simplicity. The main app is just a Docker file that I packaged with peristent volume mounts for player, score and other data backups.
Local And Container Support
local Node run for rapid iteration
Docker Compose for repeatable deployment
persistent volume for user and session data
Environment Strategy
root-level environment file for runtime config
explicit provider and rate limit controls
predictable data paths for backup and restore
For DevRel-led events, setup friction for organizers is as critical as player friction. A platform that is hard to run won’t scale to repeated workshops or community events. Docker came in on day one, not as an afterthought.
Difficulty Design
Difficulty was intentionally layered.
Track progression:
easy starts build confidence
medium stages teach deeper patterns
hard stages require chaining and precision
bonus flags reward full-track completion
From observed session data, prompt injection flows solved quickly. Deeper workflow and API attacks demanded chained, multi-step reasoning. That’s the intended shape. Beginners complete meaningful portions. Experienced players still hit real depth.
How I Actually Vibe-Coded This
Most build logs skip the honest part. Here it is.
I used GitHub Copilot for most of the implementation. The speed didn’t come from blindly accepting generated code. It came from using Copilot as a fast implementation partner after I had already fixed the intent and boundaries of each slice.
My actual workflow:
Write a clear objective for one thin vertical slice. One challenge page. One solve path. One admin operation.
Use Copilot to draft implementation candidates quickly.
Validate behavior immediately in browser and session data.
Keep only code that matched deterministic solve logic and product flow.
Vibe coding gave me velocity. Constraints gave me quality. The combination is what made one day actually work.
For UI exploration, I used Google Stitch as a design thinking accelerator. Not as final output. I used it to generate layout directions, visual hierarchy ideas, and interaction ordering quickly, then translated those ideas into the Express + EJS runtime in a way that preserved reliability.
Stitch helped most in three places:
homepage information hierarchy
track page scannability
challenge page rhythm
Copilot for implementation speed. Stitch for fast UX ideation. Manual runtime constraints for correctness.
How I Wrote Prompts During The Build
Prompt quality was the hidden multiplier. I didn’t use one giant prompt and hope for magic. Short, explicit, context-rich prompts structured around intent, constraints, and acceptance criteria.
My prompt shape:
Objective: what exactly should be built in this step
Context: files, routes, current behavior, constraints
Guardrails: what must not change
Output contract: how I want the result formatted
Verification: what should be true after implementation
Example from the implementation work:
“Implement a deterministic solve check for this challenge route. Keep existing API shape unchanged. Do not move secrets client-side. Return minimal patch. Add only required validation and update solved state in session.”
For content and UX iteration:
“Rewrite this challenge instruction so a first-time participant can act within 10 seconds without losing technical depth. Keep tone assertive and concise.”
For debugging:
“Given this route and this observed behavior, list the top three likely failure points, then propose the smallest safe fix first.”
Prompt granularity is everything. Mixing architecture decisions, copywriting, and debugging in one request produces garbage. Smaller prompts, tighter loops, fewer regressions.
How To Iterate When Vibe Coding a Complex Application
CTFd Integration And Deployment Model
The deployment model is intentionally hybrid.
CTFd is the official scoring authority: registration, team context, scoreboard trust. The custom browser platform is the gameplay authority: AI interaction, challenge runtime, deterministic solve checks.
Why this pairing works:
players get familiar CTF lifecycle in CTFd
organizers get richer AI-native challenge UX in the custom platform
no need to force gameplay UX into CTFd plugin limitations
Deployment mechanics:
containerized runtime via Docker Compose
environment-driven config for provider keys and rate controls
persistent storage for session and solve state continuity
admin visibility for operational interventions during live events
This gives a repeatable event model. Scoreboard confidence from CTFd, experiential depth from the custom AI runtime.
What I’d Improve Next
If I run this at larger scale:
deeper analytics per challenge step and drop-off points
richer anti-abuse controls for public events
one-click flag relay into CTFd API
post-event reporting for learning outcomes
What Actually Matters
A good CTF platform is not just challenge content. It’s experience design, reliability engineering, and event operations all in one.
In one day, we shipped a complete, browser first AI security CTF that participants can play, organizers can run, and DevRel can scale.
If you’re designing AI security programs, the architecture pattern here is practical:
keep scoring and gameplay concerns separated
make solve logic deterministic where it matters
design UX for flow, not just visuals
ship with operations in mind from day one
That’s how KubeArmor AI CTF came together.






















