How We Shipped a Production AI Product in 6 Weeks

Six weeks from blank Notion doc to live production system. That's what Nuclear Marmalade did for a mid-sized professional services firm who needed an AI-powered business intelligence tool — not a demo, not a prototype, but something their team used every single day. This post is the honest account of how that happened.

What does 'production AI product' actually mean?

A production AI product is a system that real users depend on daily — one that handles edge cases, fails gracefully, and doesn't need a developer babysitting it. It's not a chatbot demo you show at a board meeting. It's not a Python script that works on your laptop. Production means it's live, it's monitored, and people are annoyed when it goes down. That distinction matters enormously when you're scoping a build, because the gap between 'impressive demo' and 'reliable tool' is where most AI projects quietly die. At Nuclear Marmalade, we've seen that gap swallow budgets and timelines on projects we weren't involved in — and we've learned to design around it from day one.

Why did six weeks feel impossible — and why it wasn't?

Six weeks sounds fast because most enterprise software projects run for six months and deliver something that needed a rewrite. The reason we could move quickly wasn't magic — it was constraint. We scoped ruthlessly. The client wanted twelve features. We built four. The other eight went on a backlog with honest notes about why they weren't in the first release. When you strip a product down to the one or two things that create real value, the build gets dramatically simpler. Our AI agents service is designed around this principle — start with a single, well-defined job the AI needs to do, and make it do that job exceptionally well before adding complexity. The client's core need was surfacing patterns from unstructured internal documents. Everything else was noise until that worked.

How did we actually structure the six weeks?

Weeks one and two were entirely discovery and architecture. No code. That sounds counterintuitive when you're racing a deadline, but writing code before you understand the data is how you build beautiful things that solve the wrong problem. We mapped every data source, identified the messiest edge cases, and made a deliberate decision about which AI model fit the task — not the most impressive model, the most appropriate one. Weeks three and four were a focused build sprint with daily check-ins. Not status meetings — actual working sessions where the client's team saw real output and gave real feedback. Week five was integration and hardening: connecting to their existing systems, handling errors properly, building the monitoring layer. Week six was controlled rollout to a small internal group, then full deployment. Glen Healy ran point on the architecture decisions throughout — the kind of calls that don't show up in a project plan but determine whether the thing actually works.

What went wrong?

Honestly? Week three nearly derailed everything. The document ingestion pipeline we'd designed assumed reasonably consistent formatting across their internal files. Their files were chaos — seven years of different team members saving things differently, inconsistent naming conventions, PDFs that were scanned images rather than searchable text. We lost four days rebuilding the ingestion layer. That hurt. We'd asked about document quality in discovery, been told it was 'mostly fine,' and didn't push hard enough for a proper sample audit before we started building. That's on us. If I were doing it again, I'd spend two days in week one just ingesting a representative sample of the actual data before designing anything. Lesson learned, and it's now a standard part of how we scope product development projects.

Why does the tech stack choice matter more than people think?

The tech stack is where six-week timelines either hold or collapse. We didn't pick the newest tools. We picked the ones we knew deeply and that had strong failure modes — meaning when something broke, the error messages were useful. We used a retrieval-augmented generation approach rather than fine-tuning, because fine-tuning on a six-week timeline with a dataset that size was a risk we didn't need to take. RAG let us iterate on the retrieval logic quickly without retraining anything. The AI memory architecture we built gave the system genuine context about recurring queries, which cut average response generation time from 8 seconds to under 2. That's the difference between a tool people use and a tool people avoid. Stack decisions like that aren't glamorous, but they're where the real product work happens.

How do you keep an AI product from drifting after launch?

This is the question clients rarely ask before launch and always ask three months after. AI products drift when the underlying data changes, when user behaviour shifts in ways you didn't anticipate, or when the model itself gets updated by the provider. We built a lightweight monitoring layer that flagged low-confidence outputs for human review — not every output, just the ones the system itself wasn't sure about. We also scheduled a six-week post-launch review into the contract from day one. Not because we expected things to break, but because production systems always surface use cases you didn't predict. The business intelligence work we've done consistently shows that the first month of real usage teaches you more than six months of user research. Build for learning, not just for launch.

Would this timeline work for every AI product?

No. And anyone who tells you otherwise is selling something. Six weeks worked here because the scope was tight, the client had a dedicated internal contact who could make decisions without a committee, and the core AI task — document retrieval and synthesis — is a well-understood problem with solid tooling. A six-week timeline for an AI product that needs real-time data integration, complex multi-agent workflows, or a public-facing interface with unpredictable user input would be reckless. What six weeks can always produce is a meaningful first version — something that tests the core assumption, creates real value for real users, and gives you a foundation to build from. That's what Nuclear Marmalade delivered here. If you're curious about what's realistic for your specific situation, the consulting page is a good place to start that conversation.

Key Takeaways

Scope is your timeline. They wanted 12 features. We built 4. That's why it shipped in 6 weeks.
The demo-to-production gap is real. Most AI projects die there. Design for reliability from day one, not as an afterthought.
We got burned by bad data assumptions. Always audit a real data sample before designing the ingestion layer. Always.
Fast doesn't mean cheap on thinking. Two full weeks of discovery with zero code written is what made the build sprint possible.
Launch is the beginning, not the end. Build your monitoring and review cadence into the contract before you start, not after something breaks.

If you're sitting on an AI project that's been in planning for longer than it should be, Nuclear Marmalade can help you figure out what a realistic first version looks like — and build it. Start with our AI agents service or just read more about how we work.