90-Day AI Pilot Plan Template: Week-by-Week Execution Guide
Most AI pilots fail not because the technology doesn't work — but because the project drifted for 9 months without a clear decision point. A real 90-day pilot has explicit weekly milestones, a defined success bar set BEFORE you start, and a hard go/no-go gate at day 90. This template walks through every week of those 90 days: who does what, what artifacts get produced, and what the red-flag signals are. Copy it, adapt the bracketed bits to your use case, and put it in front of your sponsor before day one.
Before Day 1: Three Things That Must Exist
Don't start the calendar until these are in place:
1. **Named project owner** with at least 30% of their time allocated for 90 days. Not 'I'll fit it in.' Calendar-blocked time.
2. **Written success criteria** signed off by the executive sponsor. Three numbers max — e.g., 'reduce average response time from 14 minutes to under 4 minutes,' 'cut error rate from 6% to under 2%,' 'maintain CSAT score within 5 points of baseline.' If you can't write the success criteria in three numbers, you're not ready.
3. **Approved budget for the full 90 days plus 60 days of post-pilot** — costs don't drop the day the pilot ends. Year-1 TCO usually doubles in months 4–6 (production usage, scaling).
If any of these is missing, stop and fix it. Pilots without these three preconditions slip into 'is this even a project anymore' territory by week 6.
Weeks 1–2: Kickoff and Discovery
**Week 1 — Discovery:**
• Day 1: Kickoff meeting with sponsor, owner, end-user reps, and vendor. Confirm success criteria, scope, and timeline in writing.
• Day 2–3: Owner shadow-interviews 5–8 end users about the current process. Document pain points and 'don't break this' constraints.
• Day 4–5: Data inventory — what data exists, where it lives, who owns it, what shape it's in.
**Week 2 — Scope lock:**
• Day 6–8: Vendor delivers a Statement of Work (or internal team delivers an architecture doc). Owner reviews against success criteria.
• Day 9–10: Scope freeze — what's in, what's out, what's in 'phase 2 if we get there.' Both sides sign off.
**Deliverables:** Kickoff deck, success criteria doc (signed), data inventory, scope-freeze doc.
**Red flags this phase:** Sponsor not in the kickoff. Vendor pushes back on signing the scope freeze. Data inventory turns up that the key data lives in someone's email/PDF/spreadsheet.
Weeks 3–4: Data Preparation
**Week 3 — Cleaning:**
• Day 11–13: Extract a representative dataset (typically 1,000–10,000 records, last 6–12 months). Do NOT use cherry-picked clean data — use what production actually looks like.
• Day 14–15: Cleaning pass — standardize formats, fill obvious gaps, flag (don't fix) anomalies for the vendor/model to see.
**Week 4 — Compliance and pipeline:**
• Day 16–18: Data flow review with security/legal — where does data go, what stays on-prem, what consent or DPA changes are needed.
• Day 19–20: Stand up the data pipeline (batch or streaming) that will feed the production system. This is also where vendor integration starts.
**Deliverables:** Cleaned representative dataset, signed data-flow review, working pipeline (even if rough).
**Red flags this phase:** Data prep is taking 3× longer than the vendor estimated (very common — budget for it). Legal reviewer says 'we'll get to it' — get the review on the calendar formally.
Weeks 5–6: Build and Configure
**Week 5 — Initial build:**
• Day 21–23: Vendor (or internal team) builds the first working version against the cleaned dataset.
• Day 24–25: First demo to the owner only — not the wider team yet. Outputs reviewed against a small held-out test set.
**Week 6 — Iterate:**
• Day 26–28: Tune prompts, configuration, tool calls, thresholds. Most pilots see 10–20% performance improvement from week-6 tuning alone.
• Day 29–30: Second demo, broader audience (3–5 end users). Capture qualitative feedback.
**Deliverables:** Working v1 system, evaluation report against held-out test set, tuning log.
**Red flags this phase:** v1 is wildly off the success criteria (more than 50% gap). This is a tractable v1 problem; if you're not at least closing 70%+ of the gap during week 6 tuning, the architecture or model choice may be wrong.
Weeks 7–8: Integration and User Training
**Week 7 — Integration:**
• Day 31–33: System integrated into the actual workflow (CRM, help desk, internal tool, etc.) — not just running in a sandbox.
• Day 34–35: Edge-case and failure-mode testing. What happens when the input is weird, the API is down, the model is wrong, the user disagrees? Each path should have a documented behavior.
**Week 8 — Training:**
• Day 36–38: End-user training — 60–90 minute live session, hands-on. Cover 'what to do when it gets it wrong' explicitly.
• Day 39–40: Documentation finalized — quick-reference guide, escalation paths, who-to-call list.
**Deliverables:** Integrated system, failure-mode runbook, user training complete with sign-in sheet, documentation.
**Red flags this phase:** Vendor wants to skip the failure-mode testing ('it's working great in our environment'). Hard no — failure modes are where pilots die in production. End users report they don't understand what the system does — pause and redo training.
Weeks 9–10: Soft Launch (Monitored Production)
**Week 9 — Soft launch:**
• Day 41–43: System goes live for a subset of users (one team, one region, 25% of volume — pick a meaningful but recoverable slice).
• Day 44–45: Daily 15-minute standup between owner, vendor, and one end-user rep. Surface issues fast.
**Week 10 — Stabilize:**
• Day 46–48: Address top 3 issues from soft launch. Most are tuning, not engineering.
• Day 49–50: Soft-launch retrospective: what worked, what surprised us, what should we do differently before full rollout.
**Deliverables:** Soft-launch metrics dashboard (live), issue log with resolutions, retrospective notes.
**Red flags this phase:** End users actively bypassing the system. Either the UX is wrong or the system isn't trustworthy yet. Either way, do not roll out wider until this is solved.
Weeks 11–12: Full Rollout and Measurement
**Week 11 — Expand:**
• Day 51–55: Roll out to remaining users/volume. Keep the daily standup running.
**Week 12 — Measure:**
• Day 56–60: First clean week of full-rollout metrics. Compare against baseline AND against success criteria.
**Deliverables:** Full-rollout in production, week-12 metrics report.
**Red flags this phase:** Metrics dip during rollout (normal for 1–2 weeks, concerning beyond that). End-user satisfaction drops — investigate immediately; CSAT recovery takes 3× the time of CSAT damage.
Weeks 13: Pilot Review and Go/No-Go
**Week 13 — Decision:**
• Day 61–63: Owner compiles the full pilot report: success criteria vs. actuals, costs vs. budget, end-user feedback, vendor performance, lessons learned.
• Day 64–65: Sponsor + owner + key stakeholders meet for the go/no-go decision. Three possible outcomes:
• **Go (scale):** All three success criteria met or within 10%. Approve year-1 production budget and rollout plan.
• **Go (continue iteration):** 2 of 3 criteria met; 1 close. Extend pilot by 30–60 days with focused work on the gap. NOT 'we'll just keep going' — extend with specific milestones.
• **No-go:** 0–1 success criteria met. Document why honestly. Either change vendors, change scope, or shelve the project. Don't double down on a failing pilot — sunk-cost fallacy is the #1 cause of multi-year AI projects that never deliver.
**Deliverables:** Pilot report, signed go/no-go decision, communication to organization.
Bonus: Days 90–150 (Post-Pilot Stabilization)
If the decision is 'Go,' the work isn't done. The 60 days after pilot are when systems mature or quietly decay.
• **Days 90–105:** Daily monitoring continues; weekly tuning passes; vendor renewal terms negotiated based on actual usage patterns.
• **Days 105–120:** First model drift check — is the system still performing where it was at day 90, or are users seeing degradation? AI systems decay silently.
• **Days 120–150:** First quarterly review. Re-baseline, re-confirm success criteria still match business need, plan phase 2 if any.
Most AI implementations that fail in year 2 fail because nobody owned this post-pilot period.
Frequently Asked Questions
Frequently Asked Questions
It depends on the complexity. For most SMB use cases (customer service, document processing, scheduling, content generation), 90 days is enough to know whether to scale. For enterprise-grade multi-system integrations, 90 days will get you to a soft launch — full rollout often takes 6–9 months total. In that case, this template still applies; the 'rollout' weeks at the end become 'first integration phase' instead.
Stop and reset. Don't run out the clock to week 13 hoping it improves — make the call early. The pilot exists to find out, not to deliver. Week 6 is the right inflection point to either pivot vendor/approach or kill the project with most of the budget intact.
The business team that owns the outcome. IT is a critical partner; operations executes; but the owner needs to be accountable for the business result, not the technical implementation. Pilots owned by IT end up technically successful but with no business adoption.
Highly variable. Lightweight pilots (SaaS AI tool added to existing workflow) can be $5K–$25K in license costs plus internal time. Custom pilots with integration work and tuning typically run $25K–$150K for SMB-scale projects. Enterprise pilots with custom development can reach $250K+. If a vendor is quoting under $10K for a serious pilot, ask what's not included — it's usually data prep, integration, and training.