Council of Advisors — Test Cases & Eval
Council of Advisors — Test Cases & Eval
Should-trigger test cases
- “I’m trying to decide between applying for a $50K grant with a 2-week deadline that’s only 60% aligned to our mission, or waiting for a better opportunity next quarter. Convene the council.”
- “We’ve been offered a partnership with an edtech company that wants to co-brand our AI literacy curriculum. They’d fund our pilot but their product collects student usage data. I can’t decide. Council?”
- “The council — I need perspective. We’re thinking about pivoting from school-based programs to a train-the-trainer model to scale faster. My gut says yes but I want this stress-tested.”
- “Get me multiple perspectives on whether we should publicly criticize a competitor nonprofit’s AI literacy program that we think is teaching kids to trust AI uncritically.”
- “I keep coming back to the same two options for our fundraising strategy and I don’t love either of them. What would the council say?”
- “I feel like I’m missing something obvious about why our Discord community isn’t growing despite consistent posting. Call a council session.”
Should-NOT-trigger test cases
- “Draft an acknowledgment letter for the $500 donation from Sarah Chen.”
- “What’s the deadline for the MacArthur Foundation’s next grant cycle?”
- “Post the AI literacy tip to LinkedIn.”
- “How many donors lapsed this month?”
- “Fix the error in the social delivery script.”
- “Schedule the board meeting reminder for next week.”
Evaluation checklist
- Triggering: correct on/off behavior for all 12 tests.
- Queue integration: all 6 advisor seats are enqueued immediately (fast registration), not directly run in parallel.
- Polling behavior: council polls callback files until each seat reaches
complete|timeout|error. - Cleanup behavior: consumed callback files are removed after ingest.
- Runtime envelope: full six-seat sequence typically lands in ~3–4 minutes (acceptable), avoiding saturation/timeouts.
- Distinctness: six advisors produce genuinely distinct perspectives, not minor variants.
- Magnus quality: surfaces a real implementation gap, not mere restatement.
- Dante quality: presents a substantive adversarial case, not only clarifying questions.
- Flow integrity: all 5 steps executed in order.
- Cross-pollination: max one round; only when sharp conflict exists.
- Synthesis quality: explicit decision, what shifted, and what was set aside.
- Disagreement clarity: Burt explicitly states where and why he disagrees with at least one advisor when applicable.
- Constraint compliance: no council for routine/deterministic/time-critical emergency paths.
- Persistence: record saved to
./data/council/YYYY-MM-DD-[topic-slug].mdand retrievable. - Readability: complete session output should be scannable in <= 5 minutes.
- Delivery: full record to Burt direct; 100-word synthesis routed by channel matrix.
- Operationality: next actions executable within 24h.