Research

How we think about making AI more reliable.

Notes, findings, and engineering decisions from the team building Alethe AI — the platform where AI models debate to find the truth.

Research areas

Multi-model consensus

When do multiple AI models agree? When do they diverge? We study the conditions that produce convergence and what that convergence actually signals about reliability.Why one AI is never enough: the case for collaborative inference

Semantic agreement scoring

Beyond word overlap: measuring alignment between AI responses using vector embeddings and cosine similarity. We research scoring approaches that stay stable without sacrificing responsiveness.Real-time semantic scoring at scale: our smoothing approach

Collaborative AI architectures

How should multiple models share context? What is the optimal order for a round-robin debate? We experiment with transcript injection, turn-taking, and cross-analysis protocols.Building the SuperIntelligence orchestrator: a GroupChat architecture

Epistemic calibration

Teaching models to know what they know — and say so. We explore how explicit uncertainty annotation affects the quality and trustworthiness of AI responses.Convergence conditions: when should a debate stop?

Language contagion in LLMs

When one model in a multilingual debate switches language, others follow. We documented this contagion mechanism and developed multi-layer directive approaches to prevent it.Language persistence in multi-model debates: the contagion problem

Prompt engineering for debate

System prompts that elicit genuine disagreement — not performative conflict. We study what prompt structures cause models to hold positions under pressure vs. capitulate prematurely.Prompt caching with Claude and GPT: significant token cost reduction on long debates

Open methodology

We believe research on AI reliability should be transparent. Our scoring algorithms, prompt strategies, and orchestration architecture are documented in detail — not locked away as proprietary secrets.
Agreement scoring formulaThe 70/30 EMA smoothing and cosine similarity pipeline are fully documented — including why we chose these parameters over simpler alternatives.
Orchestrator designGroupChat architecture, asyncio queue injection, backpressure events — all described in our engineering posts.
Prompt templatesThe system prompts we use for each mode are evolving, but we publish what we learn about what works and why.

Want to collaborate?

We're interested in partnerships with researchers, labs, and teams working on AI reliability, evaluation, and multi-model systems.Get in touch