Probably raises $9M to build a more reliable kind of AI
Startup Probably secured $9 million in seed funding from Andreessen Horowitz to develop a system that prevents LLM hallucinations using a deterministic validator and harness engineering, allowing smaller models to achieve high accuracy.

Probably, a startup focused on making AI more reliable, has raised $9 million in seed funding from venture capital firm Andreessen Horowitz. The company aims to achieve 99.99% accuracy – a level common in deterministic systems but challenging for large language models (LLMs) – by preventing hallucinations and factual errors before they reach users.
Founder Peter Elias explains that the key is not the model itself but the engineering around it. The startup’s first product is a data science tool designed to quickly answer complex queries. Each result comes with citations and an audit trail, a practice becoming more common in AI tools.
The system uses an elaborate harness that Elias calls a “data science mech suit.” The LLM’s initial answers are checked against a deterministic validator, which rejects any results inconsistent with the dataset. The LLM is trained against this validator, and the entire pipeline is optimized for speed and accuracy. “What we learned building this was that the better your harness engineering is, the weaker the model can be,” Elias says. “If you can refine the context enough, the model does not have to work very hard to do the right thing. Basically, it’s an exercise in reducing ambiguity.”
This approach allows Probably to run on significantly smaller models. Currently, it uses a model “four classes weaker than the frontier models,” which can run on local hardware like a desktop computer rather than a data center, drastically reducing token costs – a growing concern as AI usage expands.
Elias sees the same engine being applicable to other precision-sensitive fields, such as accounting or medical services. He criticizes major AI labs for not pursuing this path, suggesting they are incentivized to sell more tokens as users correct errors.


