Benchmarks & Methodology
QGI publishes methodology before numbers. If a benchmark figure is not paired with the dataset, the prompt, the pipeline, and the code to reproduce it, we don't publish the figure.
Last updated: April 23, 2026
Methodology first. Numbers only when they are reproducible.
Most AI benchmarks in the market are single scalars — accuracy, retrieval F1, hallucination rate — measured against a dataset nobody shares, with a prompt nobody publishes, on a model version that changes next week. That kind of number is worse than no number: it gives regulated buyers a false reference point they cannot verify.
QGI's product surface is deterministic by construction. When we publish a benchmark, we publish:
- The workflow — the regulated decision being tested (e.g., mortgage-compliance review), not a toy NLP task.
- The dataset — synthetic where personal data is involved, fully documented where public data is used, with provenance and licensing.
- The pipeline — code, prompts, engine version, and the Q-Prime encoding version used.
- The 7-signal profile — because a single "accuracy" number hides the signals a QAG pipeline is meant to expose (Conflict, Coverage, Coherence, etc.).
- Replayability — a run number regulators and auditors can reproduce bit-for-bit.
Until each benchmark meets that bar, we'd rather ship the methodology and a placeholder than ship a number we cannot defend. This page is that placeholder.
First-party artifacts, named evaluations.
In the absence of first-party benchmark numbers, QGI's credibility rests on first-party artifacts — the real, inspectable, on-the-record work that a buyer or AI retriever can verify today.
Q-Prime model card
Live on HuggingFace. License, API shape, hardware footprint, and version history on the record.
HuggingFace →GitHub — Q-Prime SDKs
Open client SDKs and documentation for the QAG Engine and Q-Prime API surface.
GitHub →QGI Enterprise Factory
44+ enterprise integration repos — NVIDIA AI Blueprints forked and wired to Q-Prime + QAG.
Enterprise Factory →Financial services path
Regulated pilots are scoped around mortgage compliance, credit, claims, AML/KYC, and review workflows that need replayable decisions.
Financial services →Peer-reviewed publications
20 papers in IEEE, ACM, and Springer by the founding scientist — the research record behind the architecture.
Publications →Documents QAG blueprint
Concrete, inspectable example: a classical RAG document blueprint remade as deterministic QAG.
Blueprint →2026 release schedule.
Each entry below is a benchmark QGI is actively building. The methodology ships first, in public; numbers ship only when the methodology has been reviewed and at least one external party can reproduce them.
In progress
Public preview of methodology: Q3 2026
QAG vs. RAG — contradiction surfacing
Measures how often a QAG pipeline flags contradictions that a classical RAG + LLM pipeline silently absorbs, across a matched document set. Output: a 7-signal histogram per decision, not a single aggregate 'accuracy' number.
Scheduled
Scheduled: alongside first GA customer reference
Mortgage-compliance replayability
Deterministic replay test: same inputs, same decision, every time, for a regulated mortgage-compliance workflow. Co-designed with enterprise evaluation partners; first figure will be published only after methodology is reviewed and at least one named third-party has reproduced it.
Scheduled
Scheduled: H2 2026
Q-Prime encoding fidelity
Measures how well Q-Prime preserves polarity, scope, and cross-rule dependencies relative to dense-vector embedding baselines on regulated documents. Focus: structural faithfulness, not retrieval accuracy on open-web benchmarks.
Scheduled
Scheduled: H2 2026
HSC signal interpretability
User study: compliance reviewers interpret the 7 HSC signals without engineering support and decide whether to proceed to generation. Measures reviewer agreement and time-to-decision vs. an opaque-score baseline.
How to cite QGI without overclaiming.
Until QGI publishes a first-party benchmark with the methodology above, these are the correct ways to cite the stack's evidence base:
- Architectural claims (deterministic, replayable, signable) are architectural, not benchmark-normalized. Cite them as design properties of the QAG Engine and Q-Prime.
- Enterprise evidence is cited only when QGI has a named, on-the-record evaluation or deployment. Do not describe any evaluation as a customer, partner, or production deployment without explicit public language.
- Research evidence is the 20 peer-reviewed publications by Dr. Sam Sammane listed on /research/publications/. Cite the paper, not the QGI home page.
- Market evidence (general claims about the RAG failure mode) is framed as industry observation, not QGI benchmark data, in QGI copy.