Draft — this post is not published and is only visible in development
All posts
AIsurvey programmingmarket researchresearch operationsResOpsdata qualityLLM evaluation

How We Measure Survey Programming Efficacy (Without Guessing)

AI survey programming needs evaluation that matches how researchers judge quality: against real programmed surveys, across content and logic, with platform-aware detail and a shared respondent model—not a vibes check.

David Thor··11 min read
A scorecard-style breakdown comparing golden survey exports to AI-generated programming across content, hidden logic, and flow categories

If you are evaluating AI that programs surveys, the first question everyone asks is deceptively simple: how close did it get?

Close to what, though? A researcher skimming the live link? A PDF questionnaire that may not match what was actually programmed? A generic rubric about "quality"?

We think the bar has to be higher—and more specific—than that. The evaluation that actually matters is structural alignment with a finished survey your team already trusts: the export from Decipher, Qualtrics, ConfirmIt, Alchemer, or whatever platform you field on. That export is what went to respondents. It is the ground truth for routing, quotas, embedded data, and question definitions.

We built a deterministic scorecard to measure that alignment. It compares our AI-generated programming against those golden exports, scores the gaps in categories researchers care about, and produces a report you can read question by question—not a single headline number from the vendor.

This post is about how that comparison works: what is measured at the macro level, what requires platform-specific machinery, how we normalize flow logic into a Respondent Experience Model (REM) so apples-to-apples logic comparison is even possible, and why one-shot efficacy is only half the story.

Golden surveys, not golden questionnaires

A questionnaire document is a specification. A programmed survey is an implementation. They diverge constantly—sometimes on purpose (programmer fixes a wording ambiguity), sometimes by accident (a skip path never made it from the Word doc into the platform).

For efficacy measurement, we anchor on the golden export: the client's finished survey file in native platform form (.qsf, Decipher XML, ConfirmIt project export, and so on). We parse that into an internal representation of what the platform actually contains, then program the same study from the questionnaire (or from an agreed input) and compare generated vs golden on the same platform.

That choice matters for market researchers and data teams:

  • You are scored against what fielded, not what was drafted in Word.
  • Differences surface at export tag / question ID granularity—Q5, Screener_3, matrix row codes—not "the AI kind of got the vibe."
  • The methodology is repeatable: same inputs, same parser, same rules, same report schema. No LLM-as-judge roulette on whether skip logic "feels right."

We are explicit that golden exports are not perfect truth either. They can embed legacy programmer choices, platform quirks, or errors that shipped anyway. The scorecard can attach client notes (for example, known golden-vs-questionnaire caveats on a benchmark sample) so a 94% score is read in context—not as a marketing claim of infallibility.

Practitioner frame: Think of the golden export like a labeled dataset in ML evaluation. The label is "what the platform says the respondent experience is," not "what the questionnaire PDF literally says."

Five dimensions researchers actually care about

Every scorecard report breaks alignment into five scored categories. Each category uses match points over max points—partial credit when, say, six of seven matrix row labels align—so a single typo does not zero out an entire question block.

CategoryWhat it measures
Respondent contentQuestions respondents see: wording, choice codes and labels, matrix rows/columns, validation
Hidden programmingServer-side variables, scripts, quotas, embedded data—programming that does not collect direct input
Termination logicScreen-out and complete rules: when respondents terminate and from which step
Flow logic (simple)Default progression and straightforward skip paths
Flow logic (complex)Multi-clause visibility, nested branching, cross-question conditions

Illustrative scorecard breakdown (representative enterprise tracker; anonymized):

CategoryMatch
Respondent content93%
Hidden programming88%
Termination logic90%
Flow logic (simple)95%
Flow logic (complex)82%
Overall90%

Content scoring can go field-by-field per question: question text, row labels, randomization flags, validation rules, and platform-specific shape labels (e.g. matrix-likert vs constant-sum). When Q12 misses only rowLabels, you see that in the diff—not a vague "matrix question wrong."

Flow splits simple vs complex on purpose. A team can be excellent on default next-page progression and still struggle on nested boolean routing that references three prior multi-selects. Separating the buckets tells you where to send QA time.

Macro engine, platform adapters

The scorecard is intentionally split into two layers.

Macro layer (shared across platforms)@repo/scorecard-core:

  • Report schema and category definitions
  • Tagged-question scoring framework (missing / extra / partial / mismatch)
  • REM construction from Questra programs
  • REM normalization, equivalence environment, termination and routing compare
  • Canonical routing representation for skip vs visibility reconciliation

Platform layer — per-platform scorecard packages (Qualtrics, Decipher, ConfirmIt, Alchemer):

  • Parse golden export format into platform IR
  • Build survey content indexes (what is "respondent-facing" vs "hidden")
  • Field-level comparators per question kind (matrix, constant sum, etc.)
  • REM extraction from platform IR (how that vendor encodes blocks, loops, display logic, and terminate points)
flowchart LR
  subgraph inputs [Inputs]
    Q[Questionnaire]
    G[Golden platform export]
  end
  subgraph ai [AI programming]
    P[Questra program]
  end
  subgraph platform [Platform layer]
    GP[Parse golden IR]
    Gen[Serialize program to platform IR]
    CC[compareContent golden vs generated IR]
    RGP[buildRemFromPlatformIr golden]
  end
  subgraph macro [Macro layer]
    RGen[buildRemFromProgram generated]
    REM[Normalize REM graphs]
    EQ[Equivalence env]
    T[compareTerminations]
    F[compareRouting simple and complex]
    R[Scorecard report]
  end
  Q --> P
  P --> Gen
  G --> GP
  GP --> CC
  Gen --> CC
  GP --> RGP
  P --> RGen
  RGP --> REM
  RGen --> REM
  REM --> EQ
  EQ --> T
  EQ --> F
  CC --> R
  T --> R
  F --> R

Content comparison stays in platform IR because that is where vendor-specific semantics live: Qualtrics embedded data and display lists, Decipher cond and exec patterns, ConfirmIt routing tables. Trying to compare raw XML strings or QSF JSON blobs would measure formatting noise, not respondent impact.

Logic comparison lifts both sides into REM so the macro layer can ask one question: would these two surveys route the same respondent the same way?

The Respondent Experience Model (REM)

REM is our normalized graph of the respondent journey:

  • Steps — input questions and present-only pages, keyed by stable codes (typically export tags)
  • Terminations — screen-out and complete rules with when conditions and from step
  • Routing — visibility and transition rules (fromto, with when), each classified as simple or complex

Golden REM is built from the parsed golden export (what the platform encodes). Generated REM is built from the Questra program produced by AI programming (the source of truth for what we intended). Both graphs pass through the same normalization pipeline before compare.

That asymmetry is deliberate. Platform exports often omit guards that exist in program logic, or express the same behavior as display logic vs skip logic vs JavaScript. REM comparison is about behavioral equivalence, not textual equality of vendor artifacts.

Normalization: making unlike platforms comparable

Before termination and routing rules are scored, we prepare both REM graphs:

  • Expression normalization — align dialect and structure of when clauses using a shared equivalence environment (choice value maps, step codes, piped fields from the underlying programs when available)
  • Co-located termination merge — collapse redundant terminate rules that differ only cosmetically at the same step
  • Platform-only noise removal — drop survey-end markers that do not change who sees what
  • Canonical routing — reconcile skip/goto vs visibility representations so the same branching intent does not count as a miss twice

We also split simple and complex routing at classification time—multi-clause and cross-question conditions land in flow_complex, not buried in a single "logic" bucket.

Data nerd detail: Routing compare uses structural keys on from, to, and equivalent when expressions—not string diff of vendor scripts. Two surveys can disagree in ConfirmIt script style and still match if REM says the same person reaches Q20 under the same answers.

Self-parity and calibration

Internally we run self-parity checks: golden export → platform IR → Questra program → REM, compared against REM built directly from the program. That validates platform REM builders without conflating AI error with parser drift.

When self-parity on a platform is tight, client scorecards are trustworthy. When it is not, we fix the adapter before quoting benchmark numbers externally.

Why deterministic—not "ask Claude if it's good"

Survey programming evaluation sits in a sweet spot for deterministic checking, which we have written about elsewhere in Three Ways to Put an AI on Trial. Either the export tag exists or it does not. Either terminate fires from Screener_2 under Q1 == 3 or it does not. LLM-as-judge is useful for copy and insight summaries; it is a poor primary metric for structural survey equivalence.

Determinism gives you:

  • Auditability — every diff row has category, severity, golden vs actual, field
  • Regression tracking — rerun the same golden after a model or rules change
  • Procurement-grade conversation — "88% hidden programming, here are the 14 diffs" beats "the pilot felt fine"

We still use LLMs to produce the program. We do not use them to grade alignment with a golden export.

What a diff actually tells your QA team

A scorecard is not a pass/fail gate for auto-fielding. It is a prioritized QA backlog.

Example diff shapes researchers see:

  • Q7: question shape differs (matrix-likert vs single-choice) — kind mismatch before field-level detail matters
  • Question PIPE_BRAND missing in generated survey — structural miss
  • Termination from Screener_1: when condition mismatch — logic divergence with serialized golden vs actual expressions in the report

Warnings call out compare limitations explicitly—e.g. when generated logic REM must be built from platform IR instead of the Questra program because an artifact was missing. We would rather surface uncertainty than silently score 100%.

You can run this pipeline on our public scorecard benchmark: upload questionnaire + golden export, we program from scratch, you get a PDF report in minutes. No account required. It is the same engine we use internally on client samples.

One-shot efficacy is not the whole story

A high scorecard against a golden export answers an important question: did the AI reproduce a known-good programmed survey from the inputs we had?

It does not answer several other questions that determine whether AI programming is safe in production:

The questionnaire is often wrong. Typos, ambiguous routing notes, tables that imply loops the researcher did not spell out—golden exports sometimes fix the questionnaire. Scoring against golden measures implementation fidelity, not whether the study design itself is what the stakeholder intended.

Golden does not capture late revisions. Email threads, Slack messages, and call notes change quotas, stimuli, and terminate paths after the "final" questionnaire PDF. Unless those revisions flow back into the input the AI sees, no one-shot score reflects them.

Team preferences are not in the export. Your shop rotates anchors, uses specific font and spacing conventions, names embedded data a certain way, or always wraps MaxDiff in a particular template. Golden encodes a programmer's choices, not necessarily your playbook unless you teach the system that playbook.

Programming is iterative. The first pass is a draft; QA finds five issues; the client changes Q14 on a call. Efficacy is a time series, not a single number at generation time.

That is why we treat supportive tooling for researchers and programmers as equally important as generation:

  • Human review of logic paths, not just question text
  • Editors that expose structure (tags, routing, hidden variables) rather than hiding it in vendor UI chrome
  • Fast iteration when the client moves a terminate rule at 4pm
  • Org-level memory for standards that no single golden file contains

Closing frame: The scorecard is how we prove the AI can hit a rigorous structural bar. The product around it is how your team owns the result—fix what the questionnaire got wrong, apply your standards, and ship with confidence.

If you are benchmarking AI survey programming vendors, ask for golden-export comparison with category-level and question-level breakdowns—not a demo link and a handshake. And ask what happens after the first draft, when reality inevitably diverges from the PDF.


Try it: Upload a questionnaire and golden export on our free scorecard and see the five-category breakdown on your own study—or talk to us if you want to run a benchmark on a portfolio of trackers.

About the author

DT
David ThorFounder & CEO

Has spent 15 years building AI products and tools that make teams more productive — from Confirm.io (acq. by Facebook) to Architect.io. Holds two patents in AI-powered document authentication. Started Questra after watching his wife Emily, a market research consultant, deal with long wait times between survey drafts and revisions just to get studies into field.