AppWispr

Find what to build

Quantified Fake‑Door: Exact Sample Sizes & Decision Rules for No‑Code Prebuild Tests

AW

Written by AppWispr editorial

Return to blog
MR
FD
AW

QUANTIFIED FAKE‑DOOR: EXACT SAMPLE SIZES & DECISION RULES FOR NO‑CODE PREBUILD TESTS

Market ResearchJune 10, 20266 min read1,178 words

Fake‑door tests (no‑code prebuilds that measure demand before you build) are cheap and seductive — but most teams run them without clear rules. This post gives a compact, evergreen playbook: exact sample‑size formulas and calculators, practical acceptance thresholds, segmentation rules, and UTM / KPI templates you can copy so a signup or click becomes a defensible build (or a clear kill).

quantified-fake-door-sample-sizesfake door testingno-code prebuild testssample size calculatorproduct‑market fit tests

Section 1

1) What to measure and the decision you’re making

Link section

A fake‑door test reduces a product decision to a binary signal: someone saw the CTA and either converted (clicked, signed up, paid intent) or didn’t. That makes the primary metric a proportion (conversion rate). Design your test so the observed event maps to the future value you care about — e.g., a paid intent click is a stronger signal than an email capture.

Before you run any numbers, write a one‑line decision rule: “If conversion ≥ X% with N users and sustained across our two priority segments, we build MVP A; otherwise we kill it.” That rule forces you to quantify business tradeoffs (cost to build, expected LTV, acceptable risk) and select a meaningful Minimum Detectable Effect (MDE).

  • Pick a single binary primary outcome (click, email, payment intent).
  • Translate conversion into business value (LTV × expected conversion funnel) before picking thresholds.
  • Decide whether you need one‑sided (only care about uplift) or two‑sided testing (care about any difference).

Section 2

2) Exact sample‑size formula and a compact calculator you can use

Link section

For a binary outcome compare the observed conversion rate to your baseline (or compare two buckets) using the standard two‑proportion z‑test sample‑size formula. For each variant the per‑group sample size n is approximately: n = ((Z_{1−α/2}√(2p(1−p)) + Z_{1−β}√(p1(1−p1)+p2(1−p2)))^2) / (p1−p2)^2. In practice most teams specify baseline p (current conversion), desired MDE (absolute or relative), α (commonly 0.05) and power 1−β (commonly 0.8).

If that formula looks heavy, use any reputable online calculator that implements the two‑proportion test (Evan Miller style calculators, Statsig, or other sample‑size tools). Plug in your baseline and the smallest uplift you’d act on (your MDE) — this gives the minimal per‑variant sample. If your traffic can’t reach that sample in a reasonable time, either increase the MDE you’ll accept or reframe the test (e.g., run a more targeted paid campaign to accelerate traffic).

  • Inputs you must set: baseline conversion p, MDE (absolute or relative), α (type I error), power (1−β).
  • If you compare treatment vs control directly, use two‑sample proportion formulas; for single‑arm judgment against a target, use one‑sample proportion tests.
  • If volume is low, increase MDE or run targeted campaigns to reach required N.

Section 3

3) Practical acceptance thresholds & decision rules (statistical + operational)

Link section

Statistical acceptance alone (p < 0.05) is not enough. Use a two‑part rule: (A) a statistical test that meets your α/power requirements, and (B) a business threshold that maps to value. Example rule: “We require at least 80% power to detect our MDE, and observed conversion ≥ target conversion (baseline × multiplier) across both priority segments — otherwise fail.” That prevents small but statistically significant lifts with no business impact from triggering builds.

Add operational constraints: require the effect to persist over a holdout window (e.g., at least one full acquisition cycle or 7–14 days) and check primary segments separately (top acquisition channel and target persona). If the signal is statistically significant overall but driven by one fringe segment, don’t build for the entire market.

  • Combine statistical and business thresholds — both must pass.
  • Require persistence: effect holds across a pre‑specified time window (e.g., 7–14 days) to avoid early stopping bias.
  • Segment checks: require the effect in at least two priority segments (channel, persona, geography).

Section 4

4) Segmentation rules and how to avoid false positives

Link section

Segment early and pre‑register. Define 2–4 priority segments before the test (e.g., organic vs paid, power users vs new users, country A vs B). Treat segment checks like mini decision gates: if overall significance passes but fails for both priority segments, downgrade the signal and require a follow‑up test targeted to the winning segment.

Adjust for multiple comparisons: every extra segment or metric you check inflates false positive risk. Use Bonferroni or Benjamini‑Hochberg corrections if you perform many independent tests, or focus on a small, pre‑specified set of segments to keep statistical interpretation simple.

  • Predefine 2–4 priority segments and keep exploratory segmentation separate.
  • Use corrections (Bonferroni or FDR) when you test multiple independent hypotheses.
  • If only one non‑priority segment shows uplift, run a targeted follow‑up before committing to a full build.

Sources used in this section

Section 5

5) UTM & KPI templates you can copy for every fake‑door test

Link section

Instrument each fake‑door using a strict UTM scheme and a single canonical KPI so results are unambiguous. Example UTM pattern: utm_source=fakedoor&utm_medium=banner|email|paid&utm_campaign=feature‑name_v1&utm_term=segment. Capture the canonical KPI (e.g., click‑to‑intent rate or paid intent actions) and a secondary KPI for quality (e.g., email open, trial activation).

Report a one‑page results snapshot for each test showing: sample sizes by variant and segment, raw conversions, conversion rates with 95% CIs, p‑value (or posterior probability if using Bayesian), business value mapping (expected customers × LTV), and the final green/amber/red verdict per the pre‑registered decision rule. Store templates and past results in a lightweight tracker (AppWispr users can keep these templates in their analysis folder).

  • Canonical UTM: utm_source=fakedoor&utm_medium={channel}&utm_campaign={feature}_v{n}&utm_term={segment}
  • KPI snapshot: N per variant, conversions, conversion rate, 95% CI, p‑value, business value estimate, decision verdict.
  • Keep an internal changelog (versioned campaign names) so you never conflate tests.

FAQ

Common follow-up questions

How do I pick a reasonable Minimum Detectable Effect (MDE)?

Pick the smallest uplift that meaningfully changes your build decision. Translate uplift into expected additional paying customers (or LTV) and compare that to your build cost. If the incremental revenue from the MDE over reasonable timeframes exceeds the cost to build and operate, it’s a reasonable MDE. Practically, founders often start with 20–50% relative uplift targets for low‑traffic fake‑doors and smaller MDEs only when traffic supports large samples.

What if my traffic is too low to reach the required sample size?

Options: (1) increase the MDE you’ll act on (accept larger effect sizes), (2) run a focused paid acquisition campaign to accelerate traffic to the fake‑door, (3) convert the test to a qualitative prelaunch (interviews or moderated usability), or (4) run a single‑arm test judged against a business target rather than a two‑arm statistical test.

Should I use frequentist p‑values or a Bayesian decision rule?

Both can work. Frequentist tests are widely understood and simple to pre‑register (α, power, two‑sample z test). Bayesian rules let you express decisions as posterior probabilities (e.g., P(uplift > MDE) > 0.9). Pick one framework and pre‑register thresholds so the decision isn’t shifted after peeking at results.

How long should I run a fake‑door test?

Run until you reach the required sample size and at least one acquisition cycle for your product (commonly 7–14 days). Don’t stop early when the effect temporarily looks good. Pre‑computing required N and estimating duration from expected traffic avoids early‑stopping bias.

Sources

Research used in this article

Each generated article keeps its own linked source list so the underlying reporting is visible and easy to verify.

Next step

Turn the idea into a build-ready plan.

AppWispr takes the research and packages it into a product brief, mockups, screenshots, and launch copy you can use right away.

Quantified Fake‑Door Sample Sizes & Decision Rules