Incrementality Testing at Scale: A Geo-Lift Framework for Measuring True Campaign Impact

TL;DR: Platform-reported conversion lift overstates true incrementality by 30-70% because ad targeting selects users who were already likely to convert. Geo-lift experiments -- holding out entire geographic markets from campaign exposure -- measure the true counterfactual: what would have happened without the campaign. A 22% reported retargeting lift typically corresponds to just 4% true incremental lift.

The Most Expensive Misunderstanding in Marketing

John Wanamaker's line about half his advertising being wasted dates to the early 1900s. Over a century later, most marketing teams still cannot answer a question that is far more specific and far more important: of the conversions your campaign claims credit for, how many would have happened anyway?

This is not a philosophical question. It is a financial one. And the difference between the answer marketing dashboards give and the answer reality gives is often 30-70% of reported campaign value.

A paid search campaign reports 10,000 conversions. The CPA looks excellent. The ROAS exceeds target. Everyone is satisfied. But embedded in those 10,000 conversions are people who searched your brand name because they saw a billboard last week. People who were already navigating to your site and clicked an ad because it appeared above the organic link. People who would have purchased regardless of whether you spent a single dollar on that keyword.

These are not conversions your campaign caused. They are conversions your campaign claimed. The gap between those two statements is where marketing budgets go to die.

Incrementality testing exists to close that gap. And the most rigorous, scalable method for doing so is the geo-lift experiment.

Conversion Lift Is Not Incremental Lift

Before we get to methodology, we need to dismantle a confusion that persists even among experienced marketers.

Conversion lift measures the difference in conversion rate between people who saw your ad and people who did not. Most platform-reported lift studies use this framework. Facebook's conversion lift, Google's brand lift -- they compare exposed users to a holdout group and report the difference.

Incremental lift measures the difference in business outcomes between a world where your campaign ran and a world where it did not. This is a fundamentally different question, because it accounts for the fact that ad platforms do not randomly assign exposure. They target people who are already likely to convert.

Here is a concrete illustration. Suppose you run a retargeting campaign aimed at users who visited your product page but did not purchase. The platform reports a 22% conversion lift -- users who saw the retargeting ad converted at 22% higher rates than those who did not.

But retargeting, by definition, targets people who already demonstrated purchase intent. Many of them were coming back to buy regardless. The retargeting ad may have merely intercepted a journey that was already in motion. The true incremental lift -- conversions that would not have occurred without the ad -- might be 4%.

This is not a hypothetical. eBay's landmark 2013 study with economists from the University of California found that the incremental return on their branded search advertising was essentially zero. The ads were capturing demand, not creating it. The synthetic control methods used in that study are the same ones that now enable causal measurement of SEO's impact on branded search. The company was spending millions annually to place ads in front of people who were already typing "eBay" into a search bar.

Reported Conversion Lift vs. True Incremental Lift by Channel

The pattern is consistent. Channels that target high-intent users show the largest gap between reported and incremental lift. Channels that reach genuinely new audiences show a smaller gap. This is not because retargeting is worthless -- it is because its value is dramatically overstated by standard measurement.

The Free Rider Problem: Attribution's Dirty Secret

Economists have a term for entities that benefit from a resource without contributing to it: free riders. In marketing attribution, the free rider problem operates in reverse. Campaigns claim credit for conversions they did not cause.

Every attribution model -- last-click, first-click, linear, time-decay, data-driven -- distributes credit among touchpoints. But none of them answers the counterfactual question: what would have happened if a specific touchpoint had not existed? The fundamental causal inference limitations of multi-touch attribution models make this gap structural, not fixable with better algorithms.

Consider a customer journey: organic search, then a display ad, then a branded search click, then conversion. In a last-click model, branded search gets full credit. In a linear model, all three touchpoints share credit equally. In both cases, the display ad receives attribution.

But what if the user would have converted after the organic search alone? Then the display ad and the branded search ad are both free riders -- claiming credit for an outcome they did not influence. And the marketing team, looking at their attribution dashboard, sees a display CPA that justifies continued spend.

This is how budgets become misallocated at scale. Not through incompetence, but through measurement systems that are structurally incapable of answering the right question.

The free rider problem is worst in three scenarios:

Brand campaigns claiming organic demand. When brand awareness is high, much of the search and direct traffic exists independently of paid campaigns. Running paid brand search alongside strong organic presence creates maximum free-ridership.

Retargeting claiming purchase intent. Users in retargeting pools are already in the consideration phase. Retargeting accelerates some conversions and captures others that were inevitable. Distinguishing between acceleration and capture requires incrementality measurement.

Last-touch channels claiming multi-touch journeys. The final touchpoint before conversion is often the lowest-value touchpoint -- the user was already going to convert, and the last ad was just in the way. Yet it receives disproportionate credit in most attribution systems.

Geo-lift experiments cut through all of this. They do not ask "which touchpoint gets credit?" They ask: "in regions where this campaign ran versus regions where it did not, what was the difference in total conversions?" This question has no free rider problem. Either the campaign caused additional conversions or it did not. Geography does not lie.

Geo-Lift Methodology: The Geography of Causation

The logic of a geo-lift experiment is straightforward. Select a set of geographic regions. Run your campaign in some of them (treatment group). Withhold the campaign from others (control group). Measure the difference in your business outcome.

If the treatment regions show higher conversion rates than the control regions, and the groups were properly matched before the experiment, the difference is your campaign's incremental impact. No attribution model needed. No platform data required. Just sales in versus sales out.

This is geographic randomization, and it is the closest thing to a gold standard that marketing measurement has.

The method has roots in medical trial design. In a randomized controlled trial, researchers assign treatment randomly to eliminate confounders. In a geo-lift experiment, the "treatment" is advertising, the "patients" are geographic markets, and the "outcome" is a business metric -- revenue, conversions, store visits, app installs.

The practical steps are:

1. Define the outcome metric. This must be measurable at the geographic level and sensitive enough to detect a campaign's effect. Total revenue by DMA (Designated Market Area) is common for national advertisers. Store-level sales work for retail. App installs by metro area work for mobile.

2. Select geographic units. DMAs are standard in the United States (210 total). In Europe, regions vary by country. The units must be large enough to have stable baseline metrics and small enough that you can afford to withhold campaign spend from several of them.

3. Match treatment and control groups. This is the hardest step and the one most likely to determine whether your results are credible. We will address it in detail.

4. Run the experiment for a sufficient duration. Most geo-lift tests require 4-8 weeks of in-market time to accumulate enough data for statistical significance, plus 2-4 weeks of pre-period data for calibration.

5. Analyze the results. Compare actual outcomes in treatment regions to what would have been expected based on the control regions. The difference is your incremental lift.

Geo-Lift Experiment Design Parameters

Parameter	Typical Range	Consideration
Geographic Unit	DMA, State, Metro, Zip Code	Larger units have more stable baselines; smaller units allow more statistical power from more units
Number of Treatment Regions	60-80% of total	More treatment regions increase statistical power but reduce available control regions
Number of Control Regions	20-40% of total	Must be enough to construct a reliable synthetic control
Pre-Period Length	8-16 weeks	Longer pre-periods improve synthetic control calibration
Test Period Length	4-8 weeks	Must be long enough to detect the expected effect size
Cooldown Period	2-4 weeks	Accounts for lagged effects after campaign ends

Synthetic Control: Building a Counterfactual World

The fundamental challenge of any causal inference is the counterfactual. You want to know what would have happened in the treatment regions if the campaign had not run. But you can only observe one reality -- the one where the campaign did run.

The synthetic control method, formalized by Alberto Abadie, Alexis Diamond, and Jens Hainmueller in their 2010 paper, offers an elegant solution. Instead of comparing treatment regions to any single control region, you construct a "synthetic" version of each treatment region using a weighted combination of control regions.

Here is the intuition. Suppose you are testing a campaign in the Dallas DMA. No single other DMA is a perfect match for Dallas. Houston is similar in some ways but different in others. Atlanta shares some characteristics but not all. However, a weighted combination of Houston (40%), Atlanta (25%), Phoenix (20%), and Charlotte (15%) might track Dallas's pre-campaign sales almost perfectly.

If this synthetic Dallas matches real Dallas closely during the pre-period, then any divergence during the campaign period is attributable to the campaign.

Formally, the synthetic control estimator constructs a counterfactual for the treated unit $j=1$ using a weighted combination of $J$ control units:

\hat{Y}_{1t}^{(0)} = \sum_{j=2}^{J+1} w_j \, Y_{jt}

where the weights $w_j \geq 0$ satisfy $\sum_{j} w_j = 1$ and are chosen to minimize the pre-period discrepancy:

\min_{w} \sum_{t=1}^{T_0} \left( Y_{1t} - \sum_{j=2}^{J+1} w_j \, Y_{jt} \right)^2

The estimated average treatment effect on the treated (ATT) is then:

\hat{\tau}_t = Y_{1t} - \hat{Y}_{1t}^{(0)}, \quad t > T_0

The method works by minimizing the difference between the treatment region's pre-period outcomes and the weighted combination of control regions' pre-period outcomes. The weights are chosen algorithmically, not subjectively. And the quality of the synthetic control is measurable -- you can see how well it tracked the treatment region before the campaign started.

Abadie et al. originally applied this method to estimate the economic impact of terrorism in the Basque Country, using other Spanish regions as controls. The marketing application follows the same logic. The "intervention" is an advertising campaign instead of a political event, but the statistical framework is identical.

The key assumptions of the synthetic control method are:

No interference between units. Advertising in treatment regions should not affect outcomes in control regions. This holds when geographic units are large enough that media spillover is minimal.

Stable relationships. The relationship between treatment and control regions during the pre-period should remain stable during the test period, absent the intervention. This fails when external shocks hit treatment and control regions asymmetrically.

Convex hull condition. The treatment region's outcomes should lie within the range of control region outcomes. If the treatment region is an extreme outlier, the synthetic control cannot be reliably constructed.

The Modern Toolkit: Meta's GeoLift and Google's Matched Markets

The synthetic control method went from academic methodology to marketing tool when Meta (then Facebook) released GeoLift as an open-source R package in 2022, and Google developed its Matched Markets framework (also available in R and Python).

Meta's GeoLift automates the full workflow: market selection, power analysis, test execution, and causal inference. It uses an augmented synthetic control method with ridge regression to improve performance when the number of control regions is small. The package also includes tools for determining the minimum budget required to detect an effect and for selecting which markets to use as treatment versus control.

GeoLift's power analysis is particularly useful. Given historical data, a proposed budget, and a target effect size, it will tell you whether your test has sufficient statistical power to detect the effect -- before you spend a dollar. This is where most geo experiments fail or succeed, and having an automated tool for this calculation eliminates a significant source of human error.

Google's Matched Markets (sometimes referred to as CausalImpact at the market level) takes a Bayesian structural time-series approach. Rather than constructing a synthetic control from weighted donor regions, it fits a state-space model to the pre-period data and forecasts what would have happened in the treatment regions absent the campaign. The difference between the forecast and actual outcomes is the estimated causal effect, with full posterior distributions providing uncertainty estimates.

The Bayesian approach has a distinct advantage: it naturally produces credible intervals rather than just point estimates and p-values. Instead of "the campaign generated 5,000 incremental conversions (p = 0.03)," you get "the campaign generated between 3,200 and 6,800 incremental conversions with 95% probability." This is the same philosophical advantage that makes Bayesian A/B testing superior for product decisions. For decision-making, this richer output is far more useful.

Comparison of Geo-Lift Frameworks: Key Capabilities

Here is a Python implementation using Google's CausalImpact for measuring geo-lift:

from causalimpact import CausalImpact
import pandas as pd
 
# Prepare time series: treatment region + control regions
data = pd.DataFrame({
    'treatment_dma': treatment_revenue,        # daily revenue, treatment region
    'control_dma_1': control_1_revenue,         # synthetic control donor 1
    'control_dma_2': control_2_revenue,         # synthetic control donor 2
    'control_dma_3': control_3_revenue,         # synthetic control donor 3
}, index=date_index)
 
# Define pre-period (calibration) and post-period (campaign active)
pre_period = ['2025-01-01', '2025-03-31']
post_period = ['2025-04-01', '2025-05-31']
 
# Run Bayesian structural time-series model
ci = CausalImpact(data, pre_period, post_period)
 
# Print summary with incremental lift and credible intervals
print(ci.summary())
print(ci.summary(output='report'))
 
# Extract posterior estimate of cumulative causal effect
print(f"Incremental revenue: {ci.summary_data['average']['abs_effect']:.0f}")
print(f"95% CI: [{ci.summary_data['average']['abs_effect_lower']:.0f}, "
      f"{ci.summary_data['average']['abs_effect_upper']:.0f}]")

Neither framework is uniformly superior. GeoLift excels at pre-test planning and works well with the kind of DMA-level data most US advertisers have. Matched Markets produces richer uncertainty estimates and handles temporal dynamics more gracefully. In practice, running both on the same experiment and comparing results provides a robustness check that strengthens confidence in the findings.

Test Design: Selecting Treatment and Control Markets

Market selection is the decision that determines whether a geo-lift experiment produces credible results or expensive noise. Get this wrong and no amount of statistical sophistication will save you.

The goals are in tension. You want treatment markets that represent your overall marketing footprint -- large, diverse, strategically important. But you also want control markets that closely match them in pre-campaign behavior, which is easier when markets are similar in size and characteristics.

Here are the principles that should govern selection:

Maximize pre-period fit. The synthetic control must track the treatment region closely during the pre-period. If it cannot, the experiment is unreliable before it starts. Select control markets that, in combination, can reproduce the treatment markets' historical trajectory. GeoLift's market selection tool automates this by searching over possible treatment/control splits and scoring them by pre-period fit.

Minimize media spillover. If your control market receives media intended for a treatment market -- through cross-DMA TV coverage, digital ad leakage, or commuting patterns -- your experiment is contaminated. Adjacent DMAs are poor control candidates for TV campaigns. Digital campaigns can be geo-targeted more precisely but still spill through VPNs and location data errors.

Ensure sufficient scale. Control markets must collectively generate enough conversions during the test period to provide a stable baseline. A control group with 50 conversions per week will be too noisy to detect a 10% lift. You need hundreds to thousands of conversions in both groups for reasonable statistical power.

Avoid asymmetric confounders. If a major event -- a sports championship, a weather disaster, a competitor's product launch -- affects treatment markets but not control markets during the test period, your results are biased. Select markets where such events are either unlikely or likely to affect both groups equally.

Market Selection Checklist

Criterion	What to Check	Red Flag
Pre-period correlation	R-squared between treatment and synthetic control > 0.95	R-squared below 0.90 indicates poor fit
Media spillover	No shared TV markets, minimal commuter overlap	Adjacent DMAs with shared media coverage
Baseline volume	Control group generates 500+ weekly conversions	Fewer than 200 weekly conversions in control
Historical stability	No structural breaks in past 12 months	Market entered or exited in past year
Seasonality alignment	Treatment and control show same seasonal patterns	One group has tourism/university seasonality the other lacks
Competitor activity	No known competitive launches in test markets only	Major competitor opening stores in treatment only

A common mistake is selecting the largest markets as treatment and the smallest as control. This makes the pre-period fit poor (large markets behave differently from small ones) and introduces systematic bias. The better approach is stratified randomization: rank markets by size, divide them into strata, and randomly assign within each stratum.

Power Analysis and Minimum Detectable Effects

Statistical power is the probability that your experiment will detect a real effect if one exists. A power of 80% means there is an 80% chance of detecting a true effect at the specified significance level. Below 80%, you are running an experiment that is more likely than not to waste your time and budget.

For geo-lift experiments, power depends on three factors: the expected effect size, the variability of your outcome metric across regions, and the number of geographic units.

The expected effect size is driven by your campaign budget relative to the baseline. If you spend 100,000 on a campaign in markets that normally generate 5,000,000 in monthly revenue, you are looking for a 2% lift (assuming a 1:1 return). Detecting a 2% lift requires much more power than detecting a 20% lift.

The minimum detectable effect (MDE) at significance level $\alpha$ and power $1 - \beta$ is:

\text{MDE} = (z_{1-\alpha/2} + z_{1-\beta}) \times \sqrt{\frac{\sigma^2_{\text{treatment}}}{n_{\text{treatment}}} + \frac{\sigma^2_{\text{control}}}{n_{\text{control}}}}

where $\sigma^2$ represents the variance of the outcome metric across regions and $n$ is the number of geographic units in each group.

Minimum Detectable Effect by Weekly Campaign Budget (80% Power)

The chart above shows a representative relationship between weekly campaign budget and minimum detectable effect (MDE) for a geo-lift experiment using 30 DMAs (20 treatment, 10 control) with moderate baseline variance. The numbers will vary based on your specific baseline metrics and variance, but the pattern is universal: smaller budgets require larger true effects to be detectable.

This creates a painful catch-22. The campaigns most in need of incrementality measurement -- large, always-on brand campaigns with ambiguous ROI -- are often the ones with effect sizes too small to detect in a geo-lift test of reasonable duration. And the campaigns easiest to measure -- concentrated bursts with high expected lift -- are the ones where incrementality is least in question.

The way out is to design experiments around the minimum budget that achieves acceptable power at the effect size you care about. If your ROAS target implies a 5% revenue lift and your power analysis says you need $200,000/week in treatment markets to detect a 5% lift with 80% power, then anything below $200,000/week is a wasted experiment. Spend the budget or do not run the test.

Confounders: What Can Ruin Your Experiment

A well-designed geo-lift experiment controls for most confounders through geographic randomization. But "most" is not "all." Several categories of confounders can still undermine your results.

Macro-economic shocks. A recession, a stimulus check, an interest rate change -- these affect all markets but not necessarily equally. If your treatment markets are concentrated in regions with high exposure to a particular industry, a sector-specific shock can bias results in either direction.

Weather and seasonal events. A hurricane hitting treatment markets depresses sales. An unusually warm winter in control markets boosts outdoor recreation spending. These effects have nothing to do with your campaign but will appear in your data as if they do.

Competitive actions. A competitor launches a promotion in three of your control markets. Sales in those markets drop, making your treatment markets look better by comparison. You report incremental lift that is partly or entirely an artifact of competitive activity.

Local events. A major sporting event, a music festival, a university graduation -- these drive temporary demand spikes in specific markets. If they fall asymmetrically across your treatment and control groups, your results are contaminated.

Media spillover. The most insidious confounder. Digital campaigns can be geo-targeted, but TV, radio, and out-of-home cannot be contained perfectly within DMA boundaries. If 15% of your control markets' population commutes into treatment market media zones, your control group is partially treated, and your estimated incremental effect is biased downward.

The defenses against confounders are:

Pre-registration. Declare your hypothesis, test design, and analysis plan before the experiment starts. This prevents post-hoc rationalization of unexpected results.

Multiple control groups. If possible, hold out two separate control groups. If they agree with each other but differ from treatment, the effect is likely real. If they disagree with each other, something external is contaminating the results.

Covariate adjustment. Include observable covariates (weather data, competitive spend estimates, macro indicators) in your analysis model. The Bayesian structural time-series approach in Matched Markets handles this naturally.

Sensitivity analysis. After the experiment, test how robust your results are to excluding individual markets. If removing one market from the control group changes your conclusion, the result is fragile.

Staggered Rollout Designs

A limitation of the classic geo-lift experiment is that it requires holding spend at zero in control markets for the entire test period. For large advertisers with always-on campaigns, going dark in 20-40% of their markets for six weeks is operationally painful and strategically risky.

Staggered rollout designs offer an alternative. Instead of treatment versus control at a single point in time, you roll the campaign out to different markets at different times. Each market serves as its own control during its pre-rollout period and as treatment after rollout begins.

The logic is similar to difference-in-differences estimation. You compare the change in each market's outcomes before and after its rollout date, netting out common time trends by using markets that have not yet been treated as contemporaneous controls.

The advantages of staggered rollout are:

No permanently dark markets. Every market eventually receives the campaign. This eliminates the business objection to withholding spend.

More statistical power. With more variation in treatment timing, you have more data points for estimation. The entire timeline contributes to inference, not just the test period.

Built-in replication. If the effect appears at every rollout date, your confidence increases. If it appears at some dates but not others, you learn something about temporal heterogeneity.

The disadvantages are significant:

Stronger assumptions. Staggered rollout designs assume that treatment effects are constant across rollout cohorts and that early-treated markets do not affect later-treated markets. Both assumptions can fail.

Analytical complexity. Recent econometric research (Callaway and Sant'Anna, 2021; Sun and Abraham, 2021) has shown that naive two-way fixed effects estimators can produce severely biased estimates under staggered treatment timing. The appropriate methods are newer and less widely implemented.

Longer total duration. Because each market needs both pre and post-rollout observation time, the total experiment can take 12-16 weeks rather than 6-8.

For organizations that cannot tolerate dark markets, staggered rollout is the right design. But it requires greater analytical sophistication and should be implemented with the newer heterogeneity-robust estimators, not the classic two-way fixed effects regression.

Combining Geo-Lift with Media Mix Modeling

Geo-lift experiments and media mix models (MMMs) answer related but distinct questions. Geo-lift answers: "What is the incremental impact of this specific campaign in this specific time period?" MMM, particularly in its modern Bayesian form, answers: "What is the marginal return on each channel across my entire portfolio over the past 1-2 years?"

Both are valuable. Neither is sufficient alone. The state of the art is to use them together, with geo-lift experiments calibrating the MMM and the MMM extending geo-lift insights across channels and time periods.

Here is how the integration works:

Step 1: Run geo-lift experiments on your largest or most ambiguous channels. Prioritize channels where the gap between attributed and incremental performance is likely largest -- brand search, retargeting, and always-on social are common starting points.

Step 2: Use geo-lift results as informative priors in the MMM. If your geo-lift experiment found that branded search has a 0.8x iROAS (incremental return on ad spend), feed this into the MMM as a prior on the branded search coefficient. This prevents the MMM from overestimating branded search's contribution, as it otherwise tends to do because brand search volume correlates with demand that would exist regardless.

Step 3: Use the calibrated MMM to estimate incrementality for channels you did not test. You cannot run geo-lift experiments on everything. But an MMM calibrated with geo-lift results on key channels will produce more credible estimates for all channels.

Step 4: Use the MMM to plan future geo-lift experiments. The MMM identifies which channels have the most uncertain estimates. Target your next round of geo-lift experiments at those channels to reduce uncertainty where it matters most.

This cycle -- experiment, calibrate, model, plan, repeat -- is what separates organizations that measure incrementality from organizations that understand and act on it.

The Incrementality Measurement Hierarchy

Not every organization is ready for geo-lift experiments. And not every measurement question requires one. We propose a hierarchy of incrementality measurement approaches, ordered by rigor and resource requirements.

Think of this as a maturity model. Start where you are. Move up as your data infrastructure, analytical capabilities, and organizational appetite for experimentation allow.

Level 1: Holdout Tests (Platform-Native)

The simplest form of incrementality measurement. Use the ad platform's built-in tools to create a holdout group -- a random subset of your target audience that does not see ads. Compare conversion rates between the exposed group and the holdout. This is what Facebook's Conversion Lift and Google's brand lift studies provide.

Rigor: Low. Platform-measured, platform-reported. Selection bias is partially controlled but not eliminated. Results tend to overstate incrementality.

Resource requirement: Minimal. Configured through the platform's UI. No external data needed.

Level 2: Ghost Ads / Intent-to-Treat Analysis

An improvement on platform holdout tests. Instead of comparing people who saw ads to people who did not, compare people who were eligible to see ads (whether or not they actually did) between treatment and control groups. This eliminates the selection bias inherent in exposure-based comparisons.

Rigor: Medium. Controls for targeting selection bias. Still relies on platform-randomized groups and platform-reported outcomes.

Resource requirement: Low to Medium. Requires custom analysis but not external data.

Level 3: Geo-Lift Experiments (Single Channel)

The method we have been discussing. Run a campaign in treatment regions, withhold it from control regions, measure the difference in business outcomes using your own data.

Rigor: High. True geographic randomization. Outcomes measured independently of the ad platform. Controls for most confounders.

Resource requirement: Medium to High. Requires historical geo-level data, statistical expertise, and willingness to forgo spend in control markets.

Level 4: Multi-Channel Geo-Lift with MMM Calibration

Run geo-lift experiments on multiple channels, use results to calibrate a media mix model, and use the calibrated model for portfolio-level optimization.

Rigor: Very High. Combines experimental causal evidence with econometric modeling. Produces channel-level incrementality estimates with uncertainty quantification.

Resource requirement: High. Requires a dedicated measurement team, 6-12 months of experimentation, and significant analytical infrastructure.

Level 5: Always-On Experimentation Platform

Continuous geo-lift testing across channels, with automated market selection, power analysis, and result integration. New campaigns are tested by default. The MMM is continuously updated with fresh experimental results.

Rigor: Highest achievable outside a controlled lab. Near-real-time incrementality estimates with continuously shrinking uncertainty.

Resource requirement: Very High. Requires a measurement engineering team, automated data pipelines, and organizational commitment to experimentation as a core capability.

Incrementality Measurement Hierarchy

Level	Method	Rigor	Resource Cost	Timeline to First Result
Level 1	Platform Holdout Tests	Low	Minimal	2-4 weeks
Level 2	Ghost Ads / Intent-to-Treat	Medium	Low-Medium	4-6 weeks
Level 3	Single-Channel Geo-Lift	High	Medium-High	8-12 weeks
Level 4	Multi-Channel Geo-Lift + MMM	Very High	High	6-12 months
Level 5	Always-On Experimentation Platform	Highest	Very High	12-18 months to full maturity

Most companies should begin at Level 1 or 2 and work toward Level 3 within their first year of incrementality measurement. Level 4 is achievable for mid-size and large marketing organizations within 12-18 months. Level 5 is currently the domain of the largest advertisers -- companies spending $100M+ annually on media -- though the open-source tooling from Meta and Google is steadily lowering the barrier.

Case Studies Across Channels

The theory of geo-lift testing is well-established. What matters is what happens when organizations apply it. Here are findings that illustrate common patterns across channels.

Branded Search: The Emperor's Clothes

A large e-commerce company running $3M per month in branded search advertising conducted a geo-lift test across 15 treatment and 8 control DMAs over six weeks. The platform reported a 4.2x ROAS based on last-click attribution. The geo-lift experiment found an incremental ROAS of 0.6x. More than 80% of the conversions attributed to branded search would have occurred through organic search or direct navigation. The company reduced branded search spend by 60% and reallocated the budget to prospecting channels. Total revenue was unchanged.

This finding is consistent with the eBay study and with subsequent research from other large advertisers. Branded search incrementality is almost always lower than attribution suggests, often dramatically so.

Retargeting: Smaller Than Reported, Still Positive

A subscription-box company tested their retargeting campaigns using a geo-lift framework across 20 treatment and 10 control metros. Platform-reported ROAS was 8.5x. The geo-lift experiment found incremental ROAS of 2.1x. Retargeting was generating real incremental conversions, but at roughly one-quarter the rate that the platform claimed. The company maintained retargeting spend but adjusted its CPA targets upward and reduced frequency caps to avoid diminishing returns.

Paid Social Prospecting: Closer to Attribution Than Expected

A DTC fitness brand ran a geo-lift test on their Meta prospecting campaigns (lookalike audiences, interest targeting) across 18 treatment and 12 control DMAs. Attribution showed a 1.8x ROAS. The geo-lift experiment found 1.4x incremental ROAS. The gap was smaller than for branded search or retargeting because prospecting targets users who have not yet engaged with the brand -- there is less organic demand to cannibalize.

Connected TV: The Hardest Channel to Measure

A home services company tested a CTV campaign using a staggered rollout across 25 DMAs. The estimated incremental lift in branded search volume was 12% in treatment markets, with a 6-8% lift in direct website visits. CTV's primary incremental value appeared not in direct conversions but in demand generation that manifested through other channels 1-3 weeks later. This lagged effect would be invisible to any attribution system but was clearly detectable in the geo-lift framework.

Direct Mail: The Surprising Winner

A financial services firm tested direct mail campaigns via geo-lift and found an incremental ROAS of 3.8x, making it the highest-incrementality channel in their portfolio. The reason: direct mail reaches audiences that digital campaigns do not. Older demographics, lower internet usage, different media consumption habits. The lack of overlap with digital channels meant almost zero free-ridership.

These case studies reveal a pattern: a channel's incrementality is inversely proportional to its overlap with organic demand and other paid channels. The more a channel reaches audiences that would convert anyway or that other channels also reach, the lower its true incremental contribution.

The Uncomfortable Math

If you have followed the logic this far, you may be confronting a disquieting realization. If branded search, retargeting, and last-click-attributed conversions are systematically overstated, then the actual incremental return on marketing spend for most companies is substantially lower than reported.

This is correct. And it is not a reason for despair. It is a reason for reallocation.

The insight from incrementality testing is not that marketing does not work. It is that marketing works differently than attribution dashboards suggest. Prospecting channels, brand awareness investments, and non-digital media often generate more incremental value per dollar than the retargeting and branded search campaigns that dominate performance dashboards.

The companies that measure incrementality do not spend less on marketing. They spend differently. They shift budget from high-attribution, low-incrementality channels to low-attribution, high-incrementality channels. And they outperform competitors who optimize against attribution metrics that reward free-ridership.

Wanamaker said half his advertising was wasted. The geo-lift framework tells us something more specific: the wasted half is probably the half that looks best in your attribution reports.

The question is whether you have the measurement infrastructure to identify it and the organizational courage to act on what you find.

References

Abadie, A., Diamond, A., & Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's tobacco control program. Journal of the American Statistical Association, 105(490), 493-505.
Abadie, A., Diamond, A., & Hainmueller, J. (2015). Comparative politics and the synthetic control method. American Journal of Political Science, 59(2), 495-510.
Blake, T., Nosko, C., & Tadelis, S. (2015). Consumer heterogeneity and paid search effectiveness: A large-scale field experiment. Econometrica, 83(1), 155-174.
Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2015). Inferring causal impact using Bayesian structural time-series models. Annals of Applied Statistics, 9(1), 247-274.
Callaway, B., & Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
Gordon, B. R., Zettelmeyer, F., Bhatt, N., & Lovett, M. J. (2019). Close enough? A large-scale exploration of non-experimental approaches to advertising measurement. Marketing Science, 38(6), 991-1014.
Meta Open Source. (2022). GeoLift: An open-source solution for measuring incrementality. GitHub repository. https://github.com/facebookincubator/GeoLift
Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175-199.
Vaver, J., & Koehler, J. (2011). Measuring ad effectiveness using geo experiments. Google Technical Report.
Ye, M., & Pan, S. (2022). GeoLift: Measuring the true value of advertising through geo-experimentation. Meta Marketing Science Research.

5 replies

Ravi Subramaniam2y ago

We ran essentially this exact experiment in 2023, 11-week hold-out in Portland vs synthetic from the other 24 DMAs. The incrementality number came in at 0.38x the last-click figure, which matches what you describe almost to the decimal. The hard part isn't the model, it's convincing the paid-acquisition team that their KPIs were structurally overstated for three years running.

Prof. Nadia Berger2y ago

abadie's synthetic control is the right workhorse here but readers should note its limitations for marketing contexts. the method assumes the treated unit can be represented as a convex combination of control units, wich breaks down when you have country-level interventions (too few candidates) or when the treatment is continuous rather than binary. the recent matrix-completion methods from athey, bayati et al. (2021) handle both limitations more gracefully.

Mehmet Arslan1y ago

geo-lift is the right framework but in turkey the market structure makes it harder than the paper suggests. Istanbul alone is ~20% of GMV which breaks the 'one treated, many control' cleanness. We've had to use a staggered rollout across smaller cities and pool them to get stable enough synthetic controls. Method works but the implementation is messier than the US case.

Lauren McAllister1y ago

nerdy correction, the 'half your marketing budget is wasted' quote is usually attributed to john wanamaker but the actual original appears to be from william lever (lord leverhulme) from around 1900. matters for a historically-literate piece

Kavya Nair1y ago

one practical addition: running geo holdouts requires finance buy-in for the revenue you're deliberately foregoing during the test window. we had to build an 'investment case' for the hold-out itself, quantifying the value of knowing true incrementality over the next 18 months of spend decisions. honestly that conversation is harder than building the synthetic control

Join the conversation

Disagree, share a counter-example from your own work, or point at research that changes the picture. Comments are moderated, no account required.