Pricing Strategy

Pricing Experimentation Without the Legal Risk: An Operator Framework for Defensible A/B Tests

Price A/B tests are not, by themselves, illegal. Most of the legal risk lies in how the cohorts are formed, what data is used, and what the team can show a regulator a year later. This is the framework that survives the question.

Share

TL;DR: Price A/B tests are not, by themselves, illegal in the US, UK, or EU. Most of the legal risk is in the design choices around them: how the cohorts are formed, what personal data is used, what the customer was told, and how much the team can document a year later when a regulator or a class-action plaintiff asks. This essay walks through the actual legal exposure (Robinson-Patman in the US, ARP equivalents in EU/UK, the FTC's deceptive pricing guides, the EU UCPD, GDPR/CCPA on cohort construction), the operational frameworks that produce defensible tests (uniform-cohort randomization, opt-in trials, fairness budgets, holdout cells), and the documentation discipline that determines whether the practice survives the question.

A note on sources and scope. The Amazon 2000 DVD case, the FTC surveillance pricing study, and the EU and UK enforcement actions referenced are public. The operational figures (test volumes, opt-in rates, hold-out frequencies) come from advisory engagements with five operators across travel, SaaS, fashion, and grocery. The legal interpretations are practitioner readings of public guidance, not legal advice. The framework presumes a sophisticated pricing team with engineering, product, legal, and analytics functions willing to coordinate.


The Misunderstanding at the Center of the Conversation

Talk to a product team about A/B testing prices and one of two things happens. Either someone says "that is illegal" and the conversation ends, or someone says "everyone does it" and the conversation also ends. Both responses are wrong in the same way. Neither is engaging with what the law actually prohibits and what it allows. The legal picture for price experimentation is more permissive than the first response and more constrained than the second. The honest answer is that price A/B tests sit inside a narrow band of legally defensible practice, and almost every team that tests prices operates inside that band without fully understanding where its edges are.

This essay is the operator framework. It is opinionated about three things. First, the source of legal risk in price tests is not the test itself; it is the cohorting and the documentation. Second, the regulatory direction across the US, UK, and EU is converging on more disclosure and more constraint on personal-data inputs, so the operationally smart move is to design tests to a tighter standard than any single regulator currently requires. Third, most teams have weak documentation practices for price tests; this is the single point of leverage where adding modest discipline produces large reductions in residual legal risk.

The empirical motivation matters. The most famous cautionary example, Amazon's 2000 DVD price-personalization episode, was technically a test (Bezos described it as "a random price experiment") but was perceived as personalized price discrimination. The customer reaction was sharp enough that Amazon has not deployed personalized pricing on the consumer site in the twenty-five years since (Retail Dive, 2019). The cost of getting this wrong is not just litigation. It is brand damage that lasts a generation.


What the Law Actually Says (US, UK, EU)

A short tour of the rules. This is the practitioner's reading; consult counsel before acting on it.

United States: Robinson-Patman and the FTC. The Robinson-Patman Act of 1936 prohibits a seller from charging different prices to competing buyers for commodities of like grade and quality, where the effect may be to substantially lessen competition. The Act applies most cleanly to wholesale (seller-to-merchant) transactions. The case law on its application to consumer (seller-to-end-user) price discrimination is sparse, and the differentiated nature of most consumer goods limits its reach. The FTC's overview (Price Discrimination: Robinson-Patman Violations) describes the Act's contemporary scope. After roughly forty years of dormancy, the FTC under chair Lina Khan revived RPA enforcement, culminating in the Southern Glazer's complaint of December 2024 (Paul Weiss memo).

For consumer price tests, the more active US framework is Section 5 of the FTC Act (unfair or deceptive practices) and the Guides Against Deceptive Pricing at 16 CFR Part 233. Part 233 covers former-price comparisons (Section 233.1), competitor pricing (Section 233.2), and manufacturer's suggested retail price (Section 233.3). Tests that involve marketing the test price as a discount from a reference price that is not bona fide can run afoul of Part 233 even if the test design is otherwise sound. The FTC's 2024 surveillance-pricing study (FTC release, July 2024) signals the agency's increasing focus on personalization rather than experimentation per se.

State-level: California, New York. California's surveillance-pricing sweep, announced by AG Bonta on January 27, 2026, focuses on businesses using personal data to set targeted prices under the CCPA's purpose-limitation principle (California AG release). New York's algorithmic pricing disclosure law, effective in 2025, requires businesses to disclose when an algorithm uses personal data to set price.

European Union: UCPD, GDPR, AI Act. The Unfair Commercial Practices Directive (Directive 2005/29/EC) prohibits commercial practices that are misleading or aggressive. The European Commission's interpretation, in the 2021 update to the UCPD guidance, treats undisclosed personalized pricing as a UCPD breach in many configurations. The GDPR's purpose-limitation and lawful-basis requirements constrain the use of personal data for cohort construction. The AI Act (Regulation (EU) 2024/1689) classifies certain pricing systems as high-risk (notably in credit and insurance), with the bulk of obligations coming into force in August 2026.

United Kingdom: ARP, ASA, CMA. The UK's Digital Markets, Competition and Consumers Act 2024 modernized the consumer-protection regime. The CMA enforces algorithmic pricing scrutiny on designated digital platforms. The ASA addresses misleading price claims (drip pricing, deceptive "from" prices). The "ARP equivalents" sometimes referenced in this space refer to the broader Anti-Retail-Pricing and competition framework rather than a single statute; the practical effect is similar to the EU UCPD.

Legal Framework Touching Price Experimentation, Practitioner Summary (Early 2026)

JurisdictionFrameworkWhat It CoversTest-Design Implication
US FederalRobinson-Patman Act (1936)B2B price discrimination on commoditiesLargely wholesale; rare for consumer tests
US FederalFTC Section 5Unfair / deceptive practicesAvoid deceptive framing of test price
US Federal16 CFR Part 233 (Deceptive Pricing Guides)Reference-price representationsTest cannot create a fictitious former price
CaliforniaCCPA + surveillance pricing sweep (Jan 2026)Personal data in pricingCohort cannot use sensitive personal data without basis
New YorkAlgorithmic Pricing Disclosure Law (2025)Algorithm + personal data → priceDisclosure at point of sale if personal data used
EUUCPD 2005/29/ECMisleading / aggressive practicesUndisclosed personalization risks UCPD breach
EUGDPR (2016/679)Personal data processingLawful basis + purpose limitation for cohort data
EUAI Act (2024/1689) (high-risk from Aug 2026)AI systems in credit/insuranceDocumentation + human oversight obligations
UKDigital Markets Competition and Consumers Act 2024Designated platforms; consumer protectionCMA scrutiny on platform-level pricing
UKASA misleading-price rulesAdvertising / price-claim accuracyTest cannot create misleading From prices

The cumulative effect, for a team operating in even three of these jurisdictions, is a constraint set rather than a single rule. The operational stance we recommend is to design test mechanics to a stricter internal standard than any one regulator currently requires, on the bet that the regulatory direction across all of them is converging on more disclosure, more documented oversight, and more constraint on the personal-data inputs to cohort construction.


The Three Cohort Patterns and Their Risk Profiles

Most price tests we see in advisory work use one of three cohorting patterns. The three patterns have materially different legal risk profiles, and the team should choose deliberately rather than by convenience.

Pattern 1: Uniform-cohort randomization. Every visit is assigned to a cohort using a uniform random draw at the moment of arrival. The cohort assignment is independent of who the customer is, what they have bought before, where they came from, what device they are on, or anything else. The customer cannot move from one cohort to another by changing their behavior; the assignment is determined by chance at session start. This is the cleanest design and the one with the lowest legal risk in essentially every jurisdiction. It is also the design that most resembles a clinical trial: the cohort is random by construction, and the difference in outcomes between cohorts is attributable to the price treatment.

Pattern 2: Cohort-attribute randomization within a known segment. The team defines a segment using non-protected, business-legitimate attributes (a product category, a market, a marketing channel) and randomizes within that segment. The risk is whether the segment-defining attributes are themselves proxies for protected attributes, and whether the segment definition can be characterized as discriminatory. A test that randomizes within "users of Channel X" is usually defensible. A test that randomizes within "users whose ZIP code matches a low-income census tract" usually is not, because the segment-defining attribute is a strong proxy for race and income.

Pattern 3: Personalized pricing (no test). The model produces a price tailored to the individual customer. This is not technically an A/B test in the controlled-experiment sense; it is the production output of a personalization model. The legal risk profile is the topic of the dynamic pricing fairness audit essay and is materially higher than the test patterns above. Most of the recent regulatory attention (FTC 2024 study, California 2026 sweep, EU UCPD guidance) targets personalized pricing rather than uniform-cohort experimentation.

Relative Legal Risk by Cohort Pattern (Practitioner Heuristic, US/UK/EU)

The chart's risk scale is ordinal, not cardinal; it ranks the patterns relative to one another rather than measuring an absolute probability of enforcement. The point is that the gap between Pattern 1 and Pattern 5 is roughly an order of magnitude in our advisory observations, and the team often does not realize how much it has bought itself in legal headroom by sticking to Pattern 1.


The Test Design That Survives a Regulator's Question

A price test is most defensible when, a year after the test, the team can produce a short stack of documents that answers four questions cleanly.

  1. What was randomized? The randomization key, the seed, the unit of randomization (visit, session, account, market), and the cohort assignment function.
  2. Who was eligible? The eligibility definition, the exclusion criteria, and the count of customers in and out of each cohort.
  3. What was the test price and what was the control price? The specific price levels in each cohort, the SKU(s) involved, the duration, and the markets covered.
  4. What was the customer told? The disclosed price at point of sale, any pricing disclosures in the privacy policy or shopping interface, and whether the customer was given an opt-out option.

If the team can produce clean answers to these four questions, the test is defensible in essentially any jurisdiction. If it cannot, the test is exposed regardless of how the cohorting was done. Most of the residual legal risk in price experimentation comes from documentation gaps rather than design choices.

Defensible price-experiment workflow with documentation gates

Loading diagram...

The workflow above looks bureaucratic and is. The point of the bureaucracy is that the team can produce, on demand, the seven artifacts a regulator might ask for: the hypothesis document, the cohort assignment specification, the eligibility filter, the disclosure language, the result analysis, the roll-out decision, and the sunset clause. Teams that have these seven artifacts available for every test in the last two years are essentially immune to the "show me your records" question. Teams that have them for half of their tests, and missing or inconsistent records for the rest, are exposed in proportion to the gap.


Disclosure: What Customers Need to Be Told

The disclosure question is the area where teams most often over-engineer or under-engineer. The minimum viable disclosure, in our reading of the current state of US/UK/EU regulation, has three components.

Component 1: A privacy-policy paragraph that acknowledges the use of randomized testing. This is the easiest piece. A short paragraph in the privacy policy or terms of service that says, in plain language, that the company runs randomized tests on its website to evaluate product, layout, and pricing decisions, that the assignment is random, and that the company does not use sensitive personal information to determine the cohort. Most teams have this already and just need to re-read it to confirm it covers pricing.

Component 2: Point-of-sale disclosure when personalization is used. New York's algorithmic-pricing disclosure law and the forthcoming EU Digital Fairness Act point toward mandatory point-of-sale disclosure when an algorithm uses personal data to set price. For uniform-cohort tests (Pattern 1), no point-of-sale disclosure is generally required because no personal data is used. For Pattern 2 with proxy-protected segments and Pattern 3 (personalized pricing), point-of-sale disclosure is increasingly necessary and will likely be mandatory in most jurisdictions within the next 24 to 36 months.

Component 3: Opt-out mechanism for high-risk tests. When the test involves personal data, the customer should be able to opt out. The opt-out is more legally defensible than its absence even in jurisdictions where it is not strictly required, because it shifts the burden from "the company decided to price this customer differently" to "the customer agreed to be in the experiment." The trade-off is that opt-out implementations create selection bias: customers who opt out are different from customers who do not, and the test results are no longer valid for the population.

Disclosure Matrix by Cohort Pattern and Jurisdiction

PatternUS (Fed)CaliforniaNew YorkUKEU
Uniform-cohort (Pattern 1)Privacy policy mentionPrivacy policy mentionPrivacy policy mentionPrivacy policy mentionPrivacy policy mention
Segment + business-legit (Pattern 2a)Privacy policy + cohort rationale documentedSameSameSameSame + lawful-basis check
Segment + proxy-protected (Pattern 2b)High risk; avoidHigh risk + CCPA reviewDisclosure if personal data usedHigh risk; avoidHigh risk + UCPD review
Personalized (Pattern 3)Disclosure + opt-out advisableDisclosure + opt-out requiredDisclosure requiredDisclosure advisableDisclosure + lawful basis
Personalized with sensitive attribsDo not deployDo not deployDo not deployDo not deployDo not deploy

The matrix is a practitioner's reading of public guidance, not legal advice. The pattern is that as cohort construction moves from random to personalized, and as the data inputs move from non-personal to personal to sensitive, the disclosure burden moves from "privacy policy" to "point of sale" to "opt-out" to "do not deploy."


Fairness Budgets: The Aggregate Constraint Concept

A useful idea, borrowed from the differential privacy literature, is the "fairness budget" applied to a portfolio of price experiments. The idea is that the aggregate disparate impact of all simultaneous tests on any individual customer should be bounded, even if each individual test is within its own bounds.

The intuition is straightforward. A single test that varies a customer's price by $1 is small. Five tests running simultaneously, each varying the customer's price by $1, can add up to $5 of variation that the customer is unaware of and that bears no relationship to anything the customer would consider legitimate. The aggregate, not the individual test, is what creates the disparity the customer might perceive as unfair.

The fairness-budget design works like this. At a portfolio level, the team sets a maximum acceptable aggregate price variance per customer across all simultaneously running tests. The portfolio's allocation function then constrains which tests a given customer can be enrolled in, with priority given to tests with stronger business cases or higher information value. Customers who would breach the budget are deferred from new test enrollment until existing tests complete.

Per-Customer Aggregate Test Variance Over Time, With and Without Fairness Budget

The illustration shows what we have observed in advisory work: without a fairness budget, the per-customer aggregate variance grows roughly linearly with test concurrency, and the customer's experience drifts away from a stable reference price. With the budget, the aggregate stays bounded, even as test concurrency increases. The customer's experience remains close to the reference price even as the team learns from many simultaneous tests.


Opt-In Trials: The Belt-and-Suspenders Approach for High-Risk Tests

For tests that approach the regulatory edge (new personalization features, sensitive-segment tests, tests in newly-regulated jurisdictions), the safest design is the opt-in trial. The customer is offered an explicit choice to participate, the participation is documented, and the test is restricted to customers who have opted in.

The opt-in design has two strengths and two weaknesses. The strengths are legal defensibility (the customer consented in a way that survives most regulatory tests) and population validity (the team can talk to opted-in customers about the experience without ambiguity). The weaknesses are selection bias (opted-in customers differ from the population, so the test results may not generalize) and recruitment cost (getting enough opt-in volume for a powered test is hard, particularly for low-frequency purchase categories).

For most consumer e-commerce tests, opt-in is overkill. For personalized-pricing experiments in regulated jurisdictions, opt-in is increasingly the only design that produces both useful data and legal defensibility. Two design patterns help.

Pattern A: Loyalty-tier opt-in. Make participation in price experiments a stated feature of the loyalty program. Loyalty members agree at sign-up to be enrolled in experiments in exchange for an explicit benefit (early access to new products, exclusive discounts on opt-in days, accelerated tier progression). The opt-in is contractual and durable; the loyalty members who do not want to participate can opt out.

Pattern B: Recruit-into-test interstitial. At checkout or at session start, present customers with a clear short notice describing the test and asking for participation. The interstitial cost is high (it adds friction to the conversion funnel and depresses the topline metric) and the recruitment rate is low (typical opt-in rates in our advisory observations are 15-30%). The pattern works for high-stakes tests where the team needs unambiguous consent.

Decision path: Which test design should you use?

Does the test require any personal data to construct cohorts?

  • If yes: Are you in California, EU, UK, or New York?
    • If yes: Is the test in a sensitive category (financial, health, legal services)?
      • If yes: Outcome: Opt-in only. Full disclosure. Document lawful basis. Probably overkill for revenue-only goals; consider whether the test is worth the design cost.
      • If no: Outcome: Pattern 2a with documented business-legit segment, full privacy-policy disclosure, point-of-sale disclosure where personalization is involved.
    • If no: Outcome: Pattern 2a with documented business-legit segment and privacy-policy disclosure.
  • If no: Will the test cohort assignment be visible to customers (URL parameter, badge, banner)?
    • If yes: Outcome: Pattern 1 (uniform-cohort) with brief privacy-policy disclosure noting that website variations are tested.
    • If no: Outcome: Pattern 1 (uniform-cohort) is sufficient. No additional disclosure beyond the standard privacy-policy mention.

Documentation Discipline: The Boring Part That Actually Determines the Outcome

We have repeated the phrase "documentation discipline" several times. This section is what we mean.

A defensible price experimentation program produces, for every test, a small bundle of seven artifacts. The bundle is created before the test runs, updated as the test executes, and frozen on test completion. The artifacts are versioned, stored in a system that survives team turnover, and indexed in a way that any one of them can be retrieved within an hour by a non-author.

Artifact 1: Hypothesis document. What you are testing, what you expect to find, what threshold of improvement you would consider material, what risks the test might surface.

Artifact 2: Cohort assignment specification. The randomization unit (visit, session, account, market). The hash function. The cohort sizes. The pre-test power calculation showing the test is properly powered for the threshold in Artifact 1.

Artifact 3: Eligibility and exclusion criteria. Who is in the test. Who is excluded and why. The expected and actual count of customers in each cohort.

Artifact 4: Disclosure language. The text the customer sees, the location of the disclosure (privacy policy, footer, point-of-sale, interstitial), and the wording reviewed by legal.

Artifact 5: Test execution log. When the test started, when it ended, any interruptions, any cohort-assignment failures, any modifications during execution.

Artifact 6: Result analysis. The metric movements, the confidence intervals, the segment-level breakdowns, the decision rationale. If a result was promoted to production, the rationale for why and the sunset/review schedule.

Artifact 7: Decision record. Who made the call to roll out, roll back, or extend. The signature, the date, and the trigger conditions for re-review.

Test Artifact Bundle, Practitioner Template

ArtifactCreated WhenAuthorReviewed ByStorage Duration
Hypothesis documentPre-testPM/AnalystProduct lead5 years
Cohort assignment specPre-testEngineerData science lead5 years
Eligibility + exclusion criteriaPre-testPM/AnalystLegal lead5 years
Disclosure languagePre-testLegal/PMLegal lead5 years
Execution logDuring testEngineering/On-callData science lead5 years
Result analysisPost-testAnalystProduct lead + DS lead5 years
Decision recordPost-testProduct leadCompliance lead5 years

Five years is a defensible retention floor for most jurisdictions. The cost of producing each artifact, the first time, is roughly thirty minutes to an hour. The cost of reproducing the bundle eighteen months later when a regulator asks is small, compared to the cost of explaining a missing or contradictory artifact. The teams that have institutionalized this bundle treat it as test-completion gate; the test is not done until the bundle exists.


A Working Definition of "Low-Risk" for Operating Teams

Synthesizing the framework into a working definition that an operating team can use in practice. A price experiment is low-risk when all of the following are true.

  1. The randomization is uniform-cohort. No personal data is used to construct the cohort. The cohort cannot be reverse-engineered from a customer attribute.
  2. The eligibility filter is documented and applied identically to all visits. No silent exclusions of protected groups, no opaque "VIP override" rules.
  3. The price difference between cohorts is bounded. A common floor is that no cohort sees a price more than 10% above or below the reference. Larger variances are not necessarily illegal but invite scrutiny.
  4. The test duration is bounded. A typical defensible window is two to eight weeks for general-merchandise consumer e-commerce, with shorter windows for high-velocity categories. Indefinite-duration tests look like personalization, not experimentation.
  5. The customer is informed at the privacy-policy level. Mention of website variation testing is in the standard privacy/terms-of-service document.
  6. The team can produce the seven artifacts on demand. The bundle exists and is retrievable within an hour.
  7. No protected or sensitive attribute is used in cohort assignment. Even as a feature in a model that informs the assignment. The legal defensibility erodes quickly if a regulator can show the cohort was constructed using inputs correlated with protected attributes.

Tests meeting all seven conditions are, in our practitioner reading of current US/UK/EU law, low-risk. Tests meeting fewer than five of the seven are exposed and should be redesigned. Tests on the regulatory frontier (personalization, sensitive segments, new jurisdictions) require legal review before deploy, regardless of how many of the seven conditions are met.

The legal risk in price experimentation is not in the experiment. It is in the cohorting, the data, and the records. Get the cohorting right, restrict the data, keep the records, and the experiment itself is the easy part.


The Honest Cost of Doing This Right

Building the program described above is not free. The first-time cost of standing up a defensible price experimentation pipeline, including the documentation discipline, the cohort-assignment infrastructure, the disclosure review, and the fairness-budget portfolio constraint, is in the range of one to two engineer-quarters for a sophisticated team and substantially more for a team starting from a less mature base. The ongoing cost is roughly 10 to 20 percent of the team's engineering capacity dedicated to test infrastructure, calibration, and documentation rather than to running new experiments.

This cost is, in our advisory observations, repaid in three ways.

Repayment 1: Lower base rate of revoked tests. Teams without strong documentation discipline routinely have to pull tests mid-flight because someone surfaces a concern that the test design did not anticipate. The cost of a pulled test is the engineering cost of building it plus the opportunity cost of the analytical insight that never materializes. Teams with the discipline pull tests at maybe 20 to 30 percent the rate of teams without.

Repayment 2: Faster legal review on new tests. Once the framework is in place, new tests can be reviewed in days rather than weeks because the legal team is reviewing variations on a known pattern rather than novel designs. Test velocity, after the initial set-up cost, is often higher than before.

Repayment 3: Insurance against the bad month. A regulatory inquiry, a class action, or a press cycle that exposes one bad test can cost the company multiples of the program cost. Teams that can produce clean records for the year prior to the inquiry generally settle smaller, more cheaply, and with less brand damage than teams that cannot.

The trade is not glamorous. It is the same trade that pharmaceutical R&D made decades ago when it moved from informal practitioner experimentation to GxP documentation. The work is more boring. The outcomes are more durable.


Two Special Cases Worth Calling Out

Two recurring scenarios in advisory work deserve a separate note because they sit awkwardly in the framework above.

Special case 1: SaaS pricing-page tests. SaaS pricing tests are usually conducted on the pricing page itself: the same plan is shown to different visitors at different price points, and the team measures conversion. The legal risk here is materially lower than for transactional e-commerce pricing because the SaaS pricing page is not yet a transaction, and most jurisdictions treat the pricing page as advertising rather than a transaction with a customer. The complications arise when an existing subscriber sees a different price than a new visitor (the existing-vs-new dimension can be a proxy for tenure-related discrimination claims under some readings of consumer-protection law), and when the test price persists into the actual subscription (where the customer signs a contract at the test price and renews against it for years).

The defensible SaaS pricing test, in our reading, is one where the pricing page test is restricted to net-new visitors, where the test price is honored on conversion (the customer is not bait-and-switched at checkout), and where the legacy subscriber population is shielded from the test entirely. Many SaaS teams default to this configuration by accident because their billing systems treat new and existing subscribers separately. The accidental defensibility is real defensibility and should be preserved deliberately rather than left to chance.

Special case 2: Pricing tests in marketplace and aggregator products. When the company runs a marketplace (Amazon, Etsy, eBay, a hotel aggregator) and the prices come from the merchants on the platform, the legal posture of price experimentation is materially different. The marketplace operator is generally not the seller of record on most transactions, which means the Robinson-Patman issues sit with the merchants rather than with the platform. The platform's exposure shifts to algorithmic decisions about which merchants' prices are shown to which customers, which is closer to a recommendation problem than a pricing problem. The technical framing is different, the legal framing is different, and the auditing techniques described in the dynamic pricing fairness audit essay apply with some translation.

A marketplace that introduces personalization into search ranking based on customer-specific factors that include inferred willingness to pay is, functionally, doing personalized pricing through the recommendation system. Regulators are increasingly aware of this equivalence and are starting to apply pricing-fairness analyses to the recommendation layer in marketplaces. The defensive design is to keep willingness-to-pay signals out of the recommendation ranker and rely on relevance, quality, and merchant-side signals instead.


What We Are Watching For Next

A few open questions in the area that we do not have settled answers on.

The cross-jurisdiction holdout question. The defensible test design described above assumes the test runs in a market with a coherent regulatory framework. The trickier case is a test that runs across markets with different rules (a US-EU test, or a California-Texas test in the same calendar window). The right approach is usually to slice the test by jurisdiction and run separate trials, but this halves statistical power and slows learning. We do not have a clean general solution; the right move depends on which jurisdictions are involved and what the test is measuring.

The cumulative-effect question. Most regulators evaluate the legality of a single practice, not the cumulative effect of a portfolio. A team that runs ten tests over a year, each of them defensible in isolation, can collectively produce a customer experience that any single test would not have. The fairness budget addresses part of this, but the question of regulator perception (does the aggregate look discriminatory even if no single test was) is not yet settled in case law.

The retention-period question. We recommend five-year retention for test artifacts. In most jurisdictions this is conservative; in some it may be longer than required and creates its own data-minimization risk. The right retention period balances regulatory exposure against data-minimization principles and depends on the team's GDPR/CCPA posture for the rest of its data.

Key Takeaways

  1. Price A/B tests are not illegal. The legal risk is in the cohorting, the data inputs, and the documentation. Uniform-cohort randomization with documented eligibility, bounded price ranges, and good records is low-risk in essentially every US/UK/EU jurisdiction.
  2. Choose the cohort pattern deliberately. Uniform-cohort (Pattern 1) is materially less risky than segment-attribute randomization (Pattern 2), which is less risky than personalized pricing (Pattern 3). The gap is roughly an order of magnitude in our practitioner reading.
  3. Treat disclosure as a portfolio decision, not a per-test one. A privacy-policy mention of website variation testing covers most uniform-cohort tests. Point-of-sale disclosure and opt-out are increasingly mandatory for personalization in regulated jurisdictions and should be designed in from the start.
  4. Implement a fairness budget at the portfolio level. Without an aggregate constraint, simultaneous tests compound on the same customer and produce per-customer price variance that the customer experiences as unstable pricing. A portfolio cap of 8 to 10 percent variance per 30-day window holds the customer experience in place while preserving most of the test velocity.
  5. Document the seven artifacts. Hypothesis, cohort spec, eligibility, disclosure, execution log, result analysis, decision record. Five-year retention. Retrievable within an hour. The single best defense against any regulator's question is the records you can produce.
  6. The cost is worth it. The first-time set-up cost is one to two engineer-quarters. The repayment comes through lower pull-rate, faster legal review on new tests, and durable insurance against the kind of inquiry that has cost less prepared companies multiples more.

The Conversation

Be the first to weigh in

Join the conversation

Disagree, share a counter-example from your own work, or point at research that changes the picture. Comments are moderated, no account required.

Read Next