Business Analytics

Identity Resolution in a Cookieless World: A Probabilistic Reality

The cookie was always probabilistic. Cookieless makes the probability legible. Operators who treat new identifiers as deterministic will misattribute spend and contaminate downstream measurement.

Share

TL;DR: The third-party cookie has been declared dead for years and is now collapsing for real, in stages, across Safari, Firefox, and (after several reversals) Chrome. The replacement is not a single better identifier. It is a spectrum: a hashed email at one end (close to 100 percent confidence in match, but constrained by logged-in audience size), an IP-plus-user-agent fingerprint somewhere in the middle (often 55 to 70 percent confidence, and degrading), and Chrome Topics or Protected Audience aggregated cohorts at the other end (no per-user resolution at all). Operators who continue to treat probabilistic identifiers as if they were deterministic will misattribute spend, double-count users in CDPs, and produce audience overlaps that do not survive scrutiny. This essay maps the spectrum, the match-rate problem that anchors most identity-resolution decisions, the role of CDPs and clean rooms, and the operating discipline required to live with uncertainty rather than pretend it away.

A note on the named vendors. LiveRamp, mParticle, Segment, Snowplow, and Google's Privacy Sandbox documentation appear as widely-cited reference points, not as data sources. Match-rate figures and CDP cost ranges come from advisory engagements with anonymized partner operators in retail, publishing, and SaaS archetypes, and are presented as observed ranges rather than vendor benchmarks. Where a specific vendor capability is described, it is sourced to public documentation.


For most of the last fifteen years the operating fiction was that the third-party cookie identified a user. It did not. It identified a browser instance, and that browser instance was treated as a proxy for a user because the conversion math worked well enough that nobody had to look closely. A single human running Chrome on a laptop, Safari on a phone, and a private window on an iPad showed up to the ad ecosystem as three distinct cookies. The deduplication problem was solved, when it was solved at all, by frequency-capping rules that assumed a generous overlap and by attribution models that quietly accepted the over-count.

The cookie's collapse is being treated by operators as a sudden discontinuity. It is not, mechanically. Safari's Intelligent Tracking Prevention capped third-party cookie lifetime to 24 hours in 2017 and effectively removed it in 2020. Firefox followed with Enhanced Tracking Protection enabled by default in 2019. Chrome's third-party cookie phase-out has had at least three publicly announced timelines (2022, 2024, and the 2024 reversal that left third-party cookies as a user opt-in rather than a default deprecation, see Google's July 2024 announcement), but the practical effect is the same: cookie-based identification on the open web has been degrading since 2018, and operators who measured carefully knew it.

What changed is not that the cookie became less reliable. It is that the gap between "deterministic identifier" and "probabilistic identifier" became impossible to ignore. The cookie's degradation forced a vocabulary that the industry had previously suppressed: the difference between a 100-percent-confidence match (the user is logged in, the hashed email is on file, the join is exact) and a 60-percent-confidence match (the IP and the user-agent overlap with a previous session, the time-of-day is consistent, and the device class is plausible). Both are useful. They are not interchangeable, and treating them as if they were is the source of most identity-resolution malpractice we have observed.

The honest framing is that every identifier was always probabilistic at some level, and what the cookieless transition does is force the probability onto the dashboard. A cookie identified a browser with high confidence and a user with much lower confidence. The new identifiers (hashed emails, device graphs, fingerprints, aggregated cohorts) each have their own confidence profile. The operating shift is not from certainty to uncertainty; it is from hidden uncertainty to visible uncertainty.


The Match-Rate Problem Anchors Everything

The single most important number in cookieless identity resolution is the match rate: the percentage of a target audience for which the resolution layer can produce a usable identifier. Match rate is what determines whether a CDP segment can be activated, whether a CAPI server-side conversion can be deduplicated against a client-side pixel fire, and whether a clean-room join is large enough to be statistically meaningful.

The match-rate ceiling for deterministic identifiers is set by logged-in audience size. Across the partner operators we have advised, the share of traffic that arrives logged in (with a usable hashed email or account ID at the moment of the relevant event) typically sits in a range of 8 to 18 percent for consumer commerce, 22 to 41 percent for content publishers with paid subscriptions, and 47 to 78 percent for SaaS products where the primary use case requires a login. The variance is dominated by the product's structural login pressure, not by the cleverness of the team's data collection.

Observed Logged-In Traffic Share by Archetype, Across Partner Operators (2023-2024)

ArchetypeLogged-In Share (Median)RangeNotes
Consumer commerce (no membership)11.4%8.2-17.8%Browse-heavy, login at checkout only
Consumer commerce (loyalty program)23.7%18.4-34.6%Loyalty pressure raises baseline
Content publisher (subscription)31.8%21.9-41.3%Higher when paywall is metered + low (3-5 articles)
Content publisher (advertising-only)6.3%4.7-9.1%Newsletter signups are the main lever
Marketplace (two-sided)28.4%19.2-37.1%Sellers near 100%, buyers much lower
B2B SaaS (product-led)62.7%47.4-78.3%Login is the product
Banking and fintech71.6%58.4-83.9%Regulatory + transactional login pressure

The number that matters is not the median; it is the gap between the median and 100 percent. A consumer commerce operator with 11 percent logged-in traffic can deterministically resolve 11 percent of the audience. The other 89 percent is the addressable problem for probabilistic resolution, and the choice of probabilistic technique determines what fraction of that 89 percent can be brought into the addressable pool, at what confidence, and at what cost.

The other anchor is the conversion-side match rate: when a purchase happens, what fraction of conversions can be linked back to an upstream advertising touchpoint with sufficient confidence to inform bidding? Meta's CAPI documentation reports that advertisers who implement server-side Conversions API with deduplication tend to see match-rate improvements over pixel-only baselines, but the absolute numbers depend on the share of conversions that include a usable identifier in the server payload. In partner data we have observed, the post-CAPI conversion-side match rate (the share of conversions Meta can link to a Meta-side identity) lifts from a typical 47-61 percent on a well-implemented pixel to 68-84 percent with CAPI plus hashed email and phone, with a wider spread when the operator is sloppy about identifier hygiene.


The Deterministic-to-Probabilistic Spectrum

The spectrum runs from "the user told us who they are" to "we are guessing from environmental signals" with several useful gradations in between. The operating mistake is to treat the spectrum as a binary (cookies are gone, replace them with hashed emails) when in practice every operator runs four to six identifier types simultaneously and has to make joint decisions about which to trust under what conditions.

The clean way to think about the spectrum is by confidence weight. A first-party hashed email collected at checkout, normalized (lowercased, trimmed, SHA-256), and joined against a known customer record is near-100-percent-confidence: the false-match rate is dominated by hash collisions, which are negligible. A logged-in account ID is similarly high. A device graph that has linked a phone, a tablet, and a desktop through repeated co-occurrence of IP plus hashed email login is typically 85 to 95 percent confident per edge, with the confidence degrading as the graph reaches further from any deterministic anchor.

IP plus user-agent fingerprinting (the most common fully probabilistic method in the absence of a login) typically resolves at 55 to 70 percent confidence per match in the data we have seen, with the confidence dropping sharply when the operator is dealing with shared IPs (carrier-grade NAT on mobile, corporate networks, large household routers) or with browsers that ship anti-fingerprinting defenses (Safari, Firefox with resistFingerprinting, Brave). Active fingerprinting (canvas, WebGL, audio context) can push confidence higher but is increasingly visible to privacy-monitoring tools and to the browsers themselves, and is now formally restricted by Apple's Web Privacy Policy and effectively deprecated in Chrome's Privacy Sandbox documentation.

At the far end of the spectrum is the cohort-based approach: the user is not resolved at all; instead, the user is assigned to one or more cohorts whose membership is exposed to advertisers without revealing the individual. Chrome's Topics API and Protected Audience (formerly FLEDGE) are the public examples. The cohort approach trades per-user resolution for population-level signal: the advertiser cannot identify a person, only a probabilistically-defined group.

Identifier Confidence by Resolution Method, Across Partner Operators (2023-2024)

The chart should be read as a tradeoff curve. High confidence comes with low match rate (the user must be logged in or anchored to a deterministic record), and high match rate comes with low confidence (everyone is in a Topics cohort, but nobody is individually resolved). The operating decision is not which point on the curve is best; it is what blend across the curve fits the use case. A bid request needs high match rate even at modest confidence. A revenue-attribution model needs high confidence even at low match rate. A churn-prediction feature inside a CDP wants both, knows it cannot have both, and chooses how to weight them.


What the Privacy Sandbox Actually Does (and Does Not)

Chrome's Privacy Sandbox is the most consequential single piece of infrastructure in the cookieless transition, even after the July 2024 reversal that walked back the default third-party-cookie deprecation. The Sandbox is best understood not as a single replacement for cookies but as a set of APIs, each addressing a specific advertising-ecosystem use case that cookies previously handled.

Topics API assigns each browser to a small set of high-level interest topics (selected from a curated taxonomy of a few hundred topics) based on the sites the user has recently visited. The browser exposes the topics to the publisher, who can pass them to bidders. Topics are computed locally; they are time-bounded; and they are intentionally coarse, so that no single topic identifies a specific person.

Protected Audience API replaces the remarketing use case. An advertiser tags an interest group on a user's browser when the user visits, say, a product page. Later, when the user visits a publisher site that runs a Protected Audience auction, the browser runs the auction locally and picks an ad. The advertiser never learns the auction outcome at the per-user level; aggregated reporting is provided through the Attribution Reporting API.

Attribution Reporting API provides two reporting modes: event-level (a single noised event per conversion, with strict per-source limits) and aggregated (summary reports across many conversions, with differential-privacy noise added). The aggregation server is run by the publisher's choice of aggregation service provider, with cross-checks against tampering.

The Sandbox is sufficient for population-level advertising decisions and is insufficient for per-user attribution. Operators who tried to build a "cookieless equivalent of MTA" on Sandbox APIs typically discovered that the noise budget consumed by the per-user attribution made the model uninformative at the segment level. The honest read is that Sandbox APIs are good at "should this ad be served" and bad at "did this specific user convert because of that specific impression." The operators we have seen succeed with the Sandbox use it as one component of a measurement stack alongside CAPI, clean rooms, and incrementality testing, not as a replacement for any single existing system.


CDPs, Identity Graphs, and the Resolution Layer

A Customer Data Platform's central technical job, beneath the marketing language, is identity resolution. The pixel fires, the SDK call lands, the API event arrives. The CDP receives these events, each carrying a different identifier set (anonymous device ID here, hashed email there, account ID on a third event), and decides which of them belong to the same person. The downstream activations (segmentation, journey orchestration, ad audience export) are only as good as that resolution layer.

The resolution layer is graph-shaped. Each identifier is a node. Edges are created when two identifiers co-occur in the same event (a hashed email and an anonymous ID on the same page view) or are linked by a third-party graph provider. Edge weights encode confidence: an edge created by a logged-in event has weight near 1; an edge created by a deterministic anchor on a single device has weight 0.85 to 0.95; an edge created by a probabilistic fingerprint match has weight 0.55 to 0.75.

The resolution algorithm walks the graph, collapses identifiers above a confidence threshold into a single "person" entity, and exposes that entity to downstream consumers. Common implementations are based on connected-components in a weighted graph (the academic literature on entity resolution, Christophides et al., is a good reference), with practitioner additions for time decay (an edge created two years ago counts less than one created last week) and for explicit user-driven merges or splits (the user logs in on a new device and links it to their account).

The CDP identity-resolution graph and its downstream consumers

Loading diagram...

The single most common failure mode we observe in CDP identity resolution is over-merging: two users with overlapping IPs (a household, an office, a shared device) get collapsed into one person, and the downstream personalization is now mistargeted. The defense is a conservative merge threshold (we have seen 0.85 or higher work in partner data), explicit handling for shared-IP signals, and a regular audit of merged entities that the operator can spot-check.

The second-most-common failure mode is under-merging: the same user shows up as four people because the resolution layer was too conservative or because the deterministic anchors never landed. The cost surfaces as duplicate ad impressions, inflated unique-user counts, and journey messaging that contradicts itself across devices. The defense is investing in deterministic anchors at every possible product moment (newsletter signup, account creation, post-purchase email capture) so that the graph has anchors to walk toward.

CDP Identity-Resolution Confidence Thresholds by Activation Use Case, Practitioner Defaults

Use CaseRecommended Confidence ThresholdTypical Match Rate at ThresholdCost of False Match
Email send (transactional)0.95+14-21%High: regulatory and trust
Email send (marketing)0.85+27-38%Medium: unsubscribe + spam complaint
Display ad audience export0.65+52-68%Low: wasted impression only
In-product personalization0.80+32-43%Medium: confusion + trust loss
Churn-risk scoring0.75+41-54%Low to medium: false-positive interventions
Look-alike modeling seed0.70+47-61%Low: dilutes model quality only
Revenue attribution0.90+18-27%High: feeds into financial reporting

The table is the operating point of the CDP: different downstream consumers tolerate different false-match rates, and the resolution layer should expose a confidence score rather than a binary "this is the same person." Activation systems that demand a binary force the CDP to pick a single threshold, and that single threshold is wrong for at least half of the downstream uses.


Clean Rooms and the Two-Sided Resolution Problem

A clean room (Snowflake Data Clean Room, AWS Clean Rooms, Google Ads Data Hub, LiveRamp's Safe Haven, the Habu family of products) is a controlled environment where two parties (typically an advertiser and a publisher, or two advertisers, or a brand and a retailer) can join their data without either side seeing the other's row-level records. The clean room enforces aggregation thresholds, restricts the queries that can be run, and returns only outputs that meet a minimum cohort-size criterion.

Clean rooms are the structural answer to the question "how do we measure advertising performance when the platform no longer shares user-level data and the cookie no longer joins our data to the platform's?" The answer is that both parties contribute their first-party identifiers to a neutral environment, the environment performs a join on whatever overlapping identifiers exist (typically hashed email, hashed phone, or a LiveRamp RampID), and the environment returns aggregated metrics: overlap size, joint conversion rate, lift, frequency distribution.

The match rate inside the clean room is the analogue of the CDP's resolution match rate, with one important difference: the clean room operates on the intersection of two parties' deterministic identifier sets, and the intersection is structurally smaller than either party's union. A retailer with 14 percent logged-in traffic and a publisher with 28 percent logged-in traffic will see clean-room match rates not of 14 percent or 28 percent but of the joint intersection, often 4 to 9 percent in partner data we have observed, depending on the audience overlap.

LiveRamp's RampID and equivalent third-party identity-graph services exist in part to widen this intersection. RampID maps a hashed email to a stable cross-publisher identifier, so that the clean-room join can be performed on RampID rather than on hashed email, and the intersection includes users for whom both parties have a different deterministic anchor that maps to the same RampID. The lift is meaningful: in partner data we have seen, the RampID intersection is typically 1.4 to 2.3 times the hashed-email intersection alone, with the larger lifts in advertiser-publisher pairs that have low direct identifier overlap.

Clean-Room Match Rates by Identifier Strategy, Retailer-Publisher Pairs, Across Partner Operators (2023-2024)

The chart shows a pattern we see consistently: deterministic-anchored methods (email, phone, RampID) hold or improve over time as the operator's first-party data hygiene improves, while probabilistic-anchored device graphs degrade slowly as browser defenses tighten and as the IP space becomes noisier through carrier NAT and VPN adoption. The directional implication is straightforward, even if the magnitudes vary: the time horizon over which operators should expect device-graph performance to remain stable is short, and the investment that compounds is in first-party identifier collection.


The Operating Discipline: Logging Confidence, Not Just Identifiers

The instrument that lets an operator survive the cookieless transition is not a particular vendor or a particular API. It is the discipline of logging confidence alongside every identifier resolution, and of allowing downstream systems to consume that confidence rather than collapsing it to a binary.

The operational shape of this discipline is straightforward. Every event in the warehouse carries one or more identifier fields and, alongside each, a confidence score and a resolution-method tag. A row in the events table might carry user_id_email_hash, user_id_account, user_id_device_graph, with corresponding confidence_email, confidence_account, confidence_device_graph columns. The downstream consumers (the attribution model, the CDP audience, the retention dashboard, the look-alike modeling pipeline) read both the identifier and its confidence, and each makes its own decision about which identifier to use under what threshold.

The discipline pays off in three ways. First, the operator can answer the question "how confident are we in this number" with an actual number, not a hand-wave. When the CFO asks why the channel attribution shifted, the analytics team can show that the underlying identifier confidence changed (a Safari version update degraded device-graph confidence, a CDP merge rule was tightened, a login push lifted the deterministic share) rather than chase a phantom market shift. Second, the operator can run sensitivity analysis: how much does the attribution change if we set the confidence threshold to 0.7 versus 0.85? If the answer is "a lot," the metric is fragile and the team should improve identifier hygiene before doing anything else. If the answer is "barely," the team has headroom to trust the metric. Third, the operator can stage the cookieless transition without a flag day: when third-party cookies are 100 percent gone in Chrome (whether that is in 2026 or 2028), the systems that were already consuming confidence-weighted identifiers continue to work, just with a different mix of identifier types.

The instrumentation effort to log confidence is smaller than most teams initially fear. The pattern we have seen work is to extend the existing event schema with two additional columns per identifier (<identifier>_value and <identifier>_confidence), populate them in the CDP's resolution pipeline, and propagate them downstream. The CDP already computes the resolution decisions, so the confidence number is a byproduct of work the CDP is doing anyway. The cost is in exposing the field to the warehouse and to the downstream consumers, which is mostly a documentation and integration effort rather than a new technical capability. A team with a working CDP and a working warehouse can typically wire the full confidence schema in 4 to 8 engineer-weeks.

The harder organizational shift is getting downstream consumers to actually use the confidence field. The attribution team, the CDP audience team, the personalization team, and the look-alike modeling team have each built their queries against the existing identifier column and need to be persuaded to consume the confidence column as well. The pattern that works is to start with a single high-visibility consumer (typically the executive attribution dashboard), demonstrate that the confidence-aware version reconciles better with the source-of-truth financial number, and use that as the lever to bring the other consumers along. The teams that try to migrate every consumer simultaneously typically stall; the teams that pick a single consumer first and demonstrate the value build organizational momentum that carries the rest.

Identifier Strategy Investment Mix by Archetype, Practitioner Defaults

ArchetypeFirst-Party Login PushServer-Side CAPI InvestmentClean Room AdoptionDevice Graph Reliance
Consumer commerce (no membership)High: loyalty + post-purchase emailHigh: Meta + Google CAPI minimumMedium: retailer-publisher pilotsDecreasing
Consumer commerce (loyalty)Sustaining: already strongHighHigh: retail media network playsLow
Content publisher (subscription)High: registration walls + newsletterMedium: contextual + first-partyHigh: data collaboration coreLow
Content publisher (ad-only)Medium: newsletter is the leverLow: limited conversion eventsHigh: SSP + clean-room joint reportsMedium
Marketplace (two-sided)Sustaining: sellers locked inHigh: buyer-side CAPI essentialMedium: brand-side pilotsMedium
B2B SaaS (PLG)Sustaining: login is the productHigh: server-side firstLow: limited applicabilityLow
Banking and fintechSustaining: regulatory baselineMedium: compliance constraintsLow: data residency limitsLow

The investment mix shifts with archetype, and the decisions are joint. A consumer commerce operator with low logged-in share has the strongest case for a loyalty push and a CAPI investment in parallel; an ad-only publisher with no conversion events has the strongest case for clean-room partnerships and newsletter-driven first-party data. The portfolio thinking is what is new about the cookieless world. The portfolio thinking should have always been there.


What Breaks When You Pretend Probabilistic Is Deterministic

The failure modes that surface when an operator treats a probabilistic identifier as if it were deterministic are predictable and, once seen, embarrassing. The CDP merges two real users into one because their IP and user-agent overlap on the home network of a household, and the personalization layer starts addressing each user by the other's purchase history. The attribution model gives credit to a touchpoint that was on a different person's device entirely. The frequency-cap rule undercounts impressions because the probabilistic graph split a single user across three nodes.

Each of these errors compounds. The CDP's incorrect merge contaminates the audience export to the ad platform, which contaminates the look-alike model, which contaminates the next campaign's targeting. The attribution misattribution contaminates the spend allocation, which contaminates the budget decision for the next quarter, which contaminates the channel mix for the year. None of the individual errors are catastrophic. The cumulative drift is.

The operating defense is the explicit confidence-weighting we described above, plus a periodic audit of merges (and splits) that look anomalous. A CDP that merged two users at 0.62 confidence and is now sending one of them the other's "we noticed you abandoned your cart" email is the visible tip of a larger pattern. The audit catches the visible cases and surfaces the systemic threshold that is producing them.

The audit cadence we have seen work in partner data is quarterly for the systemic-threshold review (does the merge threshold need to move, has the underlying identifier confidence distribution shifted) and continuous for the visible-case review (every customer complaint about misaddressed personalization is triaged against the identity graph). The combination catches both the slow drift (the threshold that was right last year is wrong this year because the underlying identifier mix has changed) and the acute failures (a single high-visibility incident that surfaces a systemic issue affecting many users).

The cookie was always probabilistic. Cookieless makes the probability legible. The operating discipline is to read the probability rather than round it to one or zero.

The teams that survive the transition with their measurement stack intact are the teams that internalized this discipline early. The teams that are still pretending that hashed email is the new cookie, and that the cookieless transition is solved once they have wired hashed email into everything, are still doing the cookie-era thing, just with a different identifier. The cookie's flaw was never that it was a cookie. The flaw was the binary treatment of an identifier that was always uncertain. Replacing the cookie with hashed email and continuing the binary treatment reproduces the same failure mode in a new vocabulary.


A Twelve-Month Migration: What Changes In Order

For an operator running on cookie-era identity infrastructure today and planning the cookieless transition, the order of investments matters. The pieces interact, and getting them out of sequence produces a year of debugging that the well-ordered sequence avoids.

  1. Identifier inventory and confidence taxonomy. Document every identifier the operator collects, where it is collected, and the deterministic anchor (or lack of one) for each. Assign a confidence band per identifier type. Without this baseline, every subsequent decision is made in the dark.

  2. First-party data hygiene push. The investment that compounds is in raising the deterministic-anchor share. Loyalty programs, newsletter registration, post-purchase email capture, paywall metering, login pressure at high-intent moments. This work is unsexy and slow and is the single highest-leverage move in the cookieless transition.

  3. Server-side tagging and CAPI integration. Move conversion events to a server-side container (Google Tag Manager Server-Side or equivalent), wire Meta CAPI and Google Enhanced Conversions, with proper deduplication against client-side fires. This is where the conversion-side match-rate lift comes from.

  4. CDP resolution-layer audit. Whether the operator runs a vendor CDP (Segment, mParticle, Tealium, Rudderstack) or built their own, audit the merge and split rules, expose confidence scores to downstream consumers, and establish the periodic merge-audit cadence.

  5. Clean-room pilot. Pick one partner relationship (a publisher, a retailer, a brand-collaboration partner) and run a clean-room measurement pilot. The pilot teaches the operator what the joint match rate actually is and what the operating cadence of a clean-room program feels like. Do not start with a full program; start with a single use case.

  6. Privacy Sandbox readiness. Implement Topics API consumption on publisher properties, Protected Audience interest-group tagging on advertiser properties where the use case calls for it. Treat Sandbox APIs as one input to the measurement stack, not as a replacement for any single existing system.

  7. Confidence-weighted attribution refresh. Rebuild the attribution model to consume the confidence-tagged identifier graph, with sensitivity analysis built in. Report the attribution numbers with confidence intervals, not as point estimates.

The full sequence takes twelve to eighteen months for a mid-market operator, longer for an enterprise with many marketing properties. The teams that move faster than this generally skip the identifier inventory or the CDP audit, and the cost surfaces six months later when the numbers stop reconciling between systems.

The cookieless transition is not finished. It is in motion, in different stages across different browsers, with the Chrome timeline still uncertain. The discipline that makes the transition survivable is the discipline of accepting that identifiers are probabilistic, that confidence is the operating variable, and that no single replacement is coming. The operators who internalized this in 2020 are now three years ahead of the ones who are still waiting.


Key Takeaways

  1. The cookie was always probabilistic. Cookieless makes the probability visible. The operating shift is from hidden uncertainty to legible uncertainty, not from certainty to uncertainty.

  2. The match-rate problem anchors every identity-resolution decision. Logged-in traffic share (8 to 18 percent for consumer commerce, 47 to 78 percent for SaaS) sets the deterministic ceiling, and the gap between that share and 100 percent is the probabilistic problem.

  3. Identity resolution is a graph problem with weighted edges. Confidence scores belong on edges, not collapsed to a binary, and downstream consumers should read the confidence rather than the system imposing a single threshold.

  4. The Privacy Sandbox APIs (Topics, Protected Audience, Attribution Reporting) are useful for population-level ad decisions and insufficient for per-user attribution. They are one component of a measurement stack, not a replacement for any single existing system.

  5. CDPs are identity-resolution systems first and activation systems second. The most common failure mode is over-merging (collapsing two real users into one); the second-most-common is under-merging (splitting one user into many). Both have specific defenses and both require periodic auditing.

  6. Clean rooms join the deterministic intersection of two parties' identifier sets, which is structurally smaller than either party's union. Match rates are often 4 to 9 percent for retailer-publisher pairs. RampID and equivalent identity-graph services widen the intersection by 1.4 to 2.3 times in partner data we have observed.

  7. The portfolio approach is what is new. Every operator runs four to six identifier types simultaneously in the cookieless world, with different confidence profiles, and the operating skill is in the blending. There is no single replacement coming.

  8. The investments that compound are upstream: loyalty programs, login pressure, server-side CAPI integration, first-party data hygiene. The investments that do not compound are downstream cleverness (better attribution models on a poor identifier base).

  9. The teams that internalized probabilistic identity in 2020 are three years ahead of the teams that are still waiting for the post-cookie cookie. The transition is in motion and is not waiting for vendor consensus.

The Conversation

Be the first to weigh in

Join the conversation

Disagree, share a counter-example from your own work, or point at research that changes the picture. Comments are moderated, no account required.

Read Next