TL;DR: Lighthouse lab scores and Chrome User Experience (CrUX) field data measure related but materially different constructs, and the gap between them is the single most common reason performance work fails to translate into conversion gains. Field data correlates with conversion meaningfully; lab data correlates with conversion weakly and unpredictably. The right operating discipline is to use lab data for diagnosis and field data for optimisation targets, with the conversion-rate uplift estimated on the field-LCP slope rather than on the Lighthouse score. The published case studies (Vodafone, Walmart, Pinterest, Booking.com via web.dev) all show field-data improvements producing measurable revenue lift; lab-only improvements often produce neither.
A note on tools and brands. Vodafone, Walmart, Pinterest, Booking.com, Akamai, and the named SaaS retailers appear in this essay because their public case studies, talks, and engineering blogs are the available evidence base. Quantitative claims framed as advisory-engagement observation come from anonymized partner operators, not from the named companies. Public claims are attributed inline to their source.
The Lab-Field Gap
The most common version of the performance-optimisation cycle in 2025 and 2026 looks like this. A team runs Google PageSpeed Insights on a key landing page. The lab score is 32 out of 100, with red flags on Largest Contentful Paint, Total Blocking Time, and Cumulative Layout Shift. Engineering spends six weeks shipping the recommended optimisations: image lazy-loading, JavaScript splitting, font preloading, render-blocking-resource removal, CDN cache-rule tightening. The lab score returns to 91. The team declares victory.
Six months later, conversion rate is unchanged. The CRUX field data, which the team had not been monitoring, shows that real-user Largest Contentful Paint moved from a 75th-percentile value of 4.2 seconds to 3.8 seconds. The improvement is real but marginal. The lab-environment optimisation, run on a developer machine with a simulated 4G connection and no third-party scripts active, looked dramatic. The field-data optimisation, run across the actual distribution of users, devices, network conditions, and third-party loads, looked modest. The conversion-rate sensitivity to LCP in the field-relevant range was small enough that the residual 400ms improvement did not show up against the noise floor.
This is the lab-field gap. It is structural, well-documented, and routinely underweighted by both performance engineering teams and conversion optimisation teams. The literature, the case studies, and the field telemetry all converge on the same point: lab scores are a starting diagnostic; field data is the optimisation target; the conversion impact runs through the field-data distribution, not through the lab-score gauge.
This essay maps the gap, reviews the case-study literature, and proposes an operating discipline for performance work that translates into conversion. The first half covers the measurement side (what lab and field actually measure, why they disagree, what the published case studies show). The second half covers the operating side (how to set targets, how to estimate conversion sensitivity, when lab optimisation will not translate and what to do about it).
What Lighthouse Measures
Lighthouse, the engine behind PageSpeed Insights' lab section, runs a simulated load of the target URL in a controlled environment. The environment is specified: a simulated mid-range mobile device (a Motorola Moto G4 equivalent, throttled to 4x CPU slowdown from a desktop baseline), a simulated 4G connection (10 Mbps down, 750 ms RTT, 150ms latency to the first network packet), no extensions, no cookies, no logged-in state, no third-party scripts beyond what the page explicitly loads. The Lighthouse score is a weighted composite of the metrics measured in that single load: First Contentful Paint, Speed Index, Largest Contentful Paint, Total Blocking Time, and Cumulative Layout Shift, with the weights adjusted across Lighthouse versions (the 10.x scoring rubric is the current standard as of 2025).
The simulation is deterministic. The same URL on the same Lighthouse version produces approximately the same score across runs, with variance from cache-warming and connection-throttling variability. This is the lab metric's main virtue. It is repeatable, auditable, and comparable across builds, which makes it suitable for CI-pipeline performance budgets and for cross-team accountability.
The simulation is also unrepresentative of most real-world loads, in three structural ways.
The first is the user device. The simulated Moto G4 is intentionally a slow device; the median real-world mobile device in many markets is faster, while devices in lower-income markets and older smartphones can be substantially slower. The lab score is a single point on a distribution; the field score reflects the whole distribution.
The second is the network. The simulated 4G connection ignores the substantial fraction of real users on 5G (faster), 3G or congested 4G (slower), or wifi connections of widely varying quality. Real-world LCP distributions are bimodal or multimodal in many catalogues, with one cluster of fast loads and one or more clusters of slow loads driven by network conditions.
The third is the page state. The lab load is a cold load, no cookies, no logged-in state, no personalisation. The real user load is frequently a warm cache, a logged-in state with user-specific content, third-party scripts firing in their actual configuration, and personalisation calls completing in their actual timing. The lab measurement abstracts all of this away in the name of reproducibility.
The implication is that Lighthouse measures a specific, well-defined scenario that does not occur for most real users. When the team optimises the lab scenario, they are optimising one point in a high-dimensional distribution, and whether that point is representative of the conversion-relevant distribution depends on the page.
What CrUX Field Data Measures
The Chrome User Experience Report (CrUX) is the public-facing dataset of real-user performance metrics, collected from Chrome users who have opted in to anonymous usage statistics and who are browsing the eligible web. The dataset is aggregated at the origin level (the entire domain) and the URL level (where sufficient data exists), with metrics reported as 28-day trailing windows.
CrUX measures the same Core Web Vitals metrics (Largest Contentful Paint, Interaction to Next Paint, Cumulative Layout Shift) plus additional metrics (First Contentful Paint, Time to First Byte, navigation type, device type), but the measurement is the distribution of real Chrome loads, not a single simulated load. The headline numbers are typically the 75th percentile of each metric, which is Google's published threshold for "Good" performance.
The "Good" thresholds, as of 2024 and 2025, are LCP under 2.5 seconds at the 75th percentile, INP under 200 milliseconds at the 75th percentile, and CLS under 0.1 at the 75th percentile. A page passes Core Web Vitals if it meets all three thresholds for the relevant device type (mobile, desktop, or both).
The field data has three properties that make it materially different from lab data. First, it represents the real distribution. The 75th-percentile LCP captures the experience of the worst-quartile users, who are typically on slower devices, slower networks, or both. Optimising the 75th percentile usually means addressing the structural causes of the slow tail rather than the median or the fast head. Second, it includes third-party effects. Real users load real ads, real analytics, real chat widgets, real personalisation calls. The lab does not. Third, it includes warm-cache effects. Many real loads benefit from prior visits, CDN cache hits, browser cache, service worker cache. The lab cold-loads every time.
The trade-off is that field data is slower, less granular, and less actionable than lab data. CrUX reports on a 28-day window, which means engineering changes do not show up in CrUX for weeks. The data is aggregated, so individual-page debugging requires supplementary tooling (the Real User Monitoring market exists for this reason: SpeedCurve, Sentry, Datadog RUM, New Relic, and Akamai mPulse all instrument the equivalent of CrUX at finer granularity).
Why Lab and Field Diverge in Practice
The lab-field gap has consistent structural causes. The five most common we have seen across advisory engagements are the following.
Third-party scripts that defer during the lab load but fire eagerly in the field. Lab simulations frequently underrepresent third-party performance impact because lab loads complete before some third-party scripts even initialise. Real users sit on the page long enough for the third-party scripts to load, execute, mutate the DOM, and contribute to layout shifts. The lab's LCP and CLS look fine; the field's do not.
Authenticated-state penalties. Many landing pages perform substantially worse for logged-in users because authenticated state triggers personalisation calls, user-specific data fetching, header customisation, and cart-state reconciliation that anonymous loads skip. The lab measures anonymous; the field includes the authenticated mix.
Personalisation and A/B testing. Personalisation engines, recommendation engines, and A/B testing platforms typically inject content client-side after initial render. The injection contributes to CLS and can push back LCP. Lab loads either skip the injection or measure it at the wrong time.
Network heterogeneity by region. The lab's 10 Mbps 4G simulation maps to the median European urban experience but is materially better than the median rural experience and materially worse than median fixed-line or 5G. CrUX, drawn from a global Chrome population, shows much wider network variation. Sites with traffic concentrated in slow-network regions can have Lighthouse scores that look fine and field LCPs that do not.
Device heterogeneity. The simulated Moto G4 is a calibration point, not the median device. Real-user device distributions vary by category: e-commerce skews mobile and often mid-range Android; B2B SaaS skews desktop and modern hardware. The lab calibration is consistent across categories; the field reality is not.
Lab and field measurement pipelines, with structural divergences
The diagram captures the analytical posture that has worked in advisory engagements: treat the two measurements as answering different questions. The lab answers "what is the structural performance of this page when nothing external interferes?" The field answers "what is the experience of the actual users who visit this page?" The first question is useful for diagnosis. The second is the right target for conversion-relevant optimisation.
What the Case Studies Actually Show
The published case studies on page speed and conversion are dominated by a handful of large operators whose engineering blogs and conference talks have been picked up by Google and the wider web-performance community. The headline numbers from these case studies have been cited so often that they have become folklore, sometimes misquoted or oversimplified. The full studies are more nuanced.
Walmart (2012, public talk by Cliff Crocker). The original Walmart performance study found that for every 100 milliseconds of improvement in page load time, conversion rates rose by approximately 1 percent. The study was conducted on the Walmart.com mobile site in 2011 to 2012, measured load times as full-page-load (the older metric, predating LCP), and reported conversion as transactional sessions. The headline 1-percent-per-100ms ratio has been widely cited but its applicability is bounded: it reflects the speed-conversion sensitivity at Walmart's particular range of load times (multi-second loads typical of 2012 mobile web), in Walmart's particular catalogue (broad-line retail), with Walmart's particular conversion-rate baseline. Generalising the ratio to faster sites, narrower catalogues, or different user expectations is a stretch.
Pinterest (2017, public engineering blog by Zack Argyle). Pinterest reported that a 40-percent reduction in perceived wait time led to a 15-percent increase in search-engine traffic and signups. The metric in question was "perceived wait time" defined as a custom-engineered composite, not LCP or other Core Web Vitals (which did not exist until 2020). The reported improvement was on a redesigned mobile web flow, where multiple changes shipped together (faster server response, lazy-loaded images, simplified initial render). The 15-percent number reflects the joint effect, not the isolated effect of speed alone. Pinterest's engineering team has subsequently shared additional case studies showing similar directional effects for INP improvements (the new Core Web Vitals metric replacing FID), with the same caveats about attribution to specific interventions.
Vodafone (2021, web.dev case study, validated A/B test). Vodafone ran a controlled A/B test on landing pages comparing a Web-Vitals-optimised variant against the unoptimised baseline. The optimised page saw a 31 percent improvement in LCP and reported 8 percent more sales, 15 percent improvement in lead-to-visit rate, and 11 percent improvement in cart-to-visit rate. The test had 50-50 random traffic allocation and approximately 100K daily clicks per variant. The Vodafone study is one of the cleanest in the public literature because it is a randomised experiment rather than a before-after observation, which makes the causal claim defensible.
Booking.com (multiple public talks). Booking has shared multiple performance case studies over the years, including the often-cited finding that a 1-second increase in page load time reduces conversion by approximately 0.5 percent in their travel-booking flow. The number has held roughly stable across multiple years of measurement, with refinements as they have moved from full-page-load metrics to LCP and INP.
Akamai (2017, "Akamai Online Retail Performance Report"). The Akamai retail study, drawn from real-user monitoring across hundreds of retail sites, reported that a 2-second delay in page load time corresponded to a 103-percent increase in bounce rates and that a 1-second delay in load time reduced conversions by approximately 7 percent. The Akamai data is broadly consistent with the Walmart and Booking numbers but is reported in pooled-cross-site form, which makes it more representative as a benchmark but less applicable as a specific causal estimate for any single site.
Public Performance-Conversion Case Studies (Cited with Methodological Caveats)
| Operator | Year | Metric Studied | Reported Effect | Methodology Caveat |
|---|---|---|---|---|
| Walmart | 2012 | Full page load time | 1% conversion increase per 100ms improvement | Pre-LCP era; specific to mobile retail; not a controlled experiment |
| 2017 | Perceived wait time (custom composite) | 40% reduction yielded 15% signup increase | Custom metric; joint effect of multiple shipped changes | |
| Vodafone | 2021 | LCP (field) | 31% LCP improvement yielded 8% sales increase | Randomised AB test; cleanest causal claim in the public literature |
| Booking.com | Multiple years | Page load time, later LCP and INP | 1s slower yielded approximately 0.5% lower conversion | Multiple talks; specific to travel booking; treated as guidance not precision |
| Akamai | 2017 | Page load time (cross-site) | 2s delay yielded 103% bounce rate increase; 1s delay yielded 7% conversion reduction | Pooled across retail sites; benchmark not single-site causal estimate |
Two patterns are robust across all five studies. The first is that the effect direction is consistent: faster pages convert better. The second is that the magnitude estimates vary by an order of magnitude depending on the specific site, the specific metric, the specific baseline, and the specific study methodology. Anyone using the cited numbers as a planning input should treat them as orders of magnitude rather than as precise estimates applicable to their particular page.
The Conversion Sensitivity to LCP, in the Field
The right operating question is not "what is the cross-site average sensitivity?" but "what is the LCP-conversion sensitivity for this particular site?" This is a site-specific empirical question, and answering it requires the operator's own field telemetry rather than industry benchmarks.
The methodology is straightforward in principle. Segment real-user sessions by their LCP (or another Core Web Vitals metric). Compute the conversion rate within each segment. Plot the curve. The slope of the curve, in the relevant LCP range, is the site's conversion sensitivity.
The methodology is harder in practice. Confounders are everywhere. Users with slow LCP often have slow networks, which correlates with lower-end devices, which correlates with different demographics, different shopping intent, and different baseline conversion propensity. A naive LCP-conversion segmentation will attribute network-driven and demographic-driven conversion differences to the speed variable, overstating the true causal effect.
The clean version of the analysis controls for these confounders. The cleanest version is a randomised experiment, where some users are deliberately served a slower variant (often through artificial latency injection) and the conversion comparison is causal. Few teams run these experiments because they are uncomfortable: deliberately slowing your site for a percentage of users is a hard sell internally. The cleanest published example is the Vodafone study.
The second-cleanest version is a regression-with-controls approach: model conversion as a function of LCP plus a long list of control variables (device, network, region, time of day, prior session activity, traffic source). The residual coefficient on LCP, after controlling for the rest, is the conditional sensitivity. This is what most well-resourced teams actually do, and the results are typically smaller than the naive segmentation would suggest, often by a factor of two or three.
The curve above is illustrative, drawn from a stylised mobile e-commerce range and intended to convey shape rather than calibration. The shape that matters is the convexity: the marginal conversion lift from moving LCP from 6 seconds to 4 seconds is substantially larger than the lift from moving 2 seconds to 1 second, because the curve flattens as load times approach the perceptual instant. This convexity is consistent across the published case studies and across our partner data. It is also consistent with the perceptual-thresholds literature dating back to Miller (1968).
The operating implication is that performance work has diminishing returns in absolute speed terms but increasing returns at the slow tail. A site with 75th-percentile LCP at 5 seconds has substantial room to convert better at 3 seconds. A site already at 2.0 seconds has less to gain from moving to 1.5. The slowest-quartile users are where the conversion uplift concentrates.
Perceptual Thresholds and the Psychology of Wait
The perceptual literature on response time is older than the web and substantially richer than most performance discourse acknowledges.
Miller (1968), in "Response time in man-computer conversational transactions," published the classical thresholds: 100ms is perceived as instantaneous, 1 second preserves the user's flow of thought, 10 seconds is the limit of attention before the user begins to multi-task or leave. The thresholds were derived from terminal-and-mainframe interaction studies, but they have replicated in subsequent web-context research with broadly the same numbers.
Doherty and Thadhani (1982), in "The economic value of rapid response time," refined Miller's framework with productivity studies at IBM. They found that response-time improvements continued to produce productivity gains well below the 2-second mark that prevailing mainframe-era wisdom had treated as adequate, with productivity peaking near 400 milliseconds. The "Doherty threshold" of 400ms has been picked up by the UX community as the target for interactive feedback (the modern equivalent is the INP target of under 200 milliseconds for the 75th percentile).
Nielsen (1993), in "Response Time Limits," consolidated the Miller-Doherty findings for the web: under 100ms feels instant, under 1 second preserves attention, under 10 seconds is the upper bound for tolerable wait. The thresholds inform every modern Core Web Vitals discussion, even when they are not cited.
The psychological model that explains the convex curve in conversion data is straightforward. At very fast loads (under 1 second), the user has not formed an expectation of wait, the page appears, and the interaction is on the user's terms. At moderate loads (1 to 3 seconds), the user notices the wait but tolerates it, with conversion impact small but cumulative. At slow loads (3 to 8 seconds), the user begins to consider alternatives (back button, competing tabs, alternative apps), and the conversion impact is large. Above 8 to 10 seconds, the user has often left the page, and the conversion rate approaches the residual of users who waited only because they had no good alternative.
The Interaction-to-Next-Paint Inflection
The 2024 replacement of First Input Delay (FID) with Interaction to Next Paint (INP) as a Core Web Vitals metric materially changed what "responsiveness" means in the field measurement. The change has implications for both lab-field divergence and conversion sensitivity.
FID measured the delay between the first user interaction (a tap, click, or key press) and the browser's first response to that interaction. The metric was easy to pass because most pages had a fast first interaction, but it ignored everything that happened after the first interaction.
INP measures the worst interaction delay across the entire session, with the 75th-percentile threshold at 200 milliseconds. A page that responds quickly to the first tap but slowly to subsequent taps (the typical pattern for sites with heavy client-side rendering after initial load) passes FID but fails INP.
The shift from FID to INP has been disruptive for many sites that had passed Core Web Vitals comfortably under the older metric and find themselves failing under the new one. In partner data through 2024 and 2025, we have seen pass rates on INP that are 20 to 40 percentage points lower than the prior FID pass rates for the same sites, primarily on heavy client-side-rendered sites with substantial post-load JavaScript work.
The conversion sensitivity to INP is, by all available evidence, substantial. The Pinterest team's INP-focused case study and the Web.dev case-study library both report meaningful conversion effects, comparable in magnitude to the LCP sensitivity for many sites. The cumulative effect of poor INP, the friction of every tap and click being slow, is in some ways more damaging than a slow LCP because it compounds across the session rather than impacting only the initial impression.
Operating Discipline: Lab for Diagnosis, Field for Targets
The practical implication of the lab-field gap is a specific operating discipline. The discipline has four parts.
Use Lighthouse and PageSpeed Insights lab scores for diagnosis, not for optimisation targets. The lab score tells you what is structurally wrong with the page when nothing external interferes. It is the right tool for finding the specific opportunities (oversized images, render-blocking JavaScript, long main-thread tasks, layout shifts caused by font loading). It is not the right number to track over time as evidence that the site is getting faster from the user's perspective.
Use CrUX and RUM field data as the optimisation target. The 75th-percentile LCP, INP, and CLS, monitored over 28-day windows, are the metrics that drive Core Web Vitals pass/fail and that correlate with conversion. The improvements that move these metrics are often different from the improvements that move the lab score, because the field includes the third-party effects, the authenticated-state penalties, the network and device heterogeneity, and the personalisation effects that the lab abstracts away.
Estimate site-specific conversion sensitivity from field data, not from industry benchmarks. The Walmart 1-percent-per-100ms and Akamai 7-percent-per-1s numbers are useful as starting hypotheses, but the actual sensitivity for any given site is empirically estimated from the site's own conversion-by-LCP curve, ideally with confounders controlled. Most well-resourced teams maintain a regression model that updates monthly.
Prioritise the slow-quartile tail, not the median. Performance work that improves the 50th-percentile LCP without improving the 75th-percentile is often Lighthouse-visible but field-invisible. The conversion lift concentrates in the slow tail, where users are closest to abandoning, where the perceptual curve is steepest, and where structural causes (slow third parties, poor mobile network, low-end device) are most fixable through engineering.
Operating Discipline: What Each Measurement Layer Is For
| Measurement | Best For | Not Suitable For | Key Caveat |
|---|---|---|---|
| Lighthouse lab score | Diagnostic: identifying specific opportunities (images, JS, fonts, layout) | Tracking real-user experience; conversion attribution | Single simulated load; underrepresents third-party and authenticated state |
| Lighthouse CI in pipeline | Performance budget enforcement; regression detection | Real-world conversion impact estimation | Useful as a regression guard but not a customer-experience metric |
| CrUX field data (origin, URL) | Long-horizon Core Web Vitals pass-fail; competitive benchmarking | Real-time debugging; per-session granularity | 28-day window; aggregated; Chrome-only |
| Real User Monitoring (RUM, eg SpeedCurve, Sentry, Datadog) | Real-time monitoring; per-session debugging; segmentation by user type | Replacing CrUX as the official Core Web Vitals signal | Vendor-specific; sampling and instrumentation overhead |
| Synthetic monitoring (eg Pingdom, GTmetrix, Akamai mPulse Synthetic) | Multi-region performance baselines; uptime | Real-user experience inference | Predictable but unrepresentative |
| AB-tested performance experiment | Causal estimation of conversion sensitivity to speed changes | Continuous monitoring | Operationally hard; requires injection of artificial latency or staged rollout |
The Decision Tree for Performance Work
The right way to scope a performance project depends on where the site currently sits on each measurement layer. The tree below captures the high-leverage decision points we have used in advisory engagements.
Where Performance Work Goes Wrong
Three failure modes recur across the engagements we have seen.
Optimising the wrong percentile. Teams that focus on median or mean LCP often deliver real lab improvements without moving the 75th-percentile field LCP. The 75th percentile is the part of the distribution Google evaluates for CWV pass/fail and the part that correlates most strongly with conversion at the page level. The slow-quartile tail is harder to fix, often because it requires engineering effort against third-party performance issues or network heterogeneity rather than against the more accessible local rendering issues.
Lab regression masking field improvement. Conversely, some teams introduce field-visible improvements (better CDN edge caching, smarter prefetch, third-party script deferral) that briefly regress the lab score because the Lighthouse simulation does not benefit from the caching infrastructure. Engineers panic; the field metrics improve over the next four weeks; the lab and field converge. Teams that monitor only the lab score in this window miss the wins.
Optimising for Lighthouse score without understanding the rubric. The Lighthouse score is a composite, weighted across metrics, with weights that have shifted across versions. A team that ships a change that improves Total Blocking Time at the cost of LCP can see the Lighthouse score rise (because TBT is heavily weighted in the rubric) while the actually-conversion-relevant metric (LCP) gets worse. The composite hides the trade-off.
A Performance Work Calendar That Translates to Conversion
The operating discipline that we have seen produce the most reliable conversion translation has a specific shape over a 6-month cycle.
In month one, the team runs the diagnostic baseline. Lighthouse audit of the top 20 highest-traffic pages, CrUX data pull for the same set, RUM instrumentation if not already in place, and a per-page mapping of lab score versus field 75th-percentile LCP. The output is a heat map of pages where the lab and field disagree most strongly. These are the highest-leverage targets, because they are the pages where the lab-focused optimisation has not yet captured the field-relevant improvements.
In months two and three, the team ships field-targeted improvements. The most common high-leverage work is third-party deferral (chat widgets, analytics, ad scripts moved to defer or async), authenticated-state optimisation (skipping unnecessary personalisation calls, pre-rendering more shell content), and image and font preloading optimised for the actual LCP element. Each change is monitored in CrUX over the 28-day window.
In month four, the team runs a conversion-sensitivity experiment if not already done. The cleanest version, where feasible, is artificial latency injection on a small percentage of traffic to estimate the causal slope of conversion against LCP. Some operators are too uncomfortable with deliberately slowing some users; the next-best version is a longitudinal regression on the actual variation that occurred during the prior months, with controls.
In months five and six, the team prioritises the next wave of work based on the estimated sensitivity. If the sensitivity is high (every 200ms of LCP improvement moves conversion meaningfully), continue investing in field-focused engineering. If the sensitivity is low (the page is fast enough that further improvements have diminishing returns), shift engineering elsewhere. The decision is data-driven.
Where Generative Experiences and Streaming Render Fit
The 2024-2026 web increasingly includes generative experiences (AI chat interfaces, dynamic content generation, streaming responses) that interact with the Core Web Vitals framework in non-obvious ways. The implications for performance measurement are still emerging.
The first implication is that the LCP element on a generative page is harder to define. A page where the initial render is a typing animation, a streaming response, or a progressive disclosure of content does not have a single "largest contentful paint" in the classical sense. Browsers and CrUX have been adapting their LCP definitions, and the 2024 to 2025 specification updates have included logic for streaming content, but the metric is less stable in these contexts than on classical static pages.
The second implication is that INP becomes more important. Generative experiences typically have lower expectations for initial-render speed (users accept that an LLM call takes a few seconds) but very high expectations for ongoing interaction responsiveness. The classical performance optimisation focused on first paint loses some of its priority; INP optimisation becomes central.
The third implication is on session-level measurement. Generative experiences have longer sessions (sometimes 20 or 30 turns of interaction) and the performance characteristics within the session matter as much as the initial load. RUM tools that instrument the full session, not just the first load, are becoming the operational requirement.
The performance literature is still catching up to these patterns, but operating teams are increasingly building their own session-level instrumentation rather than relying solely on the page-level Core Web Vitals.
The single highest-leverage move for most performance projects is not the next Lighthouse optimisation in the lab. It is reading the CrUX 75th-percentile LCP for the page, segmenting the slow tail by user device and network, and finding the third-party script or authenticated-state penalty that the lab never measured. Field-first is not a slogan; it is the difference between performance work that moves conversion and performance work that produces internal high-fives.
Key Takeaways
- Lab and field measure different constructs. Lighthouse measures a single throttled cold load with no third parties and no authenticated state. CrUX and RUM measure the actual distribution of real-user sessions. The two routinely diverge by factors of 1.5 to 3 on LCP, more on INP. The divergence is structural, not a calibration error.
- Field data is the conversion-relevant signal. The empirical conversion-LCP curve, conditional on appropriate confounder controls, follows a convex shape with diminishing returns at the fast end and large returns at the slow end. Conversion uplift concentrates in the slow-quartile users, so the 75th-percentile LCP is the right optimisation target.
- The published case studies confirm direction but not magnitude. Walmart's 1-percent-per-100ms, Akamai's 7-percent-per-1s, Vodafone's 8-percent-for-31-percent-LCP-improvement, Pinterest's 15-percent-for-40-percent-perceived-wait-improvement, and Booking's 0.5-percent-per-1s all agree that faster is better. The magnitudes differ by an order of magnitude depending on site, metric, and methodology. Use the case studies as guidance, not precision inputs.
- The lab-field gap usually points at third-party scripts and authenticated-state penalties. When the lab looks good and the field looks bad, the cause is typically scripts that did not fire in the simulation or page work that anonymous loads skip. Field-focused engineering on these causes typically moves the field metrics without moving the lab score, and conversion follows the field.
- The Doherty 400ms and Nielsen 1-second thresholds still hold. The decades-old perceptual literature predicts the convex conversion curve modern field data exhibits. Under 1 second the user notices little; from 1 to 3 seconds the impact is small and cumulative; above 3 seconds the conversion damage accelerates. Pace performance investment against where the field distribution actually sits.
- Operating discipline: lab for diagnosis, field for targets, conversion sensitivity from site-specific data. The discipline is not novel but it is consistently underapplied. Teams that follow it ship conversion-translatable performance work; teams that optimise the lab score in isolation often do not.
Read Next
- Conversion Optimization
Trust Signals and Their Measurable Lift: A Field-Test Compendium
A field-test compendium of trust signals (SSL badges, guarantees, testimonials, reviews, press logos, accreditations) and what the actual lift literature says about each, with the standard caveat that trust-signal lift is highly context-dependent.
- Conversion Optimization
Card Sorting and Information Architecture Validation in Production
The IA validation pipeline from open and closed card sorts to tree testing to first-click testing to production navigation A/B tests, and the under-discussed sample-divergence problem when card-sort participants do not match real visitors.
- Conversion Optimization
Checkout Flow Micro-Optimization vs. Macro-Redesign
When small checkout tweaks return more than full rewrites, what the Baymard Institute research actually says, and a decision framework for choosing between incremental optimization and macro redesign.
The Conversation
Be the first to weigh in
Join the conversation
Disagree, share a counter-example from your own work, or point at research that changes the picture. Comments are moderated, no account required.