TL;DR: A north-star metric earns its keep when it informs decisions that would otherwise be made on intuition. The construction problem is hard: the metric needs to track customer value, be measurable on a useful cadence, decompose into actionable inputs, and resist gaming. The harder problem, and the one most playbooks skip, is the revision discipline. Every north-star metric drifts. The product changes, the customer base shifts, the strategy adjusts, and the metric that was load-bearing in year one becomes a vestigial number by year three that the team is still optimizing because nobody has the authority to change it. Goodhart's law and Campbell's law together predict that any unrevised metric will, over time, be gamed and corrupted, and the accounting literature on surrogation (Choi, Hecht, and Tayler, 2012) shows that managers will start treating the metric as the strategy rather than as a proxy for it. This essay maps the construction question, the revision conditions, and the political economy of changing a metric that has accumulated organizational sediment.
A note on the named companies. Spotify, Facebook, Airbnb, and Slack appear throughout as well-known examples of north-star metric archetypes (engagement, growth, marketplace liquidity, team activation). Quantitative figures framed as advisory observation come from anonymized partner operators in the same archetypes, not from those companies themselves. Public claims and academic results are attributed inline to their sources.
What A North-Star Metric Is For
Sean Ellis's original framing of the north-star metric defined it as "the single metric that best captures the core value that your product delivers to customers" and as "the key to driving sustainable growth across your full customer base." The Amplitude team's North Star Playbook, co-authored by John Cutler, extends the framing with input metrics, a glossary, and a workshop methodology. The Croll and Yoskovitz One Metric That Matters (OMTM) framework in Lean Analytics is the related concept positioned as stage-dependent: the right metric for an early-stage startup is not the right metric for a scaling business.
The frameworks differ in detail. They converge on a small set of properties a north-star metric should have. The metric should track customer value (not internal cost or vanity engagement). It should be measurable at a cadence fast enough to inform decisions (weekly or daily for most operators, monthly at the slowest). It should decompose into actionable inputs (the team can change something today that will move the metric in three to six months). It should be one number, or a small set of related numbers, that the entire organization can rally around. It should be resistant to gaming, though the frameworks vary in how seriously they take this last property.
The justification for a single metric is operational. Organizations of more than fifty people lose alignment on what success means as the number of competing metrics grows. The product team measures activation, the growth team measures acquisition, the engineering team measures velocity, the finance team measures revenue. Each metric is locally rational. The cross-team prioritization devolves into political negotiation rather than analytic comparison because there is no shared scale to weight one team's improvement against another team's improvement. The north-star metric is the proposed shared scale. It is supposed to be the single number that all functions can be evaluated against and that all initiatives can be prioritized by.
The conditions for the framework to work are non-trivial. The single metric needs to be the right metric, which requires the team to know what "customer value" means with enough precision to operationalize it. The metric needs to be measurable in a way that is robust to data collection noise. The metric needs to be defensible against the executive who wants to redefine it every quarter to flatter the numbers, and against the team that wants to redefine it to make their hard-won progress visible. And the metric needs to be revisable when the underlying conditions change, which is the part of the framework most playbooks treat as an afterthought.
Construction: The Customer-Value Question
The hardest part of constructing a north-star metric is operationalizing what customer value means in the specific product context. The metric needs to be one of three structural types, and the typology determines what construction problem the team is solving.
Type 1: Engagement intensity. The metric measures how much of the value-producing behavior the user does. Spotify's reported time spent listening, Facebook's daily active users, WhatsApp's messages sent. These metrics work when the product's value is realized through repeated use and when the depth of use correlates with retention and monetization. The construction question is what unit of engagement matters: time, sessions, actions, units of content consumed. Each unit has gaming risks (time can be inflated by inactive sessions, sessions can be inflated by aggressive notifications, actions can be inflated by encouraging low-value actions).
Type 2: Outcome events. The metric measures the count of completed value-delivery events. Airbnb's nights booked, Stripe's transactions processed, Shopify's gross merchandise volume processed by merchants. These metrics work when the value is delivered in discrete units that can be counted. The construction question is what counts as an event (a free trial vs. a paid signup, a one-night vs. a multi-night booking, a refunded vs. a completed transaction) and how to weight events of different sizes.
Type 3: Marketplace liquidity or two-sided activation. The metric measures the rate at which the two sides of a marketplace successfully match. Uber's completed rides, DoorDash's delivered orders, Etsy's checkout completions. These metrics are structurally engagement metrics or outcome metrics in disguise, but they require an explicit treatment of the matching function because optimizing one side in isolation produces marketplace imbalances.
The three types are not mutually exclusive; a product can have a primary metric of one type and supporting metrics of others. The construction process is to identify the structural type, define the unit precisely, pressure-test the definition against known gaming risks, and validate that movements in the metric correlate with movements in retention and monetization on a 12-to-24-month lag.
The north-star metric construction filter: five conditions, applied in order
The filter above eliminates more candidates than most teams expect. The retention-correlation step in particular eliminates a category of metrics that look value-aligned in the abstract but turn out to be uncorrelated with retention in the data (one common example is "feature usage rate," which often tracks the engagement of users who are already going to retain rather than predicting which users will retain).
Construction: The Decomposition Question
A north-star metric that the team cannot change is a vanity number. The decomposition into actionable inputs is what makes the metric load-bearing. The Amplitude playbook calls these "input metrics"; the John Cutler version emphasizes that the inputs should be the levers the team actually controls, not just upstream events in the funnel.
The decomposition typology, drawn from advisory work across categories.
Multiplicative decomposition. The metric is the product of components: active users times sessions per user times average session length. Each component is a separate lever. The advantage is that the components are typically independent enough that a team can own one component. The disadvantage is that movements in components do not always sum to movements in the product (because of cross-component correlations).
Funnel decomposition. The metric is the end-state count of a funnel: signups become activated users become retained users become paying users. Each stage is a separate conversion rate. The advantage is that the funnel matches the operational organization (acquisition team owns top, activation team owns middle, monetization team owns bottom). The disadvantage is that funnel stages are often not independent; a change at one stage shifts the user mix at downstream stages.
Cohort decomposition. The metric is the sum across cohorts of users, each cohort weighted by its size and behavior. Each cohort is a separate analytical object. The advantage is that the cohort structure reveals when the metric is driven by the behavior of new users versus the behavior of older users, which has very different operational implications. The disadvantage is that cohort analysis requires a longer history and more sophisticated tooling than the multiplicative or funnel views.
North-Star Metric Construction: Four Archetypes With Inputs
| Archetype | Example NSM | Primary Inputs | Common Gaming Risk |
|---|---|---|---|
| Engagement intensity (consumer media) | Weekly active users with at least 3 sessions | New user activation, push notification CTR, content recommendation quality, session frequency | Inflating sessions via low-quality notifications; counting near-zero sessions |
| Outcome events (transactional) | Paid orders per active customer per quarter | Conversion rate, repeat purchase rate, basket size, win-back rate | Splitting one order into multiple to inflate count; counting cancelled orders |
| Marketplace liquidity (two-sided) | Successful matches per week | Supply side onboarding, demand side acquisition, matching algorithm quality, drop rate | Loosening match criteria to inflate count at expense of match quality |
| Team activation (B2B SaaS) | Teams with 5+ active users in the past 14 days | Signup-to-team-creation rate, team-to-active-user expansion, retention of active users | Counting bot accounts; counting one user invited to many teams |
| Adoption (developer tooling) | Monthly active integrations | Documentation quality, time-to-first-successful-call, integration retention | Counting test integrations; counting integrations that never reach production |
| Compounding value (subscription) | Annual recurring revenue from cohorts at 12-month tenure | New cohort acquisition, cohort activation, cohort retention, cohort expansion | Pulling forward expansion deals to make current quarter look better |
The table is not exhaustive. The categories overlap, and many products fit more than one. The diagnostic value is in mapping the archetype to the gaming risks, which the Goodhart and Campbell literature warns us will manifest within twelve to twenty-four months of the metric becoming a target.
The Goodhart Problem And Its Four Variants
Goodhart's law in its original 1975 form was a remark about monetary policy. The popular formulation ("when a measure becomes a target, it ceases to be a good measure") generalizes the observation into a structural prediction about any quantitative target subject to optimization pressure. The Manheim and Garrabrant paper, Categorizing Variants of Goodhart's Law (arXiv:1803.04585, 2018), is the most useful practitioner-readable taxonomy of how the failure mode actually shows up. They distinguish four variants, each of which corresponds to a distinct failure pattern in north-star metric optimization.
Variant 1: Regressional Goodhart. The proxy correlates with the goal, so optimizing the proxy improves the goal on average. But the proxy has noise. Optimizing past the noise level produces measured improvement that does not correspond to real improvement. In a north-star metric context, this looks like a team driving the metric upward through small wins that aggregate into a statistically significant lift but do not move retention or revenue. The metric moves, the underlying outcome does not.
Variant 2: Extremal Goodhart. The proxy correlates with the goal in the typical range but diverges at the extremes. Optimizing the proxy past its useful range produces results that are good for the proxy and bad for the goal. In a north-star metric context, this is the team that drives daily active users from 60 percent of monthly active users to 75 percent by sending eight push notifications per day, the result of which is that the user base hates the product and churns at a higher rate over the following two quarters.
Variant 3: Causal Goodhart. The proxy and the goal are correlated because they share a common cause, not because the proxy causes the goal. Interventions that move the proxy do not move the goal because the causal arrow runs from the common cause to both. In a north-star metric context, this is the team that observes a correlation between "users who add a photo" and "user retention," concludes that adding a photo causes retention, builds a feature flow that pushes new users hard to add a photo, and finds that retention does not improve because the original correlation was driven by motivated users (who would have retained anyway) being more likely to add photos.
Variant 4: Adversarial Goodhart. The agent being measured actively games the metric. In a north-star metric context, this is the salesperson who restructures the deal to time the booking inside the quarter or the engineer who renames inactive sessions to fit the active-session definition.
The composite frequency above reflects which Goodhart variant appears most often in advisory engagements. Causal Goodhart leads the count, which matches the academic intuition that the easiest mistake to make in metric design is to read a correlation as if it implied a useful intervention point. Regressional and extremal Goodhart are also common because both arise from extending optimization past the range where the proxy was validated. Adversarial Goodhart is the least frequent but the most damaging when it occurs, because it implies a misalignment between the team and the strategy that is not easily fixed by metric revision alone.
The practical implication of the Manheim-Garrabrant taxonomy is that the construction of the metric does not eliminate the Goodhart problem; it only delays it. Every metric is subject to one or more of the four variants over time, and the magnitude of the failure depends on how aggressively the metric is optimized and how long the team optimizes without revisiting the metric's validity. The revision discipline is what bounds the failure.
The Campbell Problem: When The Metric Corrupts The Process
Donald Campbell's 1979 paper, "Assessing the Impact of Planned Social Change" in Evaluation and Program Planning, made a stronger claim than Goodhart's. Campbell's law states that "the more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." The shift from Goodhart to Campbell is from "the metric becomes a worse measure" to "the metric corrupts the underlying activity it was supposed to measure."
In a north-star metric context, the Campbell problem manifests as decisions about what to build that are driven by what will move the metric rather than by what serves the customer. A team optimizing for daily active users will build re-engagement features even when the underlying product complaint is that the product is over-engaging. A team optimizing for transactions per customer will build features that encourage transaction-splitting even when transaction-splitting harms unit economics. A team optimizing for time spent listening will build playlist-extension features even when the user wanted to stop after one playlist.
The Campbell corruption is harder to detect than the Goodhart drift because it operates at the level of strategic direction rather than at the level of metric measurement. The Goodhart drift produces a metric that is moving without moving the underlying outcome. The Campbell corruption produces a product roadmap that is shaped by what moves the metric, with the underlying outcome silently sliding in a direction the team is not measuring.
The accounting literature on surrogation, beginning with Choi, Hecht, and Tayler's 2012 paper "Lost in Translation" in the Journal of Accounting Research, is the closest empirical work on this question. They show that managers compensated on a measure tend to "lose sight of the strategic construct the measure is intended to represent, and subsequently act as though the measure is the construct." The replication studies that followed (Bentley 2019, others) have confirmed the effect across multiple compensation contexts. The implication for a north-star metric is that any metric tied to compensation (whether explicitly through bonus or implicitly through promotion) will eventually be treated as the strategy itself rather than as a proxy for the strategy.
The composite illustrates the pattern across advisory engagements where a north-star metric was not revised over a multi-quarter window. The metric continues to grow at roughly 8 percent per quarter, an apparently healthy rate. Retention and revenue track the metric for the first four to five quarters, then begin to flatten and eventually decline. By quarter ten, the metric is up 84 percent from the baseline, retention is up 2 percent, and revenue is down 7 percent from its peak. The team is technically hitting its target. The business is sliding. The metric's continued growth is producing decisions that are extracting short-term metric movement at the cost of long-term outcomes.
When To Revise
The hardest decision in north-star metric stewardship is when to revise. Revising too often resets the organizational alignment that the metric exists to produce. Revising too late lets the Goodhart and Campbell failures accumulate to the point of strategic damage. The honest answer is that there is a set of triggers that should prompt a revision review, not an automatic schedule.
Trigger 1: The metric and the underlying outcome have decoupled. If the metric is moving and retention, satisfaction, or monetization is not (or is moving in the opposite direction) over a four-to-six-quarter window, the metric has lost its proxy validity. The revision question is whether the metric should be redefined to restore correlation or whether a new metric should replace it. The diagnostic data is the cohort retention curve compared to the metric trajectory.
Trigger 2: The product has materially changed. A pivot, a major new product line, an acquisition that changes the user mix, a shift from B2C to B2B or vice versa. The original metric was constructed for the original product. The new product needs a metric that reflects its value proposition. The revision question is whether the original metric still captures value for the original product line while a new metric is needed for the new line, or whether a unified metric across the new portfolio is feasible.
Trigger 3: The customer base has materially shifted. The product was used by early adopters in the first eighteen months and by a mainstream segment after that, with the two segments having different value-realization patterns. The metric calibrated for the early adopters may not capture mainstream value. The revision question is whether to redefine the metric or to track separate metrics by segment.
Trigger 4: The strategy has changed. A shift from growth to monetization, from acquisition to retention, from breadth to depth. The metric needs to reflect the new strategic priority. The revision question is when the strategy change is durable enough to justify the metric change.
Trigger 5: The metric has been actively gamed. The team has identified specific instances of gaming, either through internal whistleblowing or through diagnostic analysis of the metric components. The revision question is whether to add guardrails (paired metrics that move in opposite directions under gaming) or to change the metric to one less susceptible to the specific gaming pattern.
Revision Triggers and Recommended Responses
| Trigger | Diagnostic | Recommended Response | Common Pitfall |
|---|---|---|---|
| Metric and outcome have decoupled | 12-month cohort retention vs. NSM trajectory; correlation drops below 0.5 | Redefine metric or add a paired outcome metric | Adding paired metric without retiring the old; team optimizes both, exhausts itself |
| Product has materially changed | New product line is more than 20 percent of revenue or users | Construct a new NSM for the new line; keep old for old line | Forcing one metric across lines that have different value units |
| Customer base has materially shifted | Segment composition has moved more than 30 percentage points in 18 months | Segment-level NSMs with a unified rollup | Optimizing the aggregate while the segment-level NSMs diverge |
| Strategy has changed | Executive priority shift documented in board materials or annual plan | Wait one quarter to confirm durability; then revise | Revising at every strategy update; the metric loses its anchor function |
| Metric has been gamed | Specific gaming patterns identified; metric movements not validated by outcome movements | Add guardrail metric; if gaming continues, replace the NSM | Adding guardrails without enforcement; the gaming continues with one more number ignored |
| Metric has accumulated political weight beyond its analytic merit | Team identifies the metric as a constraint on the work they think matters | Honest external review; decide whether the metric is right or the team is wrong | Letting political weight dictate the revision; the metric moves to whatever is convenient |
The triggers are not independent. A product change usually accompanies a customer-base shift, which often accompanies a strategy change. The accumulation of triggers is itself a diagnostic; a single trigger is a reason to review, multiple triggers are a reason to revise.
The Political Economy Of Revision
The technical answer to revising a north-star metric is straightforward. Identify the trigger, propose the new metric, validate it against retention and outcome data, transition the dashboards. The political answer is the binding constraint.
The metric, once it is on the wall, accumulates organizational sediment. The team that owns the metric has built its identity around it. The compensation structure ties bonuses to the metric. The OKRs cascade from the metric down to individual contributors. The board reviews the metric quarterly and expects to see continued movement. The investor deck includes the metric on slide three. Each of these layers makes the metric harder to change, even when the metric has decoupled from the underlying outcome and the team knows it.
The pattern we see most often in advisory engagements is that the revision decision is delayed by twelve to twenty-four months past the point at which the metric stopped being load-bearing. The delay is not driven by analytic disagreement; the analysts usually agree that the metric needs to change. The delay is driven by the political cost of admitting that the metric was wrong, by the organizational disruption of revising the compensation and OKR cascade, and by the executive who has staked credibility on the metric.
The structural moves that have worked, drawn from advisory observation:
Move 1: Pre-commit to the revision triggers in the construction phase. When the metric is first established, the executive sponsor signs off on the conditions under which the metric will be reviewed and possibly revised. The triggers are documented. The review happens automatically when a trigger fires, regardless of whether the political environment is convenient. The pre-commitment makes the revision a contractual event rather than a discretionary one.
Move 2: Pair the north-star metric with a guardrail metric from the beginning. The guardrail is a metric that should not degrade as the north-star moves. Daily active users paired with NPS or retention. Transactions paired with refund rate or customer service contacts. The guardrail does not need to be a target. It needs to be tracked and reviewed alongside the NSM, with the revision triggered if the NSM moves up while the guardrail moves down beyond a defined threshold.
Move 3: Separate the analytic metric from the compensation metric. The analytic metric is what the team uses to understand the business. The compensation metric is what is tied to bonuses and OKRs. They can be related but should not be identical. The Choi-Hecht-Tayler surrogation finding is strongest when compensation ties directly to a single metric. Pairing compensation with multiple metrics, or compensating on outcomes rather than the metric itself, reduces the surrogation pressure.
Move 4: Make the revision proposal the analyst's prerogative. The team responsible for measuring the metric has the responsibility to propose revisions when the diagnostic conditions warrant. The proposal goes to the executive sponsor with the analytic case. The executive sponsor cannot veto on political grounds without an alternative metric proposal. The structural change is that the burden of justification shifts from "why do we change" to "why do we not change given the diagnostic evidence."
What A Revised Metric Looks Like In Practice
The revised metric should pass the construction filter from the start: track customer value, measurable on a useful cadence, decomposable into actionable inputs, correlated with retention at a 12-month lag, resistant to known gaming patterns. The revision should also include three additional elements that the original construction often skipped.
Element 1: Documented revision conditions. The new metric should ship with the documented conditions under which it will itself be reviewed. The conditions can be the same triggers used to revise the original metric. The point is to embed the revision discipline in the metric from the start, rather than waiting for the next decoupling to require a political fight.
Element 2: A paired guardrail. The guardrail metric is monitored alongside the north-star and is the early warning for gaming, surrogation, or strategic drift. The guardrail does not need to be a target. It needs to be on the same dashboard, reviewed at the same cadence, and tied to the revision triggers.
Element 3: An explicit decomposition tree. The decomposition into input metrics is documented and shared. Each team understands which input metric it owns and how movements in its input metric will affect the north-star. The decomposition is itself revisable; if the team's work changes such that an input metric is no longer the right lever, the decomposition is updated.
The metric is the easy part. The revision discipline is the hard part. Most organizations write the metric on the wall and never look at it again. The ones that compound over a decade are the ones that treat the metric as a hypothesis to be tested, not as a permanent fixture to be defended.
The revised metric is itself temporary. The next strategy change, the next product pivot, the next customer-base shift will produce the next set of revision triggers. The discipline is not "find the right metric." The discipline is "build the practice of revising the metric when the conditions warrant." Organizations that internalize this distinction outlast their north-star metrics. Organizations that do not internalize it are eventually outlasted by their north-star metrics, in the sense that the metric continues to be reported, the team continues to be measured against it, and the actual business slides while the dashboard stays green.
When A North-Star Metric Is The Wrong Tool
There is a category of business where a single north-star metric is not the right operating frame, and the playbook industry has been less honest about this than the practitioner community needs. The cases where a single NSM does not work include the following.
Multi-product portfolios where the products are structurally different. A company selling enterprise software, consumer subscriptions, and developer infrastructure does not have a single customer value unit that spans all three. Forcing one metric across the portfolio produces either a meaningless aggregate (the sum of three different value units) or a single-product bias (the metric tracks one product's value and is reported across the other two as a vanity number).
Early-stage products where the product-market fit is still being discovered. A pre-PMF product does not have a stable customer value definition. The metric that captures value next quarter is likely different from the metric that captures value this quarter. The Lean Analytics OMTM framework explicitly addresses this by recommending stage-dependent metrics; the north-star framework that assumes a stable metric is poorly suited to the discovery phase.
Highly cyclical or seasonal businesses where the metric is not interpretable on the weekly cadence. A tax software company has a north-star metric only in the sense that filings completed during tax season is the relevant outcome. The other ten months of the year, the metric is structurally near zero and the team needs different leading indicators to be measured against.
Businesses where the value is delivered to one party but the metric is measured on another. A platform business with consumers on one side and merchants on the other has at least two value units. A single north-star metric can be constructed (matches completed, transactions processed), but the construction will systematically underweight one side's experience.
In these cases, the right operating frame is a small portfolio of metrics with documented relationships between them, not a single north-star. The framework discipline still applies: each metric in the portfolio should pass the construction filter, each should have documented revision triggers, each should have paired guardrails. The difference is that the alignment function the single NSM was supposed to serve has to be served by a process (a shared review of the metric portfolio, a periodic prioritization debate informed by the portfolio) rather than by a number.
The practical implication is that the question "what is our north-star metric" should not be answered "we don't believe in north-star metrics" or "we have eleven." The answer should be "given our product structure, our stage, and our customer base, the right metric (or set) is this, and here are the conditions under which we will revise it." The honesty about when the framework does and does not fit is part of the discipline.
Key Takeaways
-
A north-star metric earns its keep when it informs prioritization decisions that would otherwise be made on intuition. A metric that has not overruled an intuition in the past six months is either too far from the daily work or too aligned with what the team would have done anyway.
-
The construction filter is five conditions in order. Track customer value, measurable on a useful cadence, decompose into actionable inputs, correlate with retention at a 12-month lag, resist known gaming patterns. Most candidates fail at the retention-correlation step.
-
Goodhart's law has four distinct variants (Manheim and Garrabrant, 2018). Regressional, extremal, causal, and adversarial. Each manifests as a different failure pattern in north-star metric optimization. The construction does not eliminate the failure; revision discipline bounds it.
-
Campbell's law and the surrogation literature predict that an unrevised metric will corrupt the underlying process it was supposed to measure. Choi, Hecht, and Tayler (2012) show empirically that managers compensated on a measure begin to treat the measure as the strategy. The corruption is hard to detect because it operates on roadmap decisions rather than on metric measurements.
-
The revision triggers are five, and the revision decision should be pre-committed at construction. Decoupling from outcomes, product change, customer-base shift, strategy change, evidence of gaming. The political cost of revising rises with time, so embedding the revision discipline in the metric from the start is the structural protection against the delay pattern.
-
A single north-star metric is the wrong tool in multi-product portfolios, pre-PMF discovery, cyclical businesses, and platform contexts with structurally distinct value units. The honest framework acknowledges where it does not fit. A small portfolio of metrics with documented relationships and paired guardrails is the alternative for those cases.
Read Next
- Business Analytics
Data Warehouse to BI Layer Arbitration Patterns: Where the Semantic Layer Should Live
An analysis of the architectural debate between BI-tool-as-semantic-layer, warehouse-as-semantic-layer, and headless BI, with the knock-on effects on metric consistency, query cost, and analyst velocity.
- Business Analytics
Anomaly Detection on Analytics Dashboards: When the Alert Fires
A 4% revenue drop on a Tuesday could be a payment outage, a pricing bug, or normal variance. The difference between sound monitoring and alert theatre is not the model. It is the loop the alert sits inside.
- Business Analytics
The GA4 Transition Forensics: What Universal Analytics Did Better
An honest post-mortem of the UA to GA4 migration. What broke, what is genuinely better, what remains unchanged, and the opportunity cost question that nobody at Google wants to discuss in public.
The Conversation
Be the first to weigh in
Join the conversation
Disagree, share a counter-example from your own work, or point at research that changes the picture. Comments are moderated, no account required.