Causal Discovery in Business Data: Applying PC Algorithm and FCI to Find Revenue Drivers Without Experiments

TL;DR: Only 35% of correlational predictions about what would improve a metric actually produce the predicted improvement when tested experimentally. Causal discovery algorithms like the PC algorithm and FCI can infer causal structure from observational business data without running experiments -- identifying which features genuinely drive retention, revenue, and conversion versus which merely co-occur with them.

The Correlation Trap in Business Analytics

Most business analytics operates on a foundation that is structurally incapable of answering the questions it claims to address.

Consider a standard analysis: a data team observes that customers who use Feature X have 40% higher retention than those who do not. The slide deck concludes that Feature X drives retention. The product team allocates engineering resources to improve Feature X. Leadership celebrates the data-driven culture.

But the analysis has established only that Feature X usage and retention co-occur. It has not established the direction of causation. It has not ruled out the possibility that a third variable, say, technical sophistication of the user, or the size of the account, or the specific onboarding path they took, independently causes both Feature X usage and higher retention. The business is making causal decisions on correlational evidence. The same structural problem afflicts multi-touch attribution, where touchpoint correlations are mistaken for causal contributions to conversion.

This is not a pedantic distinction. It is the difference between investing in something that changes outcomes and investing in something that merely co-occurs with outcomes. Companies that confuse these two spend millions optimizing indicators rather than causes.

The problem compounds in business settings because the data is observational by default. Customers self-select into behaviors. Markets shift. Seasonal patterns interact with product changes. The data-generating process is tangled with feedback loops and hidden variables that no amount of regression analysis can untangle through statistical control alone.

And yet, there is a class of algorithms designed to do something that sounds almost impossible: infer causal structure from observational data, without running experiments. These algorithms do not give certainty. They give principled constraints on what the causal structure could be, given the statistical patterns in the data. That is far more valuable than the unconstrained speculation that passes for causal reasoning in most business analytics today.

Why Experiments Are Not Always an Option

The gold standard for causal inference is the randomized controlled experiment. Randomly assign customers to treatment and control groups, intervene on one variable, measure the outcome -- ideally using Bayesian A/B testing for proper uncertainty quantification. Any difference in outcomes can be attributed to the intervention. Clean. Definitive.

But in business, the gold standard is frequently unavailable.

Ethical constraints. You cannot randomly degrade the experience for a subset of enterprise customers paying seven figures annually to measure the causal effect of poor support response times on churn. The experiment would generate the answer and destroy the accounts simultaneously.

Operational constraints. Pricing experiments require legal review, competitive analysis, and coordination across sales teams. By the time the experiment is approved, the market has shifted. Testing radically different pricing structures is not something most organizations can execute quickly or at scale.

Strategic constraints. Some variables are not manipulable at all. You cannot randomly assign customers to different industries to see whether industry vertical causes different product usage patterns. You cannot randomize macroeconomic conditions. You cannot make a customer smaller to test whether company size drives adoption.

Sample size constraints. Enterprise SaaS companies may have a few hundred accounts. Running an experiment with sufficient statistical power to detect realistic effect sizes requires a sample that exceeds their entire customer base.

Table 1: When Experiments Are Infeasible and Causal Discovery Provides an Alternative

Constraint Type	Example	Why Experiments Fail	Causal Discovery Alternative
Ethical	Deliberately degrading service quality	Harms real customers	Observe natural variation in service quality across accounts
Operational	Radical pricing restructure	Takes months to approve, market shifts	Analyze existing pricing tiers as quasi-natural variation
Strategic	Customer industry vertical	Non-manipulable variable	Discover whether industry is a cause or confounder from data
Sample size	Enterprise B2B with 300 accounts	Insufficient power for A/B tests	Constraint-based algorithms can work with moderate samples
Temporal	Long-run brand effects	Outcomes manifest over years	Longitudinal observational data captures long-run patterns

These are not edge cases. For many of the most important business questions, what causes churn among high-value accounts, what drives expansion revenue, why certain market segments adopt faster, experiments are either impossible, impractical, or too slow. The business needs answers, and it cannot wait three years to design and run the definitive experiment.

This is where causal discovery enters. Not as a replacement for experiments, but as a systematic method for extracting causal hypotheses from the observational data that already exists. In organic search measurement, where experiments are particularly difficult, the synthetic control method for measuring SEO's causal impact represents a parallel approach to extracting causal signal from observational data.

Causal Discovery: Learning Structure from Data

Causal discovery is a family of algorithms that attempt to learn the causal graph, the directed acyclic graph (DAG) that represents causal relationships among variables, from statistical patterns in observational data.

The foundational insight, developed through decades of work by Judea Pearl, Peter Spirtes, Clark Glymour, and Richard Scheines, is that different causal structures produce different statistical signatures. If X causes Y, the statistical pattern is different from Y causes X, which is different from both X and Y being caused by a hidden variable Z. Not always. Not in every case. But often enough to narrow the set of plausible causal structures considerably.

The key assumptions that make this possible:

Causal Markov Condition. Each variable in the graph is independent of its non-descendants, conditional on its parents. Formally, for any variable $X_i$ in a DAG $\mathcal{G}$ with parent set $\text{Pa}(X_i)$ :

X_i \perp\!\!\!\perp \text{NonDesc}(X_i) \mid \text{Pa}(X_i)

This is the statistical fingerprint of causal structure. If smoking causes tar deposits and tar deposits cause lung cancer, then smoking and lung cancer are independent once you condition on tar deposits. The conditional independence relationships in the data reflect the causal structure.

Faithfulness. There are no exact cancellations of causal effects that produce misleading independence patterns. If two variables appear independent in the data, it is because the causal structure makes them independent, not because two causal paths between them happen to cancel out perfectly. This assumption rules out pathological special cases.

Causal Sufficiency (for some algorithms). All common causes of the measured variables are included in the dataset. This is a strong assumption. The PC algorithm requires it; the FCI algorithm does not.

Given these assumptions, the algorithms proceed by testing conditional independence relationships in the data and using them to reconstruct the causal skeleton and, where possible, orient the edges into causal arrows.

Figure 1: Comparative Assessment of Causal Inference Methods (0-100 Scale)

No observational method matches the causal validity of a well-designed experiment. But the gap between naive correlation analysis and algorithmic causal discovery is substantial. The question is not whether causal discovery is perfect, it is whether it is better than the alternative, which in most business settings is intuition dressed up with scatter plots.

Constraint-Based vs. Score-Based Approaches

Causal discovery algorithms divide into two broad families, with a third hybrid category that combines elements of both.

Constraint-based algorithms (PC, FCI, RFCI) work by testing conditional independence relationships. The logic: if two variables are independent given some set of other variables, that independence constrains what the causal structure could be. The algorithm systematically tests these relationships and eliminates causal structures that are inconsistent with the observed independences.

The PC algorithm is the canonical example. Named after its creators Peter Spirtes and Clark Glymour (the P and C), it starts with a fully connected graph and removes edges where conditional independence is found. Then it orients the remaining edges using specific rules derived from the Markov condition and faithfulness assumption.

Score-based algorithms (GES, FGES, NOTEARS) take a different approach. They define a scoring function that measures how well a given causal graph fits the data, typically the Bayesian Information Criterion (BIC) or a similar metric that balances fit against complexity. Then they search over the space of possible graphs to find the one with the best score.

The Greedy Equivalence Search (GES) is the most widely used score-based method. It operates in two phases: a forward phase that adds edges to improve the score, and a backward phase that removes edges to simplify. Under the same assumptions as the PC algorithm, GES is guaranteed to find the correct equivalence class of causal graphs in the large sample limit.

Hybrid algorithms (GFCI, cGNF) combine both approaches. They might use constraint-based methods to establish a skeleton and score-based methods to orient edges, or vice versa.

Table 2: Constraint-Based vs. Score-Based Causal Discovery Approaches

Property	Constraint-Based (PC, FCI)	Score-Based (GES, FGES)	Hybrid (GFCI)
Core mechanism	Conditional independence tests	Graph scoring and search	Both
Handles hidden confounders	FCI: Yes, PC: No	Not natively	Yes (GFCI)
Computational cost	Moderate (depends on tests)	Can be high (search space)	Moderate to high
Sensitivity to errors	High (cascading test errors)	Moderate (score is global)	Moderate
Output type	CPDAG or PAG	CPDAG	PAG
Sample size needs	Moderate	Moderate to large	Moderate to large
Strongest when	Clear independence patterns	Well-specified variables	Hidden confounders likely

For business applications, the choice between families matters less than the choice of assumptions. If you believe all relevant common causes are measured, the PC algorithm is a reasonable starting point. If hidden confounders are plausible, and in business data, they almost always are, FCI or GFCI should be the default.

The PC Algorithm, Step by Step

The PC algorithm is the most widely taught and implemented causal discovery method. Understanding it in detail clarifies how all constraint-based approaches work.

Input: A dataset with n observations across p variables. A significance level alpha for independence tests (typically 0.01 or 0.05).

Output: A Completed Partially Directed Acyclic Graph (CPDAG), which represents the equivalence class of causal structures consistent with the data.

The algorithm proceeds in three phases:

Phase 1: Skeleton Discovery

Start with a complete undirected graph, every variable connected to every other variable by an undirected edge. Then systematically remove edges.

Step 1. For each pair of adjacent variables $(X, Y)$ , test whether $X$ and $Y$ are marginally independent (unconditionally) using a conditional independence test at significance level $\alpha$ :

X \perp\!\!\!\perp Y \mid \emptyset \quad \Longrightarrow \quad \text{remove edge } X - Y

If they are independent, remove the edge between them. Record the separating set (in this case, the empty set).

Step 2. For each pair of still-adjacent variables (X, Y), test whether X and Y are conditionally independent given any single variable Z that is adjacent to X or Y. If conditional independence holds for some Z, remove the edge. Record Z as the separating set.

Step 3. For each pair of still-adjacent variables (X, Y), test whether they are conditionally independent given any pair of variables (Z, W) adjacent to X or Y. If so, remove the edge. Record the separating set.

Continue increasing the conditioning set size until all remaining edges have been tested with conditioning sets of all feasible sizes. The result is the undirected skeleton of the causal graph.

Phase 2: V-Structure Orientation

This is where undirected edges begin to acquire direction. A v-structure (also called a collider) occurs when two variables X and Z both cause a third variable Y, but X and Z are not directly connected. The pattern is: X -> Y <- Z.

The algorithm identifies v-structures using the separating sets from Phase 1. For every triple (X, Y, Z) where X and Z are both adjacent to Y but not to each other, check whether Y was in the separating set of X and Z. If Y was not in the separating set, then X -> Y <- Z is a v-structure. Orient those edges accordingly.

The logic: if Y is a common effect of X and Z (a collider), then conditioning on Y creates a dependence between X and Z (Berkson's bias). So X and Z would not be rendered independent by conditioning on Y. If the separating set that makes X and Z independent does not include Y, then Y must be a collider.

Phase 3: Edge Orientation Propagation

Apply a set of orientation rules (Meek's rules) that propagate edge directions without creating new v-structures or directed cycles:

Rule 1. If X -> Y, Z and X and Z are not adjacent, orient as X -> Y -> Z (to avoid creating a new v-structure at Y).

Rule 2. If X -> Z and X, Y, Z, orient X, Y as X -> Y (to avoid creating a directed cycle).

Rule 3. If X, Z, Y -> Z, W -> Z, X, Y, X, W, and Y and W are not adjacent, orient X, Z as X -> Z.

Any edges that remain unoriented after all rules have been applied genuinely cannot be oriented from observational data alone, both directions are consistent with the observed conditional independence pattern.

Handling Hidden Confounders with FCI

The PC algorithm assumes causal sufficiency, that every common cause of the measured variables is included in the dataset. In business data, this assumption is almost certainly violated. There are always unmeasured factors: employee morale, competitive dynamics, macroeconomic sentiment, regulatory anticipation, internal organizational politics.

The Fast Causal Inference (FCI) algorithm, also developed by Spirtes, Glymour, and Scheines, relaxes the causal sufficiency assumption. It can detect when the statistical patterns suggest the presence of a hidden common cause, even when that cause is not in the dataset.

FCI modifies the constraint-based approach by testing conditional independence using the FCI-specific conditioning sets. The core test remains the same, for variables $X$ and $Y$ , find a subset $\mathbf{S}$ of possible separating sets such that:

X \perp\!\!\!\perp Y \mid \mathbf{S} \quad \text{where } \mathbf{S} \subseteq \text{PossibleAncestors}(X) \cup \text{PossibleAncestors}(Y)

FCI extends the PC algorithm in two critical ways:

First, it introduces additional edge types. Where the PC algorithm produces only directed (->), undirected (-), and bidirected (<->) edges, FCI produces edges with circle endpoints (o) that represent genuine ambiguity about whether a variable is a cause, an effect, or connected through a hidden confounder. The edge X o-> Y means that X might cause Y, or there might be a hidden common cause, but Y does not cause X.

Second, FCI includes additional orientation rules, ten rules instead of the PC algorithm's three, that account for the possibility of hidden variables. These rules are more conservative, which means FCI orients fewer edges than the PC algorithm would on the same data. The trade-off is accuracy: when FCI orients an edge, the orientation is valid even if hidden confounders exist.

The practical difference is significant. Consider a business dataset with three measured variables: Marketing Spend, Feature Usage, and Revenue. The PC algorithm might output Marketing Spend -> Feature Usage -> Revenue, suggesting a causal chain. But if there is an unmeasured variable, say, Product Quality, that independently affects both Feature Usage and Revenue, the PC algorithm will produce a misleading graph. FCI would instead output Marketing Spend -> Feature Usage o-> Revenue, with the circle endpoint signaling that a hidden confounder might be present.

Figure 2: Accuracy of PC vs. FCI Under Increasing Hidden Confounders (Monte Carlo Replication of Colombo et al. 2012, 1000 Observations)

The chart above, based on simulation studies following the methodology described by Colombo et al. (2012), shows the accuracy degradation pattern. When no hidden confounders exist, both algorithms perform well, with PC slightly better due to its stronger assumptions. But as hidden confounders enter the picture, the realistic case for business data, FCI maintains reasonable accuracy while PC collapses.

The lesson is straightforward: for business applications, FCI should be the default algorithm unless you have strong reasons to believe causal sufficiency holds. The cost of the more conservative output (more ambiguous edges) is far smaller than the cost of confidently wrong causal claims.

Interpreting Partial Ancestral Graphs

FCI produces a Partial Ancestral Graph (PAG), which uses a richer edge vocabulary than the simple directed graphs most analysts are accustomed to reading. The four edge types carry precise causal meanings:

X -> Y (directed): X is a cause of Y, or X is an ancestor of Y. No hidden common cause connects them that would reverse or invalidate this direction.

X <-> Y (bidirected): There exists a hidden common cause of both X and Y. Neither X causes Y nor Y causes X directly through the observed variables.

X o-> Y (partially directed): X might cause Y, or there might be a hidden common cause of both, but Y does not cause X. The circle represents genuine ambiguity that the data cannot resolve.

X o-o Y (nondirected): The algorithm cannot determine the causal relationship. Any configuration, X causes Y, Y causes X, hidden common cause, is consistent with the data.

For business interpretation, the key is understanding what each edge type implies about intervention. If you see X -> Y, intervening on X will affect Y. If you see X <-> Y, intervening on X will not affect Y (the association is spurious, driven by a hidden factor). If you see X o-> Y, intervening on X might affect Y, you need additional evidence to be sure.

This hierarchy gives business teams a principled way to prioritize. Instead of treating all correlations as equally actionable, the PAG classifies relationships into categories with different implications for decision-making. That classification alone, even when it leaves some edges ambiguous, is a substantial improvement over the undifferentiated correlational approach that dominates current practice.

Applying Causal Discovery to Revenue Drivers

Consider a concrete business application. A B2B SaaS company wants to understand what drives revenue expansion among existing accounts. The dataset includes monthly observations across several hundred accounts:

Revenue (monthly recurring revenue per account)
Feature_Depth (number of distinct features used in the month)
Support_Tickets (number of support requests filed)
Training_Hours (hours of training or onboarding consumed)
Team_Size (number of active users within the account)
NPS_Score (net promoter score from quarterly surveys)
Contract_Length (months remaining on current contract)

A standard regression approach would estimate the coefficient of each variable on Revenue, controlling for the others. But this tells you nothing about causal structure. Does Training_Hours improve Revenue because training causes deeper feature adoption? Or do accounts that are already growing fast invest more in training? Regression cannot distinguish these.

Running the FCI algorithm on this data produces a PAG. A representative output drawn from advisory work might show:

Training_Hours -> Feature_Depth -> Revenue (directed chain: training causes deeper feature use, which causes revenue growth)

Team_Size -> Feature_Depth (larger teams use more features)

Team_Size -> Revenue (larger teams generate more revenue)

Support_Tickets <-> NPS_Score (bidirected: a hidden factor, perhaps product complexity, drives both)

NPS_Score o-> Revenue (partially directed: satisfaction might cause revenue, or a hidden confounder might be at play)

Contract_Length o-o Revenue (nondirected: the causal relationship is ambiguous)

Figure 3: Correlation with Revenue vs. Actionability Score from Causal Discovery

The divergence between correlation magnitude and actionability is the entire point. Support Tickets have a meaningful negative correlation with Revenue, and a naive analysis might conclude that reducing support tickets would increase revenue. But the causal graph reveals that the association is confounded, a hidden variable (product complexity, perhaps, or account maturity) drives both. Reducing support tickets by making them harder to file would not increase revenue. It would probably decrease it.

Conversely, Training Hours has a moderate correlation with Revenue but a clear causal pathway through Feature Depth. This makes Training Hours one of the most actionable levers, investing in training programs should produce revenue growth through the mediating mechanism of deeper feature adoption.

This is the difference between analytics that describe and analytics that prescribe.

Applying Causal Discovery to Churn Analysis

Churn analysis is another domain where causal discovery reframes the problem. The standard approach builds a churn prediction model, logistic regression, random forest, gradient boosting, that identifies which variables predict churn. The assumption is that the strongest predictors are also the best intervention targets.

This assumption is often wrong.

A variable can predict churn without causing it. Declining login frequency predicts churn because it is a symptom of disengagement, not a cause. Forcing users to log in more frequently would not reduce churn, it would accelerate it.

Causal discovery can distinguish between predictors that are causal (intervening on them would change the outcome), symptomatic (they co-occur with churn but do not cause it), and confounded (a hidden factor causes both the predictor and churn).

Consider a churn analysis with the following variables: Login_Frequency, Feature_Breadth (number of features used), Integration_Count (number of third-party integrations), Champion_Departure (whether the internal advocate left the company), Billing_Issues (number of billing disputes), and Churned (binary outcome).

A prediction model might rank these by feature importance. Causal discovery would produce a graph that tells a different story:

Champion_Departure -> Integration_Count (when the champion leaves, integrations decay)

Champion_Departure -> Login_Frequency (the champion's departure reduces overall team engagement)

Integration_Count -> Churned (fewer integrations increase churn, this is a direct causal path)

Login_Frequency o-> Churned (ambiguous: might cause churn, might just reflect it)

Billing_Issues <-> Churned (confounded: some hidden factor, perhaps budget pressure, causes both)

The actionable insight is not to fix billing issues or increase login frequency. It is to (a) invest in multi-champion strategies so that a single departure does not unravel the account, and (b) deepen integrations, which create genuine switching costs and directly reduce churn probability. These are the structural interventions that the causal graph identifies and that correlational analysis would miss.

Software Tools for Causal Discovery

The barrier to applying causal discovery in practice has dropped substantially in recent years. Several mature software packages implement the algorithms discussed above, with varying trade-offs between usability and flexibility.

causal-learn (Python). Maintained by the Center for Causal Discovery at Carnegie Mellon, this is the most comprehensive Python library for causal discovery. It implements PC, FCI, GES, GFCI, and several other algorithms. The API is designed for researchers but is accessible to data scientists with moderate Python experience. It handles continuous, discrete, and mixed data types.

TETRAD (Java, with Python wrapper). The original implementation of the PC and FCI algorithms, also from Carnegie Mellon. TETRAD includes a graphical interface for exploring causal models interactively, useful for presenting results to non-technical stakeholders. The py-tetrad wrapper makes it callable from Python.

DoWhy (Python). Microsoft Research's causal inference library. DoWhy focuses on the downstream task, estimating causal effects once the graph is specified, but it integrates with causal discovery tools for graph learning. Its strength is the refutation framework: after estimating a causal effect, DoWhy provides automated sensitivity analyses to test how robust the estimate is to violations of assumptions.

gCastle (Python). Developed by Huawei's research group, gCastle implements gradient-based causal discovery methods (NOTEARS, DAG-GNN) that frame graph learning as a continuous optimization problem. These methods are newer and less theoretically understood but can scale to higher-dimensional datasets.

bnlearn (R). The most mature Bayesian network learning library in R. It implements constraint-based, score-based, and hybrid algorithms with extensive options for independence testing, scoring criteria, and model comparison. Particularly strong for categorical and mixed data.

Table 3: Software Tools for Causal Discovery in Business Applications

Tool	Language	Key Algorithms	Hidden Confounders	Best For
causal-learn	Python	PC, FCI, GES, GFCI, LiNGAM	Yes (FCI, GFCI)	Research-grade analysis, comprehensive algorithm selection
TETRAD	Java (Python wrapper)	PC, FCI, GES, FGES, GFCI	Yes (FCI, GFCI)	Interactive exploration, presenting to stakeholders
DoWhy	Python	Graph specification + effect estimation	Via integration	Causal effect estimation and robustness testing
gCastle	Python	NOTEARS, DAG-GNN, gradient methods	Limited	High-dimensional data, continuous optimization approach
bnlearn	R	PC, GS, HC, Tabu, MMHC	Limited	Bayesian network learning, categorical data

Here is a minimal working example using causal-learn to run the PC algorithm on a business dataset:

import numpy as np
import pandas as pd
from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.GraphUtils import GraphUtils
 
# Load your business dataset
df = pd.read_csv("account_metrics.csv")
variables = ["Revenue", "Feature_Depth", "Training_Hours",
             "Team_Size", "NPS_Score", "Support_Tickets"]
data = df[variables].dropna().values
 
# Run the PC algorithm
# alpha: significance level for conditional independence tests
# indep_test: fisher-z for continuous data, chi-square for discrete
cg = pc(data, alpha=0.05, indep_test="fisherz",
        node_names=variables)
 
# Visualize the learned causal graph
pdy = GraphUtils.to_pydot(cg.G, labels=variables)
pdy.write_png("causal_graph.png")
 
# Print adjacency matrix with edge types
print(cg.G.graph)
# 1 = tail, -1 = arrowhead, 0 = no edge
# Example: row i, col j = -1 and row j, col i = 1
#   means i --> j (i causes j)

For a team starting with causal discovery, the practical recommendation is to begin with causal-learn for graph learning and DoWhy for effect estimation. Use TETRAD's graphical interface for stakeholder communication. The workflow would be: learn the graph with causal-learn's FCI implementation, validate it against domain knowledge, then estimate specific causal effects using DoWhy with the learned graph as input.

Practical Challenges with Business Data

The textbook versions of causal discovery assume clean, stationary, complete data generated from a single causal structure. Business data violates every one of these assumptions. Understanding the specific ways it violates them is essential for applying causal discovery responsibly.

Non-stationarity. Business data is generated by processes that change over time. The causal structure linking marketing spend to revenue in Q1 may differ from Q4 due to seasonal demand patterns, competitive actions, or internal strategy shifts. Applying a causal discovery algorithm that assumes a fixed causal structure across a non-stationary dataset will produce a graph that represents no actual time period, a statistical chimera.

The mitigation is to either (a) restrict analysis to periods where stationarity is plausible, (b) use change-point detection to segment the data into stationary windows and run causal discovery separately on each, or (c) use algorithms designed for time-varying causal structures (PCMCI for time-series, or regime-switching extensions of standard algorithms).

Missing values. Business datasets almost always have missing data. NPS scores are collected quarterly, not monthly. Some accounts do not report team size. Support ticket data may be incomplete for accounts using third-party support channels. Most causal discovery implementations either listwise-delete (losing observations) or require complete data. Both approaches introduce bias.

The mitigation is multiple imputation before running the algorithm, with causal discovery performed on each imputed dataset and results aggregated. This is computationally expensive but principled. An alternative is to use the test-wise deletion approach implemented in some versions of the PC algorithm, which uses all available data for each individual independence test.

Small samples. Enterprise B2B datasets commonly have 200-500 accounts. Conditional independence tests lose power with small samples, and the PC algorithm's cascade structure means that a single incorrect independence decision early in the process can propagate errors throughout the entire graph.

The mitigation is threefold: use conservative significance levels (alpha = 0.01 rather than 0.05) to reduce false positives, limit the number of variables to those with strong theoretical justification rather than including everything available, and use bootstrapping to assess the stability of discovered edges.

Measurement error. Business metrics are proxies. Revenue is a proxy for value delivered. NPS is a proxy for satisfaction. Feature usage counts are a proxy for engagement. When the measured variables are noisy proxies for the true causal variables, causal discovery algorithms can produce misleading results, finding edges where none exist or missing edges that are present.

Multicollinearity. Business variables tend to be highly correlated. Team size, revenue, feature usage, and contract value often move together simply because they all reflect account size. The conditional independence tests that causal discovery relies on can become unreliable when variables are near-collinear.

Combining Causal Discovery with Domain Expertise

Causal discovery algorithms are not oracles. They are hypothesis generators that operate under specific assumptions. The most productive use of these algorithms is not autonomous, it is collaborative, combining algorithmic output with domain knowledge.

Domain expertise enters the process at three points:

Before running the algorithm: variable selection and constraint specification. The choice of which variables to include is itself a causal judgment. Including irrelevant variables increases the chance of spurious edges. Excluding relevant variables violates the causal sufficiency assumption. Domain experts should curate the variable set based on theoretical understanding of the business process.

Additionally, most causal discovery implementations allow users to specify background knowledge, constraints on the graph that the algorithm must respect. For example: "Marketing spend cannot be caused by revenue in the same month" (temporal ordering). Or: "Industry vertical cannot be caused by any other variable in the dataset" (it is exogenous). These constraints reduce the search space and improve accuracy.

During interpretation: adjudicating ambiguous edges. The algorithm will produce some edges that are undirected or partially directed. Domain experts can often resolve these ambiguities. If the algorithm produces Feature_Depth o-o Team_Size, a product leader who knows that feature adoption increases as teams onboard new users can orient the edge as Team_Size -> Feature_Depth. This is not overriding the algorithm, it is complementing it with information the data does not contain.

After interpretation: plausibility assessment. Every discovered graph should be reviewed for face validity. If the algorithm claims that Support Ticket volume causes Revenue growth, something is wrong, either the data is confounded, the sample is too small, or a variable is missing. Domain review catches errors that statistical tests cannot.

The workflow that produces the best results in practice:

Domain experts specify variables, temporal ordering, and known causal constraints.
The algorithm runs with these constraints as background knowledge.
Domain experts review the output, flagging edges that are implausible or surprising.
Surprising-but-plausible edges become hypotheses for targeted investigation (additional analysis or, where feasible, experiments).
Implausible edges are investigated for data quality issues or missing confounders.

This iterative, human-in-the-loop approach avoids both extremes: the pure correlation approach that ignores causal structure entirely, and the pure algorithmic approach that trusts the output without sanity checks.

Validation Strategies

A causal graph learned from observational data should be validated before it informs business decisions. Several validation strategies exist, with different levels of rigor and feasibility.

Stability analysis (bootstrapping). Run the causal discovery algorithm on many bootstrap resamples of the data. Edges that appear in 80% or more of bootstrap samples are stable; edges that appear in fewer than 50% are fragile. Focus attention and resources on stable edges. The causal-learn library provides built-in bootstrapping for most algorithms.

Cross-validation against held-out data. Split the data temporally, learn the graph from the first two-thirds and test its predictions on the final third. If the conditional independence relationships implied by the graph hold in the test set, the structure has predictive validity. This does not prove the graph is causal, but it establishes that the statistical patterns are reproducible.

Intervention testing. The strongest validation: take a directed edge from the discovered graph, design an experiment that intervenes on the cause variable, and check whether the effect variable responds as predicted. If the graph says Training_Hours -> Feature_Depth -> Revenue, run a randomized trial that increases training for a subset of accounts and measure the downstream effects. This closes the loop between discovery and confirmation.

Sensitivity analysis. Use DoWhy's refutation framework to test how sensitive the estimated causal effects are to assumption violations. Key tests include: adding a random common cause (if the estimated effect changes substantially, it is fragile), replacing the treatment with a placebo (the estimated effect should vanish), and testing subset stability (the effect should hold across data subsets).

Multi-algorithm comparison. Run PC, FCI, GES, and a hybrid algorithm on the same data. Edges that appear consistently across algorithms with different assumptions are more credible than edges that depend on a specific algorithm's idiosyncrasies.

The validation hierarchy, from least to most rigorous:

Face validity check by domain experts
Bootstrap stability analysis
Multi-algorithm consistency
Cross-validation on held-out data
Sensitivity analysis with DoWhy
Targeted experimental confirmation

A responsible causal discovery practice applies at least the first four before making resource allocation decisions based on the discovered graph.

The Causal Discovery Decision Framework

When should a business team invest in causal discovery, and which algorithm should they use? The following framework provides structured guidance.

Step 1: Assess the question. Is the question causal? "What predicts churn?" is a prediction question, standard ML is sufficient. "What causes churn, and what should we change to reduce it?" is a causal question, causal discovery adds value.

Step 2: Assess experimental feasibility. Can you run a randomized experiment to answer the question? If yes, and the timeline is acceptable, run the experiment. Nothing beats experimental evidence for causal claims. If no, proceed to causal discovery.

Step 3: Assess data quality. Do you have at least 200 observations? Are the variables measured with reasonable accuracy? Is the data approximately stationary over the analysis period? If any of these fail, fix the data issues before running the algorithm.

Step 4: Choose the algorithm. If hidden confounders are unlikely (rare in business settings), use the PC algorithm. If hidden confounders are plausible (the default assumption), use FCI or GFCI. If you have time-series data, use PCMCI. If you have more than 50 variables, use FGES or NOTEARS for scalability.

Step 5: Specify background knowledge. Encode temporal ordering, exogenous variables, and known causal relationships as constraints. This is not optional, it is essential for producing useful results.

Step 6: Run, interpret, and validate. Execute the algorithm, review the output with domain experts, run bootstrap stability analysis, and compare across algorithms. Treat the output as hypotheses, not conclusions.

Step 7: Prioritize actionable edges. Focus on directed edges between manipulable variables and business outcomes. These are the intervention targets. Design targeted experiments where feasible to confirm the most important causal claims before committing significant resources.

Figure 4: Recommended Effort Allocation Across Causal Discovery Workflow Stages

Notice where the effort should concentrate. Running the algorithm, the part that seems most technically impressive, accounts for a small fraction of the total effort and impact. The high-impact activities are upstream (variable selection, background knowledge specification) and downstream (validation, expert review). The algorithm is the middle step. The judgment on either side of it is what determines whether the output is useful.

References

Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search (2nd ed.). MIT Press. The foundational text for constraint-based causal discovery, introducing the PC and FCI algorithms.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press. The theoretical framework underlying causal discovery, including the do-calculus and structural causal models.
Colombo, D., Maathuis, M. H., Kalisch, M., & Richardson, T. S. (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. Annals of Statistics, 40(1), 294-321. Foundational work on the FCI algorithm and its properties under hidden confounders.
Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research, 3, 507-554. The theoretical justification for the GES algorithm and its consistency guarantees.
Zheng, X., Aragam, B., Ravikumar, P., & Xing, E. P. (2018). DAGs with NO TEARS: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, 31. The NOTEARS approach to continuous optimization for causal discovery.
Runge, J., Nowack, P., Kretschmer, M., Flaxman, S., & Sejdinovic, D. (2019). Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5(11). The PCMCI algorithm for causal discovery in time-series data, relevant to business metrics measured over time.
Sharma, A., & Kiciman, E. (2020). DoWhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216. The DoWhy framework for causal effect estimation and robustness testing.
Zheng, Y., Huang, B., Chen, W., Ramsey, J., Gong, M., Cai, R., ... & Zhang, K. (2024). causal-learn: Causal discovery in Python. Journal of Machine Learning Research, 25(60), 1-8. The causal-learn library documentation and methodological foundations.
Glymour, C., Zhang, K., & Spirtes, P. (2019). Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10, 524. A comprehensive review of causal discovery methods and their assumptions.
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press. A modern treatment of causal inference with emphasis on identifiability and algorithmic approaches.

4 replies

Arjun Bhattacharya2y ago

PC works beautifully in textbook settings but falls apart once you have latent confounders (which you always do in product data). FCI is strictly more honest, it admits to ambiguity with the circle-tail edge notation, but the graphs are also harder to explain to stakeholders who want a clean arrow. I ended up wrapping FCI output in a product-friendly 'likely-cause / might-cause / correlated-via-unknown' classification before anyone beyond the data team would engage with it.

Kate Sullivan1y ago

the feature-usage-to-retention example is a good one, it's the single most frequent abuse of correlation I see in product analytics. our internal rule now is: if you're about to ship a feature because engaged users use it more, you owe a lift test. PC/FCI output is a useful hypothesis-generation step but nothing replaces a properly designed experiment to actually confirm directionality

Ece Karahan1y ago

appreciate the pragmatic treatment. one technical note: the faithfulness assumption PC relies on is much more restrictive than it looks, any two paths that exactly cancel (even approximately, with finite samples) break it. In business data this happens more often than statisticians would like, because product mechanisms often have deliberate counterbalancing effects. I'd love to see a follow-up on robustness-to-near-unfaithfulness.

Martín López1y ago

ran PC on ~18 months of attribution data across 6 channels and got a graph that confidently claimed display advertising caused paid search. clearly spurious, both are driven by campaign launches upstream. takeaway wasn't that PC is useless but that without including upstream decision variables in the node set you get nonsense. instrumenting 'when did marketing team decide to ramp X' as a variable was the unlock

Join the conversation

Disagree, share a counter-example from your own work, or point at research that changes the picture. Comments are moderated, no account required.