Glossary · E-commerce ML

Contextual Bandits

Definition

Contextual bandits are online learning algorithms that choose an action (a price, a layout, a recommendation) given a context (user features), observe a reward, and update their policy to balance exploration and exploitation. They are the modern foundation of real-time personalization and dynamic pricing.

Unlike A/B tests, contextual bandits learn continuously and route users to the currently best-performing option, minimizing regret over time. LinUCB, Thompson Sampling, and neural bandits handle different context/reward structures. Fairness-constrained variants add explicit constraints to prevent systematic disadvantage for protected groups. Practical deployment requires careful reward definition, off-policy evaluation, and guardrails for distribution shift.

Essays on this concept

Marketing Engineering
Building a Real-Time Personalization Engine: From Contextual Bandits to Deep Reinforcement Learning
A/B tests answer 'which variant is best on average.' Contextual bandits answer 'which variant is best for this user right now.' The difference in cumulative regret, and revenue, compounds daily.
E-commerce ML
Dynamic Pricing Under Demand Uncertainty: A Contextual Bandit Approach with Fairness Constraints
Airlines have done dynamic pricing for decades. E-commerce is catching up - but without the fairness constraints that prevent algorithms from charging different people different prices for the same product based on inferred willingness to pay.
E-commerce ML
Cold-Start Problem Solved: Few-Shot Learning for New Product Recommendations Using Meta-Learning
New products get no recommendations. No recommendations means no clicks. No clicks means no data. No data means no recommendations. Meta-learning breaks this loop by transferring knowledge from products that came before.
E-commerce ML
Personalized Promotion Optimization: Uplift Modeling to Identify Who Needs a Discount vs. Who Would Buy Anyway
70% of promotional spend goes to customers who would have purchased at full price. Uplift modeling identifies the 30% whose behavior actually changes with a discount, and ignores the rest. The math isn't complicated. The organizational willingness to stop blanket discounting is.

Related concepts

Authoritative references

en.wikipedia.org/wiki/Multi-armed_bandit

Contextual Bandits

Building a Real-Time Personalization Engine: From Contextual Bandits to Deep Reinforcement Learning

Dynamic Pricing Under Demand Uncertainty: A Contextual Bandit Approach with Fairness Constraints

Cold-Start Problem Solved: Few-Shot Learning for New Product Recommendations Using Meta-Learning

Personalized Promotion Optimization: Uplift Modeling to Identify Who Needs a Discount vs. Who Would Buy Anyway