multi-armed bandit

back to index

description: a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice

8 results

pages: 523 words: 143,139

Algorithms to Live By: The Computer Science of Human Decisions
by Brian Christian and Tom Griffiths
Published 4 Apr 2016

He then introduced the mathematical form of the problem to his colleagues, and it ultimately became generalized to the multi-armed bandit. A comprehensive introduction to multi-armed bandits appears in Berry and Fristed, Bandit Problems. Our focus in this chapter is on bandits where each arm either produces a payoff or doesn’t, with different probabilities but the same payoff amount on all arms. This is known as a Bernoulli bandit in the literature, as the probability distribution that describes a coin flip is called the Bernoulli distribution (after the seventeenth-century Swiss mathematician Jacob Bernoulli). Other kinds of multi-armed bandits are also possible, with unknown distributions of different kinds characterizing the payoffs from each arm.

best algorithms to use remain hotly contested: Chris Stucchio, for instance, penned a cutting article titled “Why Multi-armed Bandit Algorithms Are Superior to A/B Testing,” which was then countered by an equally cutting article called “Don’t Use Bandit Algorithms—They Probably Won’t Work for You”—also written by Chris Stucchio. See https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs_ab.html and https://www.chrisstucchio.com/blog/2015/dont_use_bandits.html. Stucchio’s 2012 post was written partly in reference to an article by Paras Chopra titled “Why Multi-armed Bandit Algorithm Is Not ‘Better’ than A/B Testing” (https://vwo.com/blog/multi-armed-bandit-algorithm/), which was itself written partly in reference to an article by Steve Hanov titled “20 lines of code that will beat A/B testing every time” (http://stevehanov.ca/blog/index.php?

Gittins (albeit many years before the first episode of Deal or No Deal aired) realized that the multi-armed bandit problem is no different. For every slot machine we know little or nothing about, there is some guaranteed payout rate which, if offered to us in lieu of that machine, will make us quite content never to pull its handle again. This number—which Gittins called the “dynamic allocation index,” and which the world now knows as the Gittins index—suggests an obvious strategy on the casino floor: always play the arm with the highest index.* In fact, the index strategy turned out to be more than a good approximation. It completely solves the multi-armed bandit with geometrically discounted payoffs.

pages: 543 words: 153,550

Model Thinker: What You Need to Know to Make Data Work for You
by Scott E. Page
Published 27 Nov 2018

To make sense of how opioids received approval and how the epidemic arose, we apply four models to generate some core intuitions as to how the crisis came to be. The first model, the multi-armed bandit model, explains why opioids were approved for use. When seeking drug approval, a pharmaceutical company runs clinical trials to demonstrate drug efficacy and a lack of deleterious side effects. We can model a clinical trial as a multi-armed bandit problem where one arm corresponds to prescribing the new drug and the other arm corresponds to a placebo or the existing drug. A Model of Opioid Approval Multi-Armed Bandit Model To demonstrate their efficacy, opioids were tested against placebos. In clinical trials, patients were randomly assigned to take either opioids or a placebo.

According to the first model, charismatic CEOs who can convince well-connected employees can introduce new strategies that trump culture. According to the second model, culture trumps weak incentives but not strong ones. 27. Multi-Armed Bandit Problems There’s one thing I’m really good at, and that’s hitting the ball over a net, in a box. I’m excellent. —Serena Williams In this chapter, we add uncertainty to the problem of learning the best alternative to create a class of models known as multi-armed bandit problems. In bandit problems, rewards from alternatives are distributions rather than fixed amounts. Bandit problems apply to a wide variety of real-world situations.

Bayesian Multi-Armed Bandit Problems In a Bayesian bandit problem, the decision-maker has prior beliefs over the reward distributions of the alternatives. Given these prior beliefs, a decision maker can quantify the trade-off between exploration and exploitation and (in theory) make optimal decisions in each period. However, except for the simplest of bandit problems, determining the optimal action requires rather tedious calculations. In real-world applications, these exact calculations may be impractical, obliging decision makers to rely on approximations. Bayesian Multi-Armed Bandit Problems A collection of alternatives {A, B, C, D,…, N} have associated reward distributions {f (A), f (B), f (C), f (D),…, f (N)}.

pages: 406 words: 109,794

Range: Why Generalists Triumph in a Specialized World
by David Epstein
Published 1 Mar 2019

That could be a problem of grit, or it could be a decision made in response to match quality information that could not have been gleaned without giving it a try. Carnegie Mellon economics and statistics professor Robert A. Miller modeled career matching—and a decision to attend a military academy is a major career choice—as a “multi-armed bandit process.” “One-armed bandit” is slang for a slot machine. A multi-armed bandit process refers to a hypothetical scenario: a single gambler is sitting in front of an entire row of slot machines; each machine has its own unique probability of reward with each pull; the gambler’s challenge is to test the different machines and try to figure out the best way to allocate their lever pulls to maximize rewards.

Persevering through difficulty is a competitive advantage for any traveler of a long road, but he suggested that knowing when to quit is such a big strategic advantage that every single person, before undertaking an endeavor, should enumerate conditions under which they should quit. The important trick, he said, is staying attuned to whether switching is simply a failure of perseverance, or astute recognition that better matches are available. Beast Barracks is perfect for a multi-armed bandit approach to quitting. A group of high achievers, not one of whom has an iota of military experience, pulls the West Point “lever,” so to speak. That is, they begin a high-risk, high-reward program and from week one get a massive information signal about whether military discipline is for them.

The Grit Scale statement “I have been obsessed with a certain idea or project for a short time but later lost interest” is Van Gogh in a nutshell, at least up until the final few years of his life when he settled on his unique style and creatively erupted. Van Gogh was an example of match quality optimization, Robert Miller’s multi-armed bandit process come to life. He tested options with maniacal intensity and got the maximum information signal about his fit as quickly as possible, and then moved to something else and repeated, until he had zigzagged his way to a place no one else had ever been, and where he alone excelled. Van Gogh’s Grit Scale score, according to Naifeh’s assessment, was flush with hard work but low on sticking with every goal or project.

pages: 625 words: 167,349

The Alignment Problem: Machine Learning and Human Values
by Brian Christian
Published 5 Oct 2020

As Stuart Russell put it in his original 1998 paper, “Can we determine the reward function by observation during rather than after learning?” Russell, “Learning Agents for Uncertain Environments (Extended Abstract).” 31. This remains very much an open research question. For recent work, see Chan et al., “The Assistive Multi-Armed Bandit.” 32. Berkeley’s Smitha Milli and Anca Drăgan have explored this question: see Milli and Drăgan, “Literal or Pedagogic Human?” 33. Stefano Ermon, interview by Ariel Conn, Future of Life Institute, January 26, 2017, https://futureoflife.org/2017/01/26/stefano-ermon-interview/. 34. Roman Yampolskiy, interview by Ariel Conn, Future of Life Institute, January 18, 2017, https://futureoflife.org/2017/01/18/roman-yampolskiy-interview/. 35.

In Machine Learning: Proceedings of the Tenth International Conference, 41–48, 1993. Caswell, Estelle. “Color Film Was Built for White People. Here’s What It Did to Dark Skin.” Vox, September 18, 2015. https://www.vox.com/2015/9/18/9348821/photography-race-bias. Chan, Lawrence, Dylan Hadfield-Menell, Siddhartha Srinivasa, and Anca Drăgan. “The Assistive Multi-Armed Bandit.” In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 354–63. IEEE, 2019. “A Chance to Fix Parole in New York.” New York Times, September 4, 2015. Chen, Dawn, Joshua C. Peterson, and Thomas L. Griffiths. “Evaluating Vector-Space Models of Analogy.” In Proceedings of the 39th Annual Conference of the Cognitive Science Society, 2017.

pages: 344 words: 96,020

Hacking Growth: How Today's Fastest-Growing Companies Drive Breakout Success
by Sean Ellis and Morgan Brown
Published 24 Apr 2017

Recall from the last chapter, for example, how the engineers on the Pinterest engagement growth team built the Copytune machine learning program in order to supercharge speed of experimentation with copy in 30 languages in numerous emails used to retain existing Pinterest users. This is an example of what’s called a multivariate test, which goes beyond comparing two alternatives to comparing every possible combination of each element of a message to find the highest-performing permutation. Or, take what’s called multi-armed bandit, a more sophisticated testing scheme used by companies to find winning results faster. We’ll introduce more types of tests that can be run and explain them in more detail later in the book. EXPERIMENTS WITHIN THE PRODUCT The more complicated tests that require significant engineering time are usually those that are changes to the products themselves.

pages: 416 words: 112,268

Human Compatible: Artificial Intelligence and the Problem of Control
by Stuart Russell
Published 7 Oct 2019

Neither author refers to the early work of Harsanyi, “Games with incomplete information, Parts I–III,” or Cyert and de Groot, “Adaptive utility.” 41. An initial paper on helping humans who don’t know their own preferences and are learning about them: Lawrence Chan et al., “The assistive multi-armed bandit,” in Proceedings of the 14th ACM/IEEE International Conference on Human–Robot Interaction (HRI), ed. David Sirkin et al. (IEEE, 2019). 42. Eliezer Yudkowsky, in Coherent Extrapolated Volition (Singularity Institute, 2004), lumps all these aspects, as well as plain inconsistency, under the heading of muddle—a term that has not, unfortunately, caught on. 43.

pages: 665 words: 159,350

Shape: The Hidden Geometry of Information, Biology, Strategy, Democracy, and Everything Else
by Jordan Ellenberg
Published 14 May 2021

One more thing I’d like to acknowledge is all the things it would have been great to write about in a book about geometry, but aren’t here, because I ran out of time and space. I had meant to write about “lumpers and splitters” and the theory of clustering; Judea Pearl and the use of directed acyclic graphs in the study of causality; navigation charts of the Marshall Islands; exploitation versus exploration and multi-armed bandits; binocular vision in praying mantis larvae; the maximum size of a subset of the N by N grid with no three points forming an isosceles triangle (let me know if you solve this actually); much more on dynamics, starting from Poincaré and going through billiards, Sinai, and Mirzakhani; much more on Descartes, who launched the unification of algebra and geometry and somehow ended up nearly absent here; and much more on Grothendieck, who took that unification much farther than Descartes could have dreamed and somehow ditto; catastrophe theory; the tree of life.

pages: 2,466 words: 668,761

Artificial Intelligence: A Modern Approach
by Stuart Russell and Peter Norvig
Published 14 Jul 2019

Similarity search in high dimensions vis hashing. In Proc. 25th Very Large Database (VLDB) Conference. Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. PAMI, 38, 142–58. Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices. Wiley. Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J. (Ed.), Progress in Statistics. North-Holland. Glanc, A. (1978). On the etymology of the word “robot”. SIGART Newsletter, 67, 12. Glickman, M.