reinforcement learning from human feedback Concepts

pages: 848 words: 227,015

On the Edge: The Art of Risking Everything
by Nate Silver
Published 12 Aug 2024

Early LLMs, when you asked them what the Moon is made out of, would often respond with “cheese.” This answer might minimize the loss function in the training data because the moon being made out of cheese is a centuries-old trope. But this is still misinformation, however harmless in this instance. So LLMs undergo another stage in their training: what’s called RLHF, or reinforcement learning from human feedback. Basically, it works like this: the AI labs hire cheap labor—often from Amazon’s Mechanical Turk, where you can employ human AI trainers from any of roughly fifty countries—to score the model’s answers in the form of an A/B test: A: The Moon is made out of cheese. B: The Moon is primarily composed of a variety of rocks and minerals.

…

For example, regression analysis can analyze how weather conditions and days of the week influence sales at a BBQ restaurant. Regulatory capture: The tendency for entrenched companies to benefit when new regulation is crafted ostensibly in the public interest, such as because of successful lobbying. Reinforcement Learning from Human Feedback (RLHF): A late stage of training a large language model in which human evaluators give it thumbs-up or thumbs-down based on subjective criteria to make the LLM more aligned with human values. Colloquially referred to by Stuart Russell as “spanking.” Repugnant Conclusion: Formulated by the philosopher Derek Parfit, the proposition that any amount of positive utility multiplied by a sufficiently large number of people—infinity people eating one stale batch of Arby’s curly fries before dying—has higher utility than some smaller number of people living in abundance.

…

The term possibly derives from poker’s Mississippi riverboat origins, where if the dealer was suspected of cheating, he’d be thrown into the river. (The) River: A geographical metaphor for the territory covered in this book, a sprawling ecosystem of like-minded, highly analytical, and competitive people that includes everything from poker to Wall Street to AI. The demonym is Riverian. RLHF: See: Reinforcement Learning from Human Feedback. Robust: In philosophy or statistical inference, reliable across many conditions or changes in parameters. A highly desirable property. ROI: See: Return on Investment. Rug pull: Hyping up a crypto project to attract investors, and then pulling a disappearing act before bringing the idea to fruition.

pages: 444 words: 117,770

The Coming Wave: Technology, Power, and the Twenty-First Century's Greatest Dilemma
by Mustafa Suleyman
Published 4 Sep 2023

There are still multiple examples of biased, even overtly racist, LLMs, as well as serious problems with everything from inaccurate information to gaslighting. But for those of us who have worked in the field from the beginning, the exponential progress at eliminating bad outputs has been incredible, undeniable. It’s easy to overlook quite how far and fast we’ve come. A key driver behind this progress is called reinforcement learning from human feedback. To fix their bias-prone LLMs, researchers set up cunningly constructed multi-turn conversations with the model, prompting it to say obnoxious, harmful, or offensive things, seeing where and how it goes wrong. Flagging these missteps, researchers then reintegrate these human insights into the model, eventually teaching it a more desirable worldview, in a way not wholly dissimilar from how we try to teach children not to say inappropriate things at the dinner table.