description: variant of reinforcement learning
generative artificial intelligence
10 results
by Sebastian Mallaby; · 30 Mar 2026 · 607pp · 161,998 words
ensure that Sparrow was aligned with human preferences on all these subtle dimensions, Irving turned to a technique that OpenAI had also tried: RLHF, or reinforcement learning from human feedback. Following on the heels of few-shot prompting and supervised fine-tuning, RLHF elevated post-training to a third level. Whereas fine-tuning was a
…
’s ability to augment answers with web search; OpenAI’s chatbot couldn’t match that. Sparrow was guided not just by the general version of reinforcement learning from human feedback, but by DeepMind’s twenty-three conduct rules. ChatGPT lacked this refinement. The night before ChatGPT’s release, OpenAI’s core team placed bets on
…
the Hinton–Bengio warning—the one represented by Geoffrey Irving and Jan Leike. To prevent models from deceiving humans, you needed technical solutions such as reinforcement learning from human feedback, bolstered by pre-release red-team tests to discover residual misbehaviors. Already, thanks to this formula, chatbots had become less biased, less toxic, and less
…
taking on the role, he had given an inspiring talk to his research colleagues, laying out how sophisticated, machine-based RL—going well beyond simple reinforcement learning from human feedback—could take large language models to the next level. But Silver quickly ran into a wall. It was partly that he was not cut out
…
chinks of light. Inscrutable, unpredictable, and therefore inherently dangerous systems became at least partially understandable. Irving’s second promising project addressed the central weakness in reinforcement learning from human feedback. As AI generated ever more sophisticated outputs, it would outstrip humans’ ability to provide feedback on them. It was one thing for humans to judge
…
firm ground. During the evolution of standard language models, post-training had begun with clever prompting, then moved on to fine-tuning, and then incorporated reinforcement learning from human feedback. Likewise, now that thinking models had moved through chain-of-thought prompting and fine-tuning, the logical next step was to fortify their reasoning with
…
–2, 106 Q* project and, 356, 359 from raw experience, 412n24 reasoning and, 358, 362–66 safety risks of, 373–75 with verified rewards, 438n27 reinforcement learning from human feedback (RLHF), 298–99, 301, 335 Renaissance Technologies, 232 Reos Partners, 67 Republic game, 36–41 residual neural network (ResNet), 197–98 Responsible AI team, Gemini
by Carissa Véliz · 21 Apr 2026 · 503pp · 129,255 words
enormous amounts of text, which allows them to mimic the patterns that they have picked up on. Second, they are refined through a process of reinforcement learning from human feedback. People teach the AI what kinds of responses they prefer. Through numerous iterations, the system learns how to satisfy human beings’ tastes, thereby becoming more
by Nate Silver · 12 Aug 2024 · 848pp · 227,015 words
-old trope. But this is still misinformation, however harmless in this instance. So LLMs undergo another stage in their training: what’s called RLHF, or reinforcement learning from human feedback. Basically, it works like this: the AI labs hire cheap labor—often from Amazon’s Mechanical Turk, where you can employ human AI trainers from
…
. Regulatory capture: The tendency for entrenched companies to benefit when new regulation is crafted ostensibly in the public interest, such as because of successful lobbying. Reinforcement Learning from Human Feedback (RLHF): A late stage of training a large language model in which human evaluators give it thumbs-up or thumbs-down based on subjective criteria
…
ecosystem of like-minded, highly analytical, and competitive people that includes everything from poker to Wall Street to AI. The demonym is Riverian. RLHF: See: Reinforcement Learning from Human Feedback. Robust: In philosophy or statistical inference, reliable across many conditions or changes in parameters. A highly desirable property. ROI: See: Return on Investment. Rug pull
…
, 135, 143–44, 157, 513n, 514n poker and, 13 River-Village conflict and, 31 Silicon Valley, 269–70, 272 regulatory capture, 31, 269, 270, 495 reinforcement learning from human feedback (RLHF), 440–41, 442, 495 Reinkemeier, Tobias, 102–3 replication crisis, 179, 497 Repugnant Conclusion, 364–65, 403, 495 resilience, 116–17 results-oriented thinking
…
, 30, 267–68, 271, 505–6n regulatory capture and, 31, 269 risk aversion and, 493 Silicon Valley and, 26, 267–75, 290, 295, 505n RLHF (reinforcement learning from human feedback), 440–41, 442, 495 Robins, Jason, 184, 186 robustness, 495 Rock, Arthur, 257, 296 rock paper scissors, 47, 58 Roffman, Marvin, 151 Rogers, Kenny, 229
by Mustafa Suleyman · 4 Sep 2023 · 444pp · 117,770 words
has been incredible, undeniable. It’s easy to overlook quite how far and fast we’ve come. A key driver behind this progress is called reinforcement learning from human feedback. To fix their bias-prone LLMs, researchers set up cunningly constructed multi-turn conversations with the model, prompting it to say obnoxious, harmful, or offensive
by Karen Hao · 19 May 2025 · 660pp · 179,531 words
technique in a blog post as a way to get AI models to follow difficult-to-specify directions. The researchers on the team called it “reinforcement learning from human feedback.” Amodei wanted to move beyond the toy environment, and Radford’s work with GPT-1 made language models seem like a good option. But GPT
…
content-moderation filter, a revelation first reported by Time magazine correspondent Billy Perrigo. It would also employ over a thousand other contractors globally to perform reinforcement learning from human feedback, or RLHF, the technique it had developed to teach an AI agent backflips, on its language models, including prompting the models repeatedly and scoring the
…
GPT-2 with roughly twice the number of parameters, a group of researchers had begun advancing the AI safety work that Amodei had wanted: testing reinforcement learning from human feedback as a way to guide the model toward generating cheerful and positive content and away from anything offensive. But late one night, a researcher made
…
quickly as possible. The live version on the API didn’t have any kind of content-moderation filtering, nor had its outputs been refined with reinforcement learning from human feedback. In meetings, the two camps sought to find a middle ground. Instead, they talked around each other in endless circles. At one point, Welinder, who
…
and quality of training data, in part by tapping into user data and shifting the model toward the best parts of the data distribution with reinforcement learning from human feedback. Beneath a section titled “How to accomplish it,” the road map elaborated further. As an initial step, OpenAI would bring various deep learning models up
…
ChatGPT. The company said it also used more than one thousand other contractors in the US and around the world to refine its models with reinforcement learning from human feedback, the AI safety technique that it had developed. To source those workers, it leaned heavily on the same platform that became the staple of the
…
the things that put OpenAI on the map and would bring it increasing commercial success had begun as AI safety projects: scaling laws, code generation, reinforcement learning from human feedback, the combination of these three into incredibly compelling large language and then multimodal models. Many Doomers would feel their work was being co-opted and
…
powerful and context sensitive, it was producing garbage responses.” But Brockman pushed forward, pulling together the resources to improve the model with human contractors conducting reinforcement learning from human feedback. With each week, the results looked better and better, until the performance truly began to wow people internally. GPT-4 now had built-in multimodal
…
new idea began to percolate between him and Altman: a team laser focused on developing new alignment methods for superintelligence, in anticipation of methods like reinforcement learning from human feedback no longer being sufficient once systems could, in their view, outsmart humans. Altman called it the Alignment Manhattan Project. At first, the two discussed spinning
…
VP of Research of the post-training team, which oversaw the preparation of OpenAI’s models for prime time, such as by aligning them with reinforcement learning from human feedback. The three sat side by side around a small round table to demo 4o, including its ability to be a real-time voice-to-voice
…
NOTE REFERENCE IN TEXT “Your goal is to provide”: RLHF documents. GO TO NOTE REFERENCE IN TEXT during a talk at UC Berkeley: “John Schulman—Reinforcement Learning from Human Feedback: Progress and Challenges,” posted April 19, 2023, by UC Berkeley EECS, YouTube, 1 hr., 3 min., 31 sec., youtu.be/hhiLw5Q_UFg. GO TO NOTE
…
Regalado, Antonio, 186, 187 regulations (regulatory policy), 25, 27, 84, 86, 134, 136, 265, 272, 301, 303–4, 306–7, 311–12, 357, 358, 384 reinforcement learning from human feedback (RLHF), 123, 137, 146, 155, 176, 213–23, 245, 248, 315, 381, 387 Remotasks, 203–4, 218–23, 416 Renaldi, Adi, 186 renewable energy, 77
by Keach Hagey · 19 May 2025 · 439pp · 125,379 words
’s tendency to spew out lies or other antisocial statements, researchers taught it how humans would actually like it to behave using a process called reinforcement learning from human feedback (RLHF). Humans would rate how well a response fit their expectations, and that feedback would help create a filter that would civilize the model. The
by Christopher Summerfield · 11 Mar 2025 · 412pp · 122,298 words
in a manner that is aligned with developers’ values. Two popular varieties of human-in-the-loop fine-tuning are supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), and they are typically used in tandem. The combined power of these methods was first revealed to the AI community in a 2022 paper
…
form of filter bubble for the user. Recent papers by the AI research company Anthropic have studied the tendency for LLMs fine-tuned with RLHF (reinforcement learning from human feedback) to be sycophantic. The researchers co-opted the term ‘sycophancy’ to describe the model’s propensity to bend its speech to suit the supposed preferences
…
(learning to learn), 158–61 one-shot learning, 253–4 prediction and, 154–62 reinforcement learning (RL), 188, 190, 192, 251, 258, 267, 305, 322 reinforcement learning from human feedback (RLHF), 188, 189–91, 192, 251, 257, 267 trial-and-error learning, 158–61, 268 Leibniz, Gottfried, 19–21, 24–5, 28, 29, 30, 47
by Ethan Mollick · 2 Apr 2024 · 189pp · 58,076 words
to fit the preferences of the human, providing additional learning that reinforces good answers and reduces bad answers, which is why the process is called Reinforcement Learning from Human Feedback (RLHF). After an AI has gone through this initial phase of reinforcement learning, they can continue to be fine-tuned and adjusted. This type of
…
seen, gathering training data has its own problems. The most common approach to reducing bias is for humans to correct the AIs, as in the Reinforcement Learning from Human Feedback (RLHF) process, which is part of the fine-tuning of LLMs that we discussed in the previous chapter. This process allows human raters to penalize
by Joanna Walsh · 22 Sep 2025 · 255pp · 80,203 words
because they don’t distinguish between the likes given by a bot and a human. AI used to be trained using what OpenAI calls RLHF – reinforcement learning from human feedback – and this work 206was paid, but aesthetic evaluation is increasingly sourced via apps that persuade users that interacting with an AI for free is play
…
carbon footprint, 122 energy consumption, 124 environmental racism, 122 GPT large language model, 113 GPT NLP (natural language processing) models, 104 and humour, 101 RLHF (reinforcement learning from human feedback), 205–6 self-learning, 123, 206 spammers and, 214 stupidity of, 124 training, 113, 114–16, 118–19, 122, 205–6 AI art, aestheticness, 127
…
art, 35–6 retrofuturism, 57 revivalism, 76 rewards, 29 Ricoeur, Paul, 34 Riefenstahl, Leni, 177 Rights Alliance, 115 Riley, Denise, 134 risk, 18–20 RLHF (reinforcement learning from human feedback), 205–6260 romanticisation, 157 Romanticism, 103 Ronell, Avital, 136 Rosler, Martha, 147 Ruskin, John, 61–2, 69 Russell, Legacy, 97 Saito, Yuriko, 128 salaried work
by Tim Berners-Lee · 8 Sep 2025 · 347pp · 100,038 words
speech, or on deliberate disinformation, will reproduce those same flaws in its results. To prevent the nastiest outputs, LLMs like ChatGPT are fine-tuned using ‘reinforcement learning from human feedback’, an AI technique that incorporates editorial judgement from humans. Of course, this immediately leads to the objection that the output of the model is being
…
need for inclusivity ref1 neural networks ref1, ref2, ref3, ref4 OpenAI ref1, ref2, ref3, ref4, ref5 paradigm shift ref1 RAGs (Retrieval-Augmented Generation systems) ref1 reinforcement learning from human feedback ref1 search engines ref1 semantic web ref1 simplified text ref1 singularity ref1 speed of development ref1 superintelligence ref1 trust ref1 see also ChatGPT Asimov, Isaac