Attention Is All You Need

back to index

description: AI scientific article about the transformer architecture, published in June 2017

4 results

pages: 336 words: 91,806

Code Dependent: Living in the Shadow of AI
by Madhumita Murgia
Published 20 Mar 2024

They started playing around with some early prototypes on English–German translations, and found it worked. Their work formalized a months-long collaboration in 2017 that eventually produced a software for processing language, known simply as the ‘transformer’. The eight research scientists who eventually played a part in its creation described it in a short paper with a snappy title: ‘Attention Is All You Need’.2 One of the authors, Llion Jones, who grew up in a tiny Welsh village, says the title was a nod to the Beatles song ‘All You Need Is Love’. The paper was first published in June 2017, and it kick-started an entirely new era of artificial intelligence: the rise of generative AI. The genesis of the transformer and the story of its creators helps to account for how we got to this moment in artificial intelligence: an inflection point, comparable to our transition to the web or to smartphones, that has seeded a new generation of entrepreneurs building AI-powered consumer products for the masses.

OpenAI took a hefty investment of more than $10bn from Microsoft and converted itself into what was, for all intents and purposes, a for-profit enterprise that sold AI technologies to large corporations and governments around the world.4 OpenAI’s crown jewel was an algorithm called GPT – the Generative Pre-trained Transformer – software that could produce text-based answers in response to human queries. One of the authors of the ‘Attention Is All You Need’ paper, Lukasz Kaiser, had ended up working there and helping to build it. It was an impressive piece of technology but until November in 2022 it was small-scale, clunky and mostly in the hands of tech-savvy programmers. To have invented a computer program that could employ our own language to communicate directly with us was quite a feat.

Chang Chien, ‘How China’s Police Used Phones and Faces to Track Protesters’, The New York Times, December 4, 2022, https://www.nytimes.com/2022/12/02/business/china-protests-surveillance.html. CHAPTER 10: YOUR SOCIETY 1 M. Murgia, ‘Transformers: The Google Scientists Who Pioneered an AI Revolution’, The Financial Times, July 23, 2023, https://www.ft.com/content/37bb01af-ee46-4483-982f-ef3921436a50. 2 A. Vaswani et al., ‘Attention Is All You Need’, Arxiv, June 12, 2017, https://arxiv.org/abs/1706.03762. 3 M. Murgia, ‘OpenAI’s Mira Murati: The Woman Charged with Pushing Generative AI into the Real World’, The Financial Times, June 18, 2023, https://www.ft.com/content/73f9686e-12cd-47bc-aa6e-52054708b3b3. 4 R. Waters and T. Kinder, ‘Microsoft’s $10bn Bet on ChatGPT Developer Marks New Era of AI’, The Financial Times, January 16, 2023, https://www.ft.com/content/a6d71785-b994-48d8-8af2-a07d24f661c5. 5 M.

pages: 848 words: 227,015

On the Edge: The Art of Risking Everything
by Nate Silver
Published 12 Aug 2024

GO TO NOTE REFERENCE IN TEXT Post describes him: Nitasha Tiku, “OpenAI Leaders Warned of Abusive Behavior before Sam Altman’s Ouster,” The Washington Post, December 8, 2023, washingtonpost.com/technology/2023/12/08/open-ai-sam-altman-complaints. GO TO NOTE REFERENCE IN TEXT companies like OpenAI and Anthropic: “Google Brain Drain: Where are the Authors of ‘Attention Is All You Need’ Now?” AIChat, aichat.blog/google-exodus-where-are-the-authors-of-attention-is-all-you-need-now. GO TO NOTE REFERENCE IN TEXT Altman has tipped his hat:@SamA, https://twitter.com/sama/status/1540227243368058880?lang=en. GO TO NOTE REFERENCE IN TEXT thinks the “schtick”: roon (@tszzl), “e/acc’s are both dangerous and cringe and cribbed half my schtick. disavow!

“Even last year, what large language models were doing was kind of babbling and not very interesting,” he said when we spoke in 2023. “And then suddenly this threshold was passed, where, gosh, it seems like human-level text generation. And, you know, nobody really anticipated that.” In 2017, a group of researchers at Google published a paper called “Attention Is All You Need” that introduced something called a “transformer.” I’ll provide a more detailed description of a transformer later, but it isn’t important for now—the intuition is just that it parses a sentence all at once instead of sequentially. (So, for example, in the sentence “Alice came over for dinner, but unlike Bob, she forgot to bring wine,” it figures out that it’s Alice and not Bob who forgot the wine.

On its own, the tell is not very meaningful, but in the context of other semantic information (the player is breathing heavily and avoiding eye contact) it might be. This part of the process, as ChatGPT says, is hidden from view. Exactly how the transformer makes these inferences is something of a mystery—this is the “bag of numbers” stage. But it just seems to work out somehow. In the famous Google paper on transformers, “Attention Is All You Need,” “attention” essentially refers to the importance of the relationships between different pairs of tokens. Once a transformer figures out these relationships, there isn’t a whole lot else it needs to do. For instance, the tokens “Alice” and “Bob” have an important relationship that the transformer will pay more attention to.

The Singularity Is Nearer: When We Merge with AI
by Ray Kurzweil
Published 25 Jun 2024

BACK TO NOTE REFERENCE 155 Notably, one of the doctoral students who designed Proverb, the first AI to master crossword puzzles better than most human solvers, was Noam Shazeer. He went on to work at Google, where he was a lead author of “Attention Is All You Need,” the paper that invented the transformer architecture for large language models that has powered the latest AI revolution. See Duke University, “Duke Researchers Pit Computer Against Human Crossword Puzzle Players,” ScienceDaily, April 20, 1999, https://www.sciencedaily.com/releases/1999/04/990420064821.htm; Vaswani et al., “Attention Is All You Need.” BACK TO NOTE REFERENCE 156 For a representative video clip from the matches and analyses of Watson and the competition, see OReilly, “Jeopardy!

BACK TO NOTE REFERENCE 94 For a more detailed explainer on how transformers work, and the original technical paper, see Giuliano Giacaglia, “How Transformers Work,” Towards Data Science, March 10, 2019, https://towardsdatascience.com/transformers-141e32e69591; Ashish Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762v5 [cs.CL], December 6, 2017, https://arxiv.org/pdf/1706.03762.pdf. BACK TO NOTE REFERENCE 95 Irene Solaiman et al., “GPT-2: 1.5B Release,” OpenAI, November 5, 2019, https://openai.com/blog/gpt-2-1-5b-release. BACK TO NOTE REFERENCE 96 Tom B.

pages: 2,466 words: 668,761

Artificial Intelligence: A Modern Approach
by Stuart Russell and Peter Norvig
Published 14 Jul 2019

The score of each word is the log-probability generated by the target RNN softmax, and the score of each hypothesis is the sum of the word scores. At timestep 3, the highest scoring hypothesis La entrada can only generate low-probability continuations, so it “falls off the beam.” 25.4The Transformer Architecture The influential article “Attention is all you need” (Vaswani et al., 2018) introduced the transformer architecture, which uses a self-attention mechanism that can model long-distance context without a sequential dependency. 25.4.1Self-attention Previously, in sequence-to-sequence models, attention was applied from the target RNN to the source RNN.

Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Piantino, S., and LeCun, Y. (2014). Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv:1412.7580. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I.(2018). Attention is all you need. In NeurIPS 30. Veach, E. and Guibas, L. J. (1995). Optimally combining sampling techniques for Monte Carlo rendering. In Proc. 22rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). Venkatesh, S. (2012). The Theory of Probability: Explorations and Applications.