labeled data

back to index

description: group of samples that have been tagged with one or more labels

76 results

pages: 296 words: 66,815

The AI-First Company
by Ash Fontana
Published 4 May 2021

HUMAN GENERATED Data Labeling Many ML models require labeled data for training recognition algorithms. There are some promising transfer and semisupervised learning techniques that may provide alternatives to gathering a great deal of labeled data, especially for generic domains such as image, video, and language understanding. However, the state of the art doesn’t seem to offer enough just yet, and particularly not for specific domains. Accessing and owning processed data to feed models can be the single hardest problem in starting a vertical, AI-First business. Supervised ML models need labeled data. Getting lots of labeled examples for specific domains is hard.

These systems don’t necessarily have active or interactive learning—they could just have people labeling data that goes in a bucket somewhere. Similarly, active and interactive learning systems don’t necessarily have humans in the loop—it could be a robot doing the labeling. HIL systems are the general form of getting users to label data. HUMAN-IN-THE-LOOP SYSTEMS The loop is the preceding diagram. Typically, there are a few different ways to get humans in the loop. Creating. Ask them to create brand-new data in the “get data” step; for instance, by completing surveys. Labeling. Label data either by entering text or picking from a list of labels used by the ML system.

Mathematically, if a labeler makes a mistake 5 percent of the time, the probability that three labelers make the same mistake is 5 percent × 5 percent × 5 percent, or just 0.01 percent. Outsourced or crowdsourced workers both create and label data. They can collect data in a multitude of ways: for example, by calling people, completing online searches, copying information manually from a website, and other methods that can be reduced to a standard, discrete, short task. They label data using the interface provided to them by the marketplace to both add labels and track their work. Some marketplaces have a human-in-the-loop system to label data. Workers can also be effective at cleaning data, whether by de-duplicating lists of data, correcting spelling errors in lists of text, or discarding blurry images.

pages: 681 words: 64,159

Numpy Beginner's Guide - Third Edition
by Ivan Idris
Published 23 Jun 2015

From the array returned by convolve() , we extracted the data in the center of size N . The following code makes an array of tme values and plots with matplotlib that we will cover in a later chapter: c = np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True) sma = np.convolve(weights, c)[N-1:-N+1] t = np.arange(N - 1, len(c)) plt.plot(t, c[N-1:], lw=1.0, label="Data") plt.plot(t, sma, '--', lw=2.0, label="Moving average") plt.title("5 Day Moving Average") plt.xlabel("Days") plt.ylabel("Price ($)") plt.grid() plt.legend() plt.show() In the following chart, the smooth dashed line is the 5 day SMA and the jagged thin line is the close price: What just happened?

We learned that the ones() functon can create an array with ones and the convolve() functon calculates the convoluton of a dataset with specifed weights (see sma.py ): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt N = 5 weights = np.ones(N) / N print("Weights", weights) c = np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True) sma = np.convolve(weights, c)[N-1:-N+1] t = np.arange(N - 1, len(c)) plt.plot(t, c[N-1:], lw=1.0, label="Data") plt.plot(t, sma, '--', lw=2.0, label="Moving average") plt.title("5 Day Moving Average") plt.xlabel("Days") plt.ylabel("Price ($)") plt.grid() plt.legend() plt.show() Exponential Moving Average The Exponental Moving Average ( EMA ) is a popular alternatve to the SMA. This method uses exponentally decreasing weights.

Normalize the weights with the ndarray sum() method: weights /= weights.sum() print("Weights", weights) For N = 5 , we get these weights: Weights [ 0.11405072 0.14644403 0.18803785 0.24144538 0.31002201] 3. Afer this, use the convolve() functon that we learned about in the SMA secton and also plot the results: c = np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True) ema = np.convolve(weights, c)[N-1:-N+1] t = np.arange(N - 1, len(c)) plt.plot(t, c[N-1:], lw=1.0, label='Data') plt.plot(t, ema, '--', lw=2.0, label='Exponential Moving Average') plt.title('5 Days Exponential Moving Average') plt.xlabel('Days') plt.ylabel('Price ($)') plt.legend() plt.grid() plt.show() This gives us a nice chart where, again, the close price is the thin jagged line and the EMA is the smooth dashed line: What just happened?

pages: 586 words: 186,548

Architects of Intelligence
by Martin Ford
Published 16 Nov 2018

That’s very different from a system that has a big store of facts, and you can copy facts from one system to another. MARTIN FORD: Is it true that the vast majority of applications of deep learning rely heavily on labeled data, or what’s called supervised learning, and that we still need to solve unsupervised learning? GEOFFREY HINTON: That’s not entirely true. There’s a lot of reliance on labeled data, but there are some subtleties in what counts as labeled data. For example, if I give you a big string of text and I ask you to try and predict the next word, then I’m using the next word as a label of what the right answer is, given the previous words.

JAMES MANYIKA: For example, we know that many of these techniques still largely rely on labelled data, and there’s still lots of limitations in terms of the availability of labelled data. Often this means that humans must label underlying data, which can be a sizable and error-prone chore. In fact, some autonomous vehicle companies are hiring hundreds of people to manually annotate hours of video from prototype vehicles to help train the algorithms. There are some new techniques that are emerging to get around the issue of needing labeled data, for example, in-stream supervision pioneered by Eric Horvitz and others; the use of techniques like Generative Adversarial Networks or GANs, which is a semi-supervised technique through which usable data can be generated in a way that reduces the need for datasets that require labeling by humans.

Supervised learning powers language translation (trained with millions of documents pre-translated into two different languages) and AI radiology systems (trained with millions of medical images labeled either “Cancer” or “No Cancer”). One problem with supervised learning is that it requires massive amounts of labeled data. This explains why companies that control huge amounts of data, like Google, Amazon, and Facebook, have such a dominant position in deep learning technology. REINFORCEMENT LEARNING essentially means learning through practice or trial and error. Rather than training an algorithm by providing the correct, labeled outcome, the learning system is set loose to find a solution for itself, and if it succeeds it is given a “reward.”

pages: 579 words: 76,657

Data Science from Scratch: First Principles with Python
by Joel Grus
Published 13 Apr 2015

Observe that we don’t actually care which label is associated with each probability, only what the probabilities are: def class_probabilities(labels): total_count = len(labels) return [count / total_count for count in Counter(labels).values()] def data_entropy(labeled_data): labels = [label for _, label in labeled_data] probabilities = class_probabilities(labels) return entropy(probabilities) The Entropy of a Partition What we’ve done so far is compute the entropy (think “uncertainty”) of a single set of labeled data. Now, each stage of a decision tree involves asking a question whose answer partitions data into one or (hopefully) more subsets. For instance, our “does it have more than five legs?”

Nonetheless, it’s an incredibly stupid test, and a good illustration of why we don’t typically use “accuracy” to measure how good a model is. Imagine building a model to make a binary judgment. Is this email spam? Should we hire this candidate? Is this air traveler secretly a terrorist? Given a set of labeled data and such a predictive model, every data point lies in one of four categories: True positive: “This message is spam, and we correctly predicted spam.” False positive (Type 1 Error): “This message is not spam, but we predicted spam.” False negative (Type 2 Error): “This message is spam, but we predicted not spam.”

We’ll look at ways to address this. Most people divide decision trees into classification trees (which produce categorical outputs) and regression trees (which produce numeric outputs). In this chapter, we’ll focus on classification trees, and we’ll work through the ID3 algorithm for learning a decision tree from a set of labeled data, which should help us understand how decision trees actually work. To make things simple, we’ll restrict ourselves to problems with binary outputs like “should I hire this candidate?” or “should I show this website visitor advertisement A or advertisement B?” or “will eating this food I found in the office fridge make me sick?”

pages: 502 words: 107,510

Natural Language Annotation for Machine Learning
by James Pustejovsky and Amber Stubbs
Published 14 Oct 2012

The labels are typically metadata tags provided by humans who annotate the corpus for training purposes. Unsupervised learning Any technique that tries to find structure from an input set of unlabeled data. Semi-supervised learning Any technique that generates a function mapping from inputs of both labeled data and unlabeled data; a combination of both supervised and unsupervised learning. Table 1-4 shows a general overview of ML algorithms and some of the annotation tasks they are frequently used to emulate. We’ll talk more about why these algorithms are used for these different tasks in Chapter 7.

Just to give you an idea of what the algorithms listed in that table mean, the rest of this section gives an overview of the main types of ML algorithms. Classification Classification is the task of identifying the labeling for a single entity from a set of data. For example, in order to distinguish spam from not-spam in your email inbox, an algorithm called a classifier is trained on a set of labeled data, where individual emails have been assigned the label [+spam] or [-spam]. It is the presence of certain (known) words or phrases in an email that helps to identify an email as spam. These words are essentially treated as features that the classifier will use to model the positive instances of spam as compared to not-spam.

DATE: Monday, April 2011 DURATION: 30 minutes, two years, four days SET: every hour, every other month Event: Meeting, vacation, promotion, maternity leave, etc. Temporal_Relations ::= BEFORE | AFTER | DURING | EQUAL | OVERLAP | ... We will come back to this problem in a later chapter, when we discuss the impact of the initial model on the subsequent performance of the algorithms you are trying to train over your labeled data. Warning In later chapters, we will see that there are actually several models that might be appropriate for describing a phenomenon, each providing a different view of the data. We will call this multimodel annotation of the phenomenon. A common scenario for multimodel annotation involves annotators who have domain expertise in an area (such as biomedical knowledge).

pages: 1,082 words: 87,792

Python for Algorithmic Trading: From Idea to Cloud Deployment
by Yves Hilpisch
Published 8 Dec 2020

Derives the time series momentum as the mean of the recent log returns. Calculates the simple moving average. Calculates the rolling maximum value. Calculates the rolling minimum value. Adds the lagged features data to the DataFrame object. Defines the labels data as the market direction (+1 or up and -1 or down). Shows a small sub-set from the resulting lagged features data. Given the features and label data, different supervised learning algorithms could now be applied. In what follows, a so-called AdaBoost algorithm for classification is used from the scikit-learn ML package (see AdaBoostClassifier). The idea of boosting in the context of classification is to use an ensemble of base classifiers to arrive at a superior predictor that is supposed to be less prone to overfitting (see “Data Snooping and Overfitting”).

Figure 5-1 shows the data and the regression line: In [1]: import os import random import numpy as np from pylab import mpl, plt plt.style.use('seaborn') mpl.rcParams['savefig.dpi'] = 300 mpl.rcParams['font.family'] = 'serif' os.environ['PYTHONHASHSEED'] = '0' In [2]: x = np.linspace(0, 10) In [3]: def set_seeds(seed=100): random.seed(seed) np.random.seed(seed) set_seeds() In [4]: y = x + np.random.standard_normal(len(x)) In [5]: reg = np.polyfit(x, y, deg=1) In [6]: reg Out[6]: array([0.94612934, 0.22855261]) In [7]: plt.figure(figsize=(10, 6)) plt.plot(x, y, 'bo', label='data') plt.plot(x, np.polyval(reg, x), 'r', lw=2.5, label='linear regression') plt.legend(loc=0); Imports NumPy. Imports matplotlib. Generates an evenly spaced grid of floats for the x values between 0 and 10. Fixes the seed values for all relevant random number generators. Generates the randomized data for the y values.

Enlarging the interval to, say, allows one to “predict” values for the dependent variable y beyond the domain of the original data set by an extrapolation given the optimal regression parameters. Figure 5-2 visualizes the extrapolation: In [8]: plt.figure(figsize=(10, 6)) plt.plot(x, y, 'bo', label='data') xn = np.linspace(0, 20) plt.plot(xn, np.polyval(reg, xn), 'r', lw=2.5, label='linear regression') plt.legend(loc=0); Generates an enlarged domain for the x values. Figure 5-2. Prediction (extrapolation) based on linear regression The Basic Idea for Price Prediction Price prediction based on time series data has to deal with one special feature: the time-based ordering of the data.

pages: 161 words: 39,526

Applied Artificial Intelligence: A Handbook for Business Leaders
by Mariya Yao , Adelyn Zhou and Marlene Jia
Published 1 Jun 2018

If you are trying to predict whether an image is of a cat or a dog, this is a classification problem with discrete classes. If you are trying to predict the numerical price of a stock or some other asset, this can be framed as a regression problem with continuous outputs. Unsupervised learning occurs when computers are given unstructured rather than labeled data, i.e. no input-output pairs, and asked to discover inherent structures and patterns that lie within the data. One common application of unsupervised learning is clustering, where input data is divided into different groups based on a measure of “similarity." For example, you may want to cluster your LinkedIn or Facebook friends into social groups based on how connected they are to each other.

Rather than building custom deep learning solutions, many enterprises opt for Machine Learning as a Service (MLaaS) solutions from Google, Amazon, IBM, Microsoft, or leading AI startups. Deep learning also suffers from technical drawbacks. Successful models typically require a large volume of reliably-labeled data, which enterprises often lack. They also require significant and specialized computing power in the form of graphical processing units (GPUs) or GPU alternatives such as Google’s tensor processing units (TPUs). After deployment, they also require constant training and updating to maintain performance.

Advances in AI mean that many of these tasks can now be done automatically and much more accurately than before. Multiple companies such as Paxata and Trifacta now offer AI-powered data wrangling services that automate portions of the data preparation process, using algorithms and machine learning to transform raw data inputs into well-labeled data structures that are ready for use. These companies emphasize ease of use. Paxata, for example, offers an Excel-like UI, while Trifacta offers a proprietary interface that encourages users to visually interact with their data. Data Architecture While centralized data allows the company to more efficiently and more intelligently assess its overall state of being, the process of collecting that data is complicated by the existence of data silos.

pages: 350 words: 98,077

Artificial Intelligence: A Guide for Thinking Humans
by Melanie Mitchell
Published 14 Oct 2019

In 2017, the Financial Times reported that “most companies working on this technology employ hundreds or even thousands of people, often in offshore outsourcing centres in India or China, whose job it is to teach the robo-cars to recognize pedestrians, cyclists and other obstacles. The workers do this by manually marking up or ‘labeling’ thousands of hours of video footage, often frame by frame.”10 New companies have sprung up to offer labeling data as a service; Mighty AI, for example, offers “the labeled data you need to train your computer vision models” and promises “known, verified, and trusted annotators who specialize in autonomous driving data.”11 The Long Tail The supervised-learning approach, using large data sets and armies of human annotators, works well for at least some of the visual abilities needed for self-driving cars (many companies are also exploring the use of video-game-like driving-simulation programs to augment supervised training).

FIGURE 14: Salt lines on a highway, in advance of a forecasted snowstorm, were reported to be confusing Tesla’s Autopilot feature. A commonly proposed solution is for AI systems to use supervised learning on small amounts of labeled data and learn everything else via unsupervised learning. The term unsupervised learning refers to a broad group of methods for learning categories or actions without labeled data. Examples include methods for clustering examples based on their similarity or learning a new category via analogy to known categories. As I’ll describe in a later chapter, perceiving abstract similarity and analogies is something at which humans excel, but to date there are no very successful AI methods for this kind of unsupervised learning.

In a 2015 article, the journalist Tom Simonite interviewed Yann LeCun about the unexpected triumph of ConvNets: LeCun recalls seeing the community that had mostly ignored neural networks pack into the room where the winners presented a paper on their results. “You could see right there a lot of senior people in the community just flipped,” he says. “They said, ‘Okay, now we buy it. That’s it, now—you won.’”8 At almost the same time, Geoffrey Hinton’s group was also demonstrating that deep neural networks, trained on huge amounts of labeled data, were significantly better than the current state of the art in speech recognition. The Toronto group’s ImageNet and speech-recognition results had substantial ripple effects. Within a year, a small company started by Hinton was acquired by Google, and Hinton and his students Krizhevsky and Sutskever became Google employees.

pages: 288 words: 86,995

Rule of the Robots: How Artificial Intelligence Will Transform Everything
by Martin Ford
Published 13 Sep 2021

The critical importance of accurately labeling massive datasets, especially for applications that involve understanding visual information, is especially well demonstrated by the meteoric ascent of Scale AI, which was founded by nineteen-year-old MIT dropout Alexandr Wang in 2016. Scale AI contracts with over 30,000 crowdsourced workers who label data for clients including Uber, Lyft, Airbnb and Alphabet’s self-driving car division, Waymo. The company has received more than $100 million in venture capital and now ranks as a Silicon Valley “unicorn”—a startup valued in excess of $1 billion.3 In many other cases, however, nearly incomprehensible quantities of beautifully labeled data are generated seemingly automatically—and for the companies that possess it, virtually free of charge. The massive torrent of data generated by platforms like Facebook, Google or Twitter is valuable in large measure because it is carefully annotated by the people using the platforms.

The technique powers AI radiology systems (trained with a huge number of medical images labeled either “Cancer” or “No Cancer”), language translation (trained with millions of documents pre-translated into different languages) and a nearly limitless number of other applications that essentially involve comparing and classifying different forms of information. Supervised learning typically requires vast amounts of labeled data, but the results can be very impressive—routinely resulting in systems with a superhuman ability to recognize patterns. Five years after the 2012 ImageNet competition that marked the onset of the deep learning explosion, the image recognition algorithms had become so proficient that the annual competition was reoriented toward a new challenge involving the recognition of real-world three-dimensional objects.2 In cases where labeling all this data requires the kind of interpretation that only a human can provide, as in attaching descriptive annotations to photographs, the process is expensive and cumbersome.

The achievements may also have led to what venture capitalist and author Kai-Fu Lee has called a “Sputnik moment” in China—in the wake of which the government quickly moved to position the country to become a leader in artificial intelligence.8 While supervised learning depends on massive quantities of labeled data, reinforcement learning requires a huge number of practice runs, the majority of which end in spectacular failure. Reinforcement learning is especially well suited to games—in which an algorithm can rapidly churn though more matches than a human being could play in a lifetime. The approach can also be applied to real-world activities that can be simulated at high speed.

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage
by Zdravko Markov and Daniel T. Larose
Published 5 Apr 2007

The mixture model also includes the probability of sampling for each class (the probability that a random value belongs to a particular class). All these parameters are also shown in Figure 3.2. Using the mixture model, we can define three problems: a finite mixture problem, a classification problem, and a clustering problem. Finite Mixture Problem Given a labeled data set (i.e., we know the class for each attribute value) the problem is to find the mean, standard deviation, and the probability of sampling for each PROBABILTY-BASED CLUSTERING Normal (Gaussian) Distribution Probability Density 2 1.5 A B 1 0.5 0 -1 -0.5 0 0.5 1 Standard deviation 1.5 2 Probability of sampling A 0 B 0.961 A 0 A 0 B 0.780 A 0 B 0 A 0 B 0.980 A 0 B 0.135 A B 0.490 B B 0 A 0 A 0 A 0.387 75 0.928 0.658 Class Mean A mA = 0.132 sA = 0.229 P(A) = 0.55 A 0.570 B mB = 0.494 sB = 0.449 P(B) = 0.45 B 0 Figure 3.2 Two-class mixture model for the term offers.

P(B | (0, 0, 0, 0, 0.976, 0.254)) ≈ (0.705)(7.979)(7.979)(0.486)(0.698)(1.604)(0.45) = 10.99 After normalization we have the following probabilities: P(B | (0, 0, 0, 0, 0.976, 0.254)) = 0.019 = 0.002 0.019 + 10.99 10.99 = 0.998 0.019 + 10.99 Clearly, the Theatre document belongs to class B, which is now the correct classification because this is its original cluster. P(B | (0, 0, 0, 0, 0.976, 0.254)) = Clustering Problem So far we have discussed two tasks associated with our probabilistic setting: learning (creating models given labeled data) and classification (predicting labels using models). Recall, however, that the cluster labels were created automatically by k-means clustering. So a natural question is whether we can also get these labels automatically within a probabilistic setting. PROBABILTY-BASED CLUSTERING 79 Expectation maximization (EM) is a popular algorithm used for clustering in the context of mixture models.

If we know that our labeling is correct and reflects closely the content (representation) of documents, we can evaluate the quality of clustering. For example, considering the example from Table 3.7, we may decide that clustering EM1 is better than EM2 because the former has 15% errors and the latter has 30% errors with respect to manual classification (labeling). Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage C 2007 John Wiley & Sons, Inc. By Zdravko Markov and Daniel T. Larose Copyright  89 90 CHAPTER 4 EVALUATING CLUSTERING 2. On the other hand, we may know that our algorithm works well and the representation reflects accurately the content of documents.

pages: 660 words: 141,595

Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking
by Foster Provost and Tom Fawcett
Published 30 Jun 2013

Micro-outsourcing is particularly relevant to data science, because it changes the economics, as well as the practicalities, of investing in data.[84] As one example, recall the requirements for applying supervised modeling. We need to have specified a target variable precisely, and we need to actually have values for the target variable (“labels”) for a set of training data. Sometimes we can specify the target variable precisely, but we find we do not have any labeled data. In certain cases, we can use micro-outsourcing systems such as Mechanical Turk to label data. For example, advertisers would like to keep their advertisements off of objectionable web pages, like those that contain hate speech. However, with billions of pages to put their ads on, how can they know which ones are objectionable? It would be far too costly to have employees look at them all.

Deduction starts with general rules and specific facts, and creates other specific facts from them. The use of our models can be considered a procedure of (probabilistic) deduction. We will get to this shortly. The input data for the induction algorithm, used for inducing the model, are called the training data. As mentioned in Chapter 2, they are called labeled data because the value for the target variable (the label) is known. Let’s return to our example churn problem. Based on what we learned in Chapter 1 and Chapter 2, we might decide that in the modeling stage we should build a “supervised segmentation” model, which divides the sample into segments having (on average) higher or lower tendency to leave the company after contract expiration.

As depicted in the bottom pane of Figure 5-9, in each iteration of the cross-validation, a different fold is chosen as the test data. In this iteration, the other k–1 folds are combined to form the training data. So, in each iteration we have (k–1)/k of the data used for training and 1/k used for testing. Figure 5-9. An illustration of cross-validation. The purpose of cross-validation is to use the original labeled data efficiently to estimate the performance of a modeling procedure. Here we show five-fold cross-validation: the original dataset is split randomly into five equal-sized pieces. Then, each piece is used in turn as the test set, with the other four used to train a model. The result is five different accuracy results, which then can be used to compute the average accuracy and its variance.

Scikit-Learn Cookbook
by Trent Hauck
Published 3 Nov 2014

Label propagation with semi-supervised learning Label propagation is a semi-supervised technique that makes use of the labeled and unlabeled data to learn about the unlabeled data. Quite often, data that will benefit from a classification algorithm is difficult to label. For example, labeling data might be very expensive, so only a subset is cost-effective to manually label. This said, there does seem to be slow, but growing, support for companies to hire taxonomists. Getting ready Another problem area is censored data. You can imagine a case where the frontier of time will affect your ability to gather labeled data. Say, for instance, you took measurements of patients and gave them an experimental drug. In some cases, you are able to measure the outcome of the drug, if it happens fast enough, but you might want to predict the outcome of the drugs that have a slower reaction time.

The whole point is that we might give some ability to predict well on the training set and to work on a wider range of situations. How it works… Label propagation works by creating a graph of the data points, with weights placed on the edge equal to the following: The algorithm then works by labeled data points propagating their labels to the unlabeled data. This propagation is in part determined by edge weight. The edge weights can be placed in a matrix of transition probabilities. We can iteratively determine a good estimate of the actual labels. 159 www.it-ebooks.info www.it-ebooks.info 5 Postmodel Workflow This chapter will cover the following recipes: ff K-fold cross validation ff Automatic cross validation ff Cross validation with ShuffleSplit ff Stratified k-fold ff Poor man's grid search ff Brute force grid search ff Using dummy estimators to compare results ff Regression model evaluation ff Feature selection ff Feature selection on L1 norms ff Persisting models with joblib Introduction Even though by design the chapters are unordered, you could argue by virtue of the art of data science, we've saved the best for last.

pages: 250 words: 73,574

Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
by John MacCormick and Chris Bishop
Published 27 Dec 2011

Unfortunately, no one has ever been able to explicitly “teach” a computer to solve more interesting classification tasks, such as the handwritten digits on the next page. So computer scientists turn to the other strategy available: getting a computer to automatically “learn” how to classify samples. The basic strategy is to give the computer a large amount of labeled data: samples that have already been classified. The figure on page 84 shows an example of some labeled data for the handwritten digit task. Because each sample comes with a label (i.e., its class), the computer can use various analytical tricks to extract characteristics of each class. When it is later presented with an unlabeled sample, the computer can guess its class by choosing the one whose characteristics are most similar to the unlabeled sample.

Most pattern recognition tasks can be phrased as classification problems. Here, the task is to classify each handwritten digit as one of the 10 digits 0,1,…, 9. Data source: MNIST data of LeCun et al. 1998. The process of learning the characteristics of each class is often called “training,” and the labeled data itself is the “training data.” So in a nutshell, pattern recognition tasks are divided into two phases: first, a training phase in which the computer learns about the classes based on some labeled training data; and second, a classification phase in which the computer classifies new, unlabeled data samples.

Obviously, this is an example of a classification task that cannot be performed with perfect accuracy, even by a human: a person's address doesn't tell us enough to predict political affiliations. But, nevertheless, we would like to train a classification system that predicts which party a person is most likely to donate to, based only on a home address. To train a classifier, a computer needs some labeled data. Here, each sample of data (a handwritten digit) comes with a label specifying one of the 10 possible digits. The labels are on the left, and the training samples are in boxes on the right. Data source: MNIST data of LeCun et al. 1998. The figure on the next page shows some training data that could be used for this task.

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps
by Valliappa Lakshmanan , Sara Robinson and Michael Munn
Published 31 Oct 2020

Machine learning problems (see Figure 1-1) can be broken into two types: supervised and unsupervised learning. Supervised learning defines problems where you know the ground truth label for your data in advance. For example, this could include labeling an image as “cat” or labeling a baby as being 2.3 kg at birth. You feed this labeled data to your model in hopes that it can learn enough to label new examples. With unsupervised learning, you do not know the labels for your data in advance, and the goal is to build a model that can find natural groupings of your data (called clustering), compress the information content (dimensionality reduction), or find association rules.

Using the first rule of thumb, we would choose an embedding dimension for plurality of 5, and using the second rule of thumb, we’d choose 40. If we are doing hyperparameter tuning, it might be worth searching within this range. Autoencoders Training embeddings in a supervised way can be hard because it requires a lot of labeled data. For an image classification model like Inception to be able to produce useful image embeddings, it is trained on ImageNet, which has 14 million labeled images. Autoencoders provide one way to get around this need for a massive labeled dataset. The typical autoencoder architecture, shown in Figure 2-11, consists of a bottleneck layer, which is essentially an embedding layer.

This allows us to break a hard machine learning problem into two parts. First, we use all the unlabeled data we have to go from high cardinality to lower cardinality by using autoencoders as an auxiliary learning task. Then, we solve the actual image classification problem for which we typically have much less labeled data using the embedding produced by the auxiliary autoencoder task. This is likely to boost model performance, because now the model only has to learn the weights for the lower-dimension setting (i.e., it has to learn fewer weights). In addition to image autoencoders, recent work has focused on applying deep learning techniques for structured data.

Data Mining: Concepts and Techniques: Concepts and Techniques
by Jiawei Han , Micheline Kamber and Jian Pei
Published 21 Jun 2011

The greater the distance, the more likely that errors will be corrected. 9.7.2. Semi-Supervised Classification Semi-supervised classification uses labeled data and unlabeled data to build a classifier. Let be the set of labeled data and be the set of unlabeled data. Here we describe a few examples of this approach for learning. Self-training is the simplest form of semi-supervised classification. It first builds a classifier using the labeled data. The classifier then tries to label the unlabeled data. The tuple with the most confident label prediction is added to the set of labeled data, and the process repeats (Figure 9.17). Although the method is easy to understand, a disadvantage is that it may reinforce errors.

Suppose we split the feature set into two sets and train two classifiers, f1 and f2, where each classifier is trained on a different set. Then, f1 and f2 are used to predict the class labels for the unlabeled data, Xu. Each classifier then teaches the other in that the tuple having the most confident prediction from f1 is added to the set of labeled data for f2 (along with its label). Similarly, the tuple having the most confident prediction from f2 is added to the set of labeled data for f1. The method is summarized in Figure 9.17. Cotraining is less sensitive to errors than self-training. A difficulty is that the assumptions for its usage may not hold true, that is, it may not be possible to split the features into mutually exclusive and class-conditionally independent sets.

Regression analysis is beyond the scope of this book. Sources for further information are given in the bibliographic notes. 1.4.4. Cluster Analysis Unlike classification and regression, which analyze class-labeled (training) data sets, clustering analyzes data objects without consulting class labels. In many cases, class-labeled data may simply not exist at the beginning. Clustering can be used to generate class labels for a group of data. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. That is, clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are rather dissimilar to objects in other clusters.

pages: 416 words: 118,522

Why Machines Learn: The Elegant Math Behind Modern AI
by Anil Ananthaswamy
Published 15 Jul 2024

Also, we did a very particular kind of problem solving called regression, where given some independent variables (x1, x2), we built a model (or equation) to predict the value of a dependent variable (y). There are many other types of models we could have built, and we’ll come to them in due course. In this case, the correlation, or pattern, was so simple that we needed only a small amount of labeled data. But modern ML requires orders of magnitude more—and the availability of such data has been one of the factors fueling the AI revolution. (The ducklings, for their part, likely indulge in a more sophisticated form of learning. No parent duck sits around labeling the data for its ducklings, and yet the babies learn.

With these barest of bare-minimum basics of probability and statistics in hand, we can get back to thinking about machine learning as probabilistic reasoning and statistical learning. SIX OF ONE, HALF A DOZEN OF THE OTHER Let’s start with the most common form of machine learning, one we have already encountered, called supervised learning. We are given some labeled data, X. Each instance of X is a d-dimensional vector, meaning it has d components. So, X is a matrix, where each row of the matrix is one instance of the data. [x1, x2, x3, x4,…, xd] Each instance of X could represent, say, a person. And the components [x1, x2, x3,…, xd] could be values for the person’s height, weight, body mass, cholesterol levels, blood pressure, and so on.

In the first method, given data, the ML algorithm figures out the best θ, for some choice of distribution type (Bernoulli or Gaussian or something else), which maximizes the likelihood of seeing the data, D. In other words, you are estimating the best underlying distribution, with parameter θ, such that if you were to sample from that distribution, you would maximize the likelihood of observing the labeled data you already had in hand. Not surprisingly, this method is called maximum likelihood estimation (MLE). It maximizes P (D | θ), the probability of observing D given θ, and is loosely associated with frequentist methodology. As a concrete example, let’s take two populations of people, one tall and the other short.

pages: 307 words: 88,180

AI Superpowers: China, Silicon Valley, and the New World Order
by Kai-Fu Lee
Published 14 Sep 2018

SECOND WAVE: BUSINESS AI First-wave AI leverages the fact that internet users are automatically labeling data as they browse. Business AI takes advantage of the fact that traditional companies have also been automatically labeling huge quantities of data for decades. For instance, insurance companies have been covering accidents and catching fraud, banks have been issuing loans and documenting repayment rates, and hospitals have been keeping records of diagnoses and survival rates. All of these actions generate labeled data points—a set of characteristics and a meaningful outcome—but until recently, most traditional businesses had a hard time exploiting that data for better results.

The writing wasn’t exactly poetry, but the speed was incredible: the “reporter” produced short summaries within two seconds of some events’ finish, and it “covered” over thirty events per day. Algorithms are also being used to sniff out “fake news” on the platform, often in the form of bogus medical treatments. Originally, readers discovered and reported misleading stories—essentially, free labeling of that data. Toutiao then used that labeled data to train an algorithm that could identify fake news in the wild. Toutiao even trained a separate algorithm to write fake news stories. It then pitted those two algorithms against each other, competing to fool one another and improving both in the process. This AI-driven approach to content is paying off.

All of these actions generate labeled data points—a set of characteristics and a meaningful outcome—but until recently, most traditional businesses had a hard time exploiting that data for better results. Business AI mines these databases for hidden correlations that often escape the naked eye and human brain. It draws on all the historic decisions and outcomes within an organization and uses labeled data to train an algorithm that can outperform even the most experienced human practitioners. That’s because humans normally make predictions on the basis of strong features, a handful of data points that are highly correlated to a specific outcome, often in a clear cause-and-effect relationship. For example, in predicting the likelihood of someone contracting diabetes, a person’s weight and body mass index are strong features.

Text Analytics With Python: A Practical Real-World Approach to Gaining Actionable Insights From Your Data
by Dipanjan Sarkar
Published 1 Dec 2016

This is extremely useful in text document categorization and is also called document clustering, where we cluster documents into groups purely based on their features, similarity, and attributes, without training any model on previously labelled data. Later chapters further discuss unsupervised learning, covering topic models, document summarization, similarity analysis, and clustering. Supervised learningrefers to specific ML techniques or algorithms that are trained on pre-labelled data samples known as training data. Features or attributes are extracted from this data using feature extraction, and for each data point we will have its own feature set and corresponding class/label.

Remember that documents are basically sentences or paragraphs of text. This forms a corpus. Our task would be to determine which class or classes each document belongs to. This entire process involves several steps which we will be discussing in detail later in this chapter. Briefly, for a supervised classification problem, we need to have some labelled data that we could use for training a text classification model. This data would essentially be curated documents that are already assigned to some specific class or category beforehand. Using this, we would essentially extract features and attributes from each document and make our model learn these attributes corresponding to each particular document and its class/category by feeding it to a supervised ML algorithm.

We also used our TF-IDF weights and vocabulary, obtained earlier when we implemented TF-IDF–based feature vector extraction from documents. Now you have a good grasp on how to extract features from text data that can be used for training a classifier. Classification Algorithms Classification algorithms are supervised ML algorithms that are used to classify, categorize, or label data points based on what it has observed in the past. Each classification algorithm, being a supervised learning algorithm, requires training data. This training data consists of a set of training observations where each observation is a pair consisting of an input data point, usually a feature vector like we observed earlier, and a corresponding output outcome for that input observation.

pages: 122 words: 29,286

Learning Scikit-Learn: Machine Learning in Python
by Raúl Garreta and Guillermo Moncecchi
Published 14 Sep 2013

It aims to provide similar features to those of R, the popular language and environment for statistical computing. We will use pandas to import the Titanic data we presented in Chapter 2, Supervised Learning, and convert them to the scikit-learn format. Let's start by importing the original titanic.csv data into a pandas DataFrame data structure (DataFrame is essentially a two-dimensional labeled data structure where columns can potentially include different data types and each row represents an instance). As usual, we previously import the numpy and pyplot packages. >>> %pylab inline >>> import pandas as pd >>> import numpy as np >>> import matplotlib.pyplot as plt Then we import the Titanic data with pandas. >>> titanic = pd.read_csv('data/titanic.csv') >>> print titanic <class 'pandas.core.frame.DataFrame'> Int64Index: 1313 entries, 0 to 1312 Data columns (total 11 columns): row.names 1313 non-null values pclass 1313 non-null values survived 1313 non-null values name 1313 non-null values age 633 non-null values embarked 821 non-null values home.dest 754 non-null values room 77 non-null values ticket 69 non-null values boat 347 non-null values sex 1313 non-null values dtypes: float64(1), int64(2), object(8) You can see that each csv column has a corresponding feature into the DataFrame, and that the feature type is induced from the available data.

They combine small amounts of annotated data with huge amounts of unlabeled data. Usually, unlabeled data can reveal the underlying distribution of elements and obtain better results in combination with a small, labeled dataset. Active learning is a particular case within semi-supervised methods. Again, it is useful when labeled data is scarce or hard to obtain. In active learning, the algorithm actively queries a human expert to answer the label of certain unlabeled instances, and thus learn the concept over a reduced set of labeled instances. Reinforcement learning proposes methods where an agent learns from feedback (rewards or reinforcements) after performing actions within an environment.

The Deep Learning Revolution (The MIT Press)
by Terrence J. Sejnowski
Published 27 Sep 2018

Unsupervised Learning and Cortical Development The Boltzmann machine can be used either in its supervised version, where both inputs and outputs are clamped, or in its unsupervised version, where only the inputs are clamped. Geoffrey Hinton used the unsupervised version to build up a deep Boltzmann machine one layer at a time.22 Starting with a layer of hidden units connected to the input units, called a restricted Boltzmann machine, Geoffrey trained these on unlabeled data, which are a lot easier to come by than labeled data (there are billions of unlabeled images and audio recordings on the Internet), and learning is much faster. The first step in unsupervised learning is to extract from the data statistical regularities that are common to all the data, but the first layer of hidden units can only extract simple features, features that a perceptron can represent.

He later moved to AT&T Bell Laboratories in Holmdel, New Jersey, where he trained a network that could read handwritten zip codes on letters, using the Modified National Institute of Standards and Technology (MNIST) Figure 9.1 Geoffrey Hinton and Yann LeCun have mastered deep learning. This photo was taken at a meeting of the Neural Computation and Adaptive Perception Program of the Canadian Institute for Advanced Research around 2000, a program that was an incubator for what became the field of deep learning. Courtesy of Geoffrey Hinton. 130 Chapter 9 database, a labeled data benchmark. Millions of letters each day have to be routed to mailboxes; today this is fully automated. The same technology also made it possible to read automatically the amount on your bank check at ATM machines. Interestingly, the hardest part is locating where on the check the numbers are written since each check has a different format.

The world of fashion may be on the brink of a new era, along with many other businesses that depend on creativity. It’s All about Scaling Most of the current learning algorithms were discovered more than twenty-five years ago, so why did it take so long for them to have an impact on the real world? With the computers and labeled data that were available to researchers in the 1980s, it was only possible to demonstrate proof of principle on toy problems. Despite some promising results, we did not know how well network learning and performance would scale as 138 Chapter 9 Convolutional Learning 139 Figure 9.5 Generative adversarial networks (GANs).

pages: 1,331 words: 163,200

Hands-On Machine Learning With Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
by Aurélien Géron
Published 13 Mar 2017

Chapter 15: Autoencoders Here are some of the main tasks that autoencoders are used for: Feature extraction Unsupervised pretraining Dimensionality reduction Generative models Anomaly detection (an autoencoder is generally bad at reconstructing outliers) If you want to train a classifier and you have plenty of unlabeled training data, but only a few thousand labeled instances, then you could first train a deep autoencoder on the full dataset (labeled + unlabeled), then reuse its lower half for the classifier (i.e., reuse the layers up to the codings layer, included) and train the classifier using the labeled data. If you have little labeled data, you probably want to freeze the reused layers when training the classifier. The fact that an autoencoder perfectly reconstructs its inputs does not necessarily mean that it is a good autoencoder; perhaps it is simply an overcomplete autoencoder that learned to copy its inputs to the codings layer and then to the outputs.

Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak. Thus, you may want to place these items close to each other. Semisupervised learning Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semisupervised learning (Figure 1-11). Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7.

Whenever someone learns Machine Learning, sooner or later they tackle MNIST. Scikit-Learn provides many helper functions to download popular datasets. MNIST is one of them. The following code fetches the MNIST dataset:1 >>> from sklearn.datasets import fetch_mldata >>> mnist = fetch_mldata('MNIST original') >>> mnist {'COL_NAMES': ['label', 'data'], 'DESCR': 'mldata.org dataset: mnist-original', 'data': array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'target': array([ 0., 0., 0., ..., 9., 9., 9.])} Datasets loaded by Scikit-Learn generally have a similar dictionary structure including: A DESCR key describing the dataset A data key containing an array with one row per instance and one column per feature A target key containing an array with the labels Let’s look at these arrays: >>> X, y = mnist["data"], mnist["target"] >>> X.shape (70000, 784) >>> y.shape (70000,) There are 70,000 images, and each image has 784 features.

pages: 463 words: 105,197

Radical Markets: Uprooting Capitalism and Democracy for a Just Society
by Eric Posner and E. Weyl
Published 14 May 2018

If we allow a complex set of rules to predict presidential elections, there are too few examples to fit these complex rules and thus our rules can easily “overfit” to inessential features of the elections, resulting in bad predictions. The more complex the rules we want to fit (the deeper and more fully connected the neural net), the more data we need to avoid overfitting. Computer scientists and statisticians call the number of labeled data points needed to avoid overfitting for a problem (such as recognizing faces, or artistic styles) the “sample complexity” of the problem.15 FIGURE 5.3: The problem of overfitting, illustrated by predicting presidential elections. Source: Excerpted from “Electoral Precedent” at https://xkcd.com/1122/.

Technofeudalism Why, then, do siren servers not voluntarily pay their users to supply the high-quality data that would allow them to develop the best services? If data production is labor, why doesn’t a market for data work emerge as a part of the broader labor market? In fact, we have seen tentative first signs of markets for high-quality, labeled data. Many researchers and some companies use Amazon’s Mechanical Turk (mTurk) marketplace to pay online workers to label and clean data sets, and to participate in social-science experiments. This is not entirely new. Television ratings are still determined by Nielsen, which pays households a small fee to record their viewing.

Instead, they are smaller companies, academic researchers, and financial firms with no direct access to data. Many of these businesses have exciting prospects. Work Fusion, for example, offers a sophisticated incentive scheme to workers to help train AIs to automate business processes. Might AI firms hire workers to label maps and road images and sell the labeled data to companies producing self-driving cars? However, the total size of these markets is tiny compared to the number of users who produce data used by the siren servers. The number of workers on mTurk is in the tens of thousands, compared to billions of users of services offered by Google and Facebook.25 The data titans (Google, Facebook, Microsoft, etc.) do not pay for most of their data.

Mastering Machine Learning With Scikit-Learn
by Gavin Hackeling
Published 31 Oct 2014

Learning from experience Machine learning systems are often described as learning from experience either with or without supervision from humans. In supervised learning problems, a program predicts an output for an input by learning from pairs of labeled inputs and outputs; that is, the program learns from examples of the right answers. In unsupervised learning, a program does not learn from labeled data. Instead, it attempts to discover patterns in the data. For example, assume that you have collected data describing the heights and weights of people. An example of an unsupervised learning problem is dividing the data points into groups. A program might produce groups that correspond to men and women, or children and adults. [8] www.it-ebooks.info Chapter 1 Now assume that the data is also labeled with the person's sex.

However, machine learning algorithms also follow the maxim "garbage in, garbage out." A student who studies for a test by reading a large, confusing textbook that contains many errors will likely not score better than a student who reads a short but well-written textbook. Similarly, an algorithm trained on a large collection of noisy, irrelevant, or incorrectly labeled data will not perform better than an algorithm trained on a smaller set of data that is more representative of problems in the real world. Many supervised training sets are prepared manually, or by semi-automated processes. Creating a large collection of supervised data can be costly in some domains.

Four Battlegrounds
by Paul Scharre
Published 18 Jan 2023

“The thing that we spent the most time on is data,” Brown said. “You need data to train your models.” Yet getting access to data, especially clean data that could be used to train a machine learning model, was no small feat. Brown and his team faced a raft of problems in gaining access to sufficient amounts of clean, labeled data to train their algorithms. Their experience demonstrates the many practical challenges with data, particularly in a bureaucracy like the Defense Department. First, there were bureaucratic hurdles to overcome. The normal practice was for drone footage collected during domestic operations to be destroyed afterwards for intelligence oversight reasons.

Again piggybacking off of Maven, the JAIC contracted with a data labeling company to label the images appropriately so that CrowdAI could build the semantic segmentation model to “paint” the video. All told, it took six months to access, assemble, clean, label, and pre-process the data. “Once we had the labeled data set, we sent it over to the subcontractor, and they produced an initial model within two weeks,” Air Force Captain Dominic Garcia said. “Among all of these pieces and parts, the easiest is the algorithm,” Colonel Jason Brown said. It was a point I repeatedly heard echoed by others across DoD.

“We didn’t know anything about radar,” John said, but “it turned out not to be as important as some of the other things” such as “knowing how to do deep learning well” and “a really disciplined software engineering approach.” He said, “That’s one of the recurring stories of . . . deep learning since 2012, is that domain expertise isn’t always the thing that’s going to matter so much. . . . The labeled data was sufficient.” Deep Learning Analytics won a $6 million contract from DARPA for the TRACE program, beating out competitors that had better human expert knowledge on radar imaging. Now they suddenly had to grow the company. Yet John’s first hire wasn’t an engineer. His biggest concern wasn’t technology; it was government contracting requirements.

pages: 208 words: 57,602

Futureproof: 9 Rules for Humans in the Age of Automation
by Kevin Roose
Published 9 Mar 2021

Social networks like Facebook, Twitter, and YouTube rely on armies of low-paid contractors who sift through objectionable content all day, deciding which posts to leave up and which to take down. AI assistants like Alexa are helped by “data annotators,” humans who listen in on recordings of users’ conversations and help the system improve over time by labeling data, correcting mistakes, and training the AI to understand accents and unusual requests. In China, “data labeling” companies have sprung up to fill a need for huge numbers of workers who spend all day doing the kinds of mundane clerical work that make AI possible—for example, labeling images and tagging audio clips.

When we see something unexpected, we do a double-take—we back up and reprocess the visual information, making different assumptions about what it might represent. But today’s AIs can’t do that. Since they have no holistic model of the world and how humans interact with it (what we might call “common sense”), most AIs depend on having lots of high-quality examples at their disposal. There are types of AI that don’t require a whole bunch of labeled data, such as unsupervised learning, a technique in which an algorithm is told to go out and hunt for patterns in a big, messy data set. And some types of AI are getting better at handling new situations. But they’re still pretty far away from being able to navigate them with ease. Which means that humans who are good at handling the unexpected—who are cool in a crisis, who like dealing with messy problems and novel scenarios, and who can move forward even in the absence of a concrete plan—still have an advantage.

The Data Journalism Handbook
by Jonathan Gray , Lucy Chambers and Liliana Bounegru
Published 9 May 2012

The most famous is her “coxcomb,” a spiral of sections each representing deaths per month, which highlighted that the vast majority of deaths were from preventable diseases rather than bullets. Figure 1-10. Mortality of the British army by Florence Nightingale (image from Wikipedia) Data Journalism and Computer-Assisted Reporting At the moment there is a “continuity and change” debate going on around the label “data journalism” and its relationship with previous journalistic practices that employ computational techniques to analyze datasets. Some argue that there is a difference between CAR and data journalism. They say that CAR is a technique for gathering and analyzing data as a way of enhancing (usually investigative) reportage, whereas data journalism pays attention to the way that data sits within the whole journalistic workflow.

It is by now common sense that even the most recent media practices have histories, as well as something new in them. Rather than debating whether or not data journalism is completely novel, a more fruitful position would be to consider it as part of a longer tradition, but responding to new circumstances and conditions. Even if there might not be a difference in goals and techniques, the emergence of the label “data journalism” at the beginning of the century indicates a new phase wherein the sheer volume of data that is freely available online—combined with sophisticated user-centric tools, self-publishing, and crowdsourcing tools—enables more people to work with more data more easily than ever before. Data Journalism Is About Mass Data Literacy Digital technologies and the web are fundamentally changing the way information is published.

pages: 487 words: 124,008

Your Face Belongs to Us: A Secretive Startup's Quest to End Privacy as We Know It
by Kashmir Hill
Published 19 Sep 2023

He funded a professor at Virginia’s George Mason University to recruit more than one thousand people to sit for extensive portrait sessions, their faces captured from a variety of angles, to create a face database called FERET. The weaselly acronym incorporated the F and the E from FacE, RE from REcognition, and T from Technology. People in the software world call these thousands of photos, each clearly tied to a particular individual, “labeled data”; they are key for both training algorithms and testing how well they work. FERET became the benchmark against which NIST could measure facial recognition programs. It could do a head-to-head test—literally—to see which company excelled. In 2000, P. Jonathon Phillips ran the first Facial Recognition Vendor Test, giving each company’s algorithm the same fairly simple challenge: judging whether two photos, both taken in a studio with ideal lighting, were of the same person.

“Picasa’s facial recognition technology will ask you to identify people in your pictures that you haven’t tagged yet,” TechCrunch had reported at the time. “Once you do and start uploading more pictures, Picasa starts suggesting tags for people based on the similarity between their face in the picture and the tags you already put in place for them.” The most tedious task in artificial intelligence was gathering heaps of “labeled data,” those examples that a computer needs to parse in order to learn. With Picasa, Google had come up with a clever way to outsource that tedium. Picasa users dutifully labeled their friends’ faces when they uploaded party pictures, vacation photos, and family portraits. Unpaid and for fun, they helped train Google’s facial recognition algorithm so that it could more easily link different photos of the same face.

See also NIST (National Institute of Standards and Technology) Faraday bags, 227 FarmVille, 4–5, 6 “fatface,” 32 Fawkes, 242 FBI (Federal Bureau of Investigation) Capitol insurrection and, 229 Congress and, 142, 143–144, 158 Government Accountability Office and, 247 NIST and, 70, 71 NYPD and, 132 Federal Police of Brazil, 136 Federal Trade Commission (FTC), 121–123, 126–127, 143, 240 FERET, 67–68, 103 Ferg-Cadima, James, 82–87, 151, 158 Ferrara, Nicholas, xii–xiii Fight for the Future, 238 Financial Crimes Task Force, 128–131 FindFace, 34–35, 220–222 Findley, Josh, 134–136 fingerprints/fingerprinting early efforts regarding, 22 Pay by Touch and, 82, 84–87 theft of, 84 First 48, The, 230 First Amendment Abrams case and, 206–207, 208–209, 212–213 Gawker and, 15 violence and, 306n206 FitMob, 8 Flanagan, Chris, 129, 131 Flickr, 199, 242, 246–247 Flipshot, 9 Floyd, George, 207, 239 FOIA (Freedom of Information Act), 141, 157 4Chan, 94 Fourth Amendment, 141 Franken, Al, 140–144, 148–151 Freedom of Information Act (FOIA), 141, 157 FreeOnes, 197 Frydman, Ken, 28 FTC (Federal Trade Commission), 121–123, 126–127, 143, 240 Fuller, Matt, 274n52 Fussey, Pete, 218–219, 235 G Gaetz, Matt, 285n95 Galaxy Nexus, 109 Galton, Francis, 17–20, 22, 25, 33 GAO (General Accounting Office), 62, 65–66 Garrison, Jessica Medeiros, 138–139, 160 Garvie, Clare, 156–157, 178 Gaslight lounge, 50–51 Gawker Media Johnson and, 11, 12 Pay by Touch and, 85 Thiel and, 14–15 Ton-That and, xi, 7, 93, 116, 164 gaydar, 31 GDPR (General Data Protection Regulation), 191 gender Congressional representation and, 89 differences in facial recognition and, 48, 69–70, 124, 125, 156, 178, 240 IQ and, 32 General Accounting Office (GAO), 62, 65–66 General Data Protection Regulation (GDPR), 191 Genetic Information Nondiscrimination Act (GINA), 83 Gibson, William, 113 Giesea, Jeff, 53 Gilbert, John J., III, 129 GINA (Genetic Information Nondiscrimination Act), 83 Girard, René, 12 GitHub, 72, 74 Giuliani, Rudy Schwartz and, 27–28, 29, 89, 90, 129, 161 Waxman and, 80 Global Positioning System (GPS), 43 Gmail, 102 Gone Wild, 200 Good, John, x, 160 Google AI Principles of, 108 BIPA and, 151 Clearview AI and, 96, 165 CSAM and, 135 diverse datasets and, 179 “Face Facts” workshop and, 121–122 Face Unlock and, 109, 179 Franken and, 141, 142 FTC and, 123 Goggles and, 99–100, 102 hesitation of regarding facial recognition technology, 99–110 identification technology and, 145 Images, 79 lawsuits against, 306n206 Lunar XPRIZE and, 192 Maps and, 100–101, 102 mining of Gmail by, 102 monetization and, 6–7 neural networks technology and, 74 phone locks and, ix Photos app, 272n48 PittPatt and, 108–109, 110 Schmidt and, 27 State Privacy and Security Coalition and, 150 Street View and, 100–101, 102 TensorFlow and, 208 ViddyHo and, 7 GotNews, 11–12 Government Accountability Office, 247 GPS (Global Positioning System), 43 Greenwald, Glenn, 149 Greer, Evan, 237–238 Grewal, Gurbir, 165 Gristedes, 114 Grossenbacher, Timo, 194–196 Grother, Patrick, 69–70 Grunin, Nikolay, 309n222 Guardian, The, 149 gunshot detection systems, 232–233, 234 Gutierrez, Alejandro (“Gooty”), 231–232, 234 H Hacker News, 4 Hamburg Data Protection Authority (DPA), 192 Haralick, Robert, 271n47 Harrelson, Woody, 156–157 Harris, Andy, 274n52 Harvard Law Review, on privacy, viii Harvey, Adam, 241 Haskell, 4 Health and Human Services Department, 247 Hereditary Genius (Galton), 20 Hikvision, 177, 215–216, 226 Hinton, Geoffrey, 73, 74, 295n146 History of Animals (Aristotle), 17 Hogan, Hulk, 15 Hola, 275n58 Hollerith, Herman, 24–25 Homeland Security, Department of, 134–136, 235 Hooton, Earnest A., 25, 26 Hot Ones, 115 Howard, Zach, 145 Howell, Christopher, 207–208 HTML (Hyper Text Markup Language), 78 Huawei, 226 HuffPost, 165 Hungary, 56–59 Hyper Text Markup Language (HTML), 78 I “I Have a Dream” speech (King), 39 IARPA (Intelligence Advanced Research Projects Activity), 106 IBM, 25, 156, 239 Incredibles, The, 119 Independent, The, 34 Indiana State Police, 133 Inmar Intelligence, 87 Insight Camera, 187 Instagram Brown and, 11 Clearview AI and, vii Marx and, 192 scraping and, 195 Williams and, 182 Instaloader, 194–195 Instant Checkmate, 58 Intel, ix, 123–125 Intelligence Advanced Research Projects Activity (IARPA), 106 internet DARPA and, 43 embodied, 145–146 Trump and, 54 Internet Archive, 78 Interpol, 136 Intimcity, 221 investigative lead reports, 71, 131, 176, 180 iPhone developer tools for, 9 release of, ix, 6 unlocking and, 109 IQ, 32 “I’ve Just Seen a Face,” 125 Iveco vans, 214 J James, LeBron, 14 January 6 insurrection, 228–230 Java, 4 Jayapal, Pramila, 239 Je Suis Maidan, 222 Jenner, Kendall, 115 Jewel-Osco, 82 Jobs, Steve, ix Johnson, Charles Carlisle “Chuck” background of, 11–12 on blocking author’s face, 163 contact with, 165 early plans of, 31, 34 FindFace and, 220 Gawker and, 15 investors and, 116 Jukic and, 134 Orbán and, 56 ouster of, 94–96 at Republican Convention, 11, 15–16 Scalzo and, 118–119 Schwartz and, 29, 161–162 Smartcheckr and, 52–53, 72, 79, 80 Ton-That and, 12–14, 27, 33, 161–162, 247–249 Trump and, 50, 51–52, 53, 54 Joint Terrorism Task Force, 132 Jukic, Marko, xvii–xviii, 134, 162 Justice Department, 66, 247 K Kalanick, Travis, 189 Kanade, Takeo, 40, 41–42, 103 Keeper, 117 Kennedy, John F., 41 King, Martin Luther, Jr., 39–40 King-Hurley Research Group, 268–269n38 Kirenaga Partners, xiv–xv, 111, 112, 160 Knox, Belle, 198 Kodak, viii, 104, 179 Krizhevsky, Alex, 295n146 Kroger’s, 87 Krolik, Aaron, 164 Kutcher, Ashton, 114–115 Kuznetsova, Anna, 222–223 L L-1 Identity Solutions, 71 labeled data, 67 Lambert, Hal, 118 law enforcement. See also New York City Police Department (NYPD); individual government entities abuses by, 156–157 Atlanta Police Department, 157–159, 160 “broken windows” policing, 129 Chicago Police Department, 155 Clearview AI and, 128, 130–139 Detroit Police Department, 169–177, 183 Federal Police of Brazil, 136 Franken and, 143–144 Metropolitan Police (London), 214–219 Miami Police Department, 230–236 Michigan State Police, 180 New York City Police Department (NYPD), 128–133, 138 Ontario regional police services, 136 PittPatt and, 105–106 private surveillance and, 204 Queensland Police Service, 136 Royal Canadian Mounted Police, 136 Schwartz and, 132–133 South Wales Police, 219 surveillance by, 155 Swedish Police Authority, 136 Toronto Police, 136 laws/legislation antidiscrimination laws, 188 antihacking laws, 117 anti-wiretapping laws, 238 Biometric Information Privacy Act (BIPA; Illinois), 86, 122, 151–152, 158, 204, 205, 206, 213 Computer Fraud and Abuse Act, 117–118 consumer protection laws, 205 Freedom of Information Act (FOIA), 141, 157 General Data Protection Regulation (GDPR), 191 Genetic Information Nondiscrimination Act (GINA), 83 location privacy bill, 149–150 privacy laws, 191 Texas Capture or Use of Biometric Identifier Act, 83 USA PATRIOT Act, 64 Lawyer Appreciation Day, 213 LeCun, Yann, 73 Leone, Doug, 113–114 Les Ambassadeurs, 219–220 Lewis, John, 138 Leyvand, Tommer, 144–146, 152 Linden, Lisa Abrams case and, 209 Lawyer Appreciation Day and, 213 NYT article and, 161 Ton-That interviews and, 164, 245, 249, 250 LinkedIn Clearview AI and, vii, 112, 160, 165 Clearview AI’s lack of presence on, x Schwartz and, 94 scraping and, 58, 117–118 Smartcheckr and, 53 use of contact lists by, 7 Lipton, Beryl, 157 Lisp, 4 Liu, Terence Z.

pages: 589 words: 69,193

Mastering Pandas
by Femi Anthony
Published 21 Jun 2015

A more detailed introduction to machine learning is given in the following paper: A Few Useful Things to Know about Machine Learning at http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf Supervised versus unsupervised learning For supervised learning problems, the input to a learning problem is a dataset consisting of labeled data. By this we mean that we have outputs whose values are known. The learning program is fed input samples and their corresponding outputs and its goal is to decipher the relationship between them. Such input is known as labeled data. Supervised learning problems include the following: Classification: The learned attribute is categorical (nominal) or discrete Regression: The learned attribute is numeric/continuous In unsupervised learning or data mining, the learning program is fed inputs but no corresponding outputs.

pages: 339 words: 92,785

I, Warbot: The Dawn of Artificially Intelligent Conflict
by Kenneth Payne
Published 16 Jun 2021

Perhaps it’s the ability to sift through mountains of data, looking for a small but telling nugget—like the supercomputers of the NSA trawling relentlessly through electronic intercepts of mobile phone communication looking for keywords; or the hydrophone detectors lining choke-points on submarine routes, waiting for the subtlest sounds of a cavitating screw. Many tactical activities are susceptible, at least in part, to the comparative advantages of deep learning AI. Where there are masses of labelled data on which to train an algorithm with the techniques of supervised learning, as with image recognition, AI has potential. So too where there is scope to rehearse an activity endlessly, like a simulated dogfight, applying the techniques of reinforcement learning, or genetic algorithms. Warfare at cyber speed AI’s speed will certainly be important in one vital aspect of tactics—the struggle to secure the information on which modern warfare depends.

Once the master algorithm has figured out what to look for in each environment, that information is used to train specialised algorithms which quickly achieve superhuman performance in each game. This is a form of unsupervised learning—of an algorithm making sense of its environment from first principles, rather than by working backwards from a set of laboriously labelled data. As with its earlier research, the test bed for DeepMind’s research was a suite of classic Atari arcade games—suggesting that we are still a long way from an AI designing a new submarine. Other deep learning approaches will doubtless play a part here too, like Generative Adversarial Networks, which improve performance by competing against one another—a very loose analogy to natural selection.

pages: 292 words: 94,660

The Loop: How Technology Is Creating a World Without Choices and How to Fight Back
by Jacob Ward
Published 25 Jan 2022

Machine learning refers to algorithms that get better at a task through experience. Machine learning draws on past patterns to make future predictions. But it cannot reach out beyond the data it has; to make new predictions, it needs new data. There are several forms of machine learning in common use at the moment. First, supervised learning refers to systems shown enough labeled data and enough correct answers (“this is an orange; this is an orange that has gone bad; this is an orange that is ripe and healthy”) that it can pick out patterns in the data. Ask it to identify specific outcomes (a ripe orange, an orange that will be ripe after a week of shipping, a rotten orange), and if it has seen enough of the patterns that correlate to those outcomes in the past, it can spot the patterns that will likely correlate to the same outcomes in the future.

Third, reinforcement learning is another way of processing raw, unlabeled data, this time through reward and punishment. A training algorithm infers what you want out of the data, then flogs the system for incorrect answers and rewards it for correct answers. In this way, reinforcement learning teaches the system to sort out the most efficient means of avoiding punishment and earning praise. With no labeled data, it just goes on a craven search for whatever patterns are most likely to earn a reward. Let’s apply any or all of these three flavors of machine learning to a single task: distinguishing cows from dogs. Imagine we’re in a theater. Ranged across the stage are a dozen dogs and a dozen cows. Some of the dogs are sitting, some are standing, but it’s hot in the vast room, so all of them are panting.

pages: 688 words: 107,867

Python Data Analytics: With Pandas, NumPy, and Matplotlib
by Fabio Nelli
Published 27 Sep 2018

In fact, this choice not only makes this library compatible with most other modules, but also takes advantage of the high quality of the NumPy module. Another fundamental choice was to design ad hoc data structures for data analysis. In fact, instead of using existing data structures built into Python or provided by other libraries, two new data structures were developed. These data structures are designed to work with relational data or labeled data, thus allowing you to manage data with features similar to those designed for SQL relational databases and Excel spreadsheets. Throughout the book in fact, you will see a series of basic operations for data analysis, which are normally used on database tables and spreadsheets. pandas in fact provides an extended set of functions and methods that allow you to perform these operations efficiently.

To do this, you can use the SVR meth od provided by the scikit-learn library.x = np.array(dist) y = np.array(temp_max) x1 = x[x<100] x1 = x1.reshape((x1.size,1)) y1 = y[x<100] x2 = x[x>50] x2 = x2.reshape((x2.size,1)) y2 = y[x>50] from sklearn.svm import SVR svr_lin1 = SVR(kernel='linear', C=1e3) svr_lin2 = SVR(kernel='linear', C=1e3) svr_lin1.fit(x1, y1) svr_lin2.fit(x2, y2) xp1 = np.arange(10,100,10).reshape((9,1)) xp2 = np.arange(50,400,50).reshape((7,1)) yp1 = svr_lin1.predict(xp1) yp2 = svr_lin2.predict(xp2) plt.plot(xp1, yp1, c="r", label='Strong sea effect') plt.plot(xp2, yp2, c="b", label='Light sea effect') plt.axis((0,400,27,32)) plt.scatter(x, y, c="k", label="data") This code will produce the chart shown in Figure 10-12. Figure 10-12The two trends desc ribed by the maximum temperatures in relation to distance As you can see, temperature increase in the first 60 km is very rapid, rising from 28 to 31 degrees. It then increases very mildly (if at all) over longer distances.

pages: 197 words: 35,256

NumPy Cookbook
by Ivan Idris
Published 30 Sep 2012

At the end, the correlation will be printed, and plot will be shown. Creating the data frame.To create the data frame, we will create a dictionary containing stock symbols as keys, and the corresponding log returns as values. The data frame itself has the date as index and the stock symbols as column labels: data = {} for i in xrange(len(symbols)): data[symbols[i]] = numpy.diff(numpy.log(close[i])) df = pandas.DataFrame(data, index=dates[0][:-1], columns=symbols) Operating on the data frame.We can now perform operations, such as calculating a correlation matrix or plotting. on the data frame: print df.corr() df.plot() The complete source code that also downloads the price data is as follows: import pandas from matplotlib.pyplot import show, legend from datetime import datetime from matplotlib import finance import numpy # 2011 to 2012 start = datetime(2011, 01, 01) end = datetime(2012, 01, 01) symbols = ["AA", "AXP", "BA", "BAC", "CAT"] quotes = [finance.quotes_historical_yahoo(symbol, start, end, asobject=True) for symbol in symbols] close = numpy.array([q.close for q in quotes]).astype(numpy.float) dates = numpy.array([q.date for q in quotes]) data = {} for i in xrange(len(symbols)): data[symbols[i]] = numpy.diff(numpy.log(close[i])) df = pandas.DataFrame(data, index=dates[0][:-1], columns=symbols) print df.corr() df.plot() legend(symbols) show() Output for the correlation matrix: AA AXP BA BAC CAT AA 1.000000 0.768484 0.758264 0.737625 0.837643 AXP 0.768484 1.000000 0.746898 0.760043 0.736337 BA 0.758264 0.746898 1.000000 0.657075 0.770696 BAC 0.737625 0.760043 0.657075 1.000000 0.657113 CAT 0.837643 0.736337 0.770696 0.657113 1.000000 The following image shows the plot for the log returns of the five stocks: How it works...

System Error: Where Big Tech Went Wrong and How We Can Reboot
by Rob Reich , Mehran Sahami and Jeremy M. Weinstein
Published 6 Sep 2021

The trick is to build a model with the small amount of supervised data available and then use the model to predict labels for a large volume of unsupervised data. Now armed with a new batch of labeled data, programmers repeat the process over and over, thereby enabling them to label even more of the previously unlabeled data. Rinse and repeat with oceans of unlabeled data points that are just sitting out there on the open web or have been collected by companies such as Google and Facebook that track nearly everything we do online. Many thousands of computers then process all of this at high speed. A revolution in capability was unleashed that is growing to this day. Of course, as the pool of labeled data expands in this way, it is also possible for the model to go horribly wrong since an erroneous prediction early in the process will be compounded by later predictions.

pages: 424 words: 114,905

Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again
by Eric Topol
Published 1 Jan 2019

It’s also useful to think of algorithms as existing on a continuum from those that are entirely human guided to those that are entirely machine guided, with deep learning at the far machine end of the scale.12 Artificial Intelligence—the science and engineering of creating intelligent machines that have the ability to achieve goals like humans via a constellation of technologies Neural Network (NN)—software constructions modeled after the way adaptable neurons in the brain were understood to work instead of human guided rigid instructions Deep Learning—a type of neural network, the subset of machine learning composed of algorithms that permit software to train itself to perform tasks by processing multilayered networks of data Machine Learning—computers’ ability to learn without being explicitly programmed, with more than fifteen different approaches like Random Forest, Bayesian networks, Support Vector machine uses, computer algorithms to learn from examples and experiences (datasets) rather than predefined, hard rules-based methods Supervised Learning—an optimization, trial-and-error process based on labeled data, algorithm comparing outputs with the correct outputs during training Unsupervised Learning—the training samples are not labeled; the algorithm just looks for patterns, teaches itself Convolutional Neural Network—using the principle of convolution, a mathematical operation that basically takes two functions to produce a third one; instead of feeding in the entire dataset, it is broken into overlapping tiles with small neural networks and max-pooling, used especially for images Natural-Language Processing—a machine’s attempt to “understand” speech or written language like humans Generative Adversarial Networks—a pair of jointly trained neural networks, one generative and the other discriminative, whereby the former generates fake images and the latter tries to distinguish them from real images Reinforcement Learning—a type of machine learning that shifts the focus to an abstract goal or decision making, a technology for learning and executing actions in the real world Recurrent Neural Network—for tasks that involve sequential inputs, like speech or language, this neural network processes an input sequence one element at a time Backpropagation—an algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation on the previous layer passing values backward through the network; how the synapses get updated over time; signals are automatically sent back through the network to update and adjust the weighting values Representation Learning—set of methods that allows a machine with raw data to automatically discover the representations needed for detection or classification Transfer Learning—the ability of an AI to learn from different tasks and apply its precedent knowledge to a completely new task General Artificial Intelligence—perform a wide range of tasks, including any human task, without being explicitly programmed TABLE 4.1: Glossary.

Li’s 2015 TED Talk “How We’re Teaching Computers to Understand Pictures” has been viewed more than 2 million times, and it’s one of my favorites.41 FIGURE 4.6: Over time, deep learning AI has exceeded human performance for image recognition. Source: Adapted from Y. Shoham et al., “Artificial Intelligence Index 2017 Annual Report,” CDN AI Index (2017): http://cdn.aiindex.org/2017-report.pdf. The open-source nature of ImageNet’s large carefully labeled data was essential for this transformation of machine image interpretation to take hold. Following suit, in 2016, Google made its Open Images database, with 9 million images in 6,000 categories, open source. Image recognition isn’t just a stunt for finding cats in videos. The human face has been at center stage.

pages: 444 words: 117,770

The Coming Wave: Technology, Power, and the Twenty-First Century's Greatest Dilemma
by Mustafa Suleyman
Published 4 Sep 2023

Much of AI’s progress during the mid-2010s was powered by the effectiveness of “supervised” deep learning. Here AI models learn from carefully hand-labeled data. Quite often the quality of the AI’s predictions depends on the quality of the labels in the training data. However, a key ingredient of the LLM revolution is that for the first time very large models could be trained directly on raw, messy, real-world data, without the need for carefully curated and human-labeled data sets. As a result almost all textual data on the web became useful. The more the better. Today’s LLMs are trained on trillions of words.

pages: 2,466 words: 668,761

Artificial Intelligence: A Modern Approach
by Stuart Russell and Peter Norvig
Published 14 Jul 2019

Although such systems can reach a high level of test-set accuracy—as shown by the ImageNet competition results, for example—they often require far more labeled data than a human would for the same task. For example, a child needs to see only one picture of a giraffe, rather than thousands, in order to be able to recognize giraffes reliably in a wide range of settings and views. Clearly, something is missing in our deep learning story; indeed, it may be the case that our current approach to supervised deep learning renders some tasks completely unattainable because the requirements for labeled data would exceed what the human race (or the universe) can supply. Moreover, even in cases where the task is feasible, labeling large data sets usually requires scarce and expensive human labor.

Moreover, even in cases where the task is feasible, labeling large data sets usually requires scarce and expensive human labor. For these reasons, there is intense interest in several learning paradigms that reduce the dependence on labeled data. As we saw in Chapter 19, these paradigms include unsupervised learning, transfer learning, and semisupervised learning. Unsupervised learning algorithms learn solely from unlabeled inputs x, which are often more abundantly available than labeled examples. Unsupervised learning algorithms typically produce generative models, which can produce realistic text, images, audio, and video, rather than simply predicting labels for such data.

In this section, we introduce the idea of pretraining: a form of transfer learning (see Section 22.7.2) in which we use a large amount of shared generaldomain language data to train an initial version of an NLP model. From there, we can use a smaller amount of domain-specific data (perhaps including some labeled data) to refine the model. The refined model can learn the vocabulary, idioms, syntactic structures, and other linguistic phenomena that are specific to the new domain. 25.5.1Pretrained word embeddings In Section 25.1, we briefly introduced word embeddings. We saw that how similar words like banana and apple end up with similar vectors, and we saw that we can solve analogy problems with vector subtraction.

pages: 566 words: 122,184

Code: The Hidden Language of Computer Hardware and Software
by Charles Petzold
Published 28 Sep 1999

But instead of containing numbers to be added, it contains the codes that indicate what the automated adder is supposed to do with the corresponding address in the original RAM array. These two RAM arrays can be labeled Data (the original RAM array) and Code (the new one): We've already established that our new automated adder needs to be able to write sums into the original RAM array (labeled Data). But the new RAM array (labeled Code) will be written to solely through the control panel. We need four codes for the four actions we want the new automated adder to do. These codes can be anything we want to assign.

AI 2041: Ten Visions for Our Future
by Kai-Fu Lee and Qiufan Chen
Published 13 Sep 2021

SUPERVISED NLP A few years ago, virtually all deep learning–based NLP neural networks learned language using the standard “supervised learning” discussed earlier. “Supervised” implies that when AI learns, it would need to be provided with the right answer for each training input. (Note this “supervision” does not imply the human would “program” rules into the AI; as established in chapter 1, that does not work.) AI would receive pairs of labeled data—the input and the “correct” output, and then the AI would learn to produce the output that corresponds to a given input. Remember the example of AI recognizing the cat image? Supervised deep learning is the training process in which AI learns to produce the word “cat.” When it comes to natural language, we can apply supervised learning by finding data that has been labeled for human purposes.

And not only “present” in the data, but also “labeled” by a human being to give sufficient clues for the AI training. Data labeling for supervised training of language-understanding systems has been a large industry for twenty years now. As an example, in an automated airline customer service system, labeled data for training language understanding looks something like this: [BOOK_FLIGHT_INTENT] I want to [METHOD: fly] from [ORIGIN: Boston] at [DEP_TIME: 838 am] and arrive in [DEST: Denver] at [ARR TIME: 1110 in the morning]] That’s a very basic example. You can imagine the overhead involved in marking up hundreds of thousands of utterances at this level of detail.

pages: 523 words: 61,179

Human + Machine: Reimagining Work in the Age of AI
by Paul R. Daugherty and H. James Wilson
Published 15 Jan 2018

Thanks to the explosion of available data for training these algorithms, ML is now used in fields as diverse and sprawling as vision-based research, fraud detection, price prediction, natural language processing, and more. Supervised learning. A type of ML in which an algorithm is presented with preclassified and sorted data (known in the field as “labeled data”) consisting of example inputs and desired outputs. The goal for the algorithm is to learn the general rules that connect the inputs to the outputs and use those rules to predict future events with input data alone. FIGURE 2-1 The constellation of AI technologies and business applications Unsupervised learning.

Work in the Future The Automation Revolution-Palgrave MacMillan (2019)
by Robert Skidelsky Nan Craig
Published 15 Mar 2020

Now we are seeing machine learning, where we have all these capabilities, where machines can do better than humans on certain tasks. They can read lips or read x-rays better than humans. Machine learning systems tend to be exceptionally good at individual tasks, typically tasks where there are large amounts of labelled data. They do not have common sense or meaning in the sense that we attach meaning to things. Machine learning has been around since the 1980s and has been enhanced by improvements in software, but also because we have much greater access to large pools of data and much bigger hardware to throw at it.

pages: 430 words: 68,225

Blockchain Basics: A Non-Technical Introduction in 25 Steps
by Daniel Drescher
Published 16 Mar 2017

Such a structure is useful for storing and linking data together that are not fully available at one given point in time but instead arrive step by step in an ongoing fashion. Figure 11-4 illustrates this idea by using the symbols introduced above. The creation of such a chain starts with the piece of data labeled Data 1 and the creation of the hash reference R1. Being the first piece of data, Data 1 does not contain any hash reference. When new data arrive, they are put together with the hash reference that points to Data 1. The hash reference R2 refers to the newly arrived data and the hash reference R1.

Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
by Aurelien Geron
Published 14 Aug 2019

Running an association rule on your sales logs may reveal that people who purchase barbecue sauce and potato chips also tend to buy steak. Thus, you may want to place these items close to each other. Semisupervised learning Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semisupervised learning (Figure 1-11). Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all your family photos to the service, it automatically recognizes that the same person A shows up in photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7.

pages: 721 words: 197,134

Data Mining: Concepts, Models, Methods, and Algorithms
by Mehmed Kantardzić
Published 2 Jan 2003

From the original paper, “19 employers and 6 doctors were implicated with 152 medical claims.” The labels of the larger data set were revealed to be not sufficiently accurate for data mining. Contradictory data points were found. A lack of standards in recording these medical claims, with a large number of missing values contributed to the poorly labeled data set. Instead of the larger 500,000 point data set, the authors were “forced” to rebuild a subset of thesee data. This required manual labeling of the subset. The manual labeling would require a much smaller set of data points to be used from the original 500,000. To cope with a smaller set of data points, the problem was split into four smaller problems, namely identifying fraudulent medical claims, affiliates, medical professionals, and employers.

Webb, Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining, Journal of Machine Learning Research, Vol. 10, 2009, p. 377–403. This paper gives a survey of contrast set mining (CSM), emerging pattern mining (EPM), and subgroup discovery (SD) in a unifying framework named supervised descriptive rule discovery. While all these research areas aim at discovering patterns in the form of rules induced from labeled data, they use different terminology and task definitions, claim to have different goals, claim to use different rule learning heuristics, and use different means for selecting subsets of induced patterns. This paper contributes a novel understanding of these subareas of data mining by presenting a unified terminology, by explaining the apparent differences between the learning tasks as variants of a unique supervised descriptive rule discovery task and by exploring the apparent differences between the approaches.

Mining of Massive Datasets
by Jure Leskovec , Anand Rajaraman and Jeffrey David Ullman
Published 13 Nov 2014

Creating a Training Set It is reasonable to ask where the label information that turns data into a training set comes from. The obvious method is to create the labels by hand, having an expert look at each feature vector and classify it properly. Recently, crowdsourcing techniques have been used to label data. For example, in many applications it is possible to use Mechanical Turk to label data. Since the “Turkers” are not necessarily reliable, it is wise to use a system that allows the question to be asked of several different people, until a clear majority is in favor of one label. One often can find data on the Web that is implicitly labeled.

pages: 300 words: 79,315

Getting Things Done: The Art of Stress-Free Productivity
by David Allen
Published 31 Dec 2002

This arrangement can cause a person’s mind to go numb to the stack because of all the decisions that are still pending about the next-action level of doing. My personal system is highly portable, with almost everything kept on lists, but I still maintain two categories of paper-based reminders. I travel with a “Read/Review” plastic file folder and another one labeled “Data Entry.” In the latter I put anything for which the next action is simply to input data into my computer (business cards that need to get into my telephone/address list, quotes for my “Quotes” database, articles about restaurants I want to put on my “Travel—Cities” sublists, etc.). Managing E-mail-Based Workflow Like some paper-based materials, e-mails that need action are sometimes best as their own reminders—in this case within the tracked e-mail system itself.

pages: 301 words: 85,126

AIQ: How People and Machines Are Smarter Together
by Nick Polson and James Scott
Published 14 May 2018

The trouble was a small but persistent bias in the underlying polls, which underestimated support for Donald Trump. Many algorithms in AI suffer from a similar problem: bias in, bias out. There’s a classic parable here about a neural-network model the U.S. Army once built to detect tanks that were partially hidden on the edge of a forest.24 Army scientists trained their model using a labeled data set of photographs, some with tanks and some without. The neural network turned out to have surprisingly high accuracy. It even did well when the army held out some of the original training data and used it exclusively to test the performance of the model. (This practice of validating results using notionally “out-of-sample” data is standard in AI.)

pages: 308 words: 85,850

Cloudmoney: Cash, Cards, Crypto, and the War for Our Wallets
by Brett Scott
Published 4 Jul 2022

In the second I would be directed to a website put up by the owners of the chain – the employees would just shrug their shoulders. The first was a down-to-earth diner where anyone would feel comfortable, whereas the second establishment was run for those who want to be on the up – professionals with laptops who gather to have meetings about start-up strategies, events organisation or a new fashion label. Data from several studies shows that cash usage is lowest among those with higher incomes and education, but you don’t need to be a social scientist to see the obvious class divide in payments choices. Venture to any trendy café where the clientele have good credit ratings, and you will quickly see that digital payment thrives among social climbers who see themselves as sophisticated.

pages: 336 words: 93,672

The Future of the Brain: Essays by the World's Leading Neuroscientists
by Gary Marcus and Jeremy Freeman
Published 1 Nov 2014

In the same way that early explorers relied on maps, even though they were originally coarse and error prone, neuroscientists are beginning to build atlases that anchor different types of data and help complete the layers of the complex brain maps that are gradually emerging. Ensuring data is registered and accessible through such next-generation atlases will surely be an important way of organizing and analyzing brain data. Building atlases depends on two key aspects: using jointly agreed upon vocabularies or ontologies to label data and locating the data in a standardized atlas coordinate space. For example, neuroscientists today use different words to refer to the same brain area. For example, in the visual system, reticular nucleus of the thalamus, nucleus reticularis, and perigeniculate nucleus all refer to the same brain structure—making it difficult to find all data about this single brain region.

Driverless: Intelligent Cars and the Road Ahead
by Hod Lipson and Melba Kurman
Published 22 Sep 2016

By simply recognizing objects depicted in images, deep-learning software is finally unlocking the puzzle of artificial perception and enabling the development of robust mid-level control software. We’ll explain the inner workings of deep learning software in-depth in chapter 10. Here we will summarize it as a type of machine-learning software that uses artificial neural networks to recognize objects in streams of raw visual data. In the past, without an accurately labeled data feed, an occupancy grid was fairly toothless, a mere crude approximation of a few large physical objects in the nearby environment. Without knowing what objects lurked outside the car, the rest of the car’s software programs could not figure out how best to react to them, or to predict what these unidentified objects would do next.

The Myth of Artificial Intelligence: Why Computers Can't Think the Way We Do
by Erik J. Larson
Published 5 Apr 2021

It means that features useful for machine learning must always be in the data, and no clues can be provided by humans that can’t also be exploited by the machine “in the wild” when testing the system or after it is released for use. Feature extraction is performed in the first, training phase, and then again after a model has been trained, in what’s called the production phase. During the training phase, labeled data is provided to the learning algorithm as input. For example, if the objective is to recognize pictures of horses, the input is a photo with a horse in it, and the output is a label: HORSE. The machine learning system (“learner”) thus receives labeled or tagged pictures of horses as input-output pairs, and the learning task is to simulate the tagging of images so that only horse images receive the HORSE label.

pages: 336 words: 91,806

Code Dependent: Living in the Shadow of AI
by Madhumita Murgia
Published 20 Mar 2024

For work that dealt with imagery of human sacrifice, beheadings, hate speech and child abuse, Daniel and his colleagues at Sama were reportedly paid about $2.20 (£1.80) per hour.9 ‘Public companies going to poor countries, or employing poor people anywhere, under the guise of upliftment and economic empowerment, can still be exploitation,’ he said, in a phone conversation from his home. ‘These companies are only interested in profit and not in the lives of the people whom they destroy.’ Although his job was to make judgements on graphic or illegal material, rather than label data like Ian and Benja, his work was also used to train algorithms – every decision he made was teaching Facebook’s content-moderation AI systems how to distinguish between good and bad content on the platform. According to Daniel, Facebook had designed its moderation system to time workers per task.

pages: 299 words: 99,080

The Soul of a New Machine
by Tracy Kidder
Published 1 Jan 1981

Surely by 1980 such a record entitled Data General to respectability. But some trade journalists still looked askance at the company; one told me Data General was widely known among his colleagues as "the Darth Vader of the computer industry." Investors still seemed jittery about Data General's stock. An article published in Fortune in 1979 had labeled Data General "the upstarts," while calling DEC "the gentlemen." The memory of that article, particularly the part that made it sound as if Data General routinely cheated its customers, still rankled Herb Richman. Building 14A/B is essentially divided into an upstairs and a downstairs, and in one corner of the upstairs the corporate officers reside.

pages: 443 words: 98,113

The Corruption of Capitalism: Why Rentiers Thrive and Work Does Not Pay
by Guy Standing
Published 13 Jul 2016

As Michael Bernstein, who leads a Stanford University project on crowd sourcing, put it, ‘AMT is notoriously bad at ensuring high quality results, producing respect and fair wages for workers, and making it easy to author effective tasks.’27 Requesters and broker platforms often seem to forget that the tasks are being done by ‘real people’ with real needs and emotions. Lukas Biewald, CEO of CrowdFlower, which specialises in collecting, cleaning and labelling data, revealed his true attitude in a moment of hubris when telling an audience, Before the internet, it would be really difficult to find someone, sit them down for ten minutes and get them to work for you, and then fire them after those ten minutes. But with technology you can actually find them, pay them the tiny amount of money and then get rid of them when you don’t want them any more.28 Even though a majority of taskers have been in rich countries so far, reflecting access to reliable internet connections, cloud labour is globalising.

pages: 340 words: 97,723

The Big Nine: How the Tech Titans and Their Thinking Machines Could Warp Humanity
by Amy Webb
Published 5 Mar 2019

This is public knowledge. The challenge is that improving the data and learning models is a big financial liability. For example, one corpus with serious problems is ImageNet, which I’ve made reference to several times in this book. ImageNet contains 14 million labeled images, and roughly half of that labeled data comes solely from the United States. Here in the US, a “traditional” image of a bride is a woman wearing a white dress and a veil, though in reality that image doesn’t come close to representing most people on their wedding days. There are women who get married in pantsuits, women who get married on the beach wearing colorful summery dresses, and women who get married wearing kimono and saris.

pages: 337 words: 103,522

The Creativity Code: How AI Is Learning to Write, Paint and Think
by Marcus Du Sautoy
Published 7 Mar 2019

You may wonder why, if this is the case, you are still being asked to identify bits of images when you want to buy tickets to the latest gig to prove you are human. What you are actually doing is helping to prepare the training data that will then be fed to the algorithms so that they can try to learn to do what you do so effortlessly. Algorithms need labelled data to learn from. What we are really doing is training the algorithms in visual recognition. This training data is used to learn the best sorts of questions to ask to distinguish cats from non-cats. Every time it gets it wrong, the algorithm is altered so that the next time it will get it right.

pages: 392 words: 108,745

Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think
by James Vlahos
Published 1 Mar 2019

First, Microsoft had people manually review utterances in the training conversations and tag them according to the predominant emotion that each expressed. They used psychologist Paul Ekman’s classic model of six basic emotions—anger, disgust, fear, happiness, sadness, surprise. Engineers then trained XiaoIce using this labeled data so her neural networks could learn to spot sentiments in unlabeled statements in the future. XiaoIce’s acuity is far from that of a real person’s. But when she perceives sentiment correctly, the experience of chatting with her becomes compelling. If you tell a conventional virtual assistant, “I don’t feel so good today,” you might get a response along the lines of “Here’s what I found on the web for ‘I don’t feel so good today.’”

pages: 390 words: 109,519

Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media
by Tarleton Gillespie
Published 25 Jun 2018

Machine learning depends on starting with a known database, a “gold standard” collection of examples, an agreed upon “ground truth” that becomes the basis upon which an algorithm can be expected to learn distinctive features. In computer science, this has either meant that researchers (or their underpaid undergrad research assistants) manually labeled data themselves, paid crowdworkers through sites like Amazon Mechanical Turk to do it, or have been in a position to get data from social media platforms willing to share it. Platforms designing their own algorithms can use the corpus of data they have already moderated. This raises two further problems.

pages: 283 words: 102,484

Everything Is Predictable: How Bayesian Statistics Explain Our World
by Tom Chivers
Published 6 May 2024

Now you can ask your smartphone to search your camera roll for pictures of dogs, or of babies, or beaches, or whatever, and it will bring them all up in a fraction of a second.) At a very abstracted level, here’s what it does: You give it however many thousands or millions of pictures of rats, dogs, and lions, each labeled as rat, dog, or lion, to train on (its “labeled data”). It sloshes them around in its circuits in some fashion, and then, once it’s done that, you give it new pictures to identify (its “test data”). It will then label each of those pictures as rat, dog, or lion, according to its best guess. This model of AI is called “supervised learning.” What it’s doing is predicting what the humans who labeled the training data would label the test data.

pages: 982 words: 221,145

Ajax: The Definitive Guide
by Anthony T. Holdener
Published 25 Jan 2008

The following is an example of what would be returned to match the default values from Example 11-7: [ 'Deine pers&#246;nlichen Informationen eintragen.', 'Familienname: ', 'Vorname: ', 'Mittlere Initiale: ', 'Adresse: ', 'Stadt: ', 'Zustand: ', 'Rei&#223;verschluss-Code: ', 'Telefonnummer: ', 'E-mail: ' ] Changing Site Language with Ajax | 401 Switching Out the Data So, we have the data we need in the xhrResponse.responseText from the server. Now what? It is a simple matter of replacing the original array with the new JSON that was sent and rerunning the function that creates the form in the first place. Example 11-9 shows the JavaScript necessary to perform such an action. Example 11-9. Switching out the label data /* Example 11-9. Switching out the label data. */ /** * This function, reloadForm, takes the XMLHttpRequest JSON server response * /xhrResponse/ from the server and sets it equal to the global <label> element * array /arrLabels/. It then calls /loadForm/ which re-creates the form with the * new data. * * @param {Object} xhrResponse The XMLHttpReqest JSON server response. */ function reloadForm(xhrResponse) { /* /eval/ the JSON response to create an array to replace the old one */ /* *** You should always validate data before executing the eval *** */ arrLabels = eval(xhrResponse.responseText); /* Load the form with the global /arrLabels/ array */ loadForm( ); } After this code is executed, the form will look like Figure 11-11.

pages: 396 words: 117,149

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
by Pedro Domingos
Published 21 Sep 2015

This rule set looks like it’s 100 percent accurate, but that’s an illusion: it will predict that every new example is negative, and therefore get every positive one wrong. If there are more positive than negative examples overall, this will be even worse than flipping coins. Imagine a spam filter that decides an e-mail is spam only if it’s an exact copy of a previously labeled spam message. It’s easy to learn and looks great on the labeled data, but you might as well have no spam filter at all. Unfortunately, our “divide and conquer” algorithm could easily learn a rule set like that. In his story “Funes the Memorious,” Jorge Luis Borges tells of meeting a youth with perfect memory. This might at first seem like a great fortune, but it is in fact an awful curse.

pages: 447 words: 111,991

Exponential: How Accelerating Technology Is Leaving Us Behind and What to Do About It
by Azeem Azhar
Published 6 Sep 2021

The site took the form of a meticulously detailed collection of 14,197,122 images, all hand-annotated with tags like ‘vegetable’, ‘musical instrument’, ‘sport’ and, yes, ‘dog’ and ‘cat’. This dataset was used as the basis for an annual competition to find the algorithm that could most consistently and accurately identify objects. Thanks to ImageNet, good-quality labelled data was suddenly in high supply. Alongside the profusion of data came an explosion in computing power. By 2010, Moore’s Law had resulted in enough power to facilitate a new kind of machine learning, ‘deep learning’, which involved creating layers of artificial neurons modelled on the cells that underpin human brains.

pages: 523 words: 112,185

Doing Data Science: Straight Talk From the Frontline
by Cathy O'Neil and Rachel Schutt
Published 8 Oct 2013

Think about it this way: if the word “Viagra” appears, this adds to the probability that the email is spam. But it’s not conclusive, yet. We need to see what else is in the email. Let’s first focus on just one word at a time, which we generically call “word.” Then, applying Bayes’ Law, we have: The righthand side of this equation is computable using enough pre-labeled data. If we refer to nonspam as “ham” then we only need compute p(word|spam), p(word|ham), p(spam), and p(ham) = 1-p(spam), because we can work out the denominator using the formula we used earlier in our medical test example, namely: In other words, we’ve boiled it down to a counting exercise: counts spam emails versus all emails, counts the prevalence of those spam emails that contain “word,” and counts the prevalence of the ham emails that contain “word.”

Succeeding With AI: How to Make AI Work for Your Business
by Veljko Krunic
Published 29 Mar 2020

Answer to question 3:  For a research question, this is a free-form exercise.  Finding a dataset for your research question clearly depends on the problem you’re trying to address, but it’s not uncommon that the answer to the question “Is it possible to acquire a dataset?” is no. Often, obtaining labeled data is the real obstacle to an application of AI.  Also, if during this hypothetical conversation, neither of you thought about which data science/AI/ML methods you could use on the dataset, chances are, you might have missed some of the needed data. Remember that needed data and its quantity depend on the AI methods you use (and vice versa).

pages: 370 words: 112,809

The Equality Machine: Harnessing Digital Technology for a Brighter, More Inclusive Future
by Orly Lobel
Published 17 Oct 2022

And like Barzilay, the MacArthur Foundation awarded Katabi a “genius” grant, citing her “ability to translate long-recognized theoretical advances into practical solutions that could be deployed in the real world.”26 Among Katabi’s other groundbreaking current projects is a radio-frequency system that monitors sleep postures called BodyCompass. BodyCompass tracks radio-frequency reflections in the environment, identifies the signals that bounced off the sleeping person’s body, and analyzes those signals via a machine learning algorithm. Katabi and her collaborators found that with just sixteen minutes of labeled data from the sleeping person, BodyCompass’s accuracy is 84 percent; within one week, its accuracy went up to 94 percent. Monitoring sleep posture is important for many health contexts, including monitoring patients after surgery, tracking progression of diseases including Parkinson’s, and more. The potential of AI extends to mental health as well.

pages: 381 words: 113,173

The Geek Way: The Radical Mindset That Drives Extraordinary Results
by Andrew McAfee
Published 14 Nov 2023

For analyzing Glassdoor data, the training data could be a set of employee reviews that have been labeled by people along two dimensions: the aspect of culture that is being discussed, and the employee’s feeling about that aspect (for example, “This is a highly favorable review about agility at the company”). After the ML software is trained with enough labeled data, it can go through all the Glassdoor reviews for a large group of companies and classify them automatically. This process yields a consistent analysis of culture across companies, and so allows comparisons and rankings. Business researchers Don and Charlie Sull have done this for more than five hundred companies, most of them based in the US, that have enough Glassdoor reviews to enable meaningful analysis.

pages: 1,064 words: 114,771

Tcl/Tk in a Nutshell
by Paul Raines and Jeff Tranter
Published 25 Mar 1999

If not specified, the width will be whatever the window requests. -window pathName (window, Window) Pathname of window to use for the marker. The window must be a descendant of the chart widget. Example set x {0.0 1.0 2.0 3.0 4.0 5.0 6.0} set y {0.0 0.1 2.3 4.5 1.2 5.4 9.6} graph .g -title "Example Graph" .g element create x -label "Data Points" -xdata $x -ydata $y pack .g * * * [3] The format of this command may change for the final Version 2.4 to require a specific printer ID. Name hierbox hierbox pathName [option value...] The hierbox command creates a new hierbox widget named pathName. A hierbox widget displays a hierarchy tree of entries for navigation and selection.

pages: 481 words: 125,946

What to Think About Machines That Think: Today's Leading Thinkers on the Age of Machine Intelligence
by John Brockman
Published 5 Oct 2015

That has come from the steady Moore’s Law doubling of circuit density every two years or so, not from any fundamentally new algorithms. That exponential rise in crunch power lets ordinary-looking computers tackle tougher problems of Big Data and pattern recognition. Consider the most popular algorithms in Big Data and machine learning. One algorithm is unsupervised (requires no teacher to label data). The other is supervised (requires a teacher). They account for a great deal of applied AI. The unsupervised algorithm is called k-means clustering, arguably the most popular algorithm for working with Big Data. It clusters like with like and underlies Google News. Start with a million data points.

pages: 424 words: 123,180

Democracy's Data: The Hidden Stories in the U.S. Census and How to Read Them
by Dan Bouk
Published 22 Aug 2022

It focuses on just the numbers at the expense of the processes and systems and people behind them. The fact is that every data set has a doorstep, a place where plans and dreams of order meet the throbbing tumult of experience, and from such encounters, via eruptions of ingenuity, we get these strange texts that bear the label “data.” They are texts with power in the world, texts that indeed should be treated carefully, debated in public, and sometimes regulated by governments. They are texts, too, that we can use to better understand ourselves, our political systems, and our societies. We just have to learn how to read them.

Producing Open Source Software: How to Run a Successful Free Software Project
by Karl Fogel
Published 13 Oct 2005

Because the Internet is not really a room, we don't have to worry about replicating those parts of parliamentary procedure that keep some people quiet while others are speaking. But when it comes to information management techniques, well-run open source projects are parliamentary procedure on steroids. Since almost all communication in open source projects happens in writing, elaborate systems have evolved for routing and labeling data appropriately, for minimizing repetitions so as to avoid spurious divergences, for storing and retrieving data, for correcting bad or obsolete information, and for associating disparate bits of information with each other as new connections are observed. Active participants in open source projects internalize many of these techniques, and will often perform complex manual tasks to ensure that information is routed correctly.

pages: 470 words: 128,328

Reality Is Broken: Why Games Make Us Better and How They Can Change the World
by Jane McGonigal
Published 20 Jan 2011

Others were mathematical errors or inconsistencies suggesting individuals were reimbursed more than they were owed. As one player noted, “Bad math on page 29 of an invoice from MP Denis MacShane, who claimed £1,730 worth of reimbursement, when the sum of those items listed was only £1,480.” But perhaps most importantly, the website also featured a section labeled “Data: What we’ve learned from your work so far.” This page put the individual players’ efforts into a much bigger context—and guaranteed that contributors would see the real results of their efforts. Some of the key results of the game included these findings:• On average, each MP expensed twice his or her annual salary, or more than £140,000 in expenses on top of a £60,675 salary

Data Wrangling With Python: Tips and Tools to Make Your Life Easier
by Jacqueline Kazil
Published 4 Feb 2016

Here’s how we’d do that: import matplotlib.pyplot as plt plt.plot(africa_cpi_cl.columns['CPI 2013 Score'], africa_cpi_cl.columns['Total (%)']) plt.xlabel('CPI Score - 2013') plt.ylabel('Child Labor Percentage') plt.title('CPI & Child Labor Correlation') plt.show() Uses pylab’s plot method to pass the x and y label data. The first variable passed is the x-axis and the second variable is the y-axis. This creates a Python chart plotting those two datasets. Visualizing Your Data | 251 Calls the xlabel and ylabel methods to label our chart axes. Calls the title method to title our chart. Calls the show method to draw the chart.

pages: 467 words: 149,632

If Then: How Simulmatics Corporation Invented the Future
by Jill Lepore
Published 14 Sep 2020

“Don’t be evil,” the motto of Google, marked the limit of a swaggering, devil-may-care ethical ambition; doing good did not come into it.7 Incubated decades before, beneath a honeycombed, geodesic dome in Wading River, this work found a place, too, in universities. In the 2010s, a flood of money into universities attempted to make the study of data a science, with data science initiatives, data science programs, data science degrees, data science centers.8 Much academic research that fell under the label “data science” produced excellent and invaluable work, across many fields of inquiry, findings that would not have been possible with computational discovery.9 And no field should be judged by its worst practitioners. Still, the shadiest data science, like the shadiest behavioral science, grew in influence by way of self-mystification, exaggerated claims, and all-around chicanery, including fast-changing, razzle-dazzle buzzwords, from “big data” to “data analytics.”

Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals
by David Aronson
Published 1 Nov 2006

Figure 6.35 shows the relationship between the data-mining bias on the vertical axis versus the number of observations used to compute ATR mean returns on the horizontal axis. Because the expected return for all ATRs is equal to zero, the data-mining bias is equal to the average observed performance of the best. Thus, the vertical axis, which is labeled data-mining bias, could as easily have been labeled average performance of the best rule. Note the steep decline in the magnitude of 96 D M 84 72 B I A S 60 Observed Performance Best-of-100 ATRs 48 Observed Performance Best-of-10 ATRs 36 % 24 Yr. 12 1 200 400 600 800 1000 Number of Months Used To Compute Mean ATR Return FIGURE 6.35 Data-mining bias versus number observations.

pages: 704 words: 182,312

This Is Service Design Doing: Applying Service Design Thinking in the Real World: A Practitioners' Handbook
by Marc Stickdorn , Markus Edgar Hormess , Adam Lawrence and Jakob Schneider
Published 12 Jan 2018

One way of supporting less-knowledgeable clients could be by assigning them a clear role, such as a supporting interviewer or observer, while the members of the key project team lead the fieldwork. Indexing During your research, it is important to index your data so that you can trace insights back to the data sources they are based on. A simple way of indexing could be to label data with a short index, such as “i6.17” for interview 6, line 17, or “v12.3:22” for video 12, at minute 3:22. This allows you to later base your design decisions not only on insights you have generated, but on raw data. You might even be able to include the participants who reported a specific phenomenon in your prototyping of solutions to improve the original situation. 15 Data visualization, synthesis, and analysis There are many ways to synthesize and analyze data (also known as sensemaking) in design research.

pages: 619 words: 177,548

Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity
by Daron Acemoglu and Simon Johnson
Published 15 May 2023

This new AI approach has already had three important implications. First, it has intertwined AI with the use of massive quantities of data. In the words of an AI scientist, Alberto Romero, who became disillusioned with the industry and left it in 2021, “If you work in AI you are most likely collecting data, cleaning data, labeling data, splitting data, training with data, evaluating with data. Data, data, data. All for a model to say: It’s a cat.” This focus on vast quantities of data is a fundamental consequence of the Turing-inspired emphasis on autonomy. Second, this approach has made modern AI appear highly scalable and transferable, and of course, in domains much more interesting and important than recognizing cats.

pages: 764 words: 261,694

The Elements of Statistical Learning (Springer Series in Statistics)
by Trevor Hastie , Robert Tibshirani and Jerome Friedman
Published 25 Aug 2009

Typically the initial centers are R randomly chosen observations from the training data. Details of the K-means procedure, as well as generalizations allowing for different variable types and more general distance measures, are given in Chapter 14. To use K-means clustering for classification of labeled data, the steps are: • apply K-means clustering to the training data in each class separately, using R prototypes per class; • assign a class label to each of the K × R prototypes; • classify a new feature x to the class of the closest prototype. Figure 13.1 (upper panel) shows a simulated example with three classes and two features.