statistical model

back to index

description: type of mathematical model

349 results

Natural Language Processing with Python and spaCy
by Yuli Vasiliev
Published 2 Apr 2020

Statistical language modeling is vital to many natural language processing tasks, such as natural language generating and natural language understanding. For this reason, a statistical model lies at the heart of virtually any NLP application. Figure 1-4 provides a conceptual depiction of how an NLP application uses a statistical model. Figure 1-4: A high-level conceptual view of an NLP application’s architecture The application interacts with spaCy’s API, which abstracts the underlying statistical model. The statistical model contains information like word vectors and linguistic annotations. The linguistic annotations might include features such as part-of-speech tags and syntactic annotations. The statistical model also includes a set of machine learning algorithms that can extract the necessary pieces of information from the stored data.

If you decide to upgrade your installed spaCy package to the latest version, you can do this using the following pip command: $ pip install -U spacy Installing Statistical Models for spaCy The spaCy installation doesn’t include statistical models that you’ll need when you start using the library. The statistical models contain knowledge collected about the particular language from a set of sources. You must separately download and install each model you want to use. Several pretrained statistical models are available for different languages. For English, for example, the following models are available for download from spaCy’s website: en_core_web_sm, en_core_web_md, en_core_web_lg, and en_vectors_web_lg.

Figure 1-1 provides a high-level depiction of the model training stage. Figure 1-1: Generating a statistical model with a machine learning algorithm using a large volume of text data as input Your model processes large volumes of text data to understand which words share characteristics; then it creates word vectors for those words that reflect those shared characteristics. As you’ll learn in “What Is a Statistical Model in NLP?” on page 8, such a word vector space is not the only component of a statistical model built for NLP. The actual structure is typically more complicated, providing a way to extract linguistic features for each word depending on the context in which it appears.

pages: 50 words: 13,399

The Elements of Data Analytic Style
by Jeff Leek
Published 1 Mar 2015

As an example, suppose you are analyzing data to identify a relationship between geography and income in a city, but all the data from suburban neighborhoods are missing. 6. Statistical modeling and inference The central goal of statistical modeling is to use a small subsample of individuals to say something about a larger population. The reasons for taking this sample are often the cost or difficulty of measuring data on the whole population. The subsample is identified with probability (Figure 6.1). Figure 6.1 Probability is used to obtain a sample Statistical modeling and inference are used to try to generalize what we see in the sample to the population. Inference involves two separate steps, first obtaining a best estimate for what we expect in the population (Figure 6.2).

Before these steps, it is critical to tidy, check, and explore the data to identify dataset specific conditions that may violate your model assumptions. 6.12.4 Assuming the statistical model fit is good Once a statistical model is fit to data it is critical to evaluate how well the model describes the data. For example, with a linear regression analysis it is critical to plot the best fit line over the scatterplot of the original data, plot the residuals, and evaluate whether the estimates are reasonable. It is ok to fit only one statistical model to a data set to avoid data dredging, as long as you carefully report potential flaws with the model. 6.12.5 Drawing conclusions about the wrong population When you perform inference, the goal is to make a claim about the larger population you have sampled from.

Histograms and boxplots are good ways to check that the measurements you observe fall on the right scale. 4.10 Common mistakes 4.10.1 Failing to check the data at all A common temptation in data analysis is to load the data and immediately leap to statistical modeling. Checking the data before analysis is a critical step in the process. 4.10.2 Encoding factors as quantitative numbers If a scale is qualitative, but the variable is encoded as 1, 2, 3, etc. then statistical modeling functions may interpret this variable as a quantitative variable and incorrectly order the values. 4.10.3 Not making sufficient plots A common mistake is to only make tabular summaries of the data when doing data checking.

pages: 442 words: 94,734

The Art of Statistics: Learning From Data
by David Spiegelhalter
Published 14 Oct 2019

Even with the Bradford Hill criteria outlined above, statisticians are generally reluctant to attribute causation unless there has been an experiment, although computer scientist Judea Pearl and others have made great progress in setting out the principles for building causal regression models from observational data.2 Pearson Correlation Gradient of regression of offspring on parent Mothers and daughters 0.31 0.33 Fathers and sons 0.39 0.45 Table 5.2 Correlations between heights of adult children and parent of the same gender, and gradients of the regression of the offspring’s on the parent’s height. Regression Lines Are Models The regression line we fitted between fathers’ and sons’ heights is a very basic example of a statistical model. The US Federal Reserve define a model as a ‘representation of some aspect of the world which is based on simplifying assumptions’: essentially some phenomenon will be represented mathematically, generally embedded in computer software, in order to produce a simplified ‘pretend’ version of reality.3 Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, for example the fitted straight line that enables us to make a prediction of a son’s height from his father’s.

We have data that can help us answer some of these questions, with which we have already done some exploratory plotting and drawn some informal conclusions about an appropriate statistical model. But we now come to a formal aspect of the Analysis part of the PPDAC cycle, generally known as hypothesis testing. What Is a ‘Hypothesis’? A hypothesis can be defined as a proposed explanation for a phenomenon. It is not the absolute truth, but a provisional, working assumption, perhaps best thought of as a potential suspect in a criminal case. When discussing regression in Chapter 5, we saw the claim that observation = deterministic model + residual error. This represents the idea that statistical models are mathematical representations of what we observe, which combine a deterministic component with a ‘stochastic’ component, the latter representing unpredictability or random ‘error’, generally expressed in terms of a probability distribution.

A two-sided test would be appropriate for a null hypothesis that a treatment effect, say, is exactly zero, and so both positive and negative estimates would lead to the null being rejected. one-tailed and two-tailed P-values: those corresponding to one-sided and two-sided tests. over-fitting: building a statistical model that is over-adapted to training data, so that its predictive ability starts to decline. parameters: the unknown quantities in a statistical model, generally denoted with Greek letters. Pearson correlation coefficient: for a set of n paired numbers, (x1, y1), (x2, y2) … (xn, yn), when , sx are the sample mean and standard deviation of the xs, and , sy are the sample mean and standard deviation of the ys, the Pearson correlation coefficient is given by Suppose xs and ys have both been standardized to Z-scores given by us and vs respectively, so that ui = (xi – )/sx, and vi = (yi – )/sy.

pages: 404 words: 92,713

The Art of Statistics: How to Learn From Data
by David Spiegelhalter
Published 2 Sep 2019

Even with the Bradford Hill criteria outlined above, statisticians are generally reluctant to attribute causation unless there has been an experiment, although computer scientist Judea Pearl and others have made great progress in setting out the principles for building causal regression models from observational data.2 * Table 5.2 Correlations between heights of adult children and parent of the same gender, and gradients of the regression of the offspring’s on the parent’s height. Regression Lines Are Models The regression line we fitted between fathers’ and sons’ heights is a very basic example of a statistical model. The US Federal Reserve define a model as a ‘representation of some aspect of the world which is based on simplifying assumptions’: essentially some phenomenon will be represented mathematically, generally embedded in computer software, in order to produce a simplified ‘pretend’ version of reality.3 Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, for example the fitted straight line that enables us to make a prediction of a son’s height from his father’s.

We have data that can help us answer some of these questions, with which we have already done some exploratory plotting and drawn some informal conclusions about an appropriate statistical model. But we now come to a formal aspect of the Analysis part of the PPDAC cycle, generally known as hypothesis testing. What Is a ‘Hypothesis’? A hypothesis can be defined as a proposed explanation for a phenomenon. It is not the absolute truth, but a provisional, working assumption, perhaps best thought of as a potential suspect in a criminal case. When discussing regression in Chapter 5, we saw the claim that observation = deterministic model + residual error. This represents the idea that statistical models are mathematical representations of what we observe, which combine a deterministic component with a ‘stochastic’ component, the latter representing unpredictability or random ‘error’, generally expressed in terms of a probability distribution.

A two-sided test would be appropriate for a null hypothesis that a treatment effect, say, is exactly zero, and so both positive and negative estimates would lead to the null being rejected. one-tailed and two-tailed P-values: those corresponding to one-sided and two-sided tests. over-fitting: building a statistical model that is over-adapted to training data, so that its predictive ability starts to decline. parameters: the unknown quantities in a statistical model, generally denoted with Greek letters. Pearson correlation coefficient: for a set of n paired numbers, (x1, y1), (x2, y2)… (xn, yn), when , sx are the sample mean and standard deviation of the xs, and , sy are the sample mean and standard deviation of the ys, the Pearson correlation coefficient is given by Suppose xs and ys have both been standardized to Z-scores given by us and vs respectively, so that ui = (xi − )/sx, and vi = (yi − )/sy.

pages: 227 words: 62,177

Numbers Rule Your World: The Hidden Influence of Probability and Statistics on Everything You Do
by Kaiser Fung
Published 25 Jan 2010

Figure C-1 Drawing a Line Between Natural and Doping Highs Because the anti-doping laboratories face bad publicity for false positives (while false negatives are invisible unless the dopers confess), they calibrate the tests to minimize false accusations, which allows some athletes to get away with doping. The Virtue of Being Wrong The subject matter of statistics is variability, and statistical models are tools that examine why things vary. A disease outbreak model links causes to effects to tell us why some people fall ill while others do not; a credit-scoring model identifies correlated traits to describe which borrowers are likely to default on their loans and which will not. These two examples represent two valid modes of statistical modeling. George Box is justly celebrated for his remark “All models are false but some are useful.” The mark of great statisticians is their confidence in the face of fallibility.

Highway engineers in Minnesota tell us why their favorite tactic to reduce congestion is a technology that forces commuters to wait more, while Disney engineers make the case that the most effective tool to reduce wait times does not actually reduce average wait times. Second, variability does not need to be explained by reasonable causes, despite our natural desire for a rational explanation of everything; statisticians are frequently just as happy to pore over patterns of correlation. In Chapter 2, we compare and contrast these two modes of statistical modeling by trailing disease detectives on the hunt for tainted spinach (causal models) and by prying open the black box that produces credit scores (correlational models). Surprisingly, these practitioners freely admit that their models are “wrong” in the sense that they do not perfectly describe the world around us; we explore how they justify what they do.

Their special talent is the educated guess, with emphasis on the adjective. The leaders of the pack are practical-minded people who rely on detailed observation, directed research, and data analysis. Their Achilles heel is the big I, when they let intuition lead them astray. This chapter celebrates two groups of statistical modelers who have made lasting, positive impacts on our lives. First, we meet the epidemiologists whose investigations explain the causes of disease. Later, we meet credit modelers who mark our fiscal reputation for banks, insurers, landlords, employers, and so on. By observing these scientists in action, we will learn how they have advanced the technical frontier and to what extent we can trust their handiwork. ~###~ In November 2006, the U.S.

pages: 523 words: 112,185

Doing Data Science: Straight Talk From the Frontline
by Cathy O'Neil and Rachel Schutt
Published 8 Oct 2013

He was using it to mean data models—the representation one is choosing to store one’s data, which is the realm of database managers—whereas she was talking about statistical models, which is what much of this book is about. One of Andrew Gelman’s blog posts on modeling was recently tweeted by people in the fashion industry, but that’s a different issue. Even if you’ve used the terms statistical model or mathematical model for years, is it even clear to yourself and to the people you’re talking to what you mean? What makes a model a model? Also, while we’re asking fundamental questions like this, what’s the difference between a statistical model and a machine learning algorithm? Before we dive deeply into that, let’s add a bit of context with this deliberately provocative Wired magazine piece, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” published in 2008 by Chris Anderson, then editor-in-chief.

In the case of proteins, a model of the protein backbone with side-chains by itself is removed from the laws of quantum mechanics that govern the behavior of the electrons, which ultimately dictate the structure and actions of proteins. In the case of a statistical model, we may have mistakenly excluded key variables, included irrelevant ones, or assumed a mathematical structure divorced from reality. Statistical modeling Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think the underlying process might be with your model. What comes first? What influences what? What causes what?

These two seem obviously different, so it seems the distinction should should be obvious. Unfortunately, it isn’t. For example, regression can be described as a statistical model as well as a machine learning algorithm. You’ll waste your time trying to get people to discuss this with any precision. In some ways this is a historical artifact of statistics and computer science communities developing methods and techniques in parallel and using different words for the same methods. The consequence of this is that the distinction between machine learning and statistical modeling is muddy. Some methods (for example, k-means, discussed in the next section) we might call an algorithm because it’s a series of computational steps used to cluster or classify objects—on the other hand, k-means can be reinterpreted as a special case of a Gaussian mixture model.

pages: 209 words: 13,138

Empirical Market Microstructure: The Institutions, Economics and Econometrics of Securities Trading
by Joel Hasbrouck
Published 4 Jan 2007

If we know that the structural model is the particular one described in section 9.2, we simply set vt so that qt = +1, set ut = 0 and forecast using equation (9.7). We do not usually know the structural model, however. Typically we’re working from estimates of a statistical model (a VAR or VMA). This complicates specification of ε0 . From the perspective of the VAR or VMA model of the trade and price data, the innovation vector and its variance are:     2 σp,q σp εp,t . (9.15) and  = εt = εq,t σp,q σq2 The innovations in the statistical model are simply associated with the observed variables, and have no necessary structural interpretation. We can still set εq,t according to our contemplated trade (εq,t = +1), but how should we set εp,t ?

The role they play and how they should be regulated are ongoing concerns of practical interest. 117 12 Limit Order Markets The worldwide proliferation of limit order markets (LOMs) clearly establishes a need for economic and statistical models of these mechanisms. This chapter discusses some approaches, but it should be admitted at the outset that no comprehensive and realistic models (either statistical or economic) exist. One might start with the view that a limit order, being a bid or offer, is simply a dealer quote by another name. The implication is that a limit order is exposed to asymmetric information risk and also must recover noninformational costs of trade. This view supports the application of the economic and statistical models described earlier to LOM, hybrid, and other nondealer markets.

HG4521.H353 2007 332.64—dc22 2006003935 9 8 7 6 5 4 3 2 1 Printed in the United States of America on acid-free paper To Lisa, who inspires these pages and much more. This page intentionally left blank Preface This book is a study of the trading mechanisms in financial markets: the institutions, the economic principles underlying the institutions, and statistical models for analyzing the data they generate. The book is aimed at graduate and advanced undergraduate students in financial economics and practitioners who design or use order management systems. Most of the book presupposes only a basic familiarity with economics and statistics. I began writing this book because I perceived a need for treatment of empirical market microstructure that was unified, authoritative, and comprehensive.

pages: 257 words: 13,443

Statistical Arbitrage: Algorithmic Trading Insights and Techniques
by Andrew Pole
Published 14 Sep 2007

Orders, up to a threshold labeled ‘‘visibility threshold,’’ have less impact on large-volume days. Fitting a mathematical curve or statistical model to the order size–market impact data yields a tool for answering the question: How much will I have to pay to buy 10,000 shares of XYZ? Note that buy and sell responses may be different and may be dependent on whether the stock is moving up or down that day. Breaking down the raw (60-day) data set and analyzing up days and down days separately will illuminate that issue. More formally, one could define an encompassing statistical model including an indicator variable for up or down day and test the significance of the estimated coefficient.

Consideration of change is Preface xv introduced from this first toe dipping into analysis, because temporal dynamics underpin the entirety of the project. Without the dynamic there is no arbitrage. In Chapter 3 we increase the depth and breadth of the analysis, expanding the modeling scope from simple observational rules1 for pairs to formal statistical models for more general portfolios. Several popular models for time series are described but detailed focus is on weighted moving averages at one extreme of complexity and factor analysis at another, these extremes serving to carry the message as clearly as we can make it. Pair spreads are referred to throughout the text serving, as already noted, as the simplest practical illustrator of the notions discussed.

Events in trading volume series provide information sometimes not identified (by turning point analysis) in price series. Volume patterns do not directly affect price spreads but volume spurts are a useful warning that a stock may be subject to unusual trading activity and that price development may therefore not be as characterized in statistical models that have been estimated on average recent historical price series. In historical analysis, flags of unusual activity are extremely important in the evaluation of, for example, simulation 25 Statistical Arbitrage 80 $ 70 60 50 40 19970102 19970524 19971016 19980312 FIGURE 2.8 Adjusted close price trace (General Motors) with 20 percent turning points identified TABLE 2.1 Event return summary for Chrysler–GM Criterion daily 30% move 25% move 20% move # Events Return Correlation 332 22 26 33 0.53 0.75 0.73 0.77 results.

Thinking with Data
by Max Shron
Published 15 Aug 2014

This could be part of a solution to the first two needs, verifying that there is a strong relationship between public transit and the housing market, and trying to predict whether apartments are under- or overpriced. Digging into our experience, we know that graphs are just one way to express a relationship. Two others are models and maps. How might we capture the relevant relationships with a statistical model? A statistical model would be a way to relate some notion of transit access to some notion of apartment price, controlling for other factors. We can clarify our idea with a mockup. The mockup here would be a sentence interpreting the hypothetical output. Results from a model might have conclusions like, “In New York City, apartment prices fall by 5% for every block away from the A train, compared to similar apartments.”

Depending on the resolution of the map, this could potentially meet the first two needs (making a case for a connection and finding outliers) as well, through visual inspection. A map is easier to inspect, but harder to calibrate or interpret. Each has its strengths and weaknesses. A scatterplot is going to be easy to make once we have some data, but potentially misleading. The statistical model will collapse down a lot of variation in the data in order to arrive at a general, interpretable conclusion, potentially missing interesting patterns. The map is going to be limited in its ability to account for variables that aren’t spatial, and we may have a harder time interpreting the results.

It is rare that we can get an audience to understand something just from lists of facts. Transformations make data intelligible, allowing raw data to be incorporated into an argument. A transformation puts an interpretation on data by highlighting things that we take to be essential. Counting all of the sales in a month is a transformation, as is plotting a graph or fitting a statistical model of page visits against age, or making a map of every taxi pickup in a city. Returning to our transit example, if we just wanted to show that there is some relationship between transit access and apartment prices, a high-resolution map of apartment prices overlaid on a transit map would be reasonable evidence, as would a two-dimensional histogram or scatterplot of the right quantities.

pages: 327 words: 103,336

Everything Is Obvious: *Once You Know the Answer
by Duncan J. Watts
Published 28 Mar 2011

Next, we compared the performance of these two polls with the Vegas sports betting market—one of the oldest and most popular betting markets in the world—as well as with another prediction market, TradeSports. And finally, we compared the prediction of both the markets and the polls against two simple statistical models. The first model relied only on the historical probability that home teams win—which they do 58 percent of the time—while the second model also factored in the recent win-loss records of the two teams in question. In this way, we set up a six-way comparison between different prediction methods—two statistical models, two markets, and two polls.6 Given how different these methods were, what we found was surprising: All of them performed about the same.

One might think, therefore, that prediction markets, with their far greater capacity to factor in different sorts of information, would outperform simplistic statistical models by a much wider margin for baseball than they do for football. But that turns out not to be true either. We compared the predictions of the Las Vegas sports betting markets over nearly twenty thousand Major League baseball games played from 1999 to 2006 with a simple statistical model based again on home-team advantage and the recent win-loss records of the two teams. This time, the difference between the two was even smaller—in fact, the performance of the market and the model were indistinguishable.

Because AI researchers had to program every fact, rule, and learning process into their creations from scratch, and because their creations failed to behave as expected in obvious and often catastrophic ways—like driving off a cliff or trying to walk through a wall—the frame problem was impossible to ignore. Rather than trying to crack the problem, therefore, AI researchers took a different approach entirely—one that emphasized statistical models of data rather than thought processes. This approach, which nowadays is called machine learning, was far less intuitive than the original cognitive approach, but it has proved to be much more productive, leading to all kinds of impressive breakthroughs, from the almost magical ability of search engines to complete queries as you type them to building autonomous robot cars, and even a computer that can play Jeopardy!

pages: 829 words: 186,976

The Signal and the Noise: Why So Many Predictions Fail-But Some Don't
by Nate Silver
Published 31 Aug 2012

Moreover, even the aggregate economic forecasts have been quite poor in any real-world sense, so there is plenty of room for progress. Most economists rely on their judgment to some degree when they make a forecast, rather than just take the output of a statistical model as is. Given how noisy the data is, this is probably helpful. A study62 by Stephen K. McNess, the former vice president of the Federal Reserve Bank of Boston, found that judgmental adjustments to statistical forecasting methods resulted in forecasts that were about 15 percent more accurate. The idea that a statistical model would be able to “solve” the problem of economic forecasting was somewhat in vogue during the 1970s and 1980s when computers came into wider use.

In these cases, it is much more likely that the fault lies with the forecaster’s model of the world and not with the world itself. In the instance of CDOs, the ratings agencies had no track record at all: these were new and highly novel securities, and the default rates claimed by S&P were not derived from historical data but instead were assumptions based on a faulty statistical model. Meanwhile, the magnitude of their error was enormous: AAA-rated CDOs were two hundred times more likely to default in practice than they were in theory. The ratings agencies’ shot at redemption would be to admit that the models had been flawed and the mistake had been theirs. But at the congressional hearing, they shirked responsibility and claimed to have been unlucky.

Barack Obama had led John McCain in almost every national poll since September 15, 2008, when the collapse of Lehman Brothers had ushered in the worst economic slump since the Great Depression. Obama also led in almost every poll of almost every swing state: in Ohio and Florida and Pennsylvania and New Hampshire—and even in a few states that Democrats don’t normally win, like Colorado and Virginia. Statistical models like the one I developed for FiveThirtyEight suggested that Obama had in excess of a 95 percent chance of winning the election. Betting markets were slightly more equivocal, but still had him as a 7 to 1 favorite.2 But McLaughlin’s first panelist, Pat Buchanan, dodged the question. “The undecideds will decide this weekend,” he remarked, drawing guffaws from the rest of the panel.

Data Mining: Concepts and Techniques: Concepts and Techniques
by Jiawei Han , Micheline Kamber and Jian Pei
Published 21 Jun 2011

Data mining has an inherent connection with statistics. A statistical model is a set of mathematical functions that describe the behavior of the objects in a target class in terms of random variables and their associated probability distributions. Statistical models are widely used to model data and data classes. For example, in data mining tasks like data characterization and classification, statistical models of target classes can be built. In other words, such statistical models can be the outcome of a data mining task. Alternatively, data mining tasks can be built on top of statistical models. For example, we can use statistics to model noise and missing data values.

For each object y in region, R, we can estimate , the probability that this point fits the Gaussian distribution. Because is very low, y is unlikely generated by the Gaussian model, and thus is an outlier. The effectiveness of statistical methods highly depends on whether the assumptions made for the statistical model hold true for the given data. There are many kinds of statistical models. For example, the statistic models used in the methods may be parametric or nonparametric. Statistical methods for outlier detection are discussed in detail in Section 12.3. Proximity-Based Methods Proximity-based methods assume that an object is an outlier if the nearest neighbors of the object are far away in feature space, that is, the proximity of the object to its neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set.

Then, when mining patterns in a large data set, the data mining process can use the model to help identify and handle noisy or missing values in the data. Statistics research develops tools for prediction and forecasting using data and statistical models. Statistical methods can be used to summarize or describe a collection of data. Basic statistical descriptions of data are introduced in Chapter 2. Statistics is useful for mining various patterns from data as well as for understanding the underlying mechanisms generating and affecting the patterns. Inferential statistics (or predictive statistics) models data in a way that accounts for randomness and uncertainty in the observations and is used to draw inferences about the process or population under investigation.

pages: 174 words: 56,405

Machine Translation
by Thierry Poibeau
Published 14 Sep 2017

Introduction of Linguistic Information into Statistical Models Statistical translation models, despite their increasing complexity to better fit language specificities, have not solved all the difficulties encountered. In fact, bilingual corpora, even large ones, remain insufficient at times to properly cover rare or complex linguistic phenomena. One solution is to then integrate more information of a linguistic nature in the machine translation system to better represent the relations between words (syntax) and their meanings (semantics). Alignment Models Accounting for Syntax The statistical models described so far are all direct translation systems: they search for equivalences between the source language and the target language at word level, or, at best, they take into consideration sequences of words that are not necessarily linguistically coherent.

On the one hand, the analysis of existing translations and their generalization according to various linguistic strategies can be used as a reservoir of knowledge for future translations. This is known as example-based translation, because in this approach previous translations are considered examples for new translations. On the other hand, with the increasing amount of translations available on the Internet, it is now possible to directly design statistical models for machine translation. This approach, known as statistical machine translation, is the most popular today. Unlike a translation memory, which can be relatively small, automatic processing presumes the availability of an enormous amount of data. Robert Mercer, one of the pioneers of statistical translation,1 proclaimed: “There is no data like more data.”

This approach naturally takes into account the statistical nature of language, which means that the approach focuses on the most frequent patterns in a language and, despite its limitations, is able to produce acceptable translations for a significant number of simple sentences. In certain cases, statistical models can also identify idioms thanks to asymmetric alignments (one word from the source language aligned with several words from the target language, for example), which means they can also overcome the word-for-word limitation. In the following section, we will examine several lexical alignment models developed toward the end of the 1980s and the beginning of the 1990s.

pages: 204 words: 58,565

Keeping Up With the Quants: Your Guide to Understanding and Using Analytics
by Thomas H. Davenport and Jinho Kim
Published 10 Jun 2013

In big-data environments, where the data just keeps coming in large volumes, it may not always be possible for humans to create hypotheses before sifting through the data. In the context of placing digital ads on publishers’ sites, for example, decisions need to be made in thousandths of a second by automated decision systems, and the firms doing this work must generate several thousand statistical models per week. Clearly this type of analysis can’t involve a lot of human hypothesizing and reflection on results, and machine learning is absolutely necessary. But for the most part, we’d advise sticking to hypothesis-driven analysis and the steps and sequence in this book. The Modeling (Variable Selection) Step A model is a purposefully simplified representation of the phenomenon or problem.

Data analysis * * * Key Software Vendors for Different Analysis Types (listed alphabetically) REPORTING SOFTWARE BOARD International IBM Cognos Information Builders WebFOCUS Oracle Business Intelligence (including Hyperion) Microsoft Excel/SQL Server/SharePoint MicroStrategy Panorama SAP BusinessObjects INTERACTIVE VISUAL ANALYTICS QlikTech QlikView Tableau TIBCO Spotfire QUANTITATIVE OR STATISTICAL MODELING IBM SPSS R (an open-source software package) SAS * * * While all of the listed reporting software vendors also have capabilities for graphical display, some vendors focus specifically on interactive visual analytics, or the use of visual representations of data and reporting.

Such tools are often used simply to graph data and for data discovery—understanding the distribution of the data, identifying outliers (data points with unexpected values) and visual relationships between variables. So we’ve listed these as a separate category. We’ve also listed key vendors of software for the other category of analysis, which we’ll call quantitative or statistical modeling. In that category, you’re trying to use statistics to understand the relationships between variables and to make inferences from your sample to a larger population. Predictive analytics, randomized testing, and the various forms of regression analysis are all forms of this type of modeling.

pages: 301 words: 89,076

The Globotics Upheaval: Globalisation, Robotics and the Future of Work
by Richard Baldwin
Published 10 Jan 2019

The chore is to identify which features of the digitalized speech data are most useful when making an educated guess as to the corresponding word. To tackle this chore, the computer scientists set up a “blank slate” statistical model. It is a blank slate in the sense that every feature of the speech data is allowed to be, in principle, an important feature in the guessing process. What they are looking for is how to weight each aspect of the speech data when trying to find the word it is associated with. The revolutionary thing about machine learning is that the scientists don’t fill in the blanks. They don’t write down the weights in the statistical model. Instead, they write a set of step-by-step instructions for how the computer should fill in the blanks itself.

That is to say, it identifies the features of the speech data that are useful in predicting the corresponding words. The scientists then make the statistical model take an exam. They feed it a fresh set of spoken words and ask it to predict the written words that they correspond to. This is called the “testing data set.” Usually, the model—which is also called an “algorithm”—is not good enough to be released “into the wild,” so the computer scientists do some sophisticated trial and error of their own by manually tweaking the computer program that is used to choose the weights. After what can be a long sequence of iterations like this, and after the statistical model has achieved a sufficiently high degree of accuracy, the new language model graduates to the next level.

We haven’t a clue as to how our elephant thinks—how we, for example, recognize a cat or keep our balance when running over hill and dale. A form of AI called “machine learning” solved the paradox by changing the way computers are programmed. With machine learning, humans help the computer (the “machine” part) estimate a very large statistical model that the computer then uses to guess the solution to a particular problem (the “learning” part). Thanks to mind-blowing advances in computing power and access to hallucinatory amounts of data, white-collar robots trained by machine learning routinely achieve human-level performance on specific guessing tasks, like recognizing speech.

pages: 276 words: 81,153

Outnumbered: From Facebook and Google to Fake News and Filter-Bubbles – the Algorithms That Control Our Lives
by David Sumpter
Published 18 Jun 2018

Instead, Mona described to me a culture where colleagues judged each other on how advanced their mathematical techniques were. They believed there was a direct trade-off between the quality of statistical results and the ease with which they can be communicated. If FiveThirtyEight offered a purely statistical model of the polls then the socio-economic background of their statisticians wouldn’t be relevant. But they don’t offer a purely statistical model. Such a model would have come out strongly for Clinton. Instead, they use a combination of their skills as forecasters and the underlying numbers. Work environments consisting of people with the same background and ideas are typically less likely to perform as well on difficult tasks, such as academic research and running a successful business.12 It is difficult for a bunch of people who all have the same background to identify all of the complex factors involved in predicting the future.

Google’s search engine was making racist autocomplete suggestions; Twitterbots were spreading fake news; Stephen Hawking was worried about artificial intelligence; far-right groups were living in algorithmically created filter-bubbles; Facebook was measuring our personalities, and these were being exploited to target voters. One after another, the stories of the dangers of algorithms accumulated. Even the mathematicians’ ability to make predictions was called into question as statistical models got both Brexit and Trump wrong. Stories about the maths of football, love, weddings, graffiti and other fun things were suddenly replaced by the maths of sexism, hate, dystopia and embarrassing errors in opinion poll calculations. When I reread the scientific article on Banksy, a bit more carefully this time, I found that very little new evidence was presented about his identity.

CHAPTER TWO Make Some Noise After the mathematical unmasking of Banksy had sunk in, I realised that I had somehow missed the sheer scale of the change that algorithms were making to our society. But let me be clear. I certainly hadn’t missed the development of the mathematics. Machine learning, statistical models and artificial intelligence are all things I actively research and talk about with my colleagues every day. I read the latest articles and keep up to date with the biggest developments. But I was concentrating on the scientific side of things: looking at how the algorithms work in the abstract.

pages: 400 words: 94,847

Reinventing Discovery: The New Era of Networked Science
by Michael Nielsen
Published 2 Oct 2011

At the least we should take seriously the idea that these statistical models express truths not found in more conventional explanations of language translation. Might it be that the statistical models contain more truth than our conventional theories of language, with their notions of verb, noun, and adjective, subjects and objects, and so on? Or perhaps the models contain a different kind of truth, in part complementary, and in part overlapping, with conventional theories of language? Maybe we could develop a better theory of language by combining the best insights from the conventional approach and the approach based on statistical modeling into a single, unified explanation?

The program would also examine the corpus to figure out how words moved around in the sentence, observing, for example, that “hola” and “hello” tend to be in the same parts of the sentence, while other words get moved around more. Repeating this for every pair of words in the Spanish and English languages, their program gradually built up a statistical model of translation—an immensely complex model, but nonetheless one that can be stored on a modern computer. I won’t describe the models they used in complete detail here, but the hola-hello example gives you the flavor. Once they had analyzed the corpus and built up their statistical model, they used that model to translate new texts. To translate a Spanish sentence, the idea was to find the English sentence that, according to the model, had the highest probability.

But it’s stimulating to speculate that nouns and verbs, subjects and objects, and all the other paraphernalia of language are really emergent properties whose existence can be deduced from statistical models of language. Today, we don’t yet know how to make such a deductive leap, but that doesn’t mean it’s not possible. What status should we give to complex explanations of thisype? As the data web is built, it will become easier and easier for people to construct such explanations, and we’ll end up with statistical models of all kinds of complex phenomena. We’ll need to learn how to look into complex models such as the language models and extract emergent concepts such as verbs and nouns.

pages: 354 words: 26,550

High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems
by Irene Aldridge
Published 1 Dec 2009

Legal risk—the risk of litigation expenses All current risk measurement approaches fall into four categories: r r r r Statistical models Scalar models Scenario analysis Causal modeling Statistical models generate predictions about worst-case future conditions based on past information. The Value-at-Risk (VaR) methodology is the most common statistical risk measurement tool, discussed in detail in the sections that focus on market and liquidity risk estimation. Statistical models are the preferred methodology of risk estimation whenever statistical modeling is feasible. Scalar models establish the maximum foreseeable loss levels as percentages of business parameters, such as revenues, operating costs, and the like.

Yet, readers relying on software packages with preconfigured statistical procedures may find the level of detail presented here to be sufficient for quality analysis of trading opportunities. The depth of the statistical content should be also sufficient for readers to understand the models presented throughout the remainder of this book. Readers interested in a more thorough treatment of statistical models may refer to Tsay (2002); Campbell, Lo, and MacKinlay (1997); and Gouriéroux and Jasiak (2001). This chapter begins with a review of the fundamental statistical estimators, moves on to linear dependency identification methods and volatility modeling techniques, and concludes with standard nonlinear approaches for identifying and modeling trading opportunities.

These highfrequency strategies, which trade on the market movements surrounding news announcements, are collectively referred to as event arbitrage. This chapter investigates the mechanics of event arbitrage in the following order: W r Overview of the development process r Generating a price forecast through statistical modeling of r Directional forecasts r Point forecasts r Applying event arbitrage to corporate announcements, industry news, and macroeconomic news r Documented effects of events on foreign exchange, equities, fixed income, futures, emerging economies, commodities, and REIT markets DEVELOPING EVENT ARBITRAGE TRADING STRATEGIES Event arbitrage refers to the group of trading strategies that place trades on the basis of the markets’ reaction to events.

pages: 320 words: 33,385

Market Risk Analysis, Quantitative Methods in Finance
by Carol Alexander
Published 2 Jan 2007

Its most important market risk modelling applications are to: • multivariate GARCH modelling, generating copulas, and • simulating asset prices. • I.3.5 INTRODUCTION TO STATISTICAL INFERENCE A statistical model will predict well only if it is properly specified and its parameter estimates are robust, unbiased and efficient. Unbiased means that the expected value of the estimator is equal to the true model parameter and efficient means that the variance of the estimator is low, i.e. different samples give similar estimates. When we set up a statistical model the implicit assumption is that this is the ‘true’ model for the population. We estimate the model’s parameters from a sample and then use these estimates to infer the values of the ‘true’ population parameters.

A case study in this chapter applies PCA to European equity indices, and several more case studies are given in subsequent volumes of Market Risk Analysis. A very good free downloadable Excel add-in has been used for these case studies and examples. Further details are given in the chapter. Chapter 3, Probability and Statistics, covers the probabilistic and statistical models that we use to analyse the evolution of financial asset prices or interest rates. Starting from the basic concepts of a random variable, a probability distribution, quantiles and population and sample moments, we then provide a catalogue of probability distributions. We describe the theoretical properties of each distribution and give examples of practical applications to finance.

We describe the theoretical properties of each distribution and give examples of practical applications to finance. Stable distributions and kernel estimates are also covered, because they have broad applications to financial risk management. The sections on statistical inference and maximum likelihood lay the foundations for Chapter 4. Finally, we focus on the continuous time and discrete time statistical models for the evolution of financial asset prices and returns, which are further developed in Volume III. xxvi Preface Much of the material in Volume II rests on the Introduction to Linear Regression given in Chapter 4. Here we start from the basic, simple linear model, showing how to estimate and draw inferences on the parameters, and explaining the standard diagnostic tests for a regression model.

pages: 250 words: 64,011

Everydata: The Misinformation Hidden in the Little Data You Consume Every Day
by John H. Johnson
Published 27 Apr 2016

You collect all the data on every wheat price in the history of humankind, and all the different factors that determine the price of wheat (temperature, feed prices, transportation costs, etc.). First, you need to develop a statistical model to determine what factors have affected the price of wheat in the past and how these various factors relate to one another mathematically. Then, based on that model, you predict the price of wheat for next year.14 The problem is that no matter how big your sample is (even if it’s the full population), and how accurate your statistical model is, there are still unknowns that can cause your forecast to be off: What if a railroad strike doubles the transportation costs?

As Hovenkamp said, “the plaintiff’s expert had ignored a clear ‘outlier’ in the data.”33 If that outlier data had been excluded—as it arguably should have been—then the results would have shown a clear increase in market share for Conwood. Instead, the conclusion—driven by an extreme observation—showed a decrease. If your conclusions change dramatically by excluding a data point, then that data point is a strong candidate to be an outlier. In a good statistical model, you would expect that you can drop a data point without seeing a substantive difference in the results. It’s something to think about when looking for outliers. ARE YOU BETTER THAN AVERAGE? The average American: Sleeps more than 8.7 hours per day34 Weighs approximately 181 pounds (195.5 pounds for men and 166.2 pounds for women)35 Drinks 20.8 gallons of beer per year36 Drives 13,476 miles per year (hopefully not after drinking all that beer)37 Showers six times a week, but only shampoos four times a week38 Has been at his or her current job 4.6 years39 So, are you better than average?

(On its website, Visa even suggests that you tell your financial institution if you’ll be traveling, which can “help ensure that your card isn’t flagged for unusual activity.”18) This is a perfect example of a false positive—the credit card company predicted that the charges on your card were potentially fraudulent, but it was wrong. Events like this, which may not be accounted for in the statistical model, are potential sources of prediction error. Just as sampling error tells us about the uncertainty in our sample, prediction error is a way to measure uncertainty in the future, essentially by comparing the predicted results to the actual outcomes, once they occur.19 Prediction error is often measured using a prediction interval, which is the range in which we expect to see the next data point.

pages: 252 words: 71,176

Strength in Numbers: How Polls Work and Why We Need Them
by G. Elliott Morris
Published 11 Jul 2022

Although political pollsters would still have to conduct traditional RDD telephone polls to sample the attitudes of Americans who are not registered to vote and therefore do not show up in states’ voter files, any pre-election polling would ideally still be mixed with polls conducted off the voter file in order to adjust nonresponse biases among the voting population. But the world of public opinion research is far from ideal. First, not every polling outfit has access to a voter file. Subscriptions can be very expensive, often more expensive than the added cost of calling people who won’t respond to your poll. Reengineering statistical models to incorporate the new methods also takes time, which many firms do not have. Further, many pollsters clinging to RDD phone polls would not have the technical know-how to make the switch even if they tried; Hartman and her colleagues were in a league of their own when it came to their programming and statistical abilities.

Another analyst, extremely sharp but perhaps prone to dramatic swings, memorably declares “We’re fucked.” A senior member of the team excused himself, and I later found out that he proceeded immediately to the bathroom, in order to vomit.11 Ghitza’s thesis reports on a group of projects related to statistical modeling and political science. He had been hired by the Obama campaign roughly six weeks before the election to program a model that could predict how the election was unfolding throughout the day, based on the way turnout among different groups was looking. Ghitza was interested in answering two questions: First, “Could we measure deviations from expected turnout for different groups of the electorate in real time?”

David Shor, Ghitza’s colleague who created the campaign’s pre-election poll-based forecasts, later remarked, “That was the worst 12 hours of my life.”12 Ghitza was not hired by the Obama campaign to work on its voter file and polling operation, but he probably should have been. He had studied breakthrough statistical modeling during his doctoral work at Columbia, developing a lot of the methods his current employer, Catalist, uses to merge polls with voter files and model support for political candidates at the individual level. The hallmark method of his dissertation, “multilevel regression with post-stratification” (MRP), was thought up by his advisor, Andrew Gelman, in the 1990s.

pages: 848 words: 227,015

On the Edge: The Art of Risking Everything
by Nate Silver
Published 12 Aug 2024

It’s about the most natural thing in the world to a former professional poker player like me, but will be completely alien to other people. On November 8, 2016, the statistical model I built for FiveThirtyEight said there was a 71 percent chance that Hillary Clinton would win the presidency and a 29 percent chance that Donald Trump would win. For context, this estimate of Trump’s chances was considered high at the time. Other statistical models put Trump’s chances at anywhere from 15 percent to less than 1 percent. And betting markets put them at around 1 chance in 6 (17 percent). Trump won, of course, by sweeping several Rust Belt swing states.

A barbecue restaurant in Austin, looking at its sales numbers, could run a regression analysis to adjust for factors like the day of the week, the weather, and if there was a big football game in town. The natural companion to analytic thinking is abstract thinking—that is, trying to derive general rules or principles from the things you observe in the world. Another way to describe this is “model building.” The models can be formal, as in a statistical model or even a philosophical model.[*6] Or they can be informal, as in a mental model, or a set of heuristics (rules of thumb) that adapt well to new situations. In poker, for instance, there are millions of permutations for how a particular hand might play out, and it’s impossible to plan for every one.

Top-down is in line with a game-theory equilibrium that assumes everyone playing the sports-betting game is pretty smart, resulting in a market where betting lines are reasonably efficient and there aren’t major edges to be had through number-crunching or data-mining—instead you need street smarts and hustle.[*1] But in truth, the distinction is more philosophical than practical: the most successful sports bettors use a mix of both approaches. Top-down guys like Spanky might not build statistical models themselves, but they’re data literate and employ models that others have built.[*2] And bottom-up guys, no matter how good they are at statistics, still need to figure out how to get the money down. If you think this is a trivial problem, I can personally attest that it isn’t. In January 2022, mobile sports betting went live in New York.

pages: 265 words: 74,000

The Numerati
by Stephen Baker
Published 11 Aug 2008

And when he got his master's, he decided to look for a job "at places where they hire Ph.D.'s." He landed at Accenture, and now, at an age at which many of his classmates are just finishing their doctorate, he runs the analytics division from his perch in Chicago. Ghani leads me out of his office and toward the shopping cart. For statistical modeling, he explains, grocery shopping is one of the first retail industries to conquer. This is because we buy food constantly. For many of us, the supermarket functions as a chilly, Muzak-blaring annex to our pantries. (I would bet that millions of suburban Americans spend more time in supermarkets than in their formal living room.)

He thinks that over the next generation, many of us will surround ourselves with the kinds of networked gadgets he and his team are building and testing. These machines will busy themselves with far more than measuring people's pulse and counting the pills they take, which is what today's state-of-the-art monitors can do. Dishman sees sensors eventually recording and building statistical models of almost every aspect of our behavior. They'll track our pathways in the house, the rhythm of our gait. They'll diagram our thrashing in bed and chart our nightly trips to the bathroom—perhaps keeping tabs on how much time we spend in there. Some of these gadgets will even measure the pause before we recognize a familiar voice on the phone.

From that, they can calculate a 90 percent probability that toothbrush movement involves teeth cleaning. (They could factor in time variables, but there's more than enough complexity ahead, as we'll see.) Next they move to the broom and the teakettle, and they ask the same questions. The goal is to build a statistical model for each of us that will infer from a series of observations what we're most likely to be doing. The toothbrush was easy. For the most part, it sticks to only one job. But consider the kettle. What are the chances that it's being used for tea? Maybe a person uses it to make instant soup (which is more nutritious than tea but dangerously salty for people like my mother).

pages: 764 words: 261,694

The Elements of Statistical Learning (Springer Series in Statistics)
by Trevor Hastie , Robert Tibshirani and Jerome Friedman
Published 25 Aug 2009

–Ian Hacking This is page xiii Printer: Opaque this Contents Preface to the Second Edition vii Preface to the First Edition xi 1 Introduction 2 Overview of Supervised Learning 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . 2.2 Variable Types and Terminology . . . . . . . . . . 2.3 Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors . . . . . . . 2.3.1 Linear Models and Least Squares . . . . 2.3.2 Nearest-Neighbor Methods . . . . . . . . 2.3.3 From Least Squares to Nearest Neighbors 2.4 Statistical Decision Theory . . . . . . . . . . . . . 2.5 Local Methods in High Dimensions . . . . . . . . . 2.6 Statistical Models, Supervised Learning and Function Approximation . . . . . . . . . . . . 2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y ) . . . 2.6.2 Supervised Learning . . . . . . . . . . . . 2.6.3 Function Approximation . . . . . . . . . 2.7 Structured Regression Models . . . . . . . . . . . 2.7.1 Difficulty of the Problem . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 14 16 18 22 . . . . 28 . . . . . 28 29 29 32 32 . . . . . . . . . . . 9 9 9 . . . . . . . . . .

We will see that there is a whole spectrum of models between the rigid linear models and the extremely flexible 1-nearest-neighbor models, each with their own assumptions and biases, which have been proposed specifically to avoid the exponential growth in complexity of functions in high dimensions by drawing heavily on these assumptions. Before we delve more deeply, let us elaborate a bit on the concept of statistical models and see how they fit into the prediction framework. 28 2. Overview of Supervised Learning 2.6 Statistical Models, Supervised Learning and Function Approximation Our goal is to find a useful approximation fˆ(x) to the function f (x) that underlies the predictive relationship between the inputs and outputs. In the theoretical setting of Section 2.4, we saw that squared error loss lead us to the regression function f (x) = E(Y |X = x) for a quantitative response.

The class of nearest-neighbor methods can be viewed as direct estimates of this conditional expectation, but we have seen that they can fail in at least two ways: • if the dimension of the input space is high, the nearest neighbors need not be close to the target point, and can result in large errors; • if special structure is known to exist, this can be used to reduce both the bias and the variance of the estimates. We anticipate using other classes of models for f (x), in many cases specifically designed to overcome the dimensionality problems, and here we discuss a framework for incorporating them into the prediction problem. 2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y ) Suppose in fact that our data arose from a statistical model Y = f (X) + ε, (2.29) where the random error ε has E(ε) = 0 and is independent of X. Note that for this model, f (x) = E(Y |X = x), and in fact the conditional distribution Pr(Y |X) depends on X only through the conditional mean f (x). The additive error model is a useful approximation to the truth.

pages: 321

Finding Alphas: A Quantitative Approach to Building Trading Strategies
by Igor Tulchinsky
Published 30 Sep 2019

For example, in alpha research the task of predicting stock prices can be a good application of supervised learning, and the task of selecting stocks for inclusion in a portfolio is an application of unsupervised learning. Machine Learning in Alpha Research123 Machine learning Unsupervised methods Clusterization algorithms Supervised methods Statistical models Support vector machines Neural networks Deep learning algorithms Fuzzy logic Ensemble methods Random forest AdaBoost Figure 16.1 The most developed directions of machine learning. The most popular are in black Statistical Models Models like naive Bayes, linear discriminant analysis, the hidden Markov model, and logistic regression are good for solving relatively simple problems that do not need high precision of classification or prediction.

The most popular are in black Statistical Models Models like naive Bayes, linear discriminant analysis, the hidden Markov model, and logistic regression are good for solving relatively simple problems that do not need high precision of classification or prediction. These methods are easy to implement and not too sensitive to missing data. The disadvantage is that each of these approaches presumes some specific data model. Trend analysis is an example of applications of statistical models in alpha research. In particular, a hidden Markov model is frequently utilized for that purpose, based on the belief that price movements of the stock market are not totally random. In a statistics framework, the hidden Markov model is a composition of two or more stochastic processes: a hidden Markov chain, which accounts for the temporal variability, and an observable process, which accounts for the spectral variability.

(There will be a range of views on both of these horizons, but we can still use the implied causal relationship between the extreme weather and the commodity supply to narrow the range of candidates.) We can now test our idea by gathering data on historical weather forecasts and price changes for the major energy contracts, and testing for association between the two datasets, using a partial in-­ sample historical dataset. The next step is to fit a simple statistical model and test it for robustness, while varying the parameters in the fit. One good robustness test is to include a similar asset for comparison, where we expect the effect to be weaker. In the case of our weather alpha example, Brent crude oil would be a reasonable choice. Crude oil is a global market, so we would expect some spillover from a US supply disruption.

pages: 625 words: 167,349

The Alignment Problem: Machine Learning and Human Values
by Brian Christian
Published 5 Oct 2020

“He was on fire about reforming the criminal justice system,” says Brennan. “It was a job of a passion for Dave.” Brennan and Wells decided to team up.10 They called their company Northpointe. As the era of the personal computer dawned, the use of statistical models at all points in the criminal justice system, in jurisdictions large and small, exploded. In 1980, only four states were using statistical models to assist in parole decisions. By 1990, it was twelve states, and by 2000, it was twenty-six.11 Suddenly it began to seem strange not to use such models; as the Association of Paroling Authorities International’s 2003 Handbook for New Parole Board Members put it, “In this day and age, making parole decisions without benefit of a good, research-based risk assessment instrument clearly falls short of accepted best practice.”12 One of the most widely used tools of this new era had been developed by Brennan and Wells in 1998; they called it Correctional Offender Management Profiling for Alternative Sanctions—or COMPAS.13 COMPAS uses a simple statistical model based on a weighted linear combination of things like age, age at first arrest, and criminal history to predict whether an inmate, if released, would commit a violent or nonviolent crime within approximately one to three years.14 It also includes a broad set of survey questions to identify a defendant’s particular issues and needs—things like chemical dependency, lack of family support, and depression.

Even in cases where the human decision makers were given the statistical prediction as yet another piece of data on which to make their decision, their decisions were still worse than just using the prediction itself.23 Other researchers tried the reverse tack: feeding the expert human judgments into a statistical model as input. They didn’t appear to add much.24 Conclusions like these, which have been supported by numerous studies since, should give us pause.25 For one, they seem to suggest that, whatever myriad issues we face in turning decision-making over to statistical models, human judgment alone is not a viable alternative. At the same time, perhaps complex, elaborate models really aren’t necessary to match or exceed this human baseline.

In recent years, alarm bells have gone off in two distinct communities. The first are those focused on the present-day ethical risks of technology. If a facial-recognition system is wildly inaccurate for people of one race or gender but not another, or if someone is denied bail because of a statistical model that has never been audited and that no one in the courtroom—including the judge, attorneys, and defendant—understands, this is a problem. Issues like these cannot be addressed within traditional disciplinary camps, but rather only through dialogue: between computer scientists, social scientists, lawyers, policy experts, ethicists.

The Ethical Algorithm: The Science of Socially Aware Algorithm Design
by Michael Kearns and Aaron Roth
Published 3 Oct 2019

A far more common type of machine learning is the supervised variety, where we wish to use data to make specific predictions that can later be verified or refuted by observing the truth—for example, using past meteorological data to predict whether it will rain tomorrow. The “supervision” that guides our learning is the feedback we get tomorrow, when either it rains or it doesn’t. And for much of the history of machine learning and statistical modeling, many applications, like this example, were focused on making predictions about nature or other large systems: predicting tomorrow’s weather, predicting whether the stock market will go up or down (and by how much), predicting congestion on roadways during rush hour, and the like. Even when humans were part of the system being modeled, the emphasis was on predicting aggregate, collective behaviors.

But if we go too far down the path toward individual fairness, other difficulties arise. In particular, if our model makes even a single mistake, then it can potentially be accused of unfairness toward that one individual, assuming it makes any loans at all. And anywhere we apply machine learning and statistical models to historical data, there are bound to be mistakes except in the most idealized settings. So we can ask for this sort of individual level of fairness, but if we do so naively, its applicability will be greatly constrained and its costs to accuracy are likely to be unpalatable; we’re simply asking for too much.

Sometimes decisions made using biased data or algorithms are the basis for further data collection, forming a pernicious feedback loop that can amplify discrimination over time. An example of this phenomenon comes from the domain of “predictive policing,” in which large metropolitan police departments use statistical models to forecast neighborhoods with higher crime rates, and then send larger forces of police officers there. The most popularly used algorithms are proprietary and secret, so there is debate about how these algorithms estimate crime rates, and concern that some police departments might be in part using arrest data.

pages: 353 words: 97,029

How Big Things Get Done: The Surprising Factors Behind Every Successful Project, From Home Renovations to Space Exploration
by Bent Flyvbjerg and Dan Gardner
Published 16 Feb 2023

To do that, we made another reference-class forecast for the remaining work, roughly half of the total. The estimate had to be high confidence because MTR had only one more shot at getting approval from the Hong Kong government for more time and money. Having data from almost two hundred relevant projects to draw on enabled us to statistically model the uncertainties, risks, and likely outcomes of various strategies. Then MTR could decide how much risk it was willing to take. I told the MTR board that it was like buying insurance. “How insured do you want to be against further time and budget overruns? Fifty percent? Seventy? Ninety?” The more insured you want to be, the more money you have to set aside.31 A settlement was eventually reached between MTR and the government in November 2015.

In fact, for the twenty-plus project types for which I have data, only a handful have a kurtosis for cost overrun that indicates a normal or near-normal distribution (the results are similar for schedule overruns and benefit shortfalls, albeit with fewer data). A large majority of project types has a kurtosis higher than 3—often much higher—indicating fat-tailed and very-fat-tailed distributions. Statistics and decision theory further talk about “kurtosis risk,” which is the risk that results when a statistical model assumes the normal (or near-normal) distribution but is applied to observations that have a tendency to occasionally be much further (in terms of number of standard deviations) from the average than is expected for a normal distribution. Project management scholarship and practice largely ignore kurtosis risk, which is unfortunate given the extreme levels of kurtosis documented above and which is a root cause of why this type of management so often goes so systematically and spectacularly wrong. 18.

Jordy Batselier and Mario Vanhoucke, “Practical Application and Empirical Evaluation of Reference-Class Forecasting for Project Management,” Project Management Journal 47, no. 5 (2016): 36; further documentation of RCF accuracy can be found in Li Liu and Zigrid Napier, “The Accuracy of Risk-Based Cost Estimation for Water Infrastructure Projects: Preliminary Evidence from Australian Projects,” Construction Management and Economics 28, no. 1 (2010): 89–100; Li Liu, George Wehbe, and Jonathan Sisovic, “The Accuracy of Hybrid Estimating Approaches: A Case Study of an Australian State Road and Traffic Authority,” The Engineering Economist 55, no. 3 (2010): 225–45; Byung-Cheol Kim and Kenneth F. Reinschmidt, “Combination of Project Cost Forecasts in Earned Value Management,” Journal of Construction Engineering and Management 137, no. 11 (2011): 958–66; Robert F. Bordley, “Reference-Class Forecasting: Resolving Its Challenge to Statistical Modeling,” The American Statistician 68, no. 4 (2014): 221–29; Omotola Awojobi and Glenn P. Jenkins, “Managing the Cost Overrun Risks of Hydroelectric Dams: An Application of Reference-Class Forecasting Techniques,” Renewable and Sustainable Energy Reviews 63 (September 2016): 19–32; Welton Chang et al., “Developing Expert Political Judgment: The Impact of Training and Practice on Judgmental Accuracy in Geopolitical Forecasting Tournaments,” Judgment and Decision Making 11, no. 5 (September 2016): 509–26; Jordy Batselier and Mario Vanhoucke, “Improving Project Forecast Accuracy by Integrating Earned Value Management with Exponential Smoothing and Reference-Class Forecasting,” International Journal of Project Management 35, no. 1 (2017): 28–43. 16.

pages: 294 words: 82,438

Simple Rules: How to Thrive in a Complex World
by Donald Sull and Kathleen M. Eisenhardt
Published 20 Apr 2015

A simple rule—take the midpoint of the two most distant crime scenes—got police closer to the criminal than more sophisticated decision-making approaches. Another study compared a state-of-the-art statistical model and a simple rule to determine which did a better job of predicting whether past customers would purchase again. According to the simple rule, a customer was inactive if they had not purchased in x months (the number of months varies by industry). The simple rule did as well as the statistical model in predicting repeat purchases of online music, and beat it in the apparel and airline industries. Other research finds that simple rules match or beat more complicated models in assessing the likelihood that a house will be burgled and in forecasting which patients with chest pain are actually suffering from a heart attack.

. [>] Statisticians have found: Professor Scott Armstrong of the Wharton School reviewed thirty-three studies comparing simple and complex statistical models used to forecast business and economic outcomes. He found no difference in forecasting accuracy in twenty-one of the studies. Sophisticated models did better in five studies, while simple models outperformed complex ones in seven cases. See J. Scott Armstrong, “Forecasting by Extrapolation: Conclusions from 25 Years of Research,” Interfaces 14 (1984): 52–66. Spyros Makridakis has hosted a series of competitions for statistical models over two decades, and consistently found that complex models fail to outperform simpler approaches.

And yet it works. One recent study of alternative investment approaches pitted the Markowitz model and three extensions of his approach against the 1/N rule, testing them on seven samples of data from the real world. This research ran a total of twenty-eight horseraces between the four state-of-the-art statistical models and the 1/N rule. With ten years of historical data to estimate risk, returns, and correlations, the 1/N rule outperformed the Markowitz equation and its extensions 79 percent of the time. The 1/N rule earned a positive return in every test, while the more complicated models lost money for investors more than half the time.

pages: 370 words: 107,983

Rage Inside the Machine: The Prejudice of Algorithms, and How to Stop the Internet Making Bigots of Us All
by Robert Elliott Smith
Published 26 Jun 2019

That act of faith remains largely hidden from everyone outside that community by a cloud of seemingly impenetrable mathematics. This obscures the dangers inherent in using statistics and probability as a basis for reasoning about people via algorithms. Statistical models, after all, aren’t unbiased, particularly when, as is the case for most algorithms today, they are motivated by the pursuit of profit. Just like expert systems, statistical models require a frame within which to operate, which is then populated by particular atoms. That frame and those atoms are subject to the same brittleness (limitations) and biases. On top of that, the probabilities drawn from these statistics, which become the grist for the statistical algorithmic mill, often aren’t what we think they are at all.

Unlike Wollstonecraft, Byron was a game-changing personality who challenged conventions and social mores and opened the door to a new Romantic Age. At least for men. The casual definition of outlier is ‘a person or thing situated away or detached from the main body or system,’ but in statistical modelling, it is ‘a data point on a graph or in a set of results that is very much bigger or smaller than the next nearest data point.’ In terms of algorithms, a statistical model is like the flattened and warped rugby ball, a shape that can be mathematically characterized by a few numbers, which can be in turn manipulated by an algorithm to fit data. In this sense, an outlier is a point that is far from the other points, the fluff on the data cloud which can’t easily be fitted inside the warped rugby ball.

However, there is another way to view the Bell Curve: not as a natural law, but as an artefact of trying to see complex and uncertain phenomena through the limiting lens of sampling and statistics. The CLT does not prove that everything follows a Bell Curve; it shows that when you sample in order to understand things that you can’t observe, you will always get a Bell Curve. That’s all. Despite this reality, faith in CLT and the Bell Curve still dominates in statistical modelling of all sorts of things today from presidential approval ratings to reoffending rates for criminals to educational success or failure, to whether jobs can be done by computers as well as people. What’s more, faith in this mathematical model inevitably led to its use in areas where it was ill-suited and inappropriate, such as Quetelet’s Theory of Probabilities as Applied and to the Moral and Political Sciences.

pages: 401 words: 109,892

The Great Reversal: How America Gave Up on Free Markets
by Thomas Philippon
Published 29 Oct 2019

This pattern holds for the whole economy as well as within the manufacturing sector, where we can use more granular data (NAICS level 6, a term explained in the Appendix section on industry classification). The relationship is positive and significant over the 1997–2002 period but not after. In fact, the relationship appears to be negative, albeit noisy, in the 2007–2012 period. Box 4.2. Statistical Models Table 4.2 presents the results of five regressions, that is, five statistical models. The right half of the table considers the whole economy; the left half focuses on the manufacturing sector. TABLE 4.2 Regression Results Productivity growth Years (1) (2) (3) (4) (5) Manufacturing Whole economy 97–02 02–07 07–12 89–99 00–15 Census CR4 growth 0.13* 0.01 −0.13 [0.06] [0.05] [0.17] Compustat CR4 growth 0.14* −0.09 [0.06] [0.07] Data set & granularity NAICS-6 KLEMS Year fixed effects Y Y Y Y Y Observations 469 466 299 92 138 R2 0.03 0.00 0.02 0.07 0.09 Notes: Log changes in TFP and in top 4 concentration.

When BLS data collectors cannot obtain a price for an item in the CPI sample (for example, because the outlet has stopped selling it), they look for a replacement item that is closest to the missing one. The BLS then adjusts for changes in quality and specifications. It can use manufacturers’ cost data or hedonic regressions to compute quality adjustments. Hedonic regressions are statistical models to infer consumers’ willingness to pay for goods or services. When it cannot estimate an explicit quality adjustment, the BLS imputes the price change using the average price change of similar items in the same geographic area. Finally, the BLS has specific procedures to estimate the price of housing (rents and owners’ equivalent rents) and medical care.

To test this idea, Matias Covarrubias, Germán Gutiérrez, and I (2019) study the relationship between changes in concentration and changes in total factor productivity (TFP) across industries during the 1990s and 2000s. We use our trade-adjusted concentration measures to control for foreign competition and for exports. Box 4.2 and its table summarize our results and discuss the interpretation of the various numbers in statistical models. We find that the relationship between concentration and productivity growth has changed over the past twenty years. During the 1990s (1989–1999) this relationship was positive. Industries with larger increases in concentration were also industries with larger productivity gains. This is no longer the case.

pages: 545 words: 137,789

How Markets Fail: The Logic of Economic Calamities
by John Cassidy
Published 10 Nov 2009

Maybe because of shifts in psychology or government policy, there are periods when markets will settle into a rut, and other periods when they will be apt to gyrate in alarming fashion. This picture seems to jibe with reality, but it raises some tricky issues for quantitative finance. If the underlying reality of the markets is constantly changing, statistical models based on past data will be of limited use, at best, in determining what is likely to happen in the future. And firms and investors that rely on these models to manage risk may well be exposing themselves to danger. The economics profession didn’t exactly embrace Mandelbrot’s criticisms. As the 1970s proceeded, the use of quantitative techniques became increasingly common on Wall Street.

After listening to Vincent Reinhart, the head of the Fed’s Division of Monetary Affairs, suggest several ways the Fed could try to revive the economy if interest rate changes could no longer be used, he dismissed the discussion as “premature” and described the possibility of a prolonged deflation as “a very small probability event.” The discussion turned to the immediate issue of whether to keep the funds rate at 1.25 percent. Since the committee’s previous meeting, Congress had approved the Bush administration’s third set of tax cuts since 2001, which was expected to give spending a boost. The Fed’s own statistical model of the economy was predicting a vigorous upturn later in 2003, suggesting that further rate cuts would be unnecessary and that some policy tightening might even be needed. “But that forecast has a very low probability, as far as I’m concerned,” Greenspan said curtly. “It points to an outcome that would be delightful if it were to materialize, but it is not a prospect on which we should focus our policy at this point.”

Greenspan’s method of analysis was inductive: he ingested as many figures as he could, from as many sources as he could find, then tried to fit them together into a coherent pattern. When I visited Greenspan at his office one day in 2000, I discovered him knee-deep in figures. He explained that he was trying to revamp a forty-year-old statistical model that his consulting firm had used to estimate realized capital gains on home sales. What made Greenspan such an interesting and important figure is that his empiricism was accompanied by a fervent belief in the efficiency and morality of the free market system. The conclusion that untrammeled capitalism provides a uniquely productive method of organizing production Greenspan took from his own observations and his reading of Adam Smith.

Big Data at Work: Dispelling the Myths, Uncovering the Opportunities
by Thomas H. Davenport
Published 4 Feb 2014

For reasons not entirely understood (by anyone, I think), the results of big data analyses are often expressed in visual formats. Now, visual analytics have a lot of strengths: They are relatively easy for non-quantitative executives to interpret, and they get attention. The downside is that they are not generally well suited for expressing complex multivariate relationships and statistical models. Put in other terms, most visual displays of data are for descriptive analytics, rather than predictive or prescriptive ones. They can, however, show a lot of data at once, as figure 4-1 illustrates. It’s a display of the tweets and retweets on Twitter involving particular New York Times articles.5 I find—as with many other complex big data visualizations—this one difficult to decipher.

.* In effect,     big data is not just a large volume of unstructured data, but also the technologies that make processing and analyzing it possible. Specific big data technologies analyze textual, video, and audio content. When big data is fast moving, technologies like machine learning allow for the rapid creation of statistical models that fit, optimize, and predict the data. This chapter is devoted to all of these big data technologies and the difference they make. The technologies addressed in the chapter are outlined in table 5-1. *I am indebted in this section to Jill Dyché, vice president of SAS Best Practices, who collaborated with me on this work and developed many of the frameworks in this section.

This makes it useful for analysts who are familiar with that query language. Business View The business view layer of the stack makes big data ready for further analysis. Depending on the big data application, additional processing via MapReduce or custom code might be used to construct an intermediate data structure, such as a statistical model, a flat file, a relational table, or a data cube. The resulting structure may be intended for additional analysis or to be queried by a traditional SQL-based query tool. Many vendors are moving to so-called “SQL on Hadoop” approaches, simply because SQL has been used in business for a couple of decades, and many people (and higher-level languages) know how to create SQL queries.

pages: 443 words: 51,804

Handbook of Modeling High-Frequency Data in Finance
by Frederi G. Viens , Maria C. Mariani and Ionut Florescu
Published 20 Dec 2011

HG106.V54 2011 332.01 5193–dc23 2011038022 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 Contents Preface Contributors xi xiii part One Analysis of Empirical Data 1 1 Estimation of NIG and VG Models for High Frequency Financial Data 3 José E. Figueroa-López, Steven R. Lancette, Kiseop Lee, and Yanhui Mi 1.1 1.2 1.3 1.4 1.5 1.6 Introduction, 3 The Statistical Models, 6 Parametric Estimation Methods, 9 Finite-Sample Performance via Simulations, 14 Empirical Results, 18 Conclusion, 22 References, 24 2 A Study of Persistence of Price Movement using High Frequency Financial Data 27 Dragos Bozdog, Ionuţ Florescu, Khaldoun Khashanah, and Jim Wang 2.1 Introduction, 27 2.2 Methodology, 29 2.3 Results, 35 v vi Contents 2.4 Rare Events Distribution, 41 2.5 Conclusions, 44 References, 45 3 Using Boosting for Financial Analysis and Trading 47 Germán Creamer 3.1 3.2 3.3 3.4 3.5 Introduction, 47 Methods, 48 Performance Evaluation, 53 Earnings Prediction and Algorithmic Trading, 60 Final Comments and Conclusions, 66 References, 69 4 Impact of Correlation Fluctuations on Securitized structures 75 Eric Hillebrand, Ambar N.

The data was obtained from the NYSE TAQ database of 2005 trades via Wharton’s WRDS system. For the sake of clarity and space, we only present the results for Intel and defer a full analysis of other stocks for a future publication. We finish with a section of conclusions and further recommendations. 1.2 The Statistical Models 1.2.1 GENERALITIES OF EXPONENTIAL LÉVY MODELS Before introducing the specific models we consider in this chapter, let us briefly motivate the application of Lévy processes in financial modeling. We refer the reader to the monographs of Cont & Tankov (2004) and Sato (1999) or the recent review papers Figueroa-López (2011) and Tankov (2011) for further information.

A geometric Brownian motion (also called Black–Scholes model) postulates the following conditions about the price process (St )t≥0 of a risky asset: (1) The (log) return on the asset over a time period [t, t + h] of length h, that is, Rt,t+h := log St+h St is Gaussian with mean μh and variance σ 2 h (independent of t); 7 1.2 The Statistical Models (2) Log returns on disjoint time periods are mutually independent; (3) The price path t → St is continuous; that is, P(Su → St , as u → t, ∀ t) = 1. The previous assumptions can equivalently be stated in terms of the so-called log return process (Xt )t , denoted henceforth as Xt := log St .

pages: 460 words: 122,556

The End of Wall Street
by Roger Lowenstein
Published 15 Jan 2010

See AIG bailouts Ben Bernanke and board of Warren Buffett and CDOs and collateral calls on compensation at corporate structure of credit default swaps and credit rating agencies and Jamie Dimon and diversity of holdings employees, number of Financial Products subsidiary Timothy Geithner and Goldman Sachs and insurance (credit default swap) premiums of JPMorgan Chase and lack of reserve for losses leadership changes Lehman Brothers and losses Moody’s and Morgan Stanley and New York Federal Reserve Bank and Hank Paulson and rescue of. See AIG bailouts revenue of shareholders statistical modeling of stock price of struggles of risk of systemic effects of failure of Texas and AIG bailouts amount of Ben Bernanke and board’s role in credit rating agencies and Federal Reserve and Timothy Geithner and Goldman Sachs and JPMorgan Chase and Lehman Brothers’ bankruptcy and New York state and Hank Paulson and reasons for harm to shareholders in Akers, John Alexander, Richard Allison, Herbert Ambac American Home Mortgages Andrukonis, David appraisers, real estate Archstone-Smith Trust Associates First Capital Atteberry, Thomas auto industry Bagehot, Walter bailouts.

See credit crisis volatility of credit crisis borrowers, lack of effects of fear of lending mortgages and reasons for spread of as unforeseen credit cycle credit default swaps AIG and Goldman Sachs and Morgan Stanley and credit rating agencies. See also specific agencies AIG and capital level determination by guessing by inadequacy of models of Lehman Brothers and Monte Carlo method of mortgage-backed securities and statistical modeling used by Credit Suisse Cribiore, Alberto Cummings, Christine Curl, Gregory Dallavecchia, Enrico Dannhauser, Stephen Darling, Alistair Dean Witter debt of financial firms U.S. reliance on of U.S. families defaults/delinquencies deflation deleveraging. See also specific firms del Missier, Jerry Democrats deposit insurance deregulation of banking system and derivatives of financial markets derivatives.

See home foreclosure(s) foreign investors France Frank, Barney Freddie Mac and Fannie Mae accounting problems of affordable housing and Alternative-A loans bailout of Ben Bernanke and capital raised by competitive threats to Congress and Countrywide Financial and Democrats and Federal Reserve and foreign investment in Alan Greenspan and as guarantor history of lack of regulation of leadership changes leverage losses mortgage bubble and as mortgage traders Hank Paulson and politics and predatory lending and reasons for failures of relocation to private sector Robert Rodriguez and shareholders solving financial crisis through statistical models of stock price of Treasury Department and free market Freidheim, Scott Friedman, Milton Fuld, Richard compensation of failure to pull back from mortgage-backed securities identification with Lehman Brothers Lehman Brothers’ bankruptcy and Lehman Brothers’ last days and long tenure of Hank Paulson and personality and character of Gamble, James (Jamie) GDP Geithner, Timothy AIG and bank debt guarantees and Bear Stearns bailout and career of China and Citigroup and financial crisis, response to Lehman Brothers and money markets and Morgan Stanley and in Obama administration Hank Paulson and TARP and Gelband, Michael General Electric General Motors Germany Glass-Steagall Act Glauber, Robert Golden West Savings and Loan Goldman Sachs AIG and as bank holding company Warren Buffett investment in capital raised by capital sought by compensation at credit default swaps and hedge funds and insurance (credit default swap) premiums of job losses at leverage of Merrill Lynch and Stanley O’Neal’s obsession with Hank Paulson and pull back from mortgage-backed securities short selling against stock price of Wachovia and Gorton, Gary government, U.S.

Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals
by David Aronson
Published 1 Nov 2006

It was a review of prior studies, known as a meta-analysis, which examined 20 studies that had compared the subjective diagnoses of psychologists and psychiatrists with those produced by linear statistical models. The studies covered the prediction of academic success, the likelihood of criminal recidivism, and predicting the outcomes of electrical shock therapy. In each case, the experts rendered a judgment by evaluating a multitude of variables in a subjective manner. “In all studies, the statistical model provided more accurate predictions or the two methods tied.”34 A subsequent study by Sawyer35 was a meta analysis of 45 studies. “Again, there was not a single study in which clinical global judgment was superior to the statistical prediction (termed ‘mechanical combination’ by Sawyer).”36 Sawyer’s investigation is noteworthy because he considered studies in which the human expert was allowed access to information that was not considered by the statistical model, and yet the model was still superior.

The prediction problems spanned nine different fields: (1) academic performance of graduate students, (2) life-expectancy of cancer patients, (3) changes in stock prices, (4) mental illness using personality tests, (5) grades and attitudes in a psychology course, (6) business failures using financial ratios, (7) students’ ratings of teaching effectiveness, (8) performance of life insurance sales personnel, and (9) IQ scores using Rorschach Tests. Note that the average correlation of the statistical model was 0.64 versus the expert average of 0.33. In terms of information content, which is measured by the correlation coefficient squared or r-squared, the model’s predictions were on average 3.76 times as informative as the experts’. Numerous additional studies comparing expert judgment to statistical models (rules) have confirmed these findings, forcing the conclusion that people do poorly when attempting to combine a multitude of variables to make predictions or judgments.

The average accuracy of the experts, as measured by the correlation coefficient between their prediction of violence and the actual manifestation of violence, was a poor 0.12. The single best expert had a score of 0.36. The predictions of a linear statistical model, using the same set of 19 inputs, achieved a correlation of 0.82. In this instance the model’s predictions were nearly 50 times more informative than the experts’. Meehl continued to expand his research of comparing experts and statistical models and in 1986 concluded that “There is no controversy in social science which shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one.

pages: 263 words: 75,455

Quantitative Value: A Practitioner's Guide to Automating Intelligent Investment and Eliminating Behavioral Errors
by Wesley R. Gray and Tobias E. Carlisle
Published 29 Nov 2012

We need some means to protect us from our cognitive biases, and the quantitative method is that means. It serves both to protect us from our own behavioral errors and to exploit the behavioral errors of others. The model does need not be complex to achieve this end. In fact, the weight of evidence indicates that even simple statistical models outperform the best experts. It speaks to the diabolical nature of our faulty cognitive apparatus that those simple statistical models continue to outperform the best experts even when those same experts are given access to the models' output. This is as true for a value investor as it is for any other expert in any other field of endeavor. This book is aimed at value investors.

Tetlock's conclusion is that experts suffer from the same behavioral biases as the laymen. Tetlock's study fits within a much larger body of research that has consistently found that experts are as unreliable as the rest of us. A large number of studies have examined the records of experts against simple statistical model, and, in almost all cases, concluded that experts either underperform the models or can do no better. It's a compelling argument against human intuition and for the statistical approach, whether it's practiced by experts or nonexperts.37 Even Experts Make Behavioral Errors In many disciplines, simple quantitative models outperform the intuition of the best experts.

The model predicted O'Connor's vote correctly 70 percent of the time, while the experts' success rate was only 61 percent.41How can it be that simple models perform better than experienced clinical psychologists or renowned legal experts with access to detailed information about the cases? Are these results just flukes? No. In fact, the MMPI and Supreme Court decision examples are not even rare. There are an overwhelming number of studies and meta-analyses—studies of studies—that corroborate this phenomenon. In his book, Montier provides a diverse range of studies comparing statistical models and experts, ranging from the detection of brain damage, the interview process to admit students to university, the likelihood of a criminal to reoffend, the selection of “good” and “bad” vintages of Bordeaux wine, and the buying decisions of purchasing managers. Value Investors Have Cognitive Biases, Too Graham recognized early on that successful investing required emotional discipline.

pages: 294 words: 77,356

Automating Inequality
by Virginia Eubanks

The electronic registry of the unhoused I studied in Los Angeles, called the coordinated entry system, was piloted seven years later. It deploys computerized algorithms to match unhoused people in its registry to the most appropriate available housing resources. The Allegheny Family Screening Tool, launched in August 2016, uses statistical modeling to provide hotline screeners with a predictive risk score that shapes the decision whether or not to open child abuse and neglect investigations. I started my reporting in each location by reaching out to organizations working closely with the families most directly impacted by these systems.

“[P]renatal risk assessments could be used to identify children at risk … while still in the womb.”3 On the other side of the world, Rhema Vaithianathan, associate professor of economics at the University of Auckland, was on a team developing just such a tool. As part of a larger program of welfare reforms led by conservative Paula Bennett, the New Zealand Ministry of Social Development (MSD) commissioned the Vaithianathan team to create a statistical model to sift information on parents interacting with the public benefits, child protective, and criminal justice systems to predict which children were most likely to be abused or neglected. Vaithianathan reached out to Putnam-Hornstein to collaborate. “It was such an exciting opportunity to partner with Rhema’s team around this potential real-time use of data to target children,” said Putnam-Hornstein.

It is an early adopter in a nationwide algorithmic experiment in child welfare: similar systems have been implemented recently in Florida, Los Angeles, New York City, Oklahoma, and Oregon. As this book goes to press, Cherna and Dalton continue to experiment with data analytics. The next iteration of the AFST will employ machine learning rather than traditional statistical modeling. They also plan to introduce a second predictive model, one that will not rely on reports to the hotline at all. Instead, the planned model “would be run on a daily or weekly basis on all babies born in Allegheny County the prior day or week,” according to a September 2017 email from Dalton.

pages: 461 words: 128,421

The Myth of the Rational Market: A History of Risk, Reward, and Delusion on Wall Street
by Justin Fox
Published 29 May 2009

First, modeling financial risk is hard. Statistical models can never fully capture all things that can go wrong (or right). It was as physicist and random walk pioneer M. F. M. Osborne told his students at UC–Berkeley back in 1972: For everyday market events the bell curve works well. When it doesn’t, one needs to look outside the statistical models and make informed judgments about what’s driving the market and what the risks are. The derivatives business and other financial sectors on the rise in the 1980s and 1990s were dominated by young quants. These people knew how to work statistical models, but they lacked the market experience needed to make informed judgments.

Traditional ratios of loan-to-value and monthly payments to income gave way to credit scoring and purportedly precise gradations of default risk that turned out to be worse than useless. In the 1970s, Amos Tversky and Daniel Kahneman had argued that real-world decision makers didn’t follow the statistical models of John von Neumann and Oskar Morgenstern, but used simple heuristics—rules of thumb—instead. Now the mortgage lending industry was learning that heuristics worked much better than statistical models descended from the work of von Neumann and Morgenstern. Simple trumped complex. In 2005, Robert Shiller came out with a second edition of Irrational Exuberance that featured a new twenty-page chapter on “The Real Estate Market in Historical Perspective.”

These people knew how to work statistical models, but they lacked the market experience needed to make informed judgments. Meanwhile, those with the experience, wisdom, and authority to make informed judgments—the bosses—didn’t understand the statistical models. It’s possible that, as more quants rise into positions of high authority (1986 Columbia finance Ph.D. Vikram Pandit, who became CEO of Citigroup in 2007, was the first quant to run a major bank), this particular problem will become less pronounced. But the second obstacle to risk-free living through derivatives is much harder to get around. It’s the paradox that killed portfolio insurance—when enough people subscribe to a particular means of taming financial risk, then that in itself brings new risks.

pages: 1,829 words: 135,521

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
by Wes McKinney
Published 25 Sep 2017

.: categories=['a', 'b']) In [25]: data Out[25]: x0 x1 y category 0 1 0.01 -1.5 a 1 2 -0.01 0.0 b 2 3 0.25 3.6 a 3 4 -4.10 1.3 a 4 5 0.00 -2.0 b If we wanted to replace the 'category' column with dummy variables, we create dummy variables, drop the 'category' column, and then join the result: In [26]: dummies = pd.get_dummies(data.category, prefix='category') In [27]: data_with_dummies = data.drop('category', axis=1).join(dummies) In [28]: data_with_dummies Out[28]: x0 x1 y category_a category_b 0 1 0.01 -1.5 1 0 1 2 -0.01 0.0 0 1 2 3 0.25 3.6 1 0 3 4 -4.10 1.3 1 0 4 5 0.00 -2.0 0 1 There are some nuances to fitting certain statistical models with dummy variables. It may be simpler and less error-prone to use Patsy (the subject of the next section) when you have more than simple numeric columns. 13.2 Creating Model Descriptions with Patsy Patsy is a Python library for describing statistical models (especially linear models) with a small string-based “formula syntax,” which is inspired by (but not exactly the same as) the formula syntax used by the R and S statistical programming languages.

This includes such submodules as: Regression models: Linear regression, generalized linear models, robust linear models, linear mixed effects models, etc. Analysis of variance (ANOVA) Time series analysis: AR, ARMA, ARIMA, VAR, and other models Nonparametric methods: Kernel density estimation, kernel regression Visualization of statistical model results statsmodels is more focused on statistical inference, providing uncertainty estimates and p-values for parameters. scikit-learn, by contrast, is more prediction-focused. As with scikit-learn, I will give a brief introduction to statsmodels and how to use it with NumPy and pandas. 1.4 Installation and Setup Since everyone uses Python for different applications, there is no single solution for setting up Python and required add-on packages.

While readers may have many different end goals for their work, the tasks required generally fall into a number of different broad groups: Interacting with the outside world Reading and writing with a variety of file formats and data stores Preparation Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis Transformation Applying mathematical and statistical operations to groups of datasets to derive new datasets (e.g., aggregating a large table by group variables) Modeling and computation Connecting your data to statistical models, machine learning algorithms, or other computational tools Presentation Creating interactive or static graphical visualizations or textual summaries Code Examples Most of the code examples in the book are shown with input and output as it would appear executed in the IPython shell or in Jupyter notebooks: In [5]: CODE EXAMPLE Out[5]: OUTPUT When you see a code example like this, the intent is for you to type in the example code in the In block in your coding environment and execute it by pressing the Enter key (or Shift-Enter in Jupyter).

pages: 301 words: 85,126

AIQ: How People and Machines Are Smarter Together
by Nick Polson and James Scott
Published 14 May 2018

People rely on billions of language facts, most of which they take for granted—like the knowledge that “drop your trousers” and “drop off your trousers” are used in very different situations, only one of which is at the dry cleaner’s. Knowledge like this is hard to codify in explicit rules, because there’s too much of it. Believe it or not, the best way we know to teach it to machines is to give them a giant hard drive full of examples of how people say stuff, and to let the machines sort it out on their own with a statistical model. This purely data-driven approach to language may seem naïve, and until recently we simply didn’t have enough data or fast-enough computers to make it work. Today, though, it works shockingly well. At its tech conference in 2017, for example, Google boldly announced that machines had now reached parity with humans at speech recognition, with a per-word dictation error rate of 4.9%—drastically better than the 20–30% error rates common as recently as 2013.

This is about 250 times more common than “whether report” (0.0000000652%), which is used mainly as a bad pun or an example of phonetic ambiguity. From the 1980s onward, NLP researchers began to recognize the value of this purely statistical information. Before, they’d been hand-building rules capable of describing how a given linguistic task should be performed. Now, these experts started training statistical models capable of predicting that a person would perform a task in a certain way. As a field, NLP shifted its focus from understanding to mimicry—from knowing how, to knowing that. These new models required lots of data. You fed the machine as many examples as you could find of how humans use language, and you programmed the machine to use the rules of probability to find patterns in those examples.

You may remember a time when people dialed 411 to look up a phone number for a local business, at a dollar or so per call. Google 411 lets you do the same thing for free, by dialing 1-800-GOOG-411. It was a useful service in an age before ubiquitous smartphones—and also a great way for Google to build up an enormous database of voice queries that would help train its statistical models for speech recognition. The system quietly shut down in 2010, presumably because Google had all the data it needed. Of course, there’s been an awful lot of Grace Hopper–style coding since 2007 to turn all that data into good prediction rules. So more than a decade later, what’s the result?

Know Thyself
by Stephen M Fleming
Published 27 Apr 2021

By tweaking the settings of the scanner, rapid snapshots can also be taken every few seconds that track changes in blood oxygen levels in different parts of the brain (this is known as functional MRI, or fMRI). Because more vigorous neural firing uses up more oxygen, these changes in blood oxygen levels are useful markers of neural activity. The fMRI signal is very slow compared to the rapid firing of neurons, but, by applying statistical models to the signal, it is possible to reconstruct maps that highlight brain regions as being more or less active when people are doing particular tasks. If I put you in an fMRI scanner and asked you to think about yourself, it’s a safe bet that I would observe changes in activation in two key parts of the association cortex: the medial PFC and the medial parietal cortex (also known as the precuneus), which collectively are sometimes referred to as the cortical midline structures.

Metacognitive sensitivity is subtly but importantly different from metacognitive bias, which is the overall tendency to be more or less confident. While on average I might be overconfident, if I am still aware of each time I make an error (the Ds in the table), then I can still achieve a high level of metacognitive sensitivity. We can quantify people’s metacognitive sensitivity by fitting parameters from statistical models to people’s confidence ratings (with names such as meta-d’ and Φ). Ever more sophisticated models are being developed, but they ultimately all boil down to quantifying the extent to which our self-evaluations track whether we are actually right or wrong.4 What Makes One Person’s Metacognition Better than Another’s?

This kind of self-endorsement of our choices is a key aspect of decision-making, and it can have profound consequences for whether we decide to reverse or undo such decisions. Together with our colleagues Neil Garrett and Ray Dolan, Benedetto and I set out to investigate people’s self-awareness about their subjective choices in the lab. In order to apply the statistical models of metacognition that we encountered in Chapter 4, we needed to get people to make lots of choices, one after the other, and rate their confidence in choosing the best option—a proxy for whether they in fact wanted what they chose. We collected a set of British snacks, such as chocolate bars and crisps, and presented people with all possible pairs of items to choose between (hundreds of pairs in total).

Learn Algorithmic Trading
by Sebastien Donadio
Published 7 Nov 2019

In-sample versus out-of-sample data When building a statistical model, we use cross-validation to avoid overfitting. Cross-validation imposes a division of data into two or three different sets. One set will be used to create your model, while the other sets will be used to validate the model's accuracy. Because the model has not been created with the other datasets, we will have a better idea of its performance. When testing a trading strategy with historical data, it is important to use a portion of data for testing. In a statistical model, we call training data the initial data to create the model.

R is not significantly more recent than Python. It was released in 1995 by the two founders, Ross Ihaka and Robert Gentleman, while Python was released in 1991 by Guido Van Rossum. Today, R is mainly used by the academic and research world. Unlike many other languages, Python and R allows us to write a statistical model with a few lines of code. Because it is impossible to choose one over the other, since they both have their own advantages, they can easily be used in a complementary manner. Developers created a multitude of libraries capable of easily using one language in conjunction with the other without any difficulties.

The last step of the time series analysis is to forecast the time series. We have two possible scenarios: A strictly stationary series without dependencies among values. We can use a regular linear regression to forecast values. A series with dependencies among values. We will be forced to use other statistical models. In this chapter, we chose to focus on using the Auto-Regression Integrated Moving Averages (ARIMA) model. This model has three parameters: Autoregressive (AR) term (p)—lags of dependent variables. Example for 3, the predictors for x(t) is x(t-1) + x(t-2) + x(t-3). Moving average (MA) term (q)—lags for errors in prediction.

Analysis of Financial Time Series
by Ruey S. Tsay
Published 14 Oct 2001

Stable Distribution The stable distributions are a natural generalization of normal in that they are stable under addition, which meets the need of continuously compounded returns rt . Furthermore, stable distributions are capable of capturing excess kurtosis shown by historical stock returns. However, non-normal stable distributions do not have a finite variance, which is in conflict with most finance theories. In addition, statistical modeling using non-normal stable distributions is difficult. An example of non-normal stable distributions is the Cauchy distribution, which is symmetric with respect to its median, but has infinite variance. Scale Mixture of Normal Distributions Recent studies of stock returns tend to use scale mixture or finite mixture of normal distributions.

Furthermore, the lag- autocovariance of rt is  γ = Cov(rt , rt− ) = E  =E ∞  i=0 ∞  ψi at−i  ∞   ψ j at−− j j=0  ψi ψ j at−i at−− j i, j=0 = ∞  j=0 2 2 ψ j+ ψ j E(at−− j ) = σa ∞  ψ j ψ j+ . j=0 Consequently, the ψ-weights are related to the autocorrelations of rt as follows: ∞ ψi ψi+ γ = i=0 ρ = ∞ 2 , γ0 1 + i=1 ψi  ≥ 0, (2.5) where ψ0 = 1. Linear time series models are econometric and statistical models used to describe the pattern of the ψ-weights of rt . 2.4 SIMPLE AUTOREGRESSIVE MODELS The fact that the monthly return rt of CRSP value-weighted index has a statistically significant lag-1 autocorrelation indicates that the lagged return rt−1 might be useful in predicting rt . A simple model that makes use of such predictive power is rt = φ0 + φ1rt−1 + at , (2.6) where {at } is assumed to be a white noise series with mean zero and variance σa2 .

If we treat the random-walk model as a special AR(1) model, then the coefficient of pt−1 is unity, which does not satisfy the weak stationarity condition of an AR(1) model. A random-walk series is, therefore, not weakly stationary, and we call it a unit-root nonstationary time series. The random-walk model has been widely considered as a statistical model for the movement of logged stock prices. Under such a model, the stock price is not predictable or mean reverting. To see this, the 1-step ahead forecast of model (2.32) at the forecast origin h is p̂h (1) = E( ph+1 | ph , ph−1 , . . .) = ph , which is the log price of the stock at the forecast origin.

pages: 416 words: 39,022

Asset and Risk Management: Risk Oriented Finance
by Louis Esch , Robert Kieffer and Thierry Lopez
Published 28 Nov 2005

Table 6.3 Student distribution quantiles ν γ2 z0.95 z0.975 z0.99 6.00 1.00 0.55 0.38 0.29 0.23 0.17 0.11 0.05 0 2.601 2.026 1.883 1.818 1.781 1.757 1.728 1.700 1.672 1.645 3.319 2.491 2.289 2.199 2.148 2.114 2.074 2.034 1.997 1.960 4.344 3.090 2.795 2.665 2.591 2.543 2.486 2.431 2.378 2.326 5 10 15 20 25 30 40 60 120 normal 8 Blattberg R. and Gonedes N., A comparison of stable and student distributions as statistical models for stock prices, Journal of Business, Vol. 47, 1974, pp. 244–80. 9 Pearson E. S. and Hartley H. O., Biometrika Tables for Statisticians, Biometrika Trust, 1976, p. 146. 190 Asset and Risk Management This clearly shows that when the normal law is used in place of the Student laws, the VaR parameter is underestimated unless the number of degrees of freedom is high.

Using pt presents the twofold advantage of: • making the magnitudes of the various factors likely to be involved in evaluating an asset or portfolio relative; • supplying a variable that has been shown to be capable of possessing certain distributional properties (normality or quasi-normality for returns on equities, for example). 1 Estimating quantiles is often a complex problem, especially for arguments close to 0 or 1. Interested readers should read Gilchrist W. G., Statistical Modelling with Quantile Functions, Chapman & Hall/CRC, 2000. 2 If the risk factor X is a share price, we are looking at the return on that share (see Section 3.1.1). 200 Asset and Risk Management Valuation models Historical data Estimation technique VaR Figure 7.1 Estimating VaR Note In most calculation methods, a different expression is taken into consideration: ∗ (t) = ln X(t) X(t − 1) As we saw in Section 3.1.1, this is in fact very similar to (t) and has the advantage that it can take on any real value3 and that the logarithmic return for several consecutive periods is the sum of the logarithmic return for each of those periods.

(yt ))) − r times− instead of yt ((yt ) = yt − yt−1 ). We therefore use an ARIMA(p, r, q) procedure.16 If this procedure fails because of nonconstant volatility in the error term, it will be necessary to use the ARCH-GARCH or EGARCH models (Appendix 7). B. The equation on the replicated positions This equation may be estimated by a statistical model (such as SAS/OR procedure PROC NPL), using multiple regression with the constraints  15 years αi = 1 and αi ≥ 0 i=3 months It is also possible to estimate the replicated positions (b) with the single constraint (by using the SAS/STAT procedure)  15 years αi = 1 i=3 months In both cases, the duration of the demand product is a weighted average of the durations.

Science Fictions: How Fraud, Bias, Negligence, and Hype Undermine the Search for Truth
by Stuart Ritchie
Published 20 Jul 2020

Or do you leave them in? Do you split the sample up into separate age groups, or by some other criterion? Do you merge observations from week one and week two and compare them to weeks three and four, or look at each week separately, or make some other grouping? Do you choose this particular statistical model, or that one? Precisely how many ‘control’ variables do you throw in? There aren’t objective answers to these kinds of questions. They depend on the specifics and context of what you’re researching, and on your perspective on statistics (which is, after all, a constantly evolving subject in itself): ask ten statisticians, and you might receive as many different answers.

– we’re looking for generalisable facts about the world (‘what is the link between taking antipsychotic drugs and schizophrenia symptoms in humans in general?’). Figure 3, below, illustrates overfitting. As you can see, we have a set of data: one measurement of rainfall is made each month across the space of a year. We want to draw a line through that data that describes what happens to rainfall over time: the line will be our statistical model of the data. And we want to use that line to predict how much rain will fall in each month next year. The laziest possible solution is just to try a straight line, as in graph 3A – but it looks almost nothing like the data: if we tried to use that line to predict the next year’s measurements, forecasting the exact same amount of rain for every month, we’d do a terribly inaccurate job.

For the American Statistical Association’s consensus position on p-values, written surprisingly comprehensibly, see Ronald L. Wasserstein & Nicole A. Lazar, ‘The ASA Statement on p-Values: Context, Process, and Purpose’, The American Statistician 70, no. 2 (2 April 2016): pp. 129–33; https://doi.org/10.1080/0003130 5.2016.1154108. It defines the p-value like this: ‘the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value: p. 131. 18.  Why does the definition of the p-value (‘how likely is it that pure noise would give you results like the ones you have, or ones with an even larger effect’) have that ‘or an even larger effect’ clause in it?

pages: 233 words: 67,596

Competing on Analytics: The New Science of Winning
by Thomas H. Davenport and Jeanne G. Harris
Published 6 Mar 2007

As more tangible benefits began to appear, the CEO’s commitment to competing on analytics grew. In his letter to shareholders, he described the growing importance of analytics and a new growth initiative to “outsmart and outthink” the competition. Analysts expanded their work to use propensity analysis and neural nets (an artificial intelligence technology incorporating nonlinear statistical modeling to identify patterns) to target and provide specialized services to clients with both personal and corporate relationships with the bank. They also began testing some analytically enabled new services for trust clients. Today, BankCo is well on its way to becoming an analytical competitor.

They can also be used to help streamline the flow of information or products—for example, they can help employees of health care organizations decide where to send donated organs according to criteria ranging from blood type to geographic limitations. Emerging Analytical Technologies These are some of the leading-edge technologies that will play a role in analytical applications over the next few years: Text categorization is the process of using statistical models or rules to rate a document’s relevance to a certain topic. For example, text categorization can be used to dynamically evaluate competitors’ product assortments on their Web sites. Genetic algorithms are a class of stochastic optimization methods that use principles found in natural genetic reproduction (crossover or mutations of DNA structures).

Commercially purchased analytical applications usually have an interface to be used by information workers, managers, and analysts. But for proprietary analyses, the presentation tools determine how different classes of individuals can use the data. For example, a statistician could directly access a statistical model, but most managers would hesitate to do so. A new generation of visual analytical tools—from new vendors such as Spotfire and Visual Sciences and from traditional analytics providers such as SAS—allow the manipulation of data and analyses through an intuitive visual interface. A manager, for example, could look at a plot of data, exclude outlier values, and compute a regression line that fits the data—all without any statistical skills.

pages: 338 words: 104,815

Nobody's Fool: Why We Get Taken in and What We Can Do About It
by Daniel Simons and Christopher Chabris
Published 10 Jul 2023

The platform calculates separate ratings for games played under different time limits. For regular games, in which each player has ten or more minutes in total for all their moves, lazzir’s rating had gained 1,442 rating points in eleven days—after having been almost unchanged for the previous five years. According to the statistical model underpinning the rating system, that 1,442-point gain meant that the lazzir who beat Chris would have been over a 1,000-to-1 favorite to beat the lazzir of just two weeks earlier. No one in chess gets better so consistently over such a short time window; even the fictional Beth Harmon from The Queen’s Gambit had more setbacks in her meteoric rise to the top.

Mysteriously, lazzir stopped playing on the site a couple of days later, and within months, his account was permanently closed for violating Chess.com’s “fair play” policy. The lazzir case is not an isolated one: Chess.com closes about eight hundred accounts every day for cheating, often because their behavior too closely matches statistical models of what a nonhuman entity would produce. An absence of noise, of the human tendency to make occasional blunders in complex situations, is a critical signal.14 COME ON, FEEL THE NOISE Most people and organizations think of noise in human behavior as a problem to eliminate. That’s the meaning of noise popularized by Daniel Kahneman, Olivier Sibony, and Cass Sunstein in their book Noise: problematic, unpredictable, or unjustified variability in performance between decisionmakers.

Lehrer, The Smarter Screen: Surprising Ways to Influence and Improve Online Behavior (New York: Portfolio, 2015), 127. The time-reversal heuristic was proposed in a blog post by Andrew Gelman, “The Time-Reversal Heuristic—a New Way to Think About a Published Finding That Is Followed Up by a Large, Preregistered Replication (in Context of Claims About Power Pose),” Statistical Modeling, Causal Inference, and Social Science, January 26, 2016 [https://statmodeling.stat.columbia.edu/2016/01/26/more-power-posing/]. 25. L. Magrath, and L. Weld, “Abusive Earnings Management and Early Warning Signs,” CPA Journal, August 2002, 50–54. Kenneth Lay’s indictment lays out the nature of the manipulation used to beat estimates [https://www.justice.gov/archive/opa/pr/2004/July/04_crm_470.htm]; he was convicted in 2006 [https://www.justice.gov/archive/opa/pr/2006/May/06_crm_328.html].

pages: 252 words: 72,473

Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy
by Cathy O'Neil
Published 5 Sep 2016

Their spectacular failure comes, instead, from what they chose not to count: tuition and fees. Student financing was left out of the model. This brings us to the crucial question we’ll confront time and again. What is the objective of the modeler? In this case, put yourself in the place of the editors at U.S. News in 1988. When they were building their first statistical model, how would they know when it worked? Well, it would start out with a lot more credibility if it reflected the established hierarchy. If Harvard, Stanford, Princeton, and Yale came out on top, it would seem to validate their model, replicating the informal models that they and their customers carried in their own heads.

A child places her finger on the stove, feels pain, and masters for the rest of her life the correlation between the hot metal and her throbbing hand. And she also picks up the word for it: burn. A machine learning program, by contrast, will often require millions or billions of data points to create its statistical models of cause and effect. But for the first time in history, those petabytes of data are now readily available, along with powerful computers to process them. And for many jobs, machine learning proves to be more flexible and nuanced than the traditional programs governed by rules. Language scientists, for example, spent decades, from the 1960s to the early years of this century, trying to teach computers how to read.

Probably not a model trained on such demographic and behavioral data. I should note that in the statistical universe proxies inhabit, they often work. More times than not, birds of a feather do fly together. Rich people buy cruises and BMWs. All too often, poor people need a payday loan. And since these statistical models appear to work much of the time, efficiency rises and profits surge. Investors double down on scientific systems that can place thousands of people into what appear to be the correct buckets. It’s the triumph of Big Data. And what about the person who is misunderstood and placed in the wrong bucket?

pages: 250 words: 79,360

Escape From Model Land: How Mathematical Models Can Lead Us Astray and What We Can Do About It
by Erica Thompson
Published 6 Dec 2022

Such models can now write poetry, answer questions, compose articles and hold conversations. They do this by scraping a huge archive of text produced by humans – basically most of the content of the internet plus a lot of books, probably with obviously offensive words removed – and creating statistical models that link one word with the probability of the next word given a context. And they do it remarkably well, to the extent that it is occasionally difficult to tell whether text has been composed by a human or by a language model. Bender, Gebru and colleagues point out some of the problems with this.

I think the most coherent argument here is that we often need to impose as much structure on the model as we can, to represent the areas in which we do genuinely have physical confidence, in order to avoid overfitting. If we are willing to calibrate everything with respect to data, then we will end up with a glorified statistical model overfitted to that data rather than something that reflects our expert judgement about the underlying mechanisms involved. If I can’t make a reasonable model without requiring that π=4 or without violating conservation of mass, then there must be something seriously wrong with my other assumptions.

Second, even if we had 200+ years of past data, we are unsure whether the conditions that generate flood losses have remained the same: perhaps flood barriers have been erected; perhaps a new development has been built on the flood plain; perhaps agricultural practices upstream have changed; perhaps extreme rainfall events have become more common. Our simple statistical model of an extreme flood will have to change to take all this into account. Given all of those factors, our calculation of a 1-in-200-year flood event will probably come with a considerable level of uncertainty, and that’s before we start worrying about correcting for returns on investment, inflation or other changes in valuation.

pages: 197 words: 35,256

NumPy Cookbook
by Ivan Idris
Published 30 Sep 2012

If not specified, first-order differences are computed. log Calculates the natural log of elements in a NumPy array. sum Sums the elements of a NumPy array. dot Does matrix multiplication for 2D arrays. Calculates the inner product for 1D arrays. Installing scikits-statsmodels The scikits-statsmodels package focuses on statistical modeling. It can be integrated with NumPy and Pandas (more about Pandas later in this chapter). How to do it... Source and binaries can be downloaded from http://statsmodels.sourceforge.net/install.html . If you are installing from source, you need to run the following command: python setup.py install If you are using setuptools, the command is: easy_install statsmodels Performing a normality test with scikits-statsmodels The scikits-statsmodels package has lots of statistical tests.

The data in the Dataset class of statsmodels follows a special format. Among others, this class has the endog and exog attributes. Statsmodels has a load function, which loads data as NumPy arrays. Instead, we used the load_pandas method, which loads data as Pandas objects. We did an OLS fit, basically giving us a statistical model for copper price and consumption. Resampling time series data In this tutorial, we will learn how to resample time series with Pandas. How to do it... We will download the daily price time series data for AAPL, and resample it to monthly data by computing the mean. We will accomplish this by creating a Pandas DataFrame, and calling its resample method.

pages: 396 words: 117,149

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
by Pedro Domingos
Published 21 Sep 2015

we can ask, “What is the algorithm that produces this output?” We will soon see how to turn this insight into concrete learning algorithms. Some learners learn knowledge, and some learn skills. “All humans are mortal” is a piece of knowledge. Riding a bicycle is a skill. In machine learning, knowledge is often in the form of statistical models, because most knowledge is statistical: all humans are mortal, but only 4 percent are Americans. Skills are often in the form of procedures: if the road curves left, turn the wheel left; if a deer jumps in front of you, slam on the brakes. (Unfortunately, as of this writing Google’s self-driving cars still confuse windblown plastic bags with deer.)

If you can tell which e-mails are spam, you know which ones to delete. If you can tell how good a board position in chess is, you know which move to make (the one that leads to the best position). Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so. In this book I use the term machine learning to refer broadly to all of them.

They called this scheme the EM algorithm, where the E stands for expectation (inferring the expected probabilities) and the M for maximization (estimating the maximum-likelihood parameters). They also showed that many previous algorithms were special cases of EM. For example, to learn hidden Markov models, we alternate between inferring the hidden states and estimating the transition and observation probabilities based on them. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM. This makes it one of the most popular algorithms in all of machine learning. You might have noticed a certain resemblance between k-means and EM, in that they both alternate between assigning entities to clusters and updating the clusters’ descriptions.

pages: 481 words: 125,946

What to Think About Machines That Think: Today's Leading Thinkers on the Age of Machine Intelligence
by John Brockman
Published 5 Oct 2015

At the University of Chicago Booth School of Business, where I teach, recruiters devote endless hours to interviewing students on campus for potential jobs—a process that selects the few who will be invited to visit the employer, where they will undergo another extensive set of interviews. Yet research shows that interviews are nearly useless in predicting whether a job prospect will perform well on the job. Compared to a statistical model based on objective measures such as grades in courses relevant to the job in question, interviews primarily add noise and introduce the potential for prejudice. (Statistical models don’t favor any particular alma mater or ethnic background and cannot detect good looks.) These facts have been known for more than four decades, but hiring practices have barely budged. The reason is simple: Each of us just knows that if we are the one conducting an interview, we will learn a lot about the candidate.

There’s an algorithm for computing the optimal action for achieving a desired outcome, but it’s computationally expensive. Experiments have found that simple learning algorithms with lots of training data often outperform complex hand-crafted models. Today’s systems primarily provide value by learning better statistical models and performing statistical inference for classification and decision making. The next generation will be able to create and improve their own software and are likely to self-improve rapidly. In addition to improving productivity, AI and robotics are drivers for numerous military and economic arms races.

More disturbing to me is the stubborn reluctance in many segments of society to allow computers to take over tasks that simple models perform demonstrably better than humans. A literature pioneered by psychologists such as the late Robyn Dawes finds that virtually any routine decision-making task—detecting fraud, assessing the severity of a tumor, hiring employees—is done better by a simple statistical model than by a leading expert in the field. Let me offer just two illustrative examples, one from human-resource management and the other from the world of sports. First, let’s consider the embarrassing ubiquity of job interviews as an important, often the most important, determinant of who gets hired.

Text Analytics With Python: A Practical Real-World Approach to Gaining Actionable Insights From Your Data
by Dipanjan Sarkar
Published 1 Dec 2016

Even though we have a large number of machine learning and data analysis techniques at our disposal, most of them are tuned to work with numerical data, hence we have to resort to areas like natural language processing (NLP ) and specialized techniques, transformations, and algorithms to analyze text data, or more specifically natural language, which is quite different from programming languages that are easily understood by machines. Remember that textual data, being highly unstructured, does not follow or adhere to structured or regular syntax and patterns—hence we cannot directly use mathematical or statistical models to analyze it. Before we dive into specific techniques and algorithms to analyze textual data, we will be going over some of the main concepts and theoretical principles associated with the nature of text data in this chapter. The primary intent here is to get you familiarized with concepts and domains associated with natural language understanding, processing, and text analytics.

Note here the emphasis on corpus of documents because the more diverse set of documents you have, the more topics or concepts you can generate—unlike with a single document where you will not get too many topics or concepts if it talks about a singular concept. Topic models are also often known as probabilistic statistical models, which use specific statistical techniques including singular valued decomposition and latent dirichlet allocation to discover connected latent semantic structures in text data that yield topics and concepts. They are used extensively in text analytics and even bioinformatics. Automated document summarizationis the process of using a computer program or algorithm based on statistical and ML techniques to summarize a document or corpus of documents such that we obtain a short summary that captures all the essential concepts and themes of the original document or corpus.

The end result is still in the form of some document, but with a few sentences based on the length we might want the summary to be. This is similar to having a research paper with an abstract or an executive summary. The main objective of automated document summarization is to perform this summarization without involving human inputs except for running any computer programs. Mathematical and statistical models help in building and automating the task of summarizing documents by observing their content and context. There are mainly two broad approaches towards document summarization using automated techniques: Extraction-based techniques: These methods use mathematical and statistical concepts like SVD to extract some key subset of content from the original document such that this subset of content contains the core information and acts as the focal point of the entire document.

pages: 447 words: 104,258

Mathematics of the Financial Markets: Financial Instruments and Derivatives Modelling, Valuation and Risk Issues
by Alain Ruttiens
Published 24 Apr 2013

FABOZZI, The Mathematics of Financial Modeling and Investment Management, John Wiley & Sons, Inc., Hoboken, 2004, 800 p. Lawrence GALITZ, Financial Times Handbook of Financial Engineering, FT Press, 3rd ed. Scheduled on November 2011, 480 p. Philippe JORION, Financial Risk Manager Handbook, John Wiley & Sons, Inc., Hoboken, 5th ed., 2009, 752 p. Tze Leung LAI, Haipeng XING, Statistical Models and Methods for Financial Markets, Springer, 2008, 374 p. David RUPPERT, Statistics and Finance, An Introduction, Springer, 2004, 482 p. Dan STEFANICA, A Primer for the Mathematics of Financial Engineering, FE Press, 2011, 352 p. Robert STEINER, Mastering Financial Calculations, FT Prentice Hall, 1997, 400 p.

More generally, Jarrow has developed some general but very useful considerations about model risk in an article devoted to risk management models, but valid for any kind of (financial) mathematical model.17 In his article, Jarrow is distinguishing between statistical and theoretical models: the former ones refer to modeling a market price or return evolution, based on historical data, such as a GARCH model. What is usually developed as “quantitative models” by some fund or portfolio managers, also belong to statistical models. On the other hand, theoretical models aim to evidence some causality based on a financial/economic reasoning, for example the Black–Scholes formula. Both types of model imply some assumptions: Jarrow distinguishes between robust and non-robust assumptions, depending on the size of the impact when the assumption is slightly modified.

MAILLET (eds), Multi-Moment Asset Allocation and Pricing Models, John Wiley & Sons, Ltd, Chichester, 2006, 233 p. Ioannis KARATZAS, Steven E. SHREVE, Methods of Mathematical Finance, Springer, 2010, 430 p. Donna KLINE, Fundamentals of the Futures Market, McGraw-Hill, 2000, 256 p. Tze Leung LAI, Haipeng XING, Statistical Models and Methods for Financial Markets, Springer, 2008, 374 p. Raymond M. LEUTHOLD, Joan C. JUNKUS, Jean E. CORDIER, The Theory and Practice of Futures Markets, Stipes Publishing, 1999, 410 p. Bob LITTERMAN, Modern Investment Management – An Equilibrium Approach, John Wiley & Sons, Inc., Hoboken, 2003, 624 p.

pages: 518 words: 147,036

The Fissured Workplace
by David Weil
Published 17 Feb 2014

Using a statistical model to predict the factors that increase the likelihood of contracting out specific types of jobs, Abraham and Taylor demonstrate that the higher the typical wage for the workforce at an establishment, the more likely that establishment will contract out its janitorial work. They also show that establishments that do any contracting out of janitorial workers tend to shift out the function entirely.36 Wages and benefits for workers employed directly versus contracted out can be compared given the significant number of people in both groups. Using statistical models that control for both observed characteristics of the workers and the places in which they work, several studies directly compare the wages and benefits for these occupations.

That competition (and franchising only indirectly) might lead them to have higher incentives to not comply. Alternatively, company-owned outlets might be in locations with stronger consumer markets, higher-skilled workers, or lower crime rates, all of which might also be associated with compliance. To adequately account for these problems, statistical models that consider all of the potentially relevant factors, including franchise status, are generated to predict compliance levels. By doing so, the effect of franchising can be examined, holding other factors constant. This allows measurement of the impact on compliance of an outlet being run by a franchisee with otherwise identical features, as opposed to a company-owned outlet.

This narrative is based on Federal Mine Safety and Health Review Commission, Secretary of Labor MSHA v. Ember Contracting Corporation, Office of Administrative Law Judges, November 4, 2011. I am grateful to Greg Wagner for flagging this case and to Andrew Razov for additional research on it. 26. These estimates are based on quarterly mining data from 2000–2010. Using statistical modeling techniques, two different measures of traumatic injuries and a direct measure of fatality rates are associated with contracting status of the mine operator as well as other explanatory factors, including mining method, physical attributes of the mine, union status, size of operations, year, and location.

pages: 721 words: 197,134

Data Mining: Concepts, Models, Methods, and Algorithms
by Mehmed Kantardzić
Published 2 Jan 2003

Throughout this book, we will simply call this subset of population data set, to eliminate confusion between the two definitions of sample: one (explained earlier) denoting the description of a single entity in the population, and the other (given here) referring to the subset of a population. From a given data set, we build a statistical model of the population that will help us to make inferences concerning that same population. If our inferences from the data set are to be valid, we must obtain samples that are representative of the population. Very often, we are tempted to choose a data set by selecting the most convenient members of the population.

Generalized linear regression models are currently the most frequently applied statistical techniques. They are used to describe the relationship between the trend of one variable and the values taken by several other variables. Modeling this type of relationship is often called linear regression. Fitting models is not the only task in statistical modeling. We often want to select one of several possible models as being the most appropriate. An objective method for choosing between different models is called ANOVA, and it is described in Section 5.5. The relationship that fits a set of data is characterized by a prediction model called a regression equation.

All these ideas are still in their infancy, and we expect that the next generation of text-mining techniques and tools will improve the quality of information and knowledge discovery from text. 11.7 LATENT SEMANTIC ANALYSIS (LSA) LSA is a method that was originally developed to improve the accuracy and effectiveness of IR techniques by focusing on semantic meaning of words across a series of usage contexts, as opposed to using simple string-matching operations. LSA is a way of partitioning free text using a statistical model of word usage that is similar to eigenvector decomposition and factor analysis. Rather than focusing on superficial features such as word frequency, this approach provides a quantitative measure of semantic similarities among documents based on a word’s context. Two major shortcomings to the use of term counts are synonyms and polysemes.

pages: 451 words: 103,606

Machine Learning for Hackers
by Drew Conway and John Myles White
Published 10 Feb 2012

Knowing the number of nonzero coefficients is useful because many people would like to be able to assert that only a few inputs really matter, and we can assert this more confidently if the model performs well even when assigning zero weight to many of the inputs. When the majority of the inputs to a statistical model are assigned zero coefficients, we say that the model is sparse. Developing tools for promoting sparsity in statistical models is a major topic in contemporary machine learning research. The second column, %Dev, is essentially the R2 for this model. For the top row, it’s 0% because you have a zero coefficient for the one input variable and therefore can’t get better performance than just using a constant intercept.

This is easiest to see in a residuals plot, as shown in panel C of Figure 6-1. In this plot, you can see all of the structure of the original data set, as none of the structure is captured by the default linear regression model. Using ggplot2’s geom_smooth function without any method argument, we can fit a more complex statistical model called a Generalized Additive Model (or GAM for short) that provides a smooth, nonlinear representation of the structure in our data: set.seed(1) x <- seq(-10, 10, by = 0.01) y <- 1 - x ⋀ 2 + rnorm(length(x), 0, 5) ggplot(data.frame(X = x, Y = y), aes(x = X, y = Y)) + geom_point() + geom_smooth(se = FALSE) The result, shown in panel D of Figure 6-1, lets us immediately see that we want to fit a curved line instead of a straight line to this data set.

Quantitative Trading: How to Build Your Own Algorithmic Trading Business
by Ernie Chan
Published 17 Nov 2008

Data-Snooping Bias In Chapter 2, I mentioned data-snooping bias—the danger that backtest performance is inflated relative to the future performance of the strategy because we have overoptimized the parameters of the model based on transient noise in the historical data. Data snooping bias is pervasive in the business of predictive statistical models of historical data, but is especially serious in finance because of the limited amount of independent data we have. High-frequency data, while in abundant supply, is useful only for high-frequency models. And while we have stock market data stretching back to the early parts of the twentieth century, only data within the past 10 years are really suitable for building predictive model.

He also co-manages EXP Quantitative Investments, LLC and publishes the Quantitative Trading blog (epchan.blogspot.com), which is syndicated to multiple financial news services including www.tradingmarkets.com and Yahoo! Finance. He has been quoted by the New York Times and CIO magazine on quantitative hedge funds, and has appeared on CNBC’s Closing Bell. Ernie is an expert in developing statistical models and advanced computer algorithms to discover patterns and trends from large quantities of data. He was a researcher in computer science at IBM’s T. J. Watson Research Center, in data mining at Morgan Stanley, and in statistical arbitrage trading at Credit Suisse. He has also been a senior quantitative strategist and trader at various hedge funds, with sizes ranging from millions to billions of dollars.

pages: 442 words: 39,064

Why Stock Markets Crash: Critical Events in Complex Financial Systems
by Didier Sornette
Published 18 Nov 2002

For this purpose, I shall describe a new set of computational methods that are capable of searching and comparing patterns, simultaneously and iteratively, at multiple scales in hierarchical systems. I shall use these patterns to improve the understanding of the dynamical state before and after a financial crash and to enhance the statistical modeling of social hierarchical systems with the goal of developing reliable forecasting skills for these large-scale financial crashes. IS PREDICTION POSSIBLE? A WORKING HYPOTHESIS With the low of 3227 on April 17, 2000, identified as the end of the “crash,” the Nasdaq Composite index lost in five weeks over 37% of its all-time high of 5133 reached on March 10, 2000.

In reality, the three crashes occurred in less than one century. This result is a first indication that the exponential model may not apply for the large crashes. As an additional test, 10,000 so-called synthetic data sets, each covering a time span close to a century, hence adding up to about 1 million years, was generated using a standard statistical model used by the financial industry [46]. We use the model version GARCH(1,1) estimated from the true index with a student distribution with four degrees of freedom. This model includes both nonstationarity of volatilities (the amplitude of price variations) and the (fat tail) nature of the distribution of the price returns seen in Figure 2.7.

More recently, Feigenbaum has examined the first differences for the logarithm of the S&P 500 from 1980 to 1987 and finds that he cannot reject the log-periodic component at the 95% confidence level [127]: in plain words, this means that the probability that the log-periodic component results from chance is about or less than one in twenty. To test furthermore the solidity of the advanced log-periodic hypothesis, Johansen, Ledoit, and I [209] tested whether the null hypothesis that a standard statistical model of financial markets, called the GARCH(1,1) model with Student-distributed noise, could “explain” the presence of log-periodicity. In the 1,000 surrogate data sets of length 400 weeks generated using this GARCH(1,1) model with Student-distributed noise and analyzed as for the real crashes, only two 400-week windows qualified.

pages: 673 words: 164,804

Peer-to-Peer
by Andy Oram
Published 26 Feb 2001

If the seller does a lot of volume, she could have a higher reputation in this system than someone who trades perfectly but has less than three quarters the volume. Other reputation metrics can have high sensitivity to lies or losses of information. Other approaches to reputation are principled.[92] One of the approaches to reputation that I like is working from statistical models of behavior, in which reputation is an unbound model parameter to be determined from the feedback data, using Maximum Likelihood Estimation (MLE). MLE is a standard statistical technique: it chooses model parameters that maximize the likelihood of getting the sample data. The reputation calculation can also be performed with a Bayesian approach.

An entity’s reputation is an ideal to be estimated from the samples as measured by the different entities providing feedback points. An entity’s reputation is accompanied by an expression of the confidence or lack of confidence in the estimate. Our reputation calculator is a platform that accepts different statistical models of how entities might behave during the transaction and in providing feedback. For example, one simple model might assume that an entity’s performance rating follows a normal distribution (bell) curve with some average and standard deviation. To make things even simpler, one can assume that feedback is always given honestly and with no bias.

Internet, Reputation reducing risk in transactions, Examples of using the Reputation Server referrals, Reputation systems risking by trying out new nodes, Bootstrapping separating into categories to defend against shilling, Scoring algorithms tracking mechanisms, Social solutions: Engineer polite behavior trust and, Trust in real life, and its lessons for computer networks using statistical models of behavior to calculate, Reputation metrics–Reputation metrics reputation domains, Reputation domains, entities, and multidimensional reputations–Reputation domains, entities, and multidimensional reputations weak vs. strong entities, Identity as an element of reputation Reputation Server, Reputation–Summary auction sites, supporting, Interdomain sharing benchmark sources, Credibility bootstrapping, obstacle to, Bootstrapping buyers and sellers benefit from, Examples of using the Reputation Server centralized vs. distributed, Central Reputation Server versus distributed Reputation Servers communicating with the marketplace, Interface to the marketplace credibility measures for sources, Credibility domains, Reputation domains, entities, and multidimensional reputations–Reputation domains, entities, and multidimensional reputations interdomain sharing, Interdomain sharing needs to know identity of entities, Identity as an element of reputation reducing risk with, Examples of using the Reputation Server references to bootstrap reputations, Bootstrapping reputation metrics, Reputation metrics–Reputation metrics scoring algorithms, Credibility scoring system, Scoring system identity properties influence, Identity as an element of reputation soliciting feedback from parties in transactions, Reputation, Interface to the marketplace reputation systems bootstrapping, Bootstrapping collecting ratings, Collecting ratings creating new identities, Bootstrapping early online, Early reputation systems online–System successes and failures evaluating the security of, Attacks and adversaries Free Haven, Reputation systems–Reputation systems needed to provide accountability, Peer-to-peer models and their impacts on accountability observing transaction flow, Collecting ratings open source development, A reputation system that resists pseudospoofing: Advogato–A reputation system that resists pseudospoofing: Advogato partially-automated (Slashdot), Who will moderate the moderators: Slashdot personalizing reputation searches, Personalizing reputation searches problems with, System successes and failures pseudospoofing, bad loophole in, Problems with pseudospoofing and possible defenses purpose of, Purposes of micropayments and reputation systems–Purposes of micropayments and reputation systems scores and ratings, Scoring systems–True decentralization usefulness, Reputations vs. micropayment schemes, Reputations Reputation Technologies, Inc.

pages: 336 words: 113,519

The Undoing Project: A Friendship That Changed Our Minds
by Michael Lewis
Published 6 Dec 2016

He helped hire new management, then helped to figure out how to price tickets, and, finally, inevitably, was asked to work on the problem of whom to select in the NBA draft. “How will that nineteen-year-old perform in the NBA?” was like “Where will the price of oil be in ten years?” A perfect answer didn’t exist, but statistics could get you to some answer that was at least a bit better than simply guessing. Morey already had a crude statistical model to evaluate amateur players. He’d built it on his own, just for fun. In 2003 the Celtics had encouraged him to use it to pick a player at the tail end of the draft—the 56th pick, when the players seldom amount to anything. And thus Brandon Hunter, an obscure power forward out of Ohio University, became the first player picked by an equation.* Two years later Morey got a call from a headhunter who said that the Houston Rockets were looking for a new general manager.

The closest he came to certainty was in his approach to making decisions. He never simply went with his first thought. He suggested a new definition of the nerd: a person who knows his own mind well enough to mistrust it. One of the first things Morey did after he arrived in Houston—and, to him, the most important—was to install his statistical model for predicting the future performance of basketball players. The model was also a tool for the acquisition of basketball knowledge. “Knowledge is literally prediction,” said Morey. “Knowledge is anything that increases your ability to predict the outcome. Literally everything you do you’re trying to predict the right thing.

The Indian was DeAndre Jordan all over again; he was, like most of the problems you faced in life, a puzzle, with pieces missing. The Houston Rockets would pass on him—and be shocked when the Dallas Mavericks took him in the second round of the NBA draft. Then again, you never knew.†† And that was the problem: You never knew. In Morey’s ten years of using his statistical model with the Houston Rockets, the players he’d drafted, after accounting for the draft slot in which they’d been taken, had performed better than the players drafted by three-quarters of the other NBA teams. His approach had been sufficiently effective that other NBA teams were adopting it. He could even pinpoint the moment when he felt, for the first time, imitated.

pages: 49 words: 12,968

Industrial Internet
by Jon Bruner
Published 27 Mar 2013

“Imagine trying to operate a highway system if all you have are monthly traffic readings for a few spots on the road. But that’s what operating our power system was like.” The utility’s customers benefit, too — an example of the industrial internet creating value for every entity to which it’s connected. Fort Collins utility customers can see data on their electric usage through a Web portal that uses a statistical model to estimate how much electricity they’re using on heating, cooling, lighting and appliances. The site then draws building data from county records to recommend changes to insulation and other improvements that might save energy. Water meters measure usage every hour — frequent enough that officials will soon be able to dispatch inspection crews to houses whose vacationing owners might not know about a burst pipe.

pages: 1,088 words: 228,743

Expected Returns: An Investor's Guide to Harvesting Market Rewards
by Antti Ilmanen
Published 4 Apr 2011

Note, though, that most academic studies rely on such in-sample relations; econometricians simply assume that any observed statistical relation between predictors and subsequent market returns was already known to rational investors in real time. Practitioners who find this assumption unrealistic try to avoid in-sample bias by selecting and/or estimating statistical models repeatedly using only data that were available at each point in time, so as to assess predictability in a quasi-out-of-sample sense, but never completely succeeding in doing so. Table 8.6. Correlations with future excess returns of the S&P 500, 1962–2009 Sources: Haver Analytics, Robert Shiller’s website, Amit Goyal’s website, own calculations.

They treat default (or rating change) as a random event whose probability can be estimated from observed market prices in the context of an analytical model (or directly from historical default data). Useful indicators, besides equity volatility and leverage, include past equity returns, certain financial ratios, and proxies for the liquidity premium. This modeling approach is sort of a compromise between statistical models and theoretically purer structural models. Reduced-form models can naturally match market spreads better than structural models, but unconstrained indicator selection can make them overfitted to in-sample data. Box 10.1. (wonkish) Risk-neutral and actual default probabilities Under certain assumptions (continuous trading, a single-factor diffusion process), positions in risky assets can be perfectly hedged and thus should earn riskless return.

However, there is some evidence of rising correlations across all quant strategies, presumably due to common positions among leveraged traders. 12.7 NOTES [1] Like many others, I prefer to use economic intuition as one guard against data mining, but the virtues of such intuition can be overstated as our intuition is inevitably influenced by past experiences. Purely data-driven statistical approaches are even worse, but at least then statistical models can help assess the magnitude of data-mining bias. [2] Here are some additional points on VMG: —No trading costs or financing costs related to shorting are subtracted from VMG returns. This is typical for academic studies because such costs are trade specific and/or investor specific and, moreover, such data are not available over long histories.

pages: 467 words: 116,094

I Think You'll Find It's a Bit More Complicated Than That
by Ben Goldacre
Published 22 Oct 2014

Obviously, there are no out gay people in the eighteen-to-twenty-four group who came out at an age later than twenty-four; so the average age at which people in the eighteen-to-twenty-four group came out cannot possibly be greater than the average age of that group, and certainly it will be lower than, say, thirty-seven, the average age at which people in their sixties came out. For the same reason, it’s very likely indeed that the average age of coming out will increase as the average age of each age group rises. In fact, if we assume (in formal terms we could call this a ‘statistical model’) that at any time, all the people who are out have always come out at a uniform rate between the age of ten and their current age, you would get almost exactly the same figures (you’d get fifteen, twenty-three and thirty-five, instead of seventeen, twenty-one and thirty-seven). This is almost certainly why ‘the average coming-out age has fallen by over twenty years’: in fact you could say that Stonewall’s survey has found that on average, as people get older, they get older.

The study concluded that compulsory cycle-helmet legislation may selectively reduce cycling in the second group. There are even more complex second-round effects if each individual cyclist’s safety is improved by increased cyclist density through ‘safety in numbers’, a phenomenon known as Smeed’s law. Statistical models for the overall impact of helmet habits are therefore inevitably complex and based on speculative assumptions. This complexity seems at odds with the current official BMA policy, which confidently calls for compulsory helmet legislation. Standing over all this methodological complexity is a layer of politics, culture and psychology.

A&E departments: randomised trials in 208; waiting times 73–5 abdominal aortic aneurysms (AAA) 18, 114 abortion; GPs and xviii, 89–91; Science and Technology Committee report on ‘scientific developments relating to the Abortion Act, 1967’ 196–201 academia, bad xviii–xix, 127–46; animal experiments, failures in research 136–8; brain-imaging studies report more positive findings than their numbers can support 131–4; journals, failures of academic 138–46; Medical Hypotheses: Aids denialism in 138–41; Medical Hypotheses: ‘Down Subjects and Oriental Population Share Several Specific Attitudes and Characteristics’ article 139, 141–3; Medical Hypotheses: masturbation as a treatment for nasal congestion articles 139, 143–6; misuse of statistics 129–31; retractions, academic literature and 134–6 academic journals: access to papers published in 32–4, 143; cherry-picking and 5–8; ‘citation classics’ and 9–10, 102–3, 173; commercial ghost writers and 25–6; data published in newspapers rather than 17–20; doctors and technical academic journals 214; ‘impact factor’ 143; number of 14, 17; peer review and 138–46 see also peer review; poor quality (‘crap’) 138–46; refusal to publish in 3–5; retractions and 134–6; statistical model errors in 129–31; studies of errors in papers published in 9–10, 129–31; summaries of important new research from 214–15; teaching and 214–15; youngest people to publish papers in 11–12 academic papers xvi; access to 32–4; cherry-picking from xvii, 5–8, 12, 174, 176–7, 192, 193, 252, 336, 349, 355; ‘citation classics’ 9–10, 102–3, 173; commercial ‘ghost writers’ and 25–6; investigative journalism work and 18; journalists linking work to 342, 344, 346; number of 14; peer review and see peer review; post-publication 4–5; press releases and xxi, 6, 29–31, 65, 66, 107–9, 119, 120, 121–2, 338–9, 340–2, 358–60; public relations and 358–60; publication bias 132–3, 136, 314, 315; references to other academic papers within allowing study of how ideas spread 26; refusal to publish in 3–5, 29–31; retractions and 134–6; studies of errors in 9–10, 129–31; titles of 297 Acousticom 366 acupuncture 39, 388 ADE 651 273–5 ADHD 40–2 Advertising Standards Authority (ASA) 252 Afghanistan 231; crop captures in xx, 221–4 Ahn, Professor Anna 341 Aids; antiretroviral drugs and 140, 185, 281, 284, 285; Big Pharma and 186; birth control, abortion and US Christian aid groups 185; Catholic Church fight against condom use and 183–4; cures for 12, 182–3, 185–6, 366; denialism 138–41, 182–3, 185–6, 263, 273, 281–6; drug users and 182, 183, 233–4; House of Numbers film 281–3; Medical Hypotheses, Aids denial in 138–41; needle-exchange programmes and 182, 183; number of deaths from 20, 186, 309; power of ideas and 182–7; Roger Coghill and ‘the Aids test’ 366; Spectator, Aids denialism at the xxi, 283–6; US Presidential Emergency Plan for Aids Relief 185 Aidstruth.org 139 al-Jabiri, Major General Jehad 274–5 alcohol: intravenous use of 233; lung cancer and 108–9; rape and consumption of 329, 330 ALLHAT trial 119 Alzheimer’s, smoking and 20–1 American Academy of Child and Adolescent Psychiatry 325 American Association on Mental Retardation 325 American Journal of Clinical Nutrition 344 American Medical Association 262 American Psychological Association 325 American Speech-Language-Hearing Association 325 anecdotes, illustrating data with 8, 118–22, 189, 248–9, 293 animal experiments 136–8 Annals of Internal Medicine 358 Annals of Thoracic Surgery 134 anti-depressants 18; recession linked to rise in prescriptions for xviii, 104–7; SSRI 18, 105 antiretroviral medications 140, 185, 281, 284, 285 aortic aneurysm repair, mortality rates in hospital after/during 18–20, 114 APGaylard 252 Appleby, John 19, 173 artificial intelligence xxii, 394–5 Asch, Solomon 15, 16 Asphalia 365 Associated Press 316 Astel, Professor Karl 22 ATSC 273 autism: educational interventions in 325; internet use and 3; MMR and 145, 347–55, 356–8 Autism Research Centre, Cambridge 348, 354 Bad Science (Goldacre) xvi, 104, 110n, 257, 346 Bad Science column see Guardian Ballas, Dr Dimitris 58 Barasi, Leo 96 Barden, Paul 101–4 Barnardo’s 394 Baron-Cohen, Professor Simon 349–51, 353–4 Batarim 305–6 BBC xxi; ‘bioresonance’ story and 277–8; Britain’s happiest places story and 56, 57; causes of avoidable death, overall coverage of 20; Down’s syndrome births increase story and 61–2; ‘EDF Survey Shows Support for Hinkley Power Station’ story and 95–6; psychological nature of libido problems story and 37; radiation from wi-fi networks story and 289–91, 293; recession and anti-depressant link, reports 105; Reform: The Value of Mathematics’ story and 196; ‘Threefold variation’ in UK bowel cancer rates’ story and 101–4; Wightman and 393, 394; ‘“Worrying’’ Jobless Rise Needs Urgent Action – Labour’ story and 59 Beating Bowel Cancer 101, 104 Becker muscular dystrophy 121 Bem Sex Role Inventory (BSRI) 45 Benedict XVI, Pope 183, 184 Benford’s law 54–6 bicycle helmets, the law and 110–13 big data xvii, xviii, 71–86; access to government data 75–7; care.data and risk of sharing medical records 77–86; magical way that patterns emerge from data 73–5 Big Pharma xvii, 324, 401 bin Laden, Osama 357 biologising xvii, 35–46; biological causes for psychological or behavioural conditions 40–2; brain imaging, reality of phenomena and 37–9; girls’ love of pink, evolution and 42–6 Biologist 6 BioSTAR 248 birth rate, UK 49–50 Bishop, Professor Dorothy 3, 6 bladder cancer 24–5, 342 Blair, Tony 357 Blakemore, Colin 138 blame, mistakes in medicine and 267–70 blind auditions, orchestras and xxi, 309–11 blinding, randomised testing and xviii, 12, 118, 124, 126, 133, 137–8, 292–3, 345 blood tests 117, 119–20, 282 blood-pressure drugs 119–20 Blundell, Professor John 337 BMA 112 Booth, Patricia 265 Boston Globe 39 bowel cancer 101–4 Boynton, Dr Petra 252 Brain Committee 230–1 Brain Gym 10–12 Brainiac: faking of science on xxii, 371–5 brain-imaging studies, positive findings in 131–4 breast cancer: abortion and 200–1; diet and 338–40; red wine and 267, 269; screening 113, 114, 115 breast enhancement cream xx, 254–7 Breuning, Stephen 135–6 The British Association for Applied Nutrition and Nutritional Therapy (BANT) 268–9 British Association of Nutritional Therapists 270 British Chiropractic Association (BCA) 250–4 British Dental Association 24 British Household Panel Survey 57 British Journal of Cancer: ‘What if Cancer Survival in Britain were the Same as in Europe: How Many Deaths are Avoidable?’

pages: 592 words: 125,186

The Science of Hate: How Prejudice Becomes Hate and What We Can Do to Stop It
by Matthew Williams
Published 23 Mar 2021

The combination of unemployed locals and an abundance of employed migrants, competing for scarce resources in a time of recession and cutbacks, creates a greater feeling of ‘us’ versus ‘them’. A lack of inter-cultural interactions and understanding between the local and migrant populations results in rising tensions. Combined with the galvanising effect of the referendum result, these factors create the perfect conditions for hate crime to flourish. In our analysis we used statistical models that take account of a number of factors known to have an effect on hate crimes. In each of the populations of the forty-three police force areas of England and Wales we measured the unemployment rate, average income, educational attainment, health deprivation, general crime rate, barriers to housing and services, quality of living, rate of migrant inflow, and Leave vote share.

This psychosocial criminological approach to behaviour can be especially useful in understanding crimes caused in part by the senses of grievance and frustration, such as terrorism and hate crime.1 ‘Instrumental’ crimes, such as burglary and theft, can often be understood as a product of wider social and economic forces. Economic downturns, cuts to state benefits, widespread unemployment, increases in school expulsions, income inequality and poor rental housing stock can all combine to explain much of the variance (the total amount that can be explained in a statistical model) in the propensity of someone to burgle a home or shoplift.*2 Their commission is often rational – ‘I have no money, it’s easier to get it illegitimately than legitimately, and the chances of getting caught are low.’ But these ‘big issue’ drivers do not explain so much of the variance in hate crimes.

. ** A macroeconomic panel regression technique was used by G. Edwards and S. Rushin (‘The Effect of President Trump’s Election on Hate Crimes’, SSRN, 18 January 2018) to rule out a wide range of the most likely alternative explanations for the dramatic increase in hate crimes in the fourth quarter of 2016. While a powerful statistical model, it cannot account for all possible explanations for the rise. To do so, a ‘true experiment’ is required in which one location at random is subjected to the ‘Trump effect’, while another control location is not. As the 2016 presidential election affected all US jurisdictions, there is simply no way of running a true experiment, meaning we cannot say with absolute certainty that Trump’s rise to power caused a rise in hate crimes.

pages: 586 words: 186,548

Architects of Intelligence
by Martin Ford
Published 16 Nov 2018

We do find that people, including myself, have all kinds of speculations about the future, but as a scientist, I like to base my conclusions on the specific data that we’ve seen. And what we’ve seen is people using deep learning as high-capacity statistical models. High capacity is just some jargon that means that the model keeps getting better and better the more data you throw at it. Statistical models that at their core are based on matrices of numbers being multiplied, and added, and subtracted, and so on. They are a long way from something where you can see common sense or consciousness emerging. My feeling is that there’s no data to support these claims and if such data appears, I’ll be very excited, but I haven’t seen it yet.

From 1996 to 1999, he worked for Digital Equipment Corporation’s Western Research Lab in Palo Alto, where he worked on low-overhead profiling tools, design of profiling hardware for out-of-order microprocessors, and web-based information retrieval. From 1990 to 1991, Jeff worked for the World Health Organization’s Global Programme on AIDS, developing software to do statistical modeling, forecasting, and analysis of the HIV pandemic. In 2009, Jeff was elected to the National Academy of Engineering, and he was also named a Fellow of the Association for Computing Machinery (ACM) and a Fellow of the American Association for the Advancement of Sciences (AAAS). His areas of interest include large-scale distributed systems, performance monitoring, compression techniques, information retrieval, application of machine learning to search and other related problems, microprocessor architecture, compiler optimizations, and development of new products that organize existing information in new and interesting ways.

I went to Berkeley as a postdoc, and there I started to really think about how what I was doing was relevant to actual problems that people cared about, as opposed to just being mathematically elegant. That was the first time I started to get into machine learning. I then returned to Stanford as faculty in 1995 where I started to work on areas relating to statistical modeling and machine learning. I began studying applied problems where machine learning could really make a difference. I worked in computer vision, in robotics, and from 2000 on biology and health data. I also had an ongoing interest in technology-enabled education, which led to a lot of experimentation at Stanford into ways in which we could offer an enhanced learning experience.

pages: 619 words: 177,548

Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity
by Daron Acemoglu and Simon Johnson
Published 15 May 2023

The modern approach bypasses the step of modeling or even understanding how humans make decisions. Instead, it relies on a large data set of humans making correct recognition decisions based on images. It then fits a statistical model to large data sets of image features to predict when humans say that there is a cat in the frame. It subsequently applies the estimated statistical model to new pictures to predict whether there is a cat there or not. Progress was made possible by faster computer processor speed, as well as new graphics processing units (GPUs), originally used to generate high-resolution graphics in video games, which proved to be a powerful tool for data crunching.

There have also been major advances in data storage, reducing the cost of storing and accessing massive data sets, and improvements in the ability to perform large amounts of computation distributed across many devices, aided by rapid advances in microprocessors and cloud computing. Equally important has been progress in machine learning, especially “deep learning,” by using multilayer statistical models, such as neural networks. In traditional statistical analysis a researcher typically starts with a theory specifying a causal relationship. A hypothesis linking the valuation of the US stock market to interest rates is a simple example of such a causal relationship, and it naturally lends itself to statistical analysis for investigating whether it fits the data and for forecasting future movements.

To start with, these approaches will have difficulty with the situational nature of intelligence because the exact situation is difficult to define and codify. Another perennial challenge for statistical approaches is “overfitting,” which is typically defined as using more parameters than justified for fitting some empirical relationship. The concern is that overfitting will make a statistical model account for irrelevant aspects of the data and then lead to inaccurate predictions and conclusions. Statisticians have devised many methods to prevent overfitting—for example, developing algorithms on a different sample than the one in which they are deployed. Nevertheless, overfitting remains a thorn in the side of statistical approaches because it is fundamentally linked to the shortcomings of the current approach to AI: lack of a theory of the phenomena being modeled.

pages: 183 words: 17,571

Broken Markets: A User's Guide to the Post-Finance Economy
by Kevin Mellyn
Published 18 Jun 2012

Moreover, distributing risk to large numbers of sophisticated institutions seemed safer than leaving it concentrated on the books of individual banks. Besides, even the Basel-process experts had become convinced that bank risk management had reached a new level of effectiveness through the use of sophisticated statistical models, and the Basel II rules that superseded Basel I especially allowed the largest and most sophisticated banks to use approved models to set their capital requirements. The fly in the ointment of market-centric finance was that it allowed an almost infinite expansion of credit in the economy, but creditworthy risks are by definition finite.

It is critical to understand that a credit score is only a measure of whether a consumer can service a certain amount of credit—that is, make timely interest and principal payments. It is not concerned with the ability to pay off Broken Markets debts over time. What it really measures is the probability that an individual will default. This is a statistical model–based determination, and as such is hostage to historical experience of the behavior of tens of millions of individuals. The factors that over time have proved most predictive include not only behavior—late or missed payments on any bill, not just a loan, signals potential default—but also circumstances.

pages: 238 words: 77,730

Final Jeopardy: Man vs. Machine and the Quest to Know Everything
by Stephen Baker
Published 17 Feb 2011

The Google team had fed millions of translated documents, many of them from the United Nations, into their computers and supplemented them with a multitude of natural-language text culled from the Web. This training set dwarfed their competitors’. Without knowing what the words meant, their computers had learned to associate certain strings of words in Arabic and Chinese with their English equivalents. Since they had so very many examples to learn from, these statistical models caught nuances that had long confounded machines. Using statistics, Google’s computers won hands down. “Just like that, they bypassed thirty years of work on machine translation,” said Ed Lazowska, the chairman of the computer science department at the University of Washington. The statisticians trounced the experts.

The human players were more complicated. Tesauro had to pull together statistics on the thousands of humans who had played Jeopardy: how often they buzzed in, their precision in different levels of clues, their betting patterns for Daily Doubles and Final Jeopardy. From these, the IBM team pieced together statistical models of two humans. Then they put them into action against the model of Watson. The games had none of the life or drama of Jeopardy—no suspense, no jokes, no jingle while the digital players came up with their Final Jeopardy responses. They were only simulations of the scoring dynamics of Jeopardy.

pages: 305 words: 75,697

Cogs and Monsters: What Economics Is, and What It Should Be
by Diane Coyle
Published 11 Oct 2021

Nate Silver writes in his bestseller The Signal and the Noise: The government produces data on literally 45,000 economic indicators each year. Private data providers track as many as four million statistics. The temptation that some economists succumb to is to put all this data into a blender and claim that the resulting gruel is haute cuisine. If you have a statistical model that seeks to explain eleven outputs but has to choose from among four million inputs to do so, many of the relationships it identifies are going to be spurious (Silver 2012). Econometricians, those economists specialising in applied statistics, know well the risk of over-fitting of economic models, the temptation to prefer inaccurate precision to the accurate imprecision that would more properly characterise noisy data.

Gamble, A., 1988, The Free Economy and the Strong State: The Politics of Thatcherism, London, New York: Macmillan. Gawer, A., M. Cusumano, and D. B. Yoffie, 2019, The Business of Platforms: Strategy in the Age of Digital Competition, Innovation, and Power, New York: Harper Business, 2019. Gelman, A., 2013, ‘The Recursion of Pop-Econ’, Statistical Modeling, Causal Inference, and Social Science, 10 May, https://statmodeling.stat.columbia.edu/2013/05/10/the-recursion-of-pop-econ-or-of-trolling/. Gerlach, P., 2017, ‘The Games Economists Play: Why Economics Students Behave More Selfishly than Other Students’, PloS ONE, 12 (9), e0183814, https://doi.org/10.1371/journal.pone.0183814.

The Smartphone Society
by Nicole Aschoff

Federici, Caliban and the Witch, 97 50. James, Sex, Race, and Class, 45. 51. James, Sex, Race, and Class, 50. 52. Machine learning (often conflated with artificial intelligence) refers to a field of study whose goal is to learn statistical models directly from large datasets. Machine learning also refers to the software and algorithms that implement these statistical models and make predictions on new data. 53. Mayer-Schönberger and Cukier, Big Data, 93. 54. Levine, Surveillance Valley, 153. 55. Facebook’s financials can be found in its 2018 10-K filing for the Securities and Exchange Commission: https://www.sec.gov/Archives/edgar/data/1326801/000132680119000009/fb-12312018x10k.htm. 56.

pages: 752 words: 131,533

Python for Data Analysis
by Wes McKinney
Published 30 Dec 2011

Preparation Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis. Transformation Applying mathematical and statistical operations to groups of data sets to derive new data sets. For example, aggregating a large table by group variables. Modeling and computation Connecting your data to statistical models, machine learning algorithms, or other computational tools Presentation Creating interactive or static graphical visualizations or textual summaries In this chapter I will show you a few data sets and some things we can do with them. These examples are just intended to pique your interest and thus will only be explained at a high level.

To create a Panel, you can use a dict of DataFrame objects or a three-dimensional ndarray: import pandas.io.data as web pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2009', '6/1/2012')) for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL'])) Each item (the analogue of columns in a DataFrame) in the Panel is a DataFrame: In [297]: pdata Out[297]: <class 'pandas.core.panel.Panel'> Dimensions: 4 (items) x 861 (major) x 6 (minor) Items: AAPL to MSFT Major axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00 Minor axis: Open to Adj Close In [298]: pdata = pdata.swapaxes('items', 'minor') In [299]: pdata['Adj Close'] Out[299]: <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 861 entries, 2009-01-02 00:00:00 to 2012-06-01 00:00:00 Data columns: AAPL 861 non-null values DELL 861 non-null values GOOG 861 non-null values MSFT 861 non-null values dtypes: float64(4) ix-based label indexing generalizes to three dimensions, so we can select all data at a particular date or a range of dates like so: In [300]: pdata.ix[:, '6/1/2012', :] Out[300]: Open High Low Close Volume Adj Close AAPL 569.16 572.65 560.52 560.99 18606700 560.99 DELL 12.15 12.30 12.05 12.07 19396700 12.07 GOOG 571.79 572.65 568.35 570.98 3057900 570.98 MSFT 28.76 28.96 28.44 28.45 56634300 28.45 In [301]: pdata.ix['Adj Close', '5/22/2012':, :] Out[301]: AAPL DELL GOOG MSFT Date 2012-05-22 556.97 15.08 600.80 29.76 2012-05-23 570.56 12.49 609.46 29.11 2012-05-24 565.32 12.45 603.66 29.07 2012-05-25 562.29 12.46 591.53 29.06 2012-05-29 572.27 12.66 594.34 29.56 2012-05-30 579.17 12.56 588.23 29.34 2012-05-31 577.73 12.33 580.86 29.19 2012-06-01 560.99 12.07 570.98 28.45 An alternate way to represent panel data, especially for fitting statistical models, is in “stacked” DataFrame form: In [302]: stacked = pdata.ix[:, '5/30/2012':, :].to_frame() In [303]: stacked Out[303]: Open High Low Close Volume Adj Close major minor 2012-05-30 AAPL 569.20 579.99 566.56 579.17 18908200 579.17 DELL 12.59 12.70 12.46 12.56 19787800 12.56 GOOG 588.16 591.90 583.53 588.23 1906700 588.23 MSFT 29.35 29.48 29.12 29.34 41585500 29.34 2012-05-31 AAPL 580.74 581.50 571.46 577.73 17559800 577.73 DELL 12.53 12.54 12.33 12.33 19955500 12.33 GOOG 588.72 590.00 579.00 580.86 2968300 580.86 MSFT 29.30 29.42 28.94 29.19 39134000 29.19 2012-06-01 AAPL 569.16 572.65 560.52 560.99 18606700 560.99 DELL 12.15 12.30 12.05 12.07 19396700 12.07 GOOG 571.79 572.65 568.35 570.98 3057900 570.98 MSFT 28.76 28.96 28.44 28.45 56634300 28.45 DataFrame has a related to_panel method, the inverse of to_frame: In [304]: stacked.to_panel() Out[304]: <class 'pandas.core.panel.Panel'> Dimensions: 6 (items) x 3 (major) x 4 (minor) Items: Open to Adj Close Major axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00 Minor axis: AAPL to MSFT Chapter 6.

There are much more efficient sampling-without-replacement algorithms, but this is an easy strategy that uses readily available tools: In [183]: df.take(np.random.permutation(len(df))[:3]) Out[183]: 0 1 2 3 1 4 5 6 7 3 12 13 14 15 4 16 17 18 19 To generate a sample with replacement, the fastest way is to use np.random.randint to draw random integers: In [184]: bag = np.array([5, 7, -1, 6, 4]) In [185]: sampler = np.random.randint(0, len(bag), size=10) In [186]: sampler Out[186]: array([4, 4, 2, 2, 2, 0, 3, 0, 4, 1]) In [187]: draws = bag.take(sampler) In [188]: draws Out[188]: array([ 4, 4, -1, -1, -1, 5, 6, 5, 4, 7]) Computing Indicator/Dummy Variables Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame containing k columns containing all 1’s and 0’s. pandas has a get_dummies function for doing this, though devising one yourself is not difficult.

pages: 58 words: 18,747

The Rent Is Too Damn High: What to Do About It, and Why It Matters More Than You Think
by Matthew Yglesias
Published 6 Mar 2012

That said, though automobiles are unquestionably a useful technology, they’re not teleportation devices and they haven’t abolished distance. Location still matters, and some land is more valuable than other land. Since land and structures are normally sold in a bundle, it’s difficult in many cases to get precise numbers on land prices as such. But researchers at the Federal Reserve Bank of New York used a statistical model based on prices paid for vacant lots and for structures that were torn down to be replaced by brand-new buildings and found that the price of land in the metro area is closely linked to its distance from the Empire State Building: CHART 1 Land Prices and Distance of Property from Empire State Building Natural logarithm of land price per square foot Distance from Empire State Building (kilometers) In general, the expensive land should be much more densely built upon than the cheap land.

Statistics in a Nutshell
by Sarah Boslaugh
Published 10 Nov 2012

For instance, in the field of study and salary example, by using age as a continuous covariate, you are examining what the relationship between those two factors would be if all the subjects in your study were the same age. Another typical use of ANCOVA is to reduce the residual or error variance in a design. We know that one goal of statistical modeling is to explain variance in a data set and that we generally prefer models that can explain more variance, and have lower residual variance, than models that explain less. If we can reduce the residual variance by including one or more continuous covariates in our design, it might be easier to see the relationships between the factors of interest and the dependent variable.

For example, in the mid-1970s, models focused on variables derived from atmospheric conditions, whereas in the near future, models will be available that are based on atmospheric data combined with land surface, ocean and sea ice, sulphate and nonsulphate aerosol, carbon cycle, dynamic vegetation, and atmospheric chemistry data. By combining these additional sources of variation into a large-scale statistical model, predictions of weather activity of qualitatively different types have been made possible at different spatial and temporal scales. In this chapter, we will be working with multiple regression on a much smaller scale. This is not unrealistic from a real-world point of view; in fact, useful regression models may be built using a relatively small number of predictor variables (say, from 2 to 10), although the people building the model might consider far more predictors for inclusion before selecting those to keep in the final model.

Perhaps wine drinkers eat better diets than people who don’t drink at all, or perhaps they are able to drink wine because they are in better health. (Treatment for certain illnesses precludes alcohol consumption, for instance.) To try to eliminate these alternative explanations, researchers often collect data on a variety of factors other than the factor of primary interest and include the extra factors in the statistical model. Such variables, which are neither the outcome nor the main predictors of interest, are called control variables because they are included in the equation to control for their effect on the outcome. Variables such as age, gender, socioeconomic status, and race/ethnicity are often included in medical and social science studies, although they are not the variables of interest, because the researcher wants to know the effect of the main predictor variables on the outcome after the effects of these control variables have been accounted for.

pages: 309 words: 86,909

The Spirit Level: Why Greater Equality Makes Societies Stronger
by Richard Wilkinson and Kate Pickett
Published 1 Jan 2009

One factor is the strength of the relationship, which is shown by the steepness of the lines in Figures 4.1 and 4.2. People in Sweden are much more likely to trust each other than people in Portugal. Any alternative explanation would need to be just as strong, and in our own statistical models we find that neither poverty nor average standards of living can explain our findings. We also see a consistent association among both the United States and the developed countries. Earlier we described how Uslaner and Rothstein used a statistical model to show the ordering of inequality and trust: inequality affects trust, not the other way round. The relationships between inequality and women’s status and between inequality and foreign aid also add coherence and plausibility to our belief that inequality increases the social distance between different groups of people, making us less willing to see them as ‘us’ rather than ‘them’.

pages: 304 words: 80,965

What They Do With Your Money: How the Financial System Fails Us, and How to Fix It
by Stephen Davis , Jon Lukomnik and David Pitt-Watson
Published 30 Apr 2016

That is why the day you get married is so memorable. In fact, the elements of that day are not likely to be present in the sample of any of the previous 3,652 days.28 So how could the computer possibly calculate the likelihood of their recurring tomorrow, or next week? Similarly, in the financial world, if you feed a statistical model data that have come from a period where there has been no banking crisis, the model will predict that it is very unlikely you will have a banking crisis. When statisticians worked out that a financial crisis of the sort we witnessed in 2008 would occur once in billions of years, their judgment was based on years of data when there had not been such a crisis.29 It compounds the problem that people tend to simplify the outcome of risk models.

The compass that bankers and regulators were using worked well according to its own logic, but it was pointing in the wrong direction, and they steered the ship onto the rocks. History does not record whether the Queen was satisfied with the academics’ response. She might, however, have noted that this economic-statistical model had been found wanting before—in 1998, when the collapse of the hedge fund Long-Term Capital Management nearly took the financial system down with it. Ironically, its directors included the two people who had shared the Nobel Prize in Economics the previous year.20 The Queen might also have noted the glittering lineup of senior economists who, over the last century, have warned against excessive confidence in predictions made using models.

pages: 360 words: 85,321

The Perfect Bet: How Science and Math Are Taking the Luck Out of Gambling
by Adam Kucharski
Published 23 Feb 2016

The scales can tip one way or the other: whichever produces the combined prediction that lines up best with actual results. Strike the right balance, and good predictions can become profitable ones. WHEN WOODS AND BENTER arrived in Hong Kong, they did not meet with immediate success. While Benter spent the first year putting together the statistical model, Woods tried to make money exploiting the long-shot-favorite bias. They had come to Asia with a bankroll of $150,000; within two years, they’d lost it all. It didn’t help that investors weren’t interested in their strategy. “People had so little faith in the system that they would not have invested for 100 percent of the profits,” Woods later said.

Some of which provide clear hints about the future, while others just muddy the predictions. To pin down which factors are useful, syndicates need to collect reliable, repeated observations about races. Hong Kong was the closest Bill Benter could find to a laboratory setup, with the same horses racing on a regular basis on the same tracks in similar conditions. Using his statistical model, Benter identified factors that could lead to successful race predictions. He found that some came out as more important than others. In Benter’s early analysis, for example, the model said the number of races a horse had previously run was a crucial factor when making predictions. In fact, it was more important than almost any other factor.

Data Wrangling With Python: Tips and Tools to Make Your Life Easier
by Jacqueline Kazil
Published 4 Feb 2016

Exception handling Enables you to anticipate and manage Python exceptions with code. It’s always better to be specific and explicit, so you don’t disguise bugs with overly general exception catches. numpy coerrcoef Uses statistical models like Pearson’s correlation to determine whether two parts of a dataset are related. agate mad_outli ers and stdev_out liers Use statistical models and tools like standard deviations or mean average deviations to determine whether your dataset has specific outliers or data that “doesn’t fit.” agate group_by and aggregate Group your dataset on a particular attribute and run aggregation analysis to see if there are notable differences (or similarities) across groupings.

This interactive displays different scenar‐ 262 | Chapter 10: Presenting Your Data ios The Guardian staff researched and coded. Not every simulation turns out with the same outcome, allowing users to understand there is an element of chance, while still showing probability (i.e., less chance of infection with higher vaccination rates). This takes a highly politicized topic and brings out real-world scenarios using statistical models of outbreaks. Although interactives take more experience to build and often require a deeper cod‐ ing skillset, they are a great tool, especially if you have frontend coding experience. As an example, for our child labor data we could build an interactive showing how many people in your local high school would have never graduated due to child labor rates if they lived in Chad.

pages: 88 words: 25,047

The Mathematics of Love: Patterns, Proofs, and the Search for the Ultimate Equation
by Hannah Fry
Published 3 Feb 2015

Simulating Social Phenomena, edited by Rosaria Conte, Rainer Hegselmann, Pietro Terna, 419–36. Berlin: Springer Berlin Heidelberg, 1997. CHAPTER 8: HOW TO OPTIMIZE YOUR WEDDING Bellows, Meghan L. and J. D. Luc Peterson. ‘Finding an Optimal Seating Chart.’ Annals of Improbable Research, 2012. Alexander, R. A Statistically Modelled Wedding. (2014): http://www­.bbc­.co­.uk/­news­/mag­azi­ne-25980076. CHAPTER 9: HOW TO LIVE HAPPILY EVER AFTER Gottman, John M., James D. Murray, Catherine C. Swanson, Rebecca Tyson and Kristin R. Swanson. The Mathematics of Marriage: Dynamic Nonlinear Models. Cambridge, MA.: Basic Books, 2005.

pages: 398 words: 86,855

Bad Data Handbook
by Q. Ethan McCallum
Published 14 Nov 2012

He has spent the past 15 years extracting information from messy data in fields ranging from intelligence to quantitative finance to social media. Richard Cotton is a data scientist with a background in chemical health and safety, and has worked extensively on tools to give non-technical users access to statistical models. He is the author of the R packages “assertive” for checking the state of your variables and “sig” to make sure your functions have a sensible API. He runs The Damned Liars statistics consultancy. Philipp K. Janert was born and raised in Germany. He obtained a Ph.D. in Theoretical Physics from the University of Washington in 1997 and has been working in the tech industry since, including four years at Amazon.com, where he initiated and led several projects to improve Amazon’s order fulfillment process.

As the first and second examples show, a scientist can spot faulty experimental setups, because of his or her ability to test the data for internal consistency and for agreement with known theories, and thereby prevent wrong conclusions and faulty analyses. What possibly could be more importantto a scientist? And if that means taking a trip to the factory, I’ll be glad to go. Chapter 8. Blood, Sweat, and Urine Richard Cotton A Very Nerdy Body Swap Comedy I spent six years working in the statistical modeling team at the UK’s Health and Safety Laboratory.[23] A large part of my job was working with the laboratory’s chemists, looking at occupational exposure to various nasty substances to see if an industry was adhering to safe limits. The laboratory gets sent tens of thousands of blood and urine samples each year (and sometimes more exotic fluids like sweat or saliva), and has its own team of occupational hygienists who visit companies and collect yet more samples.

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage
by Zdravko Markov and Daniel T. Larose
Published 5 Apr 2007

WHY THE BOOK IS NEEDED The book provides the reader with: r The models and techniques to uncover hidden nuggets of information in Webbased data r Insight into how web mining algorithms really work r The experience of actually performing web mining on real-world data sets “WHITE-BOX” APPROACH: UNDERSTANDING THE UNDERLYING ALGORITHMIC AND MODEL STRUCTURES The best way to avoid costly errors stemming from a blind black-box approach to data mining, is to apply, instead, a white-box methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software. The book, applies this white-box approach by: r Walking the reader through various algorithms r Providing examples of the operation of web mining algorithms on actual large data sets PREFACE xiii r Testing the reader’s level of understanding of the concepts and algorithms r Providing an opportunity for the reader to do some real web mining on large Web-based data sets Algorithm Walk-Throughs The book walks the reader through the operations and nuances of various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really going on inside an algorithm.

CHAPTER 4 EVALUATING CLUSTERING APPROACHES TO EVALUATING CLUSTERING SIMILARITY-BASED CRITERION FUNCTIONS PROBABILISTIC CRITERION FUNCTIONS MDL-BASED MODEL AND FEATURE EVALUATION CLASSES-TO-CLUSTERS EVALUATION PRECISION, RECALL, AND F-MEASURE ENTROPY APPROACHES TO EVALUATING CLUSTERING Clustering algorithms group documents by similarity or create statistical models based solely on the document representation, which in turn reflects document content. Then the criterion functions evaluate these models objectively (i.e., using only the document content). In contrast, when we label documents by topic we use additional knowledge, which is generally not explicitly available in document content and representation.

pages: 346 words: 92,984

The Lucky Years: How to Thrive in the Brave New World of Health
by David B. Agus
Published 29 Dec 2015

Tomasetti and Vogelstein were accused of focusing on rare cancers while leaving out several common cancers that indeed are largely preventable. The International Agency for Research on Cancer, the cancer arm of the World Health Organization, published a press release stating it “strongly disagrees” with the report. To arrive at their conclusion, Tomasetti and Vogelstein used a statistical model they developed based on known rates of cell division in thirty-one types of tissue. Stem cells were their main focal point. As a reminder, these are the small, specialized “mothership” cells in each organ or tissue that divide to replace cells that die or wear out. Only in recent years have researchers been able to conduct these kinds of studies due to advances in the understanding of stem-cell biology.

Kristal et al., “Baseline Selenium Status and Effects of Selenium and Vitamin E Supplementation on Prostate Cancer Risk,” Journal of the National Cancer Institute 106, no. 3 (March 2014): djt456, doi:10.1093/jnci/djt456, Epub February 22, 2014. 12. Johns Hopkins Medicine, “Bad Luck of Random Mutations Plays Predominant Role in Cancer, Study Shows—Statistical Modeling Links Cancer Risk with Number of Stem Cell Divisions,” news release, January 1, 2015, www.hopkinsmedicine.org/news/media/releases/bad_luck_of_random_mutations_plays_predominant_role_in_cancer_study_shows. 13. C. Tomasetti and B. Vogelstein, “Cancer Etiology. Variation in Cancer Risk Among Tissues Can Be Explained by the Number of Stem Cell Divisions,” Science 347, no. 6217 (January 2, 2015): 78–81, doi:10.1126/science.1260825. 14.

pages: 340 words: 94,464

Randomistas: How Radical Researchers Changed Our World
by Andrew Leigh
Published 14 Sep 2018

Critics mocked his ‘pocket handkerchief wheat plots’.4 But after trying hundreds of different breeding combinations, Farrer created a new ‘Federation Wheat’ based not on reputation or appearance, but on pure performance. Agricultural trials of this kind are often called ‘field experiments’, a term which some people also use to describe randomised trials in social science. Modern agricultural field experiments use spatial statistical models to divide up the plots.5 As in medicine and aid, the most significant agricultural randomised trials are now conducted across multiple countries. They are at the heart of much of our understanding of genetically modified crops, the impact of climate change on agriculture, and drought resistance

But all of these studies are limited by the assumptions that the methods required us to make. New developments in non-randomised econometrics – such as machine learning – are generally even more complicated than the older approaches.34 As economist Orley Ashenfelter notes, if an evaluator is predisposed to give a program the thumbs-up, statistical modelling ‘leaves too many ways for the researcher to fake it’.35 That’s why one leading econometrics text teaches non-random approaches by comparing each to the ‘experimental ideal’.36 Students are encouraged to ask the question: ‘If we could run a randomised experiment here, what would it look like?’

pages: 339 words: 94,769

Possible Minds: Twenty-Five Ways of Looking at AI
by John Brockman
Published 19 Feb 2019

Ross, 39, 179 Ashby’s Law of Requisite Variety (First Law of Cybernetics), 39, 179, 180 Asilomar AI Principles, 2017, 81, 84 Asimov, Isaac, 250 astonishing corollary (natural intelligence as special case of AI), 67–70 astonishing hypothesis, 66–67 Astonishing Hypothesis (Crick), 66 AUM Conference, xxi–xxii automation, in manufacturing, 4, 154 Barry, Judith, 262 Bateson, Gregory, xx–xxi, 179, 264–65 Bateson, Mary Catherine, 264 Bayesian models, 226–28 Better Angels of Our Nature, The (Pinker), 118 Bostrom, Nick, xxvi, 27, 80 bounded optimality, 132 brain organoids, 245–46 Brand, Lois, xvii Brand, Stewart, xvii, xxv Bricogne, Gérard, 183 Bronowski, Jacob, 118 Brook, Peter, 213 Brooks, Rodney, 54–63 background and overview of work of, 54–55 data gathering and exploitation, computation platforms used for, 61–63 software engineering, lack of standards and failures in, 60–61 on Turing, 57, 60 on von Neumann, 57–58, 60 on Wiener, 56–57, 59–60 buffer overrun, 61 Bush, Vannevar, 163, 179–80 Cage, John, xvi causal reasoning, 17–19 cellular automaton, von Neumann’s, 57–58 Cheng, Ian, 216–18 chess, 8, 10, 119–20, 150, 184, 185 children, learning in, 222, 228–30 Chinese Room experiment, 250 Chomsky, Noam, 223, 226 Church, Alonzo, 180 Church, George M., 49, 240–53 AI safety concerns, 242–43 background and overview of work of, 240–41 conventional computers versus bio-electronic hybrids, 246–48 equal rights, 248–49 ethical rules for intelligent machines, 243–44 free will of machines, and rights, 250–51 genetic red lines, 251–52 human manipulation of humans, 244–46, 252 humans versus nonhumans and hybrids, treatment of, 249–53 non-Homo intelligences, fair and safe treatment of, 247–48 rights for nonhumans and hybrids, 249–53 science versus religion, 243–44 self-consciousness of machines, and rights, 250–51 technical barriers/red lines, malleability of, 244–46 transhumans, rights of, 252–53 clinical (subjective) method of prediction, 233, 234–35 Colloquy of Mobiles (Pask), 259 Colossus: The Forbin Project (film), 242 competence of superintelligent AGI, 85 computational theory of mind, 102–3, 129–33, 222 computer learning systems Bayesian models, 226–28 cooperative inverse-reinforcement learning (CIRL), 30–31 deep learning (See deep learning) human learning, similarities to, 11 reality blueprint, need for, 16–17 statistical, model-blind mode of current, 16–17, 19 supervised learning, 148 unsupervised learning, 225 Computer Power and Human Reason (Weizenbaum), 48–49, 248 computer virus, 61 “Computing Machinery and Intelligence” (Turing), 43 conflicts among hybrid superintelligences, 174–75 controllable-agent designs, 31–32 control systems beyond human control (control problem) AI designed as tool and not as conscious agent, 46–48, 51–53 arguments against AI risk (See risk posed by AI, arguments against) Ashby’s Law and, 39, 179, 180 cognitive element in, xx–xxi Dyson on, 38–39, 40 Macy conferences, xx–xxi purpose imbued in machines and, 23–25 Ramakrishnan on, 183–86 risk of superhuman intelligence, arguments against, 25–29 Russell on templates for provably beneficial AI, 29–32 Tallinn on, 93–94 Wiener’s warning about, xviii–xix, xxvi, 4–5, 11–12, 22–23, 35, 93, 104, 172 Conway, John Horton, 263 cooperative inverse-reinforcement learning (CIRL), 30–31 coordination problem, 137, 138–41 corporate/AI scenario, in relation of machine superintelligences to hybrid superintelligences, 176 corporate superintelligences, 172–74 credit-assignment function, 196–200 AI and, 196–97 humans, applied to, 197–200 Crick, Francis, 58, 66 culture in evolution, selecting for, 198–99 curiosity, and AI risk denial, 96 Cybernetic Idea, xv cybernetics, xv–xxi, 3–7, 102–4, 153–54, 178–80, 194–95, 209–10, 256–57 “Cybernetic Sculpture” exhibition (Tsai), 258, 260–61 “Cybernetic Serendipity” exhibition (Reichardt), 258–59 Cybernetics (Wiener), xvi, xvii, 3, 5, 7, 56 “Cyborg Manifesto, A” (Haraway), 261 data gathering and exploitation, computation platforms used for, 61–63 Dawkins, Richard, 243 Declaration of Helsinki, 252 declarative design, 166–67 Deep Blue, 8, 184 Deep Dream, 211 deep learning, 184–85 bottom-up, 224–26 Pearl on lack of transparency in, and limitations of, 15–19 reinforcement learning, 128, 184–85, 225–26 unsupervised learning, 225 visualization programs, 211–13 Wiener’s foreshadowing of, 9 Deep-Mind, 184–85, 224, 225, 262–63 Deleuze, Gilles, 256 Dennett, Daniel C., xxv, 41–53, 120, 191 AI as “helpless by themselves,” 46–48 AI as tool, not colleagues, 46–48, 51–53 background and overview of work of, 41–42 dependence on new tools and loss of ability to thrive without them, 44–46 gap between today’s AI and public’s imagination of AI, 49 humanoid embellishment of AI, 49–50 intelligent tools versus artificial conscious agents, need for, 51–52 operators of AI systems, responsibilities of, 50–51 on Turing Test, 46–47 on Weizenbaum, 48–50 on Wiener, 43–45 Descartes, René, 191, 223 Desk Set (film), 270 Deutsch, David, 113–24 on AGI risks, 121–22 background and overview of work of, 113–14 creating AGIs, 122–24 developing AI with goals under unknown constraints, 119–21 innovation in prehistoric humans, lack of, 116–19 knowledge imitation of ancestral humans, understanding inherent in, 115–16 reward/punishment of AI, 120–21 Differential Analyzer, 163, 179–80 digital fabrication, 167–69 digital signal encoding, 180 dimensionality, 165–66 distributed Thompson sampling, 198 DNA molecule, 58 “Dollie Clone Series” (Hershman Leeson), 261, 262 Doubt and Certainty in Science (Young), xviii Dragan, Anca, 134–42 adding people to AI problem definition, 137–38 background and overview of work of, 134–35 coordination problem, 137, 138–41 mathematical definition of AI, 136 value-alignment problem, 137–38, 141–42 The Dreams of Reason: The Computer and the Rise of the Science of Complexity (Pagels), xxiii Drexler, Eric, 98 Dyson, Freeman, xxv, xxvi Dyson, George, xviii–xix, 33–40 analog and digital computation, distinguished, 35–37 background and overview of work of, 33–34 control, emergence of, 38–39 electronics, fundamental transitions in, 35 hybrid analog/digital systems, 37–38 on three laws of AI, 39–40 “Economic Possibilities for Our Grandchildren” (Keynes), 187 “Einstein, Gertrude Stein, Wittgenstein and Frankenstein” (Brockman), xxii emergence, 68–69 Emissaries trilogy (Cheng), 216–17 Empty Space, The (Brook), 213 environmental risk, AI risk as, 97–98 Eratosthenes, 19 Evans, Richard, 217 Ex Machina (film), 242 expert systems, 271 extreme wealth, 202–3 fabrication, 167–69 factor analysis, 225 Feigenbaum, Edward, xxiv Feynman, Richard, xxi–xxii Fifth Generation, xxiii–xxiv The Fifth Generation: Artificial Intelligence and Japan’s Computer Challenge to the World (Feigenbaum and McCorduck), xxiv Fodor, Jerry, 102 Ford Foundation, 202 Foresight and Understanding (Toulmin), 18–19 free will of machines, and rights, 250–51 Frege, Gottlob, 275–76 Galison, Peter, 231–39 background and overview of work of, 231–32 clinical versus objective method of prediction, 233–35 scientific objectivity, 235–39 Gates, Bill, 202 generative adversarial networks, 226 generative design, 166–67 Gershenfeld, Neil, 160–69 background and overview of work of, 160–61 boom-bust cycles in evolution of AI, 162–63 declarative design, 166–67 digital fabrication, 167–69 dimensionality problem, overcoming, 165–66 exponentially increasing amounts of date, processing of, 164–65 knowledge in AI systems, 164 scaling, and development of AI, 163–66 Ghahramani, Zoubin, 190 Gibson, William, 253 Go, 10, 150, 184–85 goal alignment.

F., 222, 225 Sleepwalkers, The (Koestler), 153 Sloan Foundation, 202 social sampling, 198–99 software failure to advance in conjunction with increased processing power, 10 lack of standards of correctness and failure in engineering of, 60–61 Solomon, Arthur K., xvi–xvii “Some Moral and Technical Consequences of Automation” (Wiener), 23 Stapledon, Olaf, 75 state/AI scenario, in relation of machine superintelligences to hybrid superintelligences, 175–76 statistical, model-blind mode of learning, 16–17, 19 Steveni, Barbara, 218 Stewart, Potter, 247 Steyerl, Hito on AI visualization programs, 211–12 on artificial stupidity, 210–11 subjective method of prediction, 233, 234–35 subjugation fear in AI scenarios, 108–10 Superintelligence: Paths, Dangers, Strategies (Bostrom), 27 supervised learning, 148 surveillance state dystopias, 105–7 switch-it-off argument against AI risk, 25 Szilard, Leo, 26, 83 Tallinn, Jaan, 88–99 AI-risk message, 92–93 background and overview of work of, 88–89 calibrating AI-risk message, 96–98 deniers of AI-risk, motives of, 95–96 environmental risk, AI risk as, 97–98 Estonian dissidents, messages of, 91–92 evolution’s creation of planner and optimizer greater than itself, 93–94 growing awareness of AI risk, 98–99 technological singularity.

The Myth of Artificial Intelligence: Why Computers Can't Think the Way We Do
by Erik J. Larson
Published 5 Apr 2021

Getting a misclassified photo on Facebook or a boring movie recommendation on Netflix may not get us into much trouble with reliance on data-driven induction, but driverless cars and other critical technologies certainly can. A growing number of AI scientists understand the issue. Oren Etzioni, head of the Allen Institute for Artificial Intelligence, calls machine learning and big data “high-capacity statistical models.”9 That’s impressive computer science, but it’s not general intelligence. Intelligent minds bring understanding to data, and can connect dots that lead to an appreciation of failure points and abnormalities. Data and data analysis aren’t enough. THE PROBLEM OF INFERENCE AS TRUST In an illuminating critique of induction as used for financial forecasting, former stock trader Nassim Nicholas Taleb divides statistical prediction problems into four quadrants, with the variables being, first, whether the decision to be made is simple (binary) or complex, and second, whether the randomness involved is “mediocre” or extreme.

Like the meandering line explaining the existing points on a scatter plot, the models turned out to have no predictive or scientific value. There are numerous such fiascoes involving earthquake prediction by geologists, as Silver points out, culminating in the now-famous failure of Russian mathematical geophysicist Vladimir Keilis-Borok to predict an earthquake in the Mojave Desert in 2004 using an “elaborate and opaque” statistical model that identified patterns from smaller earthquakes in particular regions, generalizing to larger ones. Keilis-Borok’s student David Bowman, who is now Chair of the Department of Geological Sciences at Cal State Fullerton, admitted in a rare bit of scientific humility that the Keilis-Borok model was simply overfit.

pages: 125 words: 27,675

Applied Text Analysis With Python: Enabling Language-Aware Data Products With Machine Learning
by Benjamin Bengfort , Rebecca Bilbro and Tony Ojeda
Published 10 Jun 2018

In [Link to Come] we will explore classification models and applications, then in [Link to Come] we will take a look at clustering models, often called topic modeling in text analysis. 1 Kumar, A., McCann, R., Naughton, J., Patel, J. (2015) Model Selection Management Systems: The Next Frontier of Advanced Analytics 2 Wickham, H., Cooke, D., Hofmann, H. (2015) Visualizing statistical models: Removing the blindfold 3 https://arxiv.org/abs/1405.4053

pages: 317 words: 106,130

The New Science of Asset Allocation: Risk Management in a Multi-Asset World
by Thomas Schneeweis , Garry B. Crowder and Hossein Kazemi
Published 8 Mar 2010

There are libraries of statistical books dedicated to the simple task of coming up with estimates of the parameters used in MPT. Here is the point: It is not simple. For example, (1) for what period is one estimating the parameters (week, month, year)? and (2) how constant are the estimates (e.g., do they change and, if they do, do we have statistical models that permit us to systematically reflect those changes?)? There are many more issues in parameter estimation, but probably the biggest is that when two assets exist with the same true expected return, standard deviation, and Measuring Risk 33 correlation but when the risk parameter is often estimated with error (e.g., standard deviation is larger or smaller than its true standard deviation), the procedure for determining the efficient frontier always picks the asset with the downward bias risk estimate (e.g., the lower estimated standard deviation) and the upward bias return estimate.

The primary issue, of course, remains how to create a comparably risky investable non-actively managed asset. Even when one believes in the use of ex ante equilibrium (e.g., CAPM) or arbitrage (e.g., APT) models of expected return, problems in empirically estimating the required parameters usually results in alpha being determined using statistical models based on the underlying theoretical model. As generally measured in a statistical sense, the term alpha is often derived from a linear regression in which the equation that relates an observed variable y (asset return) to some other factor x (market index) is written as: y = α + βx + ε The first term, α (alpha) represents the intercept; β (beta) represents the slope; and ε (epsilon) represents a random error term.

pages: 317 words: 100,414

Superforecasting: The Art and Science of Prediction
by Philip Tetlock and Dan Gardner
Published 14 Sep 2015

He also appreciated the absurdity of an academic committee on a mission to save the world. So I am 98% sure he was joking. And 99% sure his joke captures a basic truth about human judgment. Probability for the Stone Age Human beings have coped with uncertainty for as long as we have been recognizably human. And for almost all that time we didn’t have access to statistical models of uncertainty because they didn’t exist. It was remarkably late in history—arguably as late as the 1713 publication of Jakob Bernoulli’s Ars Conjectandi—before the best minds started to think seriously about probability. Before that, people had no choice but to rely on the tip-of-your-nose perspective.

For more details, visit www.goodjudgment.com. (1) Triage. Focus on questions where your hard work is likely to pay off. Don’t waste time either on easy “clocklike” questions (where simple rules of thumb can get you close to the right answer) or on impenetrable “cloud-like” questions (where even fancy statistical models can’t beat the dart-throwing chimp). Concentrate on questions in the Goldilocks zone of difficulty, where effort pays off the most. For instance, “Who will win the presidential election, twelve years out, in 2028?” is impossible to forecast now. Don’t even try. Could you have predicted in 1940 the winner of the election, twelve years out, in 1952?

pages: 347 words: 97,721

Only Humans Need Apply: Winners and Losers in the Age of Smart Machines
by Thomas H. Davenport and Julia Kirby
Published 23 May 2016

The term “artificial intelligence” alone, for example, has been used to describe such technologies as expert systems (collections of rules facilitating decisions in a specified domain, such as financial planning or knowing when a batch of soup is cooked), neural networks (a more mathematical approach to creating a model that fits a data set), machine learning (semiautomated statistical modeling to achieve the best fitting-model to data), natural language processing or NLP (in which computers make sense of human language in textual form), and so forth. Wikipedia lists at least ten branches of AI, and we have seen other sources that mention many more. To make sense of this army of machines and the direction in which it is marching, it helps to remember where it all started: with numerical analytics supporting and supported by human decision-makers.

This work required a broad range of sophisticated models including “neural network” models; some were vendor supplied; some were custom-built . Cathcart, who was an English major at Dartmouth College but also learned the BASIC computer language there from its creator, John Kemeny, knew his way around computer systems and statistical models. Most important, he knew when to trust them and when not to. The models and analyses began to exhibit significant problems. No matter how automated and sophisticated the models were, Cathcart realized that they were becoming less valid over time with changes in the economy and banking climate.

pages: 311 words: 99,699

Fool's Gold: How the Bold Dream of a Small Tribe at J.P. Morgan Was Corrupted by Wall Street Greed and Unleashed a Catastrophe
by Gillian Tett
Published 11 May 2009

JPMorgan Chase, Deutsche Bank, and many other banks and funds suffered substantial losses. For a few weeks after the turmoil, the banking community engaged in soul-searching. At J.P. Morgan the traders stuck bananas on their desks as a jibe at the so-called F9 model monkeys, the mathematical wizards who had created such havoc. (The “monkeys” who wrote the statistical models tended to use the “F9” key on the computer when they performed their calculations, giving rise to the tag.) J.P. Morgan, Deutsche, and others conducted internal reviews that led them to introduce slight changes in their statistical systems. GLG Ltd., one large hedge fund, told its investors that it would use a wider set of data to analyze CDOs in the future.

Compared to Greenspan, Geithner was not just younger, but he also commanded far less clout and respect. As the decade wore on, though, he became privately uneasy about some of the trends in the credit world. From 2005 onwards, he started to call on bankers to prepare for so-called “fat tails,” a statistical term for extremely negative events that occur more often than the normal bell curve statistical models the banks’ risk assessment relied on so much implied. He commented in the spring of 2006: “A number of fundamental changes in the US financial system over the past twenty-five years appear to have rendered it able to withstand the stress of a broader array of shocks than was the case in the past.

pages: 420 words: 100,811

We Are Data: Algorithms and the Making of Our Digital Selves
by John Cheney-Lippold
Published 1 May 2017

The example of Shimon offers us an excellent insight into the real cultural work that is being done by algorithmic processing. Shimon’s algorithmic ‘Coltrane’ and ‘Monk’ are new cultural forms that are innovative, a touch random, and ultimately removed from a doctrinaire politics of what jazz is supposed to be. It instead follows what ‘jazz’ is and can be according to the musical liberties taken by predictive statistical modeling. This is what anthropologist Eitan Wilf has called the “stylizing of styles”—or the way that learning algorithms challenge how we understand style as an aesthetic form.97 ‘Jazz’ is jazz but also not. As a measurable type, it’s something divergent, peculiarly so but also anthropologically so.

For example, inspired by Judith Butler’s theory of gender performance, some U.S. machine-learning researchers looked to unpack the intersectional essentialism implicit in closed, concretized a priori notions like ‘old’ and ‘man’:117 The increasing prevalence of online social media for informal communication has enabled large-scale statistical modeling of the connection between language style and social variables, such as gender, age, race, and geographical origin. Whether the goal of such research is to understand stylistic differences or to learn predictive models of “latent attributes,” there is often an implicit assumption that linguistic choices are associated with immutable and essential categories of people.

Calling Bullshit: The Art of Scepticism in a Data-Driven World
by Jevin D. West and Carl T. Bergstrom
Published 3 Aug 2020

Modeling the changes in winning times, the authors predicted that women will outsprint men by the 2156 Olympic Games. It may be true that women will someday outsprint men, but this analysis does not provide a compelling argument. The authors’ conclusions were based on an overly simplistic statistical model. As shown above, the researchers fit a straight line through the times for women, and a separate straight line through the times for men. If you use this model to estimate future times, it predicts that women will outsprint men in the year 2156. In that year, the model predicts that women will finish the hundred-meter race in about 8.08 seconds and men will be shortly behind with times of about 8.10 seconds.

Using the same model, he extrapolated further into the future and came to the preposterous conclusion that late-millennium sprinters will run the hundred-meter dash in negative times. Clearly this can’t be true, so we should be skeptical of the paper’s other surprising results, such as the forecasted gender reversal in winning times. Another lesson here is to be careful about what kind of model is employed. A model may pass all the formal statistical model-fitting tests. But if it does not account for real biology—in this case, the physical limits to how fast any organism can run—we should be careful about what we conclude. BE MEMORABLE Functional magnetic resonance imaging (fMRI) allows neuroscientists to explore what brain regions are involved in what sorts of cognitive tasks.

pages: 362 words: 103,087

The Elements of Choice: Why the Way We Decide Matters
by Eric J. Johnson
Published 12 Oct 2021

But the order of the hotels in the random list made a big difference: the first hotel was selected 50 percent more often than the second and almost twice as often as the fifth. People searched very little; 93 percent of them clicked on only one hotel. And this was a fairly expensive purchase, as the average hotel cost about $160 a night. Ursu analyzed this data across 4.5 million Expedia searches. She used statistical models to see how this effect of order translated to cost. These models controlled for differences in hotels (such as distance from downtown, swimming pools, room quality, chain name, and the like), allowing her to look at whether people should search more and how much it cost them to search too little.

Imagine you are asked, “How many times do you think you will go to the doctor this year?” You find yourself fumbling, trying to remember if that trip to the dermatologist was last January or maybe the previous December. Picwell does not rely on your memory, instead estimating how often you will go to the doctor by building statistical models based on the usage records of people similar to you.8 Again, its system does not replace the chooser, but it augments their intelligence, letting them make decisions about what they want while aiding them in estimating the consequences of their choices. It might be easy to see that we need help with our calculations, and that we can use computers and smartphones to help us.

pages: 2,466 words: 668,761

Artificial Intelligence: A Modern Approach
by Stuart Russell and Peter Norvig
Published 14 Jul 2019

A second major contribution in 1988 was Rich Sutton’s work connecting reinforcement learning—which had been used in Arthur Samuel’s checker-playing program in the 1950s—to the theory of Markov decision processes (MDPs) developed in the field of operations research. A flood of work followed connecting AI planning research to MDPs, and the field of reinforcement learning found applications in robotics and process control as well as acquiring deep theoretical foundations. One consequence of AI’s newfound appreciation for data, statistical modeling, optimization, and machine learning was the gradual reunification of subfields such as computer vision, robotics, speech recognition, multiagent systems, and natural language processing that had become somewhat separate from core AI. The process of reintegration has yielded significant benefits both in terms of applications—for example, the deployment of practical robots expanded greatly during this period—and in a better theoretical understanding of the core problems of AI. 1.3.7Big data (2001–present) Remarkable advances in computing power and the creation of the World Wide Web have facilitated the creation of very large data sets—a phenomenon sometimes known as big data.

Other application areas include gesture analysis (Suk et al., 2010), driver fatigue detection (Yang et al., 2010), and urban traffic modeling (Hofleitner et al., 2012). The link between HMMs and DBNs, and between the forward-backward algorithm and Bayesian network propagation, was explicated by Smyth et al. (1997). A further unification with Kalman filters (and other statistical models) appears in Roweis and Ghahramani (1999). Procedures exist for learning the parameters (Binder et al., 1997a; Ghahramani, 1998) and structures (Friedman et al., 1998) of DBNs. Continuous-time Bayesian networks (Nodel- man et al., 2002) are the discrete-state, continuous-time analog of DBNs, avoiding the need to choose a particular duration for time steps.

Chomsky (1956, 1957) pointed out the limitations of finite-state models compared with context-free models, concluding, “Probabilistic models give no particular insight into some of the basic problems of syntactic structure.” This is true, but probabilistic models do provide insight into some other basic problems—problems that context-free models ignore. Chomsky’s remarks had the unfortunate effect of scaring many people away from statistical models for two decades, until these models reemerged for use in the field of speech recognition (Jelinek, 1976), and in cognitive science, where optimality theory (Smolensky and Prince, 1993; Kager, 1999) posited that language works by finding the most probable candidate that optimally satisfies competing constraints.

pages: 103 words: 32,131

Program Or Be Programmed: Ten Commands for a Digital Age
by Douglas Rushkoff
Published 1 Nov 2010

As baseball became a business, the fans took back baseball as a game—even if it had to happen on their computers. The effects didn’t stay in the computer. Leveraging the tremendous power of digital abstraction back to the real world, Billy Bean, coach of the Oakland Athletics, applied these same sorts of statistical modeling to players for another purpose: to assemble a roster for his own Major League team. Bean didn’t have the same salary budget as his counterparts in New York or Los Angeles, and he needed to find another way to assemble a winning combination. So he abstracted and modeled available players in order to build a better team that went from the bottom to the top of its division, and undermined the way that money had come to control the game.

pages: 123 words: 32,382

Grouped: How Small Groups of Friends Are the Key to Influence on the Social Web
by Paul Adams
Published 1 Nov 2011

Research by Forrester found that cancer patients trust their local care physician more than world renowned cancer treatment centers, and in most cases, the patient had known their local care physician for years.16 We overrate the advice of experts Psychologist Philip Tetlock conducted numerous studies to test the accuracy of advice from experts in the fields of journalism and politics. He quantified over 82,000 predictions and found that the journalism experts tended to perform slightly worse than picking answers at random. Political experts didn’t fare much better. They slightly outperformed random chance, but did not perform as well as a basic statistical model. In fact, they actually performed slightly better at predicting things outside their area of expertise, and 80 percent of their predictions were wrong. Studies in finance also show that only 20 percent of investment bankers outperform the stock market.17 We overestimate what we know Sometimes we consider ourselves as experts, even though we don’t know as much as we think we know.

pages: 719 words: 104,316

R Cookbook
by Paul Teetor
Published 28 Mar 2011

Solution The factor function encodes your vector of discrete values into a factor: > f <- factor(v) # v is a vector of strings or integers If your vector contains only a subset of possible values and not the entire universe, then include a second argument that gives the possible levels of the factor: > f <- factor(v, levels) Discussion In R, each possible value of a categorical variable is called a level. A vector of levels is called a factor. Factors fit very cleanly into the vector orientation of R, and they are used in powerful ways for processing data and building statistical models. Most of the time, converting your categorical data into a factor is a simple matter of calling the factor function, which identifies the distinct levels of the categorical data and packs them into a factor: > f <- factor(c("Win","Win","Lose","Tie","Win","Lose")) > f [1] Win Win Lose Tie Win Lose Levels: Lose Tie Win Notice that when we printed the factor, f, R did not put quotes around the values.

See Also The help page for par lists the global graphics parameters; the chapter of R in a Nutshell on graphics includes the list with useful annotations. R Graphics contains extensive explanations of graphics parameters. Chapter 11. Linear Regression and ANOVA Introduction In statistics, modeling is where we get down to business. Models quantify the relationships between our variables. Models let us make predictions. A simple linear regression is the most basic model. It’s just two variables and is modeled as a linear relationship with an error term: yi = β0 + β1xi + εi We are given the data for x and y.

pages: 502 words: 107,657

Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die
by Eric Siegel
Published 19 Feb 2013

GlaxoSmithKline (UK): Vladimir Anisimov, GlaxoSmithKline, “Predictive Analytic Patient Recruitment and Drug Supply Modelling in Clinical Trials,” Predictive Analytics World London Conference, November 30, 2011, London, UK. www.predictiveanalyticsworld.com/london/2011/agenda.php#day1–16. Vladimir V. Anisimov, “Statistical Modelling of Clinical Trials (Recruitment and Randomization),” Communications in Statistics—Theory and Methods 40, issue 19–20 (2011): 3684–3699. www.tandfonline.com/toc/lsta20/40/19–20. MultiCare Health System (four hospitals in Washington): Karen Minich-Pourshadi for HealthLeaders Media, “Hospital Data Mining Hits Paydirt,” HealthLeaders Media Online, November 29, 2010. www.healthleadersmedia.com/page-1/FIN-259479/Hospital-Data-Mining-Hits-Paydirt.

Johnson, Serena Lee, Frank Doherty, and Arthur Kressner (Consolidated Edison Company of New York), “Predicting Electricity Distribution Feeder Failures Using Machine Learning Susceptibility Analysis,” March 31, 2006. www.phillong.info/publications/GBAetal06_susc.pdf. This work has been partly supported by a research contract from Consolidated Edison. BNSF Railway: C. Tyler Dick, Christopher P. L. Barkan, Edward R. Chapman, and Mark P. Stehly, “Multivariate Statistical Model for Predicting Occurrence and Location of Broken Rails,” Transportation Research Board of the National Academies, January 26, 2007. http://trb.metapress.com/content/v2j6022171r41478/. See also: http://ict.uiuc.edu/railroad/cee/pdf/Dick_et_al_2003.pdf. TTX: Thanks to Mahesh Kumar at Tiger Analytics for this case study, “Predicting Wheel Failure Rate for Railcars.”

pages: 416 words: 108,370

Hit Makers: The Science of Popularity in an Age of Distraction
by Derek Thompson
Published 7 Feb 2017

, Guys and Dolls, Lady and the Tramp, Strategic Air Command, Not as a Stranger, To Hell and Back, The Sea Chase, The Seven Year Itch, and The Tall Men. If you’ve heard of five of those twelve movies, you have me beat. And yet they were all more popular than the film that launched the bestselling rock song of all time. There is no statistical model in the world to forecast that the forgotten B-side of a middling record played over the credits of the thirteenth most popular movie of any year will automatically become the most popular rock-and-roll song of all time. The business of creativity is a game of chance—a complex, adaptive, semi-chaotic game with Bose-Einstein distribution dynamics and Pareto’s power law characteristics with dual-sided uncertainty.

killing 127 people in three days: Kathleen Tuthill, “John Snow and the Broad Street Pump,” Cricket 31, no. 3 (November 2003), reprinted by UCLA Department of Epidemiology, www.ph.ucla.edu/epi/snow/snowcricketarticle.html. “There were only ten deaths in houses”: John Snow, Medical Times and Gazette 9, September 23, 1854: 321–22, reprinted by UCLA Department of Epidemiology, www.ph.ucla.edu/epi/snow/choleraneargoldensquare.html. Note: Other accounts of Snow’s methodology, such as David Freedman’s paper “Statistical Models and Shoe Leather,” give more weight to Snow’s investigation of the water supply companies. A few years before the outbreak, one of London’s water suppliers had moved its intake point upstream from the main sewage discharge on the Thames, while another company kept its intake point downstream from the sewage.

pages: 319 words: 106,772

Irrational Exuberance: With a New Preface by the Author
by Robert J. Shiller
Published 15 Feb 2000

Another argument advanced to explain why days of unusually large stock price movements have often not been found to coincide with important news is that a confluence of factors may cause a significant market change, even if the individual factors themselves are not particularly newsworthy. For example, suppose certain investors are informally using a particular statistical model that forecasts fundamental value using a number of economic indicators. If all or most of these particular indicators point the same way on a given day, even if no single one of them is of any substantive importance by itself, their combined effect will be noteworthy. Both of these interpretations of the tenuous relationship between news and market movements assume that the public is paying continuous attention to the news—reacting sensitively to the slightest clues about market fundamentals, constantly and carefully adding up all the disparate pieces of evidence.

Merton, with Terry Marsh, wrote an article in the American Economic Review in 1986 that argued against my results and concluded, ironically, that speculative markets were not too volatile.26 John Campbell and I wrote a number of papers attempting to put these claims of excess volatility on a more secure footing, and we developed statistical models to study the issue and deal with some of the problems emphasized by the critics.27 We felt that we had established in a fairly convincing way that stock markets do violate the efficient markets model. Our research has not completely settled the matter, however. There are just too many possible statistical issues that can be raised, and the sample provided by only a little over a century of data cannot prove anything conclusively.

pages: 502 words: 107,510

Natural Language Annotation for Machine Learning
by James Pustejovsky and Amber Stubbs
Published 14 Oct 2012

The British National Corpus (BNC) is compiled and released as the largest corpus of English to date (100 million words). The Text Encoding Initiative (TEI) is established to develop and maintain a standard for the representation of texts in digital form. 2000s: As the World Wide Web grows, more data is available for statistical models for Machine Translation and other applications. The American National Corpus (ANC) project releases a 22-million-word subcorpus, and the Corpus of Contemporary American English (COCA) is released (400 million words). Google releases its Google N-gram Corpus of 1 trillion word tokens from public web pages.

We can identify two basic methods for sequence classification: Feature-based classification A sequence is tranformed into a feature vector. The vector is then classified according to conventional classifier methods. Model-based classification An inherent model of the probability distribution of the sequence is built. HMMs and other statistical models are examples of this method. Included in feature-based methods are n-gram models of sequences, where an n-gram is selected as a feature. Given a set of such n-grams, we can represent a sequence as a binary vector of the occurrence of the n-grams, or as a vector containing frequency counts of the n-grams.

pages: 350 words: 103,270

The Devil's Derivatives: The Untold Story of the Slick Traders and Hapless Regulators Who Almost Blew Up Wall Street . . . And Are Ready to Do It Again
by Nicholas Dunbar
Published 11 Jul 2011

The mattress had done its job—it had given international regulators the confidence to sign off as commercial banks built up their trading businesses. Betting—and Beating—the Spread Now return to the trading floor, to the people regulators and bank senior management need to police. Although they are taught to overcome risk aversion, traders continue to look for a mattress everywhere, in the form of “free lunches.” But do they use statistical modeling to identify a mattress, and make money? If you talk to traders, the answer tends to be no. Listen to the warning of a senior Morgan Stanley equities trader who I interviewed in 2009: “You can compare to theoretical or historic value. But these forms of trading are probably a bit dangerous.”

According to the Morgan Stanley trader, “You study the perception of the market: I buy this because the next tick will be on the upside, or I sell because the next tick will be on the downside. This is probably based on the observations of your peers and so on. If you look purely at the anticipation of the price, that’s a way to make money in trading.” One reason traders don’t tend to make outright bets on the basis of statistical modeling is that capital rules such as VAR discourage it. The capital required to be set aside by VAR scales up with the size of the positions and the degree of worst-case scenario projected by the statistics. For volatile markets like equities, that restriction takes a big bite out of potential profit since trading firms must borrow to invest.5 On the other hand, short-term, opportunistic trading (which might be less profitable) slips under the VAR radar because the positions never stay on the books for very long.

pages: 353 words: 106,704

Choked: Life and Breath in the Age of Air Pollution
by Beth Gardiner
Published 18 Apr 2019

Tiny airborne particles known as PM2.5, so small they are thought to enter the bloodstream and penetrate vital organs, including the brain, were a far more potent danger. Nitrogen dioxide, one of a family of gases known as NOx, also had a powerful effect. In fact, it poured out of cars, trucks, and ships in such close synchronicity with PM2.5 that even Jim Gauderman’s statistical models couldn’t disentangle the two pollutants’ effects. That wasn’t all. In what may have been their most worrisome discovery, the team found the pollutants were wreaking harm even at levels long assumed to be safe. In the years to come, the implications of that uncomfortable finding would be felt far beyond the pages of prestigious scientific journals

These numbers are everywhere: more than a million and a half annual air pollution deaths each for China and India.10 Approaching a half million in Europe.11 Upward of a hundred thousand in America.12 None are arrived at by counting individual cases; like Walton’s, they’re all derived through complex statistical modeling. Even if you tried, David Spiegelhalter says, it would be impossible to compile a body-by-body tabulation, since pollution—unlike, say, a heart attack or stroke—is not a cause of death in the medical sense. It’s more akin to smoking, obesity, or inactivity, all risk factors that can hasten a death or make it more likely, either alone or as one of several contributing factors.

Capital Ideas Evolving
by Peter L. Bernstein
Published 3 May 2007

W * Unless otherwise specif ied, quotations are from personal interviews or correspondence. 58 bern_c05.qxd 3/23/07 9:02 AM Page 59 Andrew Lo 59 While he was at Bronx Science, Lo read The Foundation Trilogy by the science fiction writer Isaac Asimov. The story was about a mathematician who develops a theory of human behavior called “psychohistory.” Psychohistory can predict the future course of human events, but only when the population reaches a certain size because the predictions are based on statistical models. Lo was hooked. He found Asimov’s narrative to be plausible enough to become a reality some day, and he wanted to be the one to make it happen. Economics, especially game theory and mathematical economics, looked like the best way to get started. He made the decision in his second year at Yale to do just that.

At that moment, in the early 1980s, academics in the field of financial economics were still working out the full theoretical implications of Markowitz’s theory of portfolio selection, the Efficient Market Hypothesis, the Capital Asset Pricing Model, the options pricing model, and Modigliani and Miller’s iconoclastic ideas about corporate finance and the central role of arbitrage. bern_c05.qxd 60 3/23/07 9:02 AM Page 60 THE THEORETICIANS That emphasis on theory made the bait even tastier for Lo. He saw the way clear to follow Asimov’s advice. By applying statistical models to the daily practice of finance in the real world, he would not only move the field of finance forward from its focus on theory, but even more enticing, he would also find the holy grail he was seeking in the first place: solutions to Asimov’s psychohistory. Progress was rapid. By 1988 he was an untenured professor at MIT, having turned down an offer of tenure to stay at Wharton.

pages: 407 words: 104,622

The Man Who Solved the Market: How Jim Simons Launched the Quant Revolution
by Gregory Zuckerman
Published 5 Nov 2019

You can’t make any money in mathematics,” he sneered. The experience taught Patterson to distrust most moneymaking operations, even those that appeared legitimate—one reason why he was so skeptical of Simons years later. After graduate school, Patterson thrived as a cryptologist for the British government, building statistical models to unscramble intercepted messages and encrypt secret messages in a unit made famous during World War II when Alan Turing famously broke Germany’s encryption codes. Patterson harnessed the simple-yet-profound Bayes’ theorem of probability, which argues that, by updating one’s initial beliefs with new, objective information, one can arrive at improved understandings.

“Pie” is more likely to follow the word “apple” in a sentence than words like “him” or “the,” for example. Similar probabilities also exist for pronunciation, the IBM crew argued. Their goal was to feed their computers with enough data of recorded speech and written text to develop a probabilistic, statistical model capable of predicting likely word sequences based on sequences of sounds. Their computer code wouldn’t necessarily understand what it was transcribing, but it would learn to transcribe language, nonetheless. In mathematical terms, Brown, Mercer, and the rest of Jelinek’s team viewed sounds as the output of a sequence in which each step along the way is random, yet dependent on the previous step—a hidden Markov model.

pages: 456 words: 185,658

More Guns, Less Crime: Understanding Crime and Gun-Control Laws
by John R. Lott
Published 15 May 2010

As to the concern that other changes in law enforcement may have been occurring at the same time, the estimates account for changes in other gun-control laws and changes in law enforcement as measured by arrest and conviction rates as well as by prison terms. No previous study of crime has attempted to control for as many different factors that might explain changes in the crime rate. 3 Did I assume that there was an immediate and constant effect from these laws and that the effect should be the same everywhere? The “statistical models assumed: (1) an immediate and constant effect of shall-issue laws, and (2) similar effects across different states and counties.” (Webster, “Claims,” p. 2; see also Dan Black and Daniel Nagin, “Do ‘Right-to-Carry’ Laws Deter Violent Crime?” Journal of Legal Studies 27 [January 1998], p. 213.) One of the central arguments both in the original paper and in this book is that the size of the deterrent effect is related to the number of permits issued, and it takes many years before states reach their long-run level of permits.

A major reason for the larger effect on crime in the more urban counties was that in rural areas, permit requests already were being approved; hence it was in urban areas that the number of permitted concealed handguns increased the most. A week later, in response to a column that I published in the Omaha WorldHerald,20 Mr. Webster modified this claim somewhat: Lott claims that his analysis did not assume an immediate and constant effect, but that is contrary to his published article, in which the vast majority of the statistical models assume such an effect. (Daniel W. Webster, “Concealed-Gun Research Flawed,” Omaha World-Herald, March 12, 1997; emphasis added.) When one does research, it is most appropriate to take the simplest specifications first and then gradually make things more complicated. The simplest way of doing this is to examine the mean crime rates before and 136 | CHAPTER SEVEN after the change in a law.

While he includes a chapter that contains replies to his critics, unfortunately he doesn’t directly respond to the key Black and Nagin finding that formal statistical tests reject his methods. The closest he gets to addressing this point is to acknowledge “the more serious possibility is that some other factor may have caused both the reduction in crime rates and the passage of the law to occur at the same time,” but then goes on to say that he has “presented over a thousand [statistical model] specifications” that reveal “an extremely consistent pattern” that right-to-carry laws reduce crime. Another view would be that a thousand versions of a demonstrably invalid analytical approach produce boxes full of invalid results. (Jens Ludwig, “Guns and Numbers,” Washington Monthly, June 1998, p. 51)76 We applied a number of specification tests suggested by James J.

pages: 161 words: 39,526

Applied Artificial Intelligence: A Handbook for Business Leaders
by Mariya Yao , Adelyn Zhou and Marlene Jia
Published 1 Jun 2018

“We’ve had a lot of success hiring from career fairs that Galvanize organizes, where we present the unique challenges our company tackles in healthcare,” he adds.(57) Experienced Scientists and Researchers Hiring experienced data scientists and machine learning researchers requires a different approach. For these positions, employers typically look for a doctorate or extensive experience in machine learning, statistical modeling, or related fields. You will usually source these talented recruits through strategic networking, academic conferences, or blatant poaching. To this end, you can partner with universities or research departments and sponsor conferences to build your brand reputation. You can also host competitions on Kaggle or similar platforms.

pages: 147 words: 39,910

The Great Mental Models: General Thinking Concepts
by Shane Parrish
Published 22 Nov 2019

“It became possible also to map out master plans for the statistical city, and people take these more seriously, for we are all accustomed to believe that maps and reality are necessarily related, or that if they are not, we can make them so by altering reality.” 12 Jacobs’ book is, in part, a cautionary tale of what can happen when faith in the model influences the decisions we make in the territory. When we try to fit complexity into the simplification. _ Jacobs demonstrated that mapping the interaction between people and sidewalks was an important factor in determining how to improve city safety. «In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality. » David Hand13 Conclusion Maps have long been a part of human society.

pages: 302 words: 82,233

Beautiful security
by Andy Oram and John Viega
Published 15 Dec 2009

Ashenfelter is a statistician at Princeton who loves wine but is perplexed by the pomp and circumstance around valuing and rating wine in much the same way I am perplexed by the pomp and circumstance surrounding risk management today. In the 1980s, wine critics dominated the market with predictions based on their own reputations, palate, and frankly very little more. Ashenfelter, in contrast, studied the Bordeaux region of France and developed a statistic model about the quality of wine. His model was based on the average rainfall in the winter before the growing season (the rain that makes the grapes plump) and the average sunshine during the growing season (the rays that make the grapes ripe), resulting in simple formula: quality = 12.145 + (0.00117 * winter rainfall) + (0.0614 * average growing season temperature) (0.00386 * harvest rainfall) Of course he was chastised and lampooned by the stuffy wine critics who dominated the industry, but after several years of producing valuable results, his methods are now widely accepted as providing important valuation criteria for wine.

I believe the right elements are really coming together where technology can create better technology. Advances in technology have been used to both arm and disarm the planet, to empower and oppress populations, and to attack and defend the global community and all it will have become. The areas I’ve pulled together in this chapter—from business process management, number crunching and statistical modeling, visualization, and long-tail technology—provide fertile ground for security management systems in the future that archive today’s best efforts in the annals of history. At least I hope so, for I hate mediocrity with a passion and I think security management systems today are mediocre at best!

pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack
by Matthew A. Russell
Published 15 Jan 2011

For example, what would the precision, recall, and F1 score have been if your algorithm had identified “Mr. Green”, “Colonel”, “Mustard”, and “candlestick”? As somewhat of an aside, you might find it interesting to know that many of the most compelling technology stacks used by commercial businesses in the NLP space use advanced statistical models to process natural language according to supervised learning algorithms. A supervised learning algorithm is essentially an approach in which you provide training samples of the form [(input1, output1), (input2, output2), ..., (inputN, outputN)] to a model such that the model is able to predict the tuples with reasonable accuracy.

SocialGraph Node Mapper, Brief analysis of breadth-first techniques sorting, Sensible Sorting, Sorting Documents by Value documents by value, Sorting Documents by Value documents in CouchDB, Sensible Sorting split method, using to tokenize text, Data Hacking with NLTK, Before You Go Off and Try to Build a Search Engine… spreadsheets, visualizing Facebook network data, Visualizing with spreadsheets (the old-fashioned way) statistical models processing natural language, Quality of Analytics stemming verbs, Querying Buzz Data with TF-IDF stopwords, Data Hacking with NLTK, Analysis of Luhn’s Summarization Algorithm downloading NLTK stopword data, Data Hacking with NLTK filtering out before document summarization, Analysis of Luhn’s Summarization Algorithm streaming API (Twitter), Analyzing Tweets (One Entity at a Time) Strong Links API, The Infochimps “Strong Links” API, Interactive 3D Graph Visualization student’s t-score, How the Collocation Sausage Is Made: Contingency Tables and Scoring Functions subject-verb-object triples, Entity-Centric Analysis: A Deeper Understanding of the Data, Man Cannot Live on Facts Alone summarizing documents, Summarizing Documents, Analysis of Luhn’s Summarization Algorithm, Summarizing Documents, Analysis of Luhn’s Summarization Algorithm analysis of Luhn’s algorithm, Analysis of Luhn’s Summarization Algorithm Tim O’Reilly Radar blog post (example), Summarizing Documents summingReducer function, Frequency by date/time range, What entities are in Tim’s tweets?

pages: 404 words: 43,442

The Art of R Programming
by Norman Matloff

This approach is used in the loop beginning at line 53. (Arguably, in this case, the increase in speed comes at the expense of readability of the code.) 9.1.7 Extended Example: A Procedure for Polynomial Regression As another example, consider a statistical regression setting with one predictor variable. Since any statistical model is merely an approximation, in principle, you can get better and better models by fitting polynomials of higher and higher degrees. However, at some point, this becomes overfitting, so that the prediction of new, future data actually deteriorates for degrees higher than some value. The class "polyreg" aims to deal with this issue.

Input/Output 239 We’ll create a function called extractpums() to read in a PUMS file and create a data frame from its Person records. The user specifies the filename and lists fields to extract and names to assign to those fields. We also want to retain the household serial number. This is good to have because data for persons in the same household may be correlated and we may want to add that aspect to our statistical model. Also, the household data may provide important covariates. (In the latter case, we would want to retain the covariate data as well.) Before looking at the function code, let’s see what the function does. In this data set, gender is in column 23 and age in columns 25 and 26. In the example, our filename is pumsa.

pages: 492 words: 118,882

The Blockchain Alternative: Rethinking Macroeconomic Policy and Economic Theory
by Kariappa Bheemaiah
Published 26 Feb 2017

Thus, owing to their fundamental role in monetary policy decision making, it is important to understand the history, abilities and limitations of these models. Currently, most central banks, such as the Federal Reserve and the ECB,13 use two kinds of models to study and build forecasts about the economy (Axtell and Farmer, 2015). The first, statistical models , fit current aggregate data of variables such as GDP, interest rates, and unemployment to empirical data in order to predict/suggest what the near future holds. The second type of models (which are more widely used), are known as “Dynamic Stochastic General Equilibrium” (DSGE) models . These models are constructed on the basis that the economy would be at rest (i.e.: static equilibrium) if it wasn’t being randomly perturbed by events from outside the economy.

Buiter The Precariat: The New Dangerous Class (2011), Guy Standing Inventing the Future: Postcapitalism and a World Without Work (2015), Nick Srnicek and Alex Williams Raising the Floor: How a Universal Basic Income Can Renew Our Economy and Rebuild the American Dream (2016), Andy Stern Index A Aadhaar program Agent Based Computational Economics (ABCE) models complexity economists developments El Farol problem and minority games Kim-Markowitz Portfolio Insurers Model Santa Fe artificial stock market model Agent based modelling (ABM) aggregate behavioural trends axiomatisation, linearization and generalization black-boxing bottom-up approach challenge computational modelling paradigm conceptualizing, individual agents EBM enacting agent interaction environmental factors environment creation individual agent parameters and modelling decisions simulation designing specifying agent behaviour Alaska Anti-Money Laundering (AML) ARPANet Artificial Neural Networks (ANN) Atlantic model Automatic Speech Recognition (ASR) Autor-Levy-Murnane (ALM) B Bandits’ Club BankID system Basic Income Earth Network (BIEN) Bitnation Blockchain ARPANet break down points decentralized communication emails fiat currency functions Jiggery Pokery accounts malware protocols Satoshi skeleton keys smart contract TCP/IP protocol technological and financial innovation trade finance Blockchain-based regulatory framework (BRF) BlockVerify C Capitalism ALM hypotheses and SBTC Blockchain and CoCo canonical model cashlessenvironment See(Multiple currencies) categories classification definition of de-skilling process economic hypothesis education and training levels EMN fiat currency CBDC commercial banks debt-based money digital cash digital monetary framework fractional banking system framework ideas and methods non-bank private sector sovereign digital currency transition fiscal policy cashless environment central bank concept of control spending definition of exogenous and endogenous function fractional banking system Kelton, Stephanie near-zero interest rates policy instrument QE and QQE tendency ultra-low inflation helicopter drops business insider ceteris paribus Chatbots Chicago Plan comparative charts fractional banking keywords technology UBI higher-skilled workers ICT technology industry categories Jiggery Pokery accounts advantages bias information Blockchain CFTC digital environment Enron scandal limitations private/self-regulation public function regulatory framework tech-led firms lending and payments CAMELS evaluation consumers and SMEs cryptographic laws fundamental limitations governments ILP KYB process lending sector mobile banking payments industry regulatory pressures rehypothecation ripple protocol sectors share leveraging effect technology marketing money cashless system crime and taxation economy IRS money Seigniorage tax evasion markets and regulation market structure multiple currency mechanisms occupational categories ONET database policies economic landscape financialization monetary and fiscal policy money creation methods The Chicago Plan transformation probabilities regulation routine and non-routine routinization hypothesis Sarbanes-Oxley Act SBTC scalability issue skill-biased employment skills and technological advancement skills downgrading process trades See(Trade finance) UBI Alaska deployment Mincome, Canada Namibia Cashless system Cellular automata (CA) Central bank digital currency (CBDC) Centre for Economic Policy Research (CEPR) Chicago Plan Clearing House Interbank Payments System (CHIPS) Collateralised Debt Obligations (CDOs) Collateralized Loan Obligations (CLOs) Complexity economics agent challenges consequential decisions deterministic and axiomatized models dynamics education emergence exogenous and endogenous changes feedback loops information affects agents macroeconoic movements network science non-linearity path dependence power laws self-adapting individual agents technology andinvention See(Technology and invention) Walrasian approach Computing Congressional Research Service (CRS) Constant absolute risk aversion (CARA) Contingent convertible (CoCo) Credit Default Swaps (CDSs) CredyCo Cryptid Cryptographic law Currency mechanisms Current Account Switching System (CASS) D Data analysis techniques Debt and money broad and base money China’s productivity credit economic pressures export-led growth fractional banking See also((Fractional Reserve banking) GDP growth households junk bonds long-lasting effects private and public sectors problems pubilc and private level reaganomics real estate industry ripple effects security and ownership societal level UK DigID Digital trade documents (DOCS) Dodd-Frank Act Dynamic Stochastic General Equilibrium (DSGE) model E EBM SeeEquation based modelling (EBM) Economic entropy vs. economic equilibrium assemblages and adaptations complexity economics complexity theory DSGE based models EMH human uncertainty principle’ LHC machine-like system operating neuroscience findings reflexivity RET risk assessment scientific method technology and economy Economic flexibility Efficient markets hypothesis (EMH) eID system Electronic Discrete Variable Automatic Computer (EDVAC) Elliptical curve cryptography (ECC) EMH SeeEfficient Market Hypothesis (EMH) Equation based modelling (EBM) Equilibrium business-cycle models Equilibrium economic models contract theory contact incompleteness efficiency wages explicit contracts implicit contracts intellectual framework labor market flexibility menu cost risk sharing DSGE models Federal Reserve system implicit contracts macroeconomic models of business cycle NK models non-optimizing households principles RBC models RET ‘rigidity’ of wage and price change SIGE steady state equilibrium, economy structure Taylor rule FRB/US model Keynesian macroeconomic theory RBC models Romer’s analysis tests statistical models Estonian government European Migration Network (EMN) Exogenous and endogenous function Explicit contracts F Feedback loop Fiat currency CBDC commercial banks debt-based money digital cash digital monetary framework framework ideas and methods non-bank private sector sovereign digital currency transition Financialization de facto definition of eastern economic association enemy of my enemy is my friend FT slogans Palley, Thomas I.

pages: 370 words: 112,809

The Equality Machine: Harnessing Digital Technology for a Brighter, More Inclusive Future
by Orly Lobel
Published 17 Oct 2022

And it can lower costs, increase the size of the pie, and accelerate the pace of progress. Malice or Competence: What We Fear For all the talk about the possibilities of AI and robotics, we’re really only at the embryonic stage of our grand machine-human integration. And AI means different things in different conversations. The most common use refers to machine learning—using statistical models to analyze large quantities of data. The next step from basic machine learning, referred to as deep learning, uses a multilayered architecture of networks, making connections and modeling patterns across data sets. AI can be understood as any machine—defined for our purposes as hardware running digital software—that mimics human behavior (i.e., human reactions).

We check boxes and upload images, and the algorithm learns how to direct us toward a successful connection. Online, we seem to be reduced to a menu of preselected choices. Despite Tinder’s recent announcement about forgoing automated scoring that takes ethnicity and socioeconomic status into account, many dating algorithms still use statistical models that allow them to classify users according to gender, race, sexuality, and other markers. At the same time, we can redefine our communities, seek love outside of our regular circles, and to some extent test the plasticity of our online identity beyond the rigid confines of the physical world.

pages: 133 words: 42,254

Big Data Analytics: Turning Big Data Into Big Money
by Frank J. Ohlhorst
Published 28 Nov 2012

CHALLENGES REMAIN Locating the right talent to analyze data is the biggest hurdle in building a team. Such talent is in high demand, and the need for data analysts and data scientists continues to grow at an almost exponential rate. Finding this talent means that organizations will have to focus on data science and hire statistical modelers and text data–mining professionals as well as people who specialize in sentiment analysis. Success with Big Data analytics requires solid data models, statistical predictive models, and test analytic models, since these will be the core applications needed to do Big Data. Locating the appropriate talent takes more than just a typical IT job placement; the skills required for a good return on investment are not simple and are not solely technology oriented.

pages: 428 words: 121,717

Warnings
by Richard A. Clarke
Published 10 Apr 2017

The deeper they dig, the harder it gets to climb out and see what is happening outside, and the more tempting it becomes to keep on doing what they know how to do . . . uncovering new reasons why their initial inclination, usually too optimistic or pessimistic, was right.” Still, maddeningly, even the foxes, considered as a group, were only ever able to approximate the accuracy of simple statistical models that extrapolated trends. They did perform somewhat better than undergraduates subjected to the same exercises, and they outperformed the proverbial “chimp with a dart board,” but they didn’t come close to the predictive accuracy of formal statistical models. Later books have looked at Tetlock’s foundational results in some additional detail. Dan Gardner’s 2012 Future Babble draws on recent research in psychology, neuroscience, and behavioral economics to detail the biases and other cognitive processes that skew our judgment when we try to make predictions about the future.

pages: 1,164 words: 309,327

Trading and Exchanges: Market Microstructure for Practitioners
by Larry Harris
Published 2 Jan 2003

Arbitrageurs generally should be reluctant to trade against markets that quickly and efficiently aggregate new information because the prices in such markets tend to accurately reflect fundamental values. 17.3.2.3 Statistical Arbitrage Statistical arbitrageurs use factor models to generalize the pairs trading strategy to many instruments. Factor models are statistical models that represent instrument returns by a weighted sum of common factors plus an instrument-specific factor. The weights, called factor loadings, are unique for each instrument. The arbitrageur must estimate them. Either statistical arbitrageurs specify the factors, or they use statistical methods to identify the factors from returns data for many instruments.

The variance of a set of price changes is the average squared difference between the price change and the average price change. The standard deviation is the square root of the variance. The mean absolute deviation is the average absolute difference between the price change and the average price change. Statistical models are necessary to identify and estimate the two components of total volatility. These models exploit the primary distinguishing characteristics of the two types of volatility: Fundamental volatility consists of seemingly random price changes that do not revert, whereas transitory volatility consists of price changes that ultimately revert.

Roll showed that we can estimate the latter term from the expected serial covariance. It is Inverting this expression gives Roll’s serial covariance spread estimator substitutes the sample serial covariance for the expected serial covariance in this last expression. ◀ * * * The simplest statistical model that can estimate these variance components is Roll’s serial covariance spread estimator model. Roll analyzed this simple model to create a simple serial covariance estimator of bid/ask spreads. The model assumes that fundamental values follow a random walk, and that observed prices are equal to fundamental value plus or minus half of the bid/ask spread.

pages: 199 words: 47,154

Gnuplot Cookbook
by Lee Phillips
Published 15 Feb 2012

These new features include the use of Unicode characters, transparency, new graph positioning commands, plotting objects, internationalization, circle plots, interactive HTML5 canvas plotting, iteration in scripts, lua/tikz/LaTeX integration, cairo and SVG terminal drivers, and volatile data. What this book covers Chapter 1, Plotting Curves, Boxes, Points, and more, covers the basic usage of Gnuplot: how to make all kinds of 2D plots for statistics, modeling, finance, science, and more. Chapter 2, Annotating with Labels and Legends, explains how to add labels, arrows, and mathematical text to our plots. Chapter 3, Applying Colors and Styles, covers the basics of colors and styles in gnuplot, plus transparency, and plotting with points and objects.

Syntactic Structures
by Noam Chomsky
Published 17 Oct 2008

We shall see, in fact, in § 7, that there are deep structural reasons for distinguish i ng (3) and (4) from (5) and (6) ; but before we are able to find an explana­ tion for such facts as these we shall have to carry the theory of syntactic structure a good deal beyond its fam i l iar li mits. 2.4 Third, the notion "grammatical i n English" cannot be identi- 16 SYNTACTIC STRUCTURES fied in any way with the notion "h igh order of statistical approxi­ mation to English." It is fa ir to assume that neither sentence ( I ) nor (2) (nor i ndeed any part of these sentences) has ever occurred in an English di scourse. Hence, in ,my statistical model for grammatical­ ness, these sentences will be ruled out on i dentica l grounds as equally 'remote' from English. Yet ( I ), though nonsensica l, i s grammatical, w h i l e ( 2 ) is not. Presented with these sentences, a speaker of English will read ( I ) with a normal sentence intonation, but he will read (2) with a fall ing i ntonation on each word ; i n fact, with just the i ntonation pattern given to any sequence of unrelated words.

pages: 624 words: 127,987

The Personal MBA: A World-Class Business Education in a Single Volume
by Josh Kaufman
Published 2 Feb 2011

MBA programs teach many worthless, outdated, even outright damaging concepts and practices—assuming your goal is to actually build a successful business and increase your net worth. Many of my MBAHOLDING readers and clients come to me after spending tens (sometimes hundreds) of thousands of dollars learning the ins and outs of complex financial formulas and statistical models, only to realize that their MBA program didn’t teach them how to start or improve a real, operating business. That’s a problem—graduating from business school does not guarantee having a useful working knowledge of business when you’re done, which is what you actually need to be successful. 3.

Over time, managers and executives began using statistics and analysis to forecast the future, relying on databases and spreadsheets in much the same way ancient seers relied on tea leaves and goat entrails. The world itself is no less unpredictable or uncertain: as in the olden days, the signs only “prove” the biases and desires of the soothsayer. The complexity of financial transactions and the statistical models those transactions relied upon continued to grow until few practitioners fully understood how they worked or respected their limits. As Wired revealed in a February 2009 article, “Recipe for Disaster: The Formula That Killed Wall Street,” the inherent limitations of deified financial formulas such as the Black-Scholes option pricing model, the Gaussian copula function, and the capital asset pricing model (CAPM) played a major role in the tech bubble of 2000 and the housing market and derivatives shenanigans behind the 2008 recession.

pages: 480 words: 138,041

The Book of Woe: The DSM and the Unmaking of Psychiatry
by Gary Greenberg
Published 1 May 2013

If he was going to revise the DSM, Frances told Pincus, then his goal would be stabilizing the system rather than trying to perfect it—or, as he put it to me, “loving the pet, even if it is a mutt5.” Frances thought there was a way to protect the system from both instability and pontificating: meta-analysis, a statistical method that, thanks to advances in computer technology and statistical modeling, had recently allowed statisticians to compile results from large numbers of studies by combining disparate data into common terms. The result was a statistical synthesis by which many different research projects could be treated as one large study. “We needed something that would leave it up to the tables rather than the people,” he told me, and meta-analysis was perfect for the job.

Kraemer seemed to be saying that the point wasn’t to sift through the wreckage and try to prevent another catastrophe but, evidently, to crash the plane and then announce that the destruction could have been a lot worse. To be honest, however, I wasn’t sure. She was not making all that much sense, or maybe I just didn’t grasp the complexities of statistical modeling. And besides, I was distracted by a memory of something Steve Hyman once wrote. Fixing the DSM, finding another paradigm, getting away from its reifications—this, he said, was like “repairing a plane while it is flying.” It was a suggestive analogy, I thought at the time, one that recognized the near impossibility of the task even as it indicated its high stakes—and the necessity of keeping the mechanics from swearing and banging too loudly, lest the passengers start asking for a quick landing and a voucher on another airline.

pages: 444 words: 138,781

Evicted: Poverty and Profit in the American City
by Matthew Desmond
Published 1 Mar 2016

With Jonathan Mijs, I combined all eviction court records between January 17 and February 26, 2011 (the Milwaukee Eviction Court Study period) with information about aspects of tenants’ neighborhoods, procured after geocoding the addresses that appeared in the eviction records. Working with the Harvard Center for Geographic Analysis, I also calculated the distance (in drive miles and time) between tenants’ addresses and the courthouse. Then I constructed a statistical model that attempted to explain the likelihood of a tenant appearing in court based on aspects of that tenant’s case and her or his neighborhood. The model generated only null findings. How much a tenant owed a landlord, her commute time to the courthouse, her gender—none of these factors were significantly related to appearing in court.

All else equal, a 1 percent increase in the percentage of children in a neighborhood is predicted to increase a neighborhood’s evictions by almost 7 percent. These estimates are based on court-ordered eviction records that took place in Milwaukee County between January 1, 2010, and December 31, 2010. The statistical model evaluating the association between a neighborhood’s percentage of children and its number of evictions is a zero-inflated Poisson regression, which is described in detail in Matthew Desmond et al., “Evicting Children,” Social Forces 92 (2013): 303–27. 3. That misery could stick around. At least two years after their eviction, mothers like Arleen still experienced significantly higher rates of depression than their peers.

pages: 504 words: 139,137

Efficiently Inefficient: How Smart Money Invests and Market Prices Are Determined
by Lasse Heje Pedersen
Published 12 Apr 2015

For instance, volatility does not capture well the risk of selling out-the-money options, a strategy with small positive returns on most days but infrequent large crashes. To compute the volatility of a large portfolio, hedge funds need to account for correlations across assets, which can be accomplished by simulating the overall portfolio or by using a statistical model such as a factor model. Another measure of risk is value-at-risk (VaR), which attempts to capture tail risk (non-normality). The VaR measures the maximum loss with a certain confidence, as seen in figure 4.1 below. For example, the VaR is the most that you can lose with a 95% or 99% confidence.

The intermediary makes money when the wave subsides. Then the flows and equilibrium pricing are in the same direction. LHP: Or you might even short at a nickel cheap? MS: You might. Trend following is based on understanding macro developments and what governments are doing. Or they are based on statistical models of price movements. A positive up price tends to result in a positive up price. Here, however, it is not possible to determine whether the trend will continue. LHP: Why do spreads tend to widen during some periods of stress? MS: Well, capital becomes more scarce, both physical capital and human capital, in the sense that there isn’t enough time for intermediaries to understand what is happening in chaotic times.

pages: 186 words: 49,251

The Automatic Customer: Creating a Subscription Business in Any Industry
by John Warrillow
Published 5 Feb 2015

You have taken on a risk in guaranteeing your customer’s roof replacement and need to be paid for placing that bet. The repair job could have cost you $3,000, and then you would have taken an underwriting loss of $1,800 ($1,200−$3,000). Calculating your risk is the primary challenge of running a peace-of-mind model company. Big insurance companies employ an army of actuaries who use statistical models to predict the likelihood of a claim being made. You don’t need to be quite so scientific. Instead, start by looking back at the last 20 roofs you’ve installed with a guarantee and figure out how many service calls you needed to make. That will give you a pretty good idea of the possible risk of offering a peace-of-mind subscription.

pages: 222 words: 53,317

Overcomplicated: Technology at the Limits of Comprehension
by Samuel Arbesman
Published 18 Jul 2016

say, 99.9 percent of the time: I made these numbers up for effect, but if any linguist wants to chat, please reach out! “based on millions of specific features”: Alon Halevy et al., “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems 24, no. 2 (2009): 8–12. In some ways, these statistical models are actually simpler than those that start from seemingly more elegant rules, because the latter end up being complicated by exceptions. sophisticated machine learning techniques: See Douglas Heaven, “Higher State of Mind,” New Scientist 219 (August 10, 2013), 32–35, available online (under the title “Not Like Us: Artificial Minds We Can’t Understand”): http://complex.elte.hu/~csabai/simulationLab/AI_08_August_2013_New_Scientist.pdf.

pages: 172 words: 51,837

How to Read Numbers: A Guide to Statistics in the News (And Knowing When to Trust Them)
by Tom Chivers and David Chivers
Published 18 Mar 2021

But that’s probably OK, because devastating pandemics come along less than once every twenty years, so it shouldn’t be inside your 95 per cent forecast.) As a reader, you need to be aware of how forecasts are made, and you need to know that they are not mystical insights into fate – but nor are they random guesses. They’re the outputs of statistical models, which can be more or less accurate; and the very precise numbers (1.2 per cent, 50,000 deaths, whatever) are central estimates inside a much bigger range of uncertainty. The media, more importantly, has a duty to report that uncertainty, because being told ‘the economy will grow by 1.2 per cent this year’ might bring forth a very different response to being told ‘the economy might shrink a bit or it might grow quite a lot or it might do anything in between, but our best guess is somewhere around 1.2 per cent growth.’

Beautiful Data: The Stories Behind Elegant Data Solutions
by Toby Segaran and Jeff Hammerbacher
Published 1 Jul 2009

Although this is a fairly simple application, it highlights the distributed nature of the solution, combining open data with free visualization methods from multiple sources. More importantly, the distributed nature of the system and free accessibility of the data allow experts in different domains—experimentalists generating data, software developers creating interfaces, and computational modelers creating statistical models—to easily couple their expertise. The true promise of open data, open services, and the ecosystem that supports them is that this coupling can occur without requiring any formal collaboration. Researchers will find and use the data in ways that the generators of that data never considered. By doing this they add value to the original data set and strengthen the ecosystem around it, whether they are performing complementary experiments, doing new analyses, or providing new services that process the data.

We try to apply the following template: • “Figure X shows…” • “Each point (or line) in the graph represents…” • “The separate graphs indicate…” 323 Download at Boykma.Com • “Before making this graph, we did…which didn’t work, because…” • “A natural extension would be…” We do not have a full theory of statistical graphics—our closest attempt is to link exploratory graphical displays to checking the fit of statistical models (Gelman 2003)—but we hope that this small bit of structure can help readers in their own efforts. We think of our graphs not as beautiful standalone artifacts but rather as tools to help us understand beautiful reality. We illustrate using examples from our own work, not because our graphs are particularly beautiful, but because in these cases we know the story behind each plot.

pages: 566 words: 155,428

After the Music Stopped: The Financial Crisis, the Response, and the Work Ahead
by Alan S. Blinder
Published 24 Jan 2013

To date, there have been precious few studies of the broader effects of this grab bag of financial-market policies. The only one I know of that even attempts to estimate the macroeconomic impacts of the entire potpourri was published in July 2010 by Mark Zandi and me. Our methodology was pretty simple—and very standard. Take a statistical model of the U.S. economy—we used the Moody’s Analytics model—and simulate it both with and without the policies. The differences between the two simulations are then estimates of the effects of the policies. These estimates, of course, are only as good as the model, but ours were huge. By 2011, we estimated, real GDP was about 6 percent higher, the unemployment rate was nearly 3 percentage points lower, and 4.8 million more Americans were employed because of the financial-market policies (as compared with sticking with laissez-faire).

The standard analysis of conventional monetary policy—what we teach in textbooks and what central bankers are raised on—is predicated, roughly speaking, on constant risk spreads. When the Federal Reserve lowers riskless interest rates, like those on federal funds and T-bills, riskier interest rates, like those on corporate lending and auto loans, are supposed to follow suit.* The history on which we economists base our statistical models looks like that. Figure 9.1 shows the behavior of the interest rates on 10-year Treasuries (the lower line) and Moody’s Baa corporate bonds (the upper line) over the period from January 1980 through June 2007, just before the crisis got started. The spread between these two rates is the vertical distance between the two lines, and the fact that they look roughly parallel means that the spread did not change much over those twenty-seven years.

pages: 517 words: 147,591

Small Wars, Big Data: The Information Revolution in Modern Conflict
by Eli Berman , Joseph H. Felter , Jacob N. Shapiro and Vestal Mcintyre
Published 12 May 2018

He found that experiencing an indiscriminate attack was associated with a more than 50 percent decrease in the rate of insurgent attacks in a village—which amounts to a 24.2 percent reduction relative to the average.59 Furthermore, the correlation between the destructiveness of the random shelling and subsequent insurgent violence from that village was either negative or statistically insignificant, depending on the exact statistical model.60 While it’s not clear how civilians subject to these attacks interpreted them, what is clear is that in this case objectively indiscriminate violence by the government reduced local insurgent activity. Both of these studies are of asymmetric conflicts, and while the settings differ in important ways, each provides evidence that is not obviously consistent with the model.

Looking at subsequent village council elections, villages that had the training centers installed were much more likely to have a candidate from the PMLN place in the top two positions. The odds of a PMLN candidate either winning or being runner-up rose by 10 to 20 percentage points (depending on the statistical model). While other studies have shown that provision of public goods can sway attitudes, the effect is not usually so large. Remember, the training was funded and was going to be provided anyway. On the other hand, villages where vouchers were distributed for training elsewhere—making them less useful to men and virtually unusable by women—saw no increased support for the PMLN.

pages: 207 words: 57,959

Little Bets: How Breakthrough Ideas Emerge From Small Discoveries
by Peter Sims
Published 18 Apr 2011

One of the men in charge of U.S. strategy in the war for many years was Robert McNamara, secretary of defense under Presidents Kennedy and Johnson. McNamara was known for his enormous intellect, renowned for achievements at Ford Motors (where he was once president) and in government. Many considered him the best management mind of his era. During World War II, McNamara had gained acclaim for developing statistical models to optimize the destruction from bombing operations over Japan. The challenge of Vietnam, however, proved to be different in ways that exposed the limits of McNamara’s approach. McNamara assumed that increased bombing in Vietnam would reduce the Viet Cong resistance with some degree of proportionality, but it did not.

pages: 244 words: 58,247

The Gone Fishin' Portfolio: Get Wise, Get Wealthy...and Get on With Your Life
by Alexander Green
Published 15 Sep 2008

Or take Long Term Capital Management (LTCM). LTCM was a hedge fund created in 1994 with the help of two Nobel Prize- winning economists. The fund incorporated a complex mathematical model designed to profit from inefficiencies in world bond prices. The brilliant folks in charge of the fund used a statistical model that they believed eliminated risk from the investment process. And if you’ve eliminated risk, why not bet large? So they did, accumulating positions totaling $1.25 trillion. Of course, they hadn’t really eliminated risk. And when Russia defaulted on its sovereign debt in 1998, the fund blew up.

pages: 190 words: 62,941

Wild Ride: Inside Uber's Quest for World Domination
by Adam Lashinsky
Published 31 Mar 2017

Kalanick bragged about the advanced math that went into Uber’s calculation of when riders should expect their cars to show up. Uber’s “math department,” as he called it, included a computational statistician, a rocket scientist, and a nuclear physicist. They were running, he informed me, a Gaussian process emulation—a fancy statistical model—to improve on data available from Google’s mapping products. “Our estimates are far superior to Google’s,” Kalanick said. I was witnessing for the first time the cocksure Kalanick. I told him I had an idea for a market for Uber. I had recently sent a babysitter home in an Uber, a wonderful convenience because I could pay with my credit card from Uber’s app and then monitor the car’s progress on my phone to make sure the sitter got home safely.

pages: 219 words: 63,495

50 Future Ideas You Really Need to Know
by Richard Watson
Published 5 Nov 2013

One day, we may, for example, develop a tiny chip that can hold the full medical history of a person including any medical conditions, allergies, prescriptions and contact information (this is already planned in America). Digital vacuums Digital vacuuming refers to the practice of scooping up vast amounts of data then using mathematical and statistical models to determine content and possible linkages. The data itself can be anything from phone calls in historical or real time (the US company AT&T, for example, holds the records of 1.9 trillion telephone calls) to financial transactions, emails and Internet site visits. Commercial applications could include future health risks to counterterrorism.

pages: 256 words: 60,620

Think Twice: Harnessing the Power of Counterintuition
by Michael J. Mauboussin
Published 6 Nov 2012

This mistake, I admit, is hard to swallow and is a direct affront to experts of all stripes. But it is also among the best documented findings in the social sciences. In 1954, Paul Meehl, a psychologist at the University of Minnesota, published a book that reviewed studies comparing the clinical judgment of experts (psychologists and psychiatrists) with linear statistical models. He made sure the analysis was done carefully so he could be confident that the comparisons were fair. In study after study, the statistical methods exceeded or matched the expert performance.16 More recently, Philip Tetlock, a psychologist at the University of California, Berkeley, completed an exhaustive study of expert predictions, including twenty-eight thousand forecasts made by three hundred experts hailing from sixty countries over fifteen years.

Logically Fallacious: The Ultimate Collection of Over 300 Logical Fallacies (Academic Edition)
by Bo Bennett
Published 29 May 2017

But if you cross the line, hopefully you are with people who care about you enough to tell you. Tip: People don’t like to be made to feel inferior. You need to know when showing tack and restraint is more important than being right. Ludic Fallacy ludus Description: Assuming flawless statistical models apply to situations where they actually don’t. This can result in the over-confidence in probability theory or simply not knowing exactly where it applies, as opposed to chaotic situations or situations with external influences too subtle or numerous to predict. Example #1: The best example of this fallacy is presented by the person who coined this term, Nassim Nicholas Taleb in his 2007 book, The Black Swan.

pages: 504 words: 89,238

Natural language processing with Python
by Steven Bird , Ewan Klein and Edward Loper
Published 15 Dec 2009

Structure of the published TIMIT Corpus: The CD-ROM contains doc, train, and test directories at the top level; the train and test directories both have eight sub-directories, one per dialect region; each of these contains further subdirectories, one per speaker; the contents of the directory for female speaker aks0 are listed, showing 10 wav files accompanied by a text transcription, a wordaligned transcription, and a phonetic transcription. there is a split between training and testing sets, which gives away its intended use for developing and evaluating statistical models. Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus. Therefore, many of the computational methods described in this book are applicable. Moreover, notice that all of the data types included in the TIMIT Corpus fall into the two basic categories of lexicon and text, which we will discuss later.

For example, one intermediate position is to assume that humans are innately endowed with analogical and memory-based learning methods (weak rationalism), and use these methods to identify meaningful patterns in their sensory language experience (empiricism). We have seen many examples of this methodology throughout this book. Statistical methods inform symbolic models anytime corpus statistics guide the selection of productions in a context-free grammar, i.e., “grammar engineering.” Symbolic methods inform statistical models anytime a corpus that was created using rule-based methods is used as a source of features for training a statistical language model, i.e., “grammatical inference.” The circle is closed. NLTK Roadmap The Natural Language Toolkit is a work in progress, and is being continually expanded as people contribute code.

Debtor Nation: The History of America in Red Ink (Politics and Society in Modern America)
by Louis Hyman
Published 3 Jan 2011

In computer models, feminist credit advocates believed they had found the solution to discriminatory lending, ushering in the contemporary calculated credit regimes under which we live today. Yet removing such basic demographics from any model was not as straightforward as the authors of the ECOA had hoped because of how THE CREDIT INFRASTRUCTURE 215 all statistical models function, but which legislators seem to not have fully understood. The “objective” credit statistics that legislators had pined for during the early investigations of the Consumer Credit Protection Act could now exist, but with new difficulties that stemmed from using regressions and not human judgment to decide on loans.

The higher the level of education and income, the lower the effective interest rate paid, since such users tended more frequently to be non-revolvers.96 The researchers found that young, large, low-income families who could not save for major purchases, paid finance charges, while their opposite, older, smaller, highincome families who could save for major purchases, did not pay finance charges. Effectively the young and poor cardholders subsidized the convenience of the old and rich.97 And white.98 The new statistical models revealed that the second best predicator of revolving debt, after a respondent’s own “self-evaluation of his or her ability to save,” was race.99 But what these models revealed was that the very group—African Americans—that the politicians wanted to increase credit access to, tended to revolve their credit more than otherwise similar white borrowers.

pages: 632 words: 166,729

Addiction by Design: Machine Gambling in Las Vegas
by Natasha Dow Schüll
Published 15 Jan 2012

To most profitably manage player relationships, the industry must determine the specific value of those relationships. “What is the relationship of a particular customer to you, and you to them? Is that customer profitable or not?” asked a Harrah’s executive at G2E in 2008.38 “What is the order of value of that player to me?” echoed Bally’s Rowe.39 Using statistical modeling, casinos “tier” players based on different parameters, assigning each a “customer value” or “theoretical player value”—a value, that is, based on the theoretical revenue they are likely to generate. On a panel called “Patron Rating: The New Definition of Customer Value,” one specialist shared his system for gauging patron worth, recommending that casinos give each customer a “recency score” (how recently he has visited), a “frequency score” (how often he visits), and a “monetary score” (how much he spends), and then create a personalized marketing algorithm out of these variables.40 “We want to maximize every relationship,” Harrah’s Richard Mirman told a journalist.41 Harrah’s statistical models for determining player value, similar to those used for predicting stocks’ future worth, are the most advanced in the industry.

On a panel called “Patron Rating: The New Definition of Customer Value,” one specialist shared his system for gauging patron worth, recommending that casinos give each customer a “recency score” (how recently he has visited), a “frequency score” (how often he visits), and a “monetary score” (how much he spends), and then create a personalized marketing algorithm out of these variables.40 “We want to maximize every relationship,” Harrah’s Richard Mirman told a journalist.41 Harrah’s statistical models for determining player value, similar to those used for predicting stocks’ future worth, are the most advanced in the industry. The casino franchise, which maintains ninety different demographic segments for its customers, has determined that player value is most strongly associated with frequency of play, type of game played, and the number of coins played per spin or hand.

pages: 569 words: 165,510

There Is Nothing for You Here: Finding Opportunity in the Twenty-First Century
by Fiona Hill
Published 4 Oct 2021

The digital divide in this case was manifested not by inadequate technological hardware and bandwidth but rather by the ones and zeroes that flowed through it and the human biases that they channeled. With schools closed and students in lockdown to stem disease transmission, the spring A-level exams were canceled. The UK government’s national exams and assessments regulatory board, known by its awkward acronym, Ofqual, decided to use a standardized statistical model instead of the exam to determine students’ grades. Teachers were instructed to submit grade predictions, but the national exam board then adjusted these using the algorithm they had devised. This drew on the historic data of the school and the results of previous students taking the same subject-based exams.

If this had been the approach to A-levels in 1984, my friends and I would surely have fallen into that unfortunate category. Bishop Barrington Comprehensive School had only a few years of A-level results in a smattering of subjects. There would have been no “historic” data for Ofqual to plug into its statistical model. In French, I didn’t even have a teacher to offer a prediction. I had been studying on my own in the months leading up to the exam. I could hardly have written my own assessment and would probably have been assigned an “unclassified” grade. Reading about the debacle from afar, I felt white-hot with sympathetic rage reading the students’ stunned comments.

Addiction by Design: Machine Gambling in Las Vegas
by Natasha Dow Schüll
Published 19 Aug 2012

To most profitably manage player relationships, the industry must determine the specific value of those relationships. “What is the relationship of a particular customer to you, and you to them? Is that customer profitable or not?” asked a Harrah’s executive at G2E in 2008.38 “What is the order of value of that player to me?” echoed Bally’s Rowe.39 Using statistical modeling, casinos “tier” players based on different parameters, assigning each a “customer value” or “theoretical player value”—a value, that is, based on the theoretical revenue they are likely to generate. On a panel called “Patron Rating: The New Definition of Customer Value,” one specialist shared his system for gauging patron worth, recommending that casinos give each customer a “recency score” (how recently he has visited), a “frequency score” (how often he visits), and a “monetary score” (how much he spends), and then create a personalized marketing algorithm out of these variables.40 “We want to maximize every relationship,” Harrah’s Richard Mirman told a journalist.41 Harrah’s statistical models for determining player value, similar to those used for predicting stocks’ future worth, are the most advanced in the industry.

On a panel called “Patron Rating: The New Definition of Customer Value,” one specialist shared his system for gauging patron worth, recommending that casinos give each customer a “recency score” (how recently he has visited), a “frequency score” (how often he visits), and a “monetary score” (how much he spends), and then create a personalized marketing algorithm out of these variables.40 “We want to maximize every relationship,” Harrah’s Richard Mirman told a journalist.41 Harrah’s statistical models for determining player value, similar to those used for predicting stocks’ future worth, are the most advanced in the industry. The casino franchise, which maintains ninety different demographic segments for its customers, has determined that player value is most strongly associated with frequency of play, type of game played, and the number of coins played per spin or hand.

pages: 204 words: 67,922

Elsewhere, U.S.A: How We Got From the Company Man, Family Dinners, and the Affluent Society to the Home Office, BlackBerry Moms,and Economic Anxiety
by Dalton Conley
Published 27 Dec 2008

Should they have gotten a discount since the first word of their brand is also the first word of American Airlines and thereby reinforces—albeit in a subtle way—the host company’s image? In order to know the value of the deal, they would have had to know how much the marketing campaign increases their business. Impossible. No focus group or statistical model will tell Amex how much worse or better their bottom line would have been in the absence of this marketing campaign. Ditto for the impact of billboards, product placement, and special promotions like airline mileage plans. There are simply too many other forces that come into play to be able to isolate the impact of a specific effort.

pages: 305 words: 69,216

A Failure of Capitalism: The Crisis of '08 and the Descent Into Depression
by Richard A. Posner
Published 30 Apr 2009

Quantitative models of risk—another fulfillment of Weber's prophecy that more and more activities would be brought under the rule of rationality— are also being blamed for the financial crisis. Suppose a trader is contemplating the purchase of a stock using largely borrowed money, so that if the stock falls even a little way the loss will be great. He might consult a statistical model that predicted, on the basis of the ups and downs of the stock in the preceding two years, the probability distribution of the stock's behavior over the next few days or weeks. The criticism is that the model would have based the prediction on market behavior during a period of rising stock values; the modeler should have gone back to the 1980s or earlier to get a fuller picture of the riskiness of the stock.

Exploring Everyday Things with R and Ruby
by Sau Sheong Chang
Published 27 Jun 2012

LOESS is not suitable for a large number of data points, however, because it scales on an O(n2) basis in memory, so instead we use the mgcv library and its gam method. We also send in the formula y~s(x), where s is the smoother function for GAM. GAM stands for generalized addictive model, which is a statistical model used to describe how items of data relate to each other. In our case, we use GAM as an algorithm in the smoother to provide us with a reasonably good estimation of how a large number of data points can be visualized. In Figure 8-5, you can see that the population of roids fluctuates over time between two extremes caused by the oversupply and exhaustion of food, respectively.

pages: 242 words: 68,019

Why Information Grows: The Evolution of Order, From Atoms to Economies
by Cesar Hidalgo
Published 1 Jun 2015

GNP considers the goods and services produced by the citizens of a country, whether or not those goods are produced within the boundaries of the country. 5. Simon Kuznets, “Modern Economic Growth: Findings and Reflections,” American Economic Review 63, no. 3 (1973): 247–258. 6. Technically, total factor productivity is the residual or error term of the statistical model. Also, economists often refer to total factor productivity as technology, although this is a semantic deformation that is orthogonal to the definition of technology used by anyone who has ever developed a technology. In the language of economics, technology is the ability to do more—of anything—with the same cost.

pages: 239 words: 70,206

Data-Ism: The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else
by Steve Lohr
Published 10 Mar 2015

“The altered field,” he wrote, “will be called ‘data science.’” In his paper, Cleveland, who is now a professor of statistics and computer science at Purdue University, described the contours of this new field. Data science, he said, would touch all disciplines of study and require the development of new statistical models, new computing tools, and educational programs in schools and corporations. Cleveland’s vision of a new field is now rapidly gaining momentum. The federal government, universities, and foundations are funding data science initiatives. Nearly all of these efforts are multidisciplinary melting pots that seek to bring together teams of computer scientists, statisticians, and mathematicians with experts who bring piles of data and unanswered questions from biology, astronomy, business and finance, public health, and elsewhere.

Once the American Dream: Inner-Ring Suburbs of the Metropolitan United States
by Bernadette Hanlon
Published 18 Dec 2009

In this study, he suggests the role of population growth was somewhat exaggerated but finds other characteristics much more pertinent. Aside from population growth, he includes the variables of suburban age, initial suburban status levels, the suburbs’ geographic locations, suburban racial makeup, and employment specialization in his statistical model. He finds (1979: 946) that suburban age, the percentage of black inhabitants, and employment specialization within a suburb affected its then-current status (in 1970) “inasmuch as they also affected earlier (1960) status levels.” He describes how a suburb’s initial, established “ecological niche” was a great determinant of its future status.

pages: 210 words: 65,833

This Is Not Normal: The Collapse of Liberal Britain
by William Davies
Published 28 Sep 2020

In this new world, data is captured first, research questions come later. In the long term, the implications of this will likely be as profound as the invention of statistics was in the late seventeenth century. The rise of ‘big data’ provides far greater opportunities for quantitative analysis than any amount of polling or statistical modelling. But it is not just the quantity of data that is different. It represents an entirely different type of knowledge, accompanied by a new mode of expertise. First, there is no fixed scale of analysis (such as the nation), nor are there any settled categories (such as ‘unemployed’). These vast new data sets can be mined in search of patterns, trends, correlations and emergent moods, which becomes a way of tracking the identities people bestow upon themselves (via hashtags and tags) rather than imposing classifications on them.

pages: 227 words: 63,186

An Elegant Puzzle: Systems of Engineering Management
by Will Larson
Published 19 May 2019

“Availability in Globally Distributed Storage Systems” This paper explores how to think about availability in replicated distributed systems, and is a useful starting point for those of us who are trying to determine the correct way to measure uptime for our storage layer or for any other sufficiently complex system. From the abstract: We characterize the availability properties of cloud storage systems based on an extensive one-year study of Google’s main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet. Particularly interesting is the focus on correlated failures, building on the premise that users of distributed systems only experience the failure when multiple components have overlapping failures.

pages: 666 words: 181,495

In the Plex: How Google Thinks, Works, and Shapes Our Lives
by Steven Levy
Published 12 Apr 2011

Because Och and his colleagues knew they would have access to an unprecedented amount of data, they worked from the ground up to create a new translation system. “One of the things we did was to build very, very, very large language models, much larger than anyone has ever built in the history of mankind.” Then they began to train the system. To measure progress, they used a statistical model that, given a series of words, would predict the word that came next. Each time they doubled the amount of training data, they got a .5 percent boost in the metrics that measured success in the results. “So we just doubled it a bunch of times.” In order to get a reasonable translation, Och would say, you might feed something like a billion words to the model.

“We are trying to understand the mechanisms behind the metrics,” says Qing Wu, a decision support analyst at Google. His specialty was forecasting. He could predict patterns of queries from season to season, in different parts of the day, and the climate. “We have the temperature data, we have the weather data, and we have the queries data so we can do correlation and statistical modeling.” To make sure that his predictions were on track, Qing Wu and his colleagues made use of dozens of onscreen dashboards with information flowing through them, a Bloomberg of the Googlesphere. “With a dashboard you can monitor the queries, the amount of money you make, how many advertisers we have, how many keywords they’re bidding on, what the ROI is for each advertiser.”

pages: 741 words: 179,454

Extreme Money: Masters of the Universe and the Cult of Risk
by Satyajit Das
Published 14 Oct 2011

HE (home equity) and HELOC (home equity line of credit), borrowing against the equity in existing homes, became prevalent. Empowered by high-tech models, lenders loaned to less creditworthy borrowers, believing they could price any risk. Ben Bernanke shared his predecessor Alan Greenspan’s faith: “banks have become increasingly adept at predicting default risk by applying statistical models to data, such as credit scores.” Bernanke concluded that banks “have made substantial strides...in their ability to measure and manage risks.”13 Innovative affordability products included jumbo and super jumbo loans that did not conform to guidelines because of their size. More risky than prime but less risky than subprime, Alt A (Alternative A) mortgages were for borrowers who did not meet normal criteria.

Although Moody’s reversed the upgrades, all three banks collapsed in 2008. Unimpeded by insufficient disclosure, lack of information transparency, fraud, and improper accounting, traders anticipated these defaults, marking down bond prices well before rating downgrades. Rating-structured securities required statistical models, mapping complex securities to historical patterns of default on normal bonds. With mortgage markets changing rapidly, this was like “using weather in Antarctica to forecast conditions in Hawaii.”17 Antarctica from 100 years ago! The agencies did not look at the underlying mortgages or loans in detail, relying instead on information from others.

pages: 757 words: 193,541

The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2
by Thomas A. Limoncelli , Strata R. Chalup and Christina J. Hogan
Published 27 Aug 2014

Standard capacity planing is sufficient for small sites, sites that grow slowly, and sites with simple needs. It is insufficient for large, rapidly growing sites. They require more advanced techniques. Advanced capacity planning is based on core drivers, capacity limits of individual resources, and sophisticated data analysis such as correlation, regression analysis, and statistical models for forecasting. Regression analysis finds correlations between core drivers and resources. Forecasting uses past data to predict future needs. With sufficiently large sites, capacity planning is a full-time job, often done by project managers with technical backgrounds. Some organizations employ full-time statisticians to build complex models and dashboards that provide the information required by a project manager.

Capacity planning involves the technical work of understanding how many resources are needed per unit of growth, plus non-technical aspects such as budgeting, forecasting, and supply chain management. These topics are covered in Chapter 18. Sample Assessment Questions • How much capacity do you have now? • How much capacity do you expect to need three months from now? Twelve months from now? • Which statistical models do you use for determining future needs? • How do you load-test? • How much time does capacity planning take? What could be done to make it easier? • Are metrics collected automatically? • Are metrics available always or does their need initiate a process that collects them? • Is capacity planning the job of no one, everyone, a specific person, or a team of capacity planners?

pages: 274 words: 75,846

The Filter Bubble: What the Internet Is Hiding From You
by Eli Pariser
Published 11 May 2011

If Netflix shows me a romantic comedy and I like it, it’ll show me another one and begin to think of me as a romantic-comedy lover. But if it wants to get a good picture of who I really am, it should be constantly testing the hypothesis by showing me Blade Runner in an attempt to prove it wrong. Otherwise, I end up caught in a local maximum populated by Hugh Grant and Julia Roberts. The statistical models that make up the filter bubble write off the outliers. But in human life it’s the outliers who make things interesting and give us inspiration. And it’s the outliers who are the first signs of change. One of the best critiques of algorithmic prediction comes, remarkably, from the late-nineteenth-century Russian novelist Fyodor Dostoyevsky, whose Notes from Underground was a passionate critique of the utopian scientific rationalism of the day.

pages: 291 words: 77,596

Total Recall: How the E-Memory Revolution Will Change Everything
by Gordon Bell and Jim Gemmell
Published 15 Feb 2009

“World Explorer: Visualizing Aggregate Data from Unstructured Text in Geo-Referenced Collections.” In Proceedings, Seventh ACM/IEEE-CS Joint Conference on Digital Libraries ( JCDL 07), June 2007. The Stuff I’ve Seen project did some experiments that showed how displaying milestones alongside a timeline may help orient the user. Horvitz et al. used statistical models to infer the probability that users will consider events to be memory landmarks. Ringel, M., E. Cutrell, S. T. Dumais, and E. Horvitz. 2003. “Milestones in Time: The Value of Landmarks in Retrieving Information from Personal Stores.” Proceedings of IFIP Interact 2003. Horvitz, Eric, Susan Dumais, and Paul Koch.

pages: 322 words: 77,341

I.O.U.: Why Everyone Owes Everyone and No One Can Pay
by John Lanchester
Published 14 Dec 2009

That means it should statistically have happened only once every 3 billion years. And it wasn’t the only one. The last decades have seen numerous 5-, 6-, and 7-sigma events. Those are supposed to happen, respectively, one day in every 13,932 years, one day in every 4,039,906 years, and one day in every 3,105,395,365 years. Yet no one concluded from this that the statistical models in use were wrong. The mathematical models simply didn’t work in a crisis. They worked when they worked, which was most of the time; but the whole point of them was to assess risk, and some risks by definition happen at the edges of known likelihoods. The strange thing is that this is strongly hinted at in the VAR model, as propounded by its more philosophically minded defenders such as Philippe Jorion: it marks the boundaries of the known world, up to the VAR break, and then writes “Here be Dragons.”

pages: 279 words: 75,527

Collider
by Paul Halpern
Published 3 Aug 2009

Although this could represent an escaping graviton, more likely possibilities would need to be ruled out, such as the commonplace production of neutrinos. Unfortunately, even a hermetic detector such as ATLAS can’t account for the streams of lost neutrinos that pass unhindered through almost everything in nature—except by estimating the missing momentum and assuming it is all being transferred to neutrinos. Some physicists hope that statistical models of neutrino production would eventually prove sharp enough to indicate significant differences between the expected and actual pictures. Such discrepancies could prove that gravitons fled from collisions and ducked into regions beyond. Another potential means of establishing the existence of extra dimensions would be to look for the hypothetical phenomena called Kaluza-Klein excitations (named for Klein and an earlier unification pioneer, German mathematician Theodor Kaluza).

pages: 306 words: 78,893

After the New Economy: The Binge . . . And the Hangover That Won't Go Away
by Doug Henwood
Published 9 May 2005

Even classic statements of this skills argument, Hke that of Juhn, Murphy, and Pierce (1993), find that the standard proxies for skill Hke years of education and years of work experience (proxies being needed because skill is nearly impossible to define or measure) only explain part of the increase in polarization—less than half, in fact. Most of the increase remains unexplained by statistical models, a remainder that is typically attributed to "unobserved" attributes. That is, since conventional economists believe as a matter of faith that market rates of pay are fair compensation for a worker s productive contribution, any inexpHcable anomaUes in pay must be the result of things a boss can see that elude the academics model.

pages: 225 words: 11,355

Financial Market Meltdown: Everything You Need to Know to Understand and Survive the Global Credit Crisis
by Kevin Mellyn
Published 30 Sep 2009

Like much of the ‘‘progress’’ of the last century, it was a matter of replacing common sense and tradition with science. The models produced using advanced statistics and computers were designed by brilliant minds from the best universities. At the Basle Committee, which set global standards for bank regulation to be followed by all major central banks, the use of statistical models to measure risk and reliance on the rating agencies were baked into the proposed rules for capital adequacy. The whole thing blew up not because of something obvious like greed. It failed because of the hubris, the fatal pride, of men and women who sincerely thought that they could build computer models that were capable of predicting risk and pricing it correctly.

Deep Work: Rules for Focused Success in a Distracted World
by Cal Newport
Published 5 Jan 2016

It turns out to be really difficult to answer a simple question such as: What’s the impact of our current e-mail habits on the bottom line? Cochran had to conduct a company-wide survey and gather statistics from the IT infrastructure. He also had to pull together salary data and information on typing and reading speed, and run the whole thing through a statistical model to spit out his final result. And even then, the outcome is fungible, as it’s not able to separate out, for example, how much value was produced by this frequent, expensive e-mail use to offset some of its cost. This example generalizes to most behaviors that potentially impede or improve deep work.

pages: 579 words: 76,657

Data Science from Scratch: First Principles with Python
by Joel Grus
Published 13 Apr 2015

Of course, she doesn’t want to write thousands of web pages, nor does she want to pay a horde of “content strategists” to do so. Instead she asks you whether you can somehow programatically generate these web pages. To do this, we’ll need some way of modeling language. One approach is to start with a corpus of documents and learn a statistical model of language. In our case, we’ll start with Mike Loukides’s essay “What is data science?” As in Chapter 9, we’ll use requests and BeautifulSoup to retrieve the data. There are a couple of issues worth calling attention to. The first is that the apostrophes in the text are actually the Unicode character u"\u2019".

pages: 589 words: 69,193

Mastering Pandas
by Femi Anthony
Published 21 Jun 2015

This is called the posterior. is the probability of obtaining the data, considering our hypothesis. This is called the likelihood. Thus, Bayesian statistics amounts to applying Bayes rule to solve problems in inferential statistics with H representing our hypothesis and D the data. A Bayesian statistical model is cast in terms of parameters, and the uncertainty in these parameters is represented by probability distributions. This is different from the Frequentist approach where the values are regarded as deterministic. An alternative representation is as follows: where, is our unknown data and is our observed data In Bayesian statistics, we make assumptions about the prior data and use the likelihood to update to the posterior probability using the Bayes rule.

pages: 373 words: 80,248

Empire of Illusion: The End of Literacy and the Triumph of Spectacle
by Chris Hedges
Published 12 Jul 2009

He added that “much of Latin America, former Soviet Union states, and sub-Saharan Africa lack sufficient cash reserves, access to international aid or credit, or other coping mechanism.” “When those growth rates go down, my gut tells me that there are going to be problems coming out of that, and we’re looking for that,” he said. He referred to “statistical modeling” showing that “economic crises increase the risk of regime-threatening instability if they persist over a one- to two-year period.” Blair articulated the newest narrative of fear. As the economic unraveling accelerates, we will be told it is not the bearded Islamic extremists who threaten us most, although those in power will drag them out of the Halloween closet whenever they need to give us an exotic shock, but instead the domestic riffraff, environmentalists, anarchists, unions, right-wing militias, and enraged members of our dispossessed working class.

pages: 280 words: 79,029

Smart Money: How High-Stakes Financial Innovation Is Reshaping Our WorldÑFor the Better
by Andrew Palmer
Published 13 Apr 2015

Public data from a couple of longitudinal studies showing the long-term relationship between education and income in the United States enabled him to build what he describes as “a simple multivariate regression model”—you know the sort, we’ve all built one—and work out the relationships between things such as test scores, degrees, and first jobs on later income. That model has since grown into something whizzier. An applicant’s education, SAT scores, work experience, and other details are pumped into a proprietary statistical model, which looks at people with comparable backgrounds and generates a prediction of that person’s personal income. Upstart now uses these data to underwrite loans to younger people—who often find it hard to raise money because of their limited credit histories. But the model was initially used to determine how much money an applicant could raise for each percentage point of future income they gave away.

pages: 277 words: 80,703

Revolution at Point Zero: Housework, Reproduction, and Feminist Struggle
by Silvia Federici
Published 4 Oct 2012

At least since the Zapatistas, on December 31, 1993, took over the zócalo of San Cristóbal to protest legislation dissolving the ejidal lands of Mexico, the concept of the “commons” has gained popularity among the radical Left, internationally and in the United States, appearing as a ground of convergence among anarchists, Marxists/socialists, ecologists, and ecofeminists.1 There are important reasons why this apparently archaic idea has come to the center of political discussion in contemporary social movements. Two in particular stand out. On the one side, there has been the demise of the statist model of revolution that for decades has sapped the efforts of radical movements to build an alternative to capitalism. On the other, the neoliberal attempt to subordinate every form of life and knowledge to the logic of the market has heightened our awareness of the danger of living in a world in which we no longer have access to seas, trees, animals, and our fellow beings except through the cash-nexus.

pages: 251 words: 76,128

Borrow: The American Way of Debt
by Louis Hyman
Published 24 Jan 2012

As credit-rating agencies began to reassess the safety of the AAA mortgage-backed securities, insurance companies had to pony up greater quantities of collateral to guarantee the insurance policies on the bonds. The global credit market rested on a simple assumption: housing prices would always go up. Foreclosures would be randomly distributed, as the statistical models assumed. Yet as those models, and the companies that had created them, began to fail, a shudder ran through the corpus of global capitalism. The insurance giant AIG, which had hoped for so much profit in 1998, watched as its entire business—both traditional and new—went down, supported only by the U.S. government.

Raw Data Is an Oxymoron
by Lisa Gitelman
Published 25 Jan 2013

Data storage of this scale, potentially measured in petabytes, would necessarily require sophisticated algorithmic querying in order to detect informational patterns. For David Gelernter, this type of data management would require “topsight,” a topdown perspective achieved through software modeling and the creation of microcosmic “mirror worlds,” in which raw data filters in from the bottom and the whole comes into focus through statistical modeling and rule and pattern extraction.36 The promise of topsight, in Gelernter’s terms, is a progression from annales to annalistes, from data collection that would satisfy a “neo-Victorian curatorial” drive to data analysis that calculates prediction scenarios and manages risk.37 What would be the locus of suspicion and paranoid fantasy (Poster calls it “database anxiety”) if not such an intricate and operationally efficient system, the aggregating capacity of which easily ups the ante on Thomas Pynchon’s paranoid realization that “everything is connected”?

pages: 238 words: 75,994

A Burglar's Guide to the City
by Geoff Manaugh
Published 17 Mar 2015

* The fundamental premise of the capture-house program is that police can successfully predict what sorts of buildings and internal spaces will attract not just any criminal but a specific burglar, the unique individual each particular capture house was built to target. This is because burglars unwittingly betray personal, as well as shared, patterns in their crimes; they often hit the same sorts of apartments and businesses over and over. But the urge to mathematize this, and to devise complex statistical models for when and where a burglar will strike next, can lead to all sorts of analytical absurdities. A great example of this comes from an article published in the criminology journal Crime, Law and Social Change back in 2011. Researchers from the Physics Engineering Department at Tsinghua University reported some eyebrow-raisingly specific data about the meteorological circumstances during which burglaries were most likely to occur in urban China.

pages: 232 words: 72,483

Immortality, Inc.
by Chip Walter
Published 7 Jan 2020

Melamud’s graphs showed that the longer people lived, the longer the list of diseases became: malfunctioning hearts, cancer, and Alzheimer’s being the three biggest killers. Whatever slowed those diseases and increased life span occurred thanks only to alpha’s whack-a-mole–style medicine. For fun, Melamud changed the statistical model for beta—the constant 8.5-year number that set the evolutionary life limit for humans at no more than 120 years. When that number was zeroed out, the calculations didn’t merely show an improvement; they blew everybody away. If the increase in beta was halted at age 30—a huge if to be sure—the median life span of that person would leap to 695 years!

pages: 345 words: 75,660

Prediction Machines: The Simple Economics of Artificial Intelligence
by Ajay Agrawal , Joshua Gans and Avi Goldfarb
Published 16 Apr 2018

For example, in a mobile phone churn model, researchers utilized data on hour-by-hour call records in addition to standard variables such as bill size and payment punctuality. The machine learning methods also got better at leveraging the data available. In the Duke competition, a key component of success was choosing which of the hundreds of available variables to include and choosing which statistical model to use. The best methods at the time, whether machine learning or classic regression, used a combination of intuition and statistical tests to select the variables and model. Now, machine learning methods, and especially deep learning methods, allow flexibility in the model and this means variables can combine with each other in unexpected ways.

pages: 229 words: 72,431

Shadow Work: The Unpaid, Unseen Jobs That Fill Your Day
by Craig Lambert
Published 30 Apr 2015

Algorithms are another tool that democratizes expertise, using the revolutionary power of data to outdo established authorities. For example, Theodore Ruger, then a law professor at Washington University in St. Louis, and three colleagues ran a contest to predict the outcome of Supreme Court cases on the 2002 docket. The four political scientists developed a statistical model based on six general case characteristics they extracted from previous trials; the model ignored information about specific laws and the facts of the actual cases. Their friendly contest pitted this model against the qualitative judgments of eighty-seven law professors, many of whom had clerked at the Court.

pages: 267 words: 72,552

Reinventing Capitalism in the Age of Big Data
by Viktor Mayer-Schönberger and Thomas Ramge
Published 27 Feb 2018

Divergences would be flagged and brought to the attention of factory directors, then to government decision makers sitting in a futuristic operations room. From there the officials would send directives back to the factories. Cybersyn was quite sophisticated for its time, employing a network approach to capturing and calculating economic activity and using Bayesian statistical models. Most important, it relied on feedback that would loop back into the decision-making processes. The system never became fully operational. Its communications network was in place and was used in the fall of 1972 to keep the country running when striking transportation workers blocked goods from entering Santiago.

pages: 296 words: 78,631

Hello World: Being Human in the Age of Algorithms
by Hannah Fry
Published 17 Sep 2018

When all the inmates were eventually granted their release, and so were free to violate the terms of their parole if they chose to, Burgess had a chance to check how good his predictions were. From such a basic analysis, he managed to be remarkably ­accurate. Ninety-eight per cent of his low-risk group made a clean pass through their ­parole, while two-thirds of his high-risk group did not.17 Even crude statistical models, it turned out, could make better forecasts than the experts. But his work had its critics. Sceptical onlookers questioned how much the factors which reliably predicted parole success in one place at one time could apply elsewhere. (They had a point: I’m not sure the category ‘farm boy’ would be much help in predicting recidivism among modern inner-city criminals.)

pages: 267 words: 71,941

How to Predict the Unpredictable
by William Poundstone

There might be a happy medium, though, a range of probabilities where it does make sense to bet on a strong away team. That’s what the Bristol group found. To use this rule you need a good estimate of the probability of an away team win. Such estimates are not hard to come by on the web. You can also find spreadsheets or software that can be used as is or adapted to create your own statistical model. Note that the bookie odds are not proper estimates of the chances, as they factor in the commission and other tweaks. The researchers’ optimal rule was to bet on the away team when its chance of winning was between 44.7 and 71.5 percent. This is a selective rule. It applied to just twenty-two of the 194 matches in October 2007.

Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
by Aurelien Geron
Published 14 Aug 2019

They often end up selecting the same model, but when they differ, the model selected by the BIC tends to be simpler (fewer parameters) than the one selected by the AIC, but it does not fit the data quite as well (this is especially true for larger datasets). Likelihood function The terms “probability” and “likelihood” are often used interchangeably in the English language, but they have very different meanings in statistics: given a statistical model with some parameters θ, the word “probability” is used to describe how plausible a future outcome x is (knowing the parameter values θ), while the word “likelihood” is used to describe how plausible a particular set of parameter values θ are, after the outcome x is known. Consider a one-dimensional mixture model of two Gaussian distributions centered at -4 and +1.

pages: 303 words: 74,206

GDP: The World’s Most Powerful Formula and Why It Must Now Change
by Ehsan Masood
Published 4 Mar 2021

But they remain a minority and to some extent marginal voices. Given the explosion of data and the tools with which to manipulate data, the trend is completely in the other direction. Our world today is what Keynes feared it would become. Most scientists and economists rely heavily on numerical and statistical models. Pick a country—any country in the world—and its economy, as well as its financial systems, is likewise built on such models. Some of these models, such as GDP, are simplistic. Others, such as those used in banking, can be far more complex. In either case, there are few practitioners who now have the ability to explain, rationalize, or critique using non-mathematical language what they do and why they do it.

pages: 271 words: 79,355

The Dark Cloud: How the Digital World Is Costing the Earth
by Guillaume Pitron
Published 14 Jun 2023

It was a time when ‘everyone was trying to understand the markets, rather than model them’, says a former analyst at HSBC Bank.16 But in 1982, James Simons, a former mathematician at the National Security Agency, created a revolutionary fund called Renaissance Technologies. ‘Simons wanted to automate the processing of signals that traditional hedge funds look out for’, the analyst explains.17 It involves lines of code that inject large volumes of data into statistical models in order to identify combinations that best predict profitable market activity. Today, these processes include increasingly advanced non-financial information (such as real-time monitoring of industrial supplies, and logistics flows via satellite images, or market sentiment as expressed on social media).18 All this can be bought or sold with a time advantage.

pages: 804 words: 212,335

Revelation Space
by Alastair Reynolds
Published 1 Jan 2000

But if Sajaki's equipment was not the best, chances were good that he had excellent algorithms to distil memory traces. Over centuries, statistical models had studied patterns of memory storage in ten billion human minds, correlating structure against experience. Certain impressions tended to be reflected in similar neural structures — internal qualia — which were the functional blocks out of which more complex memories were assembled. Those qualia were never the same from mind to mind, except in very rare cases, but neither were they encoded in radically different ways, since nature would never deviate far from the minimum-energy route to a particular solution. The statistical models could identify those qualia patterns very efficiently, and then map the connections between them out of which memories were forged.

Mining of Massive Datasets
by Jure Leskovec , Anand Rajaraman and Jeffrey David Ullman
Published 13 Nov 2014

Originally, “data mining” or “data dredging” was a derogatory term referring to attempts to extract information that was not supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to extract what really isn’t in the data. Today, “data mining” has taken on a positive meaning. Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn. EXAMPLE 1.1Suppose our data is a set of numbers. This data is much simpler than data that would be data-mined, but it will serve as an example. A statistician might decide that the data comes from a Gaussian distribution and use a formula to compute the most likely parameters of this Gaussian.

-Y., 383 Papert, S., 458 Parent, 333 Park, J.S., 226 Partition, 343 Pass, 199, 202, 209, 214 Path, 367 Paulson, E., 67 PCA, see Principal-component analysis PCY Algorithm, 207, 209, 210 Pearson, K., 414 Pedersen, J., 190 Perceptron, 17, 415, 419, 422, 455 Perfect matching, 273 Permutation, 76, 81 PIG, 66 Pigeonhole principle, 339 Piotte, M., 324 Pivotal condensation, 386 Plagiarism, 69, 195 Pnuts, 66 Point, 228, 257 Point assignment, 230, 241, 332 Polyzotis, A., 66 Position indexing, 115, 116 Positive example, 423 Positive integer, 147 Powell, A.L., 265 Power Iteration, 386 Power iteration, 387 Power law, 12 Predicate, 303 Prefix indexing, 113, 115, 116 Pregel, 42 Principal eigenvector, 158, 385 Principal-component analysis, 384, 391 Priority queue, 236 Priors, 352 Privacy, 269 Probe string, 114 Profile, see Item profile, see User profile Projection, 31, 33 Pruhs, K.R., 291 Pseudoinverse, see Moore–Penrose pseudoinverse Puz, N., 67 Quadratic programming, 442 Query, 125, 144, 261 Query example, 447 R-tree, 265 Rack, 20 Radius, 237, 240, 367 Raghavan, P., 18, 190, 382 Rahm, E., 382 Rajagopalan, S., 18, 190, 382 Ramakrishnan, R., 67, 265, 266 Ramsey, W., 290 Random hyperplanes, 99, 299 Random surfer, 155, 156, 161, 175, 357 Randomization, 215 Rank, 397 Rarest-first order, 287 Rastogi, R., 153, 266 Rating, 293, 296 Reachability, 369 Recommendation system, 16, 292 Recursion, 40 Recursive doubling, 371 Reduce task, 23, 24 Reduce worker, 25, 27 Reducer, 24 Reducer size, 51, 57 Reed, B., 67 Reflexive and transitive closure, 369 Regression, 416, 451, 455 Regularization parameter, 441 Reichsteiner, A., 414 Reina, C., 265 Relation, 30 Relational algebra, 30, 31 Replication, 22 Replication rate, 51, 57 Representation, 253 Representative point, 250 Representative sample, 128 Reservoir sampling, 152 Restart, 358 Retained set, 245 Revenue, 277 Ripple-carry adder, 147 RMSE, see Root-mean-square error Robinson, E., 67 Rocha, L.M., 414 Root-mean-square error, 295, 313, 402 Rosa, M., 382 Rosenblatt, F., 458 Rounding data, 307 Row, see Tuple Row-orthonormal matrix, 402 Rowsum, 252 Royalty, J., 67 S-curve, 84, 93 Saberi, A., 291 Salihoglu, S., 66 Sample, 215, 218, 221, 223, 242, 249, 253 Sampling, 127, 141 Savasere, A., 226 SCC, see Strongly connected component Schapire, R.E., 458 Schema, 30 Schutze, H., 18 Score, 105 Search ad, 268 Search engine, 166, 181 Search query, 125, 155, 176, 268, 285 Second-price auction, 279 Secondary storage, see Disk Selection, 31, 33 Sensor, 124 Sentiment analysis, 422 Set, 76, 112, see also Itemset Set difference, see Difference Shankar, S., 67 Shawe-Taylor, J., 458 Shi, J., 383 Shim, K., 266 Shingle, 72, 85, 109 Shivakumar, N., 226 Shopping cart, 193 Shortest paths, 42 Siddharth, J., 122 Signature, 75, 78, 85 Signature matrix, 78, 83 Silberschatz, A., 153 Silberstein, A., 67 Similarity, 4, 15, 69, 191, 299, 306 Similarity join, 52, 58 Simrank, 357 Singleton, R.C., 153 Singular value, 397, 401, 402 Singular-value decomposition, 312, 384, 397, 406 Six degrees of separation, 369 Sketch, 100 Skew, 26 Sliding window, 126, 142, 148, 257 Smart transitive closure, 372 Smith, B., 324 SNAP, 382 Social Graph, 326 Social network, 16, 325, 326, 384 SON Algorithm, 217 Source, 367 Space, 87, 228 Spam, see also Term spam, see also Link spam, 328, 421 Spam farm, 178, 180 Spam mass, 180, 181 Sparse matrix, 28, 76, 77, 168, 293 Spectral partitioning, 343 Spider trap, 161, 164, 184 Splitting clusters, 255 SQL, 19, 30, 66 Squares, 366 Srikant, R., 226 Srivastava, U., 67 Standard deviation, 245, 247 Standing query, 125 Stanford Network Analysis Platform, see SNAP Star join, 50 Stata, R., 18, 190 Statistical model, 1 Status, 287 Steinbach, M., 18 Stochastic gradient descent, 320, 445 Stochastic matrix, 158, 385 Stop clustering, 234, 238, 240 Stop words, 7, 74, 110, 194, 298 Stream, see Data stream Strength of membership, 355 String, 112 Striping, 29, 168, 170 Strong edge, 328 Strongly connected component, 159, 374 Strongly connected graph, 158, 368 Substochastic matrix, 161 Suffix length, 116 Summarization, 3 Summation, 147 Sun, J., 414 Supercomputer, 19 Superimposed code, see Bloom filter, 152 Supermarket, 193, 214 Superstep, 43 Supervised learning, 415, 417 Support, 192, 216, 218, 221 Support vector, 437 Support-vector machine, 17, 415, 419, 436, 455 Supporting page, 178 Suri, S., 383 Surprise number, 137 SVD, see Singular-value decomposition SVM, see Support-vector machine Swami, A., 226 Symmetric matrix, 346, 384 Szegedy, M., 152 Tag, 298, 329 Tail, 372 Tail length, 135, 376 Tan, P.

When Free Markets Fail: Saving the Market When It Can't Save Itself (Wiley Corporate F&A)
by Scott McCleskey
Published 10 Mar 2011

The methodology is the description of what information to gather and how to process the information to arrive at a rating (what conditions would lead to an AAA rating, to an AA rating, etc.). The 7 Gretchen Morgenson, ‘‘Debt Watchdogs: Tamed or Caught Napping?’’ New York Times, December 6, 2008. 8 Ibid. 9 Ibid. C10 06/16/2010 90 16:31:17 & Page 90 Rating the Raters: The Role of Credit Rating Agencies statistical models are the algorithms that predict the outcomes of various scenarios, such as what would happen to an airline if the price of oil rose to $100 per barrel. The analyst does his or her homework and comes up with the rating he or she believes is correct, but this is only the beginning of the process.

pages: 337 words: 86,320

Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
by Seth Stephens-Davidowitz
Published 8 May 2017

Crossley, “Validity of Responses to Survey Questions,” Public Opinion Quarterly 14, 1 (1950). 106 survey asked University of Maryland graduates: Frauke Kreuter, Stanley Presser, and Roger Tourangeau, “Social Desirability Bias in CATI, IVR, and Web Surveys,” Public Opinion Quarterly 72(5), 2008. 107 failure of the polls: For an article arguing that lying might be a problem in trying to predict support for Trump, see Thomas B. Edsall, “How Many People Support Trump but Don’t Want to Admit It?” New York Times, May 15, 2016, SR2. But for an argument that this was not a large factor, see Andrew Gelman, “Explanations for That Shocking 2% Shift,” Statistical Modeling, Causal Inference, and Social Science, November 9, 2016, http://andrewgelman.com/2016/11/09/explanations-shocking-2-shift/. 107 says Tourangeau: I interviewed Roger Tourangeau by phone on May 5, 2015. 107 so many people say they are above average: This is discussed in Adam Grant, Originals: How Non-Conformists Move the World (New York: Viking, 2016).

Psychopathy: An Introduction to Biological Findings and Their Implications
by Andrea L. Glenn and Adrian Raine
Published 7 Mar 2014

In behavioral genetics studies, the similarity of MZ twins on a given trait is compared to the similarity of DZ twins on that trait. If MZ twins are more similar than DZ twins, then it can be inferred that the trait being measured is at least partly due to genetic factors. Across large samples, statistical modeling techniques can determine the proportion of the variance in a particular trait or phenotype (in this case, psychopathy or a subcomponent of it) that is accounted for by genetic versus environmental factors. Genetic factors either can be additive or nonadditive. Additive means that genes summate to contribute to a phenotype.

pages: 302 words: 86,614

The Alpha Masters: Unlocking the Genius of the World's Top Hedge Funds
by Maneet Ahuja , Myron Scholes and Mohamed El-Erian
Published 29 May 2012

Wong says that the one thing most people don’t understand about systematic trading is the trade-off between profit potential in the long term and the potential for short-term fluctuation and losses. “We are all about the long run,” he says. “It’s why I say, over and over, the trend is your friend.” “If you’re a macro trader and you basically have 20 positions, you better make sure that no more than two or three are wrong. But we base our positions on statistical models, and we take hundreds of positions. At any given time, a lot of them are going to be wrong, and we have to accept that. But in the long run, we’ll be more right than wrong.” Evidently—since 1990, AHL’s total returns have exceeded 1,000 percent. Still, AHL is hardly invulnerable. The financial crisis brought on a sharp reversal, and the firm remains vulnerable to the Fed-induced drop in market volatility.

The Armchair Economist: Economics and Everyday Life
by Steven E. Landsburg
Published 1 May 2012

The commissioner became obsessed with the need to discourage punting and called in his assistants for advice on how to cope with the problem. One of those assistants, a fresh M.B.A., breathlessly announced that he had taken courses from an economist who was a great expert on all aspects of the game and who had developed detailed statistical models to predict how teams behave. He proposed retaining the economist to study what makes teams punt. 211 212 THE PITFALLS OF SCIENCE The commissioner summoned the economist, who went home with a large retainer check and a mandate to discover the causes of punting. Many hours later (he billed by the hour) the answer was at hand.

pages: 322 words: 84,752

Pax Technica: How the Internet of Things May Set Us Free or Lock Us Up
by Philip N. Howard
Published 27 Apr 2015

Important events and recognizable causal connections can’t be replicated or falsified. We can’t repeat the Arab Spring in some kind of experiment. We can’t test its negation—an Arab Spring that never happened, or an Arab Spring minus one key factor that resulted in a different outcome. We don’t have enough large datasets about Arab Spring–like events to run statistical models. That doesn’t mean we shouldn’t try to learn from the real events that happened. In fact, for many in the social sciences, tracing how real events unfolded is the best way to understand political change. The richest explanations of the fall of the Berlin Wall, for example, as sociologist Steve Pfaff crafts them, come from such process tracing.2 We do, however, know enough to make some educated guesses about what will happen next.

pages: 291 words: 81,703

Average Is Over: Powering America Beyond the Age of the Great Stagnation
by Tyler Cowen
Published 11 Sep 2013

On the age dynamics for achievement for non-economists, see Benjamin F. Jones and Bruce A. Weinberg, “Age Dynamics in Scientific Creativity,” published online before print, PNAS, November 7, 2011, doi: 10.1073/pnas.1102895108. On data crunching pushing out theory, see the famous essay by Leo Breiman, “Statistical Modeling: The Two Cultures,” Statistical Science, 2001, 16(3): 199–231, including the comments on the piece as well. See also the recent piece by Betsey Stevenson and Justin Wolfers, “Business is Booming in Empirical Economics,” Bloomberg.com, August 6, 2012. And as mentioned earlier, see Daniel S.

pages: 283 words: 81,163

How Capitalism Saved America: The Untold History of Our Country, From the Pilgrims to the Present
by Thomas J. Dilorenzo
Published 9 Aug 2004

Wages rose by a phenomenal 13.7 percent during the first three quarters of 1937 alone.46 The union/nonunion wage differential increased from 5 percent in 1933 to 23 percent by 1940.47 On top of this, the Social Security payroll and unemployment insurance taxes contributed to a rapid rise in government-mandated fringe benefits, from 2.4 percent of payrolls in 1936 to 5.1 percent just two years later. Economists Richard Vedder and Lowell Gallaway have determined the costs of all this misguided legislation, showing how most of the abnormal unemployment of the 1930s would have been avoided had it not been for the New Deal. Using a statistical model, Vedder and Gallaway concluded that by 1940 the unemployment rate was more than 8 percentage points higher than it would have been without the legislation-induced growth in unionism and government-mandated fringe-benefit costs imposed on employers.48 Their conclusion: “The Great Depression was very significantly prolonged in both its duration and its magnitude by the impact of New Deal programs.”49 In addition to fascistic labor policies and government-mandated wage and fringe-benefit increases that destroyed millions of jobs, the Second New Deal was responsible for economy-destroying tax increases and massive government spending on myriad government make-work programs.

pages: 294 words: 81,292

Our Final Invention: Artificial Intelligence and the End of the Human Era
by James Barrat
Published 30 Sep 2013

When I asked Jason Freidenfelds, from Google PR, he wrote: … it’s much too early for us to speculate about topics this far down the road. We’re generally more focused on practical machine learning technologies like machine vision, speech recognition, and machine translation, which essentially is about building statistical models to match patterns—nothing close to the “thinking machine” vision of AGI. But I think Page’s quotation sheds more light on Google’s attitudes than Freidenfelds’s. And it helps explain Google’s evolution from the visionary, insurrectionist company of the 1990s, with the much touted slogan DON’T BE EVIL, to today’s opaque, Orwellian, personal-data-aggregating behemoth.

pages: 561 words: 87,892

Losing Control: The Emerging Threats to Western Prosperity
by Stephen D. King
Published 14 Jun 2010

WE’RE NOT ON OUR OWN In my twenty-five years as a professional economist, initially as a civil servant in Whitehall but, for the most part, as an employee of a major international bank, I’ve spent a good deal of time looking into the future. As the emerging nations first appeared on the economic radar screen, I began to realize I could talk about the future only by delving much further into the past. I wasn’t interested merely in the history incorporated into statistical models of the economy, a history which typically includes just a handful of years and therefore ignores almost all the interesting economic developments that have taken place over the last millennium. Instead, the history that mattered to me had to capture the long sweep of economic and political progress and all too frequent reversal.

pages: 345 words: 86,394

Frequently Asked Questions in Quantitative Finance
by Paul Wilmott
Published 3 Jan 2007

This is hat B. The final hat’s numbers have mean of zero and standard deviation 10. This is hat C. You don’t know which hat is which. You pick a number out of one hat, it is −2.6. Which hat do you think it came from? MLE can help you answer this question. Long Answer A large part of statistical modelling concerns finding model parameters. One popular way of doing this is Maximum Likelihood Estimation. The method is easily explained by a very simple example. You are attending a maths conference. You arrive by train at the city hosting the event. You take a taxi from the train station to the conference venue.

pages: 280 words: 83,299

Empty Planet: The Shock of Global Population Decline
by Darrell Bricker and John Ibbitson
Published 5 Feb 2019

“If we continue at this pace, one day the next species we extinguish may be ourselves,” Bourne warned.77 But the biggest neo-Malthusian of them all is an institution, and a highly respected one at that. The United Nations Population Division, a critical component of the UN’s Department of Economic and Social Affairs, is almost as old as the UN itself, having existed in one form or another since 1946. Its principal goal is to develop statistical models that will accurately project the growth of the global population. The demographers and statisticians who work there are good at their jobs. In 1958, the division predicted that the global population would reach 6.28 billion by 2000. In fact, it was a bit lower, at 6.06 billion, about 200 million out—a difference small enough not to count.78 This was remarkably impressive, given that demographers at that time had highly inadequate data for Africa and China.

pages: 290 words: 83,248

The Greed Merchants: How the Investment Banks Exploited the System
by Philip Augar
Published 20 Apr 2005

As we saw in Chapter 1, the deal was intended to create a world-class modern media company but instead led to the largest losses in corporate history. Bankers such as the legendary dealmaker Bruce Wasserstein are dismissive of the value-destroying arguments: ‘The problem with many academic studies is that they make questionable assumptions to squeeze untidy data points into a pristine statistical model.’32 But the weight of evidence from the late-twentieth-century merger wave seems to show that the handsome profits made by the selling shareholders were usually offset by subsequent losses for the acquirers. This would suggest that many mergers were not well thought out and were attempted by managers that lacked the skills and techniques to make them work.

pages: 245 words: 83,272

Artificial Unintelligence: How Computers Misunderstand the World
by Meredith Broussard
Published 19 Apr 2018

We created the Survived column and got a number that we can call 97 percent accurate. We learned that fare is the most influential factor in a mathematical analysis of Titanic survivor data. This was narrow artificial intelligence. It was not anything to be scared of, nor was it leading us toward a global takeover by superintelligent computers. “These are just statistical models, the same as those that Google uses to play board games or that your phone uses to make predictions about what word you’re saying in order to transcribe your messages,” Carnegie Mellon professor and machine learning researcher Zachary Lipton told the Register about AI. “They are no more sentient than a bowl of noodles.”17 For a programmer, writing an algorithm is that easy.

pages: 297 words: 84,447

The Star Builders: Nuclear Fusion and the Race to Power the Planet
by Arthur Turrell
Published 2 Aug 2021

Clery, “Laser Fusion Reactor Approaches ‘Burning Plasma’ Milestone,” Science 370 (2020): 1019–20. 15. D. Clark et al., “Three-Dimensional Modeling and Hydrodynamic Scaling of National Ignition Facility Implosions,” Physics of Plasmas 26 (2019): 050601; V. Gopalaswamy et al., “Tripled Yield in Direct-Drive Laser Fusion Through Statistical Modelling,” Nature 565 (2019): 581–86. 16. K. Hahn et al., “Fusion-Neutron Measurements for Magnetized Liner Inertial Fusion Experiments on the Z Accelerator,” in Journal of Physics: Conference Series, vol. 717 (IOP Publishing, 2016), 012020. 17. O. Hurricane et al., “Approaching a Burning Plasma on the NIF,” Physics of Plasmas 26 (2019): 052704; P.

pages: 303 words: 84,023

Heads I Win, Tails I Win
by Spencer Jakab
Published 21 Jun 2016

We’re talking about less than one-tenth of one percent of all trading days during that span. Sure, the payoff from missing a major selloff would be huge. The very smartest people on Wall Street would give up a major bodily appendage to identify even one of those episodes, though, and there’s no evidence any of them has managed to do it with any consistency. Their statistical models aren’t even very good at predicting how bad those bad days will be once they arrive—potentially a fatal miscalculation for those using borrowed money to enhance returns. For example, the October 1987 stock market crash was what risk managers call a 21 standard deviation event. That’s a statistical definition and I won’t bore you with the math.

pages: 348 words: 83,490

More Than You Know: Finding Financial Wisdom in Unconventional Places (Updated and Expanded)
by Michael J. Mauboussin
Published 1 Jan 2006

Psychologist Phil Tetlock asked nearly three hundred experts to make literally tens of thousands of predictions over nearly two decades. These were difficult predictions related to political and economic outcomes—similar to the types of problems investors tackle. The results were unimpressive. Expert forecasters improved little, if at all, on simple statistical models. Further, when Tetlock confronted the experts with their poor predicting acuity, they went about justifying their views just like everyone else does. Tetlock doesn’t describe in detail what happens when the expert opinions are aggregated, but his research certainly shows that ability, defined as expertise, does not lead to good predictions when the problems are hard.

pages: 623 words: 448,848

Food Allergy: Adverse Reactions to Foods and Food Additives
by Dean D. Metcalfe
Published 15 Dec 2008

Furthermore, this approach allows for the possibility that almost 10% of patients allergic to that food will react to ingestion of that dose and this possibility may be considered as too high. Modeling of collective data from several studies is probably the preferred approach to determine the population-based threshold, although the best statistical model to use remains to be determined [8]. typical servings of these foods. Thus, it is tempting to speculate that those individuals with very low individual threshold doses would be less likely to outgrow their food allergy or would require a longer time period for that to occur. In at least one study [25], individuals with histories of severe food allergies had significantly lower individual threshold doses.

The knowledge of individual threshold doses would allow physicians to offer more complete advice to food-allergic patients in terms of their comparative vulnerability to hidden residues of allergenic foods. The clinical determination of large numbers of individual threshold doses would allow estimates of population-based thresholds using appropriate statistical modeling approaches. The food industry and regulatory agencies could also make effective use of information on population-based threshold doses to establish improved labeling regulations and practices and allergen control programs. References 1 Gern JE, Yang E, Evrard HM, et al. Allergic reactions to milk-contaminated “non-dairy” products.

Standardization of double-blind, placebocontrolled food challenges. Allergy 2001;56:75–7. 75 Caffarelli C, Petroccione T. False-negative food challenges in children with suspected food allergy. Lancet 2001;358:1871–2. 76 Sampson HA. Use of food-challenge tests in children. Lancet 2001; 358:1832–3. 77 Briggs D, Aspinall L, Dickens A, Bindslev-Jensen C. Statistical model for assessing the proportion of subjects with subjective sensitisations in adverse reactions to foods. Allergy 2001; 56:83–5. 78 Chinchilli VM, Fisher L, Craig TJ. Statistical issues in clinical trials that involve the double-blind, placebo-controlled food challenge. J Allergy Clin Immunol 2005;115:592–7. 21 CHAPTER 21 IgE Tests: In Vitro Diagnosis Kirsten Beyer KEY CONCEPTS • The presence of food allergen-specific IgE determines the sensitization to a specific food.

pages: 335 words: 94,657

The Bogleheads' Guide to Investing
by Taylor Larimore , Michael Leboeuf and Mel Lindauer
Published 1 Jan 2006

During a 15-year period when the S&P 500 had average annual returns of 15.3 percent, the Mensa Investment Club's performance averaged returns of only 2.5 percent. 3. In 1994, a hedge fund called Long Term Capital Management (LTCM) was created with the help of two Nobel Prize-winning economists. They believed they had a statistical model that could eliminate risk from investing. The fund was extremely leveraged. They controlled positions totaling $1.25 trillion, an amount equal to the annual budget of the U.S. government. After some spectacular early successes, a financial panic swept across Asia. In 1998, LTCM hemorrhaged and faced bankruptcy.

pages: 257 words: 94,168

Oil Panic and the Global Crisis: Predictions and Myths
by Steven M. Gorelick
Published 9 Dec 2009

At a depth of over 5 miles, this find contains anywhere between 3 and 15 billion barrels and could comprise 11 percent of US production by 2013.107 In 2009, Chevron reported another deep-water discovery just 44 miles away that may yield 0.5 billion barrels and could be profitably produced at an oil price of $50 per barrel.108 The second insight from discovery trends is that an underlying premise of many statistical models of oil discovery is probably incorrect. This premise is that larger oil fields are found first, followed by the discovery of smaller fields. Large fields in geologically related proximity to one another are typically discovered first simply because they are the most easily detected targets.

The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences
by Rob Kitchin
Published 25 Aug 2014

The difference between the humanities and social sciences in this respect is because the statistics used in the digital humanities are largely descriptive – identifying patterns and plotting them as counts, graphs, and maps. In contrast, the computational social sciences employ the scientific method, complementing descriptive statistics with inferential statistics that seek to identify causality. In other words, they are underpinned by an epistemology wherein the aim is to produce sophisticated statistical models that explain, simulate and predict human life. This is much more difficult to reconcile with post-positivist approaches. The defence then rests on the utility and value of the method and models, not on providing complementary analysis of a more expansive set of data. There are alternatives to this position, such as that adopted within critical GIS (Geographic Information Science) and radical statistics, and those who utilise mixed-method approaches, that either employ models and inferential statistics while being mindful of their shortcomings, or more commonly only utilise descriptive statistics that are complemented with small data studies.

pages: 322 words: 88,197

Wonderland: How Play Made the Modern World
by Steven Johnson
Published 15 Nov 2016

Probability theory served as a kind of conceptual fossil fuel for the modern world. It gave rise to the modern insurance industry, which for the first time could calculate with some predictive power the claims it could expect when insuring individuals or industries. Capital markets—for good and for bad—rely extensively on elaborate statistical models that predict future risk. “The pundits and pollsters who today tell us who is likely to win the next election make direct use of mathematical techniques developed by Pascal and Fermat,” the mathematician Keith Devlin writes. “In modern medicine, future-predictive statistical methods are used all the time to compare the benefits of various drugs and treatments with their risks.”

pages: 323 words: 89,795

Food and Fuel: Solutions for the Future
by Andrew Heintzman , Evan Solomon and Eric Schlosser
Published 2 Feb 2009

Specifically, there were some indications that China’s catch reports were too high. For example, some of China’s major fish populations were declared overexploited decades ago. In 2001, Watson and Pauly published an eye-opening study in the journal Nature about the true status of our world’s fisheries. These researchers used a statistical model to compare China’s officially reported catches to those that would be expected, given oceanographic conditions and other factors. They determined that China’s actual catches were likely closer to one half their reported levels. The implications of China’s over-reporting are dramatic: instead of global catches increasing by 0.33 million tonnes per year since 1988, as reported by the FAO, catches have actually declined by 0.36 million tonnes per year.

pages: 312 words: 89,728

The End of My Addiction
by Olivier Ameisen
Published 23 Dec 2008

J., Sunde, N. et al. (2002) Evidence of tolerance to baclofen in treatment of severe spasticity with intrathecal baclofen. Clinical Neurology and Neurosurgery 104, 142–145. Pelc, I., Ansoms, C., Lehert, P. et al. (2002) The European NEAT program: an integrated approach using acamprosate and psychosocial support for the prevention of relapse in alcohol-dependent patients with a statistical modeling of therapy success prediction. Alcoholism: Clinical and Experimental Research 26, 1529–1538. Roberts, D. C. and Andrews, M. M. (1997) Baclofen suppression of cocaine self-administration: demonstration using a discrete trials procedure. Psychopharmacology (Berlin) 131, 271–277. Shoaib, M., Swanner, L.

The Fractalist
by Benoit Mandelbrot
Published 30 Oct 2012

There were far too many big price jumps and falls. And the volatility kept shifting over time. Some years prices were stable, other years wild. “We’ve done all we can to make sense of these cotton prices. Everything changes, nothing is constant. This is a mess of the worst kind.” Nothing could make the data fit the existing statistical model, originally proposed in 1900, which assumed that each day’s price change was independent of the last and followed the mildly random pattern predicted by the bell curve. In short order, we made a deal: he’d let me see what I could do. He handed me cardboard boxes of computer punch cards recording the data.

pages: 297 words: 91,141

Market Sense and Nonsense
by Jack D. Schwager
Published 5 Oct 2012

The premise underlying statistical arbitrage is that short-term imbalances in buy and sell orders cause temporary price distortions, which provide short-term trading opportunities. Statistical arbitrage is a mean-reversion strategy that seeks to sell excessive strength and buy excessive weakness based on statistical models that define when short-term price moves in individual equities are considered out of line relative to price moves in related equities. The origin of the strategy was a subset of statistical arbitrage called pairs trading. In pairs trading, the price ratios of closely related stocks are tracked (e.g., Ford and General Motors), and when the mathematical model indicates that one stock has gained too much versus the other (either by rising more or by declining less), it is sold and hedged by the purchase of the related equity in the pair.

pages: 342 words: 94,762

Wait: The Art and Science of Delay
by Frank Partnoy
Published 15 Jan 2012

It is worth noting that when economists attempt to describe human behavior using high-level math, it often doesn’t go particularly well. Because the math is complex, people are prone to rely on it without question. And the equations often are vulnerable to unrealistic assumptions. Most recently, the financial crisis was caused in part by overreliance on statistical models that didn’t take into account the chances of declines in housing prices. But that was just the most recent iteration: the collapse of Enron, the implosion of the hedge fund Long-Term Capital Management, the billions of dollars lost by rogue traders Kweku Adoboli, Jerome Kerviel, Nick Leeson, and others—all of these fiascos have, at their heart, a mistaken reliance on complex math.

pages: 339 words: 88,732

The Second Machine Age: Work, Progress, and Prosperity in a Time of Brilliant Technologies
by Erik Brynjolfsson and Andrew McAfee
Published 20 Jan 2014

To test this hypothesis, Erik asked Google if he could access data about its search terms. He was told that he didn’t have to ask; the company made these data freely available over the Web. Erik and his doctoral student Lynn Wu, neither of whom was versed in the economics of housing, built a simple statistical model to look at the data utilizing the user-generated content of search terms made available by Google. Their model linked changes in search-term volume to later housing sales and price changes, predicting that if search terms like the ones above were on the increase today, then housing sales and prices in Phoenix would rise three months from now.

pages: 291 words: 90,200

Networks of Outrage and Hope: Social Movements in the Internet Age
by Manuel Castells
Published 19 Aug 2012

Particularly significant, before the Arab Spring, was the transformation of social involvement in Egypt and Bahrain with the help of ICT diffusion. In a stream of research conducted in 2011 and 2012 after the Arab uprisings, Howard and Hussain, using a series of quantitative and qualitative indicators, probed a multi-causal, statistical model of the processes and outcomes of the Arab uprisings by using fuzzy logic (Hussain and Howard 2012). They found that the extensive use of digital networks by a predominantly young population of demonstrators had a significant effect on the intensity and power of these movements, starting with a very active debate on social and political demands in the social media before the demonstrations’ onset.

pages: 345 words: 92,849

Equal Is Unfair: America's Misguided Fight Against Income Inequality
by Don Watkins and Yaron Brook
Published 28 Mar 2016

In these cases the question to ask is: “Assuming this is a problem, what is your solution?” Inevitably, the inequality critics’ answer will be that some form of force must be used to tear down the top by depriving them of the earned, and to prop up the bottom by giving them the unearned. But nothing can justify an injustice, nor can any statistical model erase the fact that all of the values human life requires are a product of the human mind, and that the human mind cannot function without freedom. Don’t concede that the inequality alarmists value equality. The egalitarians pose as defenders of equality. But there is no such thing as being for equality across the board: different types of equality conflict.

pages: 353 words: 88,376

The Investopedia Guide to Wall Speak: The Terms You Need to Know to Talk Like Cramer, Think Like Soros, and Buy Like Buffett
by Jack (edited By) Guinan
Published 27 Jul 2009

Related Terms: • Defined-Benefit Plan • Defined-Contribution Plan • Individual Retirement Account—IRA • Roth IRA • Tax Deferred 241 242 The Investopedia Guide to Wall Speak Quantitative Analysis What Does Quantitative Analysis Mean? A business or financial analysis technique that is used to understand market behavior by employing complex mathematical and statistical modeling, measurement, and research. By assigning a numerical value to variables, quantitative analysts try to replicate reality in mathematical terms. Quantitative analysis helps measure performance evaluation or valuation of a financial instrument. It also can be used to predict real-world events such as changes in a share’s price.

pages: 285 words: 86,853

What Algorithms Want: Imagination in the Age of Computing
by Ed Finn
Published 10 Mar 2017

The black box structure of Siri’s knowledge ontology obfuscated the category error the system made by excluding Planned Parenthood facilities. Fixing this glitch in the culture machine necessarily involves human intervention: behind the facade of the black box, engineers had to overrule baseline statistical models with exceptions and workarounds. There must be thousands of such exceptions, particularly for responses that mimic human affect. Siri and its various counterparts offer a vision of universal language computation, but in practice depend on an “effective” computation that requires constant tweaking and oversight.

pages: 408 words: 94,311

The Great Depression: A Diary
by Benjamin Roth , James Ledbetter and Daniel B. Roth
Published 21 Jul 2009

Rather, Roth’s diary is a reminder that our economic security, individually and collectively, always rests on a complex interaction of market forces, politics, consumer perception, and the impact of unforeseen (and sometimes unforeseeable) events. As in so many other areas, those offering predictions for the future or even detailed readings of the present are often wrong because of incomplete information, flawed statistical models, or hidden agendas. And even when they are right within a particular time frame, history often has other plans in mind. The Youngstown that Benjamin Roth knew and hoped to see revived—the booming steel town, where soot-choked skies meant prosperity—did in fact survive the Depression, thanks in large part to the military buildup during World War II, a major theme of this book’s final chapter.

Driverless: Intelligent Cars and the Road Ahead
by Hod Lipson and Melba Kurman
Published 22 Sep 2016

After a few thousand games more, the software began to play with what some observers might call “strategy.” Since most moves can lead to both a loss and a win, depending on subsequent moves, the database didn’t just record a win/lose outcome. Instead, it recorded the probability that each move would eventually lead to a win. In other words, the database was essentially a big statistical model. Figure 8.2 AI techniques used in driverless cars. Most robotic systems use a combination of techniques. Object recognition for real-time obstacle detection and traffic negotiation is the most challenging for AI (far left). As the software learned, it spent countless hours in “self-play,” amassing more gaming experience than any human could in a lifetime.

High-Frequency Trading
by David Easley , Marcos López de Prado and Maureen O'Hara
Published 28 Sep 2013

With variable market speed of trading we need to discretise the time interval [0, T ] over n steps ∆ti , and Equation 2.1 becomes X= n  vi ∆ti (2.2) i =1 Therefore, executing a CLOCK or VWAP or POV strategy is a scheduling problem, ie, we are trying to enforce Equation 2.2 within each evaluation interval ∆ti while targeting a predetermined T or a variable vi . Controlling the speed of trading vi is a non-trivial practical problem that requires statistical models for forecasting the market volume over short horizons, as well as local adjustments for tracking the target schedule (Markov et al 2011). These scheduling techniques are also used in later generation algorithms. Second generation algorithms Second generation algorithms introduce the concepts of price impact and risk.

pages: 297 words: 95,518

Ten Technologies to Save the Planet: Energy Options for a Low-Carbon Future
by Chris Goodall
Published 1 Jan 2010

Every country in the world that relies on increasing amounts of wind, marine, or solar power will probably need to use all three of these mechanisms to align short-term supply and demand. In the U.S., this three-pronged approach is appropriately called the “smart grid.” The construction and operation of this new kind of grid are fascinating challenges to engineers and also to the mathematicians who will use statistical modeling to minimize the risk of not having enough power or, perhaps even more expensively, having grossly excessive power production for many hours a week. Elsewhere, the standard approach, which we might call the “twentieth-century model,” simply tries to predict changes in demand and then adjusts supply to meet these variations.

Deep Value
by Tobias E. Carlisle
Published 19 Aug 2014

Sell only if market price is equal to or greater than intrinsic value, or a better opportunity can be found, hold otherwise. Resistance to the application of statistical prediction rules in value investment runs deep. Many investors recoil at the thought of ceding control of investment decisions to a statistical model, believing that it would be better to use the output from the statistical prediction rule and retain the discretion to follow the rule’s output or not. There is some evidence to support this possibility. Traditional experts are shown to make better decisions Catch a Falling Knife 143 when they are provided with the results of statistical prediction.

pages: 304 words: 90,084

Net Zero: How We Stop Causing Climate Change
by Dieter Helm
Published 2 Sep 2020

And that is what they are doing, just like the tobacco companies, the manufacturers of sugary drinks, arms manufacturers, construction companies that use cement and steel, farmers who use fertilisers and pesticides, and so on. Every company in the FT100 index is embedded in the fossil fuel economy. Those who say that this is what is wrong with the ‘capitalist model’ need to consider just what would happen if we jumped off now, rather than over a sensible transition, and why there is no effective carbon price. The statist model is, from a carbon perspective, much worse. It is the work of Saudi Aramco and Rosneft. Climate activists attack European and US politicians and company executives. They don’t dare take on Vladimir Putin, Xi Jinping and Mohammad bin Salman. Gluing yourself to the HQ of Shell or BP is easy: doing it in Moscow, Beijing or Riyadh is much tougher.

pages: 286 words: 92,521

How Medicine Works and When It Doesn't: Learning Who to Trust to Get and Stay Healthy
by F. Perry Wilson
Published 24 Jan 2023

We may see a study that notes that Black people are twice as likely to develop diabetes as white people and, erroneously, concludes that it is due to some inherent unchangeable biology, when in fact this is a correlation induced by multiple third factors—confounders such as poor socioeconomic conditions, which are things we can change. This is why I’ve moved my lab away from using race as a variable in our statistical models. It’s not that there is no correlation between race and the kind of stuff I research (kidney disease outcomes). There is. But race is correlational, not causal. Better instead to focus on the real causal agents: racism (implicit and explicit) and societal inequality. While I don’t have a pill to fix those, I am fortunate enough to have a platform in which to urge everyone to recognize the causality of health inequality in this country and to move our government and ourselves toward addressing it.

pages: 406 words: 88,977

How to Prevent the Next Pandemic
by Bill Gates
Published 2 May 2022

Until fairly recently, the government there counted deaths by surveying small samples of the country every few years and then using the data to estimate nationwide mortality. In 2018, though, Mozambique began building what’s known as a “sample registration system,” which involves continuous surveillance in areas that are representative of the country as a whole. Data from these samples is fed into statistical models that make high-quality estimates about what’s going on throughout the nation. For the first time, Mozambique’s leaders can see accurate monthly reports on how many people died, how and where they died, and how old they were. Mozambique is also one of several countries that are deepening their understanding of child mortality by participating in a program called Child Health and Mortality Prevention Surveillance, or CHAMPS, a global network of public health agencies and other organizations.

pages: 336 words: 91,806

Code Dependent: Living in the Shadow of AI
by Madhumita Murgia
Published 20 Mar 2024

In the Netherlands, risk assessment technologies like the Top600 and 400 lists are part of a wider national security policy, which Anouk de Koning, a Dutch anthropologist at Leiden University, calls ‘diffuse policing’.6 Apart from the ProKid algorithms, there is also the Crime Anticipation System, or CAS, an AI software that makes predictions about where and when crimes will occur. The system was developed in Amsterdam and rolled out nationally. Part of the diffuse policing strategy includes the use of statistical models – including AI methods – to help target specific demographics: mostly poor and non-white urban youths. The objective is to predict and prevent trouble. Together, these policies enact a combination of ‘care and coercion’, captured in the slogan used by the police for the Top600: ‘Handled with care.’7 Following eighteen months of interviews with young men and their families in the Amsterdam district of Diamantbuurt, as well as care workers and police, Anouk found that the neighbourhood’s Moroccan-Dutch youths made up the largest single subgroup in the Top600.

pages: 364 words: 101,286

The Misbehavior of Markets: A Fractal View of Financial Turbulence
by Benoit Mandelbrot and Richard L. Hudson
Published 7 Mar 2006

Econometrica 34, 1966 (Supplement): 152-153. Mandelbrot, Benoit B. 1970. Long-run interdependence in price records and other economic time series. Econometrica 38: 122-123. Mandelbrot, Benoit B. 1972. Possible refinement of the lognormal hypothesis concerning the distribution of energy dissipation in intermittent turbulence. Statistical Models and Turbulence. M. Rosenblatt and C. Van Atta, eds. Lecture Notes in Physics 12. New York: Springer, 333-351. • Reprint: Chapter N14 of Mandelbrot 1999a. Mandelbrot, Benoit B. 1974a. Intermittent turbulence in self-similar cascades; divergence of high moments and dimension of the carrier. Journal of Fluid Mechanics 62: 331-358. • Reprint: Chapter N15 of Mandelbrot 1999a.

pages: 227 words: 32,306

Using Open Source Platforms for Business Intelligence: Avoid Pitfalls and Maximize Roi
by Lyndsay Wise
Published 16 Sep 2012

In the past, much risk management within BI remained within the realm of finance, insurance, and banking, but most organizations need to assess potential risk and help mitigate its effects on the organization. Within BI, this goes beyond information visibility and means using predictive modeling and other advanced statistical models to ensure that customers with accounts past due are not allowed to submit new orders unless it is known beforehand, or that insurance claims aren’t being submitted fraudulently. The National Health Care Anti-Fraud Association (NHCAA) estimates that in 2010, 3% of all health care spending or $68 billion is lost to health care fraud in the United States.2 This makes fraud detection in health care extremely important, especially when you consider that if you are paying for insurance in the United States, part of your insurance premiums are probably being paid to cover the instances of fraud that occur, making this relevant beyond health care insurance providers.

pages: 364 words: 99,613

Servant Economy: Where America's Elite Is Sending the Middle Class
by Jeff Faux
Published 16 May 2012

“Larry Summers and Michael Steele,” This Week with Christiane Amanpour, ABC News, February 8, 2009. 10. CNN Politics, Election Center, November 24, 2010, http://www.cnn.com/ELECTION/2010/results/polls.main. 11. Andrew Gelman, “Unsurprisingly, More People Are Worried about the Economy and Jobs Than about Deficit,” Statistical Modeling, Causal Interference, and Social Science, June 19, 2010, http://www.stat.columbia.edu/~cook/movabletype/archives/2010/06/unsurprisingly.html;Ryan Grim, “Mayberry Machiavellis: Obama Political Team Handcuffing Recovery,” Huffington Post, July 6, 2010, http://www.huffingtonpost.com/2010/07/06/mayberry-machiavellis-oba_n_636770.html. 12.

pages: 313 words: 101,403

My Life as a Quant: Reflections on Physics and Finance
by Emanuel Derman
Published 1 Jan 2004

We ran daily reports on the desk's inventory using both these models. Different clients preferred different metrics, depending on their sophistication and on the accounting rules and regulations to which they were subject. We also did some longer-term, client-focused research, developing improved statistical models for homeowner prepayments or programs for valuing the more exotic ARM-based structures that were growing in popularity. The traders on the desk used the option-adjusted spread model to decide how much to bid for newly available ARM pools. The calculation was arduous. Each pool consisted of a variety of mortgages with a range of coupons and a spectrum of servicing fees, and the optionadjusted spread was calculated by averaging over thousands of future scenarios, each one involving a month-by-month simulation of interest rates over hundreds of months.

pages: 377 words: 97,144

Singularity Rising: Surviving and Thriving in a Smarter, Richer, and More Dangerous World
by James D. Miller
Published 14 Jun 2012

Nobel Prize-winning economist James Heckman has written that “an entire literature has found” that cognitive abilities “significantly affect wages.”147 Of course, “cognitive abilities” aren’t necessarily the same thing as g or IQ. Recall that the theory behind g, and therefore IQ’s importance, is that a single variable can represent intelligence. To check whether a single measure of cognitive ability has predictive value, Heckman developed a statistical model testing whether one number essentially representing g and another representing noncognitive ability can explain most of the variations in wages.148 Heckman’s model shows that it could. Heckman, however, carefully points out that noncognitive traits such as “stick-to-it-iveness” are at least as important as cognitive traits in determining wages—meaning that a lazy worker with a high IQ won’t succeed at Microsoft or Goldman Sachs.

pages: 370 words: 94,968

The Most Human Human: What Talking With Computers Teaches Us About What It Means to Be Alive
by Brian Christian
Published 1 Mar 2011

UCSD’s computational linguist Roger Levy: “Programs have gotten relatively good at what is actually said. We can devise complex new expressions, if we intend new meanings, and we can understand those new meanings. This strikes me as a great way to break the Turing test [programs] and a great way to distinguish yourself as a human. I think that in my experience with statistical models of language, it’s the unboundedness of human language that’s really distinctive.”4 Dave Ackley offers very similar confederate advice: “I would make up words, because I would expect programs to be operating out of a dictionary.” My mind on deponents and attorneys, I think of drug culture, how dealers and buyers develop their own micro-patois, and how if any of these idiosyncratic reference systems started to become too standardized—if they use the well-known “snow” for cocaine, for instance—their text-message records and email records become much more legally vulnerable (i.e., have less room for deniability) than if the dealers and buyers are, like poets, ceaselessly inventing.

pages: 364 words: 102,926

What the F: What Swearing Reveals About Our Language, Our Brains, and Ourselves
by Benjamin K. Bergen
Published 12 Sep 2016

So if you believe that exposure to violence in media could be a confounding factor—it correlates with exposure to profanity and could explain some amount of aggression—then you measure not only how much profanity but also how much violence children are exposed to. The two will probably correlate, but the key point is that you can measure exactly how much media violence correlates with child aggressiveness, and you can pull that apart in a statistical model from the amount that profanity exposure correlates with child aggressiveness. The authors of the Pediatrics study tried to do this. But to know that profanity exposure per se and not any of these other possible confounding factors is responsible for increased reports of aggressiveness, you’d need to do the same thing not just for exposure to media violence, as the authors did, but for every other possible confounding factor, which they did not.

pages: 349 words: 98,868

Nervous States: Democracy and the Decline of Reason
by William Davies
Published 26 Feb 2019

One study conducted across Europe found that the experience of unemployment leads people to become less trusting in parliament, but more trusting in the police.3 The elites who are in trouble are the ones whose lineage begins in the seventeenth century: journalists, experts, officials. They are the ones whose task it was originally to create portraits, maps, statistical models of the world, that the rest of us were expected to accept, on the basis that they were unpolluted by personal feelings or bias. Social media has accelerated this declining credibility, but it is not the sole cause. This split reflects something about the role of speed in our politics. The work of government and of establishing facts can be slow and frustrating.

Data and the City
by Rob Kitchin,Tracey P. Lauriault,Gavin McArdle
Published 2 Aug 2017

Here, a visualization is not simply describing or displaying the data, but is used as a visual analytical tool to extract information, build visual models and explanation, and to guide further statistical analysis (Keim et al. 2010). Often several different types of visual graphics are used in conjunction with each other so that the data can be examined from more than one perspective simultaneously. In addition, data mining and statistical modelling, such as prediction, simulation and optimization, can be performed and outputted through visual interfaces and outputs (Thomas and Cook 2006). In the context of city dashboards, this epistemology is framed within the emerging field of urban informatics (Foth 2009) and urban science (Batty 2013).

Risk Management in Trading
by Davis Edwards
Published 10 Jul 2014

As a result, two equally qualified risk managers can come up with slightly different estimates for VAR. In addition, there are several common approaches to estimating VAR. These approaches can include using historical price movements, forward implied volatility from options markets, or a variety of statistical models. One common approach used to estimate VAR is to assume that percentage changes in price (called percent returns) are normally distributed. Historical data would then be used to estimate the size of a typical price move. This assumption used in the model (that percent returns are normally distributed and can be described by a single parameter called volatility) would give the model its name (this is called a parametric model).

Lessons-Learned-in-Software-Testing-A-Context-Driven-Approach
by Anson-QA

If the open bug count is low near the desired end of the project, does this mean that the product is more stable, or that the test team is spending too much time writing reports, running regression tests (tests that rarely find new bugs), demonstrating the product at tradeshows, and doing other activities that aren't geared toward finding new bugs? We can't tell this from the bug counts. We are particularly unimpressed with statistical models of bug arrival rates (how many bugs will be found per unit time) as vehicles for managing projects because we see no reason to believe that the assumptions underlying the probability models have any correspondence to the realities of the project. Simmonds (2000) provides a clear, explicit statement of the assumptions of one such model.

pages: 463 words: 105,197

Radical Markets: Uprooting Capitalism and Democracy for a Just Society
by Eric Posner and E. Weyl
Published 14 May 2018

The core idea of ML is that the world and the human minds that intelligently navigate it are more complicated and uncertain than any programmer can precisely formulate in a set of rules. Instead of attempting to characterize intelligence through a set of instructions that the computer will directly execute, ML devises algorithms that train often complicated and opaque statistical models to “learn” to classify or predict outcomes of interest, such as how creditworthy a borrower is or whether a photo contains a cat. The most famous example of an ML algorithm is a “neural network,” or neural net for short. Neural nets imitate the structure of the human brain rather than perform a standard statistical analysis.

pages: 571 words: 105,054

Advances in Financial Machine Learning
by Marcos Lopez de Prado
Published 2 Feb 2018

Bubbles are formed in compressed (low entropy) markets. 18.8.2 Maximum Entropy Generation In a series of papers, Fiedor [2014a, 2014b, 2014c] proposes to use Kontoyiannis [1997] to estimate the amount of entropy present in a price series. He argues that, out of the possible future outcomes, the one that maximizes entropy may be the most profitable, because it is the one that is least predictable by frequentist statistical models. It is the black swan scenario most likely to trigger stop losses, thus generating a feedback mechanism that will reinforce and exacerbate the move, resulting in runs in the signs of the returns time series. 18.8.3 Portfolio Concentration Consider an NxN covariance matrix V, computed on returns.

pages: 323 words: 100,772

Prisoner's Dilemma: John Von Neumann, Game Theory, and the Puzzle of the Bomb
by William Poundstone
Published 2 Jan 1993

However, the ALMOST TIT FOR TAT strategy, which throws in a test defection to see if it’s dealing with ALL C, is not as good as plain TIT FOR TAT when paired with TIT FOR TAT. It’s 1 point worse. Beating TIT FOR TAT is tougher than it looks. Axelrod’s tournaments included sophisticated strategies designed to detect an exploitable opponent. Some created a constantly updated statistical model of their opponent’s behavior. This allowed them to predict what the opponent strategy will do after cooperation and after defection, and to adjust their own choices accordingly. This sounds great. It does allow these strategies to exploit unresponsive strategies like ALL C and RANDOM. The trouble is, no one submitted a unresponsive strategy (other than the RANDOM strategy Axelrod included).

pages: 304 words: 99,836

Why I Left Goldman Sachs: A Wall Street Story
by Greg Smith
Published 21 Oct 2012

The problem was that these hedge funds were not anticipating “Black Swan” events, a term coined by Nassim Nicholas Taleb to explain once-in-a-thousand-year-type events that people do not expect and that models can’t predict. What we saw in 2008 and 2009 was a series of Black Swan events that the statistical models would have told you were not possible, according to history. Instead of the S&P 500 Index having average daily percentage swings of 1 percent, for a sustained period the market was swinging back and forth more than 5 percent per day—five times what was normal. No computer model could have predicted this.

Artificial Whiteness
by Yarden Katz

For discussion of the narratives that have been used in the past to explain the rise and fall of neural networks research, see Mikel Olazaran, “A Sociological Study of the Official History of the Perceptrons Controversy,” Social Studies of Science 26, no. 3 (1996): 611–59.     4.   As one example of many, consider the coverage of a scientific journal article that presented a statistical model which the authors claim recognizes emotions better than people: Carlos F. Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M. Martinez, “Facial Color Is an Efficient Mechanism to Visually Transmit Emotion,” Proceedings of the National Academy of Sciences 115, no. 14 (2018): 3581–86. The article does not reference AI at all, but that is how it was described in the media.

pages: 419 words: 102,488

Chaos Engineering: System Resiliency in Practice
by Casey Rosenthal and Nora Jones
Published 27 Apr 2020

Nevertheless, the process of “glueing” these pieces together can be thought of as establishing a set of abstract mappings. We describe this in a recent paper.5 Boolean formulae were just one possible way to build our models and it is easy to imagine others. Perhaps most appropriate to the fundamentally uncertain nature of distributed systems would be probabilistic models or trained statistical models such as deep neural networks. These are outside of my own area of expertise, but I would very much like to find collaborators and recruit graduate students who are interested in working on these problems! It was not my intent in this section to advertise the LDFI approach per se, but rather to provide an example that shows that the sort of end-to-end “intuition automation” for which I advocated in the sidebar is possible in practice.

pages: 289 words: 95,046

Chaos Kings: How Wall Street Traders Make Billions in the New Age of Crisis
by Scott Patterson
Published 5 Jun 2023

A Dragon King, he explained, is a dynamic process that moves toward massive instability, known as a phase transition. As an example, he showed a slide of water heating to one hundred degrees Celsius—the boiling point. The bad news is that Dragon Kings occur much more frequently than traditional statistical models would imply. The good news, he said, is that this behavior can be predicted as a system approaches what he called bifurcation—the sudden shift in the phase transition, the leap from water to steam. “Close to bifurcation you have a window of visibility,” like a plane flying from clouds into the sunshine.

pages: 599 words: 98,564

The Mutant Project: Inside the Global Race to Genetically Modify Humans
by Eben Kirksey
Published 10 Nov 2020

Jiankui He zoomed through his PhD, completing his dissertation in three and a half years—extremely fast, especially for someone who was still trying to perfect his English along the way. The dissertation was ambitious and interdisciplinary: a study of the “modularity, diversity, and stochasticity” of evolutionary processes over the last 4 billion years. He used statistical models and differential equations to study seemingly unrelated systems: the structure of animal bodies, the dynamics of global financial markets, emergent strains of the influenza virus, and—fatefully—the CRISPR molecule in bacteria. He defended his dissertation in December 2010, more than a year before Jennifer Doudna and Emmanuelle Charpentier demonstrated how to manipulate DNA with CRISPR.

pages: 307 words: 101,998

IRL: Finding Realness, Meaning, and Belonging in Our Digital Lives
by Chris Stedman
Published 19 Oct 2020

In an article for the journal Nature Human Behavior, “The Association between Adolescent Well-Being and Digital Technology Use,” researchers Amy Orben and Andrew K. Przybylski argue that the relationship between technology and well-being actually varies a great deal depending on how you set up the statistical model. When they test many of these different approaches, they conclude that the negative relationship between technology and well-being is really quite small and probably negligible. Yes, our digital tools make some of us unhappy, but it’s not correct to say that they always do, or that they must.

pages: 411 words: 108,119

The Irrational Economist: Making Decisions in a Dangerous World
by Erwann Michel-Kerjan and Paul Slovic
Published 5 Jan 2010

The particular danger, now both available and salient, is likely to be overestimated in the future. Second, and by contrast, we tend to raise our probability estimate insufficiently when an experienced risk occurs. Follow-up research should document these tendencies with many more examples, and in laboratory settings. If improved predictions are our goal, it should also provide rigorous statistical models of effective updating of virgin and experienced risks. Future inquiry should consider resembled risks as well. Evidence from both terrorist incidents and financial markets suggests that we have difficulty extrapolating from risks that, though varied, bear strong similarities. Behavioral biases such as these are difficult to counteract, but awareness of them is the first step.

pages: 446 words: 102,421

Network Security Through Data Analysis: Building Situational Awareness
by Michael S Collins
Published 23 Feb 2014

An Introduction to R for Security Analysts R is an open source statistical analysis package developed initially by Ross Ihaka and Robert Gentleman of the University of Auckland. R was designed primarily by statisticians and data analysts, and is related to commercial statistical packages such as S and SPSS. R is a toolkit for exploratory data analysis; it provides statistical modeling and data manipulation capabilities, visualization, and a full-featured programming language. R fulfills a particular utility knife-like role for analysis. Analytic work requires some tool for creating and manipulating small ad hoc databases that summarize raw data. For example, hour summaries of traffic volume from a particular host broken down by services.

pages: 385 words: 111,113

Augmented: Life in the Smart Lane
by Brett King
Published 5 May 2016

The skills and level of education required for each job were taken into consideration too. These features were weighted according to how automatable they were, and according to the engineering obstacles currently preventing automation or computerisation. The results were calculated with a common statistical modelling method. The outcome was clear. In the United States, more than 45 per cent of jobs could be automated within one to two decades. Table 2.3 shows a few jobs that are basically at 100 per cent risk of automation (I’ve highlighted a few of my favourites):8 Table 2.3: Some of the Jobs at Risk from Automation and AI Telemarketers Telemarketers Data Entry Professionals Procurement Clerks Title Examiners, Abstractors and Searchers Timing Device Assemblers and Adjusters Shipping, Receiving and Traffic Clerks Sewers, Hand Insurance Claims and Policy Processing Clerks Milling and Planing Machine Setters, Operators Mathematical Technicians Brokerage Clerks Credit Analysts Insurance Underwriters Order Clerks Parts Salespersons Watch Repairers Loan Officers Claims Adjusters, Examiners and Investigators Cargo and Freight Agents Insurance Appraisers, Auto Damage Driver/Sales Workers Tax Preparers Umpires, Referees and Other Sports Officials Radio Operators Photographic Process Workers and Processing Machine Operators Bank Tellers Legal Secretaries New Accounts Clerks Etchers and Engravers Bookkeeping, Accounting and Auditing Clerks Library Technicians Packaging and Filling Machine Operators Inspectors, Testers, Sorters, Samplers and Weighing Technicians One often voiced concern is that AI will create huge wealth for a limited few who own the technology, thus implying that the wealth gap will become even more acute.

pages: 338 words: 106,936

The Physics of Wall Street: A Brief History of Predicting the Unpredictable
by James Owen Weatherall
Published 2 Jan 2013

Sornette, Didier, and Christian Vanneste. 1992. “Dynamics and Memory Effects in Rupture of Thermal Fuse.” Physical Review Letters 68: 612–15. — — — . 1994. “Dendrites and Fronts in a Model of Dynamical Rupture with Damage.” Physical Review E 50 (6, December): 4327–45. Sornette, D., C. Vanneste, and L. Knopoff. 1992. “Statistical Model of Earthquake Foreshocks.” Physical Review A 45: 8351–57. Sourd, Véronique, Le. 2008. “Hedge Fund Performance in 2007.” EDHEC Risk and Asset Management Research Centre. Spence, Joseph. 1820. Observations, Anecdotes, and Characters, of Books and Men. London: John Murray. Stewart, James B. 1992.

pages: 461 words: 106,027

Zero to Sold: How to Start, Run, and Sell a Bootstrapped Business
by Arvid Kahl
Published 24 Jun 2020

Forecasting will allow you to explore several scenarios of where your business could go if you made certain decisions that are hard to reverse and would be very risky to attempt in reality: hiring a number of people, switching to another audience completely, or pivoting to another kind of product. It's business experimentation powered by statistical models that are at least less biased than your hopeful entrepreneurial perspective. It's a projection of your ambitions into the future. Being able to share this kind of projection will give your acquirer the confidence that you have thought about these things, and there is a statistically significant chance that the goals you have set may be reached in reality.

pages: 688 words: 107,867

Python Data Analytics: With Pandas, NumPy, and Matplotlib
by Fabio Nelli
Published 27 Sep 2018

Other methods of data mining, such as decision trees and association rules, automatically extract important facts or rules from the data. These approaches can be used in parallel with data visualization to uncover relationships between the data. Predictive Modeling Predictive modeling is a process used in data analysis to create or choose a suitable statistical model to predict the probability of a result. After exploring the data, you have all the information needed to develop the mathematical model that encodes the relationship between the data. These models are useful for understanding the system under study, and in a specific way they are used for two main purposes.

The Deep Learning Revolution (The MIT Press)
by Terrence J. Sejnowski
Published 27 Sep 2018

These success stories had a common trajectory. In the past, computers were slow and only able to explore toy models with just a few parameters. But these toy models generalized poorly to real-world data. When abundant data were available and computers were much faster, it became possible to create more complex statistical models and to extract more features and relationships between the features. Deep learning automates this process. Instead of having domain experts handcraft features for each application, deep learning can extract them from very large data sets. As computation replaces labor and continues to get cheaper, more labor-intensive cognitive tasks will be performed by computers.

pages: 398 words: 105,917

Bean Counters: The Triumph of the Accountants and How They Broke Capitalism
by Richard Brooks
Published 23 Apr 2018

The BBC’s then economics editor Robert Peston blogged that ‘some would say [there] is a flaw the size of Greater Manchester in its analysis – because KPMG is ignoring one of the fundamental causes of lacklustre growth in many parts of the UK, which is a shortage of skilled labour and of easily and readily developable land’.35 When a committee of MPs came to examine the report, academics lined up to rubbish it. ‘I don’t think the statistical work is reliable,’ said a professor of statistical modelling at Imperial College, London. ‘They [KPMG] apply this procedure which is essentially made up, which provides them with an estimate,’ added a professor of economic geography from the London School of Economics. ‘It is something that really shouldn’t be done in a situation where we are trying to inform public debate using statistical analysis.’36 Noting that HS2 ‘stands or falls on this piece of work’, the committee’s acerbic chairman Andrew Tyrie summoned the report’s authors.37 One exchange with KPMG’s Lewis Atter (a former Treasury civil servant) spoke volumes for the bean counters’ role in lumbering the taxpayer with monothilic projects: Tyrie: It [the £15bn a year economic projection] is a reasonable forecast of what we might hope to get from this project?

pages: 392 words: 108,745

Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think
by James Vlahos
Published 1 Mar 2019

Because unit selection assembles snippets of actual human speech, the method has traditionally been the best way to concoct a natural-sounding voice. It’s like cooking with ingredients from the local farmers market. A second-tier method, called parametric synthesis, has historically been the speech industry’s Velveeta cheese. For it, audio engineers build statistical models of all of the various language sounds. Then they use the data to synthetically reproduce those sounds and concatenate them into full words and phrases. This approach typically produces a more robotic-sounding voice than a unit selection one. The advantage, though, is that engineers don’t need to spend eons recording someone like Bennett.

pages: 414 words: 109,622

Genius Makers: The Mavericks Who Brought A. I. To Google, Facebook, and the World
by Cade Metz
Published 15 Mar 2021

He later called it his “come-to-Jesus moment,” when he realized he had spent six years writing rules that were now obsolete. “My fifty-two-year-old body had one of those moments when I saw a future where I wasn’t involved,” he says. The world’s natural language researchers soon overhauled their approach, embracing the kind of statistical models unveiled that afternoon at the lab outside Seattle. This was just one of many mathematical methods that spread across the larger community of AI researchers in the 1990s and on into the 2000s, with names like “random forests,” “boosted trees,” and “support vector machines.” Researchers applied some to natural language understanding, others to speech recognition and image recognition.

pages: 428 words: 103,544

The Data Detective: Ten Easy Rules to Make Sense of Statistics
by Tim Harford
Published 2 Feb 2021

David Jackson and Gary Marx, “Data Mining Program Designed to Predict Child Abuse Proves Unreliable, DCFS Says,” Chicago Tribune, December 6, 2017; and Dan Hurley, “Can an Algorithm Tell When Kids Are in Danger?,” New York Times Magazine, January 2, 2018, https://www.nytimes.com/2018/01/02/magazine/can-an-algorithm-tell-when-kids-are-in-danger.html. 22. Hurley, “Can an Algorithm.” 23. Andrew Gelman, “Flaws in Stupid Horrible Algorithm Revealed Because It Made Numerical Predictions,” Statistical Modeling, Causal Inference, and Social Science (blog), July 3, 2018, https://statmodeling.stat.columbia.edu/2018/07/03/flaws-stupid-horrible-algorithm-revealed-made-numerical-predictions/. 24. Sabine Hossenfelder, “Blaise Pascal, Florin Périer, and the Puy de Dôme Experiment,” BackRe(Action) (blog), November 21, 2007, http://backreaction.blogspot.com/2007/11/blaise-pascal-florin-p-and-puy-de-d.html; and David Wootton, The Invention of Science: A New History of the Scientific Revolution (London: Allen Lane, 2015), chap. 8. 25.

pages: 363 words: 109,834

The Crux
by Richard Rumelt
Published 27 Apr 2022

They contain a strong random element. Track your monthly spending on groceries. A blip upward does not mean your finances are out of control, and a downward blip does not signal coming starvation. However, to insert proper logic into their estimates of value, the analysts would need PhDs in advanced Bayesian statistical modeling and certainly would not use spreadsheets. By construction, their fairly primitive estimating tools grossly overreact to blips. A third problem is that the “true” value of a company is very hard to know. Fischer Black, coauthor of the famous 1973 Black-Scholes option-pricing formula, was a believer that market prices were unbiased estimates of true value.3 But, over drinks, he also told me that the “true” value of a company was anywhere from half to twice the current stock price.

pages: 368 words: 102,379

Pandemic, Inc.: Chasing the Capitalists and Thieves Who Got Rich While We Got Sick
by J. David McSwane
Published 11 Apr 2022

Fintech entered the mainstream with the advent of companies like SoFi, which offered more favorable rates than banks for those looking to consolidate student loan debt. But the model found a niche—and billions in easy profit—in servicing small and struggling businesses that banks had overlooked or turned away. Through automation, data, and statistical models that help determine if applicants will repay a loan, fintechs removed much of the human work from the loan approval process. With less human involvement, it appears, came less racial bias. Researchers at New York University, for instance, found that businesses owned by Black people were 70 percent more likely to have gotten their PPP loan from fintech than a small bank.

pages: 918 words: 257,605

The Age of Surveillance Capitalism
by Shoshana Zuboff
Published 15 Jan 2019

The company describes itself “at the forefront of innovation in machine intelligence,” a term in which it includes machine learning as well as “classical” algorithmic production, along with many computational operations that are often referred to with other terms such as “predictive analytics” or “artificial intelligence.” Among these operations Google cites its work on language translation, speech recognition, visual processing, ranking, statistical modeling, and prediction: “In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, applying learning algorithms to understand and generalize.”9 These machine intelligence operations convert raw material into the firm’s highly profitable algorithmic products designed to predict the behavior of its users.

For individuals, the attraction is the possibility of a world where everything is arranged for your convenience—your health checkup is magically scheduled just as you begin to get sick, the bus comes just as you get to the bus stop, and there is never a line of waiting people at city hall. As these new abilities become refined by the use of more sophisticated statistical models and sensor capabilities, we could well see the creation of a quantitative, predictive science of human organizations and human society.38 III. The Principles of an Instrumentarian Society Pentland’s theory of instrumentarian society came to full flower in his 2014 book Social Physics, in which his tools and methods are integrated into an expansive vision of our futures in a data-driven instrumentarian society governed by computation.

pages: 484 words: 120,507

The Last Lingua Franca: English Until the Return of Babel
by Nicholas Ostler
Published 23 Nov 2010

In essence these resources are nothing other than large quantities of text (text corpora) or recorded speech (speech databases) in some form that is systematic and well documented enough to be tractable for digital analysis. From these files, it is possible to derive indices, glossaries, and thesauri, which can be the basis for dictionaries; it is also possible to derive statistical models of the languages, and (if they are multilingual files as, e.g., the official dossiers of the Canadian Parliament, the European Union, or some agency of the United Nations) models of equivalences among languages. These models are calculations of the conditional probability of sequences of sounds, or sequences of words, on the basis of past per formance in all those recorded files.

pages: 459 words: 118,959

Confidence Game: How a Hedge Fund Manager Called Wall Street's Bluff
by Christine S. Richard
Published 26 Apr 2010

Although an earthquake in California doesn’t increase the chance of an earthquake occurring in Florida, bond defaults tend to be contagious and closely correlated in times of economic stress. That makes CDOs, which mingle various types of loans across different geographic regions, vulnerable to the same pressures. In fact, the whole bond-insurance industry might be vulnerable to faulty statistical models that rely on the past to predict the future, Ackman argued in the report. These models estimated that MBIA faced just a 1-in-10,000 chance of confronting a scenario that would leave it unable to meet all its claims. Yet historical data-based models considered the 1987 stock market crash an event so improbable that it would be expected to happen only once in a trillion years, Ackman explained.

pages: 437 words: 113,173

Age of Discovery: Navigating the Risks and Rewards of Our New Renaissance
by Ian Goldin and Chris Kutarna
Published 23 May 2016

Sequencing machines arrived to automate many of the lab technicians’ decoding tasks. DNA copy machines were invented that could take a single DNA snippet of interest and make millions of copies overnight, which in turn enabled a new generation of faster sequencers designed to apply brute force to now-inexhaustible source material. Mathematicians developed new statistical models to puzzle out how to stitch any number of snippets back together into their correct order, and the “shotgun sequencing” technique (basically, blasting the entire genome into tens of thousands of very short segments) was born to take advantage of this new “sequence now, line up later” capability.

pages: 402 words: 110,972

Nerds on Wall Street: Math, Machines and Wired Markets
by David J. Leinweber
Published 31 Dec 2008

Fischer Black’s Quantitative Strategies Group at Goldman Sachs were algo pioneers. They were perhaps the first to use computers for actual trading, as well as for identifying trades. The early alpha seekers were the first combatants in the algo wars. Pairs trading, popular at the time, relied on statistical models. Finding stronger short-term correlations than the next guy had big rewards. Escalation beyond pairs to groups of related securities was inevitable. Parallel developments in futures markets opened the door to electronic index arbitrage trading. Automated market making was a valuable early algorithm.

pages: 401 words: 112,784

Hard Times: The Divisive Toll of the Economic Slump
by Tom Clark and Anthony Heath
Published 23 Jun 2014

There is also steady downward-shifting between the cat-egories in the frequency with which people claim to help. Analysis of CPS data presented at the SCHMI seminar in Sarasota, Florida, March 2012, by James Laurence and Chaeyoon Lim. 37. Data presented at the SCHMI seminar in Sarasota, Florida, March 2012, by James Laurence and Chaeyoon Lim. 38. All the statistical models – the results of which are reported in Table I of Lim and Laurence, ‘Doing good when times are bad’ – adjust for personal characteristics, including employment status, and yet the significant decline in volunteering remains. Factoring household income into the modelling, the authors report, yields results that are ‘almost identical’. 39.

pages: 403 words: 111,119

Doughnut Economics: Seven Ways to Think Like a 21st-Century Economist
by Kate Raworth
Published 22 Mar 2017

Given its uncanny resemblance to that famous inequality curve of Chapter 5, this new one was soon known as the Environmental Kuznets Curve. The Environmental Kuznets Curve, which suggests that growth will eventually fix the environmental problems that it creates. Having discovered another apparent economic law of motion, the economists could not resist the urge to use statistical modelling in order to identify the level of income at which the curve magically turned. For lead contamination in rivers, they found, pollution peaked and started to fall when national income reached $1,887 per person (measured in 1985 US dollars, the standard metric of the day). What about sulphur dioxide in the air?

pages: 409 words: 118,448

An Extraordinary Time: The End of the Postwar Boom and the Return of the Ordinary Economy
by Marc Levinson
Published 31 Jul 2016

Although the influx of foreign capital set off a boom after 1986, job creation did not follow. Spain continued to have by far the highest unemployment rate in the industrial world. Its experience, like that of France, showed that the economic malaise afflicting the wealthy economies was beyond the reach of ideologically driven solutions. While the statist model had failed to revive growth, stimulate investment, and raise living standards in both France and Spain, more market-oriented policies had proven no more efficacious. Neither approach offered a realistic chance of bringing back the glorious years, which were beyond the ability of any government to restore.21 CHAPTER 13 Morning in America October 6, 1979, was a chilly Saturday in Washington.

Human Frontiers: The Future of Big Ideas in an Age of Small Thinking
by Michael Bhaskar
Published 2 Nov 2021

Later his three-colour principle was vital for the invention of colour television. He made vaulting gains in the understanding of Saturn's rings, then one of the most intractable problems in planetary physics. Before moving to electromagnetism, Maxwell had theorised the radical idea of a field. His understanding of gases led towards the use in science of statistical models, a mathematical advance that paved the way for modern physics. Maxwell is pivotal here: after him, physics grew ever more abstract, conceptually reliant on the most sophisticated mathematical techniques. Maxwell understood that while some processes were inaccessible to direct human perception, statistical virtuosity could bridge the gap.

pages: 372 words: 116,005

The Secret Barrister: Stories of the Law and How It's Broken
by Secret Barrister
Published 1 Jul 2018

The analysis used seventeen broad ‘offence groups’, and compared defendants from different ethnic backgrounds within these groups. The groups each comprised a wide range of offences; for example ‘violence against the person’ included crimes ranging from common assault to murder, and drug offence categories did not distinguish between Class A and Class B offences, or between possession and supply. Furthermore, the statistical modelling did not take into account aggravating and mitigating features of the offences. Further analysis is therefore required into sentencing of specific offences, including aggravating and mitigating factors, before any meaningful comparisons might be drawn. 10. See, for example, M R Banaji and A G Greenwald, Blind Spot: Hidden Biases of Good People, Delacorte Press, 2013.

Super Thinking: The Big Book of Mental Models
by Gabriel Weinberg and Lauren McCann
Published 17 Jun 2019

The app developers think that their app can improve this rate, helping more people fall asleep in less than ten minutes. The developers plan a study in a sleep lab to test their theory. The test group will use their app and the control group will just go to sleep without it. (A real study might have a slightly more complicated design, but this simple design will let us better explain the statistical models.) The statistical setup behind most experiments (including this one) starts with a hypothesis that there is no difference between the groups, called the null hypothesis. If the developers collect sufficient evidence to reject this hypothesis, then they will conclude that their app really does help people fall asleep faster.

Financial Statement Analysis: A Practitioner's Guide
by Martin S. Fridson and Fernando Alvarez
Published 31 May 2011

There is ample evidence, as well, of inefficiency in many large, bureaucratic organizations. The point, however, is not to debate whether big corporations are invincible or nimble, but to determine whether they meet their obligations with greater regularity, on average, than their pint-size peers. Statistical models of default risk confirm that they do. Therefore, the bond-rating agencies are following sound methodology when they create size-based peer groups. Line of business is another basis for defining a peer group. Because different industries have different financial characteristics, ratio comparisons across industry lines may not be valid.

Succeeding With AI: How to Make AI Work for Your Business
by Veljko Krunic
Published 29 Mar 2020

PID compares errors between current values and a desired value of some process variable for the system under control and applies the correction to that Glossary of terms 223 process variable based on proportional, integral, and derivative terms. PID controllers are widely used in various control systems.  Quantitative analysis (QA)—According to Will Kenton [187]: Quantitative analysis (QA) is a technique that seeks to understand behavior by using mathematical and statistical modeling, measurement, and research. Quantitative analysts aim to represent a given reality in terms of a numerical value.  Quantitative analyst (quant)—A practitioner of quantitative analysis [187]. Com-      mon business verticals in which quants work are trading and other financial services.

pages: 755 words: 121,290

Statistics hacks
by Bruce Frey
Published 9 May 2006

Contributors The following people contributed their hacks, writing, and inspiration to this book: Joseph Adler is the author of Baseball Hacks (O'Reilly), and a researcher in the Advanced Product Development Group at VeriSign, focusing on problems in user authentication, managed security services, and RFID security. Joe has years of experience analyzing data, building statistical models, and formulating business strategies as an employee and consultant for companies including DoubleClick, American Express, and Dun & Bradstreet. He is a graduate of the Massachusetts Institute of Technology with an Sc.B. and an M.Eng. in computer science and computer engineering. Joe is an unapologetic Yankees fan, but he appreciates any good baseball game.

pages: 401 words: 119,488

Smarter Faster Better: The Secrets of Being Productive in Life and Business
by Charles Duhigg
Published 8 Mar 2016

equally successful group In comments sent in response to fact-checking questions, a Google spokeswoman wrote: “We wanted to test many group norms that we thought might be important. But at the testing phase we didn’t know that the how was going to be more important than the who. When we started running the statistical models, it became clear that not only were the norms more important in our models but that 5 themes stood out from the rest.” Boston hospitals Amy C. Edmondson, “Learning from Mistakes Is Easier Said than Done: Group and Organizational Influences on the Detection and Correction of Human Error,” The Journal of Applied Behavioral Science 32, no. 1 (1996): 5–28; Druskat and Wolff, “Group Emotional Intelligence,” 132–55; David W.

pages: 472 words: 117,093

Machine, Platform, Crowd: Harnessing Our Digital Future
by Andrew McAfee and Erik Brynjolfsson
Published 26 Jun 2017

In the 1980s, I judged fully automated recognition of connected speech (listening to connected conversational speech and writing down accurately what was said) to be too difficult for machines. . . . The speech engineers have accomplished it without even relying on any syntactic§§ analysis: pure engineering, aided by statistical modeling based on gigantic amounts of raw data. . . . I not only didn’t think I would see this come about, I would have confidently bet against it.” A remark attributed to the legendary computer scientist Frederick Jelinek captures the reason behind the broad transition within the artificial intelligence community from rule-based to statistical approaches.

pages: 410 words: 119,823

Radical Technologies: The Design of Everyday Life
by Adam Greenfield
Published 29 May 2017

This conceit helps us see that while our ability to act is invariably constrained by history, existing structures of power, and the operations of chance, we nevertheless have a degree of choice as to the kind of world we wish to bring into being. As it was originally developed by Royal Dutch Shell’s Long-Term Studies group,24 scenario planning emphasized quantification, and the creation of detailed statistical models. The scenarios that follow aren’t nearly as rigorous as all that. They are by no means a comprehensive survey of the possible futures available to us, nor is there anything particularly systematic about the way I’ve presented them. They are simply suggestive of the various choices we might plausibly make.

pages: 561 words: 120,899

The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant From Two Centuries of Controversy
by Sharon Bertsch McGrayne
Published 16 May 2011

Cochran WG, Mosteller F, Tukey JW. (1954) Statistical Problems of the Kinsey Report on Sexual Behavior in the Human Male. American Statistical Association. Converse, Jean M. (1987) Survey Research in the United States: Roots and Emergence 1890–1960. University of California Press. Fienberg SE, Hoaglin DC, eds. (2006) Selected Papers of Frederick Mosteller. Springer. Fienberg SE et al., eds. (1990) A Statistical Model: Frederick Mosteller’s Contributions to Statistics, Science and Public Policy. Springer-Verlag. Hedley-Whyte J. (2007) Frederick Mosteller (1916–2006): Mentoring, A Memoir. International Journal of Technology Assessment in Health Care (23) 152–54. Ingelfinger, Joseph, et al. (1987) Biostatistics in Clinical Medicine.

pages: 421 words: 125,417

Common Wealth: Economics for a Crowded Planet
by Jeffrey Sachs
Published 1 Jan 2008

One test of this is the cross-country evidence on economic growth. We can examine whether countries with high fertility rates indeed have lower growth rates of income per person. The standard tests have been carried out by the leaders of empirical growth modeling, Robert Barro and Xavier Sala-i-Martin. Their statistical model accounts for each country’s average annual growth rate of income per person according to various characteristics of the country, including the level of income per person, the average educational attainment, the life expectancy, an indicator of the “rule of law,” and other variables, including the total fertility rate.

pages: 387 words: 119,409

Work Rules!: Insights From Inside Google That Will Transform How You Live and Lead
by Laszlo Bock
Published 31 Mar 2015

It’s possible to measure the change in patient outcomes after a physician learns a new technique by recording recovery times, incidence of complications, and degree of vision improvement. It’s much more difficult to measure the impact of training on less structured jobs or more general skills. You can develop fantastically sophisticated statistical models to draw connections between training and outcomes, and at Google we often do. In fact, we often have to, just because our engineers won’t believe us otherwise! But for most organizations, there’s a shortcut. Skip the graduate-school math and just compare how identical groups perform after only one has received training.

When Computers Can Think: The Artificial Intelligence Singularity
by Anthony Berglas , William Black , Samantha Thalind , Max Scratchmann and Michelle Estes
Published 28 Feb 2015

And like every other formalism, decision table conditions can be learnt from experience using various algorithms. Regression Linear and exponential regression. Owned Statisticians have used regression methods since the nineteenth century to fit a function to a set of data points. In the chart above, Excel was used to automatically fit two statistical models to the data represented by the red dots. The first is a simple straight line, while the second is a curved exponential function. In both cases the 14 data points are modelled by just two numbers that are shown on the chart. The R2 value shows the sum of squares correlation between the models and the data, and shows that the exponential model is a better fit.

pages: 388 words: 125,472

The Establishment: And How They Get Away With It
by Owen Jones
Published 3 Sep 2014

Rather than flogging off the banks that were bailed out by the taxpayer, government could turn these institutions into publicly owned regional investment banks, helping to rebuild local economies across Britain. They would have specific mandates, such as supporting small businesses currently being starved of loans, as well as helping to reshape the economy and encouraging the new industrial strategy. Again, this does not mean entirely replicating a top-down statist model. British taxpayers bailed out the banks. The old American revolutionary slogan was ‘no taxation without representation’, and the same principle should apply to finance. We, the taxpayers, should have democratic representation on the boards of the banks we have saved, helping to ensure that these same banks are responsive to the needs of consumers and communities.

pages: 497 words: 123,718

A Game as Old as Empire: The Secret World of Economic Hit Men and the Web of Global Corruption
by Steven Hiatt; John Perkins
Published 1 Jan 2006

In his book Globalization and Its Discontents, Stiglitz writes: To make its [the IMF’s] programs seem to work, to make the numbers “add up,” economic forecasts have to be adjusted. Many users of these numbers do not realize that they are not like ordinary forecasts; in these instances GDP forecasts are not based on a sophisticated statistical model, or even on the best estimates of those who know the economy well, but are merely the numbers that have been negotiated as part of an IMF program. …1 Globalization, as it has been advocated, often seems to replace the old dictatorships of national elites with new dictatorships of international finance….

pages: 461 words: 125,845

This Machine Kills Secrets: Julian Assange, the Cypherpunks, and Their Fight to Empower Whistleblowers
by Andy Greenberg
Published 12 Sep 2012

So they’re planning on eventually integrating their submissions page directly into the home pages themselves, a trick that requires coaching their media partners on how to excise security bugs from the most complex portion of their sites. Once they have what the OpenLeaks engineer calls that “armored car” version of the partner sites set up, they plan to go even further than WikiLeaks, building more convincing cover traffic than has ever existed before, this unnamed engineer tells me. They’ve statistically modeled the timing and file size of uploads to WikiLeaks and have used it to spoof those submissions with high statistical accuracy. Most submissions to WikiLeaks were between 1.5 and 2 megabytes, for instance. Less than one percent are above 700 megabytes. Their cover traffic aims to follow exactly the same bell curve, making it theoretically indistinguishable from real submissions under the cover of SSL encryption, even when the user isn’t running Tor.

pages: 320 words: 87,853

The Black Box Society: The Secret Algorithms That Control Money and Information
by Frank Pasquale
Published 17 Nov 2014

It might seem risky to give any one household a loan; the breadwinner might fall ill, they might declare bankruptcy, they may hit the lottery and pay off the loan tomorrow (denying the investor a steady stream of interest payments). It’s hard to predict what will happen to any given family. But statistical models can much better predict the likelihood of defaults happening in, say, a group of 1,000 families. They “know” that, in the data used, rarely do, say, more than thirty in a 1,000 borrowers default. This statistical analysis, programmed in proprietary software, was one “green light” for massive investments in the mortgage market.21 That sounds simple, but as fi nance automation took off, such deals tended to get hedged around by contingencies, for instance about possible refi nancings or defaults.

pages: 415 words: 125,089

Against the Gods: The Remarkable Story of Risk
by Peter L. Bernstein
Published 23 Aug 1996

Richard Thaler started thinking about these problems in the early 1970s, while working on his doctoral dissertation at the University of Rochester, an institution known for its emphasis on rational theory.' His subject was the value of a human life, and he was trying to prove that the correct measure of that value is the amount people would be willing to pay to save a life. After studying risky occupations like mining and logging, he decided to take a break from the demanding statistical modeling he was doing and began to ask people what value they would put on their own lives. He started by asking two questions. First, how much would you be willing to pay to eliminate a one-in-a-thousand chance of immediate death? And how much would you have to be paid to accept a one-ina-thousand chance of immediate death?

pages: 391 words: 123,597

Targeted: The Cambridge Analytica Whistleblower's Inside Story of How Big Data, Trump, and Facebook Broke Democracy and How It Can Happen Again
by Brittany Kaiser
Published 21 Oct 2019

Rospars and his team at Blue State described themselves as pioneers who understood that “people don’t just vote on Election Day—they vote every day with their wallets, with their time, with their clicks and posts and tweets.”2 Other senior-level members of the Obama for America analytics team founded BlueLabs in 2013.3 Daniel Porter had been director of statistical modeling on the 2012 campaign, “the first in the history of presidential politics to use persuasion modeling” to identify swing voters. Sophie Schmidt’s father, Eric, founded Civis in 2013, the same year that Sophie interned at CA. Civis’s mission was to “democratize data science so organizations can stop guessing and make decisions based on numbers and scientific fact.”

pages: 400 words: 121,988

Trading at the Speed of Light: How Ultrafast Algorithms Are Transforming Financial Markets
by Donald MacKenzie
Published 24 May 2021

As described in chapter 1, these are computer programs that investors can use to break up big orders into small parts and execute them automatically (Whitcomb interview 2). Instinet did not adopt Whitcomb’s suggestion. However, one of his former students, James Hawkes, who taught statistics at the College of Charleston, ran a small firm, Quant Systems, which sold software for statistical analysis. Whitcomb and Hawkes had earlier collaborated on a statistical model to predict the outcomes of horse races. Their equation displayed some predictive power, but because of bookmakers’ large “vigs,” or “takes” (the profits they earn by setting odds unfavorable to the gambler), it did not earn Hawkes and Whitcomb money (Whitcomb interviews 1 and 2). Hawkes, though, also traded stock options, and had installed a satellite dish on the roof of his garage to receive a share-price datafeed.

pages: 945 words: 292,893

Seveneves
by Neal Stephenson
Published 19 May 2015

Or would it split up into two or more distinct swarms that would try different things? Arguments could be made for all of the above scenarios and many more, depending on what actually happened in the Hard Rain. Since the Earth had never before been bombarded by a vast barrage of lunar fragments, there was no way to predict what it was going to be like. Statistical models had been occupying much of Doob’s time because they had a big influence on which scenarios might be most worth preparing for. To take a simplistic example, if the moon could be relied on to disassemble itself into pea-sized rocks, then the best strategy was to remain in place and not worry too much about maneuvering.

A clutter of faint noise and clouds on the optical telescope gave them data about the density of objects too small and numerous to resolve. All of it fed into the plan. Doob looked tired, and nodded off frequently, and hadn’t eaten a square meal since the last perigee, but he pulled himself together when he was needed and fed any new information into a statistical model, prepared long in advance, that would enable them to maximize their chances by ditching Amalthea and doing the big final burn at just the right times. But as he kept warning Ivy and Zeke, the time was coming soon when they would become so embroiled in the particulars of which rock was coming from which direction that it wouldn’t be a statistical exercise anymore.

pages: 503 words: 131,064

Liars and Outliers: How Security Holds Society Together
by Bruce Schneier
Published 14 Feb 2012

Majumdar (2006), “Two-Stage Credit Card Fraud Detection Using Sequence Alignment,” Information Systems Security, Lecture Notes in Computer Science, Springer-Verlag, 4332:260–75. predictive policing programs Martin B. Short, Maria R. D'Orsogna, Virginia B. Pasour, George E. Tita, P. Jeffrey Brantingham, Andrea L. Bertozzi, and Lincoln B. Chayes (2008), “A Statistical Model of Criminal Behavior,” Mathematical Models and Methods in Applied Sciences, 18 (Supplement):1249–67. Beth Pearsall (2010), “Predictive Policing: The Future of Law Enforcement?” NIJ Journal, 266:16–9. Nancy Murray (2011), “Profiling in the Age of Total Information Awareness,” Race & Class, 51:3–24.

pages: 484 words: 136,735

Capitalism 4.0: The Birth of a New Economy in the Aftermath of Crisis
by Anatole Kaletsky
Published 22 Jun 2010

Mandelbrot’s research program undermined most of the mathematical assumptions of modern portfolio theory, which is the basis for the conventional risk models used by regulators, credit-rating agencies, and unsophisticated financial institutions. Mandelbrot’s analysis, presented to nonspecialist readers in his 2004 book (Mis)behavior of Markets, shows with mathematical certainty that these standard statistical models based on neoclassical definitions of efficient markets and rational expectations among investors cannot be true. Had these models been valid, events such as the 1987 stock market crash and the bankruptcy of the 1998 hedge fund crisis would not have occurred even once in the fifteen billion years since the creation of the universe.9 In fact, four such extreme events occurred in just two weeks after the Lehman bankruptcy.

pages: 419 words: 130,627

Last Man Standing: The Ascent of Jamie Dimon and JPMorgan Chase
by Duff McDonald
Published 5 Oct 2009

Ralph Cioffi of Bear Stearns wasn’t the only one putting his equity at risk by loading up on debt; all of Wall Street was in on the scheme. Warren Buffett thinks Dimon separated himself from the pack by relying on his own judgment and not becoming slave to the software that tried to simplify all of banking into a mathematical equation. “Too many people overemphasize the power of these statistical models,” he says. “But not Jamie. The CEO of any of these firms has to be the chief risk officer. At Berkshire Hathaway, it’s my number one job. I have to be correlating the chance of an earthquake in California not only causing a big insurance loss, but also the effect on Wells Fargo’s earnings, or the availability of money tomorrow.

pages: 349 words: 134,041

Traders, Guns & Money: Knowns and Unknowns in the Dazzling World of Derivatives
by Satyajit Das
Published 15 Nov 2006

The back office has a large, diverse cast. Risk managers are employed to ensure that the risk taken by traders is within specified limits. They ensure that the firm does not self-destruct as a result of some trader betting the bank on the correlation between the lunar cycle and the $/yen exchange rate. Risk managers use elaborate statistical models to keep tabs on the traders. Like double and triple agents, risk managers spy on the traders, each other and even themselves. Lawyers are employed to ensure that hopefully legally binding contracts are signed. Compliance officers ensure that the firm does not break any laws or at least is not caught breaking any laws.

pages: 486 words: 132,784

Inventors at Work: The Minds and Motivation Behind Modern Inventions
by Brett Stern
Published 14 Oct 2012

When we were doing our work then, computers really weren’t around. They were in the university. The math statistics group at Corning had a big IBM mainframe computer that could tackle really difficult problems, but modeling capabilities just didn’t exist. I had grown up with computers that were basically doing the statistical modeling of molecular spectrum. So, I was reasonably familiar with doing this and even­tually got the first computer in the lab. I was actually taking data off the optical bench that was in my lab directly into a computer. Would I have used 3D modeling if the capability had existed then? Sure, you use whatever tool is available to you.

pages: 500 words: 145,005

Misbehaving: The Making of Behavioral Economics
by Richard H. Thaler
Published 10 May 2015

Yet, although he received a Nobel Prize in economics, unfortunately I think it is fair to say that he had little impact on the economics profession.* I believe many economists ignored Simon because it was too easy to brush aside bounded rationality as a “true but unimportant” concept. Economists were fine with the idea that their models were imprecise and that the predictions of those models would contain error. In the statistical models used by economists, this is handled simply by adding what is called an “error” term to the equation. Suppose you try to predict the height that a child will reach at adulthood using the height of both parents as predictors. This model will do a decent job since tall parents tend to have tall children, but the model will not be perfectly accurate, which is what the error term is meant to capture.

Making Globalization Work
by Joseph E. Stiglitz
Published 16 Sep 2006

A claims board could establish, for instance, the magnitude of the damage suffered by each individual and provide compensation on that basis. A separate tribunal could establish the extent of the corporation’s culpability, whether it took actions which caused harm—say, as a result of inappropriate environmental policies—and then assess, using a statistical model, appropriate penalties. Additional punitive damages might be assessed to provide further deterrence or in response to particularly outrageous behavior. Chapter Eight 1.The ruble fell from R6.28 to the dollar before the crisis to R23 to the dollar in January 1999. 2.Argentina abandoned its long-standing foreign exchange regime, in which the peso was convertible to the dollar on a one-to-one basis, in December 2001.

The Trade Lifecycle: Behind the Scenes of the Trading Process (The Wiley Finance Series)
by Robert P. Baker
Published 4 Oct 2015

Then in our fixed bond we have the EUR cashflows as 0.97951 × 0.05 × 10,000,000 × 0.8086 = 396,015.9 and 0.95983 × 1.05 × 10,000,000 × 0.7904 = 7,965,821 giving an NPV of EUR 8,361,837. Unknown cashflows In many cases cashflows are not known with certainty. An option is an example of a product that has an unknown cashflow. To value the option we have to use some sort of statistical model that predicts the likely price of the underlying instrument on the exercise date and from there we can calculate the value of the option. Let’s examine a simple case where the underlying price could only be one of a discrete set of possibilities. Suppose Table 26.5 shows an option is struck at 0.9 and the probability of certain prices.

How I Became a Quant: Insights From 25 of Wall Street's Elite
by Richard R. Lindsey and Barry Schachter
Published 30 Jun 2007

At his invitation, we presented our findings on complexity and disentangling at the CFA Institute’s 1988 conference on continuing education. We also later presented them to the Institute for Quantitative Research in Finance (“Q Group”). Integrating the Investment Process Our research laid the groundwork for our investment approach. Statistical modeling and disentangling of a wide range of stocks and numerous fundamental, behavioral, and economic factors results in a multidimensional security selection system capable of maximizing the number of insights that can be exploited while capturing the intricacies of stock price behavior. This, in turn, allows for construction of portfolios that can achieve consistency of performance through numerous exposures to a large number of precisely defined profit opportunities.

pages: 303 words: 67,891

Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms: Proceedings of the Agi Workshop 2006
by Ben Goertzel and Pei Wang
Published 1 Jan 2007

Commons and colleagues have also proposed a task-based model which provides a framework for explaining stage discrepancies across tasks and for generating new stages based on classification of observed logical behaviors. [32] promotes a statistical conception of stage, which provides a good bridge between taskbased and stage-based models of development, as statistical modeling allows for stages to be roughly defined and analyzed based on collections of task behaviors. [29] postulates the existence of a postformal stage by observing elevated levels of abstraction which, they argue, are not manifested in formal thought. [33] observes a postformal stage when subjects become capable of analyzing and coordinating complex logical systems with each other, creating metatheoretical supersystems.

The Science of Language
by Noam Chomsky
Published 24 Feb 2012

Chapter 2 Page 20, On biology as more than selectional evolution Kauffman, D’Arcy Thompson, and Turing (in his work on morphogenesis) all emphasize that there is a lot more to evolution and development than can be explained by Darwinian (or neo-Darwinian) selection. (In fact, Darwin himself acknowledged as much, although this is often forgotten.) Each uses mathematics in studying biological systems in different ways. Some of Kauffman's more surprising suggestions concern self-organizing systems and the use of statistical modeling in trying to get a grip on how timing of gene protein expression can influence cell specialization during growth. Page 22, On Plato's Problem and its explanation The term “I-language” is explained – along with “I-belief” and “I-concept” – in Appendix I. For discussion, see Chomsky (1986) and (2000).

The Book of Why: The New Science of Cause and Effect
by Judea Pearl and Dana Mackenzie
Published 1 Mar 2018

John Snow, the Broad Street pump, and modern epidemiology. International Journal of Epidemiology 12: 393–396. Cox, D., and Wermuth, N. (2015). Design and interpretation of studies: Relevant concepts from the past and some extensions. Observational Studies 1. Available at: https://arxiv.org/pdf/1505.02452 .pdf. Freedman, D. (2010). Statistical Models and Causal Inference: A Dialogue with the Social Sciences. Cambridge University Press, New York, NY. Glynn, A., and Kashin, K. (2018). Front-door versus back-door adjustment with unmeasured confounding: Bias formulas for front-door and hybrid adjustments. Journal of the American Statistical Association.

pages: 475 words: 134,707

The Hype Machine: How Social Media Disrupts Our Elections, Our Economy, and Our Health--And How We Must Adapt
by Sinan Aral
Published 14 Sep 2020

In the fall of 2001, while Mark Zuckerberg was still in high school at Phillips Exeter Academy, three years before he founded Facebook at Harvard, I was a PhD student down the street at MIT, sitting in the reading room at Dewey Library studying for two very different classes: Econometrics I, taught by the world-renowned statistician Jerry Hausman, and The Sociology of Strategy, taught by the then-rising-star sociologist Ezra Zuckerman, who is now the dean of faculty at MIT’s Sloan School of Management. Ezra’s class was heavily focused on social networks, while Jerry’s class introduced us to “BLUE” estimators—the theory of what generates the best linear unbiased statistical models. I had my statistics textbook in one hand and a stack of papers on networks in the other. As I read the statistics text, I saw that it repeated one main assumption of classical statistics over and over again—the assumption that all the observations in the data we were analyzing (the people, firms, or countries) were “independent and identically distributed (or IID).”

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps
by Valliappa Lakshmanan , Sara Robinson and Michael Munn
Published 31 Oct 2020

Technically, a 2-element feature vector is enough to provide a unique mapping for a vocabulary of size 3: Categorical input Numeric feature English [0.0, 0.0] Chinese [1.0, 0.0] German [0.0, 1.0] This is called dummy coding. Because dummy coding is a more compact representation, it is preferred in statistical models that perform better when the inputs are linearly independent. Modern machine learning algorithms, though, don’t require their inputs to be linearly independent and use methods such as L1 regularization to prune redundant inputs. The additional degree of freedom allows the framework to transparently handle a missing input in production as all zeros: Categorical input Numeric feature English [1.0, 0.0, 0.0] Chinese [0.0, 1.0, 0.0] German [0.0, 0.0, 1.0] (missing) [0.0, 0.0, 0.0] Therefore, many machine learning frameworks often support only one-hot encoding.

A Dominant Character
by Samanth Subramanian
Published 27 Apr 2020

His work spoke to their deepest instinct for self-preservation. Haldane himself wasn’t a scientist in this mold. He published often and widely. In the 1930s alone, he wrote a paper on the link between quantum mechanics and philosophy, another on an economic theory of price fluctuations, several on statistical models, and a paper each on the cosmology of space-time and the future of warfare. (After the zoologist Karl von Frisch showed that bees communicate with each other through intricate dances, Haldane recalled Aristotle’s description of bee waggles and, in the Journal of Hellenic Studies, inspected it in the light of modern science.

pages: 502 words: 132,062

Ways of Being: Beyond Human Intelligence
by James Bridle
Published 6 Apr 2022

Because of the different ways that different animals react to natural phenomena, according to their size, speed and species, the ICARUS team found it necessary to use particularly complex forms of analysis to pick up on the differences in the data generated from different tags at different times – a welter of subtle and subtly variable signals. To do this, they turned to statistical models developed for financial econometrics: software designed to generate wealth by picking up on subtle signals in stock markets and investment patterns. I like to think of this as a kind of rehabilitation: penitent banking algorithms retiring from the City to start a new life in the countryside, and helping to remediate the Earth.

Advanced Software Testing—Vol. 3, 2nd Edition
by Jamie L. Mitchell and Rex Black
Published 15 Feb 2015

(For example, an ATM could be tested this way.) The tests might be selected from a pool of different tests, selected randomly, which would work if the variation between tests were something that could be predicted or limited. (For example, an e-commerce system could be tested this way.) The test can be generated on the fly, using some statistical model (called stochastic testing). For example, telephone switches are tested this way, because the variability in called number, call duration, and the like is very large. The test data can also be randomly generated, sometimes according to a model. In addition, standard tools and scripting techniques exist for reliability testing.

pages: 473 words: 154,182

Moby-Duck: The True Story of 28,800 Bath Toys Lost at Sea and of the Beachcombers, Oceanographers, Environmentalists, and Fools, Including the Author, Who Went in Search of Them
by Donovan Hohn
Published 1 Jan 2010

As it collides with the continental shelf and then with the freshwater gushing out of the rainforests of the coastal mountains, and then with the coast, the North Pacific Drift loses its coherence, crazies, sends out fractal meanders and eddies and tendrils that tease the four voyagers apart. We don’t know for certain what happens next, but statistical models suggest that at least one of the four voyagers I’m imagining—the frog, let’s pretend—will turn south, carried by an eddy or a meander into the California Current, which will likely deliver it, after many months, into the North Pacific Subtropical Gyre. You may now forget about the frog. We already know its story—how, as it disintegrates, it will contribute a few tablespoons of plastic to the Garbage Patch, or to Hawaii’s Plastic Beach, or to the dinner of an albatross, or to a sample collected in the codpiece of Charlie Moore’s manta trawl.

pages: 470 words: 144,455

Secrets and Lies: Digital Security in a Networked World
by Bruce Schneier
Published 1 Jan 2000

Just as antivirus software needs to be constantly updated with new signatures, this type of IDS needs a constantly updated database of attack signatures. It’s unclear whether such a database can ever keep up with the hacker tools. The other IDS paradigm is anomaly detection. The IDS does some statistical modeling of your network and figures out what is normal. Then, if anything abnormal happens, it sounds an alarm. This kind of thing can be done with rules (the system knows what’s normal and flags anything else), statistics (the system figures out statistically what’s normal and flags anything else), or with artificial-intelligence techniques.

Beginning R: The Statistical Programming Language
by Mark Gardener
Published 13 Jun 2012

The bats data yielded a significant interaction term in the two-way ANOVA. Look at this further. Make a graphic of the data and then follow up with a post-hoc analysis. Draw a graph of the interaction. What You Learned in This Chapter Topic Key Points Formula syntax response ~ predictor The formula syntax enables you to specify complex statistical models. Usually the response variables go on the left and predictor variables go on the right. The syntax can also be used in more simple situations and for graphics. Stacking samples stack() In more complex analyses, the data need to be in a layout where each column is a separate item; that is, a column for the response variable and a column for each predictor variable.

pages: 570 words: 158,139

Overbooked: The Exploding Business of Travel and Tourism
by Elizabeth Becker
Published 16 Apr 2013

That was the data that was missing. If the new council could measure how much money tourists spent, the industry would know how much it contributed to national economies as well as the global marketplace. From there they could begin flexing their muscles. The WTTC teamed with the Wharton School to produce a statistical model that a region or country could use to measure income from tourism. The statisticians defined the industry by categories: accommodation services; food and beverage services; passenger transport; travel agencies, tour operators and tourist guide services; cultural services; recreation and other entertainment services and a final miscellaneous category that included financial and insurance services.

pages: 560 words: 158,238

Fifty Degrees Below
by Kim Stanley Robinson
Published 25 Oct 2005

So actually, to have the idea of something broached without any subsequent repercussion is actually a kind of, what. A kind of inoculation for an event you don’t want investigated.” “Jesus. So how does it work, do you know?” “Not the technical details, no. I know they target certain counties in swing states. They use various statistical models and decision-tree algorithms to pick which ones, and how much to intervene.” “I’d like to see this algorithm.” “Yes, I thought you might.” She reached into her purse, pulled out a data disk in a paper sleeve. She handed it to him. “This is it.” “Whoah,” Frank said, staring at it. “And so . . .

pages: 444 words: 151,136

Endless Money: The Moral Hazards of Socialism
by William Baker and Addison Wiggin
Published 2 Nov 2009

Being hedged, the fund loses its correlation with the overall market. But due to the mathematics, its sensitivity to the change in covariance of its positions is magnified fourfold. Probably half of all statistical arbitrage funds that deployed this strategy have moved on to greener pastures. But the use of value-at-risk statistical models to control exposure in hedge funds or even for large pension funds that allocate between different asset types continues, and it is virtually a mandatory exercise for institutional managers. There is hardly a large pension plan that has not developed a PowerPoint presentation that boasts it realigned its investments to increase excess return (alpha) and also reduced risk (variance).

pages: 589 words: 147,053

The Age of Em: Work, Love and Life When Robots Rule the Earth
by Robin Hanson
Published 31 Mar 2016

Such a database would hardly be possible if the differing jobs within each of these 974 categories were not very similar. In fact, a factor analysis of 226 of these descriptors finds that the top four factors account for 75% of the variance in these descriptors, and the top 15 factors account for 91% of this variance (Lee 2011). Also, statistical models to predict the income and performance of workers usually have at most only a few dozen parameters. These analyses have mostly been about post-skill types, that is, about how workers differ after they have been trained to do particular tasks. Pre-skill types should vary even less than do post-skill types.

pages: 582 words: 160,693

The Sovereign Individual: How to Survive and Thrive During the Collapse of the Welfare State
by James Dale Davidson and William Rees-Mogg
Published 3 Feb 1997

And by no means, however, are all of Morris's fingers pointed at Bill Clinton. His wife comes in for some critical attention as well. For example, consider this excerpt from Morris's account of Hillary Clinton's miraculous commodity trading: "In 1995 economists at Auburn and North Florida Universities ran a sophisticated computer statistical model of the First Lady's trades for publication in the Journal of Economics and Statistics, using all the available records as well as market data from the Wall Street Journal. The probability of Hillary Rodham's having made her trades legitimately, they calculated, was less than one in 250,000,000." 22 Morris musters many incriminating 286 details about the drug-running and money-laundering operation that prospered in Arkansas under Clinton.

pages: 595 words: 143,394

Rigged: How the Media, Big Tech, and the Democrats Seized Our Elections
by Mollie Hemingway
Published 11 Oct 2021

Getting them to the voting booth seemed like a comparatively easy task. Facebook data was just the starting point. In 2012, Sasha Issenberg wrote The Victory Lab: The Secret Science of Winning Campaigns, which discusses in detail “cutting edge persuasion experiments, innovative ways to mobilize voters, and statistical models predicting the behavior of every voter in the country.”17 It soon became clear that campaigns were engaged in much more concerning behavior than merely adapting to the smartphone era. The Obama campaign released an app to the public to help Obama volunteers canvass their neighborhoods. The app contained detailed and intimate information about people’s political tendencies, such as partisan affiliation.

pages: 553 words: 153,028

The Vortex: A True Story of History's Deadliest Storm, an Unspeakable War, and Liberation
by Scott Carney and Jason Miklian
Published 28 Mar 2022

He adapted storm-surge models into NHC hurricane forecasts for the first time, and designed a system to simply and directly inform the public what was coming, how seriously they should take it, and what to do. Frank’s work formed the foundation of United States hurricane action plans that have warned the American people for over fifty years. Frank helped move the NHC from what seemed like an alchemy-based organization to a hard-science paradise. They used statistical modeling in hurricane tracking for the first time, and bought a secret weapon: a state-of-the-art mainframe with a brand-new terminal interface. The eight-thousand-pound machine took up an entire room, and the quantum leap in computational power meant that the NHC could forecast forty-eight or even seventy-two hours out instead of just twelve.

Globalists: The End of Empire and the Birth of Neoliberalism
by Quinn Slobodian
Published 16 Mar 2018

The four-­person team eventually expanded. New members include two other experts and active League economists—­Meade, an architect of GATT who had also played a key role in formulating Britain’s postwar full-­employment policies, and the Dutch econometrician Jan Tinbergen, who created the first macroeconomic statistical model of a national economy while at the League. They w ­ ere joined by Roberto Campos, a Brazilian economist who had been one of his nation’s delegates at Bretton Woods and the head of the Brazilian Development Bank, whose U.S.-­friendly policies had earned him the nickname “Bob Fields.”95 Another former League economist, Hans Staehle, had helped assem­ble the group.

Sorting Things Out: Classification and Its Consequences (Inside Technology)
by Geoffrey C. Bowker
Published 24 Aug 2000

As noted in the case of New Zealand above, its need for information is effectively infinite. Below, for exam­ ple, is a wish list from 1 985 for a national medical information system in the United States: The system must capture more data than just the names of lesions and diseases and the therapeutic procedures used to correct them to meet these needs. I n a statistical model proposed b y Kerr White, all factors affecting health are incorporated: genetic and biological; environmental, behavioral, psychologi­ cal, and social conditions which precipitate health problems' complaints, symp­ toms, and diseases which prompt people to seek medical care; and evaluation of severity and functional capacity, including impairment and handicaps .

pages: 636 words: 140,406

The Case Against Education: Why the Education System Is a Waste of Time and Money
by Bryan Caplan
Published 16 Jan 2018

The neglected master’s. Evidence on the master’s degree is sparse. Estimates of the sheepskin effect are scarce and vary widely, so I stipulate that the master’s sheepskin breakdown matches the bachelor’s. Completion rates for the master’s are lower than the bachelor’s. But I failed to locate any statistical models that estimate how master’s completion varies by prior academic performance. While broad outlines are not in doubt, I also located no solid evidence on how, correcting for student ability, the master’s payoff varies by discipline. Sins of omission. To keep my write-up manageable, I gloss over three major credentials: the associate degree, the professional degree, and the Ph.D.

pages: 543 words: 153,550

Model Thinker: What You Need to Know to Make Data Work for You
by Scott E. Page
Published 27 Nov 2018

A successful auction design had to be immune to strategic manipulation, generate efficient outcomes, and be comprehensible to participants. The economists used game theory models to analyze whether features could be exploited by strategic bidders, computer simulation models to compare the efficiency of various designs, and statistical models to choose parameters for experiments with real people. The final design, a multiple-round auction that allowed participants to back out of bids and prohibited sitting out early periods to mask intentions, proved successful. Over the past thirty years, the FCC has raised nearly $60 billion using this type of auction.10 REDCAPE: Communicate By creating a common representation, models improve communication.

pages: 598 words: 150,801

Snakes and Ladders: The Great British Social Mobility Myth
by Selina Todd
Published 11 Feb 2021

Most significantly, upward mobility rose dramatically after the Second World War when all these countries increased room at the top, by investing public money in job creation and welfare measures like free education.5 Britain did not offer fewer opportunities to be upwardly mobile than societies that are popularly assumed to be less class-bound, such as the United States. Since the 1980s, upward mobility has declined in both Britain and the USA, due to the destruction of many secure, reasonably well-paid jobs and the decimation of welfare provision and social security.6 My focus on Britain reflects the aim of this book. This is not to construct a statistical ‘model’ of social mobility that enables measurements of and between large populations – as many valuable studies have already done. Rather, I explore the historically specific circumstances that made it possible and desirable for some people to climb the ladder, and caused others to slide down it.

pages: 661 words: 156,009

Your Computer Is on Fire
by Thomas S. Mullaney , Benjamin Peters , Mar Hicks and Kavita Philip
Published 9 Mar 2021

It’s important, dare I say imperative, that policy makers think through the implications of what it will mean when this kind of predictive software is embedded in decision-making robotics, like artificial police officers or military personnel that will be programmed to make potentially life-or-death decisions based on statistical modeling and a recognition of certain patterns or behaviors in targeted populations. Crawford and Shultz warn that the use of predictive modeling through gathering data on the public also poses a serious threat to privacy; they argue for new frameworks of “data due process” that would allow individuals a right to appeal the use of their data profiles.17 This could include where a person moves about, how one is captured in modeling technologies, and use of surveillance data for use in big data projects for behavioral predictive modeling such as in predictive policing software: Moreover, the predictions that these policing algorithms make—that particular geographic areas are more likely to have crime—will surely produce more arrests in those areas by directing police to patrol them.

pages: 543 words: 157,991

All the Devils Are Here
by Bethany McLean
Published 19 Oct 2010

Merrill did a number of these deals with Magnetar. The performance of these CDOs can be summed up in one word: horrible. The essence of the ProPublica allegation is that Magnetar, like Paulson, was betting that “its” CDOs would implode. Magnetar denies that this was its intent and claims that its strategy was based on a “mathematical statistical model.” The firm says it would have done well regardless of the direction of the market. It almost doesn’t matter. The triple-As did blow up. You didn’t have to be John Paulson, picking out the securities you were then going to short, to make a fortune in this trade. Given that the CDOs referenced poorly underwritten subprime mortgages, they had to blow up, almost by definition.

pages: 512 words: 165,704

Traffic: Why We Drive the Way We Do (And What It Says About Us)
by Tom Vanderbilt
Published 28 Jul 2008

Michael Schreckenberg, the German physicist known as the “jam professor,” has worked with officials in North Rhine–Westphalia in Germany to provide real-time information, as well as “predictive” traffic forecasts. Like Inrix, if less extensively, they have assembled some 360,000 “fundamental diagrams,” or precise statistical models of the flow behavior of highway sections. They have a good idea of what happens on not only a “normal” day but on all the strange variations: weeks when a holiday falls on Wednesday, the first day there is ice on the road (most people, he notes, will not have yet put on winter tires), the first day of daylight savings time, when a normally light morning trip may occur in the dark.

pages: 574 words: 164,509

Superintelligence: Paths, Dangers, Strategies
by Nick Bostrom
Published 3 Jun 2014

Optical character recognition of handwritten and typewritten text is routinely used in applications such as mail sorting and digitization of old documents.66 Machine translation remains imperfect but is good enough for many applications. Early systems used the GOFAI approach of hand-coded grammars that had to be developed by skilled linguists from the ground up for each language. Newer systems use statistical machine learning techniques that automatically build statistical models from observed usage patterns. The machine infers the parameters for these models by analyzing bilingual corpora. This approach dispenses with linguists: the programmers building these systems need not even speak the languages they are working with.67 Face recognition has improved sufficiently in recent years that it is now used at automated border crossings in Europe and Australia.

Digital Accounting: The Effects of the Internet and Erp on Accounting
by Ashutosh Deshmukh
Published 13 Dec 2005

Business intelligence tools •Data extraction •Data transformation •Data load Business information warehouse ERP system Reports •Key performance measures •Ad-hoc queries •Business intelligence metadata •OLAP metadata Business intelligence tools OLAP Analysis •Business logic •Mathematical/statistical models •Data mining Executive dashboards Management dashboards Executive information systems Pre-packaged solutions •Planning and budgeting •Consolidations •Financial analytics •Abc/abm •Balanced scorecard •Corporate performance management These tools were soon superceded by specialized report writing tools and analytical tools, which now have evolved to a new category of Business Intelligence (BI) tools; Crystal Reports/Business Objects and Cognos are examples of leading software vendors in this area.

pages: 630 words: 174,171

Caliban's War
by James S. A. Corey
Published 6 Jun 2012

There was a massive upwelling of elemental iron in the northern hemisphere that lasted fourteen hours. There has also been a series of volcanic eruptions. Since the planet doesn’t have any tectonic motion, we’re assuming the protomolecule is doing something in the mantle, but we can’t tell what. The brains put together a statistical model that shows the approximate energy output expected for the changes we’ve seen. It suggests that the overall level of activity is rising about three hundred percent per year over that last eighteen months.” The secretary-general nodded, his expression grave. It was almost as if he’d understood any part of what she’d said.

pages: 559 words: 161,035

Class Warfare: Inside the Fight to Fix America's Schools
by Steven Brill
Published 15 Aug 2011

For years the academics had been doing this, but no one in the public knew anything about it. It was all a game of inside baseball.” Finally, they found an expert named Richard Buddin, who had published widely on the subject in peer-reviewed journals and was carefully vetted by the paper. Buddin built them their statistical model and analyzed the data. They paid him $50 an hour, in part with a grant from the Hechinger Institute at Columbia Teachers College. Between Buddin and other costs, the paper would end up spending about $50,000 on the project on top of the near-full-time salary for a year that the two Jasons worked on it, plus the time chipped in by other reporters, interns, and editors.

pages: 667 words: 186,968

The Great Influenza: The Story of the Deadliest Pandemic in History
by John M. Barry
Published 9 Feb 2004

The CDC based that range, however, on different estimates of the effectiveness and availability of a vaccine and of the age groups most vulnerable to the virus. It did not factor in the most important determinant of deaths: the lethality of the virus itself. The CDC simply figured virulence by computing an average from the last three pandemics, those in 1918, 1957, and 1968. Yet two of those three real pandemics fall outside the range of the statistical model. The 1968 pandemic was less lethal than the best case scenario, and the 1918 pandemic was more lethal than the worst case scenario. After adjusting for population growth, the 1918 virus killed four times as many as the CDC’s worst case scenario, and medical advances cannot now significantly mitigate the killing impact of a virus that lethal.

pages: 584 words: 187,436

More Money Than God: Hedge Funds and the Making of a New Elite
by Sebastian Mallaby
Published 9 Jun 2010

See also Steven Drobny, Inside the House of Money: Top Hedge Fund Traders on Profiting in the Global Markets, (Hoboken, NJ: John Wiley & Sons, 2006), p. 174. 16. Wadhwani recalls, “Often it was the case that you were already using the input variables these guys were talking about, but you were perhaps using these input variables in a more naive way in your statistical model than the way they were actually using it.” Wadhwani interview. 17. Mahmood Pradhan, who worked with Wadhwani at Tudor, elaborates: “There are times when particular variables explain certain asset prices, and there are times when other things determine the price. So you need to understand when your model is working and when it isn’t.

pages: 708 words: 176,708

The WikiLeaks Files: The World According to US Empire
by Wikileaks
Published 24 Aug 2015

More than this, however, the disruption to the old oligarchic rule represented by Allende, and the Pinochet regime’s relative autonomy from the business class, enabled the dictatorship to restructure industry in such a way as to displace the dominance of old mining and industrial capital. This was part of a global trend, as investors everywhere felt shackled by the old statist models of development. They demanded the reorganization of industry, the freeing up of the financial sector, and the opening of international markets. In place of the old economic model of “import substitution,” protecting and developing the nation’s industries to overcome dependence on imports, a new model of “export-led growth” was implemented, in which domestic consumption was suppressed so that goods could be more profitably exported abroad.103 The WikiLeaks documents, taken together with previous historical findings, show us a US government immensely relieved by the Pinochet coup, and desperate to work with the new regime.

pages: 651 words: 180,162

Antifragile: Things That Gain From Disorder
by Nassim Nicholas Taleb
Published 27 Nov 2012

Franklin, James, 2001, The Science of Conjecture: Evidence and Probability Before Pascal. Baltimore: Johns Hopkins University Press. Freedman, D. A., and D. B. Petitti, 2001, “Salt and Blood Pressure: Conventional Wisdom Reconsidered.” Evaluation Review 25(3): 267–287. Freedman, D., D. Collier, et al., 2010, Statistical Models and Causal Inference: A Dialogue with the Social Sciences. Cambridge: Cambridge University Press. Freeman, C., and L. Soete, 1997, The Economics of Industrial Innovation. London: Routledge. Freidson, Eliot, 1970, Profession of Medicine: A Study of the Sociology of Applied Knowledge. Chicago: University of Chicago Press.

pages: 687 words: 189,243

A Culture of Growth: The Origins of the Modern Economy
by Joel Mokyr
Published 8 Jan 2016

The obvious reason is that social knowledge depends on specialization, simply because the set of total knowledge is far too large for a single mind to comprehend. Complex social and physical processes are often impossible for laypersons to comprehend, yet the information may be essential to guide certain important behaviors. Subtle statistical models and sophisticated experimentation may be needed to discriminate between important hypotheses about, say, the effects of certain foods on human health or the causes of crime. Especially for propositional knowledge (the knowledge underpinning techniques in use), authorities and the division of knowledge are indispensable because such knowledge can operate effectively only if a fine subdivision of knowledge through specialization is practiced.

pages: 652 words: 172,428

Aftershocks: Pandemic Politics and the End of the Old International Order
by Colin Kahl and Thomas Wright
Published 23 Aug 2021

Research has consistently identified several indicators associated with higher levels of state fragility and civil strife, including poor health, low per capita income, economic vulnerability produced by dependence on oil and other natural resources, low levels of international trade, government discrimination, democratic backsliding, and instability in neighboring countries—and all of these were exacerbated by the pandemic. In July, for example, a group of conflict researchers at the University of Denver’s Korbel School of International Studies updated a statistical model of internal war to include the possible effects of COVID-19. Prior to the pandemic, their statistical simulation—which incorporated a wide array of human and social development indicators—predicted that the number of armed conflicts around the world would plateau or even decline starting in 2020 and continue on that path through the remainder of the decade.

pages: 685 words: 203,949

The Organized Mind: Thinking Straight in the Age of Information Overload
by Daniel J. Levitin
Published 18 Aug 2014

Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST), Mountain View, CA. Retrieved from http://www.pdl.cmu.edu/ftp/Failure/failure-fast07.pdf See also: He, Z., Yang, H., & Xie, M. (2012, October). Statistical modeling and analysis of hard disk drives (HDDs) failure. Institute of Electrical and Electronics Engineers APMRC, pp. 1–2. suffer a disk failure within two years Vishwanath, K. V., & Nagappan, N. (2010). Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM symposium on cloud computing.

pages: 701 words: 199,010

The Crisis of Crowding: Quant Copycats, Ugly Models, and the New Crash Normal
by Ludwig B. Chincarini
Published 29 Jul 2012

TABLE 15.4 Annualized Returns of Hedge Fund Strategies and Major Indices Notes 1. His stay as president of the large hedge fund Paloma Partners was short lived, and he eventually teamed up with LTCM alum Robert Shustak and the fund’s former controller, Bruce Wilson, to start Quantitative Alternatives LLC, in Rye Brook, New York. Their plan was to use statistical models for trading strategies much like those employed by LTCM. The fund never raised enough funds, and the three partners folded the operation at the end of 2008. Rosenfeld is now retired, but teaches part-time at MIT’s Sloan School of Management. 2. This phenomenon was discussed in Chapter 9. 3.

pages: 741 words: 199,502

Human Diversity: The Biology of Gender, Race, and Class
by Charles Murray
Published 28 Jan 2020

“Evolutionary Framework for Identifying Sex-and Species-Specific Vulnerabilities in Brain Development and Functions.” Journal of Neuroscience Research 95 (1–2): 355–61. Geddes, Patrick, and J. Arthur Thomson. 1889. The Evolution of Sex. New York: Humboldt Publishing. Gelman, Andrew. 2018. “You Need 16 Times the Sample Size to Estimate an Interaction Than to Estimate a Main Effect.” Statistical Modeling, Causal Inference, and Social Science (March 15). Geschwind, Norman, and Albert M. Galaburda. 1985. “Cerebral Lateralization, Biological Mechanisms, Associations, and Pathology: I. A Hypothesis and a Program for Research.” Archive of Neurology 42 (5): 428–59. Giedd, Jay N., Armin Raznahan, Aaron Alexander-Bloch et al. 2014.

pages: 691 words: 203,236

Whiteshift: Populism, Immigration and the Future of White Majorities
by Eric Kaufmann
Published 24 Oct 2018

Naturally there are exceptions like Brixton in London or Brooklyn, New York, where gentrification has taken place. This shows up as the line of dots on the left side of the American graph where there is a spike of places that were less than 10 per cent white in 2000 but had rapid white growth in the 2000s. Still, the overwhelming story, which the statistical models tell, is one in which whites are moving towards the most heavily white neighbourhoods. An identical pattern can be found in Stockholm neighbourhoods in the 1990s, and appears to hold within many American cities.31 We see it as well in urban British Columbia and Ontario, Canada, in figure 9.5.

Cultural Backlash: Trump, Brexit, and Authoritarian Populism
by Pippa Norris and Ronald Inglehart
Published 31 Dec 2018

New York: Palgrave Macmillan. Golder, Matthew. 2003. ‘Explaining variation in the success of extreme right parties in Western Europe.’ Comparative Political Studies, 36(4): 432–466. 2016. ‘Far right parties in Europe.’ Annual Review of Political Science, 19: 477–497. Goldstein, Harvey. 1995. Multilevel Statistical Models. 3rd Edn. New York: Halstead Press. Golsan, Richard J. Ed. 1995. Fascism’s Return: Scandal, Revision and Ideology since 1980. Lincoln, NE: University of Nebraska Press. Goodhart, David. 2017. The Road to Somewhere: The Populist Revolt and the Future of Politics. London: Hurst & Company. Goodwin, Matthew J. 2006.

pages: 1,409 words: 205,237

Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale
by Jan Kunigk , Ian Buss , Paul Wilkinson and Lars George
Published 8 Jan 2019

Another common challenge when transitioning analytics problems to the Hadoop realm is that analysts need to master the various data formats that are used in Hadoop, since, for example, the previously dominant model of cubing data is almost never used in Hadoop. Data scientists typically also need extensive experience with SQL as a tool to drill down into the datasets that they require to build statistical models, via SparkSQL, Hive, or Impala. Machine learning and deep learning Simply speaking, machine learning is where the rubber of big data analytics hits the road. While certainly a hyped term, machine learning goes beyond classic statistics, with more advanced algorithms that predict an outcome by learning from the data—often without explicitly being programmed.

pages: 1,294 words: 210,361

The Emperor of All Maladies: A Biography of Cancer
by Siddhartha Mukherjee
Published 16 Nov 2010

Whose victory was this—a victory of prevention or of therapeutic intervention?* Berry’s answer was a long-due emollient to a field beset by squabbles between the advocates of prevention and the proponents of chemotherapy. When Berry assessed the effect of each intervention independently using statistical models, it was a satisfying tie: both cancer prevention and chemotherapy had diminished breast cancer mortality equally—12 percent for mammography and 12 percent for chemotherapy, adding up to the observed 24 percent reduction in mortality. “No one,” as Berry said, paraphrasing the Bible, “had labored in vain.”

pages: 843 words: 223,858

The Rise of the Network Society
by Manuel Castells
Published 31 Aug 1996

What matters for our research purposes are two teachings from this fundamental experience of interrupted technological development: on the one hand, the state can be, and has been in history, in China and elsewhere, a leading force for technological innovation; on the other hand, precisely because of this, when the state reverses its interest in technological development, or becomes unable to perform it under new conditions, a statist model of innovation leads to stagnation, because of the sterilization of society’s autonomous innovative energy to create and apply technology. That the Chinese state could, centuries later, build anew an advanced technological basis, in nuclear technology, missiles, satellite launching, and electronics,13 demonstrates again the emptiness of a predominantly cultural interpretation of technological development and backwardness: the same culture may induce very different technological trajectories depending on the pattern of relationships between state and society.

Engineering Security
by Peter Gutmann

Purdue professor Gene Spafford thinks this that may have its origins in work done with a standalone US Department of Defence (DoD) mainframe system for which the administrators calculated that their mainframe could brute-force a password in x days and so a period slightly less than this was set as the passwordchange interval [79]. Like the ubiquitous “Kilroy was here” there are various other explanations floating around for the origins of this requirement, but in truth no-one really knows for sure where it came from. In fact the conclusion of the sole documented statistical modelling of password change, carried out in late 2006, is that changing passwords doesn’t really matter (the analysis takes a number of different variables into account rather than just someone’s estimate of what a DoD mainframe may have done in the 1960s, for the full details see the original article) [80] Even if we don't know where the password-change requirement really originated, we do know the effect that it has.

This means that the chance of compromise for a certificate with a lifetime of one year is 0.002%. With a rather longer five-year lifetime it’s 0.01%, and with a ten-year lifetime it’s 0.02% (remember that this is a simplified model used to illustrate a point, since in practice it’s possible to argue endlessly over the sort of statistical model that you’d use for key compromise and we have next to no actual data on when actual key compromises occur since they’re so infrequent). In any case though, in those ten years of using the same key how many security holes and breaches do you think will be found in the web site that don’t involve the site’s private key?

pages: 753 words: 233,306

Collapse
by Jared Diamond
Published 25 Apr 2011

All eight of those variables make Easter susceptible to deforestation. Easter's volcanoes are of moderate age (probably 200,000 to 600,000 years); Easter's Poike Peninsula, its oldest volcano, was the first part of Easter to become deforested and exhibits the worst soil erosion today. Combining the effects of all those variables, Barry's and my statistical model predicted that Easter, Nihoa, and Necker should be the worst deforested Pacific islands. That agrees with what actually happened: Nihoa and Necker ended up with no human left alive and with only one tree species standing (Nihoa's palm), while Easter ended up with no tree species standing and with about 90% of its former population gone.

pages: 761 words: 231,902

The Singularity Is Near: When Humans Transcend Biology
by Ray Kurzweil
Published 14 Jul 2005

Franz Josef Och, a computer scientist at the University of Southern California, has developed a technique that can generate a new language-translation system between any pair of languages in a matter of hours or days.209 All he needs is a "Rosetta stone"—that is, text in one language and the translation of that text in the other language—although he needs millions of words of such translated text. Using a self-organizing technique, the system is able to develop its own statistical models of how text is translated from one language to the other and develops these models in both directions. This contrasts with other translation systems, in which linguists painstakingly code grammar rules with long lists of exceptions to each rule. Och's system recently received the highest score in a competition of translation systems conducted by the U.S.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 17 Apr 2017

Stream processing is similar, but it extends operators to allow managed, fault-tolerant state (see “Rebuilding state after a failure” on page 478). The principle of deterministic functions with well-defined inputs and outputs is not only good for fault tolerance (see “Idempotence” on page 478), but also simplifies reasoning about the dataflows in an organization [7]. No matter whether the derived data is a search index, a statistical model, or a cache, it is helpful to think in terms of data pipelines that derive one thing from another, pushing state changes in one sys‐ tem through functional application code and applying the effects to derived systems. In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same trans‐ action as writes to the table being indexed.

pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 16 Mar 2017

Stream processing is similar, but it extends operators to allow managed, fault-tolerant state (see “Rebuilding state after a failure”). The principle of deterministic functions with well-defined inputs and outputs is not only good for fault tolerance (see “Idempotence”), but also simplifies reasoning about the dataflows in an organization [7]. No matter whether the derived data is a search index, a statistical model, or a cache, it is helpful to think in terms of data pipelines that derive one thing from another, pushing state changes in one system through functional application code and applying the effects to derived systems. In principle, derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same transaction as writes to the table being indexed.

pages: 1,072 words: 237,186

How to Survive a Pandemic
by Michael Greger, M.D., FACLM

Teenagers and health workers were found to be the prime violators of quarantine rules in Toronto during the SARS outbreak in 2003.805 Stating the obvious in regard to the difference between stopping the 1997 Hong Kong outbreak among chickens and stopping a human outbreak, experts have written, “Slaughter and quarantine of people is not an option.”806 “Even if it was possible to cordon off a city,” noted then Center for Biosecurity’s O’Toole, “that is not going to contain influenza.”807 Based on failed historical attempts along with contemporary statistical models,808 influenza experts are confident that efforts at quarantine “simply will not work.”809 Experts consider quarantine efforts “doomed to fail”810 because of the extreme contagiousness of influenza,811 which is a function of its incubation period and mode of transmission. SARS, in retrospect, was an easy virus to contain because people essentially became symptomatic before they became infectious.812 People showed signs of the disease before they could efficiently spread it, so tools like thermal image scanners at airports to detect fever or screening those with a cough could potentially stem the spread of the disease.813 The influenza virus, however, gets a head start.

pages: 801 words: 242,104

Collapse: How Societies Choose to Fail or Succeed
by Jared Diamond
Published 2 Jan 2008

All eight of those variables make Easter susceptible to deforestation. Easter’s volcanoes are of moderate age (probably 200,000 to 600,000 years); Easter’s Poike Peninsula, its oldest volcano, was the first part of Easter to become deforested and exhibits the worst soil erosion today. Combining the effects of all those variables, Barry’s and my statistical model predicted that Easter, Nihoa, and Necker should be the worst deforested Pacific islands. That agrees with what actually happened: Nihoa and Necker ended up with no human left alive and with only one tree species standing (Nihoa’s palm), while Easter ended up with no tree species standing and with about 90% of its former population gone.

pages: 944 words: 243,883

Private Empire: ExxonMobil and American Power
by Steve Coll
Published 30 Apr 2012

They would come from “all walks of life,” such as business, government, and the media, and they would be “aware of, and concerned about, the current debate and issues surrounding the world energy resources/use as well as climate change.” The ideal audience would be “open-minded,” as well as “information hungry” and “socially responsible.” The characteristics of the elites ExxonMobil sought to educate were derived in part from statistical modeling that Ken Cohen’s public affairs department had commissioned in the United States and Europe, to understand in greater depth the corporation’s reputation among opinion leaders. That model had allowed Cohen and his colleagues to forecast how elites would react to particular statements that ExxonMobil might make or actions it might take.

The Dawn of Everything: A New History of Humanity
by David Graeber and David Wengrow
Published 18 Oct 2021

Indeed, this complex subsector of the coast, between the Eel River and the mouth of the Columbia River, posed significant problems of classification for scholars seeking to delineate the boundaries of those culture areas, and the issue of their affiliation remains contentious today. See Kroeber 1939; Jorgensen 1980; Donald 2003. 45. The historicity of First Nations oral narratives concerning ancient migrations and wars on the Northwest Coast has been the subject of an innovative study which combines archaeology with the statistical modelling of demographic shifts that can be scientifically dated back to periods well over a millennium into the past. Its authors conclude that the ‘Indigenous oral record has now been subjected to extremely rigorous testing. Our result – that the [in this case] Tsimshian oral record is correct (properly not disproved) in its accounting of events from over 1,000 years ago – is a major milestone in the evaluation of the validity of Indigenous oral traditions.’

She Has Her Mother's Laugh
by Carl Zimmer
Published 29 May 2018

The only way out of that paradox is to join some of those forks back together. In other words, your ancestors must have all been related to each other, either closely or distantly. The geometry of this heredity has long fascinated mathematicians, and in 1999 a Yale mathematician named Joseph Chang created the first statistical model of it. He found that it has an astonishing property. If you go back far enough in the history of a human population, you reach a point in time when all the individuals who have any descendants among living people are ancestors of all living people. To appreciate how weird this is, think again about Charlemagne.

pages: 1,042 words: 273,092

The Silk Roads: A New History of the World
by Peter Frankopan
Published 26 Aug 2015

Europe even began to export in the opposite direction too, flooding the market in the Middle East and causing a painful contraction that stood in direct contrast to the invigorated economy to the west.71 As recent research based on skeletal remains in graveyards in London demonstrates, the rise in wealth led to better diets and to better general health. Indeed, statistical modelling based on these results even suggests that one of the effects of the plague was a substantial improvement in life expectancy. London’s post-plague population was considerably healthier than it had been before the Black Death struck – raising life expectancy sharply.72 Economic and social development did not occur evenly across Europe.

pages: 1,758 words: 342,766

Code Complete (Developer Best Practices)
by Steve McConnell
Published 8 Jun 2004

Chapter 5 of this book describes Humphrey's Probe method, which is a technique for estimating work at the individual developer level. Conte, S. D., H. E. Dunsmore, and V. Y. Shen. Software Engineering Metrics and Models. Menlo Park, CA: Benjamin/Cummings, 1986. Chapter 6 contains a good survey of estimation techniques, including a history of estimation, statistical models, theoretically based models, and composite models. The book also demonstrates the use of each estimation technique on a database of projects and compares the estimates to the projects' actual lengths. Gilb, Tom. Principles of Software Engineering Management. Wokingham, England: Addison-Wesley, 1988.

pages: 1,079 words: 321,718

Surfaces and Essences
by Douglas Hofstadter and Emmanuel Sander
Published 10 Sep 2012

If he accepts the job, his salary and professional prestige will both take leaps, but on the other hand, the move would be a huge emotional upheaval for his entire family. A colleague whom he privately asks for advice reacts, “Hey, what’s with you? You’re one of the world’s experts on how decisions are made. Why are you coming to see me? You’re the one who invented super-sophisticated statistical models for making optimal decisions. Apply your own work to your dilemma; that’ll tell you what to do!” His friend looks at him straight in the eye and says, “Come off it, would you? This is serious!” The fact is that when we are faced with serious decisions, although we can certainly draw up a list of all sorts of outcomes, assigning them numerical weights that reflect their likelihoods of happening as well as the amount of pleasure they would bring us, on the basis of which we can then calculate the “optimal” choice, this is hardly the way that people who are in the throes of major decision-making generally proceed.

pages: 1,351 words: 385,579

The Better Angels of Our Nature: Why Violence Has Declined
by Steven Pinker
Published 24 Sep 2012

Combine exponentially growing damage with an exponentially shrinking chance of success, and you get a power law, with its disconcertingly thick tail. Given the presence of weapons of mass destruction in the real world, and religious fanatics willing to wreak untold damage for a higher cause, a lengthy conspiracy producing a horrendous death toll is within the realm of thinkable probabilities. A statistical model, of course, is not a crystal ball. Even if we could extrapolate the line of existing data points, the massive terrorist attacks in the tail are still extremely (albeit not astronomically) unlikely. More to the point, we can’t extrapolate it. In practice, as you get to the tail of a power-law distribution, the data points start to misbehave, scattering around the line or warping it downward to very low probabilities.

pages: 889 words: 433,897

The Best of 2600: A Hacker Odyssey
by Emmanuel Goldstein
Published 28 Jul 2008

I started with only a few keywords and found myself with many more based on the keyword tool. But this was where more problems started to occur. I found that my keywords were being canceled way too easily and were not given a fair chance to perform. Like I said earlier, if the campaign was on a larger scale, then this statistics model may hold true. But for smaller campaigns it simply was more of a hassle. It also led to another problem that I found slightly ironic, which is that the keyword tool suggested words and phrases to me that I was later denied due to their ToS (Terms of Service) anyway. Why recommend them if you are not going to allow me to use them?

pages: 1,737 words: 491,616

Rationality: From AI to Zombies
by Eliezer Yudkowsky
Published 11 Mar 2015

When there is some phenomenon A that we want to investigate, and an observation X that is evidence about A—for example, in the previous example, A is breast cancer and X is a positive mammography—Bayes’s Theorem tells us how we should update our probability of A, given the new evidence X. By this point, Bayes’s Theorem may seem blatantly obvious or even tautological, rather than exciting and new. If so, this introduction has entirely succeeded in its purpose. * * * Bayes’s Theorem describes what makes something “evidence” and how much evidence it is. Statistical models are judged by comparison to the Bayesian method because, in statistics, the Bayesian method is as good as it gets—the Bayesian method defines the maximum amount of mileage you can get out of a given piece of evidence, in the same way that thermodynamics defines the maximum amount of work you can get out of a temperature differential.