sparse data

back to index

51 results

pages: 304 words: 82,395

Big Data: A Revolution That Will Transform How We Live, Work, and Think
by Viktor Mayer-Schonberger and Kenneth Cukier
Published 5 Mar 2013

. [>] Netflix identified individual—Ryan Singel, “Netflix Spilled Your Brokeback Mountain Secret, Lawsuit Claims,” Wired, December 17, 2009 (http://www.wired.com/threatlevel/2009/12/netflix-privacy-lawsuit/). On the Netflix data release—Arvind Narayanan and Vitaly Shmatikov, “Robust De-Anonymization of Large Sparse Datasets,” Proceedings of the 2008 IEEE Symposium on Security and Privacy, p. 111 et seq. (http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf); Arvind Narayanan and Vitaly Shmatikov, “How to Break the Anonymity of the Netflix Prize Dataset,” October 18, 2006, arXiv:cs/0610105 [cs.CR] (http://arxiv.org/abs/cs/0610105).

“Space-Efficient Indexing of Chess Endgame Tables.” ICGA Journal 23, no. 3 (2000), pp. 148–162. Narayanan, Arvind, and Vitaly Shmatikov. “How to Break the Anonymity of the Netflix Prize Dataset.” October 18, 2006, arXiv:cs/0610105 (http://arxiv.org/abs/cs/0610105). ———. “Robust De-Anonymization of Large Sparse Datasets.” Proceedings of the 2008 IEEE Symposium on Security and Privacy, p. 111 (http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf). Nazareth, Rita, and Julia Leite. “Stock Trading in U.S. Falls to Lowest Level Since 2008.” Bloomberg, August 13, 2012 (http://www.bloomberg.com/news/2012-08-13/stock-trading-in-u-s-hits-lowest-level-since-2008-as-vix-falls.html).

pages: 519 words: 102,669

Programming Collective Intelligence
by Toby Segaran
Published 17 Dec 2008

In the movie example, since every critic has rated nearly every movie, the dataset is dense (not sparse). On the other hand, it would be unlikely to find two people with the same set of del.icio.us bookmarks—most bookmarks are saved by a small group of people, leading to a sparse dataset. Item-based filtering usually outperforms user-based filtering in sparse datasets, and the two perform about equally in dense datasets. Tip To learn more about the difference in performance between these algorithms, check out a paper called "Item-based Collaborative Filtering Recommendation Algorithms" by Sarwar et al. at http://citeseer.ist.psu.edu/sarwar01itembased.html.

The Ethical Algorithm: The Science of Socially Aware Algorithm Design
by Michael Kearns and Aaron Roth
Published 3 Oct 2019

References and Further Reading Chapter 1: Algorithmic Privacy: From Anonymity to Noise References An extended discussion of successful “de-anonymization” attacks, including the Massachusetts GIC and Netflix cases, can be found in “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization” by Paul Ohm, which appeared in the UCLA Law Review 57 (2010). Details on the Netflix attack are described in “Robust De-anonymization of Large Sparse Datasets” by Arvind Narayanan and Vitaly Shmatikov, which was published in the IEEE Symposium on Security and Privacy (IEEE, 2008). Details of the original Genome-Wide Association Study attack can be found in “Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays” by Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V.

pages: 296 words: 78,631

Hello World: Being Human in the Age of Algorithms
by Hannah Fry
Published 17 Sep 2018

Jon Brodkin, ‘Senate votes to let ISPs sell your Web browsing history to advertisers’, Ars Technica, 23 March 2017, https://arstechnica.com/tech-policy/2017/03/senate-votes-to-let-isps-sell-your-web-browsing-history-to-advertisers/. 16. Svea Eckert and Andreas Dewes, ‘Dark data’, DEFCON Conference 25, 20 Oct. 2017, https://www.youtube.com/watch?v=1nvYGi7-Lxo. 17. The researchers based this part of their work on Arvind Narayanan and Vitaly Shmatikov, ‘Robust de-anonymization of large sparse datasets’, paper presented to IEEE Symposium on Security and Privacy, 18–22 May 2008. 18. Michal Kosinski, David Stillwell and Thore Graepel. ‘Private traits and attributes are predictable from digital records of human behavior’, vol. 110, no. 15, 2013, pp. 5802–5. 19. Ibid. 20. Wu Youyou, Michal Kosinski and David Stillwell, ‘Computer-based personality judgments are more accurate than those made by humans’, Proceedings of the National Academy of Sciences, vol. 112, no. 4, 2015, pp. 1036–40. 21.

The Internet Trap: How the Digital Economy Builds Monopolies and Undermines Democracy
by Matthew Hindman
Published 24 Sep 2018

So, for instance, a category might represent action movies, with movies with a lot of action at the top, and slow movies at the bottom, and correspondingly users who like action movies at the top, and those who prefer slow movies at the bottom.17 While this is true in theory, interpreting factors can be difficult in practice, as we shall see. svd had rarely been used with recommender systems because the technique performed poorly on “sparse” datasets, those (like the Netflix data) in which most of the values are missing. But Funk adapted the technique to ignore missing values, and found a way to implement the approach in only two lines of C code.18 Funk even titled the blog post explaining his method “Try This at Home,” encouraging other entrants to incorporate svd.

pages: 422 words: 104,457

Dragnet Nation: A Quest for Privacy, Security, and Freedom in a World of Relentless Surveillance
by Julia Angwin
Published 25 Feb 2014

In 2006, the New York Times: Michael Barbaro and Tom Zeller Jr., “A Face Is Exposed for AOL Searcher No. 4417749,” New York Times, August 9, 2006, http://www.nytimes.com/2006/08/09/technology/09aol.html?_r=0&gwh=2CACC912D19D87BDFD3A39B96C429022. In 2008, researchers at the University of Texas: Arvind Narayanan and Vitaly Shmatikov, “Robust De-anonymization of Large Sparse Datasets,” Security and Privacy (2008): 111–25, http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf. In 2012, my Wall Street Journal team: Jennifer Valentino-Devries and Jeremy Singer-Vine, “They Know What You’re Shopping For,” Wall Street Journal, December 7, 2012, http://online.wsj.com/article/SB10001424127887324784404578143144132736214.html.

Reset
by Ronald J. Deibert
Published 14 Aug 2020

Drone pandemic: Will coronavirus invite the world to meet Big Brother? Retrieved from https://thebulletin.org/2020/04/drone-pandemic-will-coronavirus-invite-the-world-to-meet-big-brother/ How easy it is to unmask real identities contained in large personal data sets: Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. IEEE Symposium on Security and Privacy, 111–125. http://doi.org/10.1109/SP.2008.33 “At least eight surveillance and cyber-intelligence companies attempting to sell repurposed spy and law enforcement tools”: Schectman, J., Bing, C., & Stubbs, J. (2020, April 28). Cyber-intel firms pitch governments on spy tools to trace coronavirus.

pages: 448 words: 117,325

Click Here to Kill Everybody: Security and Survival in a Hyper-Connected World
by Bruce Schneier
Published 3 Sep 2018

That’s a calm year for me; in 2015, my average speed was 33 miles per hour. 144It wasn’t always like this: This is a good summary: Mark Hansen, Carolyn McAndrews, and Emily Berkeley (Jul 2008), “History of aviation safety oversight in the United States,” DOT/FAA/AR-08-39, National Technical Information Service, http://www.tc.faa.gov/its/worldpac/techrpt/ar0839.pdf. 144The result is that today: The taxi ride to the airport is the most dangerous part of the trip. 145Whenever industry groups write about this: Here’s one example: Coalition for Cybersecurity and Policy and Law (26 Oct 2017), “New whitepaper: Building a national cybersecurity strategy: Voluntary, flexible frameworks,” Center for Responsible Enterprise and Trade, https://create.org/news/new-whitepaper-building-national-cybersecurity-strategy. 145The Federal Aviation Administration has: April Glaser (15 Mar 2017), “Federal privacy laws won’t necessarily protect you from spying drones,” Recode, https://www.recode.net/2017/3/15/14934050/federal-privacy-laws-spying-drones-senate-hearing. 148in 2006, Netflix published 100 million: Katie Hafner (2 Oct 2006), “And if you liked the movie, a Netflix contest may reward you handsomely,” New York Times, http://www.nytimes.com/2006/10/02/technology/02netflix.html. 148Researchers were able to de-anonymize: Arvind Narayanan and Vitaly Shmatikov (18 May 2008), “Robust de-anonymization of large sparse datasets,” 2008 IEEE Symposium on Security and Privacy (SP ’08), https://dl.acm.org/citation.cfm?id=1398064. 148which surprised pretty much everyone: Paul Ohm (13 Aug 2009), “Broken promises of privacy: Responding to the surprising failure of anonymization,” UCLA Law Review 57, https://papers.ssrn.com/sol3/papers.cfm?

pages: 598 words: 134,339

Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World
by Bruce Schneier
Published 2 Mar 2015

researchers were able to attach names: Michael Barbaro and Tom Zeller Jr. (9 Aug 2006), “A face is exposed for AOL Search No. 4417749,” New York Times, http://www.nytimes.com/2006/08/09/technology/09aol.html. Researchers were able to de-anonymize people: Arvind Narayanan and Vitaly Shmatikov (18–20 May 2008), “Robust de-anonymization of large sparse datasets,” 2008 IEEE Symposium on Security and Privacy, Oakland, California, http://dl.acm.org/citation.cfm?id=1398064 and http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf. correlation opportunities pop up: Also for research purposes, in the mid-1990s the Massachusetts Group Insurance Commission released hospital records from state employees with the names, addresses, and Social Security numbers removed.

Data Mining: Concepts and Techniques: Concepts and Techniques
by Jiawei Han , Micheline Kamber and Jian Pei
Published 21 Jun 2011

Figure 3.5 Principal components analysis. Y1 and Y2 are the first two principal components for the given data. PCA can be applied to ordered and unordered attributes, and can handle sparse data and skewed data. Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions. Principal components may be used as inputs to multiple regression and cluster analysis. In comparison with wavelet transforms, PCA tends to be better at handling sparse data, whereas wavelet transforms are more suitable for data of high dimensionality. 3.4.4. Attribute Subset Selection Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant.

seeStructural Clustering Algorithm for Networks core vertex 531 illustrated 532 scatter plots 54 2-D data set visualization with 59 3-D data set visualization with 60 correlations between attributes 54–56 illustrated 55 matrix 56, 59 schemas integration 94 snowflake 140–141 star 139–140 science applications 611–613 search engines 28 search space pruning 263, 301 second guess heuristic 369 selection dimensions 225 self-training 432 semantic annotations applications 317, 313, 320–321 with context modeling 316 from DBLP data set 316–317 effectiveness 317 example 314–315 of frequent patterns 313–317 mutual information 315–316 task definition 315 Semantic Web 597 semi-offline materialization 226 semi-supervised classification 432–433, 437 alternative approaches 433 cotraining 432–433 self-training 432 semi-supervised learning 25 outlier detection by 572 semi-supervised outlier detection 551 sensitivity analysis 408 sensitivity measure 367 sentiment classification 434 sequence data analysis 319 sequences 586 alignment 590 biological 586, 590–591 classification of 589–590 similarity searches 587 symbolic 586, 588–590 time-series 586, 587–588 sequential covering algorithm 359 general-to-specific search 360 greedy search 361 illustrated 359 rule induction with 359–361 sequential pattern mining 589 constraint-based 589 in symbolic sequences 588–589 shapelets method 590 shared dimensions 204 pruning 205 shared-sorts 193 shared-partitions 193 shell cubes 160 shell fragments 192, 235 approach 211–212 computation algorithm 212, 213 computation example 214–215 precomputing 210 shrinking diameter 592 sigmoid function 402 signature-based detection 614 significance levels 373 significance measure 312 significance tests 372–373, 386 silhouette coefficient 489–490 similarity asymmetric binary 71 cosine 77–78 measuring 65–78, 79 nominal attributes 70 similarity measures 447–448, 525–528 constraints on 533 geodesic distance 525–526 SimRank 526–528 similarity searches 587 in information networks 594 in multimedia data mining 596 simple random sample with replacement (SRSWR) 108 simple random sample without replacement (SRSWOR) 108 SimRank 526–528, 539 computation 527–528 random walk 526–528 structural context 528 simultaneous aggregation 195 single-dimensional association rules 17, 287 single-linkage algorithm 460, 461 singular value decomposition (SVD) 587 skewed data balanced 271 negatively 47 positively 47 wavelet transforms on 102 slice operation 148 small-world phenomenon 592 smoothing 112 by bin boundaries 89 by bin means 89 by bin medians 89 for data discretization 90 snowflake schema 140 example 141 illustrated 141 star schema versus 140 social networks 524–525, 526–528 densification power law 592 evolution of 594 mining 623 small-world phenomenon 592see alsonetworks social science/social studies data mining 613 soft clustering 501 soft constraints 534, 539 example 534 handling 536–537 space-filling curve 58 sparse data 102 sparse data cubes 190 sparsest cuts 539 sparsity coefficient 579 spatial data 14 spatial data mining 595 spatiotemporal data analysis 319 spatiotemporal data mining 595, 623–624 specialized SQL servers 165 specificity measure 367 spectral clustering 520–522, 539 effectiveness 522 framework 521 steps 520–522 speech recognition 430 speed, classification 369 spiral method 152 split-point 333, 340, 342 splitting attributes 333 splitting criterion 333, 342 splitting rules.

We then discuss how object dissimilarity can be computed for objects described by nominal attributes (Section 2.4.2), by binary attributes (Section 2.4.3), by numeric attributes (Section 2.4.4), by ordinal attributes (Section 2.4.5), or by combinations of these attribute types (Section 2.4.6). Section 2.4.7 provides similarity measures for very long and sparse data vectors, such as term-frequency vectors representing documents in information retrieval. Knowing how to compute dissimilarity is useful in studying attributes and will also be referenced in later topics on clustering (Chapter 10 and Chapter 11), outlier analysis (Chapter 12), and nearest-neighbor classification (Chapter 9). 2.4.1.

pages: 586 words: 186,548

Architects of Intelligence
by Martin Ford
Published 16 Nov 2018

JUDEA PEARL: Neural networks and reinforcement learning will all be essential components when properly utilized in causal modeling. MARTIN FORD: So, you think it might be a hybrid system that incorporates not just neural networks, but other ideas from other areas of AI? JUDEA PEARL: Absolutely. Even today, people are building hybrid systems when you have sparse data. There’s a limit, however, to how much you can extrapolate or interpolate sparse data if you want to get cause-effect relationships. Even if you have infinite data, you can’t tell the difference between A causes B and B causes A. MARTIN FORD: If someday we have strong AI, do you think that a machine could be conscious, and have some kind of inner experience like a human being?

Early on, we used these ideas from Bayesian statistics, Bayesian inference, and Bayesian networks, to use the mathematics of probability theory to formulate how people’s mental models of the causal structure of the world might work. It turns out that tools that were developed by mathematicians, physicists, and statisticians to make inferences from very sparse data in a statistical setting were being deployed in the 1990s in machine learning and AI, and it revolutionized the field. It was part of the move from an earlier symbolic paradigm for AI to a more statistical paradigm. To me, that was a very, very powerful way to think about how our minds were able to make inferences from sparse data. In the last ten years or so, our interests have turned more to where these mental models come from. We’re looking at the minds and brains of babies and young children, and really trying to understand the most basic kind of learning processes that build our basic common-sense understanding of the world.

However, it seemed like we still didn’t really have a handle on what intelligence is really about—a flexible, general-purpose intelligence that allows you to do all of those things that you can do. 10 years ago in cognitive science, we had a bunch of really satisfying models of individual cognitive capacities using this mathematics of ways people made inferences from sparse data, but we didn’t have a unifying theory. We had tools, but we didn’t have any kind of model of common sense. If you look at machine learning and AI technologies, and this is as true now as it was ten years ago, we were increasingly getting machine systems that did remarkable things that we used to think only humans could do.

pages: 625 words: 167,349

The Alignment Problem: Machine Learning and Human Values
by Brian Christian
Published 5 Oct 2020

For simplicity, we focus our discussion on the former, but both approaches have advantages, though they tend to result ultimately in fairly similar models. 55. Shannon, “A Mathematical Theory of Communication.” 56. See Jelinek and Mercer, “Interpolated Estimation of Markov Source Parameters from Sparse Data,” and Katz, “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer”; for an overview, see Manning and Schütze, Foundations of Statistical Natural Language Processing. 57. This famous phrase originated in Bellman, Dynamic Programming. 58. See Hinton, “Learning Distributed Representations of Concepts,” and “Connectionist Learning Procedures,” and Rumelhart and McClelland, Parallel Distributed Processing. 59.

New York: Macmillan, 1892. Jaynes, Edwin T. “Information Theory and Statistical Mechanics.” Physical Review 106, no. 4 (1957): 620–30. Jefferson, Thomas. Notes on the State of Virginia. Paris, 1785. Jelinek, Fred, and Robert L. Mercer. “Interpolated Estimation of Markov Source Parameters from Sparse Data.” In Proceedings, Workshop on Pattern Recognition in Practice, edited by Edzard S. Gelsema and Laveen N. Kanal, 381–97. 1980. Jeon, Hong Jun, Smitha Milli, and Anca D. Drăgan. “Reward-Rational (Implicit) Choice: A Unifying Formalism for Reward Learning.” arXiv Preprint arXiv:2002.04833, 2020. Joffe-Walt, Chana.

“Curiosity and Interest: The Benefits of Thriving on Novelty and Challenge.” Oxford Handbook of Positive Psychology 2 (2009): 367–74. Kasparov, Garry. How Life Imitates Chess: Making the Right Moves, from the Board to the Boardroom. Bloomsbury USA, 2007. Katz, Slava. “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.” IEEE Transactions on Acoustics, Speech, and Signal Processing 35, no. 3 (1987): 400–01. Kellogg, Winthrop Niles, and Luella Agger Kellogg. The Ape and the Child: A Comparative Study of the Environmental Influence upon Early Behavior. Whittlesey House, 1933.

pages: 62 words: 14,996

SciPy and NumPy
by Eli Bressert
Published 14 Oct 2012

This means that the sparse matrix was 100 times more memory efficient and the Eigen operation was roughly 150 times faster than the non-sparse cases. Tip If you’re unfamiliar with sparse matrices, I suggest reading http://www.scipy.org/SciPyPackages/Sparse, where the basics on sparse matrices and operations are discussed. In 2D and 3D geometry, there are many sparse data structures used in fields like engineering, computational fluid dynamics, electromagnetism, thermodynamics, and acoustics. Non-geometric instances of sparse matrices are applicable to optimization, economic modeling, mathematics and statistics, and network/graph theories. Using scipy.io, you can read and write common sparse matrix file formats such as Matrix Market and Harwell-Boeing, or load MatLab files.

pages: 589 words: 69,193

Mastering Pandas
by Femi Anthony
Published 21 Jun 2015

It is not a public API. panel.py, panel4d.py, and panelnd.py: These provide the functionality for the pandas' Panel object. series.py: This defines the pandas Series class and its various methods that Series inherits from NDFrame and IndexOpsMixin. sparse.py: This defines import for handling sparse data structures. Sparse data structures are compressed whereby data points matching NaN or missing values are omitted. For more information on this, go to http://pandas.pydata.org/pandas-docs/stable/sparse.html. strings.py: These have various functions for handling strings. pandas/io This module contains various modules for data I/O.

Statistics in a Nutshell
by Sarah Boslaugh
Published 10 Nov 2012

A researcher might collect exact information on the number of children per household (0 children, 1 child, 2 children, 3 children, etc.) but choose to group this data into categories for the purpose of analysis, such as 0 children, 1–2 children, and 3 or more children. This type of grouping is often used if there are large numbers of categories and some of them contain sparse data. In the case of the number of children in a household, for instance, a data set might include a relatively few households with large numbers of children, and the low frequencies in those categories can adversely affect the power of the study or make it impossible to use certain analytical techniques.

The Pearson’s chi-square test is suitable for data in which all observations are independent (the same person is not measured twice, for instance) and the categories are mutually exclusive and exhaustive (so that no case may be classified into more than one cell, and all potential cases can be classified into one of the cells). It is also assumed that no cell has an expected value less than 1, and no more than 20% of the cells have an expected value less than 5. The reason for the last two requirements is that the chi-square is an asymptotic test and might not be valid for sparse data (data in which one or more cells have a low expected frequency). Yates’s correction for continuity is a procedure developed by the British statistician Frank Yates for the chi-square test of independence when applied to 2×2 tables. The chi-square distribution is continuous, whereas the data used in a chi-square test is discrete, and Yates’s correction is meant to correct for this discrepancy.

Use of Yates’s correction is not universally endorsed, however; some researchers feel that it might be an overcorrection leading to a loss of power and increased probability of a Type II error (wrongly failing to reject the null hypothesis). Some statisticians reject the use of Yates’s correction entirely, although some find it useful with sparse data, particularly when at least one cell in the table has an expected cell frequency of less than 5. A less controversial remedy for sparse categorical data is to use Fisher’s exact test, discussed later, instead of the chi-square test, when the distributional assumptions previously named (no more than 20% of cells with an expected value less than 5 and no cell with an expected value of less than 1) are not met.

pages: 504 words: 89,238

Natural language processing with Python
by Steven Bird , Ewan Klein and Edward Loper
Published 15 Dec 2009

Its overall accuracy score is very low: >>> bigram_tagger.evaluate(test_sents) 0.10276088906608193 204 | Chapter 5: Categorizing and Tagging Words As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval). Caution! N-gram taggers should not consider context that crosses a sentence boundary.

. • A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}. • N-gram taggers can be defined for large values of n, but once n is larger than 3, we usually encounter the sparse data problem; even with a large quantity of training data, we see only a tiny fraction of possible contexts. 5.8 Summary | 213 • Transformation-based tagging involves learning a series of repair rules of the form “change tag s to tag t in context c,” where each rule fixes mistakes and possibly introduces a (smaller) number of errors. 5.9 Further Reading Extra materials for this chapter are posted at http://www.nltk.org/, including links to freely available resources on the Web.

.) ◑ Consider the regular expression tagger developed in the exercises in the previous section. Evaluate the tagger using its accuracy() method, and try to come up with ways to improve its performance. Discuss your findings. How does objective evaluation help in the development process? ◑ How serious is the sparse data problem? Investigate the performance of n-gram taggers as n increases from 1 to 6. Tabulate the accuracy score. Estimate the training data required for these taggers, assuming a vocabulary size of 105 and a tagset size of 102. ◑ Obtain some tagged data for another language, and train and evaluate a variety of taggers on it.

RDF Database Systems: Triples Storage and SPARQL Query Processing
by Olivier Cure and Guillaume Blin
Published 10 Dec 2014

An important drawback is related to the fact that most useful queries rarely retrieve all the information from a given tuple, but rather retrieve only a subset of it.That implies that a large portion of the tuple’s data is unnecessarily transferred into the main memory. This has an impact on the I/O efficiency of row stores. In Abadi (2007) the author states that column stores are good candidates for extremely wide tables and for databases handling sparse data. The paper demonstrates the potential of column stores for the Semantic Web through the storage of RDF. Based on these remarks, it’s not a surprise that the current trend with database vendors emphasizes that column stores are getting more popular and can, in fact, compete with row stores in many use cases.

A typical architecture may require a NoSQL key value store for serving a fast access to cached data, a standard RDBMS or NewSQL database to support high transaction rates, a RDF store to serve as a data warehouse and to enable data integration of Linked Open Data. REFERENCES Abadi, D.J., 2007. Column stores for wide and sparse data. CIDR, 292–297. Abadi, D., Madden, S., Ferreira, M., 2006. Integrating compression and execution in columnoriented ­database systems. In: Proceedings of the 2006 ACM SIGMOD International Conference on ­Management of Data. ACM Press, New York, pp. 671–682. Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J., 2007a.

pages: 161 words: 39,526

Applied Artificial Intelligence: A Handbook for Business Leaders
by Mariya Yao , Adelyn Zhou and Marlene Jia
Published 1 Jun 2018

Even then, neural networks trained on tiger photos do not reliably recognize abstractions or representations of tigers, such as cartoons or costumes. Because we are Systems That Master, humans have no trouble with this. A System That Masters is an intelligent agent capable of constructing abstract concepts and strategic plans from sparse data. By creating modular, conceptual representations of the world around us, we are able to transfer knowledge from one domain to another, a key feature of general intelligence. As we discussed earlier, no modern AI system is an AGI, or artificial general intelligence. While humans are Systems That Master, current AI programs are not.

pages: 397 words: 113,304

Spineless: The Science of Jellyfish and the Art of Growing a Backbone
by Juli Berwald
Published 14 May 2017

The NCEAS team collected thirty-seven datasets spanning the years 1970 to 2011. Before 1970 there were fewer than ten datasets published on jellyfish in any year. After that, the number of datasets increased into the twenties and thirties. Some NCEAS members believed that comparing the period before 1970 with the period after 1970 didn’t make sense. During the years of sparse data, each bit of information carried more weight than during years that were more data rich. This could skew the analysis by giving a disproportionate impact to earlier data relative to later data. Other members argued that if data existed, it needed to be included, otherwise deciding what data to include and what to exclude imposed biases.

Lucas explained that if all the data were included, the analysis showed that the abundances of jellyfish oscillate in a cycle that repeats roughly every twenty years. The most recent upswing started in 2004, and we’re still in it. Jellyfish have been noticed more, not because of some aberration, but because we are on the part of the normal cycle that’s tracking upward. But if you did the analysis excluding the sparse data before 1970, the conclusion was different. Over the past forty years, the data revealed an oscillation, but that up-and-down cycle was superimposed on an overall increase in jellyfish abundances. This difference in how to perform the analysis—with or without the data before 1970—caused the rift in the NCEAS group.

pages: 398 words: 31,161

Gnuplot in Action: Understanding Data With Graphs
by Philipp Janert
Published 2 Jan 2010

You can always come back to this chapter when you need a specific plot type. 67 68 5.1 CHAPTER 5 Doing it with style Choosing plot styles Different types of data call for different display styles. For instance, it makes sense to plot a smooth function with one continuous line, but to use separate symbols for a sparse data set where each individual point counts. Experimental data often requires error bars together with the data, whereas counting statistics call for histograms. Choosing an appropriate style for the data leads to graphs that are both informative and aesthetically pleasing. There are two ways to choose a style for the data: inline, as part of the plot command, or globally, using the set style directive.

Since lines are such fundamental objects, I have collected all this material in a separate section at the end of this chapter for easier reference (section 5.3). 72 CHAPTER 5 Doing it with style The linespoints style is a combination of the previous two: each data point is marked with a symbol, and adjacent points are connected with straight lines. This style is mostly useful for sparse data sets. DOTS The dots style prints a “minimal” dot (a single pixel for bitmap terminals) for each data point. This style is occasionally useful for very large, unsorted data sets (such as large scatter plots). Figure 1.2 in chapter 1 was drawn using dots. 5.2.2 Box styles Box styles, which draw a box of finite width, are sometimes useful for counting statistics, or for other data sets where the x values cannot take on a continuous spectrum of values.

pages: 260 words: 78,229

Without Conscience: The Disturbing World of the Psychopaths Among Us
by Robert D. Hare
Published 1 Nov 1993

Even more frightening is the possibility that “cool” but vicious psychopaths will become twisted role models for children raised in dysfunctional families or disintegrating communities where little value is placed on honesty, fair play, and concern for the welfare of others. “WHAT HAVE I DONE?” It is hard to imagine any parent of a psychopath who has not asked the question, almost certainly with a sense of desperation, “What have I done wrong as a parent to bring this about in my child?” The answer is, possibly nothing. To summarize our sparse data, we do not know why people become psychopaths, but current evidence leads us away from the commonly held idea that the behavior of parents bears sole or even primary responsibility for the disorder. This does not mean that parents and the environment are completely off the hook. Parenting behavior may not be responsible for the essential ingredients of the disorder, but it may have a great deal to do with how the syndrome develops and is expressed.

pages: 250 words: 75,586

When the Air Hits Your Brain: Tales From Neurosurgery
by Frank Vertosick
Published 1 Jan 1996

The course of an illness when doctors don’t interfere with it is called its natural history. Ironically, for many diseases (including SAH), medicine has been fiddling with them for as long as they have been recognized as diseases. We are, therefore, totally clueless about the natural history of those diseases, except for what sparse data we can glean from patients who escape our clutches, either because they are too sick or have stubbornly refused our care. Given this lack of hard data, a surgeon is left to choose the option for each patient. If the surgeon is aggressive, then the patient will be steered toward surgery. Unlike the bunion patient, who alone knows how much it hurts and how much surgical risk she is willing to assume to alleviate her suffering, candidates for statistical surgery are completely at the surgeon’s mercy.

pages: 294 words: 77,356

Automating Inequality
by Virginia Eubanks

In the case of the AFST, Allegheny County is concerned with child abuse, especially potential fatalities. But the number of child maltreatment–related fatalities and near fatalities in Allegheny County is very low—luckily, only a handful a year. A statistically meaningful model cannot be constructed with such sparse data. Failing that, it might seem logical to use child maltreatment as substantiated by CYF caseworkers to stand in for actual child maltreatment. But substantiation is an imprecise metric: it simply means that CYF believes there is enough evidence that a child may be harmed to accept a family for services.

pages: 275 words: 74,972

Complete Guide to Fasting: Heal Your Body Through Intermittent, Alternate-Day, and Extended Fasting
by Jimmy Moore and Jason Fung
Published 18 Oct 2016

The “fasting-mimicking diet” is a diet created by researchers to re-create the benefits of fasting without actual fasting. It is a complicated regimen of reduced caloric intake over five days every month. The first day allows 1,090 calories, composed of 10 percent protein, 56 percent fat, and 34 percent carbohydrate. That’s followed by four days of 725 calories, with the same nutritional breakdown. There is sparse data to support the claim that this diet provides all the benefits of fasting, and I don’t recommend it because of its unnecessary complexity. To me, it is far simpler to follow five days of regular fasting per month. FASTING ALL-STARS DR. THOMAS SEYFRIED All types of fasts can have therapeutic benefit.

pages: 284 words: 84,169

Talk on the Wild Side
by Lane Greene
Published 15 Dec 2018

(Actual correction by parents – “it’s brought, sweetie…” – seems to play much less of a role.) * So human children seem neither to blindly follow the rules they learn, nor to merely infer from lots of data. The rules get them going quickly, making useful sentences of their own after learning from relatively sparse data. And the data-gathering approach lets them gradually store up and recall some of the more intricate and rarer bits of the language they need. This is an absolutely stunning ability, given that we’re talking about kids who cannot yet tie shoelaces. Yet since every cognitively typical child does it, it’s not a miracle, even though it looks like one, a testament to the power of the human language faculty.

pages: 713 words: 93,944

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement
by Eric Redmond , Jim Wilson and Jim R. Wilson
Published 7 May 2012

As you recall, each row has text:, revision:author, and revision:comment columns. The links table has no such regularity. Each row may have one column or hundreds. And the variety of column names is as diverse as the row keys themselves (titles of Wikipedia articles). That’s OK! HBase is a so-called sparse data store for exactly this reason. To find out just how many rows are now in your table, you can use the count command. ​​hbase> count 'wiki', INTERVAL => 100000, CACHE => 10000​​ ​​Current count: 100000, row: Alexander wilson (vauxhall)​​ ​​Current count: 200000, row: Bachelor of liberal studies​​ ​​Current count: 300000, row: Brian donlevy​​ ​​...​​ ​​

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage
by Zdravko Markov and Daniel T. Larose
Published 5 Apr 2007

The items are used as features to represent persons as vectors (rows in the person × item matrix). Then person vectors are clustered by using any clustering algorithm that we have discussed so far (e.g., k-means or EM). Finally, the missing values are taken from the cluster representation, where each person belongs. A problem in applying this approach involves the highly sparse data: In each person vector there are many missing values. The probabilistic algorithms can easily handle missing values; they are simply omitted from the computation of probabilities and the algorithm proceeds as usual. In similarity-based clustering such as k-means, a little adjustment is made for the missing feature values.

pages: 404 words: 92,713

The Art of Statistics: How to Learn From Data
by David Spiegelhalter
Published 2 Sep 2019

MRP is no panacea—if a large number of respondents give systematically misleading answers and so do not represent their ‘cell’, then no amount of sophisticated statistical analyses will counter that bias. But it appears to be beneficial to use Bayesian modelling of every single voting area, and we shall see later that this has been spectacularly successful in exit polls conducted on the day of elections. Bayesian ‘smoothing’ can bring precision to very sparse data, and the techniques are being increasingly used for modelling, for example, how diseases spread over space and time. Bayesian learning is also now seen as a fundamental process of human awareness of the environment, in that we have prior expectations about what we will see in any context, and then only need to take notice of unexpected features in our vision which are then used to update our current perceptions.

pages: 442 words: 94,734

The Art of Statistics: Learning From Data
by David Spiegelhalter
Published 14 Oct 2019

MRP is no panacea – if a large number of respondents give systematically misleading answers and so do not represent their ‘cell’, then no amount of sophisticated statistical analyses will counter that bias. But it appears to be beneficial to use Bayesian modelling of every single voting area, and we shall see later that this has been spectacularly successful in exit polls conducted on the day of elections. Bayesian ‘smoothing’ can bring precision to very sparse data, and the techniques are being increasingly used for modelling, for example, how diseases spread over space and time. Bayesian learning is also now seen as a fundamental process of human awareness of the environment, in that we have prior expectations about what we will see in any context, and then only need to take notice of unexpected features in our vision which are then used to update our current perceptions.

pages: 356 words: 102,224

Pale Blue Dot: A Vision of the Human Future in Space
by Carl Sagan
Published 8 Sep 1997

Early scientific speculation included fetid swamps crawling with monster amphibians, like the Earth in the Carboniferous Period; a world desert; a global petroleum sea; and a seltzer ocean dotted here and there with limestone-encrusted islands. While based on some scientific data, these “models" of Venus—the first dating from the beginnings of the century, the second from the 1930s, and the last two from the raid-1950s—were little more than scientific romances, hardly constrained by the sparse data available. Then, in 1956, a report was published in The Astrophysical Journal by Cornell H. Mayer and his colleagues. They had pointed a newly completed radio telescope, built in part for classified research, on the roof of the Naval Research Laboratory in Washington, D.C., at Venus and measured the flux of radio waves arriving at Earth.

pages: 502 words: 107,510

Natural Language Annotation for Machine Learning
by James Pustejovsky and Amber Stubbs
Published 14 Oct 2012

Applying this function to the bigrams from the IMDb corpus, we can see the following results: >>> bigram_measures = nltk.collocations.BigramAssocMeasures() >>> finder1 = BigramCollocationFinder.from_words(imdbcorpus.words()) >>> finder1.nbest(bigram_measures.pmi, 10) [('".(', 'Check'), ('10th', 'Sarajevo'), ('16100', 'Patranis'), ('1st', 'Avenue'), ('317', 'Riverside'), ('5000', 'Reward'), ('6310', 'Willoughby'), ('750hp', 'tire'), ('ALEX', 'MILLER'), ('Aasoo', 'Bane')] >>> finder1.apply_freq_filter(10) #look only at collocations that occur 10 times or more >>> finder1.nbest(bigram_measures.pmi, 10) [('United', 'States'), ('Los', 'Angeles'), ('Bhagwan', 'Shri'), ('martial', 'arts'), ('Lan', 'Yu'), ('Devi', 'Maa'), ('New', 'York'), ('qv', ')),'), ('qv', '))'), ('I', ")'")] >>> finder1.apply_freq_filter(15) >>> finder1.nbest(bigram_measures.pmi, 10) [('Bhagwan', 'Shri'), ('Devi', 'Maa'), ('New', 'York'), ('qv', ')),'), ('qv', '))'), ('I', ")'"), ('no', 'longer'), ('years', 'ago'), ('none', 'other'), ('each', 'other')] One issue with using this simple formula, however, involves the problem of sparse data. That is, the probabilities of observed rare events are overestimated, and the probabilities of unobserved rare events are underestimated. Researchers in computational linguistics have found ways to get around this problem to a certain extent, and we will return to this issue when we discuss ML algorithms in more detail in Chapter 7.

pages: 363 words: 109,077

The Raging 2020s: Companies, Countries, People - and the Fight for Our Future
by Alec Ross
Published 13 Sep 2021

Dolber first learned of Rideshare Drivers United at an academic conference. He decided to attend one of its meetings in Los Angeles, and while there he crossed paths with Ivan Pardo. By that point, the organization had connected with only about five hundred of the estimated three hundred thousand rideshare drivers in California. Because Uber and Lyft publish only sparse data on their contractors, identifying and contacting new drivers was a laborious process. Up to that point, the group had recruited most of its members by canvassing parking lots at LAX. However, Dolber and Pardo developed a more scalable strategy for picking out rideshare drivers: Facebook. “Facebook is able to identify drivers better than anybody else,” Dolber said.

pages: 372 words: 110,208

Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past
by David Reich
Published 22 Mar 2018

This is evident from the fact that our data include at least three distinct East African Forager groups within Africa—one spanning the ancient Ethiopian and ancient Kenyan, a second contributing large fractions of the ancestry of the ancient foragers from the Zanzibar Archipelago and Malawi, and a third represented in the present-day Hadza.38 Based on the sparse data we had, we were not able to determine the date when these groups separated from one another. But given the extended geographic span and the antiquity of human occupation in this region, it would not be surprising if some of the differences among these groups dated back tens of thousands of years.

pages: 428 words: 103,544

The Data Detective: Ten Easy Rules to Make Sense of Statistics
by Tim Harford
Published 2 Feb 2021

.* Convergence continued throughout the 1950s and 1960s and sometimes into the 1970s.8 It’s a powerful demonstration of the way that even scientists measuring essential and unchanging facts filter the data to suit their preconceptions. This shouldn’t be entirely surprising. Our brains are always trying to make sense of the world around us based on incomplete information. The brain makes predictions about what it expects, and tends to fill in the gaps, often based on surprisingly sparse data. That is why we can understand a routine telephone conversation on a bad line—until the point at which genuinely novel information such as a phone number or street address is being spoken through the static. Our brains fill in the gaps—which is why we see what we expect to see and hear what we expect to hear, just as Millikan’s successors found what they expected to find.

The Deepest Map
by Laura Trethewey
Published 15 May 2023

Neither of them realized that he had just given her a task that would consume the rest of her life. Marie began to draw in a looser physiographic style that showed the seafloor at an oblique angle, the way the Rocky Mountains look through an airplane window on a transcontinental flight. It took all her geographical and geological training to translate the sparse data points into more understandable terrain. On land, geologists climb a mountain, look around, take measurements, and make a map. Marie didn’t have the opportunity to survey the seafloor with her own eyes; she had to decide what features to emphasize and create the “feel” of the new frontier rather than a set of recorded data points.62 “It was a very demanding technique where you had data.

pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack
by Matthew A. Russell
Published 15 Jan 2011

Ironically (in the context of the current discussion), the calculations involved in computing the PMI lead it to score high-frequency words lower than low-frequency words, which is opposite of the desired effect. Therefore, it is a good measure of independence but not a good measure of dependence (i.e., a less than ideal choice for scoring collocations). It has also been shown that sparse data is a particular stumbling block for PMI scoring, and that other techniques such as the likelihood ratio tend to outperform it. Tapping into Your Gmail Google Buzz is a great source of clean textual data that you can mine, but it’s just one of many starting points. Since this chapter showcases Google technology, this section provides a brief overview of how to tap into your Gmail data so that you can mine the text of what may be many thousands of messages in your inbox.

pages: 404 words: 118,036

The Terraformers
by Annalee Newitz

With a pang, Scrubjay realized they had never heard the actual voice of Wasakeejack, which made the speech seem more like propaganda than history. As for the Trickster Squad’s greatest heroes—Wasakeejack, Muskrat, Irontooth, and Sky—they weren’t avengers from the worlds above and below. Instead, they were just a group of H. sapiens, known from a sparse data trail of court orders and arrest records related to trespassing, destruction of property, and land ownership rights. It’s not as if Scrubjay had ever really believed in the story about how the Squad erected new continents above the floods and repopulated the land with one magical decanter. Obviously the Battle of Saskatchewan was a fairy tale too.

pages: 397 words: 121,211

Coming Apart: The State of White America, 1960-2010
by Charles Murray
Published 1 Jan 2012

This variable has three values, drawing on the categories used in chapter 11: (1) de facto seculars—those either with no religion or professing a religion but attending worship services no more than once a year; (2) believers who profess a religion and attend services at least several times a year but do not qualify for the third category; and (3) those who attend services at least nearly every week and say that they have a strong affiliation with their religion. Community. Because of the GSS’s sparse data on measures of social and civic engagement during the 1990s and 2000s, we are restricted to an index of social trust, which sums the optimistic responses to the helpfulness, fairness, and trustworthiness questions discussed in chapter 14. The three items were coded so that the negative answer (e.g., “most people try to take advantage of you”) is scored as 0, the “it depends” answer is scored as 1, and the positive answer is scored as 2.

pages: 377 words: 21,687

Digital Apollo: Human and Machine in Spaceflight
by David A. Mindell
Published 3 Apr 2008

But the astronauts and Grumman were accustomed to systems from aircraft that had four gimbals, including Gemini, and nervous about the prospect of ‘‘forbidden attitudes’’ and the danger of gimbal lock. The issue hinged on the knotty problem of reliability—what was the likelihood that a gyro would fail, and hence require the redundancy of four instead of three? The trouble was, reliability is notoriously difficult to predict, and conflicts arose over how to interpret the sparse data. The IL provided its own estimates of reliability, based on experience with Polaris, showing that three gimbals would be reliable enough. But Grumman too developed estimates, which NASA found ‘‘highly pessimistic.’’ Grumman extrapolated reliability numbers from earlier missile programs and used them to argue for a redundant, fourgimbal platform.

Beautiful Data: The Stories Behind Elegant Data Solutions
by Toby Segaran and Jeff Hammerbacher
Published 1 Jul 2009

For example, Alice might change her status to “Busy on the phone,” and then later change it to “Off the phone, anybody wanna chat?” When Alice changes her status, we write it into her profile record so that her friends can see it. The profile table might look like Table 4-1. Notice that to support evolving web applications, we must allow for a flexible schema and sparse data; not every record will have a value for every field, and adding new fields must be cheap. T A B L E 4 - 1 . User profile table Username FullName Location Status Alice Alice Smith Sunnyvale, CA Off the phone, anybody Alice345 wanna chat? IM Bob Bob Jones Singapore Eating dinner Charles Charles Adams New York, New York Sleeping BlogID Photo … … 3411 me.jpg … 5539 … … How should we update her profile record?

Amritsar 1919: An Empire of Fear and the Making of a Massacre
by Kim Wagner
Published 26 Mar 2019

There were just two women listed, namely Bibi Har Kaur and Masammat Bisso, which reflected the fact that women rarely joined such large gatherings. Of the overwhelmingly male list, fifteen were fifteen years or younger, while thirty-two were fifty or over, and the youngest was eight while the oldest was eighty years old.56 Combined with the sparse data from other supplementary records, this list provided the most comprehensive reflection of the composition of the crowd on 13 April. While the exact number of people who were killed at Jallianwala Bagh on 13 April would never be known, the figure of 379 (or 376) was certainly too low and reflected only those victims whose identity was confirmed.

pages: 626 words: 167,836

The Technology Trap: Capital, Labor, and Power in the Age of Automation
by Carl Benedikt Frey
Published 17 Jun 2019

Trade emerged across continents, and new goods were discovered and consumed that had previously been unknown: colonial goods like sugar, spices, tea, tobacco, and rice—to name a few—were shipped distances that mankind had once not known existed. Though empirical evidence on the rise of international trade is sparse, data for the period 1622–1700 shows that British imports and exports doubled. The growing importance of trade is similarly suggested by the rapid expansion of shipping. Between 1470 and the early nineteenth century, the merchant fleet of Western Europe grew sevenfold.28 As many of the colonial goods and other imports became attainable for a growing share of the population, people started to drink more tea, often sweetened with sugar; bought more luxurious clothing; and discovered new spices for their meals.

pages: 635 words: 186,208

House of Suns
by Alastair Reynolds
Published 16 Apr 2008

Whatever it is is large, and it is moving towards us.’ Dalliance pushed her faculties to the limit, lowering her detection thresholds now that I had independent evidence that something else was lurking in the cloud. In a few moments, something appeared in the displayer - a hazy blob, framed in a box and accompanied by the exceedingly sparse data my ship had managed to extract. The object was well camouflaged but large - five or six kilometres wide - and Hesperus had been right about it coming nearer. ‘It could be a big ship, or a big ship carrying a Homunculus weapon, or just one of the weapons on its own,’ I said. ‘I see smaller signals grouped around it - other ships, perhaps.’

HBase: The Definitive Guide
by Lars George
Published 29 Aug 2011

This will improve the performance of the query significantly, since it uses a Scan internally, selecting only the mapped column families. If you have a sparsely set family, this will only scan the much smaller files on disk, as opposed to running a job that has to scan everything just to filter out the sparse data. Mapping an existing table requires the Hive EXTERNAL keyword, which is also used in other places to access data stored in unmanaged Hive tables, that is, those that are not under Hive’s control: hive> CREATE EXTERNAL TABLE hbase_table_2(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES("hbase.table.name" = "<existing-table-name>"); External tables are not deleted when the table is dropped within Hive.

Four Battlegrounds
by Paul Scharre
Published 18 Jan 2023

Rather than train a model to identify the broad category “snake,” which would not be helpful for determining whether or not the snake was poisonous, the iNaturalist challenge identified specific species. From a technical standpoint, the iNaturalist challenge also pushed the boundaries in training models on sparse data. While the 2018 dataset included 450,000 images across 8,000 categories, the training images were not evenly distributed across categories. Some categories had hundreds of training images while rarer species had only a few dozen images. (For comparison, ImageNet’s goal is an average of 1,000 images per category.)

pages: 933 words: 205,691

Hadoop: The Definitive Guide
by Tom White
Published 29 May 2009

[126] On regionserver crash, when running on an older version of Hadoop, edits written to the commit log kept in HDFS were not recoverable, as files that had not been properly closed lost all edits no matter how much had been written to them at the time of the crash. [127] Yes, this file is named for Hadoop, though it’s for setting up HBase metrics. [128] “Column-Stores for Wide and Sparse Data” by Daniel J. Abadi. Chapter 14. ZooKeeper So far in this book, we have been studying large-scale data processing. This chapter is different: it is about building general distributed applications using Hadoop’s distributed coordination service, called ZooKeeper. Writing distributed applications is hard.

pages: 795 words: 215,529

Genius: The Life and Science of Richard Feynman
by James Gleick
Published 1 Jan 1992

Across the continent, where the Jet Propulsion Laboratory in Pasadena served as the army’s main collaborator in rocket research, a team was struggling with the task of tracking the satellite’s course. They used a room-size IBM 704 digital computer. It was temperamental. They entered the primitively sparse data available for tracking the metal can that the army’s rocket had hurled forward: the frequency of the radio signal, changing Doppler-fashion as the velocity in the line of flight changed; the time of disappearance from the observers at Cape Canaveral; observations from other tracking stations. The JPL team had learned that small variations in the computer’s input caused enormous variations in its output.

pages: 796 words: 223,275

The WEIRDest People in the World: How the West Became Psychologically Peculiar and Particularly Prosperous
by Joseph Henrich
Published 7 Sep 2020

The relationships between the prevalence of first cousin marriage in regions of Spain, Italy, France, and Turkey and four dimensions of psychology: (A) Individualism-Independence, (B) Conformity-Obedience, (C) Impersonal Trust, and (D) Impersonal Fairness. Although the analyses displayed in Figure 7.2 are based on sparse data from various corners of Europe that received lower dosages of the MFP for idiosyncratic historical reasons, they nevertheless illuminate a portion of the pathway that runs from the MFP through the historical dissolution of intensive kinship and into the minds of contemporary Europeans. Now let’s zoom in even closer to focus on an enduring puzzle in the social sciences: the Italian enigma.

pages: 1,034 words: 241,773

Enlightenment Now: The Case for Reason, Science, Humanism, and Progress
by Steven Pinker
Published 13 Feb 2018

The peaks in the graph correspond to mass killings in the Indonesian anti-Communist “year of living dangerously” (1965–66, 700,000 deaths), the Chinese Cultural Revolution (1966–75, 600,000), Tutsis against Hutus in Burundi (1965–73, 140,000), the Bangladesh War of Independence (1971, 1.7 million), north-against-south violence in Sudan (1956–72, 500,000), Idi Amin’s regime in Uganda (1972–79, 150,000), Pol Pot’s regime in Cambodia (1975–79, 2.5 million), killings of political enemies in Vietnam (1965–75, 500,000), and more recent massacres in Bosnia (1992–95, 225,000), Rwanda (1994, 700,000), and Darfur (2003–8, 373,000).15 The barely perceptible swelling from 2014 to 2016 includes the atrocities that contribute to the impression that we are living in newly violent times: at least 4,500 Yazidis, Christians, and Shiite civilians killed by ISIS; 5,000 killed by Boko Haram in Nigeria, Cameroon, and Chad; and 1,750 killed by Muslim and Christian militias in the Central African Republic.16 One can never use the word “fortunately” in connection with the killing of innocents, but the numbers in the 21st century are a fraction of those in earlier decades. Of course, the numbers in a dataset cannot be interpreted as a direct readout of the underlying risk of war. The historical record is especially scanty when it comes to estimating any change in the likelihood of very rare but very destructive wars.17 To make sense of sparse data in a world whose history plays out only once, we need to supplement the numbers with knowledge about the generators of war, since, as the UNESCO motto notes, “Wars begin in the minds of men.” And indeed we find that the turn away from war consists in more than just a reduction in wars and war deaths; it also may be seen in nations’ preparations for war.

pages: 764 words: 261,694

The Elements of Statistical Learning (Springer Series in Statistics)
by Trevor Hastie , Robert Tibshirani and Jerome Friedman
Published 25 Aug 2009

Linear models were largely developed in the precomputer age of statistics, but even in today’s computer era there are still good reasons to study and use them. They are simple and often provide an adequate and interpretable description of how the inputs affect the output. For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data. Finally, linear methods can be applied to transformations of the inputs and this considerably expands their scope. These generalizations are sometimes called basis-function methods, and are discussed in Chapter 5. In this chapter we describe linear methods for regression, while in the next chapter we discuss linear methods for classification.

pages: 2,466 words: 668,761

Artificial Intelligence: A Modern Approach
by Stuart Russell and Peter Norvig
Published 14 Jul 2019

The Logic of Decision (2nd edition). University of Chicago Press. Jeffreys, H. (1948). Theory of Probability. Oxford. Jelinek, F. (1976). Continuous speech recognition by statistical methods. Proc. IEEE, 64, 532–556. Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In Proc. Workshop on Pattern Recognition in Practice. Jennings, H. S. (1906). Behavior of the Lower Organisms. Columbi aUniversity Press. Jenniskens, P., Betlem, H., Betlem, J., and Barifaijo, E. (1994). The Mbale meteorite shower. Meteoritics, 29, 246–254. Jensen, F. V. (2007). Bayesian Networks and Decision Graphs.