information retrieval

back to index

description: activity of obtaining information resources relevant to an information need from a collection of information resources

176 results

Understanding search engines: mathematical modeling and text retrieval
by Michael W. Berry and Murray Browne
Published 15 Jan 2005

The work of William Frakes and Ricardo Baeza-Yates (eds.), Information Retrieval Data Structures and Algorithms, a 1992 collection of journal articles on various related topics, Gerald Kowalski's (1997) Information Retrieval Systems: Theory and Implementation, a broad overview XI xii Preface to the Second Edition of information retrieval systems, and Ricardo Baeza-Yates and Berthier Ribeiro-Neto's (1999) Modern Information Retrieval, a computer-science perspective of information retrieval, are all fine textbooks on the topic, but understandably they lack the gritty details of the mathematical computations needed to build more successful search engines.

…

The work of William Frakes and Ricardo Baeza-Yates (eds.), Information Retrieval: Data Structures & Algorithms, a 1992 collection of journal articles on various related topics, and Gerald Kowalski's (1997) Information Retrieval Systems: Theory and Implementation, a broad overview xv xvi Preface to the First Edition of information retrieval systems, are fine textbooks on the topic, but both understandably lack the gritty details of the mathematical computations needed to build more successful search engines. With this in mind, USE does not provide an overview of information retrieval systems but prefers to assume a supplementary role to the aforementioned books.

…

Further Reading but it does address the subject of IR (indexing, queries, and index construction), albeit from a unique compression perspective. One of the first books that covers various information retrieval topics was actually a collection of survey papers edited by William B. Frakes and Ricardo Baeza-Yates. Their 1992 book [30], Information Retrieval: Dat Structures & Algorithms, contains several seminal works in this area, including the use of signature-based text retrieval methods by Christos Faloutsos and the development of ranking algorithms by Donna Harman. Ricardo Baeza-Yates and Berthier Ribeiro-Neto's [2] Modern Information Retrieval is another collection of well-integrated research articles from various authors with a computer-science perspective of information retrieval. 9.2 Computational Methods and Software Two SIAM Review articles (Berry, Dumais, and O'Brien in 1995 [8] and Berry, Drmac, and Jessup in 1999 [7]) demonstrate the use of linear algebra for vector space IR models such as LSI.

pages: 298 words: 43,745

Understanding Sponsored Search: Core Elements of Keyword Advertising
by Jim Jansen
Published 25 Jul 2011

Journal of the American Society for Information Science and Technology, vol. 56(6), pp. 559–570. [46] Belkin, N. J. 1993. “Interaction with Texts: Information Retrieval as Information-Seeking Behavior.” In Information retrieval ’93. Von der Modellierung zur Anwendung. Konstanz, Germany: Universitaetsverlag Konstanz, pp. 55–66. [47] Saracevic, T. 1997. “Extension and Application of the Stratified Model of Information Retrieval Interaction.” In the Annual Meeting of the American Society for Information Science, Washington, DC, pp. 313–327. [48] Saracevic, T. 1996. “Modeling Interaction in Information Retrieval (IR): A Review and Proposal.” In the 59th American Society for Information Science Annual Meeting, Baltimore, MD, pp. 3–9. [49] Belkin, N., Cool, C., Croft, W.

…

B., and Callan, J. 1993. “The Effect of Multiple Query Representations on Information Retrieval Systems.” In 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 339–346. [50] Belkin, N., Cool, C., Kelly, D., Lee, H.-J., Muresan, G., Tang, M.-C., and Yuan, X.-J. 2003. “Query Length in Interactive Information Retrieval.” In 26th Annual International ACM 58 [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] Understanding Sponsored Search Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 205–212. Cronen-Townsend, S., Zhou, Y., and Croft, W.

…

Information overload: refers to the difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information (see Chapter 5 customers). Information retrieval: a field of study related to information extraction. Information retrieval is about developing systems to effectively index and search vast amounts of data (Source: SearchEngineDictionary.com) (see Chapter 3 keywords). Information scent: cues related to the desired outcome (see Chapter 3 keywords). Information searching: refers to people’s interaction with information-retrieval systems, ranging from adopting search strategy to judging the relevance of information retrieved (see Chapter 3 keywords). Insertion: actual placement of an ad in a document, as recorded by the ad server (Source: IAB) (see Chapter 2 model).

Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage
by Zdravko Markov and Daniel T. Larose
Published 5 Apr 2007

To do this we look into the technology for text analysis and search developed earlier in the area of information retrieval and extended recently with ranking methods based on web hyperlink structure. All that may be seen as a preprocessing step in the overall process of data mining the web content, which provides the input to machine learning methods for extracting knowledge from hypertext data, discussed in the second part of the book. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage C 2007 John Wiley & Sons, Inc. By Zdravko Markov and Daniel T. Larose Copyright CHAPTER 1 INFORMATION RETRIEVAL AND WEB SEARCH WEB CHALLENGES CRAWLING THE WEB INDEXING AND KEYWORD SEARCH EVALUATING SEARCH QUALITY SIMILARITY SEARCH WEB CHALLENGES As originally proposed by Tim Berners-Lee [1], the Web was intended to improve the management of general information about accelerators and experiments at CERN.

…

This idea was implemented in one of the ﬁrst search 32 CHAPTER 1 INFORMATION RETRIEVAL AND WEB SEARCH engines, the World Wide Web Worm system [4], and later used by Lycos and Google. This allows search engines to increase their indices with pages that have never been crawled, are unavailable, or include nontextual content that cannot be indexed, such as images and programs. As reported by Brin and Page [5] in 1998, Google indexed 24 million pages and over 259 million anchors. EVALUATING SEARCH QUALITY Information retrieval systems do not have formal semantics (such as that of databases), and consequently, the query and the set of documents retrieved (the response of the IR system) cannot be mapped one to one.

…

Title. QA76.9.D343M38 2007 005.74 – dc22 2006025099 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 For my children Teodora, Kalin, and Svetoslav – Z.M. For my children Chantal, Ellyriane, Tristan, and Ravel – D.T.L. CONTENTS PREFACE xi PART I WEB STRUCTURE MINING 1 2 INFORMATION RETRIEVAL AND WEB SEARCH 3 Web Challenges Web Search Engines Topic Directories Semantic Web Crawling the Web Web Basics Web Crawlers Indexing and Keyword Search Document Representation Implementation Considerations Relevance Ranking Advanced Text Search Using the HTML Structure in Keyword Search Evaluating Search Quality Similarity Search Cosine Similarity Jaccard Similarity Document Resemblance References Exercises 3 4 5 5 6 6 7 13 15 19 20 28 30 32 36 36 38 41 43 43 HYPERLINK-BASED RANKING 47 Introduction Social Networks Analysis PageRank Authorities and Hubs Link-Based Similarity Search Enhanced Techniques for Page Ranking References Exercises 47 48 50 53 55 56 57 57 vii viii CONTENTS PART II WEB CONTENT MINING 3 4 5 CLUSTERING 61 Introduction Hierarchical Agglomerative Clustering k-Means Clustering Probabilty-Based Clustering Finite Mixture Problem Classiﬁcation Problem Clustering Problem Collaborative Filtering (Recommender Systems) References Exercises 61 63 69 73 74 76 78 84 86 86 EVALUATING CLUSTERING 89 Approaches to Evaluating Clustering Similarity-Based Criterion Functions Probabilistic Criterion Functions MDL-Based Model and Feature Evaluation Minimum Description Length Principle MDL-Based Model Evaluation Feature Selection Classes-to-Clusters Evaluation Precision, Recall, and F-Measure Entropy References Exercises 89 90 95 100 101 102 105 106 108 111 112 112 CLASSIFICATION 115 General Setting and Evaluation Techniques Nearest-Neighbor Algorithm Feature Selection Naive Bayes Algorithm Numerical Approaches Relational Learning References Exercises 115 118 121 125 131 133 137 138 PART III WEB USAGE MINING 6 INTRODUCTION TO WEB USAGE MINING 143 Deﬁnition of Web Usage Mining Cross-Industry Standard Process for Data Mining Clickstream Analysis 143 144 147 CONTENTS 7 8 9 ix Web Server Log Files Remote Host Field Date/Time Field HTTP Request Field Status Code Field Transfer Volume (Bytes) Field Common Log Format Identiﬁcation Field Authuser Field Extended Common Log Format Referrer Field User Agent Field Example of a Web Log Record Microsoft IIS Log Format Auxiliary Information References Exercises 148 PREPROCESSING FOR WEB USAGE MINING 156 Need for Preprocessing the Data Data Cleaning and Filtering Page Extension Exploration and Filtering De-Spidering the Web Log File User Identiﬁcation Session Identiﬁcation Path Completion Directories and the Basket Transformation Further Data Preprocessing Steps References Exercises 156 149 149 149 150 151 151 151 151 151 152 152 152 153 154 154 154 158 161 163 164 167 170 171 174 174 174 EXPLORATORY DATA ANALYSIS FOR WEB USAGE MINING 177 Introduction Number of Visit Actions Session Duration Relationship between Visit Actions and Session Duration Average Time per Page Duration for Individual Pages References Exercises 177 MODELING FOR WEB USAGE MINING: CLUSTERING, ASSOCIATION, AND CLASSIFICATION Introduction Modeling Methodology Deﬁnition of Clustering The BIRCH Clustering Algorithm Afﬁnity Analysis and the A Priori Algorithm 177 178 181 183 185 188 188 191 191 192 193 194 197 x CONTENTS Discretizing the Numerical Variables: Binning Applying the A Priori Algorithm to the CCSU Web Log Data Classiﬁcation and Regression Trees The C4.5 Algorithm References Exercises INDEX 199 201 204 208 210 211 213 PREFACE DEFINING DATA MINING THE WEB By data mining the Web, we refer to the application of data mining methodologies, techniques, and models to the variety of data forms, structures, and usage patterns that comprise the World Wide Web.

pages: 593 words: 118,995

Relevant Search: With Examples Using Elasticsearch and Solr
by Doug Turnbull and John Berryman
Published 30 Apr 2016

In reality, there is a discipline behind relevance: the academic field of information retrieval. It has generally accepted practices to improve relevance broadly across many domains. But you’ve seen that what’s relevant depends a great deal on your application. Given that, as we introduce information retrieval, think about how its general findings can be used to solve your narrower relevance problem.[2] 2 For an introduction to the field of information retrieval, we highly recommend the classic text Introduction to Information Retrieval by Christopher D. Manning et al. (Cambridge University Press, 2008); see http://nlp.stanford.edu/IR-book/. 1.3.1.

…

That information will solve your problem, and you’ll move on. In information retrieval, relevance is defined as the practice of returning search results that most satisfy the user’s information needs. Further, classic information retrieval focuses on text ranking. Many findings in information retrieval try to measure how likely a given article is going to be relevant to a user’s text search. You’ll learn about several of these invaluable methods throughout this book—as many of these findings are implemented in open source search engines. To discover better text-searching methods, information retrieval researchers benchmark different strategies by using test collections of articles.

…

Example of making a relevance judgment for the query “Rambo” in Quepid, a judgment list management application Using judgment lists, researchers aim to measure whether changes to text relevance calculations improve the overall relevance of the results across every test collection. To classic information retrieval, a solution that improves a dozen text-heavy test collections 1% overall is a success. Rather than focusing on one particular problem in depth, information retrieval focuses on solving search for a broad set of problems. 1.3.2. Can we use information retrieval to solve relevance? You’ve already seen there’s no silver bullet. But information retrieval does seem to systematically create relevance solutions. So ask yourself: Do these insights apply to your application?

Data Mining: Concepts and Techniques: Concepts and Techniques
by Jiawei Han , Micheline Kamber and Jian Pei
Published 21 Jun 2011

Others include Machine Learning (ML), Pattern Recognition (PR), Artificial Intelligence Journal (AI), IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), and Cognitive Science. Textbooks and reference books on information retrieval include Introduction to Information Retrieval by Manning, Raghavan, and Schutz [MRS08]; Information Retrieval: Implementing and Evaluating Search Engines by Büttcher, Clarke, and Cormack [BCC10]; Search Engines: Information Retrieval in Practice by Croft, Metzler, and Strohman [CMS09]; Modern Information Retrieval: The Concepts and Technology Behind Search by Baeza-Yates and Ribeiro-Neto [BYRN11]; and Information Retrieval: Algorithms and Heuristics by Grossman and Frieder [GR04]. Information retrieval research is published in the proceedings of several information retrieval and Web search and mining conferences, including the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), the International World Wide Web Conference (WWW), the ACM International Conference on Web Search and Data Mining (WSDM), the ACM Conference on Information and Knowledge Management (CIKM), the European Conference on Information Retrieval (ECIR), the Text Retrieval Conference (TREC), and the ACM/IEEE Joint Conference on Digital Libraries (JCDL).

…

The data cube model not only facilitates OLAP in multidimensional databases but also promotes multidimensional data mining (see Section 1.3.2). 1.5.4. Information Retrieval Information retrieval (IR) is the science of searching for documents or information in documents. Documents can be text or multimedia, and may reside on the Web. The differences between traditional information retrieval and database systems are twofold: Information retrieval assumes that (1) the data under search are unstructured; and (2) the queries are formed mainly by keywords, which do not have complex structures (unlike SQL queries in database systems). The typical approaches in information retrieval adopt probabilistic models. For example, a text document can be regarded as a bag of words, that is, a multiset of words appearing in the document.

…

Information retrieval research is published in the proceedings of several information retrieval and Web search and mining conferences, including the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), the International World Wide Web Conference (WWW), the ACM International Conference on Web Search and Data Mining (WSDM), the ACM Conference on Information and Knowledge Management (CIKM), the European Conference on Information Retrieval (ECIR), the Text Retrieval Conference (TREC), and the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Other sources of publication include major information retrieval, information systems, and Web journals, such as Journal of Information Retrieval, ACM Transactions on Information Systems (TOIS), Information Processing and Management, Knowledge and Information Systems (KAIS), and IEEE Transactions on Knowledge and Data Engineering (TKDE). 2.

Text Analytics With Python: A Practical Real-World Approach to Gaining Actionable Insights From Your Data
by Dipanjan Sarkar
Published 1 Dec 2016

Important Concepts Our main objectives in this chapter are to understand text similarity and clustering. Before moving on to the actual techniques and algorithms, this section will discuss some important concepts related to information retrieval, document similarity measures, and machine learning. Even though some of these concepts might be familiar to you from the previous chapters, all of them will be useful to us as we gradually journey through this chapter. Without further ado, let’s get started. Information Retrieval (IR) Information retrieval (IR) is the process of retrieving or fetching relevant sources of information from a corpus or set of entities that hold information based on some demand.

…

I recommend using gensim’s hellinger() function, available in the gensim.matutils module (which uses the same logic as our preceding function) when building large-scale systems for analyzing similarity. Okapi BM25 Ranking There are several techniques that are quite popular in information retrieval and search engines, including PageRank and Okapi BM25. The acronym BM stands for best matching. This technique is also known as BM25, but for the sake of completeness I refer to it as Okapi BM25, because originally although the concepts behind the BM25 function were merely theoretical, the City University in London built the Okapi Information Retrieval system in the 1980s–90s, which implemented this technique to retrieve documents on actual real-world data. This technique can also be called a framework or model based on probabilistic relevancy and was developed by several people in the 1970s–80s, including computer scientists S.

…

Automated Text Classification Text Classification Blueprint Text Normalization Feature Extraction Bag of Words Model TF-IDF Model Advanced Word Vectorization Models Classification Algorithms Multinomial Naïve Bayes Support Vector Machines Evaluating Classification Models Building a Multi-Class Classification System Applications and Uses Summary Chapter 5: Text Summarization Text Summarization and Information Extraction Important Concepts Documents Text Normalization Feature Extraction Feature Matrix Singular Value Decomposition Text Normalization Feature Extraction Keyphrase Extraction Collocations Weighted Tag–Based Phrase Extraction Topic Modeling Latent Semantic Indexing Latent Dirichlet Allocation Non-negative Matrix Factorization Extracting Topics from Product Reviews Automated Document Summarization Latent Semantic Analysis TextRank Summarizing a Product Description Summary Chapter 6: Text Similarity and Clustering Important Concepts Information Retrieval (IR) Feature Engineering Similarity Measures Unsupervised Machine Learning Algorithms Text Normalization Feature Extraction Text Similarity Analyzing Term Similarity Hamming Distance Manhattan Distance Euclidean Distance Levenshtein Edit Distance Cosine Distance and Similarity Analyzing Document Similarity Cosine Similarity Hellinger-Bhattacharya Distance Okapi BM25 Ranking Document Clustering Clustering Greatest Movies of All Time K-means Clustering Affinity Propagation Ward’s Agglomerative Hierarchical Clustering Summary Chapter 7: Semantic and Sentiment Analysis Semantic Analysis Exploring WordNet Understanding Synsets Analyzing Lexical Semantic Relations Word Sense Disambiguation Named Entity Recognition Analyzing Semantic Representations Propositional Logic First Order Logic Sentiment Analysis Sentiment Analysis of IMDb Movie Reviews Setting Up Dependencies Preparing Datasets Supervised Machine Learning Technique Unsupervised Lexicon-based Techniques Comparing Model Performances Summary Index Contents at a Glance About the Author About the Technical Reviewer Acknowledgments Introduction Chapter 1: Natural Language Basics Chapter 2: Python Refresher Chapter 3: Processing and Understanding Text Chapter 4: Text Classification Chapter 5: Text Summarization Chapter 6: Text Similarity and Clustering Chapter 7: Semantic and Sentiment Analysis Index About the Author and About the Technical Reviewer About the Author Dipanjan Sarkar is a data scientist at Intel, the world’s largest silicon company, which is on a mission to make the world more connected and productive.

pages: 263 words: 75,610

Delete: The Virtue of Forgetting in the Digital Age
by Viktor Mayer-Schönberger
Published 1 Jan 2009

See information dossiers Dutch citizen registry, 141, 157–58 DVD, 64–65, 145 eBay, 93, 95 Ecommerce, 131 Egypt, 32 Eisenstein, Elizabeth, 37, 38 e-mails: preservation of, 69 entropy, 22 epics, 25, 26, 27 European Human Rights Convention, 110 European Union Privacy Directive, 158–59, 160 exit, 99 Expedia.com, 8 expiration dates for information, 171–95, 198–99 binary nature of, 192–93 imperfection of, 194–95 negotiating, 185–89, 187 persistence of, 183–85 societal preferences for, 182–83 external memory, limitations of, 34 Facebook, 2, 3, 84, 86, 197 Feldmar, Andrew, 3–4, 5, 104–5, 109, 111, 197 Felten, Edward, 151–52, 188 fiber-optic cables, 80–81 fidelity, 60 filing systems, 74 film, 47 fingerprints, 78 First Amendment, 110 Flash memory, 63 Flickr, 84, 102, 124 flight reservation, 8 Foer, Joshua, 21 forgetting: cost of, 68, 91, 92 human, 19–20, 114–17 central importance of, 13, 21 societal, 13 forgiving, 197 Foucault, Michel, 11, 112 free-riding, 133 Friedman, Lawrence, 106 Gandy, Oscar, 11, 165 Gasser, Urs, 3, 130 “Goblin edits,” 62 Google, 2, 6–8, 70–71, 84, 103, 104, 109, 130–31, 175–78, 179, 186, 197 governmental decision-making, 94 GPS, 9 Graham, Mary, 94 Gutenberg, 37–38 hard disks, 62–63 hieroglyphs, 32 Hilton, Paris, 86 history: omnipresence of, 125 Hotmail, 69 human cognition, 154–57 “Imagined Communities,” 43 index, 73–74, 90 full-text, 76–77 information: abstract, 17 biometric, 9 bundling of, 82–83 control over, 85–87, 91, 97–112, 135–36, 140, 167–68, 181–82 deniability of, 87 decontextualization of, 78, 89–90, 142 economics of, 82–83 incompleteness of, 156 interpretation of, 96 leakages of, 105, 133–34 legally mandated retention of, 160–61 lifespan of, 172 markets for, 145–46 misuse of, 140 peer-to-peer sharing of, 84, 86 processors of, 175–78 production cost of, 82–83 property of, 143 quality of, 96–97 recall of, 18–19 recombining of, 61–62, 85, 88–90 recontextualization of, 89–90 retrieval of, 72–79 risk of collecting, 158 role of, 85 self-disclosure of, 4 sharing of, 3, 84–85 total amount of, 52 information control: relational concepts of, 153 information dossiers, 104 digital, 123–25 information ecology, 157–63 information power, 112 differences in, 107, 133, 187, 191, 192 information privacy, 100, 108, 135, 174, 181–82 effectiveness of rights to, 135–36, 139–40, 143–44 enforcement of right to, 139–40 purpose limitation principle in, 136, 138, 159 rights to, 134–44 information retrieval. See information: retrieval of information sharing: default of, 88 information storage: capacity, 66 cheap, 62–72 corporate, 68–69 density of, 71 economics of, 68 increase in, 71–72 magnetic, 62–64 optical, 64–65 relative cost of, 65–66 sequential nature of analog, 75 informational self-determination, 137 relational dimension of, 170 intellectual property (IP), 144, 146, 150, 174 Internet, 79 “future proof,” 59–60 peer-production and, 131–32 Internet archives, 4 Islam: printing in, 40 Ito, Joi, 126 Johnson, Deborah, 14 Keohane, Robert, 98 Kodak, Eastman, 45–46 Korea: printing in, 40 language, 23–28 Lasica, J.

…

The likely medium-term outcome is that storage capacity will continue to double and storage costs to halve about every eighteen to twenty-four months, leaving us with an abundance of cheap digital storage. Easy Retrieval Remembering is more than committing information to memory. It includes the ability to retrieve that information later easily and at will. As humans, we are all too familiar with the challenges of information retrieval from our brain’s long-term memory. External analog memory, like books, hold huge amounts of information, but finding a particular piece of information in it is difficult and time-consuming. Much of the latent value of stored information remains trapped, unlikely to be utilized. Even though we may have stored it, analog information that cannot be retrieved easily in practical terms is no different from having been forgotten.

…

In contrast, retrieval from digital memory is vastly easier, cheaper, and swifter: a few words in the search box, a click, and within a few seconds a list of matching information is retrieved and presented in neatly formatted lists. Such trouble-free retrieval greatly enhances the value of information. To be sure, humans have always tried to make information retrieval easier and less cumbersome, but they faced significant hurdles. Take written information. The switch from tablets and scrolls to bound books helped in keeping information together, and certainly improved accessibility, but it did not revolutionize retrieval. Similarly, libraries helped amass information, but didn’t do as much in tracking it down.

pages: 290 words: 73,000

Algorithms of Oppression: How Search Engines Reinforce Racism
by Safiya Umoja Noble
Published 8 Jan 2018

Saracevic notes that “the domain of information science is the transmission of the universe of human knowledge in recorded form, centering on manipulation (representation, organization, and retrieval) of information, rather than knowing information.”43 This foregrounds the ways that representations in search engines are decontextualized in one specific type of information-retrieval process, particularly for groups whose images, identities, and social histories are framed through forms of systemic domination. Although there is a long, broad, and historical context for addressing categorizations, the impact of learning from these traditions has not yet been fully realized.44 Attention to “the universe of human knowledge” is suggestive for contextualizing information-retrieval practices this way, leading to inquiries into the ways current information-retrieval practices on the web, via commercial search engines, make some types of information available and suppress others.

…

For the most part, many of these processes have been automated, or they happen through graphical user interfaces (GUIs) that allow people who are not programmers (i.e., not working at the level of code) to engage in sharing links to and from websites.31 Research shows that users typically use very few search terms when seeking information in a search engine and rarely use advanced search queries, as most queries are different from traditional offline information-seeking behavior.32 This front-end behavior of users appears to be simplistic; however, the information retrieval systems are complex, and the formulation of users’ queries involves cognitive and emotional processes that are not necessarily reflected in the system design.33 In essence, while users use the simplest queries they can in a search box because of the way interfaces are designed, this does not always reflect how search terms are mapped against more complex thought patterns and concepts that users have about a topic. This disjunction between, on the one hand, users’ queries and their real questions and, on the other, information retrieval systems makes understanding the complex linkages between the content of the results that appear in a search and their import as expressions of power and social relations of critical importance.

…

For this reason, it is important to study the social context of those who are organizing information and the potential impacts of the judgments inherent in informational organization processes. Information must be treated in a context; “it involves motivation or intentionality, and therefore it is connected to the expansive social context or horizon, such as culture, work, or problem-at-hand,” and this is fundamental to the origins of information science and to information retrieval.42 Information retrieval as a practice has become a highly commercialized industry, predicated on federally funded experiments and research initiatives, leading to the formation of profitable ventures such as Yahoo! and Google, and a focus on information relevance continues to be of importance to the field.

pages: 291 words: 77,596

Total Recall: How the E-Memory Revolution Will Change Everything
by Gordon Bell and Jim Gemmell
Published 15 Feb 2009

Doherty, A., C. Gurrin, G. Jones, and A. F. Smeaton. “Retrieval of Similar Travel Routes Using GPS Tracklog Place Names.” SIGIR 2006—Conference on Research and Development on Information Retrieval, Workshop on Geographic Information Retrieval, Seattle, Washington, August 6-11, 2006. Gurrin, C., A. F. Smeaton, D. Byrne, N. O’Hare, G. Jones, and N. O’Connor. “An Examination of a Large Visual Lifelog.” AIRS 2008—Asia Information Retrieval Symposium, Harbin, China, January 16-18, 2008. Lavelle, B., D. Byrne, C. Gurrin, A. F. Smeaton, and G. Jones. “Bluetooth Familiarity: Methods of Calculation, Applications and Limitations.”

…

“Physical Context for Just-in-Time Information Retrieval.” IEEE Transactions on Computers 52, no. 8 (August): 1011-14. ———. 1997. “The Wearable Remembrance Agent: A System for Augmented Memory.” Special Issue on Wearable Computing, Personal Technologies Journal 1:218-24. Rhodes, Bradley J. “Margin Notes: Building a Contextually Aware Associative Memory” (html), to appear in The Proceedings of the International Conference on Intelligent User Interfaces (IUI ’00), New Orleans, Louisiana, January 9-12, 2000. Rhodes, Bradley, and Pattie Maes. 2000. “Just-in-Time Information Retrieval Agents.” Special issue on the MIT Media Laboratory, IBM Systems Journal 39, nos. 3 and 4: 685-704.

…

Eighth RIAO Conference—Large-Scale Semantic Access to Content (Text, Image, Video and Sound), Pittsburgh, Pennsylvania, May 30-June 1, 2007. Lee, Hyowon, Alan F. Smeaton, Noel E. O’Connor, and Gareth J. F. Jones. “Adaptive Visual Summary of LifeLog Photos for Personal Information Management.” AIR 2006—First International Workshop on Adaptive Information Retrieval, Glasgow, UK, October 14, 2006. O’Conaire, C., N. O’Connor, A. F. Smeaton, and G. Jones. “Organizing a Daily Visual Diary Using Multi-Feature Clustering.” SPIE Electronic Imaging—Multimedia Content Access: Algorithms and Systems (EI121), San Jose, California, January 28-February 1, 2007. Smeaton, A.

pages: 193 words: 19,478

Memory Machines: The Evolution of Hypertext
by Belinda Barnet
Published 14 Jul 2013

He protested that he was doing neither ‘information retrieval’ nor ‘electrical engineering’, but a new thing somewhere in between, and that it should be recognized as a new field of research. In our interview he remembered that: After I’d given a talk at Stanford, [three angry guys] got me later outside at a table. They said, ‘All you’re talking about is information retrieval.’ I said no. They said, ‘YES, it is, we’re professionals and we know, so we’re telling you don’t know enough so stay out of it, ’cause goddamit, you’re bollocksing it all up. You’re in engineering, not information retrieval.’ (Engelbart 1999) Computers, in large part, were still seen as number crunchers, and computer engineers had no business talking about psychology and the human beings who used these machines.

…

As Engelbart told the author of this book in 1999, he was often told to mind his own business and keep off well-defined turf: After I’d given a talk at Stanford, [three angry guys] got me later outside at a table. They said, ‘All you’re talking about is information retrieval.’ I said no. They said, ‘YES, it is, we’re professionals and we know, so we’re telling you don’t know enough so stay out of it, ’cause goddamit, you’re bollocksing it all up. You’re in engineering, not information retrieval.’ (Engelbart 1999) My hero; the man who never knew too much about disciplinary confines, professional flocking rules and the mere retrieval of information; the man who straps bricks to pencils, who annoys the specialists, who insists on bollocksing up the computer world in all kinds of fascinating ways.

…

Gleick quotes a rather different assessment of Babbage from an early twentieth-century edition of the Dictionary of National Biography: Mathematician and scientific mechanician […] obtained government grant for making a calculating machine […] but the work of construction ceased, owing to disagreements with the engineer; offered the government an improved design, which was refused on grounds of expense […] Lucasian professor of mathematics, Cambridge, but delivered no lectures. (Cited in Gleick 2011, 121) In the words of the information retrievers, Babbage seems a resounding failure, no matter if he did (undeservedly, according to the insinuation) have Newton’s chair. Perhaps biography does not belong in dictionaries. Among other blessings that came to Babbage was one of the great friendships in intellectual history, with Augusta Ada King, Countess Lovelace.

pages: 666 words: 181,495

In the Plex: How Google Thinks, Works, and Shapes Our Lives
by Steven Levy
Published 12 Apr 2011

AltaVista’s actual search quality techniques—what determined the ranking of results—were based on traditional information retrieval (IR) algorithms. Many of those algorithms arose from the work of one man, a refugee from Nazi Germany named Gerard Salton, who had come to America, got a PhD at Harvard, and moved to Cornell University, where he cofounded its computer science department. Searching through databases using the same commands you’d use with a human—“natural language” became the term of art—was Salton’s specialty. During the 1960s, Salton developed a system that was to become a model for information retrieval. It was called SMART, supposedly an acronym for “Salton’s Magical Retriever of Text.”

…

Page’s brother, nine years older, was already in Silicon Valley, working for an Internet start-up. Page chose to work in the department’s Human-Computer Interaction Group. The subject would stand Page in good stead in the future with respect to product development, even though it was not in the HCI domain to figure out a new model of information retrieval. On his desk and permeating his conversations was Apple interface guru Donald Norman’s classic tome The Psychology of Everyday Things, the bible of a religion whose first, and arguably only, commandment is “The user is always right.” (Other Norman disciples, such as Jeff Bezos at Amazon.com, were adopting this creed on the web.)

…

DEC had been built on the minicomputer, a once innovative category now rendered a dinosaur by the personal computer revolution. “DEC was very much living in the past,” says Monier. “But they had small groups of people who were very forward-thinking, experimenting with lots of toys.” One of those toys was the web. Monier himself was no expert in information retrieval but a big fan of data in the abstract. “To me, that was the secret—data,” he says. What the data was telling him was that if you had the right tools, it was possible to treat everything in the open web like a single document. Even at that early date, the basic building blocks of web search had been already set in stone.

pages: 1,085 words: 219,144

Solr in Action
by Trey Grainger and Timothy Potter
Published 14 Sep 2014

To begin, we need to know how Solr matches home listings in the index to queries entered by users, as this is the basis for all search applications. 1.2.1. Information retrieval engine Solr is built on Apache Lucene, a popular, Java-based, open source, information retrieval library. We’ll save a detailed discussion of what information retrieval is for chapter 3. For now, we’ll touch on the key concepts behind information retrieval, starting with the formal definition taken from one of the prominent academic texts on modern search concepts: Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).[1] 1 Christopher D.

…

IBSimilarity class ICUFoldingFilterFactory idf (inverse document frequency), 2nd, 3rd, 4th if function implicit routing importing documents common formats DIH ExtractingRequestHandler Nutch relational database data using JSON using SolrJ library using XML Inactive state incremental indexing indent parameter indexlog utility IndicNormalizationFilterFactory Indonesian language IndonesianStemFilterFactory information discovery use case information retrieval. See IR. installing Solr instanceDir parameter <int> element Integrated Development Environment. See IDE. IntelliJ IDEA internationalization. See multilingual search. Intersects operation invalidating cached objects invariants section inverse document frequency. See idf. inverted index ordering of terms overview IR (information retrieval) Irish language IrishLowerCaseFilterFactory, 2nd IsDisjointTo operation IsWithin operation Italian language ItalianLightStemFilterFactory J J2EE (Java 2 Platform, Enterprise Edition) Japanese language, 2nd JapaneseBaseFormFilterFactory JapaneseKatakanaStemFilterFactory JapaneseTokenizerFactory JAR files Java 2 Platform, Enterprise Edition.

…

Useful data import configurations Index List of Figures List of Tables List of Listings Table of Contents Copyright Brief Table of Contents Table of Contents Foreword Preface Acknowledgments About this Book 1. Meet Solr Chapter 1. Introduction to Solr 1.1. Why do I need a search engine? 1.1.1. Managing text-centric data 1.1.2. Common search-engine use cases 1.2. What is Solr? 1.2.1. Information retrieval engine 1.2.2. Flexible schema management 1.2.3. Java web application 1.2.4. Multiple indexes in one server 1.2.5. Extendable (plugins) 1.2.6. Scalable 1.2.7. Fault-tolerant 1.3. Why Solr? 1.3.1. Solr for the software architect 1.3.2. Solr for the system administrator 1.3.3.

pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack
by Matthew A. Russell
Published 15 Jan 2011

Text Mining Fundamentals Although rigorous approaches to natural language processing (NLP) that include such things as sentence segmentation, tokenization, word chunking, and entity detection are necessary in order to achieve the deepest possible understanding of textual data, it’s helpful to first introduce some fundamentals from Information Retrieval theory. The remainder of this chapter introduces some of its more foundational aspects, including TF-IDF, the cosine similarity metric, and some of the theory behind collocation detection. Chapter 8 provides a deeper discussion of NLP. Note If you want to dig deeper into IR theory, the full text of Introduction to Information Retrieval is available online and provides more information than you could ever want to know about the field. A Whiz-Bang Introduction to TF-IDF Information retrieval is an extensive field with many specialties.

…

I identity consolidation, Brief analysis of breadth-first techniques IDF (inverse document frequency), A Whiz-Bang Introduction to TF-IDF, A Whiz-Bang Introduction to TF-IDF (see also TF-IDF) calculation of, A Whiz-Bang Introduction to TF-IDF idf function, A Whiz-Bang Introduction to TF-IDF IETF OAuth 2.0 protocol, No, You Can’t Have My Password IMAP (Internet Message Access Protocol), Analyzing Your Own Mail Data, Accessing Gmail with OAuth, Fetching and Parsing Email Messages connecting to, using OAuth, Accessing Gmail with OAuth constructing an IMAP query, Fetching and Parsing Email Messages imaplib, Fetching and Parsing Email Messages ImportError, Installing Python Development Tools indexing function, JavaScript-based, couchdb-lucene: Full-Text Indexing and More inference, Open-World Versus Closed-World Assumptions, Inferencing About an Open World with FuXi application to machine knowledge, Inferencing About an Open World with FuXi in logic-based programming languages and RDF, Open-World Versus Closed-World Assumptions influence, measuring for Twitter users, Measuring Influence, Measuring Influence, Measuring Influence, Measuring Influence calculating Twitterer’s most popular followers, Measuring Influence crawling friends/followers connections, Measuring Influence Infochimps, Strong Links API, The Infochimps “Strong Links” API, Interactive 3D Graph Visualization information retrieval industry, Before You Go Off and Try to Build a Search Engine… information retrieval theory, Text Mining Fundamentals (see IR theory) intelligent clustering, Intelligent clustering enables compelling user experiences interactive 3D graph visualization, Interactive 3D Graph Visualization interactive 3D tag clouds for tweet entities co-occurring with #JustinBieber and #TeaParty, Visualizing Tweets with Tricked-Out Tag Clouds interpreter, Python (IPython), Closing Remarks intersection operations, Elementary Set Operations, How Much Overlap Exists Between the Entities of #TeaParty and #JustinBieber Tweets?

…

For comparative purposes, note that it’s certainly possible to perform text-based indexing by writing a simple mapping function that associates keywords and documents, like the one in Example 3-10. Example 3-10. A mapper that tokenizes documents def tokenizingMapper(doc): tokens = doc.split() for token in tokens: if isInteresting(token): # Filter out stop words, etc. yield token, doc However, you’ll quickly find that you need to do a lot more homework about basic Information Retrieval (IR) concepts if you want to establish a good scoring function to rank documents by relevance or anything beyond basic frequency analysis. Fortunately, the benefits of Lucene are many, and chances are good that you’ll want to use couchdb-lucene instead of writing your own mapping function for full-text indexing.

pages: 721 words: 197,134

Data Mining: Concepts, Models, Methods, and Algorithms
by Mehmed Kantardzić
Published 2 Jan 2003

For readers interested in practical implementation of some clustering methods, the paper offers useful advice and a large spectrum of references. Miyamoto, S., Fuzzy Sets in Information Retrieval and Cluster Analysis, Cluver Academic Publishers, Dodrecht, Germany, 1990. This book offers an in-depth presentation and analysis of some clustering algorithms and reviews the possibilities of combining these techniques with fuzzy representation of data. Information retrieval, which, with the development of advanced Web-mining techniques, is becoming more important in the data-mining community, is also explained in the book. 10 ASSOCIATION RULES Chapter Objectives Explain the local modeling character of association-rule techniques.

…

Any researcher or practitioner in this field needs to be aware of these issues in order to successfully apply a particular methodology, to understand a method’s limitations, or to develop new techniques. This book is an attempt to present and discuss such issues and principles and then describe representative and popular methods originating from statistics, machine learning, computer graphics, data bases, information retrieval, neural networks, fuzzy logic, and evolutionary computation. In this book, we describe how best to prepare environments for performing data mining and discuss approaches that have proven to be critical in revealing important patterns, trends, and models in large data sets. It is our expectation that once a reader has completed this text, he or she will be able to initiate and perform basic activities in all phases of a data mining process successfully and effectively.

…

Among the various methods of supervised learning, the nearest neighbor classifier achieves consistently high performance, without a priori assumptions about the distributions from which the training examples are drawn. The reader may have noticed the similarity between the problem of finding nearest neighbors for a test sample and ad hoc retrieval methodologies. In standard information retrieval systems such as digital libraries or web search, we search for the documents (samples) with the highest similarity to the query document represented by a set of key words. Problems are similar, and often the proposed solutions are applicable in both disciplines. Decision boundaries in 1NN are concatenated segments of the Voronoi diagram as shown in Figure 4.28.

pages: 481 words: 121,669

The Invisible Web: Uncovering Information Sources Search Engines Can't See
by Gary Price , Chris Sherman and Danny Sullivan
Published 2 Jan 2003

As more and more computers connected to the Internet, users began to demand tools that would allow them to search for and locate text and other files on computers anywhere on the Net. Early Net Search Tools Although sophisticated search and information retrieval techniques date back to the late 1950s and early ‘60s, these techniques were used primarily in closed or proprietary systems. Early Internet search and retrieval tools lacked even the most basic capabilities, primarily because it was thought that traditional information retrieval techniques would not work well on an open, unstructured information universe like the Internet. Accessing a file on the Internet was a two-part process.

…

What was needed was an automated approach to Web page discovery and indexing. The Web had now grown large enough that information scientists became interested in creating search services specifically for the Web. Sophisticated information retrieval techniques had been available since the early 1960s, but they were only effective when searching closed, relatively structured databases. The open, laissez-faire nature of the Web made it too messy to easily adapt traditional information retrieval techniques. New, Web-centric approaches were needed. But how best to approach the problem? Web search would clearly have to be more sophisticated than a simple Archie-type service.

…

But in the early days of the Web, the reality was that most of the Web consisted of simple HTML text documents. Since few servers offered local site search services, developers of the first Web search engines opted for the model of indexing the full text of pages stored on Web servers. To adapt traditional information retrieval techniques to Web search, they built huge databases that attempted to replicate the Web, searching over these relatively controlled, closed archives of pages rather than trying to search the Web itself in real time. With this fateful architectural decision, limiting search engines to HTML text documents and essentially ignoring all other types of data available via the Web, the Invisible Web was born.

pages: 504 words: 89,238

Natural language processing with Python
by Steven Bird , Ewan Klein and Edward Loper
Published 15 Dec 2009

While named entity recognition is frequently a prelude to identifying relations in Information Extraction, it can also contribute to other tasks. For example, in Question Answering (QA), we try to improve the precision of Information Retrieval by recovering not whole pages, but just those parts which contain an answer to the user’s question. Most QA systems take the 7.5 Named Entity Recognition | 281 documents returned by standard Information Retrieval, and then attempt to isolate the minimal text snippet in the document containing the answer. Now suppose the question was Who was the first President of the US?, and one of the documents that was retrieved contained the following passage: (5) The Washington Monument is the most prominent structure in Washington, D.C. and one of the city’s early attractions.

…

English, 63 code blocks, nested, 25 code examples, downloading, 57 code points, 94 codecs module, 95 coindex (in feature structure), 340 collocations, 20, 81 comma operator (,), 133 comparative wordlists, 65 comparison operators numerical, 22 for words, 23 complements of lexical head, 347 complements of verbs, 313 complex types, 373 complex values, 336 components, language understanding, 31 computational linguistics, challenges of natural language, 441 computer understanding of sentence meaning, 368 concatenation, 11, 88 lists and strings, 87 strings, 16 conclusions in logic, 369 concordances creating, 40 graphical POS-concordance tool, 184 conditional classifiers, 254 conditional expressions, 25 conditional frequency distributions, 44, 52–56 combining with regular expressions, 103 condition and event pairs, 52 counting words by genre, 52 generating random text with bigrams, 55 male and female names ending in each alphabet letter, 62 plotting and tabulating distributions, 53 using to find minimally contrasting set of words, 64 ConditionalFreqDist, 52 commonly used methods, 56 conditionals, 22, 133 confusion matrix, 207, 240 consecutive classification, 232 non phrase chunking with consecutive classifier, 275 consistent, 366 466 | General Index constituent structure, 296 constituents, 297 context exploiting in part-of-speech classifier, 230 for taggers, 203 context-free grammar, 298, 300 (see also grammars) probabilistic context-free grammar, 320 contractions in tokenization, 112 control, 22 control structures, 26 conversion specifiers, 118 conversions of data formats, 419 coordinate structures, 295 coreferential, 373 corpora, 39–52 annotated text corpora, 46–48 Brown Corpus, 42–44 creating and accessing, resources for further reading, 438 defined, 39 differences in corpus access methods, 50 exploring text corpora using a chunker, 267 Gutenberg Corpus, 39–42 Inaugural Address Corpus, 45 from languages other than English, 48 loading your own corpus, 51 obtaining from Web, 416 Reuters Corpus, 44 sources of, 73 tagged, 181–189 text corpus structure, 49–51 web and chat text, 42 wordlists, 60–63 corpora, included with NLTK, 46 corpus case study, structure of TIMIT, 407–412 corpus HOWTOs, 122 life cycle of, 412–416 creation scenarios, 412 curation versus evolution, 415 quality control, 413 widely-used format for, 421 counters, legitimate uses of, 141 cross-validation, 241 CSV (comma-separated value) format, 418 CSV (comma-separated-value) format, 170 D \d decimal digits in regular expressions, 110 \D nondigit characters in regular expressions, 111 data formats, converting, 419 data types dictionary, 190 documentation for Python standard types, 173 finding type of Python objects, 86 function parameter, 146 operations on objects, 86 database query via natural language, 361–365 databases, obtaining data from, 418 debugger (Python), 158 debugging techniques, 158 decimal integers, formatting, 119 decision nodes, 242 decision stumps, 243 decision trees, 242–245 entropy and information gain, 243 decision-tree classifier, 229 declarative style, 140 decoding, 94 def keyword, 9 defaultdict, 193 defensive programming, 159 demonstratives, agreement with noun, 329 dependencies, 310 criteria for, 312 existential dependencies, modeling in XML, 427 non-projective, 312 projective, 311 unbounded dependency constructions, 349–353 dependency grammars, 310–315 valency and the lexicon, 312 dependents, 310 descriptive models, 255 determiners, 186 agreement with nouns, 333 deve-test set, 225 development set, 225 similarity to test set, 238 dialogue act tagging, 214 dialogue acts, identifying types, 235 dialogue systems (see spoken dialogue systems) dictionaries feature set, 223 feature structures as, 337 pronouncing dictionary, 63–65 Python, 189–198 default, 193 defining, 193 dictionary data type, 190 finding key given a value, 197 indexing lists versus, 189 summary of dictionary methods, 197 updating incrementally, 195 storing features and values, 327 translation, 66 dictionary methods, 197 dictionary data structure (Python), 65 directed acyclic graphs (DAGs), 338 discourse module, 401 discourse semantics, 397–402 discourse processing, 400–402 discourse referents, 397 discourse representation structure (DRS), 397 Discourse Representation Theory (DRT), 397–400 dispersion plot, 6 divide-and-conquer strategy, 160 docstrings, 143 contents and structure of, 148 example of complete docstring, 148 module-level, 155 doctest block, 148 doctest module, 160 document classification, 227 documentation functions, 148 online Python documentation, versions and, 173 Python, resources for further information, 173 docutils module, 148 domain (of a model), 377 DRS (discourse representation structure), 397 DRS conditions, 397 DRT (Discourse Representation Theory), 397– 400 Dublin Core Metadata initiative, 435 duck typing, 281 dynamic programming, 165 General Index | 467 application to parsing with context-free grammar, 307 different approaches to, 167 E Earley chart parser, 334 electronic books, 80 elements, XML, 425 ElementTree interface, 427–429 using to access Toolbox data, 429 elif clause, if . . . elif statement, 133 elif statements, 26 else statements, 26 encoding, 94 encoding features, 223 encoding parameters, codecs module, 95 endangered languages, special considerations with, 423–424 entities, 373 entity detection, using chunking, 264–270 entries adding field to, in Toolbox, 431 contents of, 60 converting data formats, 419 formatting in XML, 430 entropy, 251 (see also Maximum Entropy classifiers) calculating for gender prediction task, 243 maximizing in Maximum Entropy classifier, 252 epytext markup language, 148 equality, 132, 372 equivalence (<->) operator, 368 equivalent, 340 error analysis, 225 errors runtime, 13 sources of, 156 syntax, 3 evaluation sets, 238 events, pairing with conditions in conditional frequency distribution, 52 exceptions, 158 existential quantifier, 374 exists operator, 376 Expected Likelihood Estimation, 249 exporting data, 117 468 | General Index F f-structure, 357 feature extractors defining for dialogue acts, 235 defining for document classification, 228 defining for noun phrase (NP) chunker, 276–278 defining for punctuation, 234 defining for suffix checking, 229 Recognizing Textual Entailment (RTE), 236 selecting relevant features, 224–227 feature paths, 339 feature sets, 223 feature structures, 328 order of features, 337 resources for further reading, 357 feature-based grammars, 327–360 auxiliary verbs and inversion, 348 case and gender in German, 353 example grammar, 333 extending, 344–356 lexical heads, 347 parsing using Earley chart parser, 334 processing feature structures, 337–344 subsumption and unification, 341–344 resources for further reading, 357 subcategorization, 344–347 syntactic agreement, 329–331 terminology, 336 translating from English to SQL, 362 unbounded dependency constructions, 349–353 using attributes and constraints, 331–336 features, 223 non-binary features in naive Bayes classifier, 249 fields, 136 file formats, libraries for, 172 files opening and reading local files, 84 writing program output to, 120 fillers, 349 first-order logic, 372–385 individual variables and assignments, 378 model building, 383 quantifier scope ambiguity, 381 summary of language, 376 syntax, 372–375 theorem proving, 375 truth in model, 377 floating-point numbers, formatting, 119 folds, 241 for statements, 26 combining with if statements, 26 inside a list comprehension, 63 iterating over characters in strings, 90 format strings, 118 formatting program output, 116–121 converting from lists to strings, 116 strings and formats, 117–118 text wrapping, 120 writing results to file, 120 formulas of propositional logic, 368 formulas, type (t), 373 free, 375 Frege’s Principle, 385 frequency distributions, 17, 22 conditional (see conditional frequency distributions) functions defined for, 22 letters, occurrence in strings, 90 functions, 142–154 abstraction provided by, 147 accumulative, 150 as arguments to another function, 149 call-by-value parameter passing, 144 checking parameter types, 146 defined, 9, 57 documentation for Python built-in functions, 173 documenting, 148 errors from, 157 for frequency distributions, 22 for iteration over sequences, 134 generating plurals of nouns (example), 58 higher-order, 151 inputs and outputs, 143 named arguments, 152 naming, 142 poorly-designed, 147 recursive, call structure, 165 saving in modules, 59 variable scope, 145 well-designed, 147 gazetteer, 282 gender identification, 222 Decision Tree model for, 242 gender in German, 353–356 Generalized Phrase Structure Grammar (GPSG), 345 generate_model ( ) function, 55 generation of language output, 29 generative classifiers, 254 generator expressions, 138 functions exemplifying, 151 genres, systematic differences between, 42–44 German, case and gender in, 353–356 gerunds, 211 glyphs, 94 gold standard, 201 government-sponsored challenges to machine learning application in NLP, 257 gradient (grammaticality), 318 grammars, 327 (see also feature-based grammars) chunk grammar, 265 context-free, 298–302 parsing with, 302–310 validating Toolbox entries with, 433 writing your own, 300 dependency, 310–315 development, 315–321 problems with ambiguity, 317 treebanks and grammars, 315–317 weighted grammar, 318–321 dilemmas in sentence structure analysis, 292–295 resources for further reading, 322 scaling up, 315 grammatical category, 328 graphical displays of data conditional frequency distributions, 56 Matplotlib, 168–170 graphs defining and manipulating, 170 directed acyclic graphs, 338 greedy sequence classification, 232 Gutenberg Corpus, 40–42, 80 G hapaxes, 19 hash arrays, 189, 190 (see also dictionaries) gaps, 349 H General Index | 469 head of a sentence, 310 criteria for head and dependencies, 312 heads, lexical, 347 headword (lemma), 60 Heldout Estimation, 249 hexadecimal notation for Unicode string literal, 95 Hidden Markov Models, 233 higher-order functions, 151 holonyms, 70 homonyms, 60 HTML documents, 82 HTML markup, stripping out, 418 hypernyms, 70 searching corpora for, 106 semantic similarity and, 72 hyphens in tokenization, 110 hyponyms, 69 I identifiers for variables, 15 idioms, Python, 24 IDLE (Interactive DeveLopment Environment), 2 if . . . elif statements, 133 if statements, 25 combining with for statements, 26 conditions in, 133 immediate constituents, 297 immutable, 93 implication (->) operator, 368 in operator, 91 Inaugural Address Corpus, 45 inconsistent, 366 indenting code, 138 independence assumption, 248 naivete of, 249 indexes counting from zero (0), 12 list, 12–14 mapping dictionary definition to lexeme, 419 speeding up program by using, 163 string, 15, 89, 91 text index created using a stemmer, 107 words containing a given consonant-vowel pair, 103 inference, 369 information extraction, 261–289 470 | General Index architecture of system, 263 chunking, 264–270 defined, 262 developing and evaluating chunkers, 270– 278 named entity recognition, 281–284 recursion in linguistic structure, 278–281 relation extraction, 284 resources for further reading, 286 information gain, 243 inside, outside, begin tags (see IOB tags) integer ordinal, finding for character, 95 interpreter >>> prompt, 2 accessing, 2 using text editor instead of to write programs, 56 inverted clauses, 348 IOB tags, 269, 286 reading, 270–272 is operator, 145 testing for object identity, 132 ISO 639 language codes, 65 iterative optimization techniques, 251 J joint classifier models, 231 joint-features (maximum entropy model), 252 K Kappa coefficient (k), 414 keys, 65, 191 complex, 196 keyword arguments, 153 Kleene closures, 100 L lambda expressions, 150, 386–390 example, 152 lambda operator (λ), 386 Lancaster stemmer, 107 language codes, 65 language output, generating, 29 language processing, symbol processing versus, 442 language resources describing using OLAC metadata, 435–437 LanguageLog (linguistics blog), 35 latent semantic analysis, 171 Latin-2 character encoding, 94 leaf nodes, 242 left-corner parser, 306 left-recursive, 302 lemmas, 60 lexical relationships between, 71 pairing of synset with a word, 68 lemmatization, 107 example of, 108 length of a text, 7 letter trie, 162 lexical categories, 179 lexical entry, 60 lexical relations, 70 lexical resources comparative wordlists, 65 pronouncing dictionary, 63–65 Shoebox and Toolbox lexicons, 66 wordlist corpora, 60–63 lexicon, 60 (see also lexical resources) chunking Toolbox lexicon, 434 defined, 60 validating in Toolbox, 432–435 LGB rule of name resolution, 145 licensed, 350 likelihood ratios, 224 Linear-Chain Conditional Random Field Models, 233 linguistic objects, mappings from keys to values, 190 linguistic patterns, modeling, 255 linguistics and NLP-related concepts, resources for, 34 list comprehensions, 24 for statement in, 63 function invoked in, 64 used as function parameters, 55 lists, 10 appending item to, 11 concatenating, using + operator, 11 converting to strings, 116 indexing, 12–14 indexing, dictionaries versus, 189 normalizing and sorting, 86 Python list type, 86 sorted, 14 strings versus, 92 tuples versus, 136 local variables, 58 logic first-order, 372–385 natural language, semantics, and, 365–368 propositional, 368–371 resources for further reading, 404 logical constants, 372 logical form, 368 logical proofs, 370 loops, 26 looping with conditions, 26 lowercase, converting text to, 45, 107 M machine learning application to NLP, web pages for government challenges, 257 decision trees, 242–245 Maximum Entropy classifiers, 251–254 naive Bayes classifiers, 246–250 packages, 237 resources for further reading, 257 supervised classification, 221–237 machine translation (MT) limitations of, 30 using NLTK’s babelizer, 30 mapping, 189 Matplotlib package, 168–170 maximal projection, 347 Maximum Entropy classifiers, 251–254 Maximum Entropy Markov Models, 233 Maximum Entropy principle, 253 memoization, 167 meronyms, 70 metadata, 435 OLAC (Open Language Archives Community), 435 modals, 186 model building, 383 model checking, 379 models interpretation of sentences of logical language, 371 of linguistic patterns, 255 representation using set theory, 367 truth-conditional semantics in first-order logic, 377 General Index | 471 what can be learned from models of language, 255 modifiers, 314 modules defined, 59 multimodule programs, 156 structure of Python module, 154 morphological analysis, 213 morphological cues to word category, 211 morphological tagging, 214 morphosyntactic information in tagsets, 212 MSWord, text from, 85 mutable, 93 N \n newline character in regular expressions, 111 n-gram tagging, 203–208 across sentence boundaries, 208 combining taggers, 205 n-gram tagger as generalization of unigram tagger, 203 performance limitations, 206 separating training and test data, 203 storing taggers, 206 unigram tagging, 203 unknown words, 206 naive Bayes assumption, 248 naive Bayes classifier, 246–250 developing for gender identification task, 223 double-counting problem, 250 as generative classifier, 254 naivete of independence assumption, 249 non-binary features, 249 underlying probabilistic model, 248 zero counts and smoothing, 248 name resolution, LGB rule for, 145 named arguments, 152 named entities commonly used types of, 281 relations between, 284 named entity recognition (NER), 281–284 Names Corpus, 61 negative lookahead assertion, 284 NER (see named entity recognition) nested code blocks, 25 NetworkX package, 170 new words in languages, 212 472 | General Index newlines, 84 matching in regular expressions, 109 printing with print statement, 90 resources for further information, 122 non-logical constants, 372 non-standard words, 108 normalizing text, 107–108 lemmatization, 108 using stemmers, 107 noun phrase (NP), 297 noun phrase (NP) chunking, 264 regular expression–based NP chunker, 267 using unigram tagger, 272 noun phrases, quantified, 390 nouns categorizing and tagging, 184 program to find most frequent noun tags, 187 syntactic agreement, 329 numerically intense algorithms in Python, increasing efficiency of, 257 NumPy package, 171 O object references, 130 copying, 132 objective function, 114 objects, finding data type for, 86 OLAC metadata, 74, 435 definition of metadata, 435 Open Language Archives Community, 435 Open Archives Initiative (OAI), 435 open class, 212 open formula, 374 Open Language Archives Community (OLAC), 435 operators, 369 (see also names of individual operators) addition and multiplication, 88 Boolean, 368 numerical comparison, 22 scope of, 157 word comparison, 23 or operator, 24 orthography, 328 out-of-vocabulary items, 206 overfitting, 225, 245 P packages, 59 parameters, 57 call-by-value parameter passing, 144 checking types of, 146 defined, 9 defining for functions, 143 parent nodes, 279 parsing, 318 (see also grammars) with context-free grammar left-corner parser, 306 recursive descent parsing, 303 shift-reduce parsing, 304 well-formed substring tables, 307–310 Earley chart parser, parsing feature-based grammars, 334 parsers, 302 projective dependency parser, 311 part-of-speech tagging (see POS tagging) partial information, 341 parts of speech, 179 PDF text, 85 Penn Treebank Corpus, 51, 315 personal pronouns, 186 philosophical divides in contemporary NLP, 444 phonetics computer-readable phonetic alphabet (SAMPA), 137 phones, 63 resources for further information, 74 phrasal level, 347 phrasal projections, 347 pipeline for NLP, 31 pixel images, 169 plotting functions, Matplotlib, 168 Porter stemmer, 107 POS (part-of-speech) tagging, 179, 208, 229 (see also tagging) differences in POS tagsets, 213 examining word context, 230 finding IOB chunk tag for word's POS tag, 272 in information retrieval, 263 morphology in POS tagsets, 212 resources for further reading, 214 simplified tagset, 183 storing POS tags in tagged corpora, 181 tagged data from four Indian languages, 182 unsimplifed tags, 187 use in noun phrase chunking, 265 using consecutive classifier, 231 pre-sorting, 160 precision, evaluating search tasks for, 239 precision/recall trade-off in information retrieval, 205 predicates (first-order logic), 372 prepositional phrase (PP), 297 prepositional phrase attachment ambiguity, 300 Prepositional Phrase Attachment Corpus, 316 prepositions, 186 present participles, 211 Principle of Compositionality, 385, 443 print statements, 89 newline at end, 90 string formats and, 117 prior probability, 246 probabilistic context-free grammar (PCFG), 320 probabilistic model, naive Bayes classifier, 248 probabilistic parsing, 318 procedural style, 139 processing pipeline (NLP), 86 productions in grammars, 293 rules for writing CFGs for parsing in NLTK, 301 program development, 154–160 debugging techniques, 158 defensive programming, 159 multimodule programs, 156 Python module structure, 154 sources of error, 156 programming style, 139 programs, writing, 129–177 advanced features of functions, 149–154 algorithm design, 160–167 assignment, 130 conditionals, 133 equality, 132 functions, 142–149 resources for further reading, 173 sequences, 133–138 style considerations, 138–142 legitimate uses for counters, 141 procedural versus declarative style, 139 General Index | 473 Python coding style, 138 summary of important points, 172 using Python libraries, 167–172 Project Gutenberg, 80 projections, 347 projective, 311 pronouncing dictionary, 63–65 pronouns anaphoric antecedents, 397 interpreting in first-order logic, 373 resolving in discourse processing, 401 proof goal, 376 properties of linguistic categories, 331 propositional logic, 368–371 Boolean operators, 368 propositional symbols, 368 pruning decision nodes, 245 punctuation, classifier for, 233 Python carriage return and linefeed characters, 80 codecs module, 95 dictionary data structure, 65 dictionary methods, summary of, 197 documentation, 173 documentation and information resources, 34 ElementTree module, 427 errors in understanding semantics of, 157 finding type of any object, 86 getting started, 2 increasing efficiency of numerically intense algorithms, 257 libraries, 167–172 CSV, 170 Matplotlib, 168–170 NetworkX, 170 NumPy, 171 other, 172 reference materials, 122 style guide for Python code, 138 textwrap module, 120 Python Package Index, 172 Q quality control in corpus creation, 413 quantification first-order logic, 373, 380 quantified noun phrases, 390 scope ambiguity, 381, 394–397 474 | General Index quantified formulas, interpretation of, 380 questions, answering, 29 quotation marks in strings, 87 R random text generating in various styles, 6 generating using bigrams, 55 raster (pixel) images, 169 raw strings, 101 raw text, processing, 79–128 capturing user input, 85 detecting word patterns with regular expressions, 97–101 formatting from lists to strings, 116–121 HTML documents, 82 NLP pipeline, 86 normalizing text, 107–108 reading local files, 84 regular expressions for tokenizing text, 109– 112 resources for further reading, 122 RSS feeds, 83 search engine results, 82 segmentation, 112–116 strings, lowest level text processing, 87–93 summary of important points, 121 text from web and from disk, 80 text in binary formats, 85 useful applications of regular expressions, 102–106 using Unicode, 93–97 raw( ) function, 41 re module, 101, 110 recall, evaluating search tasks for, 240 Recognizing Textual Entailment (RTE), 32, 235 exploiting word context, 230 records, 136 recursion, 161 function to compute Sanskrit meter (example), 165 in linguistic structure, 278–281 tree traversal, 280 trees, 279–280 performance and, 163 in syntactic structure, 301 recursive, 301 recursive descent parsing, 303 reentrancy, 340 references (see object references) regression testing framework, 160 regular expressions, 97–106 character class and other symbols, 110 chunker based on, evaluating, 272 extracting word pieces, 102 finding word stems, 104 matching initial and final vowel sequences and all consonants, 102 metacharacters, 101 metacharacters, summary of, 101 noun phrase (NP) chunker based on, 265 ranges and closures, 99 resources for further information, 122 searching tokenized text, 105 symbols, 110 tagger, 199 tokenizing text, 109–112 use in PlaintextCorpusReader, 51 using basic metacharacters, 98 using for relation extraction, 284 using with conditional frequency distributions, 103 relation detection, 263 relation extraction, 284 relational operators, 22 reserved words, 15 return statements, 144 return value, 57 reusing code, 56–59 creating programs using a text editor, 56 functions, 57 modules, 59 Reuters Corpus, 44 root element (XML), 427 root hypernyms, 70 root node, 242 root synsets, 69 Rotokas language, 66 extracting all consonant-vowel sequences from words, 103 Toolbox file containing lexicon, 429 RSS feeds, 83 feedparser library, 172 RTE (Recognizing Textual Entailment), 32, 235 exploiting word context, 230 runtime errors, 13 S \s whitespace characters in regular expressions, 111 \S nonwhitespace characters in regular expressions, 111 SAMPA computer-readable phonetic alphabet, 137 Sanskrit meter, computing, 165 satisfies, 379 scope of quantifiers, 381 scope of variables, 145 searches binary search, 160 evaluating for precision and recall, 239 processing search engine results, 82 using POS tags, 187 segmentation, 112–116 in chunking and tokenization, 264 sentence, 112 word, 113–116 semantic cues to word category, 211 semantic interpretations, NLTK functions for, 393 semantic role labeling, 29 semantics natural language, logic and, 365–368 natural language, resources for information, 403 semantics of English sentences, 385–397 quantifier ambiguity, 394–397 transitive verbs, 391–394 ⋏-calculus, 386–390 SemCor tagging, 214 sentence boundaries, tagging across, 208 sentence segmentation, 112, 233 in chunking, 264 in information retrieval process, 263 sentence structure, analyzing, 291–326 context-free grammar, 298–302 dependencies and dependency grammar, 310–315 grammar development, 315–321 grammatical dilemmas, 292 parsing with context-free grammar, 302– 310 resources for further reading, 322 summary of important points, 321 syntax, 295–298 sents( ) function, 41 General Index | 475 sequence classification, 231–233 other methods, 233 POS tagging with consecutive classifier, 232 sequence iteration, 134 sequences, 133–138 combining different sequence types, 136 converting between sequence types, 135 operations on sequence types, 134 processing using generator expressions, 137 strings and lists as, 92 shift operation, 305 shift-reduce parsing, 304 Shoebox, 66, 412 sibling nodes, 279 signature, 373 similarity, semantic, 71 Sinica Treebank Corpus, 316 slash categories, 350 slicing lists, 12, 13 strings, 15, 90 smoothing, 249 space-time trade-offs in algorihm design, 163 spaces, matching in regular expressions, 109 Speech Synthesis Markup Language (W3C SSML), 214 spellcheckers, Words Corpus used by, 60 spoken dialogue systems, 31 spreadsheets, obtaining data from, 418 SQL (Structured Query Language), 362 translating English sentence to, 362 stack trace, 158 standards for linguistic data creation, 421 standoff annotation, 415, 421 start symbol for grammars, 298, 334 startswith( ) function, 45 stemming, 107 NLTK HOWTO, 122 stemmers, 107 using regular expressions, 104 using stem( ) fuinction, 105 stopwords, 60 stress (in pronunciation), 64 string formatting expressions, 117 string literals, Unicode string literal in Python, 95 strings, 15, 87–93 476 | General Index accessing individual characters, 89 accessing substrings, 90 basic operations with, 87–89 converting lists to, 116 formats, 117–118 formatting lining things up, 118 tabulating data, 119 immutability of, 93 lists versus, 92 methods, 92 more operations on, useful string methods, 92 printing, 89 Python’s str data type, 86 regular expressions as, 101 tokenizing, 86 structurally ambiguous sentences, 300 structure sharing, 340 interaction with unification, 343 structured data, 261 style guide for Python code, 138 stylistics, 43 subcategories of verbs, 314 subcategorization, 344–347 substrings (WFST), 307 substrings, accessing, 90 subsumes, 341 subsumption, 341–344 suffixes, classifier for, 229 supervised classification, 222–237 choosing features, 224–227 documents, 227 exploiting context, 230 gender identification, 222 identifying dialogue act types, 235 part-of-speech tagging, 229 Recognizing Textual Entailment (RTE), 235 scaling up to large datasets, 237 sentence segmentation, 233 sequence classification, 231–233 Swadesh wordlists, 65 symbol processing, language processing versus, 442 synonyms, 67 synsets, 67 semantic similarity, 71 in WordNet concept hierarchy, 69 syntactic agreement, 329–331 syntactic cues to word category, 211 syntactic structure, recursion in, 301 syntax, 295–298 syntax errors, 3 T \t tab character in regular expressions, 111 T9 system, entering text on mobile phones, 99 tabs avoiding in code indentation, 138 matching in regular expressions, 109 tag patterns, 266 matching, precedence in, 267 tagging, 179–219 adjectives and adverbs, 186 combining taggers, 205 default tagger, 198 evaluating tagger performance, 201 exploring tagged corpora, 187–189 lookup tagger, 200–201 mapping words to tags using Python dictionaries, 189–198 nouns, 184 part-of-speech (POS) tagging, 229 performance limitations, 206 reading tagged corpora, 181 regular expression tagger, 199 representing tagged tokens, 181 resources for further reading, 214 across sentence boundaries, 208 separating training and testing data, 203 simplified part-of-speech tagset, 183 storing taggers, 206 transformation-based, 208–210 unigram tagging, 202 unknown words, 206 unsimplified POS tags, 187 using POS (part-of-speech) tagger, 179 verbs, 185 tags in feature structures, 340 IOB tags representing chunk structures, 269 XML, 425 tagsets, 179 morphosyntactic information in POS tagsets, 212 simplified POS tagset, 183 terms (first-order logic), 372 test sets, 44, 223 choosing for classification models, 238 testing classifier for document classification, 228 text, 1 computing statistics from, 16–22 counting vocabulary, 7–10 entering on mobile phones (T9 system), 99 as lists of words, 10–16 searching, 4–7 examining common contexts, 5 text alignment, 30 text editor, creating programs with, 56 textonyms, 99 textual entailment, 32 textwrap module, 120 theorem proving in first order logic, 375 timeit module, 164 TIMIT Corpus, 407–412 tokenization, 80 chunking and, 264 in information retrieval, 263 issues with, 111 list produced from tokenizing string, 86 regular expressions for, 109–112 representing tagged tokens, 181 segmentation and, 112 with Unicode strings as input and output, 97 tokenized text, searching, 105 tokens, 8 Toolbox, 66, 412, 431–435 accessing data from XML, using ElementTree, 429 adding field to each entry, 431 resources for further reading, 438 validating lexicon, 432–435 tools for creation, publication, and use of linguistic data, 421 top-down approach to dynamic programming, 167 top-down parsing, 304 total likelihood, 251 training classifier, 223 classifier for document classification, 228 classifier-based chunkers, 274–278 taggers, 203 General Index | 477 unigram chunker using CoNLL 2000 Chunking Corpus, 273 training sets, 223, 225 transformation-based tagging, 208–210 transitive verbs, 314, 391–394 translations comparative wordlists, 66 machine (see machine translation) treebanks, 315–317 trees, 279–281 representing chunks, 270 traversal of, 280 trie, 162 trigram taggers, 204 truth conditions, 368 truth-conditional semantics in first-order logic, 377 tuples, 133 lists versus, 136 parentheses with, 134 representing tagged tokens, 181 Turing Test, 31, 368 type-raising, 390 type-token distinction, 8 TypeError, 157 types, 8, 86 (see also data types) types (first-order logic), 373 U unary predicate, 372 unbounded dependency constructions, 349– 353 defined, 350 underspecified, 333 Unicode, 93–97 decoding and encoding, 94 definition and description of, 94 extracting gfrom files, 94 resources for further information, 122 using your local encoding in Python, 97 unicodedata module, 96 unification, 342–344 unigram taggers confusion matrix for, 240 noun phrase chunking with, 272 unigram tagging, 202 lookup tagger (example), 200 separating training and test data, 203 478 | General Index unique beginners, 69 Universal Feed Parser, 83 universal quantifier, 374 unknown words, tagging, 206 updating dictionary incrementally, 195 US Presidential Inaugural Addresses Corpus, 45 user input, capturing, 85 V valencies, 313 validity of arguments, 369 validity of XML documents, 426 valuation, 377 examining quantifier scope ambiguity, 381 Mace4 model converted to, 384 valuation function, 377 values, 191 complex, 196 variables arguments of predicates in first-order logic, 373 assignment, 378 bound by quantifiers in first-order logic, 373 defining, 14 local, 58 naming, 15 relabeling bound variables, 389 satisfaction of, using to interpret quantified formulas, 380 scope of, 145 verb phrase (VP), 297 verbs agreement paradigm for English regular verbs, 329 auxiliary, 336 auxiliary verbs and inversion of subject and verb, 348 categorizing and tagging, 185 examining for dependency grammar, 312 head of sentence and dependencies, 310 present participle, 211 transitive, 391–394 W \W non-word characters in Python, 110, 111 \w word characters in Python, 110, 111 web text, 42 Web, obtaining data from, 416 websites, obtaining corpora from, 416 weighted grammars, 318–321 probabilistic context-free grammar (PCFG), 320 well-formed (XML), 425 well-formed formulas, 368 well-formed substring tables (WFST), 307– 310 whitespace regular expression characters for, 109 tokenizing text on, 109 wildcard symbol (.), 98 windowdiff scorer, 414 word classes, 179 word comparison operators, 23 word occurrence, counting in text, 8 word offset, 45 word processor files, obtaining data from, 417 word segmentation, 113–116 word sense disambiguation, 28 word sequences, 7 wordlist corpora, 60–63 WordNet, 67–73 concept hierarchy, 69 lemmatizer, 108 more lexical relations, 70 semantic similarity, 71 visualization of hypernym hierarchy using Matplotlib and NetworkX, 170 Words Corpus, 60 words( ) function, 40 wrapping text, 120 Z zero counts (naive Bayes classifier), 249 zero projection, 347 X XML, 425–431 ElementTree interface, 427–429 formatting entries, 430 representation of lexical entry from chunk parsing Toolbox record, 434 resources for further reading, 438 role of, in using to represent linguistic structures, 426 using ElementTree to access Toolbox data, 429 using for linguistic structures, 425 validity of documents, 426 General Index | 479 About the Authors Steven Bird is Associate Professor in the Department of Computer Science and Software Engineering at the University of Melbourne, and Senior Research Associate in the Linguistic Data Consortium at the University of Pennsylvania.

…

Further readings in quantitative data analysis in linguistics are: (Baayen, 2008), (Gries, 2009), and (Woods, Fletcher, & Hughes, 1986). The original description of WordNet is (Fellbaum, 1998). Although WordNet was originally developed for research in psycholinguistics, it is now widely used in NLP and Information Retrieval. WordNets are being developed for many other languages, as documented at http://www.globalwordnet.org/. For a study of WordNet similarity measures, see (Budanitsky & Hirst, 2006). Other topics touched on in this chapter were phonetics and lexical semantics, and we refer readers to Chapters 7 and 20 of (Jurafsky & Martin, 2008). 2.8 Exercises 1. ○ Create a variable phrase containing a list of words.

Bootstrapping: Douglas Engelbart, Coevolution, and the Origins of Personal Computing (Writing Science)
by Thierry Bardini
Published 1 Dec 2000

I was trying to explain what I wanted to do and one guy just kept telling me, "You are just givIng fancy names to information retrieval. Why do that? Why don't you just admit that it's information retrieval and get on with the rest of it and make it all work?" He was getting kind of nasty. The other guy was trying to get him to back off. (Engelbart I996) It seems difficult to dispute, therefore, that the Memex was not conceived as a medium, only as a personal "tool" for information retrieval. Personal ac- cess to information was emphasized over communication. The later research of Ted Nelson on hypertext is very representative of that emphasis. 4 It is problematic, however, to grant Bush the status of the "unique forefa- ther" of computerized hypertext systems.

…

The regnant term at the time for what Bush was proposing was indeed "in- formation retrieval," and Engelbart himself has testified to the power that a preconceived notion of information retrieval held for creating misunderstand- ing of his work on hypertext networks: I started trying to reach out to make connections in domains of interest and con- cerns out there that fit along the vector I was interested in. I went to the informa- tion retrieval people. I remember one instance when I went to the Ford Founda- tion's Center for Advanced Study in Social Sciences to see somebody who was there for a year, who was into informatIon retrieval. We sat around. In fact, at coffee break, there were about five people sitting there.

…

The difference in objectives signals the difference in means that char- acterized the two approaches. The first revolved around the "association" of ideas on the model of how the individual mind is supposed to work. The sec- ond revolved around the intersubjective "connection" of words in the systems of natural languages. What actually differentiates hypertext systems from information -retrieval systems is not the process of "association," the term Bush proposed as analo- gous to the way the individual mind works. Instead, what constitutes a hyper- text system is clear in the definition of hypertext already cited: "a style of building systems for information representation and management around a network of nodes connected together by typed l,nks."

pages: 502 words: 107,510

Natural Language Annotation for Machine Learning
by James Pustejovsky and Amber Stubbs
Published 14 Oct 2012

Literary researchers begin compiling systematic collections of the complete works of different authors. Key Word in Context (KWIC) is invented as a means of indexing documents and creating concordances. 1960s: Kucera and Francis publish A Standard Corpus of Present-Day American English (the Brown Corpus), the first broadly available large corpus of language texts. Work in Information Retrieval (IR) develops techniques for statistical similarity of document content. 1970s: Stochastic models developed from speech corpora make Speech Recognition systems possible. The vector space model is developed for document indexing. The London-Lund Corpus (LLC) is developed through the work of the Survey of English Usage. 1980s: The Lancaster-Oslo-Bergen (LOB) Corpus, designed to match the Brown Corpus in terms of size and genres, is compiled.

…

They are also used in speech disambiguation—if a person speaks unclearly but utters a sequence that does not commonly (or ever) occur in the language being spoken, an n-gram model can help recognize that problem and find the words that the speaker probably intended to say. Another modern corpus is ClueWeb09 (http://lemurproject.org/clueweb09.php/), a dataset “created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009.” This corpus is too large to use for an annotation project (it’s about 25 terabytes uncompressed), but some projects have taken parts of the dataset (such as a subset of the English websites) and used them for research (Pomikálek et al. 2012).

…

So the first word in the ranking occurs about twice as often as the second word in the ranking, and three times as often as the third word in the ranking, and so on. N-grams In this section we introduce the notion of an n-gram. N-grams are important for a wide range of applications in Natural Language Processing (NLP), because fairly straightforward language models can be built using them, for speech, Machine Translation, indexing, Information Retrieval (IR), and, as we will see, classification. Imagine that we have a string of tokens, W, consisting of the elements w1, w2, … , wn. Now consider a sliding window over W. If the sliding window consists of one cell (wi), then the collection of one-cell substrings is called the unigram profile of the string; there will be as many unigram profiles as there are elements in the string.

pages: 392 words: 108,745

Talk to Me: How Voice Computing Will Transform the Way We Live, Work, and Think
by James Vlahos
Published 1 Mar 2019

A., 68 Hollingshead, John, 68 Holocaust survivors, 272–74 holograms, 273 Holtzman, Ari, 163–64 HomePod, 218, 225, 280 homonyms, 112 homophones, 97 Horsley, Scott, 214–15 Houdin, Jean-Eugène Robert, 19 Houston, Farah, 133, 134 Huffman, Scott, 49 human brain, 86–87 Hunt, Troy, 230 I IBM, 3, 71, 97, 108, 205 ICT (Institute for Creative Technologies), 244–46, 272–74 ImageNet Large-Scale Visual Recognition Challenge, 93–94 image recognition, 87–88, 90, 91–94, 103 immortality, virtual. See virtual immortality information retrieval (IR), 103–4, 146, 149–50, 160 InspiroBot, 108 Institute for Creative Technologies (ICT), 244–46, 272–74 intents, 257, 262 interactive voice response (IVR), 127 Internet of Things, 21–22 internet search technology, 3, 26, 54, 199–200, 203, 212, 278. See also question answering Invoke, 281 iPhone Evi app on, 203–4 sales of, 45 Siri and, 8, 17–18, 37, 47, 50, 212, 225 speech recognition and, 95 unveiling of, 7, 25 voice search app, 48 IR (information retrieval), 103–4, 146, 149–50, 160 Iris, 29 Irson, Thomas, 65 Isbitski, David, xvi Ishiguro, Hiroshi, 190–91 Ivona, 41–42 IVR (interactive voice response), 127 J Jack in the Box, 46 Jackson, Samuel L., 46 Jacob, Oren, 134, 171–73, 196, 253 Jarvis, 51 Jobs, Steve, 7, 34–37, 47, 48, 172 journalism, AI, 214–16 Julia (chatbot), 80–84, 98 Julia, Luc, 47 K Kahn, Peter, 192, 244 Karim (therapist chatbot), 246 Kasisto, 132 Kay, Tim, ix–x, xii–xiii, 13 Kelly, John, 110 Kempelen, Wolfgang von, 65–67, 69 Kim Jong Un, 217 Kindle, 41 Kismet (robot), 191–92 Kittlaus, Dag, 23–29, 32–37, 46–47, 55, 279 Kleber, Sophie, 278 Klein, Stephen, xiii knowledge, control of, 220 knowledge-based AI, 76–78, 84, 159, 161–63 Knowledge Graph, 204, 206, 212 knowledge graphs, 201–2, 204–5, 213 Knowledge Navigator, 16–18, 27 Krizhevsky, Alex, 93, 94 Kunze, Lauren, 256 Kurzweil, Fredric, 274–76 Kurzweil, Ray, 274–76, 277 Kuyda, Eugenia, 186–88, 196 Kuznetsov, Phillip, 261 Kylie.ai, 107 L L2, 209 language and human species, 4, 285–86 language models for ASR, 96–97 Lasseter, John, 172 Lawson, Lindsey, 169, 182 Le, Quoc, 93, 97, 105–6, 254 LeCun, Yann, 89, 91–92, 93–94, 161 Lemon, Oliver, 145–46, 148, 158, 159 Lenat, Doug, 161, 162 Levitan, Peter, ix, xi, xii–xiii Levy, Steven, 79 Lewis, Thor, 138 Lieberman, Philip, 14 LifePod, 239 Lindbeck, Erica, 179–80 Lindsay, Al, 41, 44 linguistics, 127 linguistics, computational, 72 lip reading, 98 Loebner Prize competition, 82–84, 142, 160, 285 long short-term memory (LSTM), 106 Loup Ventures, 213 Love, Rachel, 271 LSTM (long short-term memory), 106 Luka, 186–87 Lycos, 79 Lyrebird, 114–15, 217 M M (virtual assistant), 51–52 machine learning.

…

The simplest method for getting it to reply is for it to fire off a line of dialogue that its programmer authored in advance. People from Weizenbaum on have done this; even Siri, Alexa, and the Assistant use some prescripted content. But this technique is laborious and limited to the narrow pool of conversational situations designers imagine in advance. A more scalable technique is information retrieval, or IR, in which the AI grabs a suitable response from a database or web page. Because there’s so much content online, IR gives machines vastly more to say than if they were limited to hand-authored utterances. The technique can also be combined with the scripted approach, filling blanks within prewritten templates.

…

For instance, responding to a question about the weather, a voice assistant might say, “It’ll be sunny with a high of 78. Looks like a great day to go outside!” In that case, the specifics (“sunny,” “78”) were retrieved from a weather service while the surrounding words (“great day to go outside”) were manually authored as reusable boilerplate. Voice AI creators use information retrieval more than any other technique, and IR will pop up again later in this book. So we will focus now on an intriguing new method in which responses are neither written out in advance nor cherry-picked from some preexisting source. For what are known as generative methods, computers use deep learning to come up with words all on their own.

pages: 205 words: 20,452

Data Mining in Time Series Databases
by Mark Last , Abraham Kandel and Horst Bunke
Published 24 Jun 2004

Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Database. Proc. 21st Int. Conf. on Very Large Databases (VLDB), pp. 490– 501. 3. Baeza-Yates, R. and Gonnet, G.H. (1999). A Fast Algorithm on Average for All-Against-All Sequence Matching. Proc. 6th String Processing and Information Retrieval Symposium (SPIRE), pp. 16–23. 4. Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. ACM Press/Addison–Wesley Longman Limited. 5. Chakrabarti, K. and Mehrotra, S. (1999). The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces. Proc. 15th Int. Conf. on Data Engineering (ICDE), pp. 440–447. 6. Chan, K. and Fu, A.W. (1999).

…

Proceedings of the 4th International Conference of Knowledge Discovery and Data Mining, AAAI Press, pp. 239–241. 14. Keogh, E. and Pazzani, M. (1999). Relevance Feedback Retrieval of Time Series Data. Proceedings of the 22th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 183–190. 15. Keogh, E. and Smyth, P. (1997). A Probabilistic Approach to Fast Pattern Matching in Time Series Databases. Proceedings of the 3rd International Conference of Knowledge Discovery and Data Mining, pp. 24–20. 16. Last, M., Klein, Y., and Kandel, A. (2001). Knowledge Discovery in Time Series Databases.

…

Such similarity-based retrieval has attracted a great deal of attention in recent years. Although several diﬀerent approaches have appeared, most are based on the common premise of dimensionality reduction and spatial access methods. This chapter gives an overview of recent research and shows how the methods ﬁt into a general context of signature extraction. Keywords: Information retrieval; sequence databases; similarity search; spatial indexing; time sequences. 1. Introduction Time sequences arise in many applications—any applications that involve storing sensor inputs, or sampling a value that changes over time. A problem which has received an increasing amount of attention lately is the problem of similarity retrieval in databases of time sequences, so-called “query by example.”

pages: 1,082 words: 87,792

Python for Algorithmic Trading: From Idea to Cloud Deployment
by Yves Hilpisch
Published 8 Dec 2020

Index A absolute maximum drawdown, Case Study AdaBoost algorithm, Vectorized Backtesting addition (+) operator, Data Types adjusted return appraisal ratio, Algorithmic Trading algorithmic trading (generally)advantages of, Algorithmic Trading basics, Algorithmic Trading-Algorithmic Trading strategies, Trading Strategies-Conclusions alpha seeking strategies, Trading Strategies alpha, defined, Algorithmic Trading anonymous functions, Python Idioms API key, for data sets, Working with Open Data Sources-Working with Open Data Sources Apple, Inc.intraday stock prices, Getting into the Basics reading stock price data from different sources, Reading Financial Data From Different Sources-Reading from Excel and JSON retrieving historical unstructured data about, Retrieving Historical Unstructured Data-Retrieving Historical Unstructured Data app_key, for Eikon Data API, Eikon Data API AQR Capital Management, pandas and the DataFrame Class arithmetic operations, Data Types array programming, Making Use of Vectorization(see also vectorization) automated trading operations, Automating Trading Operations-Strategy Monitoringcapital management, Capital Management-Kelly Criterion for Stocks and Indices configuring Oanda account, Configuring Oanda Account hardware setup, Setting Up the Hardware infrastructure and deployment, Infrastructure and Deployment logging and monitoring, Logging and Monitoring-Logging and Monitoring ML-based trading strategy, ML-Based Trading Strategy-Persisting the Model Object online algorithm, Online Algorithm-Online Algorithm Python environment setup, Setting Up the Python Environment Python scripts for, Python Script-Strategy Monitoring real-time monitoring, Real-Time Monitoring running code, Running the Code uploading code, Uploading the Code visual step-by-step overview, Visual Step-by-Step Overview-Real-Time Monitoring B backtestingbased on simple moving averages, Strategies Based on Simple Moving Averages-Generalizing the Approach Python scripts for classification algorithm backtesting, Classification Algorithm Backtesting Class Python scripts for linear regression backtesting class, Linear Regression Backtesting Class vectorized (see vectorized backtesting) BacktestLongShort class, Long-Short Backtesting Class, Long-Short Backtesting Class bar charts, matplotlib bar plots (see Plotly; streaming bar plot) base class, for event-based backtesting, Backtesting Base Class-Backtesting Base Class, Backtesting Base Class Bash script, Building a Ubuntu and Python Docker Imagefor Droplet set-up, Script to Orchestrate the Droplet Set Up-Script to Orchestrate the Droplet Set Up for Python/Jupyter Lab installation, Installation Script for Python and Jupyter Lab-Installation Script for Python and Jupyter Lab Bitcoin, pandas and the DataFrame Class, Working with Open Data Sources Boolean operationsNumPy, Boolean Operations pandas, Boolean Operations C callback functions, Retrieving Streaming Data capital managementautomated trading operations and, Capital Management-Kelly Criterion for Stocks and Indices Kelly criterion for stocks and indices, Kelly Criterion for Stocks and Indices-Kelly Criterion for Stocks and Indices Kelly criterion in binomial setting, Kelly Criterion in Binomial Setting-Kelly Criterion in Binomial Setting Carter, Graydon, FX Trading with FXCM CFD (contracts for difference)algorithmic trading risks, Logging and Monitoring defined, CFD Trading with Oanda risks of losses, Long-Short Backtesting Class risks of trading on margin, FX Trading with FXCM trading with Oanda, CFD Trading with Oanda-Python Script(see also Oanda) classification problemsmachine learning for, A Simple Classification Problem-A Simple Classification Problem neural networks for, The Simple Classification Problem Revisited-The Simple Classification Problem Revisited Python scripts for vectorized backtesting, Classification Algorithm Backtesting Class .close_all() method, Placing Orders cloud instances, Using Cloud Instances-Script to Orchestrate the Droplet Set Upinstallation script for Python and Jupyter Lab, Installation Script for Python and Jupyter Lab-Installation Script for Python and Jupyter Lab Jupyter Notebook configuration file, Jupyter Notebook Configuration File RSA public/private keys, RSA Public and Private Keys script to orchestrate Droplet set-up, Script to Orchestrate the Droplet Set Up-Script to Orchestrate the Droplet Set Up Cocteau, Jean, Building Classes for Event-Based Backtesting comma separated value (CSV) files (see CSV files) condaas package manager, Conda as a Package Manager-Basic Operations with Conda as virtual environment manager, Conda as a Virtual Environment Manager-Conda as a Virtual Environment Manager basic operations, Basic Operations with Conda-Basic Operations with Conda installing Miniconda, Installing Miniconda-Installing Miniconda conda remove, Basic Operations with Conda configparser module, The Oanda API containers (see Docker containers) contracts for difference (see CFD) control structures, Control Structures CPython, Python for Finance, Python Infrastructure .create_market_buy_order() method, Placing Orders .create_order() method, Placing Market Orders-Placing Market Orders cross-sectional momentum strategies, Strategies Based on Momentum CSV filesinput-output operations, Input-Output Operations-Input-Output Operations reading from a CSV file with pandas, Reading from a CSV File with pandas reading from a CSV file with Python, Reading from a CSV File with Python-Reading from a CSV File with Python .cummax() method, Case Study currency pairs, Logging and Monitoring(see also EUR/USD exchange rate) algorithmic trading risks, Logging and Monitoring D data science stack, Python, NumPy, matplotlib, pandas data snooping, Data Snooping and Overfitting data storageSQLite3 for, Storing Data with SQLite3-Storing Data with SQLite3 storing data efficiently, Storing Financial Data Efficiently-Storing Data with SQLite3 storing DataFrame objects, Storing DataFrame Objects-Storing DataFrame Objects TsTables package for, Using TsTables-Using TsTables data structures, Data Structures-Data Structures DataFrame class, pandas and the DataFrame Class-pandas and the DataFrame Class, Reading from a CSV File with pandas, DataFrame Class-DataFrame Class DataFrame objectscreating, Vectorization with pandas storing, Storing DataFrame Objects-Storing DataFrame Objects dataism, Preface DatetimeIndex() constructor, Plotting with pandas decision tree classification algorithm, Vectorized Backtesting deep learningadding features to analysis, Adding Different Types of Features-Adding Different Types of Features classification problem, The Simple Classification Problem Revisited-The Simple Classification Problem Revisited deep neural networks for predicting market direction, Using Deep Neural Networks to Predict Market Direction-Adding Different Types of Features market movement prediction, Using Deep Learning for Market Movement Prediction-Adding Different Types of Features trading strategies and, Machine and Deep Learning deep neural networks, Using Deep Neural Networks to Predict Market Direction-Adding Different Types of Features delta hedging, Algorithmic Trading dense neural network (DNN), The Simple Classification Problem Revisited, Using Deep Neural Networks to Predict Market Direction dictionary (dict) objects, Reading from a CSV File with Python, Data Structures DigitalOceancloud instances, Using Cloud Instances-Script to Orchestrate the Droplet Set Up droplet setup, Setting Up the Hardware DNN (dense neural network), The Simple Classification Problem Revisited, Using Deep Neural Networks to Predict Market Direction Docker containers, Using Docker Containers-Building a Ubuntu and Python Docker Imagebuilding a Ubuntu and Python Docker image, Building a Ubuntu and Python Docker Image-Building a Ubuntu and Python Docker Image defined, Docker Images and Containers Docker images versus, Docker Images and Containers Docker imagesdefined, Docker Images and Containers Docker containers versus, Docker Images and Containers Dockerfile, Building a Ubuntu and Python Docker Image-Building a Ubuntu and Python Docker Image Domingos, Pedro, Automating Trading Operations Droplet, Using Cloud Instancescosts, Infrastructure and Deployment script to orchestrate set-up, Script to Orchestrate the Droplet Set Up-Script to Orchestrate the Droplet Set Up dynamic hedging, Algorithmic Trading E efficient market hypothesis, Predicting Market Movements with Machine Learning Eikon Data API, Eikon Data API-Retrieving Historical Unstructured Dataretrieving historical structured data, Retrieving Historical Structured Data-Retrieving Historical Structured Data retrieving historical unstructured data, Retrieving Historical Unstructured Data-Retrieving Historical Unstructured Data Euler discretization, Python Versus Pseudo-Code EUR/USD exchange ratebacktesting momentum strategy on minute bars, Backtesting a Momentum Strategy on Minute Bars-Backtesting a Momentum Strategy on Minute Bars evaluation of regression-based strategy, Generalizing the Approach factoring in leverage/margin, Factoring In Leverage and Margin-Factoring In Leverage and Margin gross performance versus deep learning-based strategy, Using Deep Neural Networks to Predict Market Direction-Using Deep Neural Networks to Predict Market Direction, Adding Different Types of Features-Adding Different Types of Features historical ask close prices, Retrieving Historical Data-Retrieving Historical Data historical candles data for, Retrieving Candles Data historical tick data for, Retrieving Tick Data implementing trading strategies in real time, Implementing Trading Strategies in Real Time-Implementing Trading Strategies in Real Time logistic regression-based strategies, Generalizing the Approach placing orders, Placing Orders-Placing Orders predicting, Predicting Index Levels-Predicting Index Levels predicting future returns, Predicting Future Returns-Predicting Future Returns predicting index levels, Predicting Index Levels-Predicting Index Levels retrieving streaming data for, Retrieving Streaming Data retrieving trading account information, Retrieving Account Information-Retrieving Account Information SMA calculation, Getting into the Basics-Generalizing the Approach vectorized backtesting of ML-based trading strategy, Vectorized Backtesting-Vectorized Backtesting vectorized backtesting of regression-based strategy, Vectorized Backtesting of Regression-Based Strategy event-based backtesting, Building Classes for Event-Based Backtesting-Long-Short Backtesting Classadvantages, Building Classes for Event-Based Backtesting base class, Backtesting Base Class-Backtesting Base Class, Backtesting Base Class building classes for, Building Classes for Event-Based Backtesting-Long-Short Backtesting Class long-only backtesting class, Long-Only Backtesting Class-Long-Only Backtesting Class, Long-Only Backtesting Class long-short backtesting class, Long-Short Backtesting Class-Long-Short Backtesting Class, Long-Short Backtesting Class Python scripts for, Backtesting Base Class-Long-Short Backtesting Class Excelexporting financial data to, Exporting to Excel and JSON reading financial data from, Reading from Excel and JSON F featuresadding different types, Adding Different Types of Features-Adding Different Types of Features lags and, Using Logistic Regression to Predict Market Direction financial data, working with, Working with Financial Data-Python Scriptsdata set for examples, The Data Set Eikon Data API, Eikon Data API-Retrieving Historical Unstructured Data exporting to Excel/JSON, Exporting to Excel and JSON open data sources, Working with Open Data Sources-Working with Open Data Sources reading data from different sources, Reading Financial Data From Different Sources-Reading from Excel and JSON reading data from Excel/JSON, Reading from Excel and JSON reading from a CSV file with pandas, Reading from a CSV File with pandas reading from a CSV file with Python, Reading from a CSV File with Python-Reading from a CSV File with Python storing data efficiently, Storing Financial Data Efficiently-Storing Data with SQLite3 .flatten() method, matplotlib foreign exchange trading (see FX trading; FXCM) future returns, predicting, Predicting Future Returns-Predicting Future Returns FX trading, FX Trading with FXCM-References and Further Resources(see also EUR/USD exchange rate) FXCMFX trading, FX Trading with FXCM-References and Further Resources getting started, Getting Started placing orders, Placing Orders-Placing Orders retrieving account information, Account Information retrieving candles data, Retrieving Candles Data-Retrieving Candles Data retrieving data, Retrieving Data-Retrieving Candles Data retrieving historical data, Retrieving Historical Data-Retrieving Historical Data retrieving streaming data, Retrieving Streaming Data retrieving tick data, Retrieving Tick Data-Retrieving Tick Data working with the API, Working with the API-Account Information fxcmpy wrapper packagecallback functions, Retrieving Streaming Data installing, Getting Started tick data retrieval, Retrieving Tick Data fxTrade, CFD Trading with Oanda G GDX (VanEck Vectors Gold Miners ETF)logistic regression-based strategies, Generalizing the Approach mean-reversion strategies, Getting into the Basics-Generalizing the Approach regression-based strategies, Generalizing the Approach generate_sample_data(), Storing Financial Data Efficiently .get_account_summary() method, Retrieving Account Information .get_candles() method, Retrieving Historical Data .get_data() method, Backtesting Base Class, Retrieving Tick Data .get_date_price() method, Backtesting Base Class .get_instruments() method, Looking Up Instruments Available for Trading .get_last_price() method, Retrieving Streaming Data .get_raw_data() method, Retrieving Tick Data get_timeseries() function, Retrieving Historical Structured Data .get_transactions() method, Retrieving Account Information GLD (SPDR Gold Shares)logistic regression-based strategies, Using Logistic Regression to Predict Market Direction-Using Logistic Regression to Predict Market Direction mean-reversion strategies, Getting into the Basics-Generalizing the Approach gold pricemean-reversion strategies, Getting into the Basics-Getting into the Basics momentum strategy and, Getting into the Basics-Getting into the Basics, Generalizing the Approach-Generalizing the Approach Goldman Sachs, Python and Algorithmic Trading, Algorithmic Trading .go_long() method, Long-Short Backtesting Class H half Kelly criterion, Optimal Leverage Harari, Yuval Noah, Preface HDF5 binary storage library, Using TsTables-Using TsTables HDFStore wrapper, Storing DataFrame Objects-Storing DataFrame Objects high frequency trading (HFQ), Algorithmic Trading histograms, matplotlib hit ratio, defined, Vectorized Backtesting I if-elif-else control structure, Python Idioms in-sample fitting, Generalizing the Approach index levels, predicting, Predicting Index Levels-Predicting Index Levels infrastructure (see Python infrastructure) installation script, Python/Jupyter Lab, Installation Script for Python and Jupyter Lab-Installation Script for Python and Jupyter Lab Intel Math Kernel Library, Basic Operations with Conda iterations, Control Structures J JSONexporting financial data to, Exporting to Excel and JSON reading financial data from, Reading from Excel and JSON Jupyter Labinstallation script for, Installation Script for Python and Jupyter Lab-Installation Script for Python and Jupyter Lab RSA public/private keys for, RSA Public and Private Keys tools included, Using Cloud Instances Jupyter Notebook, Jupyter Notebook Configuration File K Kelly criterionin binomial setting, Kelly Criterion in Binomial Setting-Kelly Criterion in Binomial Setting optimal leverage, Optimal Leverage-Optimal Leverage stocks and indices, Kelly Criterion for Stocks and Indices-Kelly Criterion for Stocks and Indices Keras, Using Deep Learning for Market Movement Prediction, Using Deep Neural Networks to Predict Market Direction, Adding Different Types of Features key-value stores, Data Structures keys, public/private, RSA Public and Private Keys L lags, The Basic Idea for Price Prediction, Using Logistic Regression to Predict Market Direction lambda functions, Python Idioms LaTeX, Python Versus Pseudo-Code leveraged trading, risks of, Factoring In Leverage and Margin, FX Trading with FXCM, Optimal Leverage linear regressiongeneralizing the approach, Generalizing the Approach market movement prediction, Using Linear Regression for Market Movement Prediction-Generalizing the Approach predicting future market direction, Predicting Future Market Direction predicting future returns, Predicting Future Returns-Predicting Future Returns predicting index levels, Predicting Index Levels-Predicting Index Levels price prediction based on time series data, The Basic Idea for Price Prediction-The Basic Idea for Price Prediction review of, A Quick Review of Linear Regression scikit-learn and, Linear Regression with scikit-learn vectorized backtesting of regression-based strategy, Vectorized Backtesting of Regression-Based Strategy, Linear Regression Backtesting Class list comprehension, Python Idioms list constructor, Data Structures list objects, Reading from a CSV File with Python, Data Structures, Regular ndarray Object logging, of automated trading operations, Logging and Monitoring-Logging and Monitoring logistic regressiongeneralizing the approach, Generalizing the Approach-Generalizing the Approach market direction prediction, Using Logistic Regression to Predict Market Direction-Using Logistic Regression to Predict Market Direction Python script for vectorized backtesting, Classification Algorithm Backtesting Class long-only backtesting class, Long-Only Backtesting Class-Long-Only Backtesting Class, Long-Only Backtesting Class long-short backtesting class, Long-Short Backtesting Class-Long-Short Backtesting Class, Long-Short Backtesting Class longest drawdown period, Risk Analysis M machine learningclassification problem, A Simple Classification Problem-A Simple Classification Problem linear regression with scikit-learn, Linear Regression with scikit-learn market movement prediction, Using Machine Learning for Market Movement Prediction-Generalizing the Approach ML-based trading strategy, ML-Based Trading Strategy-Persisting the Model Object Python scripts, Linear Regression Backtesting Class trading strategies and, Machine and Deep Learning using logistic regression to predict market direction, Using Logistic Regression to Predict Market Direction-Using Logistic Regression to Predict Market Direction macro hedge funds, algorithmic trading and, Algorithmic Trading __main__ method, Backtesting Base Class margin trading, FX Trading with FXCM market direction prediction, Predicting Future Market Direction market movement predictiondeep learning for, Using Deep Learning for Market Movement Prediction-Adding Different Types of Features deep neural networks for, Using Deep Neural Networks to Predict Market Direction-Adding Different Types of Features linear regression for, Using Linear Regression for Market Movement Prediction-Generalizing the Approach linear regression with scikit-learn, Linear Regression with scikit-learn logistic regression to predict market direction, Using Logistic Regression to Predict Market Direction-Using Logistic Regression to Predict Market Direction machine learning for, Using Machine Learning for Market Movement Prediction-Generalizing the Approach predicting future market direction, Predicting Future Market Direction predicting future returns, Predicting Future Returns-Predicting Future Returns predicting index levels, Predicting Index Levels-Predicting Index Levels price prediction based on time series data, The Basic Idea for Price Prediction-The Basic Idea for Price Prediction vectorized backtesting of regression-based strategy, Vectorized Backtesting of Regression-Based Strategy market orders, placing, Placing Market Orders-Placing Market Orders math module, Data Types mathematical functions, Data Types matplotlib, matplotlib-matplotlib, Plotting with pandas-Plotting with pandas maximum drawdown, Risk Analysis, Case Study McKinney, Wes, pandas and the DataFrame Class mean-reversion strategies, NumPy and Vectorization, Strategies Based on Mean Reversion-Generalizing the Approachbasics, Getting into the Basics-Generalizing the Approach generalizing the approach, Generalizing the Approach Python code with a class for vectorized backtesting, Momentum Backtesting Class Miniconda, Installing Miniconda-Installing Miniconda mkl (Intel Math Kernel Library), Basic Operations with Conda ML-based strategies, ML-Based Trading Strategy-Persisting the Model Objectoptimal leverage, Optimal Leverage-Optimal Leverage persisting the model object, Persisting the Model Object Python script for, Automated Trading Strategy risk analysis, Risk Analysis-Risk Analysis vectorized backtesting, Vectorized Backtesting-Vectorized Backtesting MLPClassifier, The Simple Classification Problem Revisited MLTrader class, Online Algorithm-Online Algorithm momentum strategies, Momentumbacktesting on minute bars, Backtesting a Momentum Strategy on Minute Bars-Backtesting a Momentum Strategy on Minute Bars basics, Getting into the Basics-Getting into the Basics generalizing the approach, Generalizing the Approach Python code with a class for vectorized backtesting, Momentum Backtesting Class Python script for custom streaming class, Python Script Python script for momentum online algorithm, Momentum Online Algorithm vectorized backtesting of, Strategies Based on Momentum-Generalizing the Approach MomentumTrader class, Implementing Trading Strategies in Real Time-Implementing Trading Strategies in Real Time MomVectorBacktester class, Generalizing the Approach monitoringautomated trading operations, Logging and Monitoring-Logging and Monitoring, Real-Time Monitoring Python scripts for strategy monitoring, Strategy Monitoring Monte Carlo simulationsample tick data server, Sample Tick Data Server time series data based on, Python Scripts motives, for trading, Algorithmic Trading MRVectorBacktester class, Generalizing the Approach multi-layer perceptron, The Simple Classification Problem Revisited Musashi, Miyamoto, Python Infrastructure N natural language processing (NLP), Retrieving Historical Unstructured Data ndarray class, Vectorization with NumPy-Vectorization with NumPy ndarray objects, NumPy and Vectorization, ndarray Methods and NumPy Functions-ndarray Methods and NumPy Functionscreating, ndarray Creation linear regression and, A Quick Review of Linear Regression regular, Regular ndarray Object nested structures, Data Structures NLP (natural language processing), Retrieving Historical Unstructured Data np.arange(), ndarray Creation numbers, data typing of, Data Types numerical operations, pandas, Numerical Operations NumPy, NumPy and Vectorization-NumPy and Vectorization, NumPy-Random NumbersBoolean operations, Boolean Operations ndarray creation, ndarray Creation ndarray methods, ndarray Methods and NumPy Functions-ndarray Methods and NumPy Functions random numbers, Random Numbers regular ndarray object, Regular ndarray Object universal functions, ndarray Methods and NumPy Functions vectorization, Vectorization with NumPy-Vectorization with NumPy vectorized operations, Vectorized Operations numpy.random sub-package, Random Numbers NYSE Arca Gold Miners Index, Getting into the Basics O Oandaaccount configuration, Configuring Oanda Account account setup, Setting Up an Account API access, The Oanda API-The Oanda API backtesting momentum strategy on minute bars, Backtesting a Momentum Strategy on Minute Bars-Backtesting a Momentum Strategy on Minute Bars CFD trading, CFD Trading with Oanda-Python Script factoring in leverage/margin with historical data, Factoring In Leverage and Margin-Factoring In Leverage and Margin implementing trading strategies in real time, Implementing Trading Strategies in Real Time-Implementing Trading Strategies in Real Time looking up instruments available for trading, Looking Up Instruments Available for Trading placing market orders, Placing Market Orders-Placing Market Orders Python script for custom streaming class, Python Script retrieving account information, Retrieving Account Information-Retrieving Account Information retrieving historical data, Retrieving Historical Data-Factoring In Leverage and Margin working with streaming data, Working with Streaming Data Oanda v20 RESTful API, The Oanda API, ML-Based Trading Strategy-Persisting the Model Object, Vectorized Backtesting offline algorithmdefined, Signal Generation in Real Time transformation to online algorithm, Online Algorithm OLS (ordinary least squares) regression, matplotlib online algorithmautomated trading operations, Online Algorithm-Online Algorithm defined, Signal Generation in Real Time Python script for momentum online algorithm, Momentum Online Algorithm signal generation in real time, Signal Generation in Real Time-Signal Generation in Real Time transformation of offline algorithm to, Online Algorithm .on_success() method, Implementing Trading Strategies in Real Time, Online Algorithm open data sources, Working with Open Data Sources-Working with Open Data Sources ordinary least squares (OLS) regression, matplotlib out-of-sample evaluation, Generalizing the Approach overfitting, Data Snooping and Overfitting P package manager, conda as, Conda as a Package Manager-Basic Operations with Conda pandas, pandas and the DataFrame Class-pandas and the DataFrame Class, pandas-Input-Output OperationsBoolean operations, Boolean Operations case study, Case Study-Case Study data selection, Data Selection-Data Selection DataFrame class, DataFrame Class-DataFrame Class exporting financial data to Excel/JSON, Exporting to Excel and JSON input-output operations, Input-Output Operations-Input-Output Operations numerical operations, Numerical Operations plotting, Plotting with pandas-Plotting with pandas reading financial data from Excel/JSON, Reading from Excel and JSON reading from a CSV file, Reading from a CSV File with pandas storing DataFrame objects, Storing DataFrame Objects-Storing DataFrame Objects vectorization, Vectorization with pandas-Vectorization with pandas password protection, for Jupyter lab, Jupyter Notebook Configuration File .place_buy_order() method, Backtesting Base Class .place_sell_order() method, Backtesting Base Class Plotlybasics, The Basics multiple real-time streams for, Three Real-Time Streams multiple sub-plots for streams, Three Sub-Plots for Three Streams streaming data as bars, Streaming Data as Bars visualization of streaming data, Visualizing Streaming Data with Plotly-Streaming Data as Bars plotting, with pandas, Plotting with pandas-Plotting with pandas .plot_data() method, Backtesting Base Class polyfit()/polyval() convenience functions, matplotlib price prediction, based on time series data, The Basic Idea for Price Prediction-The Basic Idea for Price Prediction .print_balance() method, Backtesting Base Class .print_net_wealth() method, Backtesting Base Class .print_transactions() method, Retrieving Account Information pseudo-code, Python versus, Python Versus Pseudo-Code publisher-subscriber (PUB-SUB) pattern, Working with Real-Time Data and Sockets Python (generally)advantages of, Python for Algorithmic Trading basics, Python and Algorithmic Trading-References and Further Resources control structures, Control Structures data structures, Data Structures-Data Structures data types, Data Types-Data Types deployment difficulties, Python Infrastructure idioms, Python Idioms-Python Idioms NumPy and vectorization, NumPy and Vectorization-NumPy and Vectorization obstacles to adoption in financial industry, Python for Finance origins, Python for Finance pandas and DataFrame class, pandas and the DataFrame Class-pandas and the DataFrame Class pseudo-code versus, Python Versus Pseudo-Code reading from a CSV file, Reading from a CSV File with Python-Reading from a CSV File with Python Python infrastructure, Python Infrastructure-References and Further Resourcesconda as package manager, Conda as a Package Manager-Basic Operations with Conda conda as virtual environment manager, Conda as a Virtual Environment Manager-Conda as a Virtual Environment Manager Docker containers, Using Docker Containers-Building a Ubuntu and Python Docker Image using cloud instances, Using Cloud Instances-Script to Orchestrate the Droplet Set Up Python scriptsautomated trading operations, Running the Code, Python Script-Strategy Monitoring backtesting base class, Backtesting Base Class custom streaming class that trades a momentum strategy, Python Script linear regression backtesting class, Linear Regression Backtesting Class long-only backtesting class, Long-Only Backtesting Class long-short backtesting class, Long-Short Backtesting Class real-time data handling, Python Scripts-Sample Data Server for Bar Plot sample time series data set, Python Scripts strategy monitoring, Strategy Monitoring uploading for automated trading operations, Uploading the Code vectorized backtesting, Python Scripts-Mean Reversion Backtesting Class Q Quandlpremium data sets, Working with Open Data Sources working with open data sources, Working with Open Data Sources-Working with Open Data Sources R random numbers, Random Numbers random walk hypothesis, Predicting Index Levels range (iterator object), Control Structures read_csv() function, Reading from a CSV File with pandas real-time data, Working with Real-Time Data and Sockets-Sample Data Server for Bar PlotPython script for handling, Python Scripts-Sample Data Server for Bar Plot signal generation in real time, Signal Generation in Real Time-Signal Generation in Real Time tick data client for, Connecting a Simple Tick Data Client tick data server for, Running a Simple Tick Data Server-Running a Simple Tick Data Server, Sample Tick Data Server visualizing streaming data with Plotly, Visualizing Streaming Data with Plotly-Streaming Data as Bars real-time monitoring, Real-Time Monitoring Refinitiv, Eikon Data API relative maximum drawdown, Case Study returns, predicting future, Predicting Future Returns-Predicting Future Returns risk analysis, for ML-based trading strategy, Risk Analysis-Risk Analysis RSA public/private keys, RSA Public and Private Keys .run_mean_reversion_strategy() method, Long-Only Backtesting Class, Long-Short Backtesting Class .run_simulation() method, Kelly Criterion in Binomial Setting S S&P 500, Algorithmic Trading-Algorithmic Tradinglogistic regression-based strategies and, Generalizing the Approach momentum strategies, Getting into the Basics passive long position in, Kelly Criterion for Stocks and Indices-Kelly Criterion for Stocks and Indices scatter objects, Three Real-Time Streams scientific stack, NumPy and Vectorization, Python, NumPy, matplotlib, pandas scikit-learn, Linear Regression with scikit-learn ScikitBacktester class, Generalizing the Approach-Generalizing the Approach SciPy package project, NumPy and Vectorization seaborn library, matplotlib-matplotlib simple moving averages (SMAs), pandas and the DataFrame Class, Simple Moving Averagestrading strategies based on, Strategies Based on Simple Moving Averages-Generalizing the Approach visualization with price ticks, Three Real-Time Streams .simulate_value() method, Running a Simple Tick Data Server Singer, Paul, CFD Trading with Oanda sockets, real-time data and, Working with Real-Time Data and Sockets-Sample Data Server for Bar Plot sorting list objects, Data Structures SQLite3, Storing Data with SQLite3-Storing Data with SQLite3 SSL certificate, RSA Public and Private Keys storage (see data storage) streaming bar plots, Streaming Data as Bars, Sample Data Server for Bar Plot streaming dataOanda and, Working with Streaming Data visualization with Plotly, Visualizing Streaming Data with Plotly-Streaming Data as Bars string objects (str), Data Types-Data Types Swiss Franc event, CFD Trading with Oanda systematic macro hedge funds, Algorithmic Trading T TensorFlow, Using Deep Learning for Market Movement Prediction, Using Deep Neural Networks to Predict Market Direction Thomas, Rob, Working with Financial Data Thorp, Edward, Capital Management tick data client, Connecting a Simple Tick Data Client tick data server, Running a Simple Tick Data Server-Running a Simple Tick Data Server, Sample Tick Data Server time series data setspandas and vectorization, Vectorization with pandas price prediction based on, The Basic Idea for Price Prediction-The Basic Idea for Price Prediction Python script for generating sample set, Python Scripts SQLite3 for storage of, Storing Data with SQLite3-Storing Data with SQLite3 TsTables for storing, Using TsTables-Using TsTables time series momentum strategies, Strategies Based on Momentum(see also momentum strategies) .to_hdf() method, Storing DataFrame Objects tpqoa wrapper package, The Oanda API, Working with Streaming Data trading platforms, factors influencing choice of, CFD Trading with Oanda trading strategies, Trading Strategies-Conclusions(see also specific strategies) implementing in real time with Oanda, Implementing Trading Strategies in Real Time-Implementing Trading Strategies in Real Time machine learning/deep learning, Machine and Deep Learning mean-reversion, NumPy and Vectorization momentum, Momentum simple moving averages, Simple Moving Averages trading, motives for, Algorithmic Trading transaction costs, Long-Only Backtesting Class, Vectorized Backtesting TsTables package, Using TsTables-Using TsTables tuple objects, Data Structures U Ubuntu, Building a Ubuntu and Python Docker Image-Building a Ubuntu and Python Docker Image universal functions, NumPy, ndarray Methods and NumPy Functions V v20 wrapper package, The Oanda API, ML-Based Trading Strategy-Persisting the Model Object, Vectorized Backtesting value-at-risk (VAR), Risk Analysis-Risk Analysis vectorization, NumPy and Vectorization, Strategies Based on Mean Reversion-Generalizing the Approach vectorized backtestingdata snooping and overfitting, Data Snooping and Overfitting-Conclusions ML-based trading strategy, Vectorized Backtesting-Vectorized Backtesting momentum-based trading strategies, Strategies Based on Momentum-Generalizing the Approach potential shortcomings, Building Classes for Event-Based Backtesting Python code with a class for vectorized backtesting of mean-reversion trading strategies, Momentum Backtesting Class Python scripts for, Python Scripts-Mean Reversion Backtesting Class, Linear Regression Backtesting Class regression-based strategy, Vectorized Backtesting of Regression-Based Strategy trading strategies based on simple moving averages, Strategies Based on Simple Moving Averages-Generalizing the Approach vectorization with NumPy, Vectorization with NumPy-Vectorization with NumPy vectorization with pandas, Vectorization with pandas-Vectorization with pandas vectorized operations, Vectorized Operations virtual environment management, Conda as a Virtual Environment Manager-Conda as a Virtual Environment Manager W while loops, Control Structures Z ZeroMQ, Working with Real-Time Data and Sockets About the Author Dr.

…

pages: 187 words: 50,083

Collaborative Society
by Dariusz Jemielniak and Aleksandra Przegalinska
Published 18 Feb 2020

Kreps et al., “Trust and Sources of Health Information: The Impact of the Internet and Its Implications for Health Care Providers: Findings from the First Health Information National Trends Survey,” Archives of Internal Medicine 165 (2005): 2618–2624. 16. R. Dragusin et al., “Rare Disease Diagnosis as an Information Retrieval Task,” Advances in Information Retrieval Theory 6931 (2011): 356–359. 17. M. G. Bouwman, Q. G. A. Teunissen, F. A. Wijburg, and G. E. Linthorst, “‘Doctor Google’ Ending the Diagnostic Odyssey in Lysosomal Storage Disorders: Parents Using Internet Search Engines as an Efficient Diagnostic Strategy in Rare Diseases,” Archives of Disease in Childhood 95 (2010): 642–644. 18.

…

Some take the shape of virtual agents or physical objects, such as Amazon Alexa or the “social robot” called Jibo, and they open opportunities for research from the perspective of the proximity they maintain with the users as well as gestural communication.32 Currently, chatbot systems may not only mimic human conversation and entertain users, but they also figure importantly in applications for education, information retrieval, business, and most importantly e-commerce. In fact, chatbots are the perfect example of implementing state-of-the-art consumer-oriented artificial intelligence that simulates human behavior based on formal models. They have served as subjects for the research of patterns of human and nonhuman interaction, as well as issues related to assigning social roles to others, finding patterns of successful and unsuccessful interactions, and establishing social relationships and bonds.

…

Wegner, “The Necessity of New Paradigms in Measuring Human-Chatbot Interaction,” in Advances in Cross-Cultural Decision Making, ed. M. Hoffman (Springer International Publishing, 2018), 205–214. 23. S. Zheng, Effective Methods for Web Crawling and Web Information Extraction (Pennsylvania State University, 2011). 24. C. Olston and M. Najork, “Web Crawling,” Foundations and Trends in Information Retrieval 4 (2010): 175–246. 25. https://trafficleaks.com/bot-army 26. http://blogs.discovermagazine.com/d-brief/2017/01/20/twitter-bot-army/#.XBrD5xNKjOQ 27. https://perma.cc/5RCJ-D7RZ 28. https://perma.cc/CZ47–4BF9 29. S. Vosoughi, D. Roy, and S. Aral, “The Spread of True and False News Online,” Science 359 (2018): 1146–1151. 30.

pages: 252 words: 74,167

Thinking Machines: The Inside Story of Artificial Intelligence and Our Race to Build the Future
by Luke Dormehl
Published 10 Aug 2016

‘See it, THINK, and marvel at the mind of man and his machine,’ wrote one giddy reviewer, borrowing the ‘Think’ tagline that had been IBM’s since the 1920s. IBM showed off several impressive technologies at the event. One was a groundbreaking handwriting recognition computer, which the official fair brochure referred to as an ‘Optical Scanning and Information Retrieval’ system. This demo allowed visitors to write an historical date of their choosing (post-1851) in their own handwriting on a small card. That card was then fed into an ‘optical character reader’ where it was converted into digital form, and then relayed once more to a state-of-the-art IBM 1460 computer system.

…

Simon’s prediction was hopelessly off, but as it turns out, the second thing that registers about the World’s Fair is that IBM wasn’t wrong. All three of the technologies that dropped jaws in 1964 are commonplace today – despite our continued insistence that AI is not yet here. The Optical Scanning and Information Retrieval has become the Internet: granting us access to more information at a moment’s notice than we could possibly hope to absorb in a lifetime. While we still cannot see the future, we are making enormous advances in this capacity, thanks to the huge datasets generated by users that offer constant forecasts about the news stories, books or songs that are likely to be of interest to us.

…

Another, called ANALOGY, did the same for the geometric questions found in IQ tests, while STUDENT cracked complex algebra story conundrums such as: ‘If the number of customers Tom gets is twice the square of 20 per cent of the number of advertisements he runs, and the number of advertisements he runs is 45, what is the number of customers Tom gets?fn1 A particularly impressive display of computational reasoning was a program called SIR (standing for Semantic Information Retrieval). SIR appeared to understand English sentences and was even able to learn relationships between objects in a way that resembled real intelligence. In reality, this ‘knowledge’ relied on a series of pre-programmed templates, such as A is a part of B, with nouns substituting for the variables.

Designing Search: UX Strategies for Ecommerce Success
by Greg Nudelman and Pabini Gabriel-Petit
Published 8 May 2011

He holds a degree in music theory and composition from Harvard. Daniel Tunkelang is a leading industry advocate of human-computer information retrieval (HCIR). He was a founding employee of faceted search pioneer Endeca, where he spent ten years as Chief Scientist. During that time, he established the HCIR workshop, which has taken place annually since 2007. Always working to bring together industry and academia, he co-organized the 2010 Workshop on Search and Social Media and has served as an organizer for the industry tracks of the premier conferences on information retrieval: SIGIR and CIKM. He authored a popular book on faceted search as part of the Morgan & Claypool Synthesis Lectures.

…

Journal of the American Society for Information Science, 48(11): pp. 1036–1048, 1997. [Cousins, 1997] S.B. Cousins. “Reification and Affordances in a User Interface for Interacting with Heterogeneous Distributed Applications.” PhD thesis, Stanford University, May 1997. [Ellis, 1989] D. Ellis. A behavioural model for information retrieval system design. Journal of Information Science, 15: pp. 237–247, 1989. [Bates, 1979] M.J. Bates. Information search tactics. Journal of the American Society for Information Science, 30(4): pp. 205–214, 1979. [Norman, 1988] D.A. Norman. The Psychology of Everyday Things. Basic Books, New York, 1988.

…

Basic Books, New York, 1988. [Pirolli and Card, 1999] P. Pirolli and S.K. Card. Information foraging. Psychological Review, 106(4):pp. 643–675, 1999. [Belkin et al., 1993] N. Belkin, P. G. Marchetti, and C. Cool. Braque – design of an interface to support user interaction in information retrieval. Information Processing and Management, 29(3): pp. 325–344, 1993. [Chang and Rice, 1993] Shan-Ju Chang and Ronald E. Rice. Browsing: A multidimensional framework. Annual Review of Information Science and Technology, 28: pp. 231–276, 1993. [Marchionini, 1995] Gary Marchionini. Information Seeking in Electronic Environments.

pages: 1,535 words: 337,071

Networks, Crowds, and Markets: Reasoning About a Highly Connected World
by David Easley and Jon Kleinberg
Published 15 Nov 2010

Before discussing some of the ideas behind the ranking of pages, let’s begin by considering some of the basic reasons why it’s a hard problem. First, search is a hard problem for computers to solve in any setting, not just on the Web. Indeed, the field of information retrieval [35, 354] has dealt with this problem for decades before the creation of the Web: automated information retrieval systems starting in the 1960s were designed to search repositories of newspaper articles, scientific papers, patents, legal abstracts, and other document collections in reponse to keyword queries. Information retrieval systems have always had to deal with the problem that keywords are a very limited way to express a complex information need; in addition to the fact that a list of keywords is short and inexpressive, it suffers from the problems of synonomy (multiple ways to say the same thing, so that your search for recipes involving scallions fails because the recipe you wanted called them “green onions”) and pol-ysemy (multiple meaning for the same term, so that your search for information about the animal called a jaguar instead produces results primarily about automobiles, football players, and an operating system for the Apple Macintosh.)

…

Even today, such news search features are only partly integrated into the core parts of the search engine interface, and emerging Web sites such as Twitter continue to fill in the spaces that exist between static content and real-time awareness. More fundamental still, and at the heart of many of these issues, is the fact that the Web has shifted much of the information retrieval question from a problem of scarcity to a problem of abundance. The prototypical applications of information retrieval in the pre-Web era had a “needle-in-a-haystack” flavor — for example, an intellectual-property attorney might express the information need, “find me any patents that have dealt with the design 14.2. LINK ANALYSIS USING HUBS AND AUTHORITIES 407 of elevator speed regulators based on fuzzy-logic controllers.”

…

With this in mind, people who depended on the success of their Web sites increasingly began modifying their Web-page authoring styles to score highly in search engine rankings. For people who had conceived of Web search as a kind of classical information retrieval application, this was something novel. Back in the 1970s and 1980s, when people designed information retrieval tools for scientific papers or newspaper articles, authors were not overtly writing their papers or abstracts with these search tools in mind.4 From the relatively early days of the Web, however, people have written Web pages with search engines quite explicitly in mind.

Cataloging the World: Paul Otlet and the Birth of the Information Age
by Alex Wright
Published 6 Jun 2014

In 1780, an Austrian named Gerhard van Swieten further adapted the technique to create a master catalog for the Austrian National Library, known as the Josephinian Catalog (named for Austria’s “enlightened despot” Joseph II). Van Swieten decided to store his catalog cards in 205 wooden boxes, sealed in an airtight locker—the first recognizable precursor to the once familiar, now rapidly disappearing, library card catalog.23 Today, we might tend to think of the card catalog as a simplistic information retrieval tool: the dominion of somber librarians in fusty reading rooms. However, to take such a dismissive view of these compact, efficient systems—the direct ancestors of the modern database—may lead us to overlook the critical role they played in the industrial information explosion that would reshape the European world in the nineteenth century.

…

In the mid-1930s, IBM was building its portfolio of electronic devices (even before it had started manufacturing any of them), long before Vannevar Bush, then dean of engineering at the Massachusetts Institute of Technology, published his famous essay “As We May Think.” Today, most computer science historians have characterized Bush’s Rapid Selector as the first electronic information-retrieval machine. When Bush tried to patent his invention in 1937 and 1940, however, the U.S. Patent Office turned him down, citing Goldberg’s work. And while there is no evidence that Goldberg’s invention directly influenced Bush’s work, Donker Duyvis—Paul Otlet’s eventual successor at the IIB—did tell Bush about Goldberg’s invention in 1946.12 Despite his considerable achievements, Goldberg remains all but unknown today.

…

Only when the conflict between nation-states had been eliminated could humanity finally realize its spiritual and intellectual potential. Worldwide dissemination of recorded knowledge was an essential step along that path. Like Otlet, Wells believed that better access to information might help prevent future wars. Beginning with his 1905 work, A Modern Utopia, Wells had developed a fascination with the problem of information retrieval— the need for better methods for organizing the world’s recorded knowledge. This led him to reject old values and institutional strictures and embrace a mechanistic approach, one founded on Taylorist ideals of scientific management and a belief in the power of science to solve humanity’s problems, and the coming war in particular.

pages: 174 words: 56,405

Machine Translation
by Thierry Poibeau
Published 14 Sep 2017

Speech translation has become a hot topic (“speech to speech” applications aim at making it possible to speak in one’s own language with another interlocutor speaking in a foreign language by using live automated translation). The machine translation market is growing fast. Over the last few years we have witnessed the emergence of new applications, particularly on mobile devices. Cross-Language Information Retrieval Cross-language information retrieval aims to give access to documents initially written in different languages. Consider research on patents: when a company seeks to know if an idea or a process has already been patented, it must ensure that its research is exhaustive and covers all parts of the world. It is therefore fundamental to cross the language barrier, for both the query (i.e., the information need expressed through keywords) and the analysis of the responses (i.e., documents relevant to the information need).

…

See Smart glasses; Smart watch Construction (linguistic), 23 Context (linguistic), 17–21, 31, 34, 54–56, 64–67, 71, 92, 117–119, 129, 150, 176–178, 186, 188, 215–216, 238 Continuous model, 186–187 Conversational agent, 2. See also Artificial dialogue Coordination, 175 Corpus alignment, 91–108 Cross-language information retrieval, 238–239 Cryptography, 49, 52, 56, 58–60 Cryptology. See Cryptography CSLi, 232, 236 Cultural hegemony, 168, 250–251 Czech, 210, 213 DARPA, 200–203, 209, 259 Database access, 241 Date expressions, 115, 152, 160 Deceptive cognate, 11, 261 Decoder, 141, 144, 185, 186, 190 Deep learning, 34–35, 37, 170, 181–195, 228, 234, 247, 253–255 Deepmind, 182 Defense industry, 77, 88, 173, 232–233, 235 De Firmas-Périés, Arman-Charles-Daniel, 41 De Maimieux, Joseph, 41 Descartes, René, 40–42 Determiner, 133, 215 Dialogue.

…

See Machine translation systems Ideographic writing system, 105 Idiom. See Idiomatic expression Idiomatic expression, 10, 11, 15, 23, 28, 30, 33, 115, 125, 178, 217, 219, 262 Iida, Hitoshi, 117 Image recognition, 183 Indirect machine translation, 25–32 Indo-European languages, 165, 213, 214, 250 Information retrieval, 45, 92, 238–239 Informativeness, 201, 206 Intelligence industry. See Intelligence services Intelligence services, 77, 89, 173, 225, 233, 235, 249 Interception (of communications), 225, 232 Interlingua, 24, 28–32, 40, 58, 63, 66–68, 85, 262 Interlingual machine translation. See Interlingua Intermediate representation, 25–32, 63 Internet, 33, 93, 97, 98, 100, 102, 164, 166, 168–169, 172, 197, 227–233, 238, 242–243, 247–250 link, 98–99 Interpretation, 20, 201 Island of confidence, 102, 108, 150 Isolating language, 215–216 Israel, 60, 69 Japan, 44, 67, 86, 87, 109 Japanese, 11, 88, 117–118, 164–165, 192, 242 Jibbigo, 236 JRC-Acquis corpus, 97, 212–213, 223 Keyword, 92, 99, 238 Kilgarriff, Adam, 18 King, Gilbert, 76 Kircher, Athanasius, 41 Koehn, Philip, 136, 212–213 Korean, 88, 235–236 Language complexity (see Complexity) diversity, 1, 164–170 (see also typology) exposure (see Child language acquisition) family, 30, 106, 138, 172–174 independent representation (see Interlingua) learning (see Child language acquisition) model, 127, 140, 142, 144, 153, 185 proximity, 163 (see also family) typology, 138, 192 (see also family) universal, 56, 66, 67 (see also Universal language) Lavie, Alon, 206 Learning step (or learning phase).

pages: 394 words: 108,215

What the Dormouse Said: How the Sixties Counterculture Shaped the Personal Computer Industry
by John Markoff
Published 1 Jan 2005

In 1960, Engelbart presented a paper at the annual meeting of the American Documentation Institute, outlining how computer systems of the future might change the role of information-retrieval specialists. The idea didn’t sit at all well with his audience, which gave his paper a blasé reception. He also got into an argument with a researcher who asserted that Engelbart was proposing nothing that was any different from any of the other information-retrieval efforts that were already under way. It was a long and lonely two years. The state of the art of computer science was moving quickly toward mathematical algorithms, and the computer scientists looked down their nose at his work, belittling it as mere office automation and hence beneath their notice.

…

For a while, he thought that the emergent field of artificial intelligence might provide him with some support, or at least meaningful overlap. But the AI researchers translated his ideas into their own, and the concept of Augmentation seemed pallid when viewed through their eyes, reduced to the more mundane idea of information retrieval, missing Engelbart’s dream entirely.4 Gradually, he began to understand that the AI community was actually his philosophical enemy. After all, their vision was to replace humans with machines, while he wanted to extend and empower people. Engelbart would later say that he had nothing against the vision of AI but just believed that it would be decades and decades before it could be realized.

…

There was an abyss between the original work done by Engelbart’s group in the sixties and the motley crew of hobbyists that would create the personal-computer industry beginning in 1975. In their hunger to possess their own computers, the PC hobbyists would miss the crux of the original idea: communications as an integral part of the design. That was at the heart of the epiphanies that Engelbart had years earlier, which led to the realization of Vannevar Bush’s Memex information-retrieval system of the 1940s. During the period from the early 1960s until 1969, when most of the development of the NLS system was completed, Engelbart and his band of researchers remained in a comfortable bubble. They were largely Pentagon funded, but unlike many of the engineering and computing groups that surrounded them at SRI, they weren’t doing work that directly contributed to the Vietnam War.

pages: 893 words: 199,542

Structure and interpretation of computer programs
by Harold Abelson , Gerald Jay Sussman and Julie Sussman
Published 25 Jul 1996

Use the results of exercises 2.63 and 2.64 to give θ(n) implementations of union-set and intersection-set for sets implemented as (balanced) binary trees.41 Sets and information retrieval We have examined options for using lists to represent sets and have seen how the choice of representation for a data object can have a large impact on the performance of the programs that use the data. Another reason for concentrating on sets is that the techniques discussed here appear again and again in applications involving information retrieval. Consider a data base containing a large number of individual records, such as the personnel files for a company or the transactions in an accounting system.

…

Also, a central role is played in the implementation by a frame data structure, which determines the correspondence between symbols and their associated values. One additional interesting aspect of our query-language implementation is that we make substantial use of streams, which were introduced in chapter 3. 4.4.1 Deductive Information Retrieval Logic programming excels in providing interfaces to data bases for information retrieval. The query language we shall implement in this chapter is designed to be used in this way. In order to illustrate what the query system does, we will show how it can be used to manage the data base of personnel records for Microshaft, a thriving high-technology company in the Boston area.

…

The resulting RSA algorithm has become a widely used technique for enhancing the security of electronic communications. Because of this and related developments, the study of prime numbers, once considered the epitome of a topic in “pure” mathematics to be studied only for its own sake, now turns out to have important practical applications to cryptography, electronic funds transfer, and information retrieval. 1.3 Formulating Abstractions with Higher-Order Procedures We have seen that procedures are, in effect, abstractions that describe compound operations on numbers independent of the particular numbers. For example, when we (define (cube x) (* x x x)) we are not talking about the cube of a particular number, but rather about a method for obtaining the cube of any number.

pages: 523 words: 143,139

Algorithms to Live By: The Computer Science of Human Decisions
by Brian Christian and Tom Griffiths
Published 4 Apr 2016

Anderson and Milson, “Human Memory,” in turn, draws from a statistical study of library borrowing that appears in Burrell, “A Simple Stochastic Model for Library Loans.” the missing piece in the study of the mind: Anderson’s initial exploration of connections between information retrieval by computers and the organization of human memory was conducted in an era when most people had never interacted with an information retrieval system, and the systems in use were quite primitive. As search engine research has pushed the boundaries of what information retrieval systems can do, it’s created new opportunities for discovering parallels between minds and machines. For example, Tom and his colleagues have shown how ideas behind Google’s PageRank algorithm are relevant to understanding human semantic memory.

…

Does it suggest that human memory is good or bad? What’s the underlying story here? These questions have stimulated psychologists’ speculation and research for more than a hundred years. In 1987, Carnegie Mellon psychologist and computer scientist John Anderson found himself reading about the information retrieval systems of university libraries. Anderson’s goal—or so he thought—was to write about how the design of those systems could be informed by the study of human memory. Instead, the opposite happened: he realized that information science could provide the missing piece in the study of the mind.

…

Basically, all of these theories characterize memory as an arbitrary and non-optimal configuration.… I had long felt that the basic memory processes were quite adaptive and perhaps even optimal; however, I had never been able to see a framework in which to make this point. In the computer science work on information retrieval, I saw that framework laid out before me.” A natural way to think about forgetting is that our minds simply run out of space. The key idea behind Anderson’s new account of human memory is that the problem might be not one of storage, but of organization. According to his theory, the mind has essentially infinite capacity for memories, but we have only a finite amount of time in which to search for them.

pages: 250 words: 73,574

Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
by John MacCormick and Chris Bishop
Published 27 Dec 2011

An example set of web pages that each have a title and a body. We already know that the Babylonians were using indexing 5000 years before search engines existed. It turns out that search engines did not invent the word-location trick either: this is a well-known technique that was used in other types of information retrieval before the internet arrived on the scene. However, in the next section we will learn about a new trick that does appear to have been invented by search engine designers: the metaword trick. The cunning use of this trick and various related ideas helped to catapult the AltaVista search engine to the top of the search industry in the late 1990s.

…

In other cases, the algorithms may have existed in the research community for some time, waiting in the wings for the right wave of new technology to give them wide applicability. The search algorithms for indexing and ranking fall into this category: similar algorithms had existed for years in the field known as information retrieval, but it took the phenomenon of web search to make these algorithms “great,” in the sense of daily use by ordinary computer users. Of course, the algorithms also evolved for their new application; PageRank is a good example of this. Note that the emergence of new technology does not necessarily lead to new algorithms.

…

Among the many college-level computer science texts on algorithms, three particularly readable options are Algorithms, by Dasgupta, Papadimitriou, and Vazirani; Algorithmics: The Spirit of Computing, by Harel and Feldman; and Introduction to Algorithms, by Cormen, Leiserson, Rivest, and Stein. Search engine indexing (chapter 2). The original AltaVista patent covering the metaword trick is U.S. patent 6105019, “Constrained Searching of an Index,” by Mike Burrows (2000). For readers with a computer science background, Search Engines: Information Retrieval in Practice, by Croft, Metzler, and Strohman, is a good option for learning more about indexing and many other aspects of search engines. PageRank (chapter 3). The opening quotation by Larry Page is taken from an interview by Ben Elgin, published in Businessweek, May 3, 2004. Vannevar Bush's “As We May Think” was, as mentioned above, originally published in The Atlantic magazine (July 1945).

pages: 397 words: 102,910

The Idealist: Aaron Swartz and the Rise of Free Culture on the Internet
by Justin Peters
Published 11 Feb 2013

His brief remarks to the group at Woods Hole were wistful: “I merely wish I were young enough to participate with you in the fascinating intricacies you will encounter and bring under your control.”48 Vannevar rhymes with believer, and when it came to government funding of scientific research, Bush certainly was. He was also a lifelong believer in libraries, and the benefits to be derived from their automation. In 1945, he published an article in the Atlantic Monthly that proposed a rudimentary mechanized library called Memex, a linked-information retrieval system. Memex was a desk-size machine that was equal parts stenographer, filing cabinet, and reference librarian: “a device in which an individual stores his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility.”49 The goal was to build a machine that could capture a user’s thought patterns, compile and organize his reading material and correspondence, and record the resulting “associative trails” between them all, such that the user could trace his end insights back to conception.

…

The rise of photoduplication technologies that facilitated the rapid spread of information merely underscored the fragility of copyright holders’ claims that intellectual property was indistinguishable from regular physical property. “We know that volumes of information can be stored on microfilm and magnetic tape. We keep hearing about information-retrieval networks,” former senator Kenneth B. Keating told Congress in 1965. “The inexorable question arises—what will happen in the long run if authors’ income is cut down and down by increasing free uses by photocopy and information storage and retrieval? Will the authors continue writing? Will the publishers continue publishing if their markets are diluted, eroded, and eventually, the profit motive and incentive completely destroyed?

…

Project Gutenberg had become an eloquent counterargument to copyright advocates’ dismissive claims about the public domain. It demonstrated just how easily a network could be used to breathe new life into classics that might otherwise go unseen. Despite the existence of initiatives such as Project Gutenberg, despite the emergence of the Internet as a new medium for information retrieval and distribution, the same official attitudes about intellectual property prevailed. The public domain was regarded as a penalty rather than as an opportunity. Parochial concerns were conflated with the public interest. The rise of the Internet might portend an informational revolution, but from the standpoint of the people in power, Hart warned, revolution was a bad thing.

pages: 1,387 words: 202,295

Structure and Interpretation of Computer Programs, Second Edition
by Harold Abelson , Gerald Jay Sussman and Julie Sussman
Published 1 Jan 1984

Exercise 2.65: Use the results of Exercise 2.63 and Exercise 2.64 to give implementations of union-set and intersection-set for sets implemented as (balanced) binary trees.107 Sets and information retrieval We have examined options for using lists to represent sets and have seen how the choice of representation for a data object can have a large impact on the performance of the programs that use the data. Another reason for concentrating on sets is that the techniques discussed here appear again and again in applications involving information retrieval. Consider a data base containing a large number of individual records, such as the personnel files for a company or the transactions in an accounting system.

…

Also, a central role is played in the implementation by a frame data structure, which determines the correspondence between symbols and their associated values. One additional interesting aspect of our query-language implementation is that we make substantial use of streams, which were introduced in Chapter 3. 4.4.1Deductive Information Retrieval Logic programming excels in providing interfaces to data bases for information retrieval. The query language we shall implement in this chapter is designed to be used in this way. In order to illustrate what the query system does, we will show how it can be used to manage the data base of personnel records for Microshaft, a thriving high-technology company in the Boston area.

…

2.1.4 Extended Exercise: Interval Arithmetic 2.2 Hierarchical Data and the Closure Property 2.2.1 Representing Sequences 2.2.2 Hierarchical Structures 2.2.3 Sequences as Conventional Interfaces 2.2.4 Example: A Picture Language 2.3 Symbolic Data 2.3.1 Quotation 2.3.2 Example: Symbolic Differentiation 2.3.3 Example: Representing Sets 2.3.4 Example: Huffman Encoding Trees 2.4 Multiple Representations for Abstract Data 2.4.1 Representations for Complex Numbers 2.4.2 Tagged data 2.4.3 Data-Directed Programming and Additivity 2.5 Systems with Generic Operations 2.5.1 Generic Arithmetic Operations 2.5.2 Combining Data of Different Types 2.5.3 Example: Symbolic Algebra 3 Modularity, Objects, and State 3.1 Assignment and Local State 3.1.1 Local State Variables 3.1.2 The Benefits of Introducing Assignment 3.1.3 The Costs of Introducing Assignment 3.2 The Environment Model of Evaluation 3.2.1 The Rules for Evaluation 3.2.2 Applying Simple Procedures 3.2.3 Frames as the Repository of Local State 3.2.4 Internal Definitions 3.3 Modeling with Mutable Data 3.3.1 Mutable List Structure 3.3.2 Representing Queues 3.3.3 Representing Tables 3.3.4 A Simulator for Digital Circuits 3.3.5 Propagation of Constraints 3.4 Concurrency: Time Is of the Essence 3.4.1 The Nature of Time in Concurrent Systems 3.4.2 Mechanisms for Controlling Concurrency 3.5 Streams 3.5.1 Streams Are Delayed Lists 3.5.2 Infinite Streams 3.5.3 Exploiting the Stream Paradigm 3.5.4 Streams and Delayed Evaluation 3.5.5 Modularity of Functional Programs and Modularity of Objects 4 Metalinguistic Abstraction 4.1 The Metacircular Evaluator 4.1.1 The Core of the Evaluator 4.1.2 Representing Expressions 4.1.3 Evaluator Data Structures 4.1.4 Running the Evaluator as a Program 4.1.5 Data as Programs 4.1.6 Internal Definitions 4.1.7 Separating Syntactic Analysis from Execution 4.2 Variations on a Scheme — Lazy Evaluation 4.2.1 Normal Order and Applicative Order 4.2.2 An Interpreter with Lazy Evaluation 4.2.3 Streams as Lazy Lists 4.3 Variations on a Scheme — Nondeterministic Computing 4.3.1 Amb and Search 4.3.2 Examples of Nondeterministic Programs 4.3.3 Implementing the Amb Evaluator 4.4 Logic Programming 4.4.1 Deductive Information Retrieval 4.4.2 How the Query System Works 4.4.3 Is Logic Programming Mathematical Logic? 4.4.4 Implementing the Query System 4.4.4.1 The Driver Loop and Instantiation 4.4.4.2 The Evaluator 4.4.4.3 Finding Assertions by Pattern Matching 4.4.4.4 Rules and Unification 4.4.4.5 Maintaining the Data Base 4.4.4.6 Stream Operations 4.4.4.7 Query Syntax Procedures 4.4.4.8 Frames and Bindings 5 Computing with Register Machines 5.1 Designing Register Machines 5.1.1 A Language for Describing Register Machines 5.1.2 Abstraction in Machine Design 5.1.3 Subroutines 5.1.4 Using a Stack to Implement Recursion 5.1.5 Instruction Summary 5.2 A Register-Machine Simulator 5.2.1 The Machine Model 5.2.2 The Assembler 5.2.3 Generating Execution Procedures for Instructions 5.2.4 Monitoring Machine Performance 5.3 Storage Allocation and Garbage Collection 5.3.1 Memory as Vectors 5.3.2 Maintaining the Illusion of Infinite Memory 5.4 The Explicit-Control Evaluator 5.4.1 The Core of the Explicit-Control Evaluator 5.4.2 Sequence Evaluation and Tail Recursion 5.4.3 Conditionals, Assignments, and Definitions 5.4.4 Running the Evaluator 5.5 Compilation 5.5.1 Structure of the Compiler 5.5.2 Compiling Expressions 5.5.3 Compiling Combinations 5.5.4 Combining Instruction Sequences 5.5.5 An Example of Compiled Code 5.5.6 Lexical Addressing 5.5.7 Interfacing Compiled Code to the Evaluator References List of Exercises List of Figures Term Index Colophon Next: UTF, Prev: (dir), Up: (dir) [Contents] Next: UTF, Prev: (dir), Up: (dir) [Contents] Next: Dedication, Prev: Top, Up: Top [Contents] Unofficial Texinfo Format This is the second edition SICP book, from Unofficial Texinfo Format.

The Card Catalog: Books, Cards, and Literary Treasures
by Library Of Congress and Carla Hayden
Published 3 Apr 2017

About the same time, the handful of computer companies that existed were making major innovations and had moved away from the punched-card system, advancing to vacuum tubes and magnetic tapes. Seeing new possibilities for cataloging and storing data, Librarian of Congress Lawrence Quincy Mumford established the Committee on Mechanized Information Retrieval in January 1958. In the years that followed, and with the approval of Congress, the Library purchased an IBM 1401, a small-scale computer system the size of a Volkswagen bus. The committee also recommended establishing a group to both design and implement the procedures required to automate the catalog.

…

See Roman Catholic Church census, 151 Centennial International Exhibition of 1876, 84 Ch’eng Ti, 15 Chicago Public Library, 7 Christianity, rise of, 15 clay, 12 Clemens, Samuel, 121 codex, 17 Cole, John, 87, 107 Collins, Billy, 156 Collyer, Homer, 148 Collyer, Langley, 148 Committee on Mechanized Information Retrieval, 152 computer punch cards, 151 Computing-Tabulating-Recording Company, 151 Congress Main Reading Room, 159 Copyright Act of 1870, 103 cross-referencing, 17 cuneiform, 12 Cutter, Charles Ammi, 82, 83, 108 D Dana, John, 146 Descartes, René, 19 Dewey Deciman Classification, 84 Dewey, Melville Louis, 82, 83, 85, 87, 107, 113, 151 Dixson, Kathy, 155 Diderot, Denis, 33 Douglass, Frederick, 102 Dove, Rita, 156 E Edlund, Paul, 112, 158 Eliot, T.

pages: 402 words: 110,972

Nerds on Wall Street: Math, Machines and Wired Markets
by David J. Leinweber
Published 31 Dec 2008

Reporters were necessary intermediaries in an era when (for example) press releases were sent to a few thousand fax machines and assigned to reporters by editors, and when SEC filings were found on a shelf in the Commission’s reading rooms in major cities. Press releases go to everyone over the Web. SEC filings are completely electronic. The reading rooms are closed. There is a great deal of effort to develop persistent specialized information-retrieval software agents for these sorts of routine newsgathering activities, which in turn creates incentives for reporters to move up from moving information around to interpretation and analysis. Examples and more in-depth discussion on these “new research” topics are forthcoming in Chapters 9 and 10. 86 Nerds on Wall Str eet Innovative algo systems will facilitate the use of news, in processed and raw forms.

…

Reuters Newscope algorithmic offerings, http://about.reuters.com/productinfo/ newsscoperealtime/index.aspx?user=1&. 27. These tools are called Open Calais (www.opencalais.com/). 28. For the technically ambitious reader, Lucene (http://lucene.apache.org/), Lingpipe (http://alias-i.com/lingpipe/), and Lemur (www.lemurproject.org/) are popular open source language and information retrieval tools. 29. Anthony Oettinger, a pioneer in machine translation at Harvard going back to the 1950s, told a story of an early English-Russian-English system sponsored by U.S. intelligence agencies. The English “The spirit is willing but the flesh is weak” went in, was translated to Russian, which was then sent in again to be translated back into English.

…

Direct access to primary sources of financially relevant information is disintermediating reporters, who now have to provide more than just a conduit to earn their keep. We would be hard-pressed to find more innovation than we see today on the Web. Google Finance, Yahoo! Finance, and their brethren have made more advanced information retrieval and analysis tools available for free than could be purchased for any amount in the notso-distant past. Other new technologies enable a new level of human-machine collaboration in investment research, such as XML (extensible markup language), discussed in Chapter 2. One of this technology’s most vocal proponents is Christopher Cox, former chairman of the SEC, who has taken the lead in encouraging the adoption of XBRL (extensible Business Reporting Language) to keep U.S. markets, exchanges, companies, and investors ahead of the curve. 106 Nerds on Wall Str eet We constantly hear about information overload, information glut, information anxiety, data smog, and the like.

pages: 223 words: 52,808

Intertwingled: The Work and Influence of Ted Nelson (History of Computing)
by Douglas R. Dechow
Published 2 Jul 2015

J Technol Educ 10(1). http://scholar.lib.vt.edu/ejournals/JTE/v10n1/childress.html 5. Nelson TH (1965) A file structure for the complex, the changing and the indeterminate. In: Proceedings of the ACM 20th national conference. ACM Press, New York, pp 84–100 6. Nelson TH (1967) Getting it out of our system. In: Schlechter G (ed) Information retrieval: a critical review. Thompson Books, Washington, DC, pp 191–210 7. Nelson TH (1968) Hypertext implementation notes, 6–10 March 1968. Xuarchives. http://xanadu.com/REF%20XUarchive%20SET%2003.11.06/hin68.tif 8. Nelson TH (1974) Computer lib: you can and must understand computers now/dream machines.

…

Ted signed my copy of Literary Machines [25] at a talk in the mid-1990s, thus I was in awe of the man when Bill Dutton put us together as visiting scholars in the OII attic, a wonderful space overlooking the Ashmolean Museum. Ted and I arrived at concepts of data and metadata from very different paths. He brought his schooling in the theater and literary theory to the pioneer days of personal computing. I brought my schooling in mathematics, information retrieval, documentation, libraries, and communication to the study of scholarship. While Ted was sketching personal computers to revolutionize written communication [24], I was learning how to pry data out of card catalogs and move them into the first generation of online catalogs [6]. Our discussions that began 30 years later revealed the interaction of these threads, which have since converged. 10.2 Collecting and Organizing Data Ted overwhelms himself in data, hence he needs metadata to manage his collections.

…

In: Proceedings of the World Documentation Federation Nelson TH (1966–1967) Hypertext notes. http://web.archive.org/web/20031127035740/http://www.xanadu.com/XUarchive/. Unpublished series of ten short essays or “notes“ Nelson TH (1967) Getting it out of our system. In: Schechter G (ed) Information retrieval: a critical review. Thompson Books, Washington, DC, pp 191–210 Nelson TH, Carmody S, Gross W, Rice D, van Dam A (1969) A hypertext editing system for the/360. In: Faiman M, Nievergelt J (eds) Pertinent concepts in computer graphics. Proceedings of the Second University of Illinois conference on computer graphics.

pages: 371 words: 93,570

Broad Band: The Untold Story of the Women Who Made the Internet
by Claire L. Evans
Published 6 Mar 2018

With a couple of phones and boxes of index cards, it coordinated extensive group action for quick-response incidents like the 1971 San Francisco Bay oil spill—an early version of the kind of organizing that happens so easily today on social media. Resource One took up where these efforts left off, even inheriting the San Francisco Switchboard’s corporate shell. When Pam and the Chrises moved into the warehouse, their plan was to design a common information retrieval system for all the existing Switchboards in the city, interlinking their various resources into a database running on borrowed computer time. “Our vision was making technology accessible to people,” Pam explains. “It was a very passionate time. And we thought anything was possible.” But borrowing computer time to build such a database was far too limiting; if they were to imbue their politics into a computer system for the people, they’d need to build it from the ground up.

…

That summer, while the other communards plumbed the building’s twenty-foot hot tub, the Resource One group installed cabinet racks and drum storage units. Nobody on the job had done anything remotely like it—even the lead electrician learned as he went, and the software was written from scratch, encoding the counterculture’s values into the computer at an operating system level. The Resource One Generalized Information Retrieval System, ROGIRS, written by a hacker, Ephrem Lipkin, was designed for the underground Switchboards, as a way to manage the offerings of an alternative economy. Once up and running, the machine would become the heart of Northern California’s underground free-access network, a glimmer of the Internet’s vital cultural importance years before most people would ever hear of it.

…

Bolton told them how social services agencies in the Bay Area didn’t share a citywide database for referral information; he’d personally observed how social workers at different agencies relied on their own Rolodexes. The quality of referrals they gave varied throughout the city, and people weren’t always connected to the services they needed, even if the services did exist. Chris Macie, who founded Resource One with Pam and stayed on after she left, programmed a new information retrieval system for the project, and the women started calling social workers all over San Francisco. If they kept an updated database of referral information, they asked, would the agencies be interested in subscribing? The answer was a resounding yes. The women of Resource One found their cause: using the computer to help the most disadvantaged people in the city gain access to services.

pages: 480 words: 99,288

Mastering ElasticSearch
by Rafal Kuc and Marek Rogozinski
Published 14 Aug 2013

The Lucene conceptual formula The conceptual version of the TF/IDF formula looks like: The previous presented formula is a representation of Boolean model of Information Retrieval combined with Vector Space Model of Information Retrieval. Let's not discuss it and let's just jump into the practical formula, which is implemented by Apache Lucene and is actually used. Note The information about Boolean model and Vector Space Model of Information Retrieval are far beyond the scope of this book. If you would like to read more about it, start with http://en.wikipedia.org/wiki/Standard_Boolean_model and http://en.wikipedia.org/wiki/Vector_Space_Model.

…

Rafał began his journey with Lucene in 2002 and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and this was it. He started working with ElasticSearch in the middle of 2010. Currently, Lucene, Solr, ElasticSearch, and information retrieval are his main points of interest. Rafał is also an author of Solr 3.1 Cookbook, the update to it—Solr 4.0 Cookbook, and is a co-author of ElasticSearch Server all published by Packt Publishing. The book you are holding in your hands was something that I wanted to write after finishing the ElasticSearch Server book and I got the opportunity.

pages: 660 words: 141,595

Data Science for Business: What You Need to Know About Data Mining and Data-Analytic Thinking
by Foster Provost and Tom Fawcett
Published 30 Jun 2013

It forms the core of several prediction algorithms that estimate a target value such as the expected resouce usage of a client or the probability of a customer to respond to an offer. It is also the basis for clustering techniques, which group entities by their shared features without a focused objective. Similarity forms the basis of information retrieval, in which documents or webpages relevant to a search query are retrieved. Finally, it underlies several common algorithms for recommendation. A traditional algorithm-oriented book might present each of these tasks in a different chapter, under different names, with common aspects buried in algorithm details or mathematical propositions.

…

Jaccard distance Cosine distance is often used in text classification to measure the similarity of two documents. It is defined in Equation 6-5. Equation 6-5. Cosine distance where ||·||2 again represents the L2 norm, or Euclidean length, of each feature vector (for a vector this is simply the distance from the origin). Note The information retrieval literature more commonly talks about cosine similarity, which is simply the fraction in Equation 6-5. Alternatively, it is 1 – cosine distance. In text classification, each word or token corresponds to a dimension, and the location of a document along each dimension is the number of occurrences of the word in that document.

…

The True negative rate and False positive rate are analogous for the instances that are actually negative. These are often taken as estimates of the probability of predicting Y when the instance is actually p, that is p(Y|p), etc. We will continue to explore these measures in Chapter 8. The metrics Precision and Recall are often used, especially in text classification and information retrieval. Recall is the same as true positive rate, while precision is TP/(TP + FP), which is the accuracy over the cases predicted to be positive. The F-measure is the harmonic mean of precision and recall at a given point, and is: Practitioners in many fields such as statistics, pattern recognition, and epidemiology speak of the sensitivity and specificity of a classifier: You may also hear about the positive predictive value, which is the same as precision.

pages: 413 words: 119,587

Machines of Loving Grace: The Quest for Common Ground Between Humans and Robots
by John Markoff
Published 24 Aug 2015

Engelbart’s researchers, an eclectic collection of buttoned-down white-shirted engineers and long-haired computer hackers, were taking computing in a direction so different it was not even in the same coordinate system. The Shakey project was struggling to mimic the human mind and body. Engelbart had a very different goal. During World War II he had stumbled across an article by Vannevar Bush, who had proposed a microfiche-based information retrieval system called Memex to manage all of the world’s knowledge. Engelbart later decided that such a system could be assembled based on the then newly available computers. He thought the time was right to build an interactive system to capture knowledge and organize information in such a way that it would now be possible for a small group of people—scientists, engineers, educators—to create and collaborate more effectively.

…

The PageRank algorithm Larry Page developed to improve Internet search results essentially mined human intelligence by using the crowd-sourced accumulation of human decisions about valuable information sources. Google initially began by collecting and organizing human knowledge and then making it available to humans as part of a glorified Memex, the original global information retrieval system first proposed by Vannevar Bush in the Atlantic Monthly in 1945.11 As the company has evolved, however, it has started to push heavily toward systems that replace rather than extend humans. Google’s executives have obviously thought to some degree about the societal consequences of the systems they are creating.

…

Louis and Stanford, but dropped out of both programs before receiving an advanced degree. Once he was on the West Coast, he had gotten involved with Brewster Kahle’s Internet Archive Project, which sought to save a copy of every Web page on the Internet. Larry Page and Sergey Brin had given Hassan stock for programming PageRank, and Hassan also sold E-Groups, another of his information retrieval projects, to Yahoo! for almost a half-billion dollars. By then, he was a very wealthy Silicon Valley technologist looking for interesting projects. In 2006 he backed both Ng and Salisbury and hired Salisbury’s students to join Willow Garage, a laboratory he’d already created to facilitate the next generation of robotics technology—like designing driverless cars.

pages: 357 words: 125,142

Sorting Things Out: Classification and Its Consequences
by Geoffrey C. Bowker and Susan Leigh Star
Published 25 Aug 2000

So one culture sees spirit possession as a valid cause of death, another ridicules this as superstition; one medical specialty sees cancer as a localized phenomenon to be cut out and stopped from spreading, another sees it as a disorder of the whole immune system that merely manifests in one location or another. The implications for both treatment and classification differ. Trying to encode both causes results in serious information retrieval problems. In addition, classifications shift historically. In Britain in 1650 we find that 696 people died of being “aged”; 31 succumbed to wolves, 9 to grief, and 19 to “King’s Evil.” “Mother” claimed 2 in 1647 but none in 1650, but in that year 2 were “smothered and stifled” (see figure 1.3).

…

Our work here is an exercise both in restoring the stories of practical classifying, conflict, and consensus therein and in understanding the design of the list itself. Formal Classification The structural aspects of classification are themselves a technical specialty in information science, biology, and statistics, among other places. Information scientists design thesauri for information retrieval, valuing parsimony and accuracy of terms, and the overall stability of the system over long periods of time. For biologists the choice of structure reflects how one sees species and the evolutionary process. For transformed cladists and numerical taxonomists, no useful statement about the past can be read out of their classifications; for evolutionary taxonomists that is the very basis of their system.

…

The ICD, he points out, originated as a means for describing causes of death; a trace of its heritage is its continued difficulty with describing chronic as opposed to acute forms of disease. This is one basis for the temporal fault lines that emerge in its usage. The UMLS originated as a means of information retrieval (the MeSH scheme) and is not as sensitive to clinical conditions as it might be (Musen 1992, 440). The two basic problems for any overarching classification scheme in a rapidly changing and complex field can be described as follows. First, any classificatory decision made now might by its nature block off valuable future developments.

The Art of SEO
by Eric Enge , Stephan Spencer , Jessie Stricchiola and Rand Fishkin
Published 7 Mar 2012

However, the search engines recognize an iframe or a frame used to pull in another site’s content for what it is, and therefore ignore the content inside the iframe or frame as it is content published by another publisher. In other words, they don’t consider content pulled in from another site as part of the unique content of your web page. Determining Searcher Intent and Delivering Relevant, Fresh Content Modern commercial search engines rely on the science of information retrieval (IR). This science has existed since the middle of the twentieth century, when retrieval systems powered computers in libraries, research facilities, and government labs. Early in the development of search systems, IR scientists realized that two critical components comprised the majority of search functionality: relevance and importance (which we defined earlier in this chapter).

…

As far as the search engines are concerned, however, the text in a document—and particularly the frequency with which a particular term or phrase is used—has very little impact on how happy a searcher will be with that page. In fact, quite often a page laden with repetitive keywords in an attempt to please the engines will provide a very poor user experience; thus, although some SEO professionals today do claim to use term weight (a mathematical equation grounded in the real science of information retrieval) or other, more “modern” keyword text usage methods, nearly all optimization can be done very simply. The best way to ensure that you’ve achieved the greatest level of targeting in your text for a particular term or phrase is to use it in the title tag, in one or more of the section headings (within reason), and in the copy on the web page.

…

Hiding text in Java applets As with text in images, the search engines cannot easily parse content inside Java applets. Using them as a tool to hide text would certainly be a strange choice, though. Forcing form submission Search engines will not submit HTML forms in an attempt to access the information retrieved from a search or submission. Thus, if you keep content behind a forced-form submission and never link to it externally, your content will remain out of the engines’ indexes (as Figure 6-43 demonstrates). Figure 6-43. Content that can only be accessed by submitting a form is unreadable by crawlers The problem comes when content behind forms earns links outside your control, as when bloggers, journalists, or researchers decide to link to the pages in your archives without your knowledge.

pages: 160 words: 45,516

Tomorrow's Lawyers: An Introduction to Your Future
by Richard Susskind
Published 10 Jan 2013

Consider recent progress in artificial intelligence (AI) and, in particular, the achievements of Watson, IBM’s AI-based system that competed—in a live broadcast in 2011—on the US television general knowledge quiz show Jeopardy! Watson beat the show’s two finest ever human contestants. This is a phenomenal technological feat, combining advanced natural language understanding, machine learning, information retrieval, knowledge processing, speech synthesis, and more. While the remarkable Google retrieves information for us that might be relevant, Watson shows how AI-based systems, in years to come, will actually speak with us and solve our problems. It is significant that many new and emerging applications do not simply computerize and streamline pre-existing and inefficient manual processes.

…

This thinking led me in 1996, in my book The Future of Law, to predict a shift in legal paradigm, by which I meant that many or most of our fundamental assumptions about legal service and legal process would be challenged and displaced by IT and the Internet. It was a 20-year prediction, so I can be called fully to account in 2016. I do not think I will be far out—when I look at IBM’s Watson (see Chapter 1) and think of similar technology in law, or reflect on information retrieval systems that are already outperforming human beings engaged in document review, then I feel we are on the brink of a monumental shift. Crucially, I concluded in 1996 that legal service would move from being a one-to-one, consultative, print-based advisory service to a one-to-many, packaged, Internet-based information service.

pages: 347 words: 97,721

Only Humans Need Apply: Winners and Losers in the Age of Smart Machines
by Thomas H. Davenport and Julia Kirby
Published 23 May 2016

Forms of Augmentation: Superpowers and Leverage In the realm of knowledge work, we’ve seen augmentation by intelligent machines take four forms, and we can further group them into just two categories. The first two we would class as superpowers, and the second two as leverage. When a machine greatly augments your powers of information retrieval, as many information systems do, we would call that gaining a superpower. Indeed, in the Terminator film franchise, out of all the superhuman capabilities Skynet designed into its “cybernetic organisms,” the one filmgoers covet most is the instant pop-up retrieval of biographical information on any humans encountered.

…

It was the inspiration, for example, for Google Glass, according to the technical lead on that product, Thad Starner.6 (And although we had to say Hasta la vista, baby, to that particular product, Google assures us it will be back.) When Tom wrote a book about knowledge workers a decade ago, there were already some examples of how empowering such information retrieval can be for them. He wrote in some detail, for example, about the idea of “computer-aided physician order entry,” particularly focusing on an example of this type of system at Partners HealthCare, a care network in Boston. When physicians input medical orders (drugs, tests, referrals, etc.) for their patients into the system, it checks to see if the order is consistent with what it thinks is best medical practice.

…

See also augmentation; specific professions augmentation and, 31–32, 62, 65, 74, 76, 100, 122, 139, 176, 185, 228, 234, 251 big-picture perspective and, 100 codified tasks and automation, 12–13, 14, 16–18, 19, 27–28, 30, 70, 139, 156, 167, 191, 204, 216, 246 creativity and, 120–21 defined, 5 demand peak, 6 deskilling and, 16 five options for, 76–77, 218, 232 (see also specific steps) how job loss happens, 23–24 information retrieval and, 65–66 lack of wage growth, 24 machine encroachment, 13, 24–25 political strategy to help, 239 roles better done by humans, 26–30 signs of coming automation, 19–22 Stepping In, post-automation work, 30–32 taking charge of destiny, 8–9 time frame for dislocation of, 24–26 who they are, 5–6 working hours of, 70 Kraft, Robert, 172–73 Krans, Mike, 102–3, 132, 134–35, 138 Kurup, Deepika, 164 Kurzweil, Ray, 36 labor unions, 1, 16, 25 Lacerte, 22 language recognition technologies, 39–40, 43, 44–45, 50, 53, 56, 212 natural language processing (NLP), 34, 37, 178 Lawton, Jim, 50, 182–83, 193 Learning by Doing (Bessen), 133, 233 legal field augmentation as leverage in, 68 automation (e-discovery), 13, 142–44, 145, 151 content analysis and automation, 20 narrow specializations, 159–60, 162 number of U.S. lawyers, 68 Stepping Up in, 93 Leibniz Institute for Astrophysics, 59 Levasseur, M.

pages: 582 words: 160,693

The Sovereign Individual: How to Survive and Thrive During the Collapse of the Welfare State
by James Dale Davidson and William Rees-Mogg
Published 3 Feb 1997

Digital Lawyers 154 Before agreeing to perform an operation, the skilled surgeon will probably call upon a digital lawyer to draft an instant contract that specifies and limits liability based upon the size and characteristics of the tumor revealed in images displayed by the magnetic resonance machine. Digital lawyers will be information-retrieval systems that automate selection of contract provisions, employing artificial intelligence processes such as neural networks to customize private contracts to meet transnational legal conditions. Participants in most high-value or important transactions will not only shop for suitable partners with whom to conduct a business; they will also shop for a suitable domicile for their transactions.

…

• Lifetime employment will disappear as "jobs" increasingly become tasks or "piece work" rather than positions within an organization. • Control over economic resources will shift away from the state to persons of superior skills and intelligence, as it becomes increasingly easy to create wealth by adding knowledge to products. • Many members of learned professions will be displaced by interactive information-retrieval systems. • New survival strategies for persons of lower intelligence will evolve, involving greater concentration on development of leisure skills, sports abilities, and crime, as well as service to the growing numbers of Sovereign Individuals as income inequality within jurisdictions rises.

…

As a consequence, broad paradigmatic understanding, or unspoken theories about the way the world works, are being antiquated more quickly than in the past. This increases the importance of the broad overview and diminishes the value of individual "facts" of the kind that are readily available to almost anyone with an information retrieval system. 3. The growing tribalization and marginalization of life have had a stunting effect on discourse, and even on thinking. Many people have consequently gotten into the habit of shying away from conclusions that are obviously implied by the facts at their disposal. A recent psychological study disguised as a public opinion poll showed that members of individual occupational groups were almost uniformly unwilling to accept any conclusion that implied a loss of income for them, no matter how airtight the logic supporting it.

Sorting Things Out: Classification and Its Consequences (Inside Technology)
by Geoffrey C. Bowker
Published 24 Aug 2000

So one culture sees spirit possession as a valid cause of death , another ridicules this as superstition; one medical specialty sees cancer as a localized phenomenon to be cut out and stopped from spreading, another sees it as a disorder of the whole immune system that merely manifests in one location or another. The implications f(Jr both treatment and classification differ. Tryin g to encode both causes results in serious information retrieval problems . I n addition , classifications shift historically. I n Britain i n I 650 we find that 696 people died of being " aged" ; 3 1 succumbed to wolves, 9 to grief, and l 9 to " King's Evil . " " Mother" claimed 2 in I 647 but none in 1 65 0 , but in that year 2 were " smothered and stifled" (see figure 1 . 3 ) .

…

) indexi cality: the 48 points were only recognized if they were at least 0.5 cun from a classic acupuncture point, where a cun is: "the distance between the interphalangeal creases of the patient's middle finger" (WHO 1 99 1 , 1 4 ) . Formal Classification The structural aspects of classification are themselves a technical spe cialty in information science, biology, and statistics, among other places . Information scientists design thesauri for information retrieval, valuing parsimony and accuracy of terms, and the overall stability of the system over long periods of time. For biologists the choice of structure reflects how one sees species and the evolutionary process. For transformed cladists and numerical taxonomists, no useful state ment about the past can be read out of their classifications; for evolu tionary taxonomists that is the very basis of their system.

…

The lCD, he points out, originated as The Kindness of Strangers 69 a means for describing causes of death; a trace of its heritage is its continued difficulty with describing chronic as opposed to acute forms of disease. This is one basis for the temporal fault lines that emerge in its usage. The UMLS originated as a means of information retrieval (the MeSH scheme) and is not as sensitive to clinical conditions as it might be (Musen 1 992, 440) . The two basic problems for any overarching classification scheme in a rapidly changing and complex field can be described as follows. First, any classificatory decision made now might by its nature block off valuable future developments .

pages: 32 words: 7,759

8 Day Trips From London
by Dee Maldon
Published 16 Mar 2010

8 Day Trips from London A simple guide for visitors who want to see more than the capital By Dee Maldon Bookline & Thinker Ltd Bookline & Thinker Ltd #231, 405 King’s Road London SW10 OBB www.booklinethinker.com Eight Days Out From London Copyright © Bookline & Thinker Ltd 2010 This book is a work of non-fiction A CIP catalogue record for this book is available from the British Library All rights reserved. No part of this work may be reproduced or stored in an information retrieval system without the express permission of the publisher ISBN: 9780956517715 Printed and bound by Lightning Source UK Book cover designed by Donald McColl Contents Bath Brighton Cambridge Canterbury Oxford Stonehenge Winchester Windsor Introduction Why take any day trips from London?

pages: 379 words: 109,612

Is the Internet Changing the Way You Think?: The Net's Impact on Our Minds and Future
by John Brockman
Published 18 Jan 2011

And when a file becomes corrupt, all I am left with is a pointer, a void where an idea should be, the ghost of a departed thought. The New Balance: More Processing, Less Memorization Fiery Cushman Postdoctoral fellow, Mind/Brain/Behavior Interfaculty Initiative, Harvard University The Internet changes the way I behave, and possibly the way I think, by reducing the processing costs of information retrieval. I focus more on knowing how to obtain and use information online and less on memorizing it. This tradeoff between processing and memory reminds me of one of my father’s favorite stories, perhaps apocryphal, about studying the periodic table of the elements in his high school chemistry class.

…

And when a friend cooks a good meal, I’m more interested to learn what Website it came from than how it was spiced. I don’t know most of the American Psychological Association rules for style and citation, but my computer does. For any particular “computation” I perform, I don’t need the same depth of knowledge, because I have access to profoundly more efficient processes of information retrieval. So the Internet clearly changes the way I behave. It must be changing the way I think at some level, insofar as my behavior is a product of my thoughts. It probably is not changing the basic kinds of mental processes I can perform but it might be changing their relative weighting. We psychologists love to impress undergraduates with the fact that taxi drivers have unusually large hippocampi.

…

Anthony Aguirre Associate professor of physics, University of California, Santa Cruz Recently I wanted to learn about twelfth-century China—not a deep or scholarly understanding, just enough to add a bit of not-wrong color to something I was writing. Wikipedia was perfect! More regularly, my astrophysics and cosmology endeavors bring me to databases such as arXiv, ADS (Astrophysics Data System), and SPIRES (Stanford Physics Information Retrieval System), which give instant and organized access to all the articles and information I might need to research and write. Between such uses and an appreciable fraction of my time spent processing e-mails, I, like most of my colleagues, spend a lot of time connected to the Internet. It is a central tool in my research life.

pages: 408 words: 105,715

Kingdom of Characters: The Language Revolution That Made China Modern
by Jing Tsu
Published 18 Jan 2022

Only selected state agencies and research institutions were given permission to build mainframe computers or to house them, and their equipment was largely dependent on imported parts if not entire machines. Information retrieval was also a distant goal. Back then, information retrieval meant something more basic than typing a query into a search box on Google or Bing. It was literally about where and how to store data information and how to call it up as a file or other format. Both electronic and informational retrieval would take longer-term planning. For the time being, the only area that was both urgent and achievable was phototypesetting. This method of typesetting involved taking a snapshot of the character to be printed, then transferring the film image to printing plates.

pages: 58 words: 12,386

Big Data Glossary
by Pete Warden
Published 20 Sep 2011

For example, you might want to extract product names and prices from a shopping site. With the tool, you could find a single product page, select the product name and price, and then the same elements would be pulled for every other page it crawled from the site. It relies on the fact that most web pages are generated by combining templates with information retrieved from a database, and so have a very consistent structure. Once you’ve gathered the data, it offers some features that are a bit like Google Refine’s for de-duplicating and cleaning up the data. All in all, it’s a very powerful tool for turning web content into structured information, with a very approachable interface.

pages: 2,466 words: 668,761

Artificial Intelligence: A Modern Approach
by Stuart Russell and Peter Norvig
Published 14 Jul 2019

Speech recognition can be seen as the first application area that highlighted the success of deep learning, with computer vision following shortly thereafter. Interest in the field of information retrieval was spurred by widespread usage of Internet searching. Croft et al. (2010) and Manning et al. (2008) provide textbooks that cover the basics. The TREC conference hosts an annual competition for IR systems and publishes proceedings with results. Brin and Page (1998) describe the PageRank algorithm, which takes into account the links between pages, and give an overview of the implementation of a Web search engine. Silverstein et al. (1998) investigate a log of a billion Web searches. The journal Information Retrieval and the proceedings of the annual flagship SIGIR conference cover recent developments in the field.

…

Because of these difficulties, probabilistic methods for coping with uncertainty fell out of favor in AI from the 1970s to the mid-1980s. Developments since the late 1980s are described in the next chapter. The naive Bayes model for joint distributions has been studied extensively in the pattern recognition literature since the 1950s (Duda and Hart, 1973). It has also been used, often unwittingly, in information retrieval, beginning with the work of Maron (1961). The probabilistic foundations of this technique, described further in Exercise 12.BAYS, were elucidated by Robertson and Sparck Jones (1976). Domingos and Pazzani (1997) provide an explanation for the surprising success of naive Bayesian reasoning even in domains where the independence assumptions are clearly violated.

…

For free-form text, techniques include hidden Markov models and rule-based learning systems (as used in TEXTRUNNER and NELL (Never-Ending Language Learning) (Mitchell et al.,2018)). More recent systems use recurrent neural networks, taking advantage of the flexibility of word embeddings. You can find an overview in Kumar (2017). Information retrieval is the task of finding documents that are relevant and important for a given query. Internet search engines such as Google and Baidu perform this task billions of times a day. Three good textbooks on the subject are Manning et al. (2008), Croft et al.(2010), and Baeza-Yates and Ribeiro-Neto (2011).

pages: 586 words: 186,548

Architects of Intelligence
by Martin Ford
Published 16 Nov 2018

From 1996 to 1999, he worked for Digital Equipment Corporation’s Western Research Lab in Palo Alto, where he worked on low-overhead profiling tools, design of profiling hardware for out-of-order microprocessors, and web-based information retrieval. From 1990 to 1991, Jeff worked for the World Health Organization’s Global Programme on AIDS, developing software to do statistical modeling, forecasting, and analysis of the HIV pandemic. In 2009, Jeff was elected to the National Academy of Engineering, and he was also named a Fellow of the Association for Computing Machinery (ACM) and a Fellow of the American Association for the Advancement of Sciences (AAAS). His areas of interest include large-scale distributed systems, performance monitoring, compression techniques, information retrieval, application of machine learning to search and other related problems, microprocessor architecture, compiler optimizations, and development of new products that organize existing information in new and interesting ways.

…

That made me start thinking about AI again I eventually figured out that the reason Watson won is because it was actually a narrower AI problem than it first appeared to be. That’s almost always the answer. In Watson’s case it’s because about 95% of the answers in Jeopardy turn out to be the titles of Wikipedia pages. Instead of understanding language, reasoning about it and so forth, it was mostly doing information retrieval from a restricted set, namely the pages that are Wikipedia titles. It was actually not as hard of a problem as it looked like to the untutored eye, but it was interesting enough that it got me to think about AI again. Around the same time, I started writing for The New Yorker, where I was producing a lot of pieces about neuroscience, linguistics, psychology, and also AI.

…

MARTIN FORD: Of course, that’s not a problem that’s exclusive to AI; humans are subject to the same issues when confronted with flawed data. It’s a bias in the data that results from past decisions that people doing research made. BARBARA GROSZ: Right, but now look what’s going on in some areas of medicine. The computer system can, “read all the papers” (more than a person could) and do certain kinds of information retrieval from them and extract results, and then do statistical analyses. But if most of the papers are on scientific work that was done only on male mice, or only on male humans, then the conclusions the system is coming to are limited. We’re also seeing this problem in the legal realm, with policing and fairness.

pages: 259 words: 73,193

The End of Absence: Reclaiming What We've Lost in a World of Constant Connection
by Michael Harris
Published 6 Aug 2014

As we inevitably off-load media content to the cloud—storing our books, our television programs, our videos of the trip to Taiwan, and photos of Grandma’s ninetieth birthday, all on a nameless server—can we happily dematerialize our mind’s stores, too? Perhaps we should side with philosopher Lewis Mumford, who insisted in The Myth of the Machine that “information retrieving,” however expedient, is simply no substitute for the possession of knowledge accrued through personal and direct labor. Author Clive Thompson wondered about this when he came across recent research suggesting that we remember fewer and fewer facts these days—of three thousand people polled by neuroscientist Ian Robertson, the young were less able to recall basic personal information (a full one-third, for example, didn’t know their own phone numbers).

…

M., 106–7, 109 4chan, 53–54 Foursquare, 150–51 Frankenstein (Shelley), 56 Frankfurt, Harry G., 92 Franklin, Benjamin, 192 friends, 30–31 Frind, Markus, 182–83 Furbies, 29–30 Füssel, Stephan, 103 Gaddam, Sai, 173 Gallup, 123 genes, 41–43 Gentile, Douglas, 118–21 German Ideology, The (Marx), 12n Gleick, James, 137 Globe and Mail, 81–82, 89 glossary, 211–16 Google, 3, 8, 18–19, 24, 33, 43, 49, 82, 96, 142, 185 memory and, 143–47 search results on, 85–86, 91 Google AdSense, 85 Google Books, 102–3 Google Glass, 99–100 Google Maps, 91 Google Plus, 31 Gopnik, Alison, 33–34 Gould, Glenn, 200–201, 204 GPS, 35, 59, 68, 171 Greenfield, Susan, 20, 25 Grindr, 165, 167, 171, 173–74, 176 Guardian, 66n Gutenberg, Johannes, 11–13, 14, 16, 21, 34, 98 Gutenberg Bible, 83, 103 Gutenberg Galaxy, The (McLuhan), 179, 201 Gutenberg Revolution, The (Man), 12n, 103 GuySpy, 171, 172, 173 Hangul, 12n Harari, Haim, 141 Harry Potter series, 66n Hazlehurst, Ronnie, 74 Heilman, James, 75–79 Henry, William A., III, 84–85 “He Poos Clouds” (Pallett), 164 History of Reading, A (Manguel), 16, 117, 159 Hollinghurst, Alan, 115 Holmes, Sherlock, 147–48 House at Pooh Corner, The (Milne), 93 Hugo, Victor, 20–21 “Idea of North, The” (Gould), 200–201 In Defense of Elitism (Henry), 84–85 Information, The (Gleick), 137 information retrieval, 141–42 Innis, Harold, 202 In Search of Lost Time (Proust), 160 Instagram, 19, 104, 149 Internet, 19, 20, 21, 23, 26–27, 55, 69, 125, 126, 129, 141, 143, 145, 146, 187, 199, 205 brain and, 37–38, 40, 142, 185 going without, 185, 186, 189–97, 200, 208–9 remembering life before, 7–8, 15–16, 21–22, 48, 55, 203 Internship, The, 89 iPad, 21, 31 children and, 26–27, 45 iPhone, see phones iPotty, 26 iTunes, 89 Jobs, Steve, 134 Jones, Patrick, 152n Justification of Johann Gutenberg, The (Morrison), 12 Kaiser Foundation, 27, 28n Kandel, Eric, 154 Kaufman, Charlie, 155 Keen, Andrew, 88 Kelly, Kevin, 43 Kierkegaard, Søren, 49 Kinsey, Alfred, 173 knowledge, 11–12, 75, 80, 82, 83, 86, 92, 94, 98, 141, 145–46 Google Books and, 102–3 Wikipedia and, 63, 78 Koller, Daphne, 95 Kranzberg, Melvin, 7 Kundera, Milan, 184 Lanier, Jaron, 85, 106–7, 189 latent Dirichlet allocation (LDA), 64–65 Leonardo da Vinci, 56 Lewis, R.

pages: 288 words: 86,995

Rule of the Robots: How Artificial Intelligence Will Transform Everything
by Martin Ford
Published 13 Sep 2021

The robot is able to engage in rudimentary conversations and do a number of practical things that center around information retrieval; it can look up things on the internet, get weather and traffic reports, play music and so forth. In other words, Jibo offers a set of capabilities that are broadly similar to Amazon’s Alexa-powered Echo smart speakers. The Echo, of course, can’t move at all, but backed by Amazon’s massive cloud computing infrastructure and far larger team of highly paid AI developers, it’s information retrieval and natural language capabilities are likely stronger—and certain to become more so over time.

Beautiful Data: The Stories Behind Elegant Data Solutions
by Toby Segaran and Jeff Hammerbacher
Published 1 Jul 2009

The author demonstrates shocking prescience. The title of the paper is “A Business Intelligence System,” and it appears to be the first use of the term “Business Intelligence” in its modern context. In addition to the dissemination of information in real time, the system was to allow for “information retrieval”—search—to be conducted over the entire document collection. Luhn’s emphasis on action points focuses the role of information processing on goal completion. In other words, it’s not enough to just collect and aggregate data; an organization must improve its capacity to complete critical tasks because of the insights gleaned from the data.

…

These may reflect a range of viewpoints and enable us to begin to consider alternative notions of place as we attempt to describe it more effectively. Consequently, Ross Purves and Alistair Edwardes have been using Geograph as a source of descriptions of place in their research at the University of Zurich. Their ultimate objective involves improving information retrieval by automatically adding indexing terms to georeferenced digital photographs that relate to popular notions of place, such as “mountain,” “remote,” or “hiking.” Their work involves validating previous studies and forming new perspectives by comparing Geograph to existing efforts to describe place and analyzing term co-occurrence in the geograph descriptions (Edwardes and Purves 2007).

…

Zerfos, and J. Cho. “Downloading textual hidden web content through keyword queries.” JCDL 2005: 100–109. SURFACING THE DEEP WEB Download at Boykma.Com 147 Raghavan, S. and H. Garcia-Molina. “Crawling the Hidden Web.” VLDB 2001: 129–138. Salton, G. and M. J. McGill. Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983. SpiderMonkey (JavaScript-C) Engine, http://www.mozilla.org/js/spidermonkey/. V8 JavaScript Engine, http://code.google.com/p/v8/. 148 CHAPTER NINE Download at Boykma.Com Chapter 10 CHAPTER TEN Building Radiohead’s House of Cards Aaron Koblin with Valdean Klump THIS IS THE STORY OF HOW THE GRAMMY-NOMINATED MUSIC VIDEO FOR RADIOHEAD’S “HOUSE OF Cards” was created entirely with data.

pages: 135 words: 26,407

How to DeFi
by Coingecko , Darren Lau , Sze Jin Teh , Kristian Kho , Erina Azmi , Tm Lee and Bobby Ong
Published 22 Mar 2020

Retrieved from https://www.defisnap.io/#/dashboard ~ Chapter 14: DeFi in Action (n.d.). Retrieved October 19, 2019, from https://slideslive.com/38920018/living-on-defi-how-i-survive-argentinas-50-inflation Gundiuc, C. (2019, September 29). Argentina Central Bank Exposed 800 Citizens' Sensitive Information. Retrieved from https://beincrypto.com/argentina-central-bank-exposed-sensitive-information-of-800-citizens/ Lopez, J. M. S. (2020, February 5). Argentina’s ‘little trees’ blossom as forex controls fuel black market. Retrieved from https://www.reuters.com/article/us-argentina-currency-blackmarket/argentinas-little-trees-blossom-as-forex-controls-fuel-black-market-idUSKBN1ZZ1H1 Russo, C. (2019, December 9).

The Art of Computer Programming: Sorting and Searching
by Donald Ervin Knuth
Published 15 Jan 1998

. — TITUS LIVIUS, Ab Urbe Condita XXXIX.vi (Robert Burton, Anatomy of Melancholy 1.2.2.2) This book forms a natural sequel to the material on information structures in Chapter 2 of Volume 1, because it adds the concept of linearly ordered data to the other basic structural ideas. The title "Sorting and Searching" may sound as if this book is only for those systems programmers who are concerned with the preparation of general-purpose sorting routines or applications to information retrieval. But in fact the area of sorting and searching provides an ideal framework for discussing a wide variety of important general issues: • How are good algorithms discovered? • How can given algorithms and programs be improved? • How can the efficiency of algorithms be analyzed mathematically?

…

For example, given a large file about stage performers, a producer might wish to find all unemployed actresses between 25 and 30 with dancing talent and a French accent; given a large file of baseball statistics, a sportswriter may wish to determine the total number of runs scored by the Chicago White Sox in 1964, during the seventh inning of night games, against left-handed pitchers. Given a large file of data about anything, people like to ask arbitrarily complicated questions. Indeed, we might consider an entire library as a database, and a searcher may want to find everything that has been published about information retrieval. An introduction to the techniques for such secondary key (multi-attribute) retrieval problems appears below in Section 6.5. Before entering into a detailed study of searching, it may be helpful to put things in historical perspective. During the pre-computer era, many books of logarithm tables, trigonometry tables, etc., were compiled, so that mathematical calculations could be replaced by searching.

…

Suppose that we want to test a given search argument to see whether it is one of the 31 most common words of English (see Figs. 12 and 13 in Section 6.2.2). The data is represented in Table 1 as a trie structure; this name was suggested by E. Fredkin [CACM 3 A960), 490-500] because it is a part of information retrieval. A trie — pronounced "try" —is essentially an M-ary tree, whose nodes are M-place vectors with components corresponding to digits or characters. Each node on level I represents the set of all keys that begin with a certain sequence of I characters called its prefix; the node specifies an M-way branch, depending on the (I + l)st character.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 17 Apr 2017

. • Particle physicists have been doing Big Data–style large-scale data analysis for decades, and projects like the Large Hadron Collider (LHC) now work with hun‐ dreds of petabytes! At such a scale custom solutions are required to stop the hardware cost from spiraling out of control [49]. • Full-text search is arguably a kind of data model that is frequently used alongside databases. Information retrieval is a large specialist subject that we won’t cover in great detail in this book, but we’ll touch on search indexes in Chapter 3 and Part III. We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that come into play when implementing the data models described in this chapter.

…

In LevelDB, this in-memory index is a sparse collection of some of the keys, but in Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie [38]. This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance [39]. Other fuzzy search techniques go in the direction of document classification and machine learning. See an information retrieval textbook for more detail [e.g., 40]. Keeping everything in memory The data structures discussed so far in this chapter have all been answers to the limi‐ tations of disks. Compared to main memory, disks are awkward to deal with. With both magnetic disks and SSDs, data on disk needs to be laid out carefully if you want good performance on reads and writes.

…

Schulz and Stoyan Mihov: “Fast String Correction with Levenshtein Automata,” International Journal on Document Analysis and Recognition, volume 5, number 1, pages 67–85, November 2002. doi:10.1007/s10032-002-0082-8 [40] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduc‐ tion to Information Retrieval. Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at nlp.stanford.edu/IR-book 106 | Chapter 3: Storage and Retrieval [41] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “The End of an Architectural Era (It’s Time for a Complete Rewrite),” at 33rd International Confer‐ ence on Very Large Data Bases (VLDB), September 2007. [42] “VoltDB Technical Overview White Paper,” VoltDB, 2014. [43] Stephen M.

pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 16 Mar 2017

Particle physicists have been doing Big Data–style large-scale data analysis for decades, and projects like the Large Hadron Collider (LHC) now work with hundreds of petabytes! At such a scale custom solutions are required to stop the hardware cost from spiraling out of control [49]. Full-text search is arguably a kind of data model that is frequently used alongside databases. Information retrieval is a large specialist subject that we won’t cover in great detail in this book, but we’ll touch on search indexes in Chapter 3 and Part III. We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that come into play when implementing the data models described in this chapter.

…

In LevelDB, this in-memory index is a sparse collection of some of the keys, but in Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie [38]. This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance [39]. Other fuzzy search techniques go in the direction of document classification and machine learning. See an information retrieval textbook for more detail [e.g., 40]. Keeping everything in memory The data structures discussed so far in this chapter have all been answers to the limitations of disks. Compared to main memory, disks are awkward to deal with. With both magnetic disks and SSDs, data on disk needs to be laid out carefully if you want good performance on reads and writes.

…

Schulz and Stoyan Mihov: “Fast String Correction with Levenshtein Automata,” International Journal on Document Analysis and Recognition, volume 5, number 1, pages 67–85, November 2002. doi:10.1007/s10032-002-0082-8 [40] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at nlp.stanford.edu/IR-book [41] Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “The End of an Architectural Era (It’s Time for a Complete Rewrite),” at 33rd International Conference on Very Large Data Bases (VLDB), September 2007

pages: 855 words: 178,507

The Information: A History, a Theory, a Flood
by James Gleick
Published 1 Mar 2011

Shannon’s theory made a bridge between information and uncertainty; between information and entropy; and between information and chaos. It led to compact discs and fax machines, computers and cyberspace, Moore’s law and all the world’s Silicon Alleys. Information processing was born, along with information storage and information retrieval. People began to name a successor to the Iron Age and the Steam Age. “Man the food-gatherer reappears incongruously as information-gatherer,”♦ remarked Marshall McLuhan in 1967.♦ He wrote this an instant too soon, in the first dawn of computation and cyberspace. We can see now that information is what our world runs on: the blood and the fuel, the vital principle.

…

It is an ancient observation, but one that seemed to bear restating when information became plentiful—particularly in a world where all bits are created equal and information is divorced from meaning. The humanist and philosopher of technology Lewis Mumford, for example, restated it in 1970: “Unfortunately, ‘information retrieving,’ however swift, is no substitute for discovering by direct personal inspection knowledge whose very existence one had possibly never been aware of, and following it at one’s own pace through the further ramification of relevant literature.”♦ He begged for a return to “moral self-discipline.”

…

♦ “KNOWLEDGE OF SPEECH, BUT NOT OF SILENCE”: T. S. Eliot, “The Rock,” in Collected Poems: 1909–1962 (New York: Harcourt Brace, 1963), 147. ♦ “THE TSUNAMI OF AVAILABLE FACT”: David Foster Wallace, Introduction to The Best American Essays 2007 (New York: Mariner, 2007). ♦ “UNFORTUNATELY, ‘INFORMATION RETRIEVING,’ HOWEVER SWIFT”: Lewis Mumford, The Myth of the Machine, vol. 2, The Pentagon of Power (New York: Harcourt, Brace, 1970), 182. ♦ “ELECTRONIC MAIL SYSTEM”: Jacob Palme, “You Have 134 Unread Mail! Do You Want to Read Them Now?” in Computer-Based Message Services, ed. Hugh T. Smith (North Holland: Elsevier, 1984), 175–76

pages: 153 words: 27,424

REST API Design Rulebook
by Mark Masse
Published 19 Oct 2011

The first web server.[8] The first web browser, which Berners-Lee also named “WorldWideWeb” and later renamed “Nexus” to avoid confusion with the Web itself. The first WYSIWYG[9] HTML editor, which was built right into the browser. On August 6, 1991, on the Web’s first page, Berners-Lee wrote, The WorldWideWeb (W3) is a wide-area hypermedia information retrieval initiative aiming to give universal access to a large universe of documents.[10] From that moment, the Web began to grow, at times exponentially. Within five years, the number of web users skyrocketed to 40 million. At one point, the number was doubling every two months. The “universe of documents” that Berners-Lee had described was indeed expanding.

pages: 281 words: 95,852

The Googlization of Everything:
by Siva Vaidhyanathan
Published 1 Jan 2010

In 2009 the core service of Google—its Web search engine—handled more than 70 percent of the Web search business in the United States and more than 90 percent in much of Europe, and grew at impressive rates elsewhere around the world. 15. Thorsten Joachims et al., “Accurately Interpreting Clickthrough Data as Implicit Feedback,” Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Salvador, Brazil: ACM, 2005), 154–61. 16. B. J. Jansen and U. Pooch, “A Review of Web Searching Studies and a Framework for Future Research,” Journal of the American Society for Information Science and Technology 52, no. 3 (2001): 235–46; Amanda Spink and Bernard J. Jansen, Web Search: Public Searching on the Web (Dordrecht: Kluwer Academic Publishers, 2004); Caroline M.

…

A Comparison of Websites across Countries and Domains,” Journal of Computer-Mediated Communication 12, no. 3 (2007), http://jcmc.indiana.edu. 69. Wingyan Chung, “Web Searching in a Multilingual World,” Communications of the ACM 51, no. 5 (2008): 32–40; Fotis Lazarinis et al., “Current Research Issues and Trends in Non-English Web Searching,” Information Retrieval 12, no. 3 (2009): 230–50. 70. “Google’s Market Share in Your Country.” 71. Choe Sang-Hun, “Crowd’s Wisdom Helps South Korean Search Engine Beat Google and Yahoo,” New York Times, July 4, 2007. 72. “S. Korea May Clash with Google over Internet Regulation Differences,” Hankyoreh, April 17, 2009; Kim Tong-hyung, “Google Refuses to Bow to Gov’t Pressure,” Korea Times, April 9, 2009. 73.

Beautiful Visualization
by Julie Steele
Published 20 Apr 2010

[2] See http://bit.ly/4iZib. [3] See http://en.wikipedia.org/wiki/George_Washingtons_Farewell_Address. [4] See http://avalon.law.yale.edu/18th_century/washing.asp. Chapter Nine The Big Picture: Search and Discovery Todd Holloway Search and discovery are two styles of information retrieval. Search is a familiar modality, well exemplified by Google and other web search engines. While there is a discovery aspect to search engines, there are more straightforward examples of discovery systems, such as product recommendations on Amazon and movie recommendations on Netflix. These two types of retrieval systems have in common that they can be incredibly complex under the hood.

…

She’s the author of acclaimed site thisisindexed.com, and her work has appeared in the New York Times, the BBC Magazine Online, Paste, Golf Digest, Redbook, New York Magazine, the National Post of Canada, the Guardian, Time, and many other old and new media outlets. Todd Holloway can’t get enough of information visualization, information retrieval, machine learning, data mining, the science of networks, and artificial intelligence. He is a Grinnell College and Indiana University alumnus. Noah Iliinsky has spent the last several years thinking about effective approaches to creating diagrams and other types of information visualization.

RDF Database Systems: Triples Storage and SPARQL Query Processing
by Olivier Cure and Guillaume Blin
Published 10 Dec 2014

Indeed, while compression is the main objective in URI encoding, the main feature sought in RDF stores related to literal is a full text search.The most popular solution for handling a full text search in literals is Lucene, integrated in RDF stores such as Yars2, Jena TDB/SDB, and GraphDB (formerly OWLIM), and in Big Data RDF databases, but it’s also popular for other systems, such as IBM OmnifindY! Edition, Technorati, Wikipedia, Internet Archive, and LinkedIn. Lucene is a very popular open-source information-retrieval library from the Apache Software Foundation (originally created in Java by Doug Cutting). It provides Java-based full-text indexing 99 100 RDF Database Systems and searching capabilities for applications through an easy-to-use API. Lucene is based on powerful and efficient search algorithms using indexes.

…

Therefore, they can be used to identify the fastest index among the six clustered indexes.The overall claim of this multiple-index approach is that, due to a clever compression strategy, the total size of the indexes is less than the size required by a standard triples table solution. The system supports both individual update operations and updates to entire batches. More details on RDF-3X and its extension X-RDF-3X are provided in Chapter 6. The YARS (Harth and Decker, 2005) system combines methods from information retrieval and databases to allow for better query answering performance over RDF data. It stores RDF data persistently by using six B+tree indexes. It not only stores the subject, the predicate, and the object, but also the context information about the data origin. Each element of the corresponding quad (i.e., 4-uplet) is encoded in a dictionary storing mappings from literals and URIs to object IDs (object IDs are stored as number identifiers for compactness).To speed up keyword queries, the lexicon keeps an inverted index on string literals to allow fast full-text searches.

Getting the Builders in : How to Manage Homebuilding and Renovation Projects
by Sales, Leonard John.
Published 19 Sep 2008

Published by How To Content, A division of How To Books Ltd, Spring Hill House, Spring Hill Road, Begbroke Oxford OX5 1RX, United Kingdom. Tel: (01865) 375794. Fax: (01865) 379162. info_howtobooks.co.uk www.howtobooks.co.uk All rights reserved. No part of this work may be reproduced or stored in an information retrieval system (other than for purposes of review) without the express permission of the publisher in writing. The right of Leonard Sales to be identified as the author of this work has been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. © 2008 Leonard Sales First published 2004 Second edition 2006 Third edition 2008 First published in electronic form 2008 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 978 1 84803 285 9 Cover design by Baseline Arts Ltd, Oxford Produced for How To Books by Deer Park Productions, Tavistock, Devon Typeset by TW Typesetting, Plymouth, Devon NOTE: The material contained in this book is set out in good faith for general guidance and no liability can be accepted for loss or expense incurred as a result of relying in particular circumstances on statements made in the book.

pages: 123 words: 32,382

Grouped: How Small Groups of Friends Are the Key to Influence on the Social Web
by Paul Adams
Published 1 Nov 2011

Just as we are surrounded by people throughout our daily life, the web is being rebuilt around people. People are increasingly using the web to seek the information they need from each other, rather than from businesses directly. People always sourced information from each other offline, but up until now, online information retrieval tended to be from a business to a person. The second driving factor is an acknowledgment in our business models of the fact that people live in networks. For many years, we considered people as isolated, independent actors. Most of our consumer behavior models are structured this way—people acting independently, moving down a decision funnel, making objective choices along the way.

pages: 353 words: 104,146

European Founders at Work
by Pedro Gairifo Santos
Published 7 Nov 2011

They're a group who meet every year about music recommendations and information retrieval in music. We ended up hiring a guy called Norman, who was both a great scientist and understood all the algorithms and captive audience sort of things, but also an excellent programmer who was able to implement all these ideas. So we got really lucky. The first person we hired was great and he just took over. He chucked out all of our crappy recommendation systems we had and built something good, and then improved it constantly for the next several years. __________ 2 The International Society for Music Information Retrieval So we had some A/B testing, split testing systems in there for the radio so they could try out new tweaks to the algorithms and see what was performing better.

pages: 648 words: 108,814

Solr 1.4 Enterprise Search Server
by David Smiley and Eric Pugh
Published 15 Nov 2009

The major features found in Lucene are as follows: • A text-based inverted index persistent storage for efficient retrieval of documents by indexed terms A rich set of text analyzers to transform a string of text into a series of terms (words), which are the fundamental units indexed and searched A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matches A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the more likely candidates first, with flexible means to affect the scoring A highlighter feature to show words found in context • A query spellchecker based on indexed content • • • • For even more information on the query spellchecker, check out the Lucene In Action book (LINA for short) by Erik Hatcher and Otis Gospodnetić.

…

NW, , Atlanta, , 30327 hl fragmenter, highlighting component 165 hl maxAlternateFieldLength, highlighting component 165 hl maxAnalyzedChars, highlighting component 165 home directory, Solr bin 15 conf 15 conf/schema.xml 15 conf/solrconfig.xml 15 conf/xslt 15 data 15 lib 15 HTML, indexing in Solr 227 HTMLStripStandardTokenizerFactory 52 HTMLStripStandardTokenizerFactory tokenizer 227 HTMLStripWhitespaceTokenizerFactory 52 HTTP caching 277-279 HTTP server request access logs, logging about 201, 202 log directory, creating 201 Tailing 202 I IDF 33 idf 112 ID field 44 indent, diagnostic parameter 98 index 31 index-time and query-time, boosting 113 versus query-time 57 index-time boosting 70 IndexBasedSpellChecker options field 174 sourceLocation 174 thresholdTokenFrequency 175 index data document access, controlling 221 securing 220 indexed, field option 41 indexed, schema design 282 indexes sharding 295 indexing strategies about 283 factors, committing 285 factors, optimizing 285 unique document checking, disabling 285 Index Searchers 280 Information Retrieval. See IR int element 92 InternetArchive 226 invariants 111 Inverse Document Frequency. See IDF inverse reciprocals 125 IR 8 ISOLatin1AccentFilterFactory filter 62 issue tracker, Solr 27 J J2SE with JConsole 212 JARmageddon 205 jarowinkler, spellchecker 172 java.util.logging package 203 Java class names abbreviated 40 org.apache.solr.schema.BoolField 40 Java Development Kit (JDK) URL 11 JavaDoc tags 234 Java Management Extensions.

pages: 321 words: 113,564

AI in Museums: Reflections, Perspectives and Applications
by Sonja Thiel and Johannes C. Bernhardt
Published 31 Dec 2023

Most importantly, at the end of the text recognition workflow, there should be a digital and structured text with a high accuracy, which can then serve as a suitable input for various natural language processing (NLP) tasks. Within NLP, the SBB focussed on a few common tasks that are supposed to be particularly in demand by users and can also benefit information retrieval in the digitized collections, namely, named entity recognition (NER) and named entity disambiguation and linking (EL). The NER system13 developed by the SBB is based on Bidirectional Encoder Representations from Transformers, or BERT (Devlin/Chang/Lee et al. 2019). To adapt the original BERT model for historical texts containing OCR errors, an unsupervised pre-training was done using a selection of 2,333,647 German-language pages from the SBB’s digitized collections, followed by additional supervised training on openly available gold-standard data for NER (Labusch/Neudecker/Zellhöfer 2019).

…

Dialogue systems, also known as conversational systems, can be categorized into three types: 1) goal-based systems, designed to complete specific user tasks such as scheduling appointments, and typically gather information through questions until the task is fulfilled (Gao et al. 2018; McTear/Callejas/Griol al. 2016); 2) chatbots, designed for casual conversation on open-ended topics, and trained end-to-end using large datasets of dialogue examples (ibid.); and 3) question-answering systems, which focus on answering a wide range of questions from a knowledge base, with the emphasis on information retrieval (Dimitrakis et al. 2020). In social robots such as Pepper and Nao (Softbank Robotics), or Furhat (Furhat Robotics), simple goal-based dialogue systems are commonly used due to the ease of controlling the knowledge and dialogue (Foster 2019). This nonetheless requires every spoken interaction with the robots to be handcrafted and a dialogue policy to be programmed.

pages: 913 words: 265,787

How the Mind Works
by Steven Pinker
Published 1 Jan 1997

A piece that has been requested recently is more likely to be needed now than a piece that has not been requested for a while. An optimal information-retrieval system should therefore be biased to fetch frequently and recently encountered items. Anderson notes that that is exactly what human memory retrieval does: we remember common and recent events better than rare and long-past events. He found four other classic phenomena in memory research that meet the optimal design criteria independently established for computer information-retrieval systems. A third notable feature of access-consciousness is the emotional coloring of experience.

…

So the neural medium itself is not necessarily to blame. The psychologist John Anderson has reverse-engineered human memory retrieval, and has shown that the limits of memory are not a byproduct of a mushy storage medium. As programmers like to say, “It’s not a bug, it’s a feature.” In an optimally designed information-retrieval system, an item should be recovered only when the relevance of the item outweighs the cost of retrieving it. Anyone who has used a computerized library retrieval system quickly comes to rue the avalanche of titles spilling across the screen. A human expert, despite our allegedly feeble powers of retrieval, vastly outperforms any computer in locating a piece of information from its content.

…

A human expert, despite our allegedly feeble powers of retrieval, vastly outperforms any computer in locating a piece of information from its content. When I need to find articles on a topic in an unfamiliar field, I don’t use the library computer; I send email to a pal in the field. What would it mean for an information-retrieval system to be optimally designed? It should cough up the information most likely to be useful at the time of the request. But how could that be known in advance? The probabilities could be estimated, using general laws about what kinds of information are most likely to be needed. If such laws exist, we should be able to find them in information systems in general, not just human memory; for example, the laws should be visible in the statistics of books requested at a library or the files retrieved in a computer.

Bookkeeping the Easy Way
by Wallace W. Kravitz
Published 30 Apr 1990

Any similarity with the names and types of business of a real person or company is purely coincidental. © Copyright 1999 by Barron's Educational Series, Inc. Prior © copyrights 1990, 1983 by Barron's Educational Series, Inc. All rights reserved. No part of this book may be reproduced in any form, by photostat, microfilm, xerography, or any other means, or incorporated into any information retrieval system, electronic or mechanical, without the written permission of the copyright owner. All inquiries should be addressed to: Barron's Educational Series, Inc. 250 Wireless Boulevard Hauppauge, NY 11788 http://www.barronseduc.com Library of Congress Catalog Card No. 99-17245 International Standard Book No. 0-7641-1079-9 Library of Congress Cataloging-in-Publication Data Kravitz, Wallace W.

pages: 429 words: 114,726

The Computer Boys Take Over: Computers, Programmers, and the Politics of Technical Expertise
by Nathan L. Ensmenger
Published 31 Jul 2010

Daniel McCracken, “The Human Side of Computing,” Datamation 7, no. 1 (1961): 9–11. Chapter 6 1. “The Thinking Machine,” Time magazine, January 23, 1950, 54–60. 2. J. Lear, “Can a Mechanical Brain Replace You?” Colliers, no. 131 (1953), 58–63. 3. “Office Robots,” Fortune 45 (January 1952), 82–87, 112, 114, 116, 118. 4. Cheryl Knott Malone, “Imagining Information Retrieval in the Library: Desk Set in Historical Context,” IEEE Annals of the History of Computing 24, no. 3 (2002): 14–22. 5. Ibid. 6. Ibid. 7. Thorstein Veblen, The Theory of the Leisure Class (New York: McMillan, 1899). 8. Thomas Haigh, “The Chromium-Plated Tabulator: Institutionalizing an Electronic Revolution, 1954–1958,” IEEE Annals of the History of Computing 4, no. 23 (2001), 75–104. 9.

…

In History of Computing: Software Issues, ed. Ulf Hashagen, Reinhard Keil-Slawik, and Arthur Norberg. Berlin: Springer-Verlag, 2002, 25–48. Mahoney, Michael. “What Makes the History of Software Hard.” IEEE Annals of the History of Computing 30 (3) (2008): 8–18. Malone, Cheryl Knott. “Imagining Information Retrieval in the Library: Desk Set in Historical Context.” IEEE Annals of the History of Computing 24 (3) (2002): 14–22. Mandel, Lois. “The Computer Girls.” Cosmopolitan, April 1967, 52–56. Manion, Mark, and William M. Evan. “The Y2K problem: technological risk and professional responsibility.”

pages: 924 words: 196,343

JavaScript & jQuery: The Missing Manual
by David Sawyer McFarland
Published 28 Oct 2011

JavaScript lets a web page react intelligently. With it, you can create smart web forms that let visitors know when they’ve forgotten to include necessary information; you can make elements appear, disappear, or move around a web page (see Figure 1-1); you can even update the contents of a web page with information retrieved from a web server—without having to load a new web page. In short, JavaScript lets you make your websites more engaging and effective. Figure 1-1. JavaScript lets web pages respond to visitors. On Amazon.com, mousing over the “Gifts & Wish Lists” link opens a tab that floats above the other content on the page and offers additional options.

…

It can be as simple as this: { firstName : 'Bob', lastName : 'Smith' } In this code, firstName acts like a key with a value of Bob—a simple string value. However, the value can also be another object (see Figure 11-10 on page 376), so you can often end up with a complex nested structure—like dolls within dolls. That’s what Flickr’s JSON feed is like. Here’s a small snippet of one of those feeds. It shows the information retrieved for two photos: 1 { 2 "title": "Uploads from Smithsonian Institution", 3 "link": "http://www.flickr.com/photos/smithsonian/", 4 "description": "", 5 "modified": "2011-08-11T13:16:37Z", 6 "generator": "http://www.flickr.com/", 7 "items": [ 8 { 9 "title": "East Island, June 12, 1966.", 10 "link": "http://www.flickr.com/photos/smithsonian/5988083516/", 11 "media": {"m":"http://farm7.static.flickr.com/6029/5988083516_ bfc9f41286_m.jpg"}, 12 "date_taken": "2011-07-29T11:45:50-08:00", 13 "description": "Short description here", 14 "published": "2011-08-11T13:16:37Z", 15 "author": "nobody@flickr.com (Smithsonian Institution)", 16 "author_id": "25053835@N03", 17 "tags": "ocean birds redfootedbooby" 18 }, 19 { 20 "title": "Phoenix Island, April 15, 1966.", 21 "link": "http://www.flickr.com/photos/smithsonian/5988083472/", 22 "media": {"m":"http://farm7.static.flickr.com/6015/5988083472_ c646ef2778_m.jpg"}, 23 "date_taken": "2011-07-29T11:45:48-08:00", 24 "description": "Another short description", 25 "published": "2011-08-11T13:16:37Z", 26 "author": "nobody@flickr.com (Smithsonian Institution)", 27 "author_id": "25053835@N03", 28 "tags": "" 29 } 30 } Flickr’s JSON object has a bit of information about the feed in general: That’s the stuff at the beginning—“title”, “link”, and so on.

…

You might add some code in your program to do that like this: $('.weed').click(function() { $(this).remove(); }); // end click The problem with this code is that it only applies to elements that already exist. If you programmatically add new divs—<div class=“weed”>—the click handler isn’t applied to them. Code that applies only to existing elements is also a problem when you use Ajax as described in Part Four of this book. Ajax lets you update content on a page using information retrieved from a web server. Gmail, for example, can display new mail as you receive it by continually retrieving it from a web server and updating the content in the web browser. In this case, your list of received emails changes after you first started using Gmail. Any events that were applied to the page content when the page loads won’t apply to the new content added from the server.

Scikit-Learn Cookbook
by Trent Hauck
Published 3 Nov 2014

We just need to find some distance metric, compute the pairwise distances, and compare the outcomes to what's expected. Getting ready A lower-level utility in scikit-learn is sklearn.metrics.pairwise. This contains server functions to compute the distances between the vectors in a matrix X or the distances between the vectors in X and Y easily. This can be useful for information retrieval. For example, given a set of customers with attributes of X, we might want to take a reference customer and find the closest customers to this customer. In fact, we might want to rank customers by the notion of similarity measured by a distance function. The quality of the similarity depends upon the feature space selection as well as any transformation we might do on the space.

pages: 334 words: 123,463

Shadow Libraries: Access to Knowledge in Global Higher Education
by Joe Karaganis
Published 3 May 2018

Traditional publishers are becoming full-spectrum service providers for classroom learning and research, encroaching on tasks performed by libraries, bookstores, teachers and administrators, and technology providers, and incorporating a variety of other student support services. Increasingly, educational publishers understand their competition not as other publishing companies, but as telecommunications companies, software companies, information retrieval providers, and the like. Unlike the music and film industries, however, the educational publishers have had more time, less pressure to evolve toward digital media, and markets that remain largely embedded in institutions, which are more resistant to disintermediation and reliance on individual textbooks than the various consumer markets for “content.”

…

Local efforts remain small and lack coordination or even communication among them. As a result, the same lessons are learned, forgotten, and relearned. The same failures are experienced repeatedly. Uneven economic growth has exacerbated this fragmentation. As Banerjee notes, “[at] one end of the spectrum the country can boast of a highly specialized information retrieval system”; at the other, many in the Indian populace lack access “even to basic reading material or advice,” much less to the databases or information networks available to better-funded libraries. In our survey, students in both law and social sciences reported widespread use of online resources for their classes; in both cases over three quarters reported doing the majority of their work this way.

pages: 742 words: 137,937

The Future of the Professions: How Technology Will Transform the Work of Human Experts
by Richard Susskind and Daniel Susskind
Published 24 Aug 2015

Here is a system that undoubtedly performs tasks that we would normally think require human intelligence. The version of Watson that competed on Jeopardy! holds over 200 million pages of documents and implements a wide range of AI tools and techniques, including natural language processing, machine learning, speech synthesis, game-playing, information retrieval, intelligent search, knowledge processing and reasoning, and much more. This type of AI, we stress again, is radically different from the first wave of rule-based expert systems of the 1980s (see section 4.9). It is interesting to note, harking back again to the exponential growth of information technology, that the hardware on which Watson ran in 2011 was said to be about the size of the average bedroom.

…

In the thaw that has followed the winter, over the past few years, we have seen a series of significant developments—Big Data, Watson, robotics, and affective computing—that we believe point to a second wave of AI. In summary, the computerization of the work of professionals began in earnest in the late 1970s with information retrieval systems. Then, in the 1980s, there were first-generation AI systems in the professions, whose main focus was expert systems technologies. In the next decade, the 1990s, there was a shift towards the field of knowledge management, when professionals started to store and retrieve not just source materials but know-how and working practices.

pages: 165 words: 50,798

Intertwingled: Information Changes Everything
by Peter Morville
Published 14 May 2014

In 1992, I started classes at the School of Information and Library Studies, and promptly began to panic. I was stuck in required courses like Reference and Cataloging with people who wanted to be librarians. In hindsight, I’m glad I took those classes, but at the time I was convinced I’d made a very big mistake. It took a while to find my groove. I studied information retrieval and database design. I explored Dialog, the world’s first commercial online search service. And I fell madly in love with the Internet. The tools were crude, the content sparse, but the promise irresistible. A global network of networks that provides universal access to ideas and information: how could anyone who loves knowledge resist that?

pages: 170 words: 49,193

The People vs Tech: How the Internet Is Killing Democracy (And How We Save It)
by Jamie Bartlett
Published 4 Apr 2018

Apple splashed out $200 million for Turi, a machine learning start-up, in 2016, and Intel has invested over $1 billion in AI companies over the past couple of years.7 Market leaders in AI like Google, with the data, the geniuses, the experience and the computing power, won’t be limited to just search and information retrieval. They will also be able to leap ahead in almost anything where AI is important: logistics, driverless cars, medical research, television, factory production, city planning, agriculture, energy use, storage, clerical work, education and who knows what else. Amazon is already a retailer, marketing platform, delivery and logistics network, payment system, credit lender, auction house, book publisher, TV production company, fashion designer and cloud computing provider.8 What next?

Principles of Protocol Design
by Robin Sharp
Published 13 Feb 2008

Challenge-response mechanism for authentication of client (see Section 11.4.4). Coding: ASCII encoding of all PDUs. Addressing: Uniform Resource Identifier (URI) identifies destination system and path to resource. Fault tolerance: Resistance to corruption via optional MD5 checksumming of resource content during transfer. 11.4.3 Web Caching Since most distributed information retrieval applications involve transfer of considerable amounts of data through the network, caching is commonly used in order to reduce the amount of network traffic and reduce response times. HTTP, which is intended to support such applications, therefore includes explicit mechanisms for controlling the operation of caching.

…

The proceedings of the two series of international workshops on “Intelligent Agents for Telecommunication Applications”, and on “Cooperative Information Agents” are good places to search for the results of recent research into both theory and applications of agents in the telecommunications and information retrieval areas. A new trend in the construction of very large distributed systems is to base them on Grid technology. This is a technology for coordinating the activities of a potentially huge number of computers, in order to supply users with computer power, in the form of CPU power, storage and other resources.

pages: 286 words: 94,017

Future Shock
by Alvin Toffler
Published 1 Jun 1984

The profession of airline flight engineer, he notes, emerged and then began to die out within a brief period of fifteen years. A look at the "help wanted" pages of any major newspaper brings home the fact that new occupations are increasing at a mind-dazzling rate. Systems analyst, console operator, coder, tape librarian, tape handler, are only a few of those connected with computer operations. Information retrieval, optical scanning, thin-film technology all require new kinds of expertise, while old occupations lose importance or vanish altogether. When Fortune magazine in the mid-1960's surveyed 1,003 young executives employed by major American corporations, it found that fully one out of three held a job that simply had not existed until he stepped into it.

…

This itself, with its demands for uniform discipline, regular hours, attendance checks and the like, was a standardizing force. Advanced technology will, in the future, make much of this unnecessary. A good deal of education will take place in the student's own room at home or in a dorm, at hours of his own choosing. With vast libraries of data available to him via computerized information retrieval systems, with his own tapes and video units, his own language laboratory and his own electronically equipped study carrel, he will be freed, for much of the time, of the restrictions and unpleasantness that dogged him in the lockstep classroom. The technology upon which these new freedoms will be based will inevitably spread through the schools in the years ahead—aggressively pushed, no doubt, by major corporations like IBM, RCA, and Xerox.

pages: 501 words: 145,943

If Mayors Ruled the World: Dysfunctional Nations, Rising Cities
by Benjamin R. Barber
Published 5 Nov 2013

I myself was fascinated when, nearly thirty years ago, I enthused about emerging interactive technologies and the impact they might have on citizenship and “strong democracy”: The wiring of homes for cable television across America . . . the availability of low frequency and satellite transmissions in areas beyond regular transmission or cable and the interactive possibilities of video, computers, and information retrieval systems open up a new mode of human communication that can be used either in civic and constructive ways or in manipulative and destructive ways.19 Mine was one of the earliest instances of anticipatory enthusiasm (though laced with skepticism), but a decade later with the web actually in development, cyber zealots were everywhere predicting a new electronic frontier for civic interactivity.

…

—than the founders and CEOs of immensely powerful tech firms that are first of all profit-seeking, market-monopolizing, consumer-craving commercial entities no more virtuous (or less virtuous) than oil or tobacco or weapons manufacturing firms. It should not really be a surprise that Apple will exploit cheap labor at its Foxconn subsidiary glass manufacturer in China or that Google will steer to the wind, allowing states like China to dictate the terms of “information retrieval” in their own domains. Or that the World Wide Web is being called the “walled-wide-web” by defenders of an open network who fear they are losing the battle. Dictators, nowadays mostly faltering or gone, are no longer the most potent threat to democracy: robust corporations are, not because they are enemies of popular sovereignty but because court decisions like Buckley v.

pages: 573 words: 157,767

From Bacteria to Bach and Back: The Evolution of Minds
by Daniel C. Dennett
Published 7 Feb 2017

“Flash Signal Evolution, Mate Choice and Predation in Fireflies.” Annual Review of Entomology 53: 293–321. Lieberman, Matthew D. 2013. Social: Why Our Brains Are Wired to Connect. New York: Crown. Littman, Michael L., Susan T. Dumais, and Thomas K. Landauer. 1998. “Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing.” In Cross-Language Information Retrieval, 51–62. New York: Springer. Lycan, William G. 1987. Consciousness. Cambridge, Mass.: MIT Press. MacCready, P. 1999. “An Ambivalent Luddite at a Technological Feast.” Designfax, August. MacKay, D. M. 1968. “Electroencephalogram Potentials Evoked by Accelerated Visual Motion.”

pages: 189 words: 57,632

Content: Selected Essays on Technology, Creativity, Copyright, and the Future of the Future
by Cory Doctorow
Published 15 Sep 2008

Taken more broadly, this kind of metadata can be thought of as a pedigree: who thinks that this document is valuable? How closely correlated have this person's value judgments been with mine in times gone by? This kind of implicit endorsement of information is a far better candidate for an information-retrieval panacea than all the world's schema combined. Amish for QWERTY (Originally published on the O'Reilly Network, 07/09/2003) I learned to type before I learned to write. The QWERTY keyboard layout is hard-wired to my brain, such that I can't write anything of significance without that I have a 101-key keyboard in front of me.

pages: 144 words: 55,142

Interlibrary Loan Practices Handbook
by Cherie L. Weible and Karen L. Janke
Published 15 Apr 2011

If an electronic resources management system is not available or used, it is important to find the interlibrary loan terms on a license and record this information in the ILL department. The terms of the license should be upheld. Regular communication with 41 42 lending workflow basics library staff who are responsible for licensing will ensure that ILL staff are aware of any new or updated license information. Retrieving the Item If the print item is owned and available, the call number or other location-specific information should be noted on the request. Borrowers might request a particular edition or year, so careful attention should be paid to make sure the call number and item are an exact match. All requests should be collected and sorted by location and the items pulled from the stacks at least daily.

pages: 190 words: 62,941

Wild Ride: Inside Uber's Quest for World Domination
by Adam Lashinsky
Published 31 Mar 2017

Still, the simplicity of the product masked the complexity of the software code necessary to build it. Camp was getting a master’s degree in software engineering, and though he and his friends bootstrapped StumbleUpon with their labor and little cash, their graduate research dovetailed with the product. Camp’s thesis was on “information retrieval through collaborative interface design and evolutionary algorithms.” Like Facebook, which began a few years later, StumbleUpon was a dorm-room success. It grew quickly to hundreds of thousands of users with Camp and his cofounders as the only employees. (Revenue would follow in later years with an early form of “native” advertising, full-page ads that would appear after several “stumbles,” or items users were discovering.)

pages: 673 words: 164,804

Peer-to-Peer
by Andy Oram
Published 26 Feb 2001

Suppose you query the Gnutella network for “strawberry rhubarb pie.” You expect a few results that let you download a recipe. That’s what we expect from today’s Gnutella system, but it actually doesn’t capture the unique properties Gnutella offers. Remember, Gnutella is a distributed, real-time information retrieval system wherein your query is disseminated across the network in its raw form. That means that every node that receives your query can interpret your query however it wants and respond however it wants, in free form. In fact, Gnutella file-sharing software does just that. Each flavor of Gnutella software interprets the search queries differently.

…

[75] “eBay Feedback Removal Policy,” http://pages.ebay.com/help/community/fbremove.html. [76] D. Chaum (1981), “Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms.” Communications of the ACM, vol. 24, no. 2, pp.84-88. [77] “Electronic Frontiers Georgia Remailer Uptime List,” http://anon.efga.org. [78] Tal Malkin (1999), MIT Ph.D. thesis, “Private Information Retrieval and Oblivious Transfer.” [79] Masayuki Abe (1998), “Universally Verifiable MIX-Network with Verification Work Independent of the Number of MIX Servers,” EUROCRYPT ’98, Springer-Verlag LNCS. [80] We ignore the possibility of traffic analysis here and assume that the user chooses more than one hop

pages: 222 words: 74,587

Paper Machines: About Cards & Catalogs, 1548-1929
by Markus Krajewski and Peter Krapp
Published 18 Aug 2011

Paper Machines History and Foundations of Information Science Edited by Michael Buckland, Jonathan Furner, and Markus Krajewski Human Information Retrieval by Julian Warner Good Faith Collaboration: The Culture of Wikipedia by Joseph Michael Reagle Jr. Paper Machines: About Cards & Catalogs, 1548–1929 by Markus Krajewski Paper Machines About Cards & Catalogs, 1548–1929 Markus Krajewski translated by Peter Krapp The MIT Press Cambridge, Massachusetts London, England © 2011 Massachusetts Institute of Technology © für die deutsche Ausgabe 2002, Kulturverlag Kadmos Berlin All rights reserved.

pages: 245 words: 64,288

Robots Will Steal Your Job, But That's OK: How to Survive the Economic Collapse and Be Happy
by Pistono, Federico
Published 14 Oct 2012

While our brains will stay pretty much the same for the next 20 years, computer’s efficiency and computational power will have doubled about twenty times. That is a million-fold increase. So, for the same $3 million you will have a computer a million times more powerful than Watson, or you could have a Watson-equivalent computer for $3. Watson’s computational power and exceptional skills of advanced Natural Language Processing, Information Retrieval, Knowledge Representation and Reasoning, Machine Learning, and open domain question answering are already being put to better use than showing off at a TV contest. IBM and Nuance Communications Inc. are partnering for the research project to develop a commercial product during the next 18 to 24 months that will exploit Watson’s capabilities as a clinical decision support system to aid the diagnosis and treatment of patients.86 Recall the example of automated radiologists we mentioned earlier.

pages: 222 words: 70,132

Move Fast and Break Things: How Facebook, Google, and Amazon Cornered Culture and Undermined Democracy
by Jonathan Taplin
Published 17 Apr 2017

The effect on the thousand people gathered for the conference was revolutionary. Imagine the first performance of Stravinsky’s The Rite of Spring but without the boos and walkouts. People were thunderstruck by this radical upending of what a computer could be. No longer a giant calculation machine, it was a personal tool of communication and information retrieval. 2. It is not an exaggeration to say that the work of Steve Jobs, Bill Gates, Larry Page, and Mark Zuckerberg stands on the shoulders of Doug Engelbart. Yet Engelbart’s vision of the computing future was different from today’s reality. In the run-up to the demonstration, Bill English had enlisted the help of Whole Earth Catalog publisher Stewart Brand, who had produced the Acid Tests with Ken Kesey two years earlier.

pages: 244 words: 66,599

Insanely Great: The Life and Times of Macintosh, the Computer That Changed Everything
by Steven Levy
Published 2 Feb 1994

That was the intangible benefit of HyperCard-a hastening of what now seems an inevitable reordering of the way we consume information. On a more basic level, HyperCard found several niches, the most prevalent being an easy-to-use control panel, or "front end," for databases, providing easy access for files, pictures, notes, and video clips that otherwise would be elusive to those unschooled in the black arts of information retrieval. Thus it became associated with another use of Macintosh that would become central to the computer's role in nudging digital technology a little closer to the familiar: multimedia. In recent years multimedia has taken on a negative connotation in the computer industry. The term is often used with a suspicious fuzziness, and is often dismissed as a meaningless buzzword, tainted by hucksters invoking the word to move new hardware.

pages: 671 words: 228,348

Pro AngularJS
by Adam Freeman
Published 25 Mar 2014

A URL like this will be a requested relative to the main HTML document, which means that I don’t have to hard-code protocols, hostnames, and ports into the application. GET AND POST: PICK THE RIGHT ONE The rule of thumb is that GET requests should be used for all read-only information retrieval, while POST requests should be used for any operation that changes the application state. In standards-compliance terms, GET requests are for safe interactions (having no side effects besides information retrieval), and POST requests are for unsafe interactions (making a decision or changing something). These conventions are set by the World Wide Web Consortium (W3C), at www.w3.org/Protocols/rfc2616/rfc2616-sec9.html.

pages: 1,302 words: 289,469

The Web Application Hacker's Handbook: Finding and Exploiting Security Flaws
by Dafydd Stuttard and Marcus Pinto
Published 30 Sep 2007

Application Pages Versus Functional Paths The enumeration techniques described so far have been implicitly driven by one particular picture of how web application content may be conceptualized and cataloged. This picture is inherited from the pre-application days of the World Wide Web, in which web servers functioned as repositories of static information, retrieved using URLs that were effectively filenames. To publish some web content, an author simply generated a bunch of HTML files and copied these into the relevant directory on a web server. When users followed hyperlinks. 94 Chapter 4 Mapping the Application they navigated the set of files created by the author, requesting each file via its name within the directory tree residing on the server.

…

The authors' favorite is sqlmap, which can attack MySQL, Oracle, and MS-SQL, among others. It implements UNiON-based and inference-based retrieval. It supports various escalation methods, including retrieval of files from the operating system, and command execution under Windows using xp_cmdshell. In practice, sqlmap is an effective tool for database information retrieval through time-delay or other inference methods and can be useful for union-based retrieval. One of the best ways to use it is with the --sql-sheli option. This gives the attacker a SQL prompt and performs the necessary union, error-based, or blind SQL injection behind the scenes to send and retrieve results.

…

/default/fedefault.aspx SessionUser.Key f7e50aef8fadd30f31f3aeal04cef26ed2ce2be50073c SessionClient.ID 306 SessionClient.ReviewID 245 UPriv.2100 Chapter 15 i Exploiting Information Disclosure 619 SessionUser.NetworkLevelUser 0 UPriv.2200 SessionUser.BranchLevelUser 0 SessionDatabase fd219.prod.wahh-bank.com The following items are commonly included in verbose debug messages: ■ Values of key session variables that can be manipulated via user input ■ Hostnames and credentials for back-end components such as databases ■ File and directory names on the server ■ Information embedded within meaningful session tokens (see Chapter 7) ■ Encryption keys used to protect data transmitted via the client (see Chapter 5) ■ Debug information for exceptions arising in native code components, including the values of CPU registers, contents of the stack, and a list of the loaded DLLs and their base addresses (see Chapter 16) When this kind of error reporting functionality is present in live production code, it may signify a critical weakness in the application's security. You should review it closely to identify any items that can be used to further advance your attack, and any ways in which you can supply crafted input to manipulate the application's state and control the information retrieved. Server and Database Messages Informative error messages are often returned not by the application itself but by some back-end component such as a database, mail server, or SOAP server. If a completely unhandled error occurs, the application typically responds with an HTTP 500 status code, and the response body may contain further information about the error.

pages: 1,164 words: 309,327

Trading and Exchanges: Market Microstructure for Practitioners
by Larry Harris
Published 2 Jan 2003

In contrast, if the manager wants to buy the stock because he believes that it is fundamentally undervalued, Bob can be more patient. The prices of such stocks usually do not rise so quickly that Bob needs to hurry to trade. The portfolio manager says that he wants to buy Exxon Mobil because he believes it is fundamentally undervalued. Bob then uses an electronic information retrieval system to examine the recent price and trade history for Exxon Mobil. He looks to see whether other traders are trying to fill large orders. If a large seller is pushing prices down, Bob might be able to fill his order quickly at a good price. If Bob must compete with another large buyer, the order may be hard to execute at a good price.

…

In addition to their regulatory functions, the SEC and CFTC collect and disseminate information useful to traders, investors, speculators, and legislators. The SEC collects various financial reports from issuers and position reports from large traders. Investors who are interested in estimating security values can access these reports over the Internet via the SEC’s Edgar information retrieval system. The CFTC likewise collects and publishes information about commodity market supply and demand conditions and large trader positions. Traders use this information to value commodities and to forecast what other traders might do in the future. Both organizations also provide information to Congress through their regular annual reports, their special reports on specific issues, their testimony at congressional hearings, and their responses to requests for information from members of Congress and their staffs.

…

His notes now project price targets of 20 and 25 dollars per share, with the possibility of more than 50 dollars a share by the time the new plant comes on line. 12.1.1 The Successful Ending: Bill Profits Some traders who follow BNB closely see the price change. They immediately query their electronic information retrieval services to determine why the stock is moving, and when it started to move. They find the story about producing in China and see that the price increase immediately followed its publication. Although the news has no particular fundamental value, many traders infer more from the story than they should because of the large positive price change that followed the announcement.

pages: 242 words: 245

The New Ruthless Economy: Work & Power in the Digital Age
by Simon Head
Published 14 Aug 2003

Instead, the company believed that "reducing dependency on people knowledge and skills through expert and artificial intelligence systems" offered the best approach. With the expert system containing "most, if not all, of the knowledge required to perform a task or solve a problem," the knowledgeability of the agent could be confined "largely to data entry and information retrieval procedures"—echoes of Hammer and Champy's deal structurers and case managers.29 The chief KM problem faced by MMR's software engineers was how to achieve an accurate definition of the problem to be solved by CasePoint. The one thing the expert system could not do was provide for itself an accurate description of the symptoms of machine breakdown.

pages: 242 words: 71,938

The Google Resume: How to Prepare for a Career and Land a Job at Apple, Microsoft, Google, or Any Top Tech Company
by Gayle Laakmann Mcdowell
Published 25 Jan 2011

The one thing that would make this slightly stronger is for Bill to list the dates of the projects. Distributed Hash Table (Language/Platform: Java/Linux) Successfully implemented Distributed Hash Table based on chord lookup protocol, Chord protocol is one solution for connecting the peers of a P2P network. Chord consistently maps a key onto a node. Information Retrieval System (Language/Platform: Java/Linux) Developed an indexer to index corpus of file and a Query Processor to process the Boolean query. The Query Processor outputs the file name, title, line number, and word position. Implemented using Java API such as serialization and collections (Sortedset, Hashmaps).

Mastering Book-Keeping: A Complete Guide to the Principles and Practice of Business Accounting
by Peter Marshall
Published 1 Feb 1997

Fax: (01865) 379162. info@howtobooks.co.uk www.howtobooks.co.uk © 2009 Dr Peter Marshall First edition 1992 Second edition 1995 Third edition 1997 Fourth edition 1999 Fifth edition 2001 Sixth edition 2003 Seventh edition 2005 Reprinted 2006 Eighth edition 2009 First published in electronic form 2009 All rights reserved. No part of this work may be reproduced or stored in an information retrieval system (other than for purposes of review) without the express permission of the publisher in writing. The rights of Peter Marshall to be identified as the author this work has been asserted by him in accordance with the Copyright Designs and Patents Act 1988. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 978 1 84803 324 5 Produced for How To Books by Deer Park Productions, Tavistock, Devon Typeset by PDQ Typesetting, Newcastle-under-Lyme, Staffordshire Cover design by Baseline Arts Ltd, Oxford NOTE: The material contained in this book is set out in good faith for general guidance and no liability can be accepted for loss or expense incurred as a result of relying in particular circumstances on statements made in the book.

Deep Work: Rules for Focused Success in a Distracted World
by Cal Newport
Published 5 Jan 2016

In this case, I would suggest that you maintain the strategy of scheduling Internet use even after the workday is over. To simplify matters, when scheduling Internet use after work, you can allow time-sensitive communication into your offline blocks (e.g., texting with a friend to agree on where you’ll meet for dinner), as well as time-sensitive information retrieval (e.g., looking up the location of the restaurant on your phone). Outside of these pragmatic exceptions, however, when in an offline block, put your phone away, ignore texts, and refrain from Internet usage. As in the workplace variation of this strategy, if the Internet plays a large and important role in your evening entertainment, that’s fine: Schedule lots of long Internet blocks.

pages: 238 words: 77,730

Final Jeopardy: Man vs. Machine and the Quest to Know Everything
by Stephen Baker
Published 17 Feb 2011

Researchers at Harvard, studying the brain scans of people suffering from tip of the tongue syndrome, have noted increased activity in the anterior cingulate—a part of the brain behind the frontal lobe, devoted to conflict resolution and detecting surprise. Few of these conflicts appeared to interfere with Jennings’s information retrieval. During his unprecedented seventy-four-game streak, he routinely won the buzz on more than half the clues. And his snap judgments that the answers were on call in his head somewhere led him to a remarkable 92 percent precision rate, according to statistics compiled by the quiz show’s fans.

pages: 345 words: 75,660

Prediction Machines: The Simple Economics of Artificial Intelligence
by Ajay Agrawal , Joshua Gans and Avi Goldfarb
Published 16 Apr 2018

To be mobile-first is to drive traffic to your mobile experience and optimize consumers’ interfaces for mobile even at the expense of your full website and other platforms. The last part is what makes it strategic. “Do well on mobile” is something to aim for. But saying you will do so even if it harms other channels is a real commitment. What does this mean in the context of AI-first? Google’s research director Peter Norvig gives an answer: With information retrieval, anything over 80% recall and precision is pretty good—not every suggestion has to be perfect, since the user can ignore the bad suggestions. With assistance, there is a much higher barrier. You wouldn’t use a service that booked the wrong reservation 20% of the time, or even 2% of the time.

Mastering Structured Data on the Semantic Web: From HTML5 Microdata to Linked Open Data
by Leslie Sikos
Published 10 Jul 2015

Oracle (2015) Oracle Spatial and Graph. www.oracle.com/technetwork/database/ options/spatialandgraph/overview/index.html. Accessed 10 April 2015. 11. SYSTAP LLC (2015) Blazegraph. www.blazegraph.com/bigdata. Accessed 10 April 2015. Chapter 7 Querying While machine-readable datasets are published primarily for software agents, automatic data extraction is not always an option. Semantic Information Retrieval often involves users searching for the answer to a complex question, based on the formally represented knowledge in a dataset or database. While Structured Query Language (SQL) is used to query relational databases, querying graph databases and flat Resource Description Framework (RDF) files can be done using the SPARQL Protocol and RDF Query Language (SPARQL), the primary query language of RDF, which is much more powerful than SQL.

Writing Effective Use Cases
by Alistair Cockburn
Published 30 Sep 2000

System has been setup to require the Shopper to identify themselves: Shopper establishes identity 2f. System is setup to interact with known other systems (parts inventory, process & planning) that will affect product availability and selection: 2f.1. System interacts with known other systems (parts inventory, process & planning) to get the needed information. (Retrieve Part Availability, Retrieve Build Schedule). 2f.2. System uses the results to filter or show availability of product and/or options(parts). 2g. Shopper was presented and selects a link to an Industry related web-site: Shopper views other web-site. 2h. System is setup to interact with known Customer Information System: 2h.1.

Paper Knowledge: Toward a Media History of Documents
by Lisa Gitelman
Published 26 Mar 2014

Both microform databanks and Sutherland’s Sketchpad gesture selectively toward a prehistory for the pdf page image because both—though differently—mobilized pages and images of pages for a screen-based interface. The databanks retrieved televisual reproductions of existing source pages, modeling not just information retrieval but also encouraging certain citation norms (since users could indicate that, for example, “the information appears on page 10”). Meanwhile, Sketchpad established a page as a fixed computational field, a visible ground on which further computational objects might be rendered. The portable document format is related more tenuously to mainframes and microform, even though today’s reference databases—the majority of which of course include and serve up pdf —clearly descend in some measure from experiments like Intrex and the Times Information Bank.

pages: 411 words: 80,925

What's Mine Is Yours: How Collaborative Consumption Is Changing the Way We Live
by Rachel Botsman and Roo Rogers
Published 2 Jan 2010

This section was heavily influenced by Richard Grants, “Drowning in Plastic: The Great Pacific Garbage Patch Is Twice the Size of France,” Telegraph (April 24, 2009), www.telegraph.co.uk/earth/environment/5208645/Drowning-in-plastic-The-Great-Pacific-Garbage-Patch-is-twice-the-size-of-France.html. 5. Statistics on annual consumption of plastic materials come from “Plastics Recycling Information.” Retrieved August 2009, www.wasteonline.org.uk/resources/InformationSheets/Plastics.htm. 6. Thomas M. Kostigen, “The World’s Largest Dump: The Great Pacific Garbage Patch,” Discover Magazine (July 10, 2008), http://discovermagazine.com/2008/jul/10-the-worlds-largest-dump. 7. Paul Hawken, Amory Lovins, and L.

pages: 791 words: 85,159

Social Life of Information
by John Seely Brown and Paul Duguid
Published 2 Feb 2000

The definitions of knowledge management that began this chapter perform a familiar two-step. First, they define the core problem in terms of information, so that, second, they can put solutions in the province of information technology.13 Here, retrieval looks as easy as search. Page 125 If information retrieval were all that is required for such things as knowledge management or best practice, HP would have nothing to worry about. It has an abundance of very good information technology. The persistence of HP's problem, then, argues that knowledge management, knowledge, and learning involve more than information.

pages: 290 words: 83,248

The Greed Merchants: How the Investment Banks Exploited the System
by Philip Augar
Published 20 Apr 2005

They range in size from top firms like UBS, Fidelity, State Street, and Barclays Global Investors, which manage over a trillion dollars apiece, to small hedge funds looking after a few million dollars. They rely heavily on brokers whose job is to provide them with advice, information and share dealing: ‘Our best brokers have a great appetite for information retrieval and dissemination. We get our first Bloomberg messages at 5.20 a.m., it’s an information game. We pay brokers $60 million of commission out of a $3 billion fund and most goes to those that phone us most often. They are fast ten-second conversations, often Bloomberg driven. I get a thousand e-mails a day and I read them all.’7 The broking divisions of the top investment banks flood their clients with information: ‘We give them a view on every single price movement; it’s all about short term momentum.

pages: 382 words: 92,138

The Entrepreneurial State: Debunking Public vs. Private Sector Myths
by Mariana Mazzucato
Published 1 Jan 2011

Available online at http://www.guardian.co.uk/technology/2002/apr/04/internetnews.maths/print (accessed 10 October 2012). DIUS (Department of Innovation, Universities and Skills). 2008. Innovation Nation, March. Cm 7345. London: DIUS. DoD (United States Department of Defense). 2011. Selected Acquisition Report (SAR): RCS: DD-A&T(Q&A)823-166 : NAVSTAR GPS: Defense Acquisition Management Information Retrieval (DAMIR). Los Angeles, 31 December. DoE (United States Department of Energy). 2007. ‘DOE-Supported Researcher Is Co-winner of 2007 Nobel Prize in Physics’. 10 September. Available online at http://science.energy.gov/news/in-the-news/2007/10-09-07/?p=1 (accessed 21 January 2013). _____. 2009.

pages: 344 words: 94,332

The 100-Year Life: Living and Working in an Age of Longevity
by Lynda Gratton and Andrew Scott
Published 1 Jun 2016

There are those who argue that even these skills can be performed by AI – pointing, for example, to the development of IBM’s supercomputer Watson, which is able to perform detailed oncological diagnosis. This means that with diagnostic augmentation, the skill set for the medical profession will shift from information retrieval to deeper intuitive experience, more person-to- person skills and greater emphasis on team motivation and judgement. The same technological developments will occur in the education sector, where digital teaching will replace textbooks and classroom teaching and the valuable skills will move towards the intricate human skills of empathy, motivation and encouragement.

Noam Chomsky: A Life of Dissent
by Robert F. Barsky
Published 2 Feb 1997

He tried to use the features of linguistic analysis for discourse analysis" (qtd. in R. A. Harris 83). From this project discourse analysis was born. Chomsky was in search of transformations "to model the linguistic knowledge in a native speaker's head," while Harris was interested in "such practical purposes as machine translation and automated information retrieval" (R. A. Harris 84). Their linguistic interests were irrevocably diverging. Chomsky's last communications with Harris were in the early 1960s, "when [Harris] asked me to [approach] contacts at the [National Science Foundation] for a research contract for him, which I did. We then spent a couple of days together in Israel, in 1964.

pages: 339 words: 94,769

Possible Minds: Twenty-Five Ways of Looking at AI
by John Brockman
Published 19 Feb 2019

To be fair, the human body needs 100 watts to operate and twenty years to build, hence about 6 trillion joules of energy to “manufacture” a mature human brain. The cost of manufacturing Watson-scale computing is similar. So why aren’t humans displacing computers? For one, the Jeopardy! contestants’ brains were doing far more than information retrieval—much of which would be considered mere distractions by Watson (e.g., cerebellar control of smiling). Other parts allow leaping out of the box with transcendence unfathomable by Watson, such as what we see in Einstein’s five annus mirabilis papers of 1905. Also, humans consume more energy than the minimum (100 watts) required for life and reproduction.

The Internet Trap: How the Digital Economy Builds Monopolies and Undermines Democracy
by Matthew Hindman
Published 24 Sep 2018

Kirshenbaum, E., Forman, G., and Dugan, M. (2012). A live comparison of methods for personalized article recommendation at Forbes.com. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Bristol, England (pp. 51–66). Knight Foundation. (2016, May). Mobile first news: how people use smartphones to access information. Retrieved from https://www.knightfoundation.org/media /uploads/publication_pdfs/KF_Mobile-Report_Final_050916.pdf. Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., and Pohlmann, N. (2013). Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, Chicago, IL (pp. 1168–76).

The Myth of Artificial Intelligence: Why Computers Can't Think the Way We Do
by Erik J. Larson
Published 5 Apr 2021

Alice easily succeeds at improving Hugh-Machine’s chess competence (despite his being a champion player already), by simply downloading some StockFish chess code off her smartphone. Similarly, she gives the Hugh-Machine perfect arithmetical abilities with a calculator, and supercomputer memory, as well as access to all the information retrievable by Google. System X is optimal, and the Hugh-Machine can do something that Hugh Alexander, for all his seeming intelligence, could not: it can play superhuman chess, and superhumanly add numbers, and excel at many other System X things. The problem is, so too could Bob-Machine. In fact, Alice realizes that Bob-Machine and Hugh-Machine are provably equivalent.

UNIX® Network Programming, Volume 1: The Sockets Networking API, 3rd Edition
by W. Richard Stevens, Bill Fenner, Andrew M. Rudoff
Published 8 Jun 2013

Unfortunately, it is implementation-dependent how an administrator configures a host to use the different types of name services. Solaris 2.x, HP-UX 10 and later, and FreeBSD 5.x and later use the file /etc/nsswitch.conf, and AIX uses the file /etc/netsvc.conf. BIND 9.2.2 supplies its own version named the Information Retrieval Service (IRS), which uses the file /etc/irs.conf. If a name server is to be used for hostname lookups, then all these systems use the file /etc/resolv.conf to specify the IP addresses of the name servers. Fortunately, these differences are normally hidden to the application programmer, so we just call the resolver functions such as gethostbyname and gethostbyaddr.

…

R., xxiii hard error, 99 Harkins, D., 524, 949 Index Haug, J., xxiii HAVE_MSGHDR_MSG_CONTROL constant, 425 HAVE_SOCKADDR_SA_LEN constant, 68 hdr structure, 601, 603 – 604, 941 head of line blocking, 31, 293 – 299 head, STREAMS, 852 header extension length, 719, 725 length field, IPv4, 870 Hewlett-Packard, xxiii High-Performance Parallel Interface, see HIPPI high-priority, STREAMS message, 183, 854 Hinden, R., 55, 57, 216, 529, 721, 726, 871, 873, 877 – 879, 888, 948 – 949 HIPPI (High-Performance Parallel Interface), 55 historical advanced API, IPv6, 732 history, BSD networking, 20 – 21 Holbrook, H., 558, 950 Holdrege, M., 267, 952 home_page function, 455 – 456, 694, 697 hop count, routing, 481 hop limit, 43, 217 – 218, 552, 559, 563, 566, 617, 750, 755, 757, 761, 772, 872 – 873, 884 hop-by-hop options, IPv6, 719 – 725 host byte order, 77, 103, 110, 120, 148, 737, 740, 915 Host Requirements RFC, 948 HOST_NOT_FOUND constant, 308 host_serv function, 325 – 326, 457, 713, 717, 728, 745, 757, 798 definition of, 325 source code, 326 hostent structure, 307 – 308, 310, 345, 347 – 348, 929 definition of, 307 hostent_data structure, 346 HP-UX, xxiii, 22, 78, 108, 257, 262, 306, 343, 346, 390, 538, 793 hstrerror function, 308, 310 HTML (Hypertext Markup Language), 454, 825 htonl function, 79, 103, 152, 918 definition of, 79 htons function, 8, 311 definition of, 79 HTTP (Hypertext Transfer Protocol), 9, 40, 62, 103, 106, 211, 452, 456, 459, 595, 696, 820, 825, 896 Huitema, C., 304, 889, 950, 954 Hypertext Markup Language, see HTML Hypertext Transfer Protocol, see HTTP I_RECVFD constant, 420 I_SENDFD constant, 420 IANA (Internet Assigned Numbers Authority), 50 – 52, 215, 311, 950, 953 IBM, xxiii ICMP (Internet Control Message Protocol), 33, 62, 200, 249, 256 – 257, 735, 739, 742, 755, 896, 922, UNIX Network Programming 925 address request, 739, 883 code field, 882 destination unreachable, 100 – 101, 144, 200, 249, 762, 764 – 765, 771, 775, 865, 883 – 884 destination unreachable, fragmentation required, 56, 771, 883 echo reply, 735, 741, 883 – 884 echo request, 735, 739, 741, 883 – 884 header, picture of, 882 message daemon, implementation, 769 – 786 packet too big, 56, 771, 884 parameter problem, 720, 883 – 884 port unreachable, 249, 253, 257, 265, 534, 755, 761, 764, 771, 794, 815, 883 – 884, 925 redirect, 485, 497, 883 – 884 router advertisement, 735, 741, 883 – 884 router solicitation, 735, 883 – 884 source quench, 771 – 772, 883 time exceeded, 755, 761, 764, 771, 883 – 884 timestamp request, 739, 883 type field, 882 ICMP6_FILTER socket option, 216, 740 ICMP6_FILTER_SETBLOCK macro, definition of, 740 ICMP6_FILTER_SETBLOCKALL macro, definition of, 740 ICMP6_FILTER_SETPASS macro, definition of, 740 ICMP6_FILTER_SETPASSALL macro, definition of, 740 ICMP6_FILTER_WILLBLOCK macro, definition of, 740 ICMP6_FILTER_WILLPASS macro, definition of, 740 icmp6_filter structure, 193, 216, 740 icmpcode_v4 function, 765 icmpcode_v6 function, 765 icmpd program, 769, 772, 774 – 786, 946 icmpd_dest member, 772 icmpd_err member, 771, 774, 783 – 784 icmpd_errno member, 771 icmpd.h header, 775 ICMPv4 (Internet Control Message Protocol version 4), 33 – 34, 735, 740, 769, 871, 882 – 884 checksum, 737, 753, 806, 882 header, 743, 755 message types, 883 ICMPv6 (Internet Control Message Protocol version 6), 33 – 34, 216, 735, 738, 769, 882 – 884 checksum, 738, 753 – 754, 882 header, 744, 755 message types, 884 multicast listener done, 884 multicast listener query, 884 multicast listener report, 884 Index 965 neighbor advertisement, 884 neighbor advertisement, inverse, 884 neighbor solicitation, 884 neighbor solicitation, inverse, 884 socket option, 216 type filtering, 740 – 741 id program, 431 ident member, 405 identification field, IPv4, 870 IEC (International Electrotechnical Commission), 26, 950 IEEE (Institute of Electrical and Electronics Engineers), 26, 509, 550, 879, 950 IEEE-IX, 26 IETF (Internet Engineering Task Force), 28, 947 if_announcemsghdr structure, 487 definition of, 488 if_freenameindex function, 504 – 508 definition of, 504 source code, 508 if_index member, 504, 903 if_indextoname function, 504 – 508, 566, 568, 593 definition of, 504 source code, 506 if_msghdr structure, 487, 502 definition of, 488 if_name member, 504, 508, 903 if_nameindex function, 486, 504 – 508 definition of, 504 source code, 507 if_nameindex structure, 504, 507 – 508, 903 definition of, 504 if_nametoindex function, 486, 504 – 508, 566 – 567, 569 definition of, 504 source code, 505 ifa_msghdr structure, 487 definition of, 488 ifam_addrs member, 489, 493 ifc_buf member, 469 – 470 ifc_len member, 77, 468, 470 ifc_req member, 469 ifconf structure, 77, 467 – 468, 470 definition of, 469 ifconfig program, 23, 25, 103, 234, 471, 480 IFF_BROADCAST constant, 480 IFF_POINTOPOINT constant, 480 IFF_PROMISC constant, 792 IFF_UP constant, 480 ifi_hlen member, 473, 478, 502 ifi_index member, 502 ifi_info structure, 469, 471, 473, 475, 478, 484, 500, 502, 608 ifi_next member, 471, 478 ifm_addrs member, 489, 493 966 UNIX Network Programming ifm_type member, 502 ifma_msghdr structure, 487 definition of, 488 ifmam_addrs member, 489 IFNAMSIZ constant, 504 ifr_addr member, 469, 480 – 481 ifr_broadaddr member, 469, 481, 484 ifr_data member, 469 ifr_dstaddr member, 469, 481, 484 ifr_flags member, 469, 480 – 481 ifr_metric member, 469, 481 ifr_name member, 470, 480 ifreq structure, 467 – 468, 470, 475, 477, 480, 484, 568 definition of, 469 IFT_NONE constant, 591 IGMP (Internet Group Management Protocol), 33 – 34, 556, 735, 739 – 740, 871 checksum, 753 ILP32, programming model, 28 imperfect multicast filtering, 555 implementation ICMP message daemon, 769 – 786 ping program, 741 – 754 traceroute program, 755 – 768 imr_interface member, 560, 562, 568 imr_multiaddr member, 560, 562 imr_sourceaddr member, 562 IN6_IS_ADDR_LINKLOCAL macro, definition of, 360 IN6_IS_ADDR_LOOPBACK macro, definition of, 360 IN6_IS_ADDR_MC_GLOBAL macro, definition of, 360 IN6_IS_ADDR_MC_LINKLOCAL macro, definition of, 360 IN6_IS_ADDR_MC_NODELOCAL macro, definition of, 360 IN6_IS_ADDR_MC_ORGLOCAL macro, definition of, 360 IN6_IS_ADDR_MC_SITELOCAL macro, definition of, 360 IN6_IS_ADDR_MULTICAST macro, definition of, 360 IN6_IS_ADDR_SITELOCAL macro, definition of, 360 IN6_IS_ADDR_UNSPECIFIED macro, definition of, 360 IN6_IS_ADDR_V4COMPAT macro, definition of, 360 IN6_IS_ADDR_V4MAPPED macro, 355, 360, 362, 745 definition of, 360 in6_addr structure, 193, 561 definition of, 71 in6_pktinfo structure, 588, 615 – 617, 731 Index definition of, 616 IN6ADDR_ANY_INIT constant, 103, 320, 322, 412, 616, 881 IN6ADDR_LOOPBACK_INIT constant, 880 in6addr_any constant, 103, 881 in6addr_loopback constant, 880 in_addr structure, 70, 193, 308, 310, 358, 560, 563 definition of, 68 in_addr_t datatype, 69 – 70 in_cksum function, 753 source code, 753 in_pcbdetach function, 140 in_port_t datatype, 69 INADDR_ANY constant, 13, 53, 102 – 103, 122, 126, 214, 242, 288, 320, 322, 412, 534, 560 – 563, 859, 876, 915 INADDR_LOOPBACK constant, 876 INADDR_MAX_LOCAL_GROUP constant, 915 INADDR_NONE constant, 82, 901, 915 in-addr.arpa domain, 304, 310 in-band data, 645 incarnation, definition of, 44 incomplete connection queue, 104 index, interface, 217, 489, 498, 502, 504 – 508, 560 – 563, 566, 569, 577, 616, 731 INET6_ADDRSTRLEN constant, 83, 86, 901 inet6_opt_append function, 723 – 724 definition of, 723 inet6_opt_find function, 725 definition of, 724 inet6_opt_finish function, 723 – 724 definition of, 723 inet6_opt_get_val function, 725 definition of, 724 inet6_opt_init function, 723 – 724 definition of, 723 inet6_option_alloc function, 732 inet6_option_append function, 732 inet6_option_find function, 732 inet6_option_init function, 732 inet6_option_next function, 732 inet6_option_space function, 732 inet6_opt_next function, 724 – 725 definition of, 724 inet6_opt_set_val function, 723 – 725 definition of, 723 inet6_rth_add function, 727 – 728 definition of, 727 inet6_rthdr_add function, 732 inet6_rthdr_getaddr function, 732 inet6_rthdr_getflags function, 732 inet6_rthdr_init function, 732 inet6_rthdr_lasthop function, 732 inet6_rthdr_reverse function, 732 inet6_rthdr_segments function, 732 inet6_rthdr_space function, 732 UNIX Network Programming inet6_rth_getaddr function, 728, 731 definition of, 728 inet6_rth_init function, 727 – 728 definition of, 727 inet6_rth_reverse function, 728, 730 definition of, 728 inet6_rth_segments function, 728, 731 definition of, 728 inet6_rth_space function, 727 – 728 definition of, 727 inet6_srcrt_print function, 730 – 731 INET_ADDRSTRLEN constant, 83, 86, 901 inet_addr function, 9, 67, 82 – 83, 93 definition of, 82 inet_aton function, 82 – 83, 93, 314 definition of, 82 inet_ntoa function, 67, 82 – 83, 343, 685 definition of, 82 inet_ntop function, 67, 82 – 86, 93, 110, 309, 341, 343, 345, 350, 593, 731 definition of, 83 IPv4-only version, source code, 85 inet_pton function, 8 – 9, 11, 67, 82 – 85, 93, 290, 333, 343, 930 definition of, 83 IPv4-only version, source code, 85 inet_pton_loose function, 93 inet_srcrt_add function, 713, 715 inet_srcrt_init function, 712, 715 inet_srcrt_print function, 714 inetd program, 61, 114, 118 – 119, 154, 363, 371 – 380, 587, 613 – 614, 825, 850, 897, 934, 945 Information Retrieval Service, see IRS INFTIM constant, 184, 902 init program, 132, 145, 938 init_v6 function, 749 initial thread, 676 in.rdisc program, 735 Institute of Electrical and Electronics Engineers, see IEEE int16_t datatype, 69 int32_t datatype, 69 int8_t datatype, 69 interface address, UDP, binding, 608 – 612 configuration, ioctl function, 468 – 469 index, 217, 489, 498, 502, 504 – 508, 560 – 563, 566, 569, 577, 616, 731 index, recvmsg function, receiving, 588 – 593 logical, 877 loopback, 23, 792, 799, 809, 876 – 877 message-based, 858 operations, ioctl function, 480 – 481 UDP determining outgoing, 261 – 262 interface-local multicast scope, 552 – 553 International Electrotechnical Commission, see IEC Index 967 International Organization for Standardization, see ISO Internet, 5, 22 Internet Assigned Numbers Authority, see IANA Internet Control Message Protocol, see ICMP Internet Control Message Protocol version 4, see ICMPv4 Internet Control Message Protocol version 6, see ICMPv6 Internet Draft, 947 Internet Engineering Task Force, see IETF Internet Group Management Protocol, see IGMP Internet Protocol, see IP Internet Protocol next generation, see IPng Internet Protocol version 4, see IPv4 Internet Protocol version 6, see IPv6 Internet service provider, see ISP Internetwork Packet Exchange, see IPX interoperability IPv4 and IPv6, 353 – 362 IPv4 client IPv6 server, 354 – 357 IPv6 client IPv4 server, 357 – 359 source code portability, 361 interprocess communication, see IPC interrupts, software, 129 inverse, ICMPv6 neighbor advertisement, 884 ICMPv6 neighbor solicitation, 884 I/O asynchronous, 160, 468, 663 definition of, Unix, 399 model, asynchronous, 158 – 159 model, blocking, 154 – 155 model, comparison of, 159 – 160 model, I/O, multiplexing, 156 – 157 model, nonblocking, 155 – 156 model, signal-driven, 157 – 158 models, 154 – 160 multiplexing, 153 – 189 multiplexing I/O, model, 156 – 157 nonblocking, 88, 165, 234 – 235, 388, 398, 435 – 464, 468, 665, 669, 671, 919, 945 signal-driven, 200, 234 – 235, 663 – 673 standard, 168, 344, 399 – 402, 409, 437, 935, 952 synchronous, 160 ioctl function, 191, 222, 233 – 234, 399, 403 – 404, 409, 420, 465 – 469, 474 – 475, 477 – 478, 480 – 485, 500, 538, 566, 568, 585, 647, 654, 664, 666, 669, 790, 792, 799, 852, 857, 868 ARP cache operations, 481 – 483 definition of, 466, 857 file operations, 468 interface configuration, 468 – 469 interface operations, 480 – 481 routing table operations, 483 – 484 socket operations, 466 – 467 STREAMS, 857 – 858 968 UNIX Network Programming IOV_MAX constant, 390 iov_base member, 389 iov_len member, 389, 392 iovec structure, 389 – 391, 393, 601 definition of, 389 IP (Internet Protocol), 33 fragmentation and broadcast, 537 – 538 fragmentation and multicast, 571 Multicast Infrastructure, 571, 584 – 585 Multicast Infrastructure session announcements, 571 – 575 routing, 869 spoofing, 108, 948 version number field, 869, 871 ip6_mtuinfo structure, definition of, 619 ip6.arpa domain, 304 ip6m_addr member, 619 ip6m_mtu member, 619 IP_ADD_MEMBERSHIP socket option, 193, 560, 562 IP_ADD_SOURCE_MEMBERSHIP socket option, 193, 560 IP_BLOCK_SOURCE socket option, 193, 560, 562 IP_DROP_MEMBERSHIP socket option, 193, 560 – 561 IP_DROP_SOURCE_MEMBERSHIP socket option, 193, 560 IP_HDRINCL socket option, 193, 214, 710, 736 – 738, 753, 755, 790, 793, 805 – 806 IP_MULTICAST_IF socket option, 193, 559, 563, 945 IP_MULTICAST_LOOP socket option, 193, 559, 563 IP_MULTICAST_TTL socket option, 193, 215, 559, 563, 871, 945 IP_OPTIONS socket option, 193, 214, 709 – 710, 718, 733, 945 IP_RECVDSTADDR socket option, 193, 211, 214, 251, 265, 392 – 396, 587 – 588, 590, 592, 608, 616, 620, 666, 895 ancillary data, picture of, 394 IP_RECVIF socket option, 193, 215, 395, 487, 588, 590, 592, 608, 620, 666 ancillary data, picture of, 591 IP_TOS socket option, 193, 215, 870, 895 IP_TTL socket option, 193, 215, 218, 755, 761, 871, 895 IP_UNBLOCK_SOURCE socket option, 193, 560 ip_id member, 740, 806 ip_len member, 737, 740, 806 ip_mreq structure, 193, 560, 568 definition of, 560 ip_mreq_source structure, 193 definition of, 562 ip_off member, 737, 740 IPC (interprocess communication), 411 – 412, Index 545 – 547, 675 ipi6_addr member, 616 ipi6_ifindex member, 616 ipi_addr member, 588, 901 ipi_ifindex member, 588, 901 IPng (Internet Protocol next generation), 871 ipopt_dst member, 714 ipopt_list member, 714 ipoption structure, definition of, 714 IPPROTO_ICMP constant, 736 IPPROTO_ICMPV6 constant, 193, 216, 738, 740 IPPROTO_IP constant, 214, 394 – 395, 591, 710 IPPROTO_IPV6 constant, 216, 395, 615 – 619, 722, 727 IPPROTO_RAW constant, 737 IPPROTO_SCTP constant, 97, 222, 288 IPPROTO_TCP constant, 97, 219, 288, 519 IPPROTO_UDP constant, 97 IPsec, 951 IPv4 (Internet Protocol version 4), 33, 869 address, 874 – 877 and IPv6 interoperability, 353 – 362 checksum, 214, 737, 753, 871 client IPv6 server, interoperability, 354 – 357 destination address, 871 fragment offset field, 871 header, 743, 755, 869 – 871 header length field, 870 header, picture of, 870 identification field, 870 multicast address, 549 – 551 multicast address, ethernet mapping, picture of, 550 options, 214, 709 – 711, 871 protocol field, 871 receiving packet information, 588 – 593 server, interoperability, IPv6 client, 357 – 359 socket address structure, 68 – 70 socket option, 214 – 215 source address, 871 source routing, 711 – 719 total length field, 870 IPv4-compatible IPv6 address, 880 IPv4/IPv6 host, definition of, 34 IPv4-mapped IPv6 address, 93, 322, 333, 354 – 360, 745, 879 – 880 IPv6 (Internet Protocol version 6), xx, 33, 871 address, 877 – 881 backbone, see 6bone checksum, 216, 738, 873 client IPv4 server, interoperability, 357 – 359 destination address, 873 destination options, 719 – 725 extension headers, 719 flow label field, 871 getaddrinfo function, 322 – 323 UNIX Network Programming header, 744, 755, 871 – 874 header, picture of, 872 historical advanced API, 732 hop-by-hop options, 719 – 725 interoperability, IPv4 and, 353 – 362 multicast address, 551 – 552 multicast address, ethernet mapping, picture of, 550 multicast address, picture of, 551 next header field, 872 options, see IPv6, extension headers path MTU control, 618 – 619 payload length field, 872 receiving packet information, 615 – 618 routing header, 725 – 731 server, interoperability, IPv4 client, 354 – 357 socket address structure, 71 – 72 socket option, 216 – 218 source address, 873 source routing, 725 – 731 source routing segments left, 725 source routing type, 725 sticky options, 731 – 732 IPV6_ADD_MEMBERSHIP socket option, 560 – 561 IPV6_ADDRFORM socket option, 361 IPV6_CHECKSUM socket option, 193, 216, 738 IPV6_DONTFRAG socket option, 216, 619 IPV6_DROP_MEMBERSHIP socket option, 560 – 561 IPV6_DSTOPTS socket option, 193, 395, 732 ancillary data, picture of, 722 IPV6_HOPLIMIT socket option, 193, 395, 617, 732, 749 – 750, 873 ancillary data, picture of, 615 IPV6_HOPOPTS socket option, 193, 395, 732 ancillary data, picture of, 722 IPV6_JOIN_GROUP socket option, 193, 560, 562 IPV6_LEAVE_GROUP socket option, 193, 561 IPV6_MULTICAST_HOPS socket option, 193, 559, 563, 617, 873 IPV6_MULTICAST_IF socket option, 193, 559, 563, 616 IPV6_MULTICAST_LOOP socket option, 193, 559, 563 IPV6_NEXTHOP socket option, 193, 217, 395, 617, 732 ancillary data, picture of, 615 IPV6_PATHMTU socket option, 217, 619 IPV6_PKTINFO socket option, 193, 251, 395, 561, 608, 616, 620, 666, 732 ancillary data, picture of, 615 IPV6_PKTOPTIONS socket option, 732 IPV6_RECVDSTOPTS socket option, 217, 722 IPV6_RECVHOPLIMIT socket option, 217 – 218, 617, 749, 873 IPV6_RECVHOPOPTS socket option, 217, 722 Index 969 IPV6_RECVPATHMTU socket option, 216 – 217, 619 IPV6_RECVPKTINFO socket option, 217, 616 – 617, 620 IPV6_RECVRTHDR socket option, 218, 727, 729 IPV6_RECVTCLASS socket option, 218, 618 IPV6_RTHDR socket option, 193, 395, 732 ancillary data, picture of, 727 IPV6_RTHDR_TYPE_0 constant, 727 IPV6_TCLASS socket option, 395, 618, 732, 871 ancillary data, picture of, 615 IPV6_UNICAST_HOPS socket option, 193, 218, 617, 755, 761, 873 IPV6_USE_MIN_MTU socket option, 218, 618 – 619 IPV6_V6ONLY socket option, 218, 357 IPV6_XXX socket options, 218 ipv6_mreq structure, 193, 560, 569 definition of, 560 ipv6mr_interface member, 560, 569 ipv6mr_multiaddr member, 560 IPX (Internetwork Packet Exchange), 952 IRS (Information Retrieval Service), 306 ISO (International Organization for Standardization), 18, 26, 950 ISO 8859, 573 ISP (Internet service provider), 875 iterative server, 15, 114, 243, 821 – 822 Jackson, A., 721, 952 Jacobson, V., 35, 38 – 39, 44, 571, 596, 598 – 599, 737, 788, 790, 896, 949 – 951 Jim, J., 285, 953 Jinmei, T., 28, 216, 397, 719, 738, 744, 953 joinable thread, 678 Jones, R.

…

How to Form Your Own California Corporation
by Anthony Mancuso
Published 2 Jan 1977

California Secretary of State contact Information www.ss.ca.gov/business/corp/corporate.htm Office hours for all locations are Monday through Friday 8:00 a.m. to 5:00 p.m. Sacramento Office 1500 11th Street Sacramento, CA 95814 (916) 657-5448* • Name Availability Unit (*recorded information on how to obtain) • Document Filing Support Unit • Legal Review Unit • Information Retrieval and Certification Unit • Status (*recorded information on how to obtain) • Statement of Information Unit (filings only) P.O. Box 944230 Sacramento, CA 94244-2300 • Substituted Service of Process (must be hand delivered to the Sacramento office) San Francisco Regional Office 455 Golden Gate Avenue, Suite 14500 San Francisco, CA 94102-7007 415-557-8000 Fresno Regional Office 1315 Van Ness Ave., Suite 203 Fresno, CA 93721-1729 559-445-6900 Los Angeles Regional Office 300 South Spring Street, Room 12513 Los Angeles, CA 90013-1233 213-897-3062 San Diego Regional Office 1350 Front Street, Suite 2060 San Diego, CA 92101-3609 619-525-4113 California Department of Corporations contact information www.corp.ca.gov Contact Information The Department of Corporations, the office that receives your Notice of Stock Issuance, as explained in Chapter 5, Step 7, has four offices.

pages: 352 words: 96,532

Where Wizards Stay Up Late: The Origins of the Internet
by Katie Hafner and Matthew Lyon
Published 1 Jan 1996

It was “published” electronically in the MsgGroup in 1977. They went on: “As computer communication systems become more powerful, more humane, more forgiving and above all, cheaper, they will become ubiquitous.” Automated hotel reservations, credit checking, real-time financial transactions, access to insurance and medical records, general information retrieval, and real-time inventory control in businesses would all come. In the late 1970s, the Information Processing Techniques Office’s final report to ARPA management on the completion of the ARPANET research program concluded similarly: “The largest single surprise of the ARPANET program has been the incredible popularity and success of network mail.

pages: 364 words: 102,926

What the F: What Swearing Reveals About Our Language, Our Brains, and Ourselves
by Benjamin K. Bergen
Published 12 Sep 2016

NBC Sports. Retrieved from http://profootballtalk.nbcsportscom/2014/03/03/richard-sherman-calls-nfl-banning-the-n-word-an-atrocious-idea. Snopes. (October 11, 2014). Pluck Yew. Retrieved from http://www.snopes.com/language/apocryph/pluckyew.asp. Social Security Administration. (n.d.). Background information. Retrieved from https://www.ssa.gov/oact/babynames/background.html. Songbass. (November 3, 2008). Obama gives McCain the middle finger. YouTube. Retrieved from https://www.youtube.com/watch?v=Pc8Wc1CN7sY. Spears, A. K. (1998). African-American language use: Ideology and so-called obscenity. In African-American English: Structure, history, and use, Salikoko S.

pages: 386 words: 91,913

The Elements of Power: Gadgets, Guns, and the Struggle for a Sustainable Future in the Rare Metal Age
by David S. Abraham
Published 27 Oct 2015

O’Rourke, “Navy Virginia (SSN-774) Class Attack Submarine Procurement: Background and Issues for Congress,” Congressional Research Service, July 31, 2014, http://www.fas.org/sgp/crs/weapons/RL32418.pdf. For information on Virginia class submarine purchases, see, “DDG 51 Arleigh Burke Class Guided Missile Destroyer,” Defense Acquisition Management Information Retrieval, December 31, 2012, accessed December 18, 2014, http://www.dod.mil/pubs/foi/logistics_material_readiness/acq_bud_fin/SARs/2012-sars/13-F-0884_SARs_as_of_Dec_2012/Navy/DDG_51_December_2012_SAR.pdf. For information on the DDG 51 Aegis Destroyer Ships as of 2012, including expected production until 2016, see “Next Global Positioning System Receiver Equipment,” Committee Reports 113th Congress (2013–2014), House Report 113-102, June 7, 2013, accessed December 18, 2014, thomas.loc.gov/cgi-bin/cpquery/?

pages: 443 words: 98,113

The Corruption of Capitalism: Why Rentiers Thrive and Work Does Not Pay
by Guy Standing
Published 13 Jul 2016

McClintick, ‘How Harvard lost Russia’, Institutional Investor, 27 February 2006. 2 ‘The new age of crony capitalism’, The Economist, 15 March 2014, pp. 9, 53–4; ‘The party winds down’, The Economist, 7 May 2016, pp. 46–8. 3 M. Lupu, K. Mayer, J. Tait and A. J. Trippe (eds), Current Challenges in Patent Information Retrieval (Heidelberg: Springer-Verlag, 2011), p. v. 4 Letter to Isaac McPherson, 13 August 1813. A public good is one that can be consumed or used by one person without affecting its consumption or use by others; it is available to all. 5 ‘A question of utility’, The Economist, 8 August 2015. 6 M.

Your Own Allotment : How to Find It, Cultivate It, and Enjoy Growing Your Own Food
by Russell-Jones, Neil.
Published 21 Mar 2008

YOUR OWN ALLOTMENT If you want to know how… How to Grow Your Own Food A week-by-week guide to wild life friendly fruit and vegetable gardening Planning and Creating Your First Garden A step-by-step guide to designing your garden – whatever your experience or knowledge How to Start Your Own Gardening Business An insider guide to setting yourself up as a professional gardener Please send for a free copy of the latest catalogue: How To Books Ltd Spring Hill House, Spring Hill Road, Begbroke Oxford OX5 1RX, United Kingdom info@howtoboooks.co.uk www.howtobooks.co.uk YOUR OWN ALLOTMENT How to find it, cultivate it, and enjoy growing your own food Neil Russell-Jones SPRING HILL Published by How To Content, A division of How To Books Ltd, Spring Hill House, Spring Hill Road, Begbroke, Oxford OX5 1RX, United Kingdom. Tel: (01865) 375794. Fax: (01865) 379162. info@howtobooks.co.uk www.howtobooks.co.uk All rights reserved. No part of this work may be reproduced or stored in an information retrieval system (other than for purposes of review), without the express permission of the publisher in writing. The right of Neil Russell-Jones to be identified as author of this work has been asserted by him in accordance with the Copyright, Design and Patents Act 1988. © 2008 Neil Russell-Jones First edition 2008 First published in electronic form 2008 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN 978 1 84803 247 7 Cover design by Mousemat Design Illustrations by Deborah Andrews Produced for How To Books by Deer Park Productions, Tavistock, Devon Typeset by Pantek Arts Ltd, Maidstone, Kent NOTE:The material contained in this book is set out in good faith for general guidance and no liability can be accepted for loss or expense incurred as a result of relying in particular circumstances on statements made in this book.The laws and regulations are complex and liable to change, and readers should check the current positions with the relevant authorities before making personal arrangements.

Lessons-Learned-in-Software-Testing-A-Context-Driven-Approach
by Anson-QA

For example (these examples are all based on successes from our personal experience), imagine bringing in a smart person whose most recent work role was as an attorney who can analyze any specification you can give them and is trained as an advocate, a director of sales and marketing (the one we hired trained our staff in new methods of researching and writing bug reports to draw the attention of the marketing department), a hardware repair technician, a librarian (think about testing databases or other information retrieval systems), a programmer, a project manager (of nonsoftware projects), a technical support representative with experience supporting products like the ones you're testing, a translator (especially useful if your company publishes software in many languages), a secretary (think about all the information that you collect, store, and disseminate and all the time management you and your staff have to do), a system administrator who knows networks, or a user of the software you're testing.

Artificial Whiteness
by Yarden Katz

In the 1990s probabilistic modeling and inference were becoming AI’s dominant new computational engine and starting to displace logic-based approaches to reasoning within the field. These probabilistic frameworks, which crystalized in the 1980s, did not always develop under the umbrella of “AI” but also under headings such as “statistical pattern recognition,” “data mining,” or “information retrieval.”97 Regardless, these frameworks were being absorbed into AI’s familiar narratives. An article in AI Magazine titled “Is AI Going Mainstream at Last? A Look Inside Microsoft Research” (1993) exemplifies this turn. The piece omits AI’s history of shifting in and out of the mainstream, claiming that “AI” merely had its “15 minutes of fame in the mid-80s,” but that new developments in probabilistic modeling could put it back on the map.

JUST ONE DAMNED THING AFTER ANOTHER
by Jodi Taylor
Published 8 Jan 2013

I fell in love with the library, which, together with the Hall, obviously constituted the heart of the building. High ceilings made it spacious and a huge fireplace made it cosy. Comfortable chairs were scattered around and tall windows all along one wall let the sunshine flood in. As well as bays of books they had all the latest electronic information retrieval systems, study areas and data tables and through an archway, a huge archive. ‘You name it, we’ve got it somewhere,’ said Doctor Dowson, introduced to me as Librarian and Archivist and who appeared to be wearing a kind of sou’ wester. ‘At least until that old fool upstairs blows us all sky high.

pages: 519 words: 102,669

Programming Collective Intelligence
by Toby Segaran
Published 17 Dec 2008

Algorithms for full-text searches are among the most important collective intelligence algorithms, and many fortunes have been made by new ideas in this field. It is widely believed that Google's rapid rise from an academic project to the world's most popular search engine was based largely on the PageRank algorithm, a variation that you'll learn about in this chapter. Information retrieval is a huge field with a long history. This chapter will only be able to cover a few key concepts, but we'll go through the construction of a search engine that will index a set of documents and leave you with ideas on how to improve things further. Although the focus will be on algorithms for searching and ranking rather than on the infrastructure requirements for indexing large portions of the Web, the search engine you build should have no problem with collections of up to 100,000 pages.

pages: 433 words: 106,048

The End of Illness
by David B. Agus
Published 15 Oct 2012

The poll also found that 68 percent of those who have access have used the Internet to look for information about specific medicines, and nearly four in ten use it to look for other patients’ experiences of a condition. Without a doubt new technologies are helping more people around the world to find out more about their health and to make better-informed decisions, but often their online searches lack usefulness because the information retrieved cannot be personalized. Relying on dodgy information can easily lead people to take risks with inappropriate tests and treatments, wasting money and causing unnecessary worry. But with a health-record system like Dell’s and its developing infrastructure to tailor health advice and guidance to individual people based on their personal records, the outcome could be revolutionary to our health-care system, instigating the reform that’s sorely needed.

pages: 540 words: 103,101

Building Microservices
by Sam Newman
Published 25 Dec 2014

However, we still need to know how to set up and maintain these systems in a resilient fashion. Starting Again The architecture that gets you started may not be the architecture that keeps you going when your system has to handle very different volumes of load. As Jeff Dean said in his presentation “Challenges in Building Large-Scale Information Retrieval Systems” (WSDM 2009 conference), you should “design for ~10× growth, but plan to rewrite before ~100×.” At certain points, you need to do something pretty radical to support the next level of growth. Recall the story of Gilt, which we touched on in Chapter 6. A simple monolithic Rails application did well for Gilt for two years.

pages: 416 words: 106,582

This Will Make You Smarter: 150 New Scientific Concepts to Improve Your Thinking
by John Brockman
Published 14 Feb 2012

Sometimes this information is found by directed search using a Web search engine, sometimes by serendipity by following links, and sometimes by asking hundreds of people in your social network or hundreds of thousands of people on a question-answering Web site such as Answers.com, Quora, or Yahoo Answers. I do not actually know of a real findability index, but tools in the field of information retrieval could be applied to develop one. One of the unsolved problems in the field is how to help the searcher to determine if the information simply is not available. An Assertion Is Often an Empirical Question, Settled by Collecting Evidence Susan Fiske Eugene Higgins Professor of Psychology, Princeton University; author, Envy Up, Scorn Down: How Status Divides Us The most important scientific concept is that an assertion is often an empirical question, settled by collecting evidence.

pages: 461 words: 106,027

Zero to Sold: How to Start, Run, and Sell a Bootstrapped Business
by Arvid Kahl
Published 24 Jun 2020

Start by explaining your documents and what they contain in an overview document. Provide a master document that gives your buyer quick access to the data they're looking for at a glance. If you're storing all of your documents in cloud storage like Google Drive, you can cross-link between documents easily. Anything you can do to speed up information retrieval will make the due diligence process less stressful. While the due diligence phase usually comes with certain legal guarantees, don't be naive: there will be bad actors in the field, and some people will just promise more than they're willing to do. While most buyers are serious, some may just want to take a look under the hood of your business.

Reset
by Ronald J. Deibert
Published 14 Aug 2020

Retrieved from https://www.salon.com/2019/06/17/lithium-mining-for-green-electric-cars-is-leaving-a-stain-on-the-planet/ In Chile’s Atacama and Argentina’s Salar de Hombre Muerto regions: Zacune. Lithium. More than half of the world’s cobalt supply is sourced from the Democratic Republic of Congo: U.S. Department of the Interior. (n.d.). Cobalt statistics and information. Retrieved June 16, 2020, from https://www.usgs.gov/centers/nmic/cobalt-statistics-and-information; Eichstaedt, P. (2011). Consuming the Congo: War and conflict minerals in the world’s deadliest place. Chicago Review Press. Cobalt mining operations in the DRC routinely use child labour: Amnesty International. (2016).

The Dream Machine: J.C.R. Licklider and the Revolution That Made Computing Personal
by M. Mitchell Waldrop
Published 14 Apr 2001

I first tried to find close relevance within es- tablished disciplines [such as artificial intelligence,] but in each case I found that the people I would talk with would immediately translate my admittedly strange (for the times) statements of purpose and possibility into their own discipline's framework."9 At the 1960 meeting of the American Documentation Institute, a talk he gave was greeted with yawns, and his proposed augmentation environ- ment was dismissed as just another information-retrieval system. No, Engelbart realized, if his augmentation ideas were ever going to fly, he would have to create a new discipline from scratch. And to do that, he would have to give this new discipline a conceptual framework all its own-a manifesto that would layout his thinking in the most compelling way possible.

…

To begin with, while he very much liked the idea of having a big influence on PARC's research, he considered Pake's notion of a "graphics research group" a complete nonstarter. Sure, graphics technology was a critical part of this what- ever-it-was he wanted to create. But so were text display, mass-storage technol- ogy, networking technology, information retrieval, and all the rest. Taylor wanted to go after the whole, integrated vision, just as he'd gone after the whole Intergalactic Network. To focus entirely on graphics would be like trying to build the Arpanet by focusing entirely on the technology of telephone lines. And yet Pake did have a point, damn it.

pages: 470 words: 109,589

Apache Solr 3 Enterprise Search Server
by Unknown
Published 13 Jan 2012

A rich set of chainable text analysis components, such as tokenizers and language-specific stemmers that transform a text string into a series of terms (words). A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matching. A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the more likely candidates first, with flexible means to affect the scoring. Search enhancing features like:A highlighter feature to show query words found in context. A query spellchecker based on indexed content or a supplied dictionary. A "more like this" feature to list documents that are statistically similar to provided text.

pages: 437 words: 113,173

Age of Discovery: Navigating the Risks and Rewards of Our New Renaissance
by Ian Goldin and Chris Kutarna
Published 23 May 2016

Goldin, Ian and Kenneth Reinert (2012). Globalization for Development. Oxford: Oxford University Press. 49. Vietnam Food Association (2014). “Yearly Export Statistics.” Retrieved from vietfood.org.vn/en/default.aspx?c=108. 50. Bangladesh Garment Manufacturers and Exporters Association (2015). “Trade Information.” Retrieved from bgmea.com.bd/home/pages/TradeInformation#.U57MMhZLGYU. 51. Burke, Jason (2013, November 14). “Bangladesh Garment Workers Set for 77% Pay Rise.” The Guardian. Retrieved from www.theguardian.com. 52. Goldin, Ian and Kenneth Reinert (2012). Globalization for Development. Oxford: Oxford University Press. 53.

pages: 396 words: 117,149

The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World
by Pedro Domingos
Published 21 Sep 2015

The use of Naïve Bayes in spam filtering is described in “Stopping spam,” by Joshua Goodman, David Heckerman, and Robert Rounthwaite (Scientific American, 2005). “Relevance weighting of search terms,”* by Stephen Robertson and Karen Sparck Jones (Journal of the American Society for Information Science, 1976), explains the use of Naïve Bayes–like methods in information retrieval. “First links in the Markov chain,” by Brian Hayes (American Scientist, 2013), recounts Markov’s invention of the eponymous chains. “Large language models in machine translation,”* by Thorsten Brants et al. (Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007), explains how Google Translate works.

pages: 397 words: 110,130

Smarter Than You Think: How Technology Is Changing Our Minds for the Better
by Clive Thompson
Published 11 Sep 2013

the Wikipedia page on “Drone attacks in Pakistan”: “Drone attacks in Pakistan,” Wikipedia, accessed March 24, 2013, en.wikipedia.org/wiki/Drone_attacks_in_Pakistan. 40 percent of all queries are acts of remembering: Jaime Teevan, Eytan Adar, Rosie Jones, and Michael A. S. Potts, “Information Re-Retrieval: Repeat Queries in Yahoo’s Logs,” in SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2007), 151–58. collaborative inhibition: Celia B. Harris, Paul G. Keil, John Sutton, and Amanda J. Barnier, “We Remember, We Forget: Collaborative Remembering in Older Couples,” Discourse Processes 48, no. 4 (2011), 267–303. In his essay “Mathematical Creation”: Henri Poincaré, “Mathematical Creation,” in The Anatomy of Memory: An Anthology (New York: Oxford University Press, 1996), 126–35.

pages: 365 words: 117,713

The Selfish Gene
by Richard Dawkins
Published 1 Jan 1976

By this device, the timing of muscle contractions could be influenced not only by events in the immediate past, but by events in the distant past as well. The memory, or store, is an essential part of a digital computer too. Computer memories are more reliable than human ones, but they are less capacious, and enormously less sophisticated in their techniques of information-retrieval. One of the most striking properties of survival-machine behaviour is its apparent purposiveness. By this I do not just mean that it seems to be well calculated to help the animal's genes to survive, although of course it is. I am talking about a closer analogy to human purposeful behaviour.

pages: 523 words: 112,185

Doing Data Science: Straight Talk From the Frontline
by Cathy O'Neil and Rachel Schutt
Published 8 Oct 2013

Another evaluation metric you could use is precision, defined in Chapter 5. The fact that some of the same formulas have different names is due to the fact that different academic disciplines have developed these ideas separately. So precision and recall are the quantities used in the field of information retrieval. Note that precision is not the same thing as specificity. Finally, we have accuracy, which is the ratio of the number of correct labels to the total number of labels, and the misclassification rate, which is just 1–accuracy. Minimizing the misclassification rate then just amounts to maximizing accuracy.

pages: 1,172 words: 114,305

New Laws of Robotics: Defending Human Expertise in the Age of AI
by Frank Pasquale
Published 14 May 2020

The sociologist Harold Wilensky once observed that “many occupations engage in heroic struggles for professional identification; few make the grade.”43 But if we are to maintain a democratic society rather than give ourselves over to the rise of the robots—or to those who bid them to rise—then we must spread the status and autonomy now enjoyed by professionals in fields like law and medicine to information retrieval, dispute resolution, elder care, marketing, planning, designing, and many other fields. Imagine a labor movement built on solidarity between workers who specialize in non-routine tasks. If they succeed in uniting, they might project a vision of labor far more concrete and realistic than the feudal futurism of techno-utopians.

Succeeding With AI: How to Make AI Work for Your Business
by Veljko Krunic
Published 29 Mar 2020

Although you can use the F-score to measure other characteristics of this system, it’s certainly not a good technical metric for a profit curve in which the business metric is cost savings. If this is the case, why do people use F-score at all? Because the F-score makes sense in many areas of information retrieval, but not in our particular business case. F-score is often used in the context of NLP [124], so if you’re debating which technical metrics to use, it’s a reasonable starting point. The broader teaching point is that just because a certain technical metric is widely used, doesn’t automatically make it a useful metric for your profit curve.

pages: 436 words: 124,373

Galactic North
by Alastair Reynolds
Published 14 Feb 2006

"Veda would have figured it out." "We'll never know now, will we?" "What does it matter?" she said. "Gonna kill them anyway, aren't you?" Seven flashed an arc of teeth filed to points and waved a hand towards the female pirate. "Allow me to introduce Mirsky, our loose-tongued but efficient information retrieval specialist. She's going to take you on a little trip down memory lane; see if we can't remember those access codes." "What codes?" "It'll come back to you," Seven said. They were taken through the tunnels, past half-assembled mining machines, onto the surface and then into the pirate ship.

pages: 597 words: 119,204

Website Optimization
by Andrew B. King
Published 15 Mar 2008

This is especially true in the more complex world of the Web where application calls are hidden within the content portion of the page and third parties are critical to the overall download time. You need to have a view into every piece of the page load in order to manage and improve it. * * * [167] Roast, C. 1998. "Designing for Delay in Interactive Information Retrieval." Interacting with Computers 10 (1): 87–104. [168] Balashov, K., and A. King. 2003. "Compressing the Web." In Speed Up Your Site: Web Site Optimization. Indianapolis: New Riders, 412. A test of 25 popular sites found that HTTP gzip compression saved 75% on average off text file sizes and 37% overall

pages: 410 words: 119,823

Radical Technologies: The Design of Everyday Life
by Adam Greenfield
Published 29 May 2017

One scenario along these lines is that proposed by Simon Taylor, VP for Blockchain R&D at Barclays Bank, in a white paper on distributed-ledger applications prepared for the British government.19 Taylor imagines all of our personal information stored on a common blockchain, duly encrypted. Any legitimate actor, public or private—the HR department, the post office, your bank, the police—could query the same unimpeachable source of information, retrieve from it only what they were permitted to, and leave behind a trace of their access. Each of us would have read/write access to our own record; should we find erroneous information, we would have to make but one correction, and it would then propagate across the files of every institution with access to the ledger.

pages: 570 words: 115,722

The Tangled Web: A Guide to Securing Modern Web Applications
by Michal Zalewski
Published 26 Nov 2011

The subsequent proposals experimented with an increasingly bizarre set of methods to permit interactions other than retrieving a document or running a script, including such curiosities as SHOWMETHOD, CHECKOUT, or—why not—SPACEJUMP.[122] Most of these thought experiments have been abandoned in HTTP/1.1, which settles on a more manageable set of eight methods. Only the first two request types—GET and POST—are of any significance to most of the modern Web. GET The GET method is meant to signify information retrieval. In practice, it is used for almost all client-server interactions in the course of a normal browsing session. Regular GET requests carry no browser-supplied payloads, although they are not strictly prohibited from doing so. The expectation is that GET requests should not have, to quote the RFC, “significance of taking an action other than retrieval” (that is, they should make no persistent changes to the state of the application).

pages: 320 words: 87,853

The Black Box Society: The Secret Algorithms That Control Money and Information
by Frank Pasquale
Published 17 Nov 2014

The question now is whether its dictatorship will be benign. Does Google intend Book Search to promote widespread public access, or is it envisioning finely tiered access to content, granted (and withheld) in opaque ways?168 Will Google grant open access to search results on its platform, so experts in library science and information retrieval can understand (and critique) its orderings of results?169 Finally, where will the profits go from this immense cooperative project? Will they be distributed fairly among contributors, or will this be another instance in which the aggregator of content captures an unfair share of revenues from well-established dynamics of content digitization?

pages: 455 words: 138,716

The Divide: American Injustice in the Age of the Wealth Gap
by Matt Taibbi
Published 8 Apr 2014

“Just think what I could do with your emails,” he hissed, adding that he, Spyro, was going to “consider all my options as maintaining our confidentiality,” and that if the executive didn’t cooperate, he could “no longer rely on my discretion.” Contogouris seemed to be playing a triple game. First, he was genuinely trying to deliver an informant to the FBI and set himself up as an FBI informant. Second, he was trying to deliver confidential information to the hedge funds, to whom he had set himself up as an expert at information retrieval. And third, he was playing secret source to “reputable” journalists, to whom he had promised to deliver stunning exposés. Contogouris even referenced one of those contacts in his adolescent coded emails to Sender sent from London that day: CONTOGOURIS: We have been rapping here about the postman.

pages: 303 words: 67,891

Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms: Proceedings of the Agi Workshop 2006
by Ben Goertzel and Pei Wang
Published 1 Jan 2007

NARS can be connected to existing knowledge bases, such as Cyc (for commonsense knowledge), WordNet (for linguistic knowledge), Mizar (for mathematical knowledge), and so on. For each of them, a special interface module should be able to approximately translate knowledge from its original format into Narsese. x The Internet. It is possible for NARS to be equipped with additional modules, which use techniques like semantic web, information retrieval, and data mining, to directly acquire certain knowledge from the Internet, and put them into Narsese. x Natural language interface. After NARS has learned a natural language (as discussed previously), it should be able to accept knowledge from various sources in that language. Additionally, interactive tutoring will be necessary, which allows a human trainer to monitor the establishing of the knowledge base, to answer questions, to guide the system to form a proper goal structure and priority distributions among its concepts, tasks, and beliefs.

pages: 427 words: 134,098

Wonder Boy: Tony Hsieh, Zappos, and the Myth of Happiness in Silicon Valley
by Angel Au-Yeung and David Jeans
Published 25 Apr 2023

Others turned their heads to steal a glance at him: Interviews with multiple anonymous sources. Chapter 22: The Gordian Knot “Pop quiz,” Tony said, turning to him: Interview with anonymous source. Expedia: Expedia listing, https://www.expedia.com/Park-City-Hotels-Massive-10-Bed-10-Bath-Luxury-Mountain-Home-In-Old-Town.h54112819.Hotel-Information, retrieved August 3, 2022. Vrbo: Vrbo listing, https://www.vrbo.com/1997020, retrieved August 3, 2022. Steve-O … steered his bus into Park City: Interviews with anonymous sources. for a fee: Interviews with anonymous sources. a virtual reality company: Interviews with anonymous sources. Tyler found Tony: Interview with anonymous source.

pages: 696 words: 143,736

The Age of Spiritual Machines: When Computers Exceed Human Intelligence
by Ray Kurzweil
Published 31 Dec 1998

Cybernetics A term coined by Norbert Wiener to describe the “science of control and communication in animals and machines.” Cybernetics is based on the theory that intelligent living beings adapt to their environments and accomplish objectives primarily by reacting to feedback from their surroundings. Database The structured collection of data that is designed in connection with an information retrieval system. A database management system (DBMS) allows monitoring, updating, and interacting with the database. Debugging The process of discovering and correcting errors in computer hardware and software. The issue of bugs or errors in a program will become increasingly important as computers are integrated into the human brain and physiology throughout the twenty-first century.

pages: 550 words: 154,725

The Idea Factory: Bell Labs and the Great Age of American Innovation
by Jon Gertner
Published 15 Mar 2012

A visitor could also try something called a portable “pager,” a big, blocky device that could alert doctors and other busy professionals when they received urgent calls.2 New York’s fair would dwarf Seattle’s. The crowds were expected to be immense—probably somewhere around 50 or 60 million people in total. Pierce and David’s 1961 memo recommended a number of exhibits: “personal hand-carried telephones,” “business letters in machine-readable form, transmitted by wire,” “information retrieval from a distant computer-automated library,” and “satellite and space communications.” By the time the fair opened in April 1964, though, the Bell System exhibits, housed in a huge white cantilevered building nicknamed the “floating wing,” described a more conservative future than the one Pierce and David had envisioned.

pages: 492 words: 153,565

Countdown to Zero Day: Stuxnet and the Launch of the World's First Digital Weapon
by Kim Zetter
Published 11 Nov 2014

See “Software Problem Led to System Failure at Dhahran, Saudi Arabia,” US Government Accountability Office, February 4, 1992, available at gao.gov/products/IMTEC-92-26. 22 Bryan, “Lessons from Our Cyber Past.” 23 “The Information Operations Roadmap,” dated October 30, 2003, is a seventy-four-page report that was declassified in 2006, though the pages dealing with computer network attacks are heavily redacted. The document is available at http://information-retrieval.info/docs/DoD-IO.html. 24 Arquilla Frontline “CyberWar!” interview. A Washington Post story indicates that attacks on computers controlling air-defense systems in Kosovo were launched from electronic-jamming aircraft rather than over computer networks from ground-based keyboards. Bradley Graham, “Military Grappling with Rules for Cyber,” Washington Post, November 8, 1999. 25 James Risen, “Crisis in the Balkans: Subversion; Covert Plan Said to Take Aim at Milosevic’s Hold on Power,” New York Times, June 18, 1999.

pages: 527 words: 147,690

Terms of Service: Social Media and the Price of Constant Connection
by Jacob Silverman
Published 17 Mar 2015

As storage costs decrease and analytical powers grow, it’s not unreasonable to think that this capability will be extended to other targets, including, should the political environment allow it, the United States. Some of the NSA’s surveillance capacity derives from deals made with Internet firms—procedures for automating court-authorized information retrieval, direct access to central servers, and even (as in the case of Verizon) fiber optic cables piped from military bases into major Internet hubs. In the United States, the NSA uses the FBI to conduct surveillance authorized under the Patriot Act and to issue National Security Letters (NSLs)—subpoenas requiring recipients to turn over any information deemed relevant to an ongoing investigation.

pages: 467 words: 149,632

If Then: How Simulmatics Corporation Invented the Future
by Jill Lepore
Published 14 Sep 2020

To help his reader picture what he pictured, he conjured a scene set in 2000 in which a person sits at a computer console and attempts to get to the bottom of a research question merely by undertaking a series of searches. Nearly all of what Licklider described in Libraries of the Future later came to pass: the digitization of printed material, the networking of library catalogs and their contents, the development of sophisticated, natural language–based information-retrieval and search mechanisms.23 Licklider described, with a contagious amazement, what would become, in the twenty-first century, the Internet at its very best. In 1962, Licklider left Bolt Beranek and Newman for ARPA, where his many duties included funding behavioral science projects, including Pool’s Project ComCom.

pages: 542 words: 161,731

Alone Together
by Sherry Turkle
Published 11 Jan 2011

From 1996 on, Thad Starner, who like Steve Mann was a member of the MIT cyborg group, worked on the Remembrance Agent, a tool that would sit on your computer desktop (or now, your mobile device) and not only record what you were doing but make suggestions about what you might be interested in looking at next. See Bradley J. Rhodes and Thad Starner, “Remembrance Agent: A Continuously Running Personal Information Retrieval System,” Proceedings of the First International Conference on the Practical Application of Intelligent Agents and Multi Agent Technology (PAAM ’96),487-495, 487-495, www.bradleyrhodes.com/Papers/remembrance.html (accessed December 14, 2009).Albert Frigo’s “Storing, Indexing and Retrieving My Autobiography,” presented at the 2004 Workshop on Memory and the Sharing of Experience in Vienna, Austria, describes a device to take pictures of what comes into his hand.

pages: 1,331 words: 163,200

Hands-On Machine Learning With Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
by Aurélien Géron
Published 13 Mar 2017

Build a classification deep neural network, reusing the lower layers of the autoencoder. Train it using only 10% of the training set. Can you get it to perform as well as the same classifier trained on the full training set? Semantic hashing, introduced in 2008 by Ruslan Salakhutdinov and Geoffrey Hinton,13 is a technique used for efficient information retrieval: a document (e.g., an image) is passed through a system, typically a neural network, which outputs a fairly low-dimensional binary vector (e.g., 30 bits). Two similar documents are likely to have identical or very similar hashes. By indexing each document using its hash, it is possible to retrieve many documents similar to a particular document almost instantly, even if there are billions of documents: just compute the hash of the document and look up all documents with that same hash (or hashes differing by just one or two bits).

pages: 574 words: 164,509

Superintelligence: Paths, Dangers, Strategies
by Nick Bostrom
Published 3 Jun 2014

Software polices the world’s email traffic, and despite continual adaptation by spammers to circumvent the countermeasures being brought against them, Bayesian spam filters have largely managed to hold the spam tide at bay. Software using AI components is responsible for automatically approving or declining credit card transactions, and continuously monitors account activity for signs of fraudulent use. Information retrieval systems also make extensive use of machine learning. The Google search engine is, arguably, the greatest AI system that has yet been built. Now, it must be stressed that the demarcation between artificial intelligence and software in general is not sharp. Some of the applications listed above might be viewed more as generic software applications rather than AI in particular—though this brings us back to McCarthy’s dictum that when something works it is no longer called AI.

Smart Grid Standards
by Takuro Sato
Published 17 Nov 2015

Unlike C12.18 or C12.21 protocols, which only support session-oriented communications, the sessionless communication has the advantage of requiring less complex handling on both sides of the communication links and reduces the number of signaling overhead. ANSI C12.22 has a common application layer (layer 7 in the OSI, Open System Interconnection, reference model), which provides a minimal set of services and data structures required to support C12.22 nodes for the purposes of configuration, programming, and information retrieval in a networked environment. The application layer is independent of the underlying network technologies. This enables interoperability between C12.22 with already existing communication systems. C12.22 also defines a number of application layer services, which are combined to realize the various functions of the C12.22 protocols.

pages: 567 words: 171,072

The Greatest Capitalist Who Ever Lived: Tom Watson Jr. And the Epic Story of How IBM Created the Digital Age
by Ralph Watson McElvenny and Marc Wortman
Published 14 Oct 2023

Black ignores the most authoritative study of the Dehomag Hollerith census technology and its use by the Nazis available at the time he wrote his book. The paper, “Locating the Victims: The Nonrole of Punched Card Technology and Census Work,” by Friedrich W. Kistermann, appeared in the peer-reviewed academic journal IEEE Annals of the History of Computing, in 1997. A career IBM Germany patent, information retrieval, and database management specialist, in retirement Kistermann pursued historical studies of data processing and restored punched-card machines like the ones used for the German censuses. After reviewing the capabilities of the Hollerith machines produced and sold by Dehomag and their census use in detail, he concludes, “Nazi organizations and bureaucratic administrations instituted and used every means and procedure to identify, locate, isolate, deprive, exclude, and deport the Jews.

In the Age of the Smart Machine
by Shoshana Zuboff
Published 14 Apr 1988

The designers complained that as systems use became more central to task performance, managers and operators would need a more ana- lytic understanding of their work in order to determine their informa- tion requirements. They would also need a deeper level of insight into the systems themselves (procedural reasoning) that would allow them to go beyond simple information retrieval to actually becoming familiar with data and generatin'g new insights. People don't know enough about what goes into making up their job. Time hasn't been spent with them to tell them why. They've just been told, "Here's the system and here's how to use it." But they have to learn more about their job and more about the systems if The Dominion of the Smart Machine 283 they are going to figure out not only how to get data but what data they need.

pages: 611 words: 188,732

Valley of Genius: The Uncensored History of Silicon Valley (As Told by the Hackers, Founders, and Freaks Who Made It Boom)
by Adam Fisher
Published 9 Jul 2018

He showed off a way to edit text, a version of e-mail, even a primitive Skype. To modern eyes, Engelbart’s computer system looks pretty familiar, but to an audience used to punch cards and printouts it was a revelation. The computer could be more than a number cruncher; it could be a communications and information-retrieval tool. In one ninety-minute demo Engelbart shattered the military-industrial computing paradigm, and gave the hippies and freethinkers and radicals who were already gathering in Silicon Valley a vision of the future that would drive the culture of technology for the next several decades. Bob Taylor: There was about a thousand or more people in the audience and they were blown away.

pages: 685 words: 203,949

The Organized Mind: Thinking Straight in the Age of Information Overload
by Daniel J. Levitin
Published 18 Aug 2014

All bits are created equal After writing this, I discovered the same phrase “all bits are created equal” in Gleick, J. (2011). The information: A history, a theory, a flood. New York, NY: Vintage. Information has thus become separated from meaning Gleick writes “information is divorced from meaning.” He cites the technology philosopher Lewis Mumford from 1970: “Unfortunately, ‘information retrieving,’ however swift, is no substitute for discovering by direct personal inspection knowledge whose very existence one had possibly never been aware of, and following it at one’s own pace through the further ramification of relevant literature.” Gleick, J. (2011). The information: A history, a theory, a flood.

Mining of Massive Datasets
by Jure Leskovec , Anand Rajaraman and Jeffrey David Ullman
Published 13 Nov 2014

Widom, Database Systems: The Complete Book Second Edition, Prentice-Hall, Upper Saddle River, NJ, 2009. [4]D.E. Knuth, The Art of Computer Programming Vol. 3 (Sorting and Searching), Second Edition, Addison-Wesley, Upper Saddle River, NJ, 1998. [5]C.P. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. [6]R.K. Merton, “The Matthew effect in science,” Science 159:3810, pp. 56–63, Jan. 5, 1968. [7]P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison-Wesley, Upper Saddle River, NJ, 2005. 1 This startup attempted to use machine learning to mine large-scale data, and hired many of the top machine-learning people to do so.

The Code: Silicon Valley and the Remaking of America
by Margaret O'Mara
Published 8 Jul 2019

The firm had defense industry roots: founded by Martin Marietta president George Bunker and TRW vice president Simon Ramo, the firm was dedicated to what the two founders termed “a national need in the application of electronics to information handling.” An early client was NASA, for which Bunker Ramo built one of the world’s first computerized information retrieval systems, using the networked computer to classify and categorize large data sets a la Vannevar Bush’s memex.16 At first, the system Bunker Ramo designed for the dealers was simply another digital database that put paper stock tables on line. But when the firm added a feature that allowed brokers to buy and sell over the network, AT&T again cried foul.

pages: 933 words: 205,691

Hadoop: The Definitive Guide
by Tom White
Published 29 May 2009

Here are the contents of MaxTemperatureWithCounters_Temperature.properties: CounterGroupName=Air Temperature Records MISSING.name=Missing MALFORMED.name=Malformed Hadoop uses the standard Java localization mechanisms to load the correct properties for the locale you are running in, so, for example, you can create a Chinese version of the properties in a file named MaxTemperatureWithCounters_Temperature_zh_CN.properties, and they will be used when running in the zh_CN locale. Refer to the documentation for java.util.PropertyResourceBundle for more information. Retrieving counters In addition to being available via the web UI and the command line (using hadoop job -counter), you can retrieve counter values using the Java API. You can do this while the job is running, although it is more usual to get counters at the end of a job run, when they are stable.

Red Rabbit
by Tom Clancy and Scott Brick
Published 2 Jan 2002

Not that this would matter all that much to the corpse in question. "Wet" operations interfered with the main mission, which was gathering information. That was something people occasionally forgot, but something that CIA and KGB mainly understood, which was why both agencies had gotten away from it. But when the information retrieved frightened or otherwise upset the politicians who oversaw the intelligence services, then the spook shops were ordered to do things that they usually preferred to avoid—and so, then, they took their action through surrogates and/or mercenaries, mainly… "Arthur, if KGB wants to hurt the Pope, how do you suppose they'd go about it?"

pages: 843 words: 223,858

The Rise of the Network Society
by Manuel Castells
Published 31 Aug 1996

From these he accepts only a few dozen each instant, from which to make an image.19 Because of the low definition of TV, McLuhan argued, viewers have to fill in the gaps in the image, thus becoming more emotionally involved in the viewing (what he, paradoxically, characterized as a “cool medium”). Such involvement does not contradict the hypothesis of the least effort because TV appeals to the associative/lyrical mind, not involving the psychological effort of information retrieving and analyzing to which Herbert Simon’s theory refers. This is why Neil Postman, a leading media scholar, considers that television represents an historical rupture with the typographic mind. While print favors systematic exposition, TV is best suited to casual conversation. To make the distinction sharply, in his own words: “Typography has the strongest possible bias towards exposition: a sophisticated ability to think conceptually, deductively and sequentially; a high valuation of reason and order; an abhorrence of contradiction; a large capacity for detachment and objectivity; and a tolerance for delayed response.”20 While for television, “entertainment is the supra-ideology of all discourse on television.

Seeking SRE: Conversations About Running Production Systems at Scale
by David N. Blank-Edelman
Published 16 Sep 2018

One of their first questions would be something like, “Who are our customers? And why is getting the response in 10 seconds important for them?” Despite the fact that these questions came primarily from the business perspective, the information questions like these reveal can change the game dramatically. What if this service is for an “information retrieval” development team whose purpose is to address the necessity of content validation on the search engine results page, to make sure that the new index serves only live links? And what if we download a page with a million links on it? Now we can see the conflict between the priorities in the SLA and those of the service’s purposes.

pages: 864 words: 222,565

Inventor of the Future: The Visionary Life of Buckminster Fuller
by Alec Nevala-Lee
Published 1 Aug 2022

Unlike the student activists of the New Left, who emphasized politics and protest, its fans—described by one of Brand’s colleagues as “baling wire hippies”—were drawn to technology, and they would advance far beyond Fuller’s sense of what computers could be. On December 9, 1968, Brand assisted with a talk at the Joint Computer Conference in San Francisco by Douglas Engelbart, who treated computers as tools for communication and information retrieval, rather than for data processing alone. Along with advising on logistics, Brand operated a camera that provided a live feed from Menlo Park as Engelbart demonstrated windows, hypertext, and the mouse. At first, its impact was limited to a handful of researchers, but the presentation would be known one day as the Mother of All Demos

pages: 1,201 words: 233,519

Coders at Work
by Peter Seibel
Published 22 Jun 2009

If you don't feel really pretty comfortable swimming around in that world, maybe programming isn't what you should be doing. Seibel: Did you have any important mentors? Deutsch: There were two people. One of them is someone who's no longer around; his name was Calvin Mooers. He was an early pioneer in information systems. I believe he is credited with actually coining the term information retrieval. His background was originally in library science. I met him when I was, I think, high-school or college age. He had started to design a programming language that he thought would be usable directly by just people. But he didn't know anything about programming languages. And at that point, I did because I had built this Lisp system and I'd studied some other programming languages.

pages: 761 words: 231,902

The Singularity Is Near: When Humans Transcend Biology
by Ray Kurzweil
Published 14 Jul 2005

John Smith, director of the ABC Institute—you last saw him six months ago at the XYZ conference" or, "That's the Time-Life Building—your meeting is on the tenth floor." We'll have real-time translation of foreign languages, essentially subtitles on the world, and access to many forms of online information integrated into our daily activities. Virtual personalities that overlay the real world will help us with information retrieval and our chores and transactions. These virtual assistants won't always wait for questions and directives but will step forward if they see us struggling to find a piece of information. (As we wonder about "That actress ... who played the princess, or was it the queen ... in that movie with the robot," our virtual assistant may whisper in our ear or display in our visual field of view: "Natalie Portman as Queen Amidala in Star Wars, episodes 1, 2, and 3.")

pages: 496 words: 174,084

Masterminds of Programming: Conversations With the Creators of Major Programming Languages
by Federico Biancuzzi and Shane Warden
Published 21 Mar 2009

When I write a line of code, I need to rely on understanding what it’s going to do. Don: Well, there are applications where determinism is important and applications where it is not. Traditionally there has been a dividing line between what you might call databases and what you might call information retrieval. Certainly both of those are flourishing fields and they have their respective uses. XQuery and XML Will XML affect the way we use search engines in the future? Don: I think it’s possible. Search engines already exploit the kinds of metadata that are included in HTML tags such as hyperlinks.

pages: 857 words: 232,302

The Evolutionary Void
by Peter F. Hamilton
Published 18 Aug 2010

Whatever.” The Delivery Man was mildly puzzled by Gore’s lack of focus. It wasn’t like him at all. “All right. So what I was thinking is that there has to be some kind of web and database in the cities.” “There is. You can’t access it.” “Why not?” “The AIs are sentient. They won’t allow any information retrieval.” “That’s stupid.” “From our point of view, yes, but they’re the same as the borderguards: They maintain the homeworld’s sanctity; the AIs keep the Anomine’s information safe.” “Why?” “Because that’s what the Anomine do; that’s what they are. They’re entitled to protect what they’ve built, same as anyone.”

pages: 903 words: 235,753

The Stack: On Software and Sovereignty
by Benjamin H. Bratton
Published 19 Feb 2016

Different actors (e.g., telcos, states, standards bodies, hardware original equipment manufacturers, and cloud software platforms) all play different roles and control hardware and software applications in different ways and toward different ends. Internet backbone is generally provided and shared by tier 1 bandwidth providers (such as telcos), but one key trend is for very large platforms, such as Google, to bypass other actors and architect complete end-to-end networks, from browser, to fiber, to data center, such that information retrieval, composition, and analysis are consolidated and optimized on private loops. Consider that if Google's own networks, both internal and external, were compared to others, they would represent one of the largest Internet service providers in the world, and by the time this sentence is published, they may very well be the largest.

pages: 891 words: 253,901

The Devil's Chessboard: Allen Dulles, the CIA, and the Rise of America's Secret Government
by David Talbot
Published 5 Sep 2016

Army and the CIA. The top secret work conducted by the SO Division included research on LSD-induced mind control, assassination toxins, and biological warfare agents like those allegedly being used in Korea. Olson’s division also was involved in research that was euphemistically labeled “information retrieval”—extreme methods of extracting intelligence from uncooperative captives. For the past two years, Olson had been traveling to secret centers in Europe where Soviet prisoners and other human guinea pigs were subjected to these experimental interrogation methods. Dulles began spearheading this CIA research even before he became director of the agency, under a secret program that preceded MKULTRA code-named Operation Artichoke, after the spymaster’s favorite vegetable.

pages: 982 words: 221,145

Ajax: The Definitive Guide
by Anthony T. Holdener
Published 25 Jan 2008

You know nothing about this site until you dig further by following the links on the page. The point of a business site’s main page is to grab your attention with a central focus: We do web design. Our specialty is architectural engineering. We sell fluffy animals. Regardless of the focus, it should be readily apparent. * Chris Roast, “Designing for Delay in Interactive Information Retrieval,” Interacting with Computers 10 (1998): 87–104. “Need for Speed I,” Zona Research, Zona Market Bulletin (1999). “Need for Speed II,” Zona Research, Zona Market Bulletin (2001). Jonathan Klein, Youngme Moon, and Rosalind W. Picard, “This Computer Responds to User Frustration: Theory, Design, and Results,” Interacting with Computers 14 (2) (2002): 119–140. 144 | Chapter 6: Designing Ajax Interfaces Obscurity This can cover two different problems you do not want for your application.

pages: 1,073 words: 314,528

Strategy: A History
by Lawrence Freedman
Published 31 Oct 2013

They operated quickly and automatically when needed, managing cognitive tasks of great complexity and evaluating situations and options before they reached consciousness. This referred to not one but a number of processes, perhaps with different evolutionary roots, ranging from simple forms of information retrieval to complex mental representations.43 They all involved the extraordinary computational and storage power of the brain, drawing on past learning and experiences, picking up on and interpreting cues and signals from the environment, suggesting appropriate and effective behavior, and enabling individuals to cope with the circumstances in which they might find themselves without having to deliberate on every move.

pages: 1,199 words: 332,563

Golden Holocaust: Origins of the Cigarette Catastrophe and the Case for Abolition
by Robert N. Proctor
Published 28 Feb 2012

The Ad Hoc Committee was also responsible for helping to locate medical witnesses and prepare testimony. Edwin Jacob from Jacob, Medinger & Finnegan supervised the Central File with financial support from all parties to the conspiracy. Responsibility for maintaining the Central File Information Center in 1971 was transferred to the CTR, which managed “informational retrieval” and maintenance through a CTR Special Project, organized as part of a new Information Systems division, by which means the CTR became a crucial resource for the industry’s effort to defend itself against litigation. See Kessler’s “Amended Final Opinion,” pp. 165–68. 46. “Congressional Preparation,” Jan. 26, 1968, Bates 955007434–7439; F.

pages: 2,054 words: 359,149

The Art of Software Security Assessment: Identifying and Preventing Software Vulnerabilities
by Justin Schuh
Published 20 Nov 2006

In fact, this type of error is even more relevant in RPC because many factors can cause impersonation functions to fail. Context Handles and State Before you go any further, you need to see how RPC keeps state information about connected clients. RPC is inherently stateless, but it does provide explicit mechanisms for maintaining state. This state information might include session information retrieved from a database or information on whether a client has called procedures in the correct sequence. The typical RPC mechanism for maintaining state is the context handle, a unique token a client can supply to a server that’s similar in function to a session ID stored in an HTTP cookie. From the server’s point of view, the context handle is a pointer to the associated data for that client, so no special translation of the context handle is necessary.

The Art of Computer Programming
by Donald Ervin Knuth
Published 15 Jan 2001

I. Collision test. Chi-square tests can be made only when a nontrivial number of items are expected in each category. But another kind of test can be used when the number of categories is much larger than the number of observations; this test is related to "hashing," an important method for information retrieval that we shall study in Section 6.4. Suppose we have m urns and we throw n balls at random into those urns, where m is much greater than n. Most of the balls will land in urns that were previously empty, but if a ball falls into an urn that already contains at least one ball we say that a "collision" has occurred.

pages: 889 words: 433,897

The Best of 2600: A Hacker Odyssey
by Emmanuel Goldstein
Published 28 Jul 2008

Immediately, you’ll get a list of everyone with that name, as well as their city and state, which often don’t fit properly on the line. There are no reports of any wildcards that allow you to see everybody at once. (The closest thing is *R, which will show all of the usernames that you’re sending to.) It’s also impossible for a user not to be seen if you get his name or alias right. It’s a good free information retrieval system. But there’s more. MCI Mail can also be used as a free word processor of sorts. The system will allow you to enter a letter, or for that matter, a manuscript. You can then hang up and do other things, come back within 24 hours, and your words will still be there. You can conceivably list them out using your own printer on a fresh sheet of paper and send it through the mail all by yourself, thus sparing MCI Mail’s laser printer the trouble.