full text search

back to index

description: techniques for searching a single computer-stored document or a collection in a full text database

43 results

PostgreSQL: Up and Running, 3rd Edition
by Unknown

You’ll appreciate this automatic type creation when you write functions that loop through tables. pgAdmin doesn’t make the automatic type creation obvious because it does not list them under the types node, but rest assured that they are there. Full text search Full text search (FTS) is a natural language based search. This kind of search has some “intelligence” built in. Unlike regular expression search, FTS can match based on the semantics of an expression, not just its syntactical makeup. For example, if you’re searching for the word running in long piece of text, you may end up with run, running, ran, runner, jog, sprint, dash, and so on. Three objects in PostgreSQL together support FTS: FTS configurations, FTS dictionaries, and FTS parsers. These objects exist to support the built-in Full Text Search engine packaged with PostgreSQL.

Parallelization is a work in progress with the eventual hope that all queries could take advantage of multiple processor cores. See “Parallelized Queries” Phrase full text search Use the distance operator <-> in full text search query to indicate how far two words can be apart from each other and still be considered a match. In prior versions you could indicate only which words should be searched; now you can control the sequence of the words. See “Full Text Search”. psql \gexec options These read an SQL statement from a query and execute it. See “Dynamic SQL Execution”. postgres_fdw Updates, inserts, and deletes are all much faster for simple cases.

Combining tsqueries SELECT plainto_tsquery('business analyst') || phraseto_tsquery( tsquery ------------------------------------------'busi' & 'analyst' | 'data' <-> 'scientist' SELECT plainto_tsquery('business analyst') && phraseto_tsquery( tsquery -------------------------------------------'busi' & 'analyst' & ('data' <-> 'scientist') tsqueries and tsvectors have additional operators for doing things like determining if one is a subset of another, and several functions. All this is detailed in PostgreSQL Manual: Text Search Functions and Operators. Using Full Text Search We have created a tsvector from our text; we have created a tsquery from our search terms. Now, we can perform a full text search. We do so by using the @@ operator. Example 5-51 demonstrates it: Example 5-51. Full text search in action SELECT left(title,50) As title, left(description,50) as description FROM film WHERE fts @@ to_tsquery('hunter & (scientist | chef)') AND title title | description -----------------------+-----------------------------------------------ALASKA PHANTOM | A Fanciful Saga of a Hunter And a Pastry Chef w CAUSE DATE | A Taut Tale of a Explorer And a Pastry Chef who CINCINATTI WHISPERER | A Brilliant Saga of a Pastry Chef And a Hunter COMMANDMENTS EXPRESS | A Fanciful Saga of a Student And a Mad Scientis DAUGHTER MADIGAN | A Beautiful Tale of a Hunter And a Mad Scientis GOLDFINGER SENSIBILITY | A Insightful Drama of a Mad Scientist And a Hun HATE HANDICAP | A Intrepid Reflection of a Mad Scientist And a INSIDER ARIZONA | A Astounding Saga of a Mad Scientist And a Hunt WORDS HUNTER | A Action-Packed Reflection of a Composer And a (9 rows) Example 5-51 finds all films with a title or description containing the word hunter and either the word scientist, or the word chef, or both.

pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 16 Mar 2017

(Even today, Hadoop MapReduce remains a good way of building indexes for Lucene/Solr [44].) We saw briefly in “Full-text search and fuzzy indexes” how a full-text search index such as Lucene works: it is a file (the term dictionary) in which you can efficiently look up a particular keyword and find the list of all the document IDs containing that keyword (the postings list). This is a very simplified view of a search index—in reality it requires various additional data, in order to rank search results by relevance, correct misspellings, resolve synonyms, and so on—but the principle holds. If you need to perform a full-text search over a fixed set of documents, then a batch process is a very effective way of building the indexes: the mappers partition the set of documents as needed, each reducer builds the index for its partition, and the index files are written to the distributed filesystem.

Composing Data Storage Technologies Over the course of this book we have discussed various features provided by databases and how they work, including: Secondary indexes, which allow you to efficiently search for records based on the value of a field (see “Other Indexing Structures”) Materialized views, which are a kind of precomputed cache of query results (see “Aggregation: Data Cubes and Materialized Views”) Replication logs, which keep copies of the data on other nodes up to date (see “Implementation of Replication Logs”) Full-text search indexes, which allow keyword search in text (see “Full-text search and fuzzy indexes”) and which are built into some relational databases [1] In Chapters 10 and 11, similar themes emerged. We talked about building full-text search indexes (see “The Output of Batch Workflows”), about materialized view maintenance (see “Maintaining materialized views”), and about replicating changes from a database to derived data systems (see “Change Data Capture”).

A 2D index could narrow down by timestamp and temperature simultaneously. This technique is used by HyperDex [36]. Full-text search and fuzzy indexes All the indexes discussed so far assume that you have exact data and allow you to query for exact values of a key, or a range of values of a key with a sort order. What they don’t allow you to do is search for similar keys, such as misspelled words. Such fuzzy querying requires different techniques. For example, full-text search engines commonly allow a search for one word to be expanded to include synonyms of the word, to ignore grammatical variations of words, and to search for occurrences of words near each other in the same document, and support various other features that depend on linguistic analysis of the text.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 17 Apr 2017

(Even today, Hadoop MapReduce remains a good way of building indexes for Lucene/Solr [44].) We saw briefly in “Full-text search and fuzzy indexes” on page 88 how a full-text search index such as Lucene works: it is a file (the term dictionary) in which you can efficiently look up a particular keyword and find the list of all the document IDs con‐ taining that keyword (the postings list). This is a very simplified view of a search index—in reality it requires various additional data, in order to rank search results by relevance, correct misspellings, resolve synonyms, and so on—but the principle holds. If you need to perform a full-text search over a fixed set of documents, then a batch process is a very effective way of building the indexes: the mappers partition the set of documents as needed, each reducer builds the index for its partition, and the index files are written to the distributed filesystem.

Composing Data Storage Technologies Over the course of this book we have discussed various features provided by data‐ bases and how they work, including: • Secondary indexes, which allow you to efficiently search for records based on the value of a field (see “Other Indexing Structures” on page 85) Unbundling Databases | 499 • Materialized views, which are a kind of precomputed cache of query results (see “Aggregation: Data Cubes and Materialized Views” on page 101) • Replication logs, which keep copies of the data on other nodes up to date (see “Implementation of Replication Logs” on page 158) • Full-text search indexes, which allow keyword search in text (see “Full-text search and fuzzy indexes” on page 88) and which are built into some relational databases [1] In Chapters 10 and 11, similar themes emerged. We talked about building full-text search indexes (see “The Output of Batch Workflows” on page 411), about material‐ ized view maintenance (see “Maintaining materialized views” on page 467), and about replicating changes from a database to derived data systems (see “Change Data Capture” on page 454).

A 2D index could narrow down by timestamp and temperature simultaneously. This tech‐ nique is used by HyperDex [36]. Full-text search and fuzzy indexes All the indexes discussed so far assume that you have exact data and allow you to query for exact values of a key, or a range of values of a key with a sort order. What they don’t allow you to do is search for similar keys, such as misspelled words. Such fuzzy querying requires different techniques. For example, full-text search engines commonly allow a search for one word to be expanded to include synonyms of the word, to ignore grammatical variations of words, and to search for occurrences of words near each other in the same document, and support various other features that depend on linguistic analysis of the text.

Learning Flask Framework
by Matt Copperwaite and Charles Leifer
Published 26 Nov 2015

Create the file entries/tag_index.html and add the following code: {% extends "base.html" %} {% block title %}Tags{% endblock %} {% block content_title %}Tags{% endblock %} {% block content %} <ul> {% for tag in object_list.items %} <li><a href="{{ url_for('entries.tag_detail', slug=tag.slug) }}">{{ tag.name }}</a></li> [ 68 ] Chapter 3 {% endfor %} </ul> {% endblock %} If you like, you can add a link to the tag list in the base template's navigation. Full-text search In order to allow users to find posts containing certain words or phrases, we will add simple full-text search to the pages that contain lists of blog entries. To accomplish this, we will do some refactoring. We will be adding a search form to the sidebar of all pages containing lists of blog entries. While we could copy and paste the same code into both entries/index.html and entries/tag_detail.html, we will, instead, create another base template that contains the search widget.

Introducing SQLAlchemy Installing SQLAlchemy [i] www.allitebooks.com 21 22 23 24 Table of Contents Using SQLAlchemy in our Flask app Choosing a database engine Connecting to the database Creating the Entry model Creating the Entry table Working with the Entry model Making changes to an existing entry Deleting an entry Retrieving blog entries Filtering the list of entries Special lookups Combining expressions 24 25 25 26 29 30 32 32 32 33 34 35 Building a tagging system Adding and removing tags from entries Using backrefs Making changes to the schema Adding Flask-Migrate to our project Creating the initial migration Adding a status column Summary 37 41 42 43 43 44 45 46 Negation Operator precedence Chapter 3: Templates and Views Introducing Jinja2 Basic template operations Loops, control structures, and template programming Jinja2 built-in filters Creating a base template for the blog Creating a URL scheme Defining the URL routes Building the index view Building the detail view Listing entries matching a given tag Listing all the tags Full-text search Adding pagination links Enhancing the blog app Summary [ ii ] 36 37 47 48 49 51 55 57 60 62 63 66 67 68 69 71 73 73 Table of Contents Chapter 4: Forms and Validation Getting started with WTForms Defining a form for the Entry model A form with a view The create.html template Handling form submissions Validating input and displaying error messages Editing existing entries The edit.html template Deleting entries Cleaning up Using flash messages Displaying flash messages in the template Saving and modifying tags on posts Image uploads Processing file uploads The image upload template Serving static files Summary 75 75 76 77 78 80 82 85 86 89 90 91 93 94 96 97 99 100 101 Chapter 5: Authenticating Users 103 Chapter 6: Building an Administrative Dashboard 123 Creating a user model Installing Flask-Login Implementing the Flask-Login interface Creating user objects Login and logout views The login template Logging out Accessing the current user Restricting access to views Storing an entry's author Setting the author on blog entries Protecting the edit and delete views Displaying a user's drafts Sessions Summary Installing Flask-Admin Adding Flask-Admin to our app [ iii ] 104 105 107 108 110 112 113 114 114 115 117 117 119 120 121 123 125 Table of Contents Exposing models through the Admin Customizing the list views Adding search and filtering to the list view Customizing Admin model forms Enhancing the User form Generating slugs Managing static assets via the Admin Securing the admin website Creating an authentication and authorization mixin Setting up a custom index page Flask-Admin templates Reading more Summary 126 129 132 134 136 138 140 141 143 144 145 146 146 Chapter 7: AJAX and RESTful APIs 147 Chapter 8: Testing Flask Apps 167 Creating a comment model Creating a schema migration Installing Flask-Restless Setting up Flask-Restless Making API requests Creating comments using AJAX AJAX form submissions Validating data in the API Preprocessors and postprocessors Loading comments using AJAX Retrieving the list of comments Reading more Summary Unit testing Python's unit test module A simple math test Flask and unit testing Testing a page Testing an API Test-friendly configuration Mocking objects 147 149 149 150 151 154 156 159 160 161 163 166 166 167 168 169 171 173 175 176 177 [ iv ] Table of Contents Logging and error reporting Logging 179 180 Error reporting Read more Summary 182 182 182 Logging to file Custom log messages Levels 180 181 181 Chapter 9: Excellent Extensions SeaSurf and CSRF protection of forms Creating Atom feeds Syntax highlighting using Pygments Simple editing with Markdown Caching with Flask-Cache and Redis Creating secure, stable versions of your site by creating static content Commenting on a static site Synchronizing multiple editors Asynchronous tasks with Celery Creating command line instructions with Flask-script References Summary Chapter 10: Deploying Your Application Running Flask with a WSGI server Apache's httpd Serving static files 183 183 185 186 190 192 194 195 195 196 199 200 201 203 203 204 206 Nginx 207 Serving static files Gunicorn Securing your site with SSL Getting your certificate Apache httpd Nginx Gunicorn Automating deployment using Ansible Read more Summary Index [v] 209 210 210 211 212 214 215 216 219 219 221 Preface Welcome to Learning Flask, the book that will teach you the necessary skills to build web applications with Flask, a lightweight Python web framework.

In the following table, I've listed a brief description of the core skills paired with the corresponding features of the blog: Skill Blog site feature(s) Relational databases with SQLAlchemy Store entries and tags in a relational database. Perform a wide variety of queries, including pagination, date-ranges, full-text search, inner and outer joins, and more. Flask-SQLAlchemy Form processing and validation Flask-WTF Template rendering with Jinja2 Jinja2 User authentication and administrative dashboards Flask-Login Ajax and RESTful APIs Flask-API Unit testing unittest Everything else Create and edit blog entries using forms.

RDF Database Systems: Triples Storage and SPARQL Query Processing
by Olivier Cure and Guillaume Blin
Published 10 Dec 2014

The main drawback is that because the ids are guided by the hierarchies, a lot of the candidate integers cannot be attributed. 4.4 ALLOWING A FULL TEXT SEARCH IN LITERALS Regarding literals, compared to URIs, the corresponding encoding strategy does not have the same purpose. Indeed, while compression is the main objective in URI encoding, the main feature sought in RDF stores related to literal is a full text search.The most popular solution for handling a full text search in literals is Lucene, integrated in RDF stores such as Yars2, Jena TDB/SDB, and GraphDB (formerly OWLIM), and in Big Data RDF databases, but it’s also popular for other systems, such as IBM OmnifindY!

Nevertheless, OWLIMLite does not propose any query optimization nor advanced features like full-text search. OWLIM-SE provides a more advanced persistence layer that is based on binary data files and several indexes: POS and PSO. Additional data structures are supported but cannot be configured and are therefore not detailed by Ontotext. OWLIM-SE aims at data sets ranging around a billion triples even on a desktop machine. Just like in OWLIMLite, queries can be expressed in SPARQL or SeRQL (due to the use of SAIL), but OWLIM-SE provides some forms of query optimizations. Advanced features such as RDF rank, full-text search (integrated into SPARQL with a standalone or Lucene approach), and geospatial extension are provided.

Readers interested in Lucene should also have a look at Lucene’s sister package called Solr (see http://lucene.apache.org/solr/tutorial.html), which embeds Lucene in a client-server architecture. Solr is an open-source search server with XML/HTTP APIs, caching and replication functionalities, and a Web administration interface. Solr provides useful features such as a powerful full-text search, hit highlighting, and database integration. Solr is used in RDF stores such as Allegrograph and Jena TDB/SDB, but also in several commercial websites such as Netflix, CNET, and the Smithsonian. 101 102 RDF Database Systems 4.5 COMPRESSING LARGE AMOUNTS OF DATA Traditional centralized approaches have growing difficulties (both in terms of time and space efficiency) for handling very large input.

pages: 713 words: 93,944

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement
by Eric Redmond , Jim Wilson and Jim R. Wilson
Published 7 May 2012

At first, the dusting wasn’t even enough to cover this morning’s earliest tracks, but the power of the storm took over, replenishing the landscape and delivering the perfect skiing experience with the diversity and quality that we craved. Just this past year, I woke up to the realization that the database world, too, is covered with a fresh blanket of snow. Sure, the relational databases are there, and you can get a surprisingly rich experience with open source RDBMS software. You can do clustering, full-text search, and even fuzzy searching. But you’re no longer limited to that approach. I have not built a fully relational solution in a year. Over that time, I’ve used a document-based database and a couple of key-value datastores. The truth is that relational databases no longer have a monopoly on flexibility or even scalability.

In other cases, we just plain don’t know how to spell “Benn Aflek.” We’ll look into a few PostgreSQL packages that make text searching easy. It’s worth noting that as we progress, this kind of string matching blurs the lines between relational queries and searching frameworks like Lucene.[8] Although some may feel features like full-text search belong with the application code, there can be performance and administrative benefits of pushing these packages to the database, where the data lives. SQL Standard String Matches PostgreSQL has many ways of performing text matches, but the two big default methods are LIKE and regular expressions.

CREATE INDEX movies_title_trigram ON movies​​ ​​USING gist (title gist_trgm_ops);​​ Now you can query with a few misspellings and still get decent results. ​​SELECT *​​ ​​FROM movies​​ ​​WHERE title % 'Avatre';​​ ​​ title​​ ​​---------​​ ​​ Avatar​​ Trigrams are an excellent choice for accepting user input, without weighing them down with wildcard complexity. Full-Text Fun Next, we want to allow users to perform full-text searches based on matching words, even if they’re pluralized. If a user wants to search for certain words in a movie title but can remember only some of them, Postgres supports simple natural-language processing. TSVector and TSQuery Let’s look for a movie that contains the words night and day.

pages: 32 words: 10,468

Getting Things Done for Hackers
by Lars Wirzenius
Published 15 Jun 2012

However, if you’ve never done this kind of thing before, be prepared to re-do it at least once. (“Be prepared to write a prototype, since you’ll make one anyway.”) For digital files, having a computer that can quickly do full text searches helps a lot. Indeed, you may be tempted to rely on search only, and if that works for you, great. However, there are files for which full text search won’t work, such as images, audio, and video. Thus, it is probably best to put your digital, archived files in folders named using the same system you use for your paper files. I recommend having a folder named “Archive” (or something similar in your local language), which is the location where all your archived files shall be.

Under “Archive”, you’ll create a folder for each topic: these are the folders that correspond to the physical manilla folders (or equivalent). Have only one level of these. $HOME/Archive/ Council tax 2011/ Debian DPL plans/ Orange GSM prepaid/ Talk: Debconf 2010/ Three GSM prepaid/ Having only a single level of archive folders makes it easier to look for them manually, when full-text search is not available or isn’t good enough. If you create folders within folders, searching manually becomes at least an order of magnitude harder. Create a folder under “Archive” even if you’re only putting a single file there. Later you might need to archive a second file together with the first one, and if you didn’t create the folder beforehand, you’ll have to move the first file.

pages: 205 words: 47,169

PostgreSQL: Up and Running
by Regina Obe and Leo Hsu
Published 5 Jul 2012

If PostgreSQL automaticaly creates an index for you or you don’t bother picking the type, B-tree will be chosen. It is currently the only index type allowed for primary key and unique indexes. GiST Generalized Search Tree (GiST) is an index type optimized for full text search, spatial data, astronomical data, and hierarchical data. You can’t use it to enforce uniqueness, however, you can use it in exclusion constraints. GIN Generalized Inverted Index (GIN) is an index type commonly used for the built-in full text search of PostgreSQL and the trigram extensions. GIN is a decendent of Gist, but it’s not lossy. GIN indexes are generally faster to search than GiST, but slower to update. You can see an example at Waiting for Faster LIKE/ILIKE.

In prior versions, to make views updatable you used DO INSTEAD rules, which only supported SQL for programming logic. Triggers can be written in most procedural languages—except SQL—and opens the door for more complex abstraction using views. KNN GiST adds improvement to popular extensions like full-text search, trigram (for fuzzy search and case insensitive search), and PostGIS. Database Drivers If you are using or plan to use PostgreSQL, chances are that you’re not going to use it in a vacuum. To have it interact with other applications, you’re going to need database drivers. PostgreSQL enjoys a generous number of freely available database drivers that can be used in many programming languages.

Old Extensions Absorbed into PostgreSQL Prior to PostgreSQL 8.3, the following extensions weren’t part of core: PL/PgSQL wasn’t always installed by default in every database. In old versions, you had to run CREATE LANGUAGE plpgsql; in your database. From around 8.3 on, it’s installed by default, but you retain the option of uninstalling it. tsearch is a suite for supporting full-text searches by adding indexes, operators, custom dictionaries, and functions. It became part of PostgreSQL core in 8.3. You don’t have the option to uninstall it. If you’re still relying on old behavior, you can install the tsearch2 extension, which retained old functions that are no longer available in the newer version.

pages: 481 words: 121,669

The Invisible Web: Uncovering Information Sources Search Engines Can't See
by Gary Price , Chris Sherman and Danny Sullivan
Published 2 Jan 2003

Major business decisions involving significant expense or potential litigation often hinge on the details of a patent search, so using a general-purpose search engine for this type of search is effectively out of the question. Many government patent offices maintain Web sites, but Delphion’s Intellectual Property Network (http://www.delphion.com/) allows full-text searching of U.S. and European patents and abstracts of Japanese patents simultaneously. Additionally, the United States Patent Office (http://www.uspto.gov) provides patent information dating back to 1790, as well as U.S. Trademark data. 6. Out of Print Books. The growth of the Web has proved to be a boon for bibliophiles.

The Scout Report http://scout.cs.wisc.edu/scout/report/current/ The Scout Report provides the closest thing to an “official” seal of approval for quality Web sites. Published weekly, it provides organized summaries of the most valuable and authoritative Web resources available. The Scout Report Signpost provides the full-text search of nearly 6,000 of these summaries. The Scout Report staff is made up of a group 110 The Invisible Web of librarians and information professionals, and their standards for inclusion in the report are quite high. Librarians’ Index to the Internet (LII) http://www.lii.org This searchable, annotated directory of Web resources, maintained by Carole Leita and a volunteer team of more than 70 reference librarians, is organized into categories including “best of,” “directories,” “databases,” and “specific resources.”

Wally chooses the full-text database http://www.uspto.gov/patft/index.html over a bibliographic database that provides only limited information for each patent. Clicking the full-text database link brings up further options. After scanning the page, Wally notices a direct link that allows for full-text searching by patent number (http://164.195.100.11/netahtml/ srchnum.htm). Wally quickly types in the number and in less than a second has a link to the full-text of patent number 3541541. Wally’s job is complete and his boss is very impressed. This is a case where a general-purpose search engine failed to find the desired end result, but was indispensable in helping Wally locate the “front door” of the Invisible Web database that ultimately provided what he was looking for.

pages: 1,085 words: 219,144

Solr in Action
by Trey Grainger and Timothy Potter
Published 14 Sep 2014

I viewed Solr more as an alternate type of datastore designed around an inverted index than as a full-text search engine, and that has helped Solr extend beyond the legacy enterprise search market. By the end of 2005, Solr was powering the search and faceted navigation of a number of CNET sites, and soon it was made open source. Solr was contributed to the Apache Software Foundation in January 2006 and became a subproject of the Lucene PMC (with Lucene Java as its sibling). There had always been a large degree of overlap with Lucene (the core full-text search library used by Solr) committers, and in 2010 the projects were merged.

I could not have done this without their insightful questions about Solr and their giving me the opportunity to build a large-scale search solution using Solr. About this Book Whether handling big data, building cloud-based services, or developing multitenant web applications, it’s vital to have a fast, reliable search solution. Apache Solr is a scalable and ready-to-deploy open source full-text search engine powered by Lucene. It offers key features like multilingual keyword searching, faceted search, intelligent matching, content clustering, and relevancy weighting right out of the box. Solr in Action is the definitive guide to implementing fast and scalable search using Apache Solr. It uses well-documented examples ranging from basic keyword searching to scaling a system for billions of documents and queries.

You’ll learn how to define fields to represent structured data like numbers, dates, prices, and unique identifiers. We also cover how update requests are processed and configured using solrconfig.xml. Chapter 6 builds on the material in chapter 5 by showing how to index text fields using text analysis. Solr was designed to efficiently search and rank documents requiring full-text search. Text analysis is an important part of the search process in that it removes the linguistic variations between indexed text and queries. At this point in the book, you’ll have a solid foundation and will be ready to put Solr to work on your own search needs. As your knowledge of search and Solr grows, so too will your need to go beyond basic keyword searching and implement common search features such as advanced query parsing, hit highlighting, spell-checking, autosuggest, faceting, and result grouping.

pages: 82 words: 17,229

Redis Cookbook
by Tiago Macedo and Fred Oliveira
Published 26 Jul 2011

Implementing an Inverted-Index Text Search with Redis Problem An inverted index is an index data structure that stores mappings of words (or other content) to their locations in a file, document, database, etc. This is generally used to implement full text search, but it requires previous indexing of the documents to be searched. In this recipe, we’ll use Redis as the storage backend for a full-text search implementation. Solution Our implementation will use one set per word, containing document IDs. In order to allow fast searches, we’ll index all the documents beforehand. Search itself is performed by splitting the query into words and intersecting the matching sets.

pages: 960 words: 140,978

Android Cookbook
by Ian F. Darwin
Published 9 Apr 2012

It relaunches the activity using the same intent that fired it: private void reload(){ startActivity(getIntent()); finish(); } 11.8. Performing Advanced Text Searches Claudio Esperanca Problem You want to implement an advanced “search” capability, and you need to know how to build a data layer to store and search text data using SQLite’s Full Text Search. Solution Using an SQLite Full Text Search 3 (FTS3) virtual table and match function from SQLite it’s possible to build such a mechanism. Discussion By following these steps, you will be able to create an example Android project with a data layer where you will be able to store and retrieve some data using an SQLite database.

The DbAdapter class package com.androidcookbook.example.advancedsearch; import java.util.LinkedList; import android.content.ContentValues; import android.content.Context; import android.database.Cursor; import android.database.SQLException; import android.database.sqlite.SQLiteDatabase; import android.database.sqlite.SQLiteOpenHelper; import android.util.Log; public class DbAdapter { public static final String APP_NAME = "AdvancedSearch"; private static final String DATABASE_NAME = "AdvancedSearch_db"; private static final int DATABASE_VERSION = 1; // Our internal database version (e.g. to control upgrades) private static final String TABLE_NAME = "example_tbl"; public static final String KEY_USERNAME = "username"; public static final String KEY_FULLNAME = "fullname"; public static final String KEY_EMAIL = "email"; public static long GENERIC_ERROR = -1; public static long GENERIC_NO_RESULTS = -2; public static long ROW_INSERT_FAILED = -3; private final Context context; private DbHelper dbHelper; private SQLiteDatabase sqlDatabase; public DbAdapter(Context context) { this.context = context; } private static class DbHelper extends SQLiteOpenHelper { private boolean databaseCreated=false; DbHelper(Context context) { super(context, DATABASE_NAME, null, DATABASE_VERSION); } @Override public void onCreate(SQLiteDatabase db) { Log.d(APP_NAME, "Creating the application database"); try{ // Create the full text search 3 virtual table db.execSQL( "CREATE VIRTUAL TABLE ["+TABLE_NAME+"] USING FTS3 (" + "["+KEY_USERNAME+"] TEXT," + "["+KEY_FULLNAME+"] TEXT," + "["+KEY_EMAIL+"] TEXT" + ");" ); this.databaseCreated = true; } catch (Exception e) { Log.e(APP_NAME, "An error occurred while creating the database: " + e.toString(), e); this.deleteDatabaseStructure(db); } } @Override public void onUpgrade(SQLiteDatabase db, int oldVersion, int newVersion) { Log.d(APP_NAME, "Updating the database from the version " + oldVersion + " to " + newVersion + "..."); this.deleteDatabaseStructure(db); // toy example: purge prev. data on upgrade this.onCreate(db); } public boolean databaseCreated(){ return this.databaseCreated; } private boolean deleteDatabaseStructure(SQLiteDatabase db){ try{ db.execSQL("DROP TABLE IF EXISTS ["+TABLE_NAME+"];"); return true; }catch (Exception e) { Log.e(APP_NAME, "An error occurred while deleting the database: " + e.toString(), e); } return false; } } /** * Open the database; if the database can't be opened, try to create it * * @return {@link Boolean} true if database opened/created OK, false otherwise * @throws {@link SQLException] if an error occurred */ public boolean open() throws SQLException { try{ this.dbHelper = new DbHelper(this.context); this.sqlDatabase = this.dbHelper.getWritableDatabase(); return this.sqlDatabase.isOpen(); }catch (SQLException e) { throw e; } } /** * Close the database connection * @return {@link Boolean} true if the connection was terminated, false otherwise */ public boolean close() { this.dbHelper.close(); return !

this.sqlDatabase.isOpen(); } /** * Check if the database is opened * * @return {@link Boolean} true if it was, false otherwise */ public boolean isOpen(){ return this.sqlDatabase.isOpen(); } /** * Check if the database was created * * @return {@link Boolean} true if it was, false otherwise */ public boolean databaseCreated(){ return this.dbHelper.databaseCreated(); } /** * Insert a new row on the table * * @param username {@link String} with the username * @param fullname {@link String} with the fullname * @param email {@link String} with the email * @return {@link Long} with the row id or ROW_INSERT_FAILED (bellow 0 value) on error */ public long insertRow(String username, String fullname, String email) { try{ // Prepare the values ContentValues values = new ContentValues(); values.put(KEY_USERNAME, username); values.put(KEY_FULLNAME, fullname); values.put(KEY_EMAIL, email); // Try to insert the row return this.sqlDatabase.insert(TABLE_NAME, null, values); }catch (Exception e) { Log.e(APP_NAME, "An error occurred while inserting the row: "+e.toString(), e); } return ROW_INSERT_FAILED; } /** * The search method Uses the full text search 3 virtual table and * the MATCH function from SQLite to search for data. * @see http://www.sqlite.org/fts3.html to know more about the syntax * @param search {@link String} with the search expression * @return {@link LinkedList} with the {@link String} search results */ public LinkedList<String> search(String search) { LinkedList<String> results = new LinkedList<String>(); Cursor cursor = null; try{ cursor = this.sqlDatabase.query(true, TABLE_NAME, new String[] { KEY_USERNAME, KEY_FULLNAME, KEY_EMAIL }, TABLE_NAME + " MATCH ?"

pages: 299 words: 88,375

Gray Day: My Undercover Mission to Expose America's First Cyber Spy
by Eric O'Neill
Published 1 Mar 2019

I swiveled my chair to the FBI NET computer and logged into ACS. I used the full-text search to enter a few names of targets I had ghosted into the system. ACS covered most of the information with x’s—but not everything. Hanssen’s words from a few days ago replayed in my mind. ACS is flawed and the FBI hasn’t a clue. I hadn’t listened then because his hands on my shoulders had distracted me, but now I made the connection. All the Russians have to do is recruit someone at the FBI with ACS access, feed him a name, and have the mole conduct full-text searches. Why would a spy reveal such an enormous flaw? I conducted a few additional searches and then sat back with my hands steepled over my mouth.

“Boss…” I scooted forward in my seat at the same time I pushed the chair back. Hanssen chuckled but took his hands off my shoulders. “I’ve been testing security in ACS,” Hanssen said, scooping up my ACS manual as he passed my desk. “I still have my old access from the State Department that gives me full clearance through a full-text search. The account they gave us here doesn’t have full access.” He dropped the manual into my trash can. “An oversight I am certain they will correct.” I glanced from my boss to the trash can and tried not to roll the tension out of my shoulders. Hanssen followed my eyes. “ACS is a joke.

“But if you type in a name and you get a 65 case file back, you know the file is related to the person—” “And the person is under investigation for espionage.” Hanssen slammed the pen into the tray beneath the whiteboard. “All the Russians have to do is recruit someone at the FBI with ACS access, feed him a name, and have the mole conduct full-text searches.” He jammed a finger against the 65A, smearing it across the whiteboard. “All that comes back is a 65A and a bucketload of x’s, but that is enough.” He picked up the eraser. “Our mole now knows that the person he searched is compromised.” How many times had Hanssen searched his own name?

pages: 982 words: 221,145

Ajax: The Definitive Guide
by Anthony T. Holdener
Published 25 Jan 2008

By including every piece of text as part of the search, the developer can ensure that the results will be as inclusive as possible. Of course, there is a downside, and in this case it is the speed of the search. The time it would take to perform a full text search, even on a small site, would be unacceptable to most developers. Example 16-2 illustrates how to handle this kind of search. Example 16-2. Performing a full text search using PHP <?php /** * Example 16-2. Performing a full text search using PHP. */ /* The starting place of the site */ $dir = '/root/of/site'; /* This will hold the results of the parse */ $results = array( ); /** * This function, parse_meta_elements, searches through the passed /p_dir/ and * searches all of the files (of an acceptable type) for the passed * /p_search_string/, recursively searching subdirectories and building a * results array to report its findings. * * @param {String} $p_dir The directory to search for files to parse in. * @param {String} $p_search_string The string to search for in the files. */ function parse_meta_elements($p_dir, $p_search_string) { /* Could we open up a handle to the passed directory?

The user is already on the site, and if he is searching there he is searching for something specific. This is what we will concentrate on in this chapter: how Ajax can aid in these types of searches. Types of Site Searches On the backend, a developer can set up a site for searching in three different ways, each offering a different level of detail in the search: • Keyword searching • Full text searching • Page indexing Developers don’t have to implement these themselves; many search engines now have ways that a local site can use them as their engine for searching the site instead of searching the whole Web. Which search method is most useful in any given situation depends on the purpose of the search.

*/ if (preg_match('/^meta/', strtolower($buffer)) && preg_match('/^keywords/', strtolower($buffer))) /* * Get a count of matches for ranking, 0 is * okay */ $results[] = array(0 => stristr($buffer, $p_search_string), 1 => $file); } } chdir('..'); } } parse_meta_elements($dir, $_REQUEST['searchString']); /* Handle the results of the parse here... */ ?> Types of Site Searches | 567 Full Text Parsing For a more thorough site search, you should use full text searching instead of keywords. This search method searches the full text of each page on the site, yielding a much more complete search of the site. The length of the words being searched should be identified so that commonly used words such as a, an, and, the, of, and so on are not included in the search.

PostgreSQL Administration Essentials
by Hans-Jurgen Schonig
Published 14 Oct 2014

Each index type supports certain algorithms used for different purposes. The following list of index types is available in PostgreSQL (as of Version 9.4.1): btree: These are the high-concurrency B-trees gist: This is an index type for geometric searches (GIS data) and for KNN-search gin: This is an index type optimized for Full-Text Search (FTS) sp-gist: This is a space-partitioned gist As we mentioned before, each type of index serves different purposes. We highly encourage you to dig into this extremely important topic to make sure that you can help software developers whenever necessary. Note Unfortunately, we don't have enough room in this book to discuss all the index types in greater depth.

Index A all variable / Configuring the amount of log output ALTER ROLE clause / Modifying and dropping roles ANALYZE functionabout / Analyzing the performance of a query asynchronous replicationsetting up / Setting up an asynchronous replication, Setting up replication step by step high-level overview, obtaining / Obtaining a high-level overview slave, preparing / Preparing the slave master, configuring / Configuring the master initial backup, fetching / Fetching an initial backup recovery.conf file, creating / Creating and modifying the recovery.conf file slave, firing up / Firing up the slave slaves, turning into masters / Turning slaves into masters B B-tree indexinternal structure / The internal structure of a B-tree index, Understanding the B-tree internals sorted order, providing / Providing a sorted order B-treesadvantages / Understanding the B-tree internals backendsabout / Integrating Nagios backups, performingabout / Performing backups pg_dump, handling / Handling pg_dump entire instance, dumping / Dumping an entire instance basic operations, COPY commandabout / Basic operations of the COPY command bgwriter_delay parameter / Controlling writes bgwriter_lru_maxpages parameter / Controlling writes bgwriter_lru_multiplier parameter / Controlling writes binary packagesinstalling / Installing binary packages btree index typeabout / Dealing with different types of indexes Bucardo projectURL / Integrating Nagios checklist / Integrating Nagios buffers_checkpoint variableabout / Checking out the information in the background writer C cert method / Authentication methods available checkpointsconfiguring / Configuring the checkpoints optimizing / Optimizing the checkpoints distance, configuring / Configuring the distance between checkpoints writes, controlling / Controlling writes checkpoint_completion_target / Controlling writes checkpoint_sync_time variableabout / Checking out the information in the background writer checkpoint_write_time variableabout / Checking out the information in the background writer column permissionsmanaging / Managing column rights SELECT clause / Managing column rights INSERT clause / Managing column rights UPDATE clause / Managing column rights REFERENCES clause / Managing column rights combined indexabout / Combined indexes seq_page_cost variable / Combined indexes connectionabout / Integrating Nagios COPY commandusing / Using the COPY command basic operations / Basic operations of the COPY command pipes, using / Making use of pipes COPY … PROGRAM command / Making use of pipes CREATE clauseabout / Understanding schema-level permissions csvlogabout / Configuring log destinations curl command / Making use of pipes custom format dumpsabout / More sophisticated dumping D dataimporting / Importing and exporting data exporting / Importing and exporting data database-level permissionscontrolling / Controlling database-level permissions databasescreating / Creating databases ddl variable / Configuring the amount of log output DebianPostgreSQL, installing on / Installing PostgreSQL on Debian or Ubuntu E effective_cache_sizeadjusting / Adjusting effective_cache_size environment variablesPGPASSWORD / Passing users and passwords PGUSER / Passing users and passwords PGHOST / Passing users and passwords PGPORT / Passing users and passwords PGDATABASE / Passing users and passwords eventlogabout / Configuring log destinations execution plansabout / Understanding the concept of execution plans costs, calculating / Calculating costs conclusions, drawing / Drawing important conclusions indexes, creating / Creating indexes query performance, analyzing / Analyzing the performance of a query B-tree indexes / The internal structure of a B-tree index combined indexes / Combined indexes partial indexes / Partial indexes EXPLAIN ANALYZEabout / Analyzing the performance of a query explain analyze statement / Tweaking work_mem explain commandusing / Understanding the concept of execution plans F Fedora 20 / Installing PostgreSQL on Red-Hat-based systems Full-Text Search (FTS) / Dealing with different types of indexes G generate_series functionabout / Preparing the data gin index typeabout / Dealing with different types of indexes gist index typeabout / Dealing with different types of indexes GroupAggregate / Tweaking work_mem gss method / Authentication methods available H HashAggregate / Tweaking work_mem hot_standby_delayabout / Integrating Nagios huge pagesabout / Considering huge pages I ident method / Authentication methods available IF EXISTS clause / Modifying and dropping roles index typeslisting / Dealing with different types of indexes about / Dealing with different types of indexes btree / Dealing with different types of indexes gist / Dealing with different types of indexes gin / Dealing with different types of indexes sp-gist / Dealing with different types of indexes instance-level permissionshandling / Handling instance-level permissions roles, creating / Creating roles roles, modifying / Modifying and dropping roles roles, dropping / Modifying and dropping roles L ldap method / Authentication methods available Lehman-Yao B-treesURL / Understanding the B-tree internals Linux cgroupshandling / Handling Linux cgroups setting up / Setting up cgroups Linux kernel control group (cgroup) / Handling Linux cgroups local logfilescreating / Creating local logfiles locksabout / Integrating Nagios log creation, fine-tuningabout / Making log creation more fine grained logging selectively / Logging selectively slow queries, focusing on / Focusing on slow queries notices, silencing / Silencing notices log destinationsconfiguring / Configuring log destinations csvlog / Configuring log destinations syslog / Configuring log destinations eventlog / Configuring log destinations local logfiles, creating / Creating local logfiles syslog, using / Using syslog logs, configuring on Windows / Configuring logs on Windows performance considerations / Performance considerations logging collector process / Understanding the PostgreSQL log architecture log output amountconfiguring / Configuring the amount of log output logs, making readable / Making logs more readable additional settings / Additional settings logsconfiguring, on Windows / Configuring logs on Windows log_timezone variable / Making logs more readable M maintenance_work_mem parameterimproving / Improving maintenance_work_mem maxwritten_clean variableabout / Checking out the information in the background writer md5 method / Authentication methods available memory parametersadjusting / Adjusting memory parameters shared buffers, optimizing / Optimizing shared buffers huge pages, considering / Considering huge pages work_mem parameter, tweaking / Tweaking work_mem maintenance_work_mem, improving / Improving maintenance_work_mem missing indexesdetecting / Detecting missing indexes mod variable / Configuring the amount of log output N Nagios integrationabout / Integrating Nagios network authenticationmanaging / Managing network authentication contradictions, managing / Managing contradictions methods / Authentication methods available examples / Some more examples SSL, handling / Handling SSL pg_hba.conf, changing / Changing pg_hba.conf network authentication methodstrust / Authentication methods available reject / Authentication methods available md5 / Authentication methods available password / Authentication methods available gss / Authentication methods available sspi / Authentication methods available ident / Authentication methods available peer / Authentication methods available ldap / Authentication methods available radius / Authentication methods available cert / Authentication methods available pam / Authentication methods available none variable / Configuring the amount of log output P pagesabout / Considering huge pages pam method / Authentication methods available partial indexesabout / Partial indexes partial replaysperforming / Performing partial replays password method / Authentication methods available peer method / Authentication methods available pg_dumpabout / Handling pg_dump custom format dumps / More sophisticated dumping partial replays, performing / Performing partial replays users and passwords, passing / Passing users and passwords pg_dumpall command-line toolabout / Dumping an entire instance backups / Understanding backups and user creation user creation / Understanding backups and user creation pg_hba.confabout / Managing network authentication changing / Changing pg_hba.conf pg_relation_size commandabout / Calculating costs pg_start_backup variable / Setting up PITR pg_statio_user_indexesabout / Monitoring indexes pg_stat_activity fileabout / Checking out the pg_stat_activity file example / Checking out the pg_stat_activity file pg_stat_bgwriterabout / Checking out the information in the background writer pg_stat_databaseabout / Monitoring databases pg_stat_reset commandabout / How to reset statistics pg_stat_statement functionabout / Detecting slow queries pg_stat_statements moduleabout / Detecting slow queries pg_stat_user_indexesabout / Monitoring indexes pg_stat_user_tablesabout / Monitoring tables pg_stat_user_tables optionabout / Detecting missing indexes pg_stat_xact_user_tables tableabout / Resetting statistics PITRhandling / Handling point-in-time recovery about / Handling point-in-time recovery setting up / Setting up PITR transaction logs, replaying / Replaying transaction logs Pluggable Authentication Module (PAM)about / Managing network authentication URL / Managing network authentication PostgreSQLfiring up / Firing up PostgreSQL databases / Understanding the existing databases databases, creating / Creating databases index types / Dealing with different types of indexes system statistics / Understanding the system statistics of PostgreSQL integrating, with Nagios / Integrating Nagios PostgreSQL, compiling from sourceabout / Compiling PostgreSQL from source working / How it works contrib packages, installing / Installing the contrib packages database instance, creating / Creating a database instance postgresql.confwal_level parameter / Setting up PITR archive_mode parameter / Setting up PITR archive_command parameter / Setting up PITR PostgreSQL documentationURL / Dealing with different types of indexes PostgreSQL installationsetup, preparing / Preparing your setup PostgreSQL version numbers / Understanding the PostgreSQL version numbers binary packages, installing / Installing binary packages performing, on Debian / Installing PostgreSQL on Debian or Ubuntu performing, on Ubuntu / Installing PostgreSQL on Debian or Ubuntu performing, on Red Hat-based systems / Installing PostgreSQL on Red-Hat-based systems finalizing / Finalizing your installation PostgreSQL log architectureabout / Understanding the PostgreSQL log architecture log destinations, configuring / Configuring log destinations PostgreSQL securityabout / Understanding PostgreSQL security TCP, configuring / Configuring the TCP network authentication, managing / Managing network authentication instance-level permissions, handling / Handling instance-level permissions database-level permissions, controlling / Controlling database-level permissions schema-level permissions / Understanding schema-level permissions table-level permissions, handling / Handling table-level permissions column rights, managing / Managing column rights improving, with SELinux / Improving security with SELinux PostgreSQL transaction logabout / Understanding the PostgreSQL transaction log purpose / The purpose of the transaction log size, inspecting / Inspecting the size of the transaction log PostgreSQL versionselecting / Choosing the right version PostgreSQL version numbersabout / Understanding the PostgreSQL version numbers ps ax | grep post command / Passing users and passwords Q query, executing stepsparser / Understanding the concept of execution plans rewrite system / Understanding the concept of execution plans optimizer or planner / Understanding the concept of execution plans executor / Understanding the concept of execution plans query_timeabout / Integrating Nagios R radius method / Authentication methods available Red Hat-based systemsPostgreSQL, installing on / Installing PostgreSQL on Red-Hat-based systems reject method / Authentication methods available replicationimproving / Improving and monitoring the replication, Making things more robust monitoring / Improving and monitoring the replication streaming, monitoring / Keeping an eye on streaming conflicts, managing / Managing conflicts S schema-level permissionsabout / Understanding schema-level permissions schemasabout / Understanding schema-level permissions search_path variableabout / Understanding schema-level permissions SECURITY LABEL / Improving security with SELinux SELinuxused, for improving security / Improving security with SELinux shared buffersabout / Optimizing shared buffers shared_buffers parameterabout / Optimizing shared buffers simple binary treesusing / Using simple binary trees data, preparing / Preparing the data slow queriesdetecting / Detecting slow queries sp-gist index typeabout / Dealing with different types of indexes sspi method / Authentication methods available standard error (stderr) / Understanding the PostgreSQL log architecture statisticsresetting / How to reset statistics synchronous replicationupgrading to / Upgrading to synchronous replication syntax, CREATE ROLESUPERUSER | NOSUPERUSER / Creating roles CREATEDB | NOCREATEDB / Creating roles CREATEROLE | NOCREATEROLE / Creating roles CREATEUSER | NOCREATEUSER / Creating roles INHERIT | NOINHERIT / Creating roles LOGIN | NOLOGIN / Creating roles REPLICATION | NOREPLICATION / Creating roles CONNECTION LIMIT / Creating roles PASSWORD / Creating roles VALID UNTIL / Creating roles IN ROLE | IN GROUP / Creating roles ROLE / Creating roles ADMIN / Creating roles USER / Creating roles SYSID / Creating roles syslogabout / Configuring log destinations, Using syslog using / Using syslog system statistics, PostgreSQLabout / Understanding the system statistics of PostgreSQL pg_stat_activity file / Checking out the pg_stat_activity file databases, monitoring / Monitoring databases tables, monitoring / Monitoring tables indexes, monitoring / Monitoring indexes pg_stat_bgwriter / Checking out the information in the background writer resetting / Resetting statistics T table-level permissionshandling / Handling table-level permissions SELECT / Handling table-level permissions INSERT / Handling table-level permissions UPDATE / Handling table-level permissions DELETE / Handling table-level permissions TRUNCATE / Handling table-level permissions REFERENCES / Handling table-level permissions TRIGGER / Handling table-level permissions TCPconfiguring / Configuring the TCP The Oversized Attribute Storage Technique (TOAST)about / Monitoring tables timelinesabout / Understanding timelines benefits / The importance of timelines trust method / Authentication methods available U UbuntuPostgreSQL, installing on / Installing PostgreSQL on Debian or Ubuntu USAGE clauseabout / Understanding schema-level permissions users and passwordspassing / Passing users and passwords W work_mem parameterabout / Tweaking work_mem tweaking / Tweaking work_mem Write Ahead Log (WAL) / The purpose of the transaction log

pages: 696 words: 111,976

SQL Hacks
by Andrew Cumming and Gordon Russell
Published 28 Nov 2006

You could use this to perform the search: mysql> SELECT author, -> MATCH (body) AGAINST ('+database +systems' IN BOOLEAN MODE) -> AS SCORE -> FROM story -> ORDER BY 2 DESC; +---------------------+-------+ | author | SCORE | +---------------------+-------+ | Atzeni | 1 | | Adams | 0 | | Russell and Cumming | 0 | +---------------------+-------+ 3.1.2. PostgreSQL To get full text searching in PostgreSQL, you need to use the Tsearch2 module. A more detailed guide on how to do this is available from devx (http://www.devx.com/opensource/Article/21674). To install Tsearch2 (from a source-code install) go to your source directory for PostgreSQL and type the following at the Linux or Unix shell prompt (you may need to be root for the install step): $ cd contrib/tsearch2 $ make $ make install To use Tsearch2 in a particular database, you need to issue this command: $ psql dbname < tsearch2.sql tsearch2.sql should be in your install directory's contrib directory (for instance, /usr/local/pgsql/share/contrib/tsearch2.sql).

To use this new searching capability, you need to add a column to the tables to be searched (to hold some system vector data concerning the field to be searched), add an index, and prepare the new column for searching: ALTER TABLE story ADD COLUMN vectors tsvector; CREATE INDEX story_index ON story USING gist(vectors); SELECT set_curcfg('default'); UPDATE story SET vectors = to_tsvector(body); Finally, you can perform your search: dbname=> SELECT author,rank (vectors,q) dbname-> FROM story, to_tsquery('database&systems') AS q dbname-> ORDER BY rank(vectors,q) DESC; author | rank ---------------------+----------- Atzeni | 0.0991032 Adams | 1e-20 Russell and Cumming | 1e-20 3.1.3. SQL Server Implementation of full text searching in SQL Server utilizes the Microsoft Search Engine. This is external to the database and has to be configured separately first. Part of this configuration requires you to specify a location for saving the full text indexes. Because these indexes are stored outside the normal database structure, you need to remember to back these up separately when you are doing a database backup.

VALUES statement INSERT ALL statement (Oracle) INSERT statement creating words table piping sequence of to SQL command-line utility using SELECT instead of VALUES construct web page converted to INSERT TRIGGER INSERTED table (SQL Server) instance name (SQL Server) integers converting dates to Access Oracle SQL Server representing seconds since epoch integers table generating sequence of dates generating the alphabet numbers 0-99 simplifying queries that use self-joins on using with LEFT OUTER JOIN generating data for consecutive dates generating data for each alphabet letter Internet Movie Database (IMDb) Internet, sharing data across INTERVAL notation (Oracle) invoices, reconciling with remittances ISNULL function (MySQL) ISO standard date format isolated transactions, ensuring in database updates isolation level, transactions autocommit, turning off concurrency issues enforcing querying standard, listed iSQL*Plus (Oracle) Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] Java connecting to SQL database from program escaping special characters in strings placeholders Java Virtual Machine javac (Java compiler) JavaScript validation JDBC driver for SQL databases JOIN keyword JOINs ANSI syntax choosing right style for relationships JOIN chain JOIN star combinations, generating comma-delimited lists converting aggregate subqueries to converting subqueries to steps in process including rows left out of query output making use of indexes one-to-many relationships, finding specific rows on many side one-to-many, producing excess repetition in result set self-joins types of updating in MySQL Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] Kevin Bacon game keywords capitalization searching for, without using LIKE Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] labels, adding to SVG pie chart Large Object Blocks [See LOBs] last Thursday of the month, finding latitude, longitude, and altitude values (GPS) LEAST function 2nd 3rd working without LEFT JOIN [See LEFT OUTER JOINs] LEFT OUTER JOINs 2nd JOIN chain example providing missing report data star JOIN example using in CROSS JOIN LEVEL pseudovariable (Oracle) libraries, database LIKE operator case sensitivity JOINs on comma-delimited lists searching for keywords without using LIMIT instruction (MySQL) 2nd limiting number of rows returned line continuation character (\\) Linux Apache web server logs XSLT transformation of web page to SQL xsltproc utility linuxzoo web site LOBs (Large Object Blocks) locking performance and coping with unexpected redo implicit locks within transactions transaction isolation level using optimistic locking using pessimistic locking table comparison and LOG function (SQL Server) logarithms, adding for list of numbers logging creating permanent log processing web server logs checking for broken links investigating user actions queries Webalizer tool, using long-running queries, finding and stopping preventing from running LPAD function Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] Macintosh XSLT transformation of web page to SQL, Mac OS X xsltproc utility magnitude math avoiding division by zero calculating maximum of two fields calculating rank counting disaggregating a COUNT distance between GPS locations, calculation keeping a running total median, calculating minimum of two values, calculating multiplying across a result set progressive tax, applying reconciling invoices and remittances rounding numbers tallying survey results into a chart values and subtotals, getting in one query maximum calculating for three fields calculating for two fields MD5 hashing in the programming language Oracle SQL Server median, calculating finding middle row or rows temporary table creating filling in memory, buffered data from queries returning many rows metadata, obtaining Microsoft Access [See Access] Microsoft Search Engine Microsoft SQL Server 2005 MID function (Access) middle element(s), finding minimum calculating minimum of two values MOD function (Oracle) 2nd MOD operator modulus operator (%) MONTH function monthly totals, reporting months current month, reporting on last Thursday, finding second Tuesday, finding msxsl.exe cleaning up Unicode output multiplication across a result set MySQL anonymous access audit trail, creating auto-numbering calculating a running total checking if table exists before issuing DROP TABLE command-line utility Connector/J driver (JDBC driver) creating tables CURRENT_TIMESTAMP dates casting to strings converting to integers DAYOFWEEK function finding floating calendar date MONTH, YEAR, and EXTRACT functions parsing quarterly reports reducing precision of disk space, managing DROP avoiding constraints dropping FROM clause of SELECT statement errors caused by unrecognized date formats exporting and importing table definitions FULLTEXT pattern match GREATEST and LEAST functions IFNULL function IN BOOLEAN MODE text-searching InnoDB, specifying ISNULL function LIMIT instruction 2nd limiting queries long-running queries, finding and stopping MD5 function MONTH and YEAR functions ODBC driver OFFSET instruction pattern-matching personalized parameters, defining pessimistic locking REPEAT function static functions string concatenation SUBSTRING function support for SQL92 standard temporary table, creating transaction isolation transaction isolation level, querying transposition errors, finding in numbers triggers tunneling into from Access creating secure tunnel MySQL ODBC connector starting tunnel with Visual Basic stopping the tunnel testing the connection updates users and administrators, creating UUID function versions, SQL conventions and web-based interface WITH ROLLUP clause XML features mysql_ functions mysqldump Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] names (column and table), spaces in namespace management navigation features for web applications network connection concepts NEWID( ) 2nd nonrepeatable reads nonstring data, checking format NOT IN subquery 2nd NULL values skipped in a count NULLIF function numbers finding transposition errors generating unique sequential numbers auto-numbering 2nd choosing primary key gaps in sequence multiuser considerations performance random rounding 2nd sequential, generating numbers table [See integers table] numerical keys, table comparisons and Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] ODBC canonical date format MySQL connector source data for Excel pivot tables ODBC/JDBC bridge OFFSET instruction OID datatype (PostgreSQL) old copy of data, returned for transaction ON clause one-row tables one-to-many relationship implemented incorrectly searching for specific rows on the many side one-to-one relationship one-way hashes online applications [See web-based applications] operating systems clocks using epoch time command-line switches finding all the files for an operation optimistic locking optimizer hints, returning subset of results Oracle anonymous access auditing features auto-numbering autocommit, turning off calculating a running total calendar view of important dates checking if table exists before issuing DROP TABLE creating tables CURRENT_TIMESTAMP dates converting to integers day-of-the week EXTRACT and TO_DATE functions finding floating calendar date generating sequence with integers table parsing quarterly reports reducing precision in report data reporting on current month sequence of days up to a year subtracting to get days between disk space, managing DROP avoiding constraints exporting and importing table definitions generating alphabet with integers table GREATEST and LEAST functions GROUPING SETS clause INSERT ALL statement limiting queries long-running queries, finding and stopping LPAD function MD5 hash, generating with DBMS_CRYPTO library (Oracle 10g) MOD function optimizer behavior, changing with ALTER SESSION personalized parameters, defining pessimistic locking RANK( ) function recursive extensions ROWNUM pseudovariable 2nd 3rd 4th running from command line single-row table (dual table) SQL conventions in this book static functions string concatenation SUBSTR function SYS_GUID( ) function temp space, running out of temporary table, creating transaction isolation level, querying transposition errors, finding in numbers triggers UPDATE statement users and administrators, creating web-based interface XML features OUTER JOINs eliminating subqueries by using JOIN chain example OVER clause used with aggregate function overlapping ranges, identifying Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] page rank calculations parameters personalized creating list of defining storing in database parameters, query partial index matching passwords accessing with SQL injection easypass generator tool hashing Oracle recorded in plain text specifying with command-line switches strength of pattern-matching FULLTEXT, MySQL solving crossword puzzle using SQL performance combining queries comparing tables compressing to avoid LOBs executing functions in the database extracting subset of the results improving with indexes locking and coping with unexpected redo implicit locks within transactions transaction isolation level using optimistic locking using pessimistic locking mixing file and database storage adding/removing files changing existing files too many files one-to-many JOIN producing excess repetition in result set queries returning many rows using a large buffer using series of round trips using variable binding SQL calculations unique sequential number generation Perl assigning list of variables in single statement cleaning up Unicode output from msxsl.exe compression/decompression script connecting to SQL database from program converting logfile lines into SQL INSERT statements dates (user supplied), converting to date literals DBI interface creating BLOBs and retrieving data from escaping special characters in strings MD5 hashing opening files in append mode placeholders pop-up list from a table table comparison script Webalizer DNS cache, accessing permissions user, setting on rows and columns personalized parameters creating list of defining pessimistic locking pg_dump command phantom reads PHP connecting to SQL database from program escaping special characters in strings extracting subset of results MD5 hashing navigation features for web application search results, alphabetic placeholders pop-up list from a table user creation form repeatable phpBB program, sharing namespaces phpMyAdmin 2nd phpPgAdmin pipes pivot tables, filling in missing values placeholders 2nd examples in different languages Plink (SSH client) killing the process starting connection using Visual Basic plpgsql language 2nd pnm library pop-up list from a table in Perl in PHP portability, database applications PostgreSQL anonymous access audit trail, creating auto-numbering autocommit (chained mode), turning off calendar view of important dates checking if table exists before issuing DROP TABLE command-line utility (psql) CURRENT_TIMESTAMP dates casting to strings converting to integers day-of-the-week finding floating calendar date generating sequence with integers table quarterly reports reducing precision in report data sequence of days up to a year DROP avoiding constraints exporting table definitions full text searching with Tsearch2 module LIMIT instruction limiting queries long-running queries, finding and stopping LPAD function maximum of three fields, calculating MD5 algorithm OFFSET instruction OID datatype pattern-matching personalized parameters, defining pessimistic locking SQL conventions in this book static functions string concatenation SUBSTRING function temporary table, creating transaction isolation level, querying transposition errors, finding in numbers triggers UPDATE statement users and administrators, creating web-based interfaace precision, reducing for report data custom ranges dates numbers primary keys choosing for number sequence defining to create indexed column email-based usernames image filenames prioritized decision tables processors, XSLT profiles (Oracle) programming languages accessing SQL database C# Java Perl PHP Ruby placeholder examples progressive tax, applying PUBLIC role PuTTY web site Pythagorean theorem 2nd Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] quadratic time, SQL calculations QUARTER function quarterly reports, generating queries combining issuing without using table long-running, finding and stopping preventing from running returning many rows using a large buffer using series of round trips using variable binding Query Builder interface (Access) query caching query parameters queuing in database Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] RAD (rapid application development) tools radians, converting GPS coordinates to random numbers ranges grouping report data by custom ranges reducing precision of dates reducing precision of numbers overlapping, identifying RANK( ) function 2nd rank, calculating rank, page rank calculations READ COMMITTED isolation level 2nd testing INSERT transaction READ UNCOMMITTED isolation level recursive extensions (Oracle) redirection operator (<) regular expressions user-supplied date values converted to date literals remittances, reconciling with invoices REPEAT function (MySQL) REPEATABLE READ isolation level testing INSERT transaction repeating same calculation, avoiding REPLACE function using instead of string concatenations widespread support among database systems REPLICATE function (SQL Server) Report Wizard (Access) reports breaking data down into ranges custom ranges reducing precision of dates reducing precision of numbers building decision tables prioritized decision tables choosing any three of five GROUP BY solution JOIN solution comma-delimited lists in a column finding top n in each group generating a calendar generating sequential or missing data providing missing data with OUTER JOIN sequential data identifying updates uniquely branch transactions central server update duplicate batches pivot tables, filling in missing values queuing in database, setting up six degrees of Kevin Bacon game testing two values from a subquery traversing a simple tree Oracle recursive extensions tree visualization result sets excess repetition produced by one-to-many JOIN extracting subset queries returning large result set result weighting results page REVOKE command RGB values RIGHT OUTER JOINs filling in missing values in pivot table robot spiders, filtering from web server logs ROLLBACK command ROUND function rounding errors ROWNUM pseudovariable (Oracle) 2nd 3rd 4th rows displaying as columns using self-join displaying columns as ungrouping data with repeated columns generating without tables Ruby connecting to SQL database from program placeholders running total, keeping Index [SYMBOL] [A] [B] [C] [D] [E] [F] [G] [H] [I] [J] [K] [L] [M] [N] [O] [P] [Q] [R] [S] [T] [U] [V] [W] [X] [Y] scalable vector graphics [See SVG, generating from SQL] schema, dropping second Tuesday of the month, finding secure tunnel, creating on Windows security user permissions, setting on rows web-based interfaces SELECT statement asterisk (*) wildcard, using dropping FROM clause FROM clause, using other SELECT statements in VALUES list INSERT ...

pages: 519 words: 102,669

Programming Collective Intelligence
by Toby Segaran
Published 17 Dec 2008

Multidimensional scaling in two dimensions is easy to print, but scaling can be done in any number of dimensions. Try changing the code to scale in one dimension (all the points on a line). Now try making it work for three dimensions. Chapter 4. Searching and Ranking This chapter covers full-text search engines, which allow people to search a large set of documents for a list of words, and which rank results according to how relevant the documents are to those words. Algorithms for full-text searches are among the most important collective intelligence algorithms, and many fortunes have been made by new ideas in this field. It is widely believed that Google's rapid rise from an academic project to the world's most popular search engine was based largely on the PageRank algorithm, a variation that you'll learn about in this chapter.

The neural network will use this information to change the ordering of the results to better reflect what people have clicked on in the past. To work through the examples in this chapter, you'll need to create a Python module called searchengine, which has two classes: one for crawling and creating the database, and the other for doing full-text searches by querying the database. The examples will use SQLite, but they can easily be adapted to work with a traditional client-server database. To start, create a new file called searchengine.py and add the following crawler class and method signatures, which you'll be filling in throughout this chapter: class crawler: # Initialize the crawler with the name of database def __init_ _(self,dbname): pass def __del_ _(self): pass def dbcommit(self): pass # Auxilliary function for getting an entry id and adding # it if it's not present def getentryid(self,table,field,value,createnew=True): return None # Index an individual page def addtoindex(self,url,soup): print 'Indexing %s' % url # Extract the text from an HTML page (no tags) def gettextonly(self,soup): return None # Separate the words by any non-whitespace character def separatewords(self,text): return None # Return true if this url is already indexed def isindexed(self,url): return False # Add a link between two pages def addlinkref(self,urlFrom,urlTo,linkText): pass # Starting with a list of pages, do a breadth # first search to the given depth, indexing pages # as we go def crawl(self,pages,depth=2): pass # Create the database tables def createindextables(self): pass A Simple Crawler I'll assume for now that you don't have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I'll show you how to build a simple crawler.

If you'd like to make sure that the crawl worked properly, you can try checking the entries for a word by querying the database: >>[row for row in crawler.con.execute( .. 'select rowid from wordlocation where wordid=1')] [(1,), (46,), (330,), (232,), (406,), (271,), (192,),... The list that is returned is the list of all the URL IDs containing "word," which means that you've successfully run a full-text search. This is a great start, but it will only work with one word at a time, and will just return the documents in the order in which they were loaded. The next section will show you how to expand this functionality by doing these searches with multiple words in the query. Querying You now have a working crawler and a big collection of documents indexed, and you're ready to set up the search part of the search engine.

pages: 480 words: 99,288

Mastering ElasticSearch
by Rafal Kuc and Marek Rogozinski
Published 14 Aug 2013

You should also know how to send queries to get the documents you are interested in, how to narrow down the results of your queries by using filtering, and how to calculate statistics for your data with the use of the faceting/aggregation mechanism. However, before getting to the exciting functionality that ElasticSearch offers, we think that we should start with a quick tour of Apache Lucene, the full text search library that ElasticSearch uses to build and search its indices, as well as the basic concepts that ElasticSearch is built on. In order to move forward and extend our learning, we need to ensure we don't forget the basics. It is easy to do. We also need to make sure that we understand Lucene correctly as Mastering ElasticSearch requires this understanding.

We don't know for sure, because we were not the ones that made the decision, but we assume that it was because Lucene is mature, highly performing, scalable, light, and yet, very powerful. Its core comes as a single file of Java library with no dependencies, and allows you to index documents and search them with its out of the box full text search capabilities. Of course there are extensions to Apache Lucene that allows different languages handling, enables spellchecking, highlighting, and much more; but if you don't need those features, you can download a single file and use it in your application. Overall architecture Although I would like to jump straight to Apache Lucene architecture, there are some things we need to know first in order to fully understand it, and those are: Document: It is a main data carrier used during indexing and search, containing one or more fields, which contain the data we put and get from Lucene Field: It is a section of the document which is built of two parts, the name and the value Term: It is a unit of search representing a word from text Token: It is an occurrence of a term from the text of the field.

By the end of this chapter we will have covered the following topics: How to use different scoring formulae and what they can bring How to use different posting formats and what they can bring How to handle Near Real Time searching, real-time GET, and what searcher reopening means Looking deeper into multilingual data handling Configuring transaction log to our needs and see how it affects our deployments Segments merging, different merge policies, and merge scheduling Altering Apache Lucene scoring With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents.

Python Web Development With Django
by Jeff Forcier

MySQL lacks some advanced functionality present in Postgres, but is also a bit more common, partly due to its tight integration with the common Web language PHP. Unlike some database servers, MySQL has a couple of different internal database types that determine the effective feature set: One is MyISAM, which lacks transactional support and foreign keys but is capable of full-text searching, and another is InnoDB, which is newer and has a better feature set but currently lacks full-text search.There are others, but these two are the most commonly used. If you’re on Windows or your package manager doesn’t have a recent version of MySQL, its official Web site ishttp://www.mysql.com, and offers binaries for most platforms. Django’s preferred MySQL Python library is MySQLdb, whose official site is http: //www.sourceforge.net/projects/mysql-python, and you need version 1.2.1p2 or newer.

Our search function is handy, but doesn’t offer the power that something as familiar as a Web search engine does; a multiword phrase, for example, should ideally be treated as a collection of independent search terms unless otherwise specified.The implementation here could be made more sophisticated, but if you are doing full-text searching over large numbers of records you probably would benefit from Summary something such as Sphinx, a search engine with available Django integration. For more, see withdjango.com. Status change notifications. We’ve already got a custom save method that handles our Markdown rendering.We could easily extend this to improve our workflow system by detecting when a story’s status has been changed and sending a notification e-mail to the person responsible for handling stories at that stage.A key piece of implementing this would be to replace our status field with a ForeignKey to a full-fledged Status model, which in addition to the numerical value and label fields implied of our STATUS_CHOICES list would have a status_owner field, a ForeignKey field to the User model.

pages: 291 words: 77,596

Total Recall: How the E-Memory Revolution Will Change Everything
by Gordon Bell and Jim Gemmell
Published 15 Feb 2009

Modern databases let you very quickly retrieve all the records with a given attribute. You can rapidly sort, sift, and combine information in just about any way you can imagine. There was once a slight technical distinction to be made between how a database could index and look up records and full-text retrieval of documents, but by now databases have subsumed full-text search; they are happy to store documents and perform Google-like retrieval. In his memex paper, Bush had expressed hope that the search algorithms of the future would be better than simple index-lookup on some attribute like author or date. He held up the human brain’s associative memory as the ideal.

Any team that can take my corpus and turn it into my digitally immortal chatting self will get my support. And that’s not just vanity—if you can imitate me, you can imitate help-desk personnel and make a ton of money. START-UP #8—DOCUMENT MANAGEMENT It sounds great to declutter your life by scanning all your documents, but full-text search on a heap of files is not always the best way to retrieve information. This service (or program that you run) will automatically group similar items. It will build a knowledge base of every kind of document it can learn about, for example from all major utility and phone companies. It will be able to pull out the date, the total, and who the bill is from.

pages: 468 words: 233,091

Founders at Work: Stories of Startups' Early Days
by Jessica Livingston
Published 14 Aug 2008

So what we did was that we searched our directory first, we gave you those results, and then, if we didn’t find anything, we kicked you over to a full-text search. So, when I say we “rented” that technology, we essentially partnered with full-text search companies to be the falloff searches that we had. Livingston: That’s what you did with Google? Brady: Yes. Strategically, it was spot-on until Google showed up. Because we always thought it was going to be a leapfrogging game. No one is ever going to be able to get so far ahead that we’d ever be in strategic risk of kingmaking a full-text search engine, because you just can’t do that. Google ended up doing exactly that.

The ideas that they had really early on were right strategically and creatively. So everything we did through the middle of ’97, invariably we were first and we did it very well. The one thing we didn’t do that all our competitors were spending a lot of time doing was search. They were crawling the Web and doing full text search, and our strategy was, “Look, that’s a technology game. We’re not a technology company, we’re a media company. Since there are so many of them out there, we’re always going to be able to rent it.” That was the thought back then, and until Google came along that strategy was perfect. Because, as things played out, that’s exactly what happened.

See also angel investors Firefox, 395–404 Firefox 1.0, 395 Firefox toolbar, 226 FirePower Systems, 17–18 flagging, 251–252 flash card program, 52 Fletcher, Mark, 233–246 Flickr, 257–264 floppy disk drive, 52, 55 Fog Creek Software, 345–360 FogBugz, 348–350 Index 459 Forum for Women Entrepreneurs, 264 Founders Award, 169 Fourier transform, 178 Frankston, Bob, 73–88, 90, 91 fraud, 6–11 fraud investigators, 8 fraud unit, 9 Fregin, Doug, 141–151 Fried, Jason, 309 Fry’s stores, 199 full-text search companies, 133 Fuzzy. See Mauldin, Michael (Fuzzy) Fylstra, Dan, 76, 83–84, 90 G Galbraith, David, 118 Game Neverending (Ludicorp), 257, 258 Gates, Bill, 292, 307 Gecko, 397 General Magic, 174, 189 General Motors, 141, 145–146 GeoURL, 223 Geschke, Charles, 281–296 GlaxoSmithKline, 106 Gmail, 161–172 Goldman, Phil, 178 Goodger, Ben, 397 Google, 27, 122–123, 161, 167–170 Google’s Founders Award, 168–169 Government Printing Office, 270 Graham, Paul, 205–222 Greenspun, Philip, 317–344 Groove Networks, 103–110 Gruner, Ron, 427–446 H hackers, 230 Hambrecht & Quist, 283–284, 429 Handler, Sheryl, 265 hard-disk drive, 196 hardware vs. software designers, 21 Harris 2200, 79 Harvard Business School, 75–76 Heinemeier Hansson, David, 309, 313 Hembrecht, Bill, 283–284, 429 Hendricks, John, 202 Hewitt, Joe, 395, 402 Hewlett-Packard (HP), 32, 191–192 Hewlett-Packard 3000 minicomputer, 42 Highland Capital, 419 Hillis, Danny, 265, 278 Hoffman, Reid, 261 Homebrew Computer Club, 33 Hong, James, 377–386 HOT or NOT, 377–386 Hotmail, 17–29, 135 Hourihan, Meg, 112, 119, 120 House of Representatives, 270 HP LaserJet printers, 289, 296 Huffman, Steve, 448 Human Resources, 391 Hummer Winblad Venture Partners, 297–298 Hyatt, Dave, 395, 397 460 Index I IBM, 89, 93–94, 289 IBM PC, 51, 94, 186 IDG, 65 IFC (Internet Foundation Class), 154 IGOR, 8–10 IM (Instant Messaging), 316 Imbach, Sarah, 9 InfoWorld, 65 instant messenger application, 259 intellectual property issues, 21 interactive pager, 149 InterActiveCorp.

pages: 125 words: 27,675

Applied Text Analysis With Python: Enabling Language-Aware Data Products With Machine Learning
by Benjamin Bengfort , Rebecca Bilbro and Tony Ojeda
Published 10 Jun 2018

The addition of the WORM store to our data ingestion workflow means that we need to store data in two places: the raw corpus as well as the preprocessed corpus, and leads to the question: where should that data be stored? When we think of data management, the first thought is a database. Databases are certainly valuable tools in building language aware data products, and many provide full-text search functionality and other types of indexing. However, consider the fact that most databases are constructed to retrieve or update only a couple of rows per transaction. In contrast, computational access to a text corpus will be a complete read of every single document, and will cause no in-place updates to the document, nor search or select individual documents.

pages: 1,266 words: 278,632

Backup & Recovery
by W. Curtis Preston
Published 9 Feb 2009

An email archive system would include who sent and received an email, the subject of the email, and all other appropriate metadata. Finally, an archive system may import the full text of the item into its database, allowing for full-text searches against the archive. This can be useful, especially if multiple formats can be supported. It’s particularly expedient to be able to do a full-text search against all emails, Word documents, PDF files, etc. Another important feature of archive systems is their ability to store a predetermined number of copies of an archived item. A company can then decide how many copies it wishes to keep.

The assumption is that if someone asks for widget ABC’s parts (or some other piece of reference data), the appropriate files can just be restored from the system where they used to reside. The first problem with that scenario is remembering where the files were several years ago. While backup products and even some backup devices are starting to offer full-text search against all your backups, the problems in the following paragraph still exist. Even if you can remember where the files belong, the number of operating systems or application versions that have come and gone in the intervening time can stymie the effort. To restore files that were backed up from “Apollo” five years ago, the first requirement is a system named Apollo.

For too long our files and databases have been put into backup formats that required the backup software to extract them. We’ve lived with this for so long that it’s actually quite hard to imagine the possibilities that this brings to the table. Here’s a short list to help you wrap your brain around this one: You can point a full-text search appliance directly at your backups and search the full text of all files ever backed up. If you’re running multiple backup products, users and administrators can use a single method of recovery. Imagine how easy end-user recoveries would be if you could just point them at a mount point such as \\backupserver\yourclientname\date.

pages: 593 words: 118,995

Relevant Search: With Examples Using Elasticsearch and Solr
by Doug Turnbull and John Berryman
Published 30 Apr 2016

And although tokens are typically generated text, as you’ll see in chapter 4, analysis can be applied and tokens generated for nontext values such as floating-point numbers and geographic locations. In chapter 1, we mentioned the notion of features. In machine learning, features are descriptors for the items being classified. Features used to classify fruit may be things such as color, flavor, and shape. With full-text search, the tokens produced during analysis are the dominant features used to match a user’s query with documents in the index. Don’t worry if this seems vague right now; the greater portion of this book is dedicated to making these ideas clear. After analysis is complete, the documents are indexed; the tokens from the analysis step are stored into search engine data structures for document retrieval.

simple signals, 2nd Solr additive, with Boolean queries boosting feature mappings multiplicative, with function queries user ratings vs. filtering breadcrumb navigation browse experience browse interface, Yowl buckets section building signals bulk index API bulkMovies string business and domain awareness business concerns group business weight business-ranking logic BusinessScore C cast.name field, 2nd, 3rd, 4th, 5th cast.name scores cast.name.bigrammed field, 2nd, 3rd, 4th character filtering, 2nd, 3rd character offsets classic similarity classification features cleaning click-through rate co-occurrence counting cold-start problem COLLAB_FILTER filter, 2nd collaboration filtering, using co-occurrence counting search relevance and collation collocation extraction combining fields committed documents common words, removing completion field, 2nd completion suggester completion_analyzer completion_prefix variable complexphrase query parser compound queries, 2nd, 3rd concept search basic methods for building augmenting content with synonyms concept signals building using machine learning personalized search and configurations conflate tokens constant_score query, 2nd content augmentation curation engineer/curator pairing risk of miscommunication with content curator role of content curator exploring extracting into documents providing to search engine searching content group content weight, 2nd ContentScore control analysis controlling field matching converge conversion rate coord (coordinating factor), 2nd, 3rd, 4th, 5th, 6th, 7th copyField, 2nd, 3rd copy_to option, 2nd cosine similarity cross_fields, 2nd, 3rd searching, 2nd, 3rd, 4th Solr solving signal discordance with cuisine field cuisine_hifi field, 2nd cuisine_lofi field curation, search relevance and custom all field custom score query D data-driven culture debugging example search application Elasticsearch first searches with The Movie Database Python matching query matching analysis to solve matching issues comparing query to inverted index fixing by changing analyzers query parsing underlying strategy ranking computing weight explain feature scoring matches to measure relevance search term importance similarity vector-space model decay functions, 2nd deep paging default analyzer defType parameter delimiters acronyms modeling specificity phone numbers synonyms tokenizing geographic data tokenizing integers tokenizing melodies deployment, relevance-focused search application description field, 2nd, 3rd, 4th, 5th descriptive query directors field directors.name field, 2nd, 3rd, 4th directors.name score directors.name.bigrammed, 2nd, 3rd disable_coord option disabling tokenization discriminating fields DisjunctionMaximumQuery dismax, 2nd doc frequency, 2nd doc values document search and retrieval aggregations Boolean search facets filtering Lucene-based search positional and phrase matching ranked results relevance sorting document-ranking system documents analysis enhancement enrichment extraction flattening nested grouping similar matching meaning of scored search completion from documents being searched tokens as features of matching process meaning of documents dot character dot product, 2nd down-boosting title DSL (domain-specific language) E e-commerce search, 2nd easy_install utility edismax query parser, 2nd Elasticsearch example search application overview end sentinels engaged field engaged restaurants English analyzer overview reindexing with english_* filters english_bigrams analyzer english_keywords filter english_possessive_stemmer filter english_stemmer filter english_stop filter enrichment, 2nd ETL (extract, transform, load), 2nd every field gets a vote exact matching, 2nd, 3rd, 4th expert search, 2nd, 3rd explanation field external sources extract function, 2nd, 3rd extracting features extraction F faceted browsing overview Solr facet.prefix option facets, 2nd, 3rd fail fast, 2nd, 3rd, 4th fast vector highlighter feature modeling, 2nd feature selection feature space features creation of overview, 2nd, 3rd feedback at search box search completion search suggestions search-as-you-type business and domain awareness content curation risk of miscommunication with content curator role of content curator in search results listing grouping similar documents information presented snippet highlighting when there are no results search relevance and Solr faceted browsing field collapsing match phrase prefix relevance feedback feature mappings suggestion and highlighting components while browsing alternative results ordering breadcrumb navigation faceted browsing field boosts field collapsing overview Solr field discordance field mappings field normalization field scores, 2nd field synchronicity, signal modeling and field-by-field dismax field-centric methods, 2nd field-centric search, combining term-centric search and combining greedy search and conservative amplifiers like fields precision vs. recall Solr fieldNorms, 2nd, 3rd fields fieldType field_value_factor function fieldWeight, 2nd, 3rd, 4th filter clause filter element filter queries filtering Amazon-style collaborative overview using co-occurrence counting score shaping vs. boosting finite state transducer fire token first_name field floating-point numbers fragment_size parameter fudge factors full bulk command full search string full-text search full_name field function decay function queries, multiplicative boosting with Boolean queries vs. combining high-value tiers scored with simple Solr function_score query, 2nd, 3rd, 4th G garbage features Gaussian decay generalizing matches generate_word_parts genres aggregation genres.name field geographic data, tokenizing geohashing geolocation, 2nd getCastAndCrew function GitHub repository granular fields grouping fields H has_discount field high-quality signals highlighted snippets highlights HTMLStripCharFilter HTTP commands, 2nd I ideal document IDF (inverse document frequency) ignoring when ranking overview, 2nd inconsistent scoring index-time analysis, 2nd index-time personalization indexing documents information and requirements gathering business needs required and available information users and information needs information retrieval, creating relevance solutions through inner objects innermost calculation integers, tokenizing inventory-related files inventory_dir configuration inverse document frequency.

pages: 480 words: 122,663

The Art of SQL
by Stephane Faroult and Peter Robson
Published 2 Mar 2006

One can sometimes use query accelerators other than regular indexes (full-text indexing, for instance), but such accelerators usually have drawbacks, such as not being maintained in real time. Also take note that full-text search may produce awkward results at times. Let's take the example of a color column that contains a description of both body and interior colors. If you search for "blue" because you'd prefer to buy a blue car, gray cars with a blue interior will also be returned. We have all experienced irrelevant full-text search results through web searches. Database-guaranteed data correctness Data-entry is prone to error. More importantly than dissuasive search times, if "ASB" is entered instead of "ABS" into a descriptive string, the database management system will have no way to check whether the string "ASB" is meaningful.

Django Book
by Matt Behrens
Published 24 Jan 2015

The icontains is a lookup type (as explained in Chapter 5 and Appendix B), and the statement can be roughly translated as “Get the books whose title contains q, without being case-sensitive.” This is a very simple way to do a book search. We wouldn’t recommend using a simple icontains query on a large production database, as it can be slow. (In the real world, you’d want to use a custom search system of some sort. Search the Web for open-source full-text search to get an idea of the possibilities.) We pass books, a list of Book objects, to the template. The template code for search_results.html might include something like this: <p>You searched for: <strong>{{ query }}</strong></p> {% if books %} <p>Found {{ books|length }} book{{ books|pluralize }}.

year, month, and day For date/datetime fields, perform exact year, month, or day matches: # Return all entries published in 2005 >>>Entry.objects.filter(pub_date__year=2005) # Return all entries published in December >>> Entry.objects.filter(pub_date__month=12) # Return all entries published on the 3rd of the month >>> Entry.objects.filter(pub_date__day=3) # Combination: return all entries on Christmas of any year >>> Entry.objects.filter(pub_date__month=12, pub_date_day=25) isnull Takes either True or False, which correspond to SQL queries of IS NULL and IS NOT NULL, respectively: >>> Entry.objects.filter(pub_date__isnull=True) search A Boolean full-text search that takes advantage of full-text indexing. This is like contains but is significantly faster due to full-text indexing. Note this is available only in MySQL and requires direct manipulation of the database to add the full-text index. The pk Lookup Shortcut For convenience, Django provides a pk lookup type, which stands for “primary_key”.

pages: 485 words: 74,211

Developing Web Applications with Haskell and Yesod
by Michael Snoyman
Published 22 Apr 2012

My experience has been that you can easily perform 95% of what you need to do with the high-level interface. (In fact, most of my web apps use the high-level interface exclusively.) But occassionally you’ll want to use a feature that’s specific to a backend. One feature I’ve used in the past is full text search. In this case, we’ll use the SQL “LIKE” operator, which is not modeled in Persistent. We’ll get all people with the last name “Snoyman” and print the records out. Note Actually, you can express a LIKE operator directly in the normal syntax due to a feature added in Persistent 0.6, which allows backend-specific operators.

pages: 420 words: 61,808

Flask Web Development: Developing Web Applications With Python
by Miguel Grinberg
Published 12 May 2014

Following is a short list of some additional packages that are worth exploring: Flask-Babel: Internationalization and localization support Flask-RESTful: Tools for building RESTful APIs Celery: Task queue for processing background jobs Frozen-Flask: Conversion of a Flask application to a static website Flask-DebugToolbar: In-browser debugging tools Flask-Assets: Merging, minifying, and compiling of CSS and JavaScript assets Flask-OAuth: Authentication against OAuth providers Flask-OpenID: Authentication against OpenID providers Flask-WhooshAlchemy: Full-text search for Flask-SQLAlchemy models based on Whoosh Flask-KVsession: Alternative implementation of user sessions that use server-side storage If the functionality that you need for your project is not covered by any of the extensions and packages mentioned in this book, then your first destination to look for additional extensions should be the official Flask Extension Registry.

pages: 295 words: 66,912

Walled Culture: How Big Content Uses Technology and the Law to Lock Down Culture and Keep Creators Poor
by Glyn Moody
Published 26 Sep 2022

Unlike the Google Books lawsuit, which dragged on for years, the one against HathiTrust was resolved quickly. In October 2012, the presiding judge ruled that the digitisation programme was clear fair use under US copyright law, because scanning books in order to index them was a ‘transformative’76 act. The judge also recognised that copying entire works was necessary in order to provide full-text searching and access by the visually impaired. As the Publishers Weekly report on the judgment notes: “[The judge] clearly was impressed by the access the project affords to the blind and print disabled (who had intervened in the case via the National Federation for the Blind), citing it often in his opinion and at one point writing that the ‘unprecedented ability of print-disabled individuals to have an equal opportunity to compete with their sighted peers’ was perhaps the ‘most important’ transformative use of the scans.”77 The judge raised a key issue: copyright in general prevents those who are visually impaired from making accessible versions of texts.

pages: 263 words: 75,610

Delete: The Virtue of Forgetting in the Digital Age
by Viktor Mayer-Schönberger
Published 1 Jan 2009

In the United States in the 1970s, Lexis and Westlaw, for example, made available to their customers huge databases with the full text of tens of thousands of court decisions, but these could only be retrieved using a limited set of keys. Customers, however, wanted to find relevant decisions by searching for words in the text of the decision, not just the case name, docket number, date, and a few subject words that had been indexed. The solution was to make searchable every word of every document in the database. Such full-text searches still require input of the precise words or terms and so it is no surefire recipe of finding the desired information, but it is eminently easier and more powerful than a search that is restricted to a small number of predefined search keys. At first, full-text indexing and searches were used by large providers of information databases, but by the beginning of the twenty-first century it had become a standard feature of all major PC operating systems, bringing the power of pinpoint information retrieval to people’s desktops.

pages: 260 words: 77,007

Are You Smart Enough to Work at Google?: Trick Questions, Zen-Like Riddles, Insanely Difficult Puzzles, and Other Devious Interviewing Techniques You ... Know to Get a Job Anywhere in the New Economy
by William Poundstone
Published 4 Jan 2012

As the tech blogger Joel Spolsky wrote, A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the ‘if’ statement,” he said. That’s true. Google also uses full-text-search-of-the-entire-Internet the way Microsoft uses little tables that list what error IDs correspond to which help text. Look at how Google does spell checking: it’s not based on dictionaries; it’s based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn’t.

pages: 313 words: 75,583

Ansible for DevOps: Server and Configuration Management for Humans
by Jeff Geerling
Published 9 Oct 2015

You can find the entire example Drupal LAMP server playbook in this book’s code repository at https://github.com/geerlingguy/ansible-for-devops, in the drupal directory. Real-world playbook: Ubuntu Apache Tomcat server with Solr Apache Solr is a fast and scalable search server optimized for full-text search, word highlighting, faceted search, fast indexing, and more. It’s a very popular search server, and it’s pretty easy to install and configure using Ansible. In the following example, we’re going to set up Apache Solr using Ubuntu 12.04 and Apache Tomcat. Apache Solr Server. Include a variables file, and discover pre_tasks and handlers Just like the previous LAMP server example, we’ll begin this playbook by telling Ansible our variables will be in a separate vars.yml file: 1 - hosts: all 2 3 vars_files: 4 - vars.yml Let’s quickly create the vars.yml file, while we’re thinking about it.

pages: 420 words: 79,867

Developing Backbone.js Applications
by Addy Osmani
Published 21 Jul 2012

Sorting will automatically detect if you’re trying to sort numbers (even if they’re strored as strings) and will do the right thing. Collection.setFilter(filterFields, filterWords) - filter the current view. Filtering supports multiple words without any specific order, so you’ll basically get a full-text search ability. Also, you can pass it only one field from the model, or you can pass an array with fields and all of them will get filtered. Last option is to pass it an object containing a comparison method and rules. Currently, only levenshtein method is available. The goTo(), prevPage(), and nextPage() functions do not require the options param since they will be executed synchronously.

pages: 360 words: 96,275

PostgreSQL 9 Admin Cookbook: Over 80 Recipes to Help You Run an Efficient PostgreSQL 9. 0 Database
by Simon Riggs and Hannu Krosing
Published 23 Oct 2010

Reducing the number of rows returned Although often the problem is producing many rows in the first place, it is made worse by returning all the unneeded rows to the client. This is especially true if client and server are not on the same host. Here are some ways to reduce the traffic between the client and server. A full text search returns 10,000 documents, but only first the 20 are displayed to user In this case, order the documents by ranking on the server, and return only the top 20 actually displayed SELECT title, ts_rank_cd(body_tsv, query, 20) AS text_rank FROM articles, plainto_tsquery('spicy potatoes') AS query WHERE body_tsv @@ query ORDER BY rank DESC LIMIT 20 ; If you need the next 20, don't just query with limit 40 and throw away the first 20, but use "OFFSET 20 LIMIT 20" to return just the next 20.

pages: 281 words: 95,852

The Googlization of Everything:
by Siva Vaidhyanathan
Published 1 Jan 2010

From library consultant Karen Coyle: I was chatting with the brother of one of the Google founders. He told me that his brother was working on a new search engine that would be better than anything ever seen before. I tried to argue that it would still be limited by the NOTES TO PAGES 82–91 235 reality of the full-text search. I probably looked at Google when it was first made available, and I was pretty un-impressed. Just more keyword searching. Today I use it constantly, but I’m very aware of the fact that it works quite well for nouns and proper nouns (people, companies, named things), and less well for concepts. . . . .

pages: 353 words: 104,146

European Founders at Work
by Pedro Gairifo Santos
Published 7 Nov 2011

Really try to break those barriers, break those borders, and take inspiration from things around you and be curious, but apply it to problems in a smart, differentiated, useful way. Ilya Segalovich Yandex Ilya Segalovich is co-founder of Yandex, the leading search engine in Russian-speaking countries. The roots of Yandex trace back to a company called Arkadia, which in the early 1990s developed software featuring full-text search supporting the Russian language. In 1993, Segalovich and Arkady Volozh came up with the word “Yandex” to describe their search technologies. The web site, Yandex.ru, was launched in 1997 and in 2000 Yandex was incorporated as a standalone company. In May 2011, Yandex raised $1.3 billion in an initial public offering on NASDAQ.

pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack
by Matthew A. Russell
Published 15 Jan 2011

We’ll pass in empty strings for the URL portion of those tuples, but you could use your imagination and hyperlink to a simple web service that displays a list of tweets containing the entities. (Recall that Example 5-7 provides just about everything you’d need to wire this up by using couchdb-lucene to perform a full-text search on tweets stored in CouchDB.) Another option might be to write a web service and link to a URL that provides any tweet containing the specified entity. Example 5-17. Generating the data for an interactive tag cloud using WP-Cumulus (the_tweet__tweet_tagcloud_code.py) # -*- coding: utf-8 -*- import os import sys import webbrowser import json from cgi import escape from math import log import couchdb from couchdb.design import ViewDefinition DB = sys.argv[1] MIN_FREQUENCY = int(sys.argv[2]) HTML_TEMPLATE = '..

pages: 470 words: 109,589

Apache Solr 3 Enterprise Search Server
by Unknown
Published 13 Jan 2012

The two most common high level libraries for interacting with Solr are acts_as_solr and Sunspot. However, in the last couple of years, Sunspot has become the more popular choice, and comes in a version designed to work explicitly with Rails called sunspot_rails that allows Rails ActiveRecord database objects to be transparently backed by a Solr index for full text search. For lower-level client interface to Solr from Ruby environments, there are two libraries duking it out to be the client of choice: solr-ruby, a client library developed by the Apache Solr project and rsolr, which is a reimplementation of a Ruby centric client library. Both of these solutions are solid and act as great low level API libraries.

pages: 349 words: 114,038

Culture & Empire: Digital Revolution
by Pieter Hintjens
Published 11 Mar 2013

Once upon a time, the list of all websites was twenty pages long. I still have a book that has the entire World Wide Web printed as an appendix. Then the list got too long to print and sites like Yahoo! organized them into categories. Then the category list got too large to keep updated, and Lycos invented the full-text search. This was too slow, so Digital Equipment Corporation built a natty search engine called Altavista to show how to do it properly. The results for any search got too long, so Google invented the ranked search, which pretty much fixed the search issue. Google also threw all the clutter off the main page.

pages: 561 words: 120,899

The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant From Two Centuries of Controversy
by Sharon Bertsch McGrayne
Published 16 May 2011

According to Google’s research director, Peter Norvig, “There must have been dozens of times when a project started with naïve Bayes, just because it was easy to do and we expected to replace it with something more sophisticated later, but in the end the vast amount of data meant that a more complex technique was not needed.” Google also uses Bayesian techniques to classify spam and pornography and to find related words, phrases, and documents. A very large Bayesian network finds synonyms of words and phrases. Instead of downloading dictionaries for a spelling checker, Google conducted a full-text search of the entire Internet looking for all the different ways words can be spelled. The result was a flexible system that could recognize that “shaorn” should have been “Sharon” and correct the typo. While Bayes has helped revolutionize modern life on the web, it is also helping to finesse the Tower of Babel that has separated linguistic communities for millennia.

Pragmatic.Programming.Erlang.Jul.2007
by Unknown

. • Chapter 20, Programming Multicore CPUs, on page 367 is about programming multicore computers. We talk about the techniques for ensuring that an Erlang program will run efficiently on multicore computers. We introduce a number of abstractions for speeding up sequential programs on multicore computers. Finally we perform some measurements and develop our third major program, a full-text search engine. To write this, we first implement a function called mapreduce—this is a higher-order function for parallelizing a computation over a set of processing elements. • Appendix A, on page 390, describes the type system used to document Erlang functions. • Appendix B, on page 396, describes how to set up Erlang on the Windows operating system (and how to configure emacs on all operating systems). • Appendix C, on page 399, has a catalog of Erlang resources. • Appendix D, on page 403, describes lib_chan, which is a library for programming socket-based distribution. 15 B EGIN A GAIN • Appendix E, on page 419, looks at techniques for analyzing, profiling, debugging, and tracing your code. • Appendix F, on page 439, has one-line summaries of the most used modules in the Erlang standard libraries. 1.2 Begin Again Once upon a time a programmer came across a book describing a funny programming language.

pages: 602 words: 207,965

Practical Ext JS Projects With Gears
by Frank Zammetti
Published 7 Jul 2009

The Create Index and Drop Index functions will first require that you figure out how to retrieve the list of indexes for a table and then present a Window to enter the index details in the case of Create Index, or a list of existing indexes to choose from in the case of Delete Index. None of this is especially hard, and would make for a good exercise (hint: getting the list of indexes is a slight modification to the query to retrieve a list of tables in a database). • The SQLite engine Gears uses has a full-text search capability, and it would be nice if there was a Text Search tool, similar to the Query tool, where that could be used. • Provide the ability to add a new record from the Browse tab of the Table Details Window, as well as the ability to duplicate, edit, or delete the selected record. • Allow more than 20 fields to be added in the Create Table Window.

pages: 933 words: 205,691

Hadoop: The Definitive Guide
by Tom White
Published 29 May 2009

Link graph database This database is necessary to compute link-based page ranking scores, such as PageRank. For each URL known to Nutch, it contains a list of other URLs pointing to it, and their associated anchor text (from HTML <a href="..">anchor text</a> elements). This database is called LinkDb. Full-text search index This is a classical inverted index, built from the collected page metadata and from the extracted plain-text content. It is implemented using the excellent Lucene library. We briefly mentioned before that Hadoop began its life as a component in Nutch, intended to improve its scalability and to address clear performance bottlenecks caused by a centralized data processing model.