fault tolerance

back to index

137 results

pages: 1,237 words: 227,370

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 16 Mar 2017

EventSource (browser API), Pushing state changes to clients eventual consistency, Replication, Problems with Replication Lag, Safety and liveness, Consistency Guarantees(see also conflicts) and perpetual inconsistency, Timeliness and Integrity evolvability, Evolvability: Making Change Easy, Encoding and Evolutioncalling services, Data encoding and evolution for RPC graph-structured data, Property Graphs of databases, Schema flexibility in the document model, Dataflow Through Databases-Archival storage, Deriving several views from the same event log, Reprocessing data for application evolution of message-passing, Distributed actor frameworks reprocessing data, Reprocessing data for application evolution, Unifying batch and stream processing schema evolution in Avro, The writer’s schema and the reader’s schema schema evolution in Thrift and Protocol Buffers, Field tags and schema evolution schema-on-read, Schema flexibility in the document model, Encoding and Evolution, The Merits of Schemas exactly-once semantics, Exactly-once message processing, Fault Tolerance, Exactly-once execution of an operationparity with batch processors, Unifying batch and stream processing preservation of integrity, Correctness of dataflow systems exclusive mode (locks), Implementation of two-phase locking eXtended Architecture transactions (see XA transactions) extract-transform-load (see ETL) F FacebookPresto (query engine), The divergence between OLTP databases and data warehouses React, Flux, and Redux (user interface libraries), End-to-end event streams social graphs, Graph-Like Data Models Wormhole (change data capture), Implementing change data capture fact tables, Stars and Snowflakes: Schemas for Analytics failover, Leader failure: Failover, Glossary(see also leader-based replication) in leaderless replication, absence of, Writing to the Database When a Node Is Down leader election, The leader and the lock, Total Order Broadcast, Distributed Transactions and Consensus potential problems, Leader failure: Failover failuresamplification by distributed transactions, Limitations of distributed transactions, Maintaining derived state failure detection, Detecting Faultsautomatic rebalancing causing cascading failures, Operations: Automatic or Manual Rebalancing perfect failure detectors, Three-phase commit timeouts and unbounded delays, Timeouts and Unbounded Delays, Network congestion and queueing using ZooKeeper, Membership and Coordination Services faults versus, Reliability partial failures in distributed systems, Faults and Partial Failures-Cloud Computing and Supercomputing, Summary fan-out (messaging systems), Describing Load, Multiple consumers fault tolerance, Reliability-How Important Is Reliability?, Glossaryabstractions for, Consistency and Consensus formalization in consensus, Fault-Tolerant Consensus-Limitations of consensususe of replication, Single-leader replication and consensus human fault tolerance, Philosophy of batch process outputs in batch processing, Bringing related data together in the same place, Philosophy of batch process outputs, Fault tolerance, Fault tolerance in log-based systems, Applying end-to-end thinking in data systems, Timeliness and Integrity-Correctness of dataflow systems in stream processing, Fault Tolerance-Rebuilding state after a failureatomic commit, Atomic commit revisited idempotence, Idempotence maintaining derived state, Maintaining derived state microbatching and checkpointing, Microbatching and checkpointing rebuilding state after a failure, Rebuilding state after a failure of distributed transactions, XA transactions-Limitations of distributed transactions transaction atomicity, Atomicity, Atomic Commit and Two-Phase Commit (2PC)-Exactly-once message processing faults, ReliabilityByzantine faults, Byzantine Faults-Weak forms of lying failures versus, Reliability handled by transactions, Transactions handling in supercomputers and cloud computing, Cloud Computing and Supercomputing hardware, Hardware Faults in batch processing versus distributed databases, Designing for frequent faults in distributed systems, Faults and Partial Failures-Cloud Computing and Supercomputing introducing deliberately, Reliability, Network Faults in Practice network faults, Network Faults in Practice-Detecting Faultsasymmetric faults, The Truth Is Defined by the Majority detecting, Detecting Faults tolerance of, in multi-leader replication, Multi-datacenter operation software errors, Software Errors tolerating (see fault tolerance) federated databases, The meta-database of everything fence (CPU instruction), Linearizability and network delays fencing (preventing split brain), Leader failure: Failover, The leader and the lock-Fencing tokensgenerating fencing tokens, Using total order broadcast, Membership and Coordination Services properties of fencing tokens, Correctness of an algorithm stream processors writing to databases, Idempotence, Exactly-once execution of an operation Fibre Channel (networks), MapReduce and Distributed Filesystems field tags (Thrift and Protocol Buffers), Thrift and Protocol Buffers-Field tags and schema evolution file descriptors (Unix), A uniform interface financial data, Advantages of immutable events Firebase (database), API support for change streams Flink (processing framework), Dataflow engines-Discussion of materializationdataflow APIs, High-Level APIs and Languages fault tolerance, Fault tolerance, Microbatching and checkpointing, Rebuilding state after a failure Gelly API (graph processing), The Pregel processing model integration of batch and stream processing, Batch and Stream Processing, Unifying batch and stream processing machine learning, Specialization for different domains query optimizer, The move toward declarative query languages stream processing, Stream analytics flow control, Network congestion and queueing, Messaging Systems, Glossary FLP result (on consensus), Distributed Transactions and Consensus FlumeJava (dataflow library), MapReduce workflows, High-Level APIs and Languages followers, Leaders and Followers, Glossary(see also leader-based replication) foreign keys, Comparison to document databases, Reduce-Side Joins and Grouping forward compatibility, Encoding and Evolution forward decay (algorithm), Describing Performance Fossil (version control system), Limitations of immutabilityshunning (deleting data), Limitations of immutability FoundationDB (database)serializable transactions, Serializable Snapshot Isolation (SSI), Performance of serializable snapshot isolation, Limitations of distributed transactions fractal trees, B-tree optimizations full table scans, Reduce-Side Joins and Grouping full-text search, Glossaryand fuzzy indexes, Full-text search and fuzzy indexes building search indexes, Building search indexes Lucene storage engine, Making an LSM-tree out of SSTables functional reactive programming (FRP), Designing Applications Around Dataflow functional requirements, Summary futures (asynchronous operations), Current directions for RPC fuzzy search (see similarity search) G garbage collectionimmutability and, Limitations of immutability process pauses for, Describing Performance, Process Pauses-Limiting the impact of garbage collection, The Truth Is Defined by the Majority(see also process pauses) genome analysis, Summary, Specialization for different domains geographically distributed datacenters, Distributed Data, Reading Your Own Writes, Unreliable Networks, The limits of total ordering geospatial indexes, Multi-column indexes Giraph (graph processing), The Pregel processing model Git (version control system), Custom conflict resolution logic, The causal order is not a total order, Limitations of immutability GitHub, postmortems, Leader failure: Failover, Leader failure: Failover, Mapping system models to the real world global indexes (see term-partitioned indexes) GlusterFS (distributed filesystem), MapReduce and Distributed Filesystems GNU Coreutils (Linux), Sorting versus in-memory aggregation GoldenGate (change data capture), Trigger-based replication, Multi-datacenter operation, Implementing change data capture(see also Oracle) GoogleBigtable (database)data model (see Bigtable data model) partitioning scheme, Partitioning, Partitioning by Key Range storage layout, Making an LSM-tree out of SSTables Chubby (lock service), Membership and Coordination Services Cloud Dataflow (stream processor), Stream analytics, Atomic commit revisited, Unifying batch and stream processing(see also Beam) Cloud Pub/Sub (messaging), Message brokers compared to databases, Using logs for message storage Docs (collaborative editor), Collaborative editing Dremel (query engine), The divergence between OLTP databases and data warehouses, Column-Oriented Storage FlumeJava (dataflow library), MapReduce workflows, High-Level APIs and Languages GFS (distributed file system), MapReduce and Distributed Filesystems gRPC (RPC framework), Current directions for RPC MapReduce (batch processing), Batch Processing(see also MapReduce) building search indexes, Building search indexes task preemption, Designing for frequent faults Pregel (graph processing), The Pregel processing model Spanner (see Spanner) TrueTime (clock API), Clock readings have a confidence interval gossip protocol, Request Routing government use of data, Data as assets and power GPS (Global Positioning System)use for clock synchronization, Unreliable Clocks, Clock Synchronization and Accuracy, Clock readings have a confidence interval, Synchronized clocks for global snapshots GraphChi (graph processing), Parallel execution graphs, Glossaryas data models, Graph-Like Data Models-The Foundation: Datalogexample of graph-structured data, Graph-Like Data Models property graphs, Property Graphs RDF and triple-stores, Triple-Stores and SPARQL-The SPARQL query language versus the network model, The SPARQL query language processing and analysis, Graphs and Iterative Processing-Parallel executionfault tolerance, Fault tolerance Pregel processing model, The Pregel processing model query languagesCypher, The Cypher Query Language Datalog, The Foundation: Datalog-The Foundation: Datalog recursive SQL queries, Graph Queries in SQL SPARQL, The SPARQL query language-The SPARQL query language Gremlin (graph query language), Graph-Like Data Models grep (Unix tool), Simple Log Analysis GROUP BY clause (SQL), GROUP BY grouping records in MapReduce, GROUP BYhandling skew, Handling skew H Hadoop (data infrastructure)comparison to distributed databases, Batch Processing comparison to MPP databases, Comparing Hadoop to Distributed Databases-Designing for frequent faults comparison to Unix, Philosophy of batch process outputs-Philosophy of batch process outputs, Unbundling Databases diverse processing models in ecosystem, Diversity of processing models HDFS distributed filesystem (see HDFS) higher-level tools, MapReduce workflows join algorithms, Reduce-Side Joins and Grouping-MapReduce workflows with map-side joins(see also MapReduce) MapReduce (see MapReduce) YARN (see YARN) happens-before relationship, Ordering and Causalitycapturing, Capturing the happens-before relationship concurrency and, The “happens-before” relationship and concurrency hard disksaccess patterns, Advantages of LSM-trees detecting corruption, The end-to-end argument, Don’t just blindly trust what they promise faults in, Hardware Faults, Durability sequential write throughput, Hash Indexes, Disk space usage hardware faults, Hardware Faults hash indexes, Hash Indexes-Hash Indexesbroadcast hash joins, Broadcast hash joins partitioned hash joins, Partitioned hash joins hash partitioning, Partitioning by Hash of Key-Partitioning by Hash of Key, Summaryconsistent hashing, Partitioning by Hash of Key problems with hash mod N, How not to do it: hash mod N range queries, Partitioning by Hash of Key suitable hash functions, Partitioning by Hash of Key with fixed number of partitions, Fixed number of partitions HAWQ (database), Specialization for different domains HBase (database)bug due to lack of fencing, The leader and the lock bulk loading, Key-value stores as batch process output column-family data model, Data locality for queries, Column Compression dynamic partitioning, Dynamic partitioning key-range partitioning, Partitioning by Key Range log-structured storage, Making an LSM-tree out of SSTables request routing, Request Routing size-tiered compaction, Performance optimizations use of HDFS, Diversity of processing models use of ZooKeeper, Membership and Coordination Services HDFS (Hadoop Distributed File System), MapReduce and Distributed Filesystems-MapReduce and Distributed Filesystems(see also distributed filesystems) checking data integrity, Don’t just blindly trust what they promise decoupling from query engines, Diversity of processing models indiscriminately dumping data into, Diversity of storage metadata about datasets, MapReduce workflows with map-side joins NameNode, MapReduce and Distributed Filesystems use by Flink, Rebuilding state after a failure use by HBase, Dynamic partitioning use by MapReduce, MapReduce workflows HdrHistogram (numerical library), Describing Performance head (Unix tool), Simple Log Analysis head vertex (property graphs), Property Graphs head-of-line blocking, Describing Performance heap files (databases), Storing values within the index Helix (cluster manager), Request Routing heterogeneous distributed transactions, Distributed Transactions in Practice, Limitations of distributed transactions heuristic decisions (in 2PC), Recovering from coordinator failure Hibernate (object-relational mapper), The Object-Relational Mismatch hierarchical model, Are Document Databases Repeating History?

Protocols for making systems Byzantine fault-tolerant are quite complicated [84], and fault-tolerant embedded systems rely on support from the hardware level [81]. In most server-side data systems, the cost of deploying Byzantine fault-tolerant solutions makes them impractical. Web applications do need to expect arbitrary and malicious behavior of clients that are under end-user control, such as web browsers. This is why input validation, sanitization, and output escaping are so important: to prevent SQL injection and cross-site scripting, for example. However, we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the authority on deciding what client behavior is and isn’t allowed.

in derived data systems, Derived Data materialized views, Aggregation: Data Cubes and Materialized Views updating derived data, Single-Object and Multi-Object Operations, The need for multi-object transactions, Combining Specialized Tools by Deriving Data versus normalization, Deriving several views from the same event log derived data, Derived Data, Stream Processing, Glossaryfrom change data capture, Implementing change data capture in event sourcing, Deriving current state from the event log-Deriving current state from the event log maintaining derived state through logs, Databases and Streams-API support for change streams, State, Streams, and Immutability-Concurrency control observing, by subscribing to streams, End-to-end event streams outputs of batch and stream processing, Batch and Stream Processing through application code, Application code as a derivation function versus distributed transactions, Derived data versus distributed transactions deterministic operations, Pros and cons of stored procedures, Faults and Partial Failures, Glossaryaccidental nondeterminism, Fault tolerance and fault tolerance, Fault tolerance, Fault tolerance and idempotence, Idempotence, Reasoning about dataflows computing derived data, Maintaining derived state, Correctness of dataflow systems, Designing for auditability in state machine replication, Using total order broadcast, Databases and Streams, Deriving current state from the event log joins, Time-dependence of joins DevOps, The Unix Philosophy differential dataflow, What’s missing?

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
by Martin Kleppmann
Published 17 Apr 2017

Protocols for mak‐ ing systems Byzantine fault-tolerant are quite complicated [84], and fault-tolerant embedded systems rely on support from the hardware level [81]. In most server-side data systems, the cost of deploying Byzantine fault-tolerant solutions makes them impractical. Web applications do need to expect arbitrary and malicious behavior of clients that are under end-user control, such as web browsers. This is why input validation, sani‐ tization, and output escaping are so important: to prevent SQL injection and crosssite scripting, for example. However, we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the authority on deciding what client behavior is and isn’t allowed.

It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. In this book we cover several techniques for building reliable systems from unreliable parts. Counterintuitively, in such fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately—for example, by randomly killing indi‐ vidual processes without warning. Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally.

However, we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where there is no such cen‐ tral authority, Byzantine fault tolerance is more relevant. A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant algorithms require a supermajority of more than twothirds of the nodes to be functioning correctly (i.e., if you have four nodes, at most one may malfunction). To use this approach against bugs, you would have to have four independent implementations of the same software and hope that a bug only appears in one of the four implementations.

Elixir in Action
by Saša Jurić
Published 30 Jan 2019

As I’ve explained, Erlang goes a long way toward making it possible to write fault-tolerant systems that can run for a long time with hardly any downtime. This is a big challenge and a specific focus of the Erlang platform. Although it’s admittedly unfortunate that the ecosystem isn’t as mature as it could be, my sentiment is that Erlang significantly helps with hard problems, even if simple problems can sometimes be more clumsy to solve. Of course, those difficult problems may not always be important. Perhaps you don’t expect a high load, or a system doesn’t need to run constantly and be extremely fault-tolerant. In such cases, you may want to consider some other technology stack with a more evolved ecosystem.

We’ll spend some time exploring BEAM concurrency, a feature that plays a central role in Elixir’s and Erlang’s support for scalability, fault-tolerance, and distribution. In this chapter, we’ll start our tour of BEAM concurrency by looking at basic techniques and tools. Before we explore the lower-level details, we’ll take a look at higher-level principles. 129 130 5.1 Chapter 5 Concurrency primitives Concurrency in BEAM Erlang is all about writing highly available systems — systems that run forever and are always able to meaningfully respond to client requests. To make your system highly available, you have to tackle the following challenges: ¡ Fault-tolerance — Minimize, isolate, and recover from the effects of runtime errors. ¡ Scalability — Handle a load increase by adding more hardware resources without changing or redeploying the code. ¡ Distribution — Run your system on multiple machines so that others can take over if one machine crashes.

Doing so promotes the scalability and fault-tolerance of the system. ¡ A process is internally sequential and handles requests one by one. A single process can thus keep its state consistent, but it can also cause a performance bottleneck if it serves many clients. ¡ Carefully consider calls versus casts. Calls are synchronous and therefore block the caller. If the response isn’t needed, casts may improve performance at the expense of reduced guarantees, because a client process doesn’t know the outcome. ¡ You can use mix projects to manage more involved systems that consist of multiple modules. 8 Fault-tolerance basics This chapter covers ¡ Runtime errors ¡ Errors in concurrent systems ¡ Supervisors Fault-tolerance is a first-class concept in BEAM.

pages: 419 words: 102,488

Chaos Engineering: System Resiliency in Practice
by Casey Rosenthal and Nora Jones
Published 27 Apr 2020

Conclusion The discoveries made during these exercises and the improvements to Slack’s reliability they inspired were only possible because Disasterpiece Theater gave us a clear process for testing the fault tolerance of our production systems. Disasterpiece Theater exercises are meticulously planned failures that are introduced in the development environment and then, if that goes well, in the production environment by a group of experts all gathered together. It helps minimize the risk inherent in testing fault tolerance, especially when it’s based on assumptions made long ago in older systems that maybe weren’t originally designed to be so fault tolerant. The process is intended to motivate investment in development environments that faithfully match the production environment and to drive reliability improvements throughout complex systems.

Robustness and Stability To build users’ trust in a newly released distributed database like TiDB, where data is saved in multiple nodes that communicate with each other, data loss or damage must be prevented at any time. But in the real world, failures can happen any time, anywhere, in a way we can never expect. So how can we survive them? One common way is to make our system fault-tolerant. If one service crashes, another fallover service can take charge immediately without affecting online services. In practice we need to be wary that fault tolerance increases the complexity for a distributed system. How can we ensure that our fault tolerance is robust? Typical ways of testing our tolerance to failures include writing unit tests and integration tests. With the assistance of internal test generation tools, we have developed over 20 million unit test cases.

For completeness, I want to address monoliths briefly. There is no precise threshold a system crosses and becomes a monolith—it’s relative. Monolithic systems are not inherently more or less fault tolerant than service-oriented architectures. They may, though, be harder to retrofit because of the sheer surface area, difficulty in affecting incremental change, and difficulty limiting the blast radius of failures. Maybe you decide you’re going to break up your monolith, maybe you don’t. Fault tolerance is reachable via both roads. Design Patterns Common in Newer Systems By contrast, systems being designed today are likely to assume individual computers come and go frequently.

pages: 673 words: 164,804

Peer-to-Peer
by Andy Oram
Published 26 Feb 2001

Publius, Publius and other systems in this book when Freenet data is requested through Free Haven, One network with a thousand faces frequent correspondent lists, POWs not required for, Nonfungible micropayments FSF (Free Software Foundation), A success story: From free software to open source full collisions, Micropayment digital cash schemes fungible micropayments, Micropayment schemes, Fungible micropayments–Anonymous macropayment digital cash schemes amortized pairwise payment model, The difficulty of distributed systems: How to exchange micropayments among peers as accountability measure, Fungible payments for accountability pairwise payment model and, The difficulty of distributed systems: How to exchange micropayments among peers redeemable for real-world currencies, Fungible payments for accountability G Garay, Juan, Other considerations from the case study gateways advantages of, One network with a thousand faces between Freenet and Free Haven, One network with a thousand faces creating, problems with, Problems creating gateways–Free Haven existing projects, Existing projects implementing, Gateway implementation inserts, problems with, Problems with inserts interoperability through, Interoperability Through Gateways–Acknowledgments requests, problems with, Problems with requests–Free Haven Gaussians (signals), How SETI@home works gcc (GNU C Compiler), A success story: From free software to open source Gedye, David, SETI@home GeoCities, The writable Web Germany and legal status of ISPs, Precedents and parries GhostScript, A success story: From free software to open source GNU C Compiler (gcc), A success story: From free software to open source GNU Emacs, A success story: From free software to open source GNU General Public License (GPL), A success story: From free software to open source Gnutella Version 0.56, Gnutella’s first breath GNU tools, A success story: From free software to open source GnuCache, Host caches Gnut software for Unix, Host caches Gnutella, Some context and a definition, Gnutella–Gnutella’s effects accountability problem, solving, Purposes of micropayments and reputation systems analogues to, Gnutella’s analogues–Cultivating the Gnutella network anonymity properties of, An analysis of anonymity anonymous chat, Anonymous Gnutella chat bandwidth, using too much, Accountability blending client, server, and network, The client is the server is the network cellular telephony and, Cellular telephony and the Gnutella network creating an ad hoc backbone, Organizing Gnutella design deficiencies, Gnutella distributed intelligence system, Distributed intelligence–Distributed intelligence distributed search engines and, Distributed search engines dynamic routing, Dynamic routing Ethernet analogous to, Ethernet exchanging micropayments among peers, The difficulty of distributed systems: How to exchange micropayments among peers free riding, impact on, The impact of free riding freeloading on, Freeloading horizon concept, The Gnutella horizon host caches and, Host caches–Returning the network to its natural state how it began, A brief history impact on accountability, Peer-to-peer models and their impacts on accountability InfraSearch and, Distributed intelligence, Searching legal issues, Napster wars link distribution in, Fault tolerance and link distribution in Gnutella message-based routing, Message-based, application-level routing node failure due to random removal, Fault tolerance and link distribution in Gnutella due to targeted attack, Fault tolerance and link distribution in Gnutella vs. Freenet, Fault tolerance and link distribution in Gnutella OmniNetwork, what it can add to, Gnutella performance case study, Case study 2: Gnutella–Scalability placing nodes on the network, Placing nodes on the network private networks, Private Gnutella networks pseudoanonymity and, Gnutella pseudoanonymity–Anonymous Gnutella chat querying the network, Distributed intelligence–Distributed intelligence, Case study 2: Gnutella–Initial experiments reducing broadcasts causes big impact, Reducing broadcasts makes a significant impact–Reducing broadcasts makes a significant impact Reflectors, File sharing: Napster and successors, Cultivating the Gnutella network, Reducing broadcasts makes a significant impact routing techniques, Gnutella saved by open source developers, Open source to the rescue scalability and, Scalability simulating behavior over time, Initial experiments–Initial experiments fault tolerance, Fault tolerance and link distribution in Gnutella super peers and, File sharing: Napster and successors, Scalability TCP-based broadcasts, TCP broadcast traffic problems with, Gnutella’s traffic problems–Reducing broadcasts makes a significant impact transmission loss over TCP, Lossy transmission over reliable TCP trust issues and, Gnutella TTL (time-to-live) numbers, Message broadcasting vs. client/server model, Gnutella works like the real world–Client/server means control, and control means responsibility vs.

NAT (Network Address Translation), Message fanout IPv6 and, Technical solutions: Return to the old Internet Nautilus file manager, A success story: From free software to open source negative shilling, Attacks and adversaries Nerdherd web site, Open source to the rescue .NET, Some context and a definition, Conversations and peers, Evolving toward the ideal Netscape’s parallel download strategy, The TCP rate equation: Cooperative protocols Network Address Translation (see NAT) network congestion, The TCP rate equation: Cooperative protocols–The TCP rate equation: Cooperative protocols network model of Internet, The network model of the Internet explosion (1995-1999)–Asymmetric bandwidth Network News Transport Protocol (NNTP) and Usenet, Usenet networks, private (Gnutella), Private Gnutella networks neural networks and scoring algorithms, Scoring algorithms New, Darren, Strategic positioning and core competencies New-Member-Added delta message, The New-Member-Added delta message Newmarch, Jan, Codifying reputation on a wide scale: The PGP web of trust news.admin Usenet group, Usenet, Social solutions: Engineer polite behavior newsgroups (see Usenet) next-generation Web, evaluating it as a form of conversation, Conversations and peers NNTP (Network News Transport Protocol) and Usenet, Usenet node failure due to random removal in Freenet, Simulating fault tolerance in Gnutella, Fault tolerance and link distribution in Gnutella due to targeted attack in Freenet, Simulating fault tolerance in Gnutella, Fault tolerance and link distribution in Gnutella node-specific tickets, Micropayments in the Free Haven context nodes, remailer (see remailers) nonfungible micropayments, Nonfungible micropayments–Nonparallelizable work functions amortized pairwise payment model, The difficulty of distributed systems: How to exchange micropayments among peers extended types of, Extended types of nonfungible micropayments Free Haven works best with, Micropayments in the Free Haven context limitations of, Fungible micropayments pairwise payment model and, The difficulty of distributed systems: How to exchange micropayments among peers POWs and, Micropayment schemes nonparallelizable work functions, Nonparallelizable work functions notifications, buddy, Accountability and the buddy system Nullsoft, Gnutella’s first breath Nutella and Gnutella, Gnutella’s first breath O Odlyzko, A.M., General considerations in an economic analysis of micropayment design Olson, Mancur, Accountability OmniNetwork, Interoperability Through Gateways integrating all network types into, One network with a thousand faces networks and their roles in, Well-known networks and their roles–Free Haven and Publius Onion Routing (mix network), Communications channel, Mix networks open source meme map, A success story: From free software to open source–A success story: From free software to open source peer-to-peer and, File sharing: Napster and successors projects and trust metric, A reputation system that resists pseudospoofing: Advogato–A reputation system that resists pseudospoofing: Advogato saving Gnutella, Open source to the rescue software and trust issues, Open source software summit, Remaking the Peer-to-Peer Meme, A success story: From free software to open source Oram, Andy, Preface–We’d like to hear from you, Afterword–A clean sweep?

encrypting, Encryption and decryption, Encryption and decryption (see also cryptography) delta messages, Security characteristics of a shared space documents with Publius, System architecture–System architecture email messages, A simple example of remailers–How Type 2 remailers differ from Type 1 remailers, Other anonymity tools–Mix networks Freenet documents, Keys and redirects IP addresses (Red Rover), The hub–The subscribers micropayments, The difficulty of distributed systems: How to exchange micropayments among peers on the Web, Why secure email is a failure using PGP (Pretty Good Privacy), Signature verification encryption/decryption key pairs, Anatomy of a mutually-trusting shared space, Taxonomy of Groove keys end-to-end payment model, The difficulty of distributed systems: How to exchange micropayments among peers Enterprise Resource Planning (ERP) systems, Interface to the marketplace entities in reputation domains, Reputation domains, entities, and multidimensional reputations–Reputation domains, entities, and multidimensional reputations Reputation Server and, Identity as an element of reputation ERP (Enterprise Resource Planning) systems, Interface to the marketplace Eternity Service, Free Haven and Publius trust issues and, The Eternity Service Eternity Usenet, Eternity Usenet anonymity properties of, An analysis of anonymity Ethernet and Gnutella network, Ethernet EUROCRYPT conference, Anonymous macropayment digital cash schemes Evans, Philip, Long-term vision expiration dates of shares (Free Haven), Share expiration extraterrestrial signals (SETI@home), SETI@home, How SETI@home works F fanout, message, Message fanout–Message fanout Fast Fourier Transform (FFT) algorithm, Radio SETI fault tolerance, Performance simulating in Freenet, Simulating fault tolerance in Gnutella, Fault tolerance and link distribution in Gnutella Faybishenko, Yaroslav, Distributed intelligence Federrath, Hannes, Micropayments in the Free Haven context feedback soliciting from parties in transactions, Collecting ratings, Reputation–Summary system on eBay, Reputations worth real money: eBay trusting sources of, Credibility FFT (Fast Fourier Transform) algorithm, Radio SETI file sharing between Free Haven and Freenet, One network with a thousand faces DNS and, DNS Freenet and, Freenet–Conclusions next-generation peer-to-peer technologies, Next-generation peer-to-peer file-sharing technologies file tampering, detecting, Message digest functions file-sharing systems searching technologies for, Trust and search engines–Deniability trust issues and, File-sharing systems–Freenet unifying with an OmniNetwork, Interoperability Through Gateways Financial Cryptography conference, Anonymous macropayment digital cash schemes firewalls, Message fanout abuse of port 80, Abusing port 80 as obstacles to peer-to-peer, Firewalls, dynamic IP, NAT: The end of the open network making smarter, Technical solutions: Return to the old Internet no good if internal network is compromised, Groove versus email no longer a guarantee of protection, Groove versus email Fishburn, P.C., General considerations in an economic analysis of micropayment design flat-fee methods vs. pay-per-use methods, General considerations in an economic analysis of micropayment design floating-point operations and SETI@home, The world’s most powerful computer flooding attacks, Attacks on documents or the servnet common methods for dealing with, Common methods for dealing with flooding and DoS attacks–Active caching and mirroring foiled by fungible micropayments, Fungible payments for accountability protecting against, Active caching and mirroring protecting peer-to-peer from, Accountability Ford-Fulkerson algorithm and Advogato, A reputation system that resists pseudospoofing: Advogato forgery macropayment techniques for protecting against, Anonymous macropayment digital cash schemes preventing with micropayment schemes, Varieties of micropayments or digital cash thwarted by MicroMint, Micropayment digital cash schemes France and legal status of ISPs, Precedents and parries Frankel, Justin, Gnutella’s first breath Free Haven, Acknowledgments, Free Haven–Acknowledgments accountability, Free Haven buddy system and, Accountability and the buddy system case study, A case study: Accountability in Free Haven–Other considerations from the case study impact on, Peer-to-peer models and their impacts on accountability solving problem of, Purposes of micropayments and reputation systems, Moderating security levels: An accountability slider anonymity, Free Haven–Partial anonymity, An analysis of anonymity–An analysis of anonymity attacks on, Attacks on anonymity properties of, An analysis of anonymity attacks on, Attacks on Free Haven–Attacks on anonymity buddy system ) (see buddy system (Free Haven) choosing good algorithms, Reputation systems communications channel, Elements of the system, Communications channel, Micropayments in the Free Haven context design of, The design of Free Haven–Implementation status documents attacks on, Attacks on documents or the servnet not possible to revoke, Document revocation retrieving, Retrieval storing new, Storage efficiency problems, Future work flexibility, Free Haven freeloader problem, Freeloading, Moderating security levels: An accountability slider goals of, Free Haven introducers, The design of Free Haven, Introducers, Reputation systems micropayments in, Micropayments in the Free Haven context–Micropayments in the Free Haven context nonfungible micropayments work best with, Micropayments in the Free Haven context OmniNetwork, what it can add to, Free Haven and Publius persistence, Free Haven privacy in data-sharing systems, Privacy in data-sharing systems–Reliability with anonymity problems in design of, Future work–Conclusion pseudonyms on, Reliability with anonymity public keys, Elements of the system–Retrieval publication system, Elements of the system–Publication receipts, Trading –Receipts attacks on, Attacks on the reputation system reply blocks, Elements of the system reputation (see reputation, Free Haven) reputation referrals, Reputation systems routing techniques, Free Haven scores and ratings, Reputation systems sending broadcasts in batches, Micropayments in the Free Haven context servers adding/removing, Introducers as introducers, The design of Free Haven, Introducers, Reputation systems broadcasting referrals, Reputation systems contracts formed by, The design of Free Haven credibility/confidence values, Reputation systems punishing misbehaving, Reputation systems servnet, The design of Free Haven attacks on, Attacks on documents or the servnet dynamic nature of, Elements of the system introducing files into, Publication shares buddy system, Accountability and the buddy system expiration dates of, Share expiration hoarding, Attacks on documents or the servnet receipts and, Trading –Receipts, Attacks on the reputation system replicating not allowed, Accountability and the buddy system trading, The design of Free Haven, Trading trust issues and, Mojo Nation and Free Haven, Other considerations from the case study vs.

Principles of Protocol Design
by Robin Sharp
Published 13 Feb 2008

The first is a practical objection: Simple languages generally do not correspond to protocols which can tolerate faults, such as missing or duplicated messages. Protocols which are fault-tolerant often require the use of state machines with enormous numbers of states, or they may define context-dependent languages. A more radical objection is that classical analysis of the protocol language from a formal language point of view traditionally concerns itself with the problems of constructing a suitable recogniser, determining the internal states of the recogniser, and so on. This does not help us to analyse or check many of the properties which we may require the protocol to have, such as the properties of fault-tolerance mentioned above. To be able to investigate this we need analytical tools which can describe the parallel operation of all the parties which use the protocol to regulate their communication. 1.2 Protocols as Processes A radically different way of looking at things has therefore gained prominence within recent years.

If no value is received from a particular participant, the algorithm should supply some default, vde f . 5.4.1 Using unsigned messages Solutions to this problem depend quite critically on the assumptions made about the system. Initially, we shall assume the following: Degree of fault-tolerance: Out of the n participants, at most t are unreliable. This defines the degree of fault tolerance required of the system. We cannot expect the protocol to work correctly if this limit is overstepped. Network properties: Every message that is sent is delivered correctly, and the receiver of a message knows who sent it. These assumptions mean that an unreliable participant cannot interfere with the message traffic between the other participants.

Addressing: Hierarchical addressing. T-address formed by concatenating T-selector onto N-address. Fault tolerance: Loss or duplication of data (DT TPDUs) or acknowledgments (AK TPDUs). Whereas the ISO Class 0 protocol provides minimal functionality, and is therefore only suitable for use when the underlying network is comparatively reliable, the Class 4 protocol is designed to be resilient to a large range of potential disasters, including the arrival of spurious PDUs, PDU loss and PDU corruption. To ensure this degree of fault tolerance, the protocol uses a large number of timers, whose identifications and functions are summarised in Table 9.3.

pages: 371 words: 78,103

Webbots, Spiders, and Screen Scrapers
by Michael Schrenk
Published 19 Aug 2009

This action sounds silly, but it is exactly what a poorly programmed webbot may do if it is expecting an available seat and has no provision to act otherwise. Types of Webbot Fault Tolerance For a webbot, fault tolerance involves adapting to changes to URLs, HTML content (which affect parsing), forms, cookie use, and network outages and congestion). We'll examine each of these aspects of fault tolerance in the following sections. Adapting to Changes in URLs Possibly the most important type of webbot fault tolerance is URL tolerance, or a webbot's ability to make valid requests for web pages under changing conditions. URL tolerance ensures that your webbot does the following: Download pages that are available on the target site Follow header redirections to updated pages Use referer values to indicate that you followed a link from a page that is still on the website Avoid Making Requests for Pages That Don't Exist Before you determine that your webbot downloaded a valid web page, you should verify that you made a valid request.

Depending on what your webbot does and which website it targets, the identification of a webbot can lead to possible banishment from the website and the loss of a competitive advantage for your business. It's better to avoid these issues by designing fault-tolerant webbots that anticipate changes in the websites they target. Fault tolerance does not mean that everything will always work perfectly. Sometimes changes in a targeted website confuse even the most fault-tolerant webbot. In these cases, the proper thing for a webbot to do is to abort its task and report an error to its owner. Essentially, you want your webbot to fail in the same manner a person using a browser might fail.

In that regard, think about how and when people use browsers, and try to write webbots that mimic that activity. * * * [68] See Chapter 28 for more information about trespass to chattels. [69] You can find the owner of an IP address at http://www.arin.net. Chapter 25. WRITING FAULT-TOLERANT WEBBOTS The biggest complaint users have about webbots is their unreliability: Your webbots will suddenly and inexplicably fail if they are not fault tolerant, or able to adapt to the changing conditions of your target websites. This chapter is devoted to helping you write webbots that are tolerant to network outages and unexpected changes in the web pages you target.

Pragmatic.Programming.Erlang.Jul.2007
by Unknown

The author kept on and on about concurrency and distribution and fault tolerance and about a method of programming called concurrency-oriented programming—whatever that might mean. But some of the examples looked like fun. That evening the programmer looked at the example chat program. It was pretty small and easy to understand, even if the syntax was a bit strange. Surely it couldn’t be that easy. The basic program was simple, and with a few more lines of code, file sharing and encrypted conversations became possible. The programmer started typing.... What’s This All About? It’s about concurrency. It’s about distribution. It’s about fault tolerance. It’s about functional programming.

@spec unlink(Pid) -> true This removes any link between the current process and the process Pid. @spec exit(Why) -> none() This causes the current process to terminate with reason Why. If the clause that executes this statement is not within the scope of 170 E RROR H ANDLING P RIMITIVES Joe Asks. . . How Can We Make a Fault-Tolerant System? To make something fault tolerant, we need at least two computers. One computer does the job, and another computer watches the first computer and must be ready to take over at a moment’s notice if the first computer fails. This is exactly how error recovery works in Erlang. One process does the job, and another process watches the first process and takes over if things go wrong.

In distributed Erlang, the process that does the job and the processes that monitor the process that does the job can be placed on physically different machines. Using this technique, we can start designing fault-tolerant software. This pattern is common. We call it the worker-supervisor model, and an entire section of the OTP libraries is devoted to building supervision trees that use this idea. The basic language primitive that makes all this possible is the link primitive. Once you understand how link works and get yourself access to two computers, then you’re well on your way to building your first fault-tolerant system. a catch statement, then the current process will broadcast an exit signal, with argument Why to all processes to which it is currently linked.

pages: 680 words: 157,865

Beautiful Architecture: Leading Thinkers Reveal the Hidden Beauty in Software Design
by Diomidis Spinellis and Georgios Gousios
Published 30 Dec 2008

Project Darkstar high-level architecture Unlike most replication schemes, the different copies of the game logic are not meant to process the same events. Instead, each copy can independently interact with the clients. Replication in this design is used primarily to allow scale rather than to ensure fault tolerance (although, as we will see later, fault tolerance is also achieved). Further, the game logic itself does not know or need to know that there are other copies of the server operating on other machines. The code written by the game programmer runs as if it were on a single machine, with coordination of the different copies done by the Project Darkstar infrastructure.

NIO image transfer Obviously, that leaves the problem of getting the images from the client to the server. One option we considered and rejected early was CIFS—Windows shared drives. Our main concern here was fault-tolerance, but transfer speed also worried us. These machines needed to move a lot of data back and forth, while photographers and customers were sitting around waiting. In our matrix of off-the-shelf options, nothing had the right mix of speed, parallelism, fault-tolerance, and information hiding. Reluctantly, we decided to build our own file transfer protocol, which led us into one of the most complex areas of Creation Center. Image transfer became a severe trial, but we emerged, at last, with one of the most robust features of the whole system.

Computers haven’t been around that long, of course, but here too there have been many examples of beautiful architectures in the past. As with buildings, the style doesn’t always persist, and in this chapter I describe one such architecture and consider why it had so little impact. Guardian is the operating system for Tandem’s fault-tolerant “NonStop” series of computers. It was designed in parallel with the hardware to provide fault tolerance with minimal overhead cost. This chapter describes the original Tandem machine, designed between 1974 and 1976 and shipped between 1976 and 1982. It was originally called “Tandem/16,” but after the introduction of its successor, “NonStop II,” it was retrospectively renamed “NonStop I.”

Mastering Blockchain, Second Edition
by Imran Bashir
Published 28 Mar 2018

Consensus is pluggable and currently, there are two types of ordering services available in Hyperledger Fabric: SOLO: This is a basic ordering service intended to be used for development and testing purposes. Kafka: This is an implementation of Apache Kafka, which provides ordering service. It should be noted that currently Kafka only provides crash fault tolerance but does not provide byzantine fault tolerance. This is acceptable in a permissioned network where chances of malicious actors are almost none. In addition to these mechanisms, the Simple Byzantine Fault Tolerance (SBFT) based mechanism is also under development, which will become available in the later releases of Hyperledger Fabric. Distributed ledger Blockchain and world state are two main elements of the distributed ledger.

Now, if the update is rejected by the node, that would result in loss of availability. In that case due to partition tolerance, both availability and consistency are unachievable. This is strange because somehow blockchain manages to achieve all of these properties—or does it? This will be explained shortly. To achieve fault tolerance, replication is used. This is a standard and widely-used method to achieve fault tolerance. Consistency is achieved using consensus algorithms in order to ensure that all nodes have the same copy of the data. This is also called state machine replication. The blockchain is a means for achieving state machine replication. In general, there are two types of faults that a node can experience.

As an analogy to distributed systems, the generals can be considered nodes, the traitors as Byzantine (malicious) nodes, and the messenger can be thought of as a channel of communication among the generals. This problem was solved in 1999 by Castro and Liskov who presented the Practical Byzantine Fault Tolerance (PBFT) algorithm, where consensus is reached after a certain number of messages are received containing the same signed content. This type of inconsistent behavior of Byzantine nodes can be intentionally malicious, which is detrimental to the operation of the network. Any unexpected behavior by a node on the network, whether malicious or not, can be categorized as Byzantine.

Advanced Software Testing—Vol. 3, 2nd Edition
by Jamie L. Mitchell and Rex Black
Published 15 Feb 2015

The closer this metric is to one, that is, the more test cases that pass in comparison to all that should be run, the better. 4.3.2 Fault Tolerance The second subcharacteristic of reliability is fault tolerance, defined as the capability of a system to maintain a specified level of performance in case of software faults. When fault tolerance is built into the software, it often consists of extra code to avoid and/or survive and handle exceptional conditions. The more critical it is for the system to exhibit fault tolerance, the more code and complexity are likely to be added. Other terms that may be used when discussing fault tolerance are error tolerance and robustness. In the real world, it is not acceptable for a single, isolated failure to bring an entire system down.

Table of Contents Jamie Mitchell’s Acknowledgements Rex Black’s Acknowledgements Introduction 1 The Technical Test Analyst’s Tasks in Risk-Based Testing 1.1 Introduction 1.2 Risk Identification 1.3 Risk Assessment 1.4 Risk Mitigation or Risk Control 1.5 An Example of Risk Identification and Assessment Results 1.6 Risk-Aware Testing Standard 1.7 Sample Exam Questions 2 Structure-Based Testing 2.1 Introduction 2.1.1 Control Flow Testing Theory 2.1.2 Building Control Flow Graphs 2.1.3 Statement Coverage 2.1.4 Decision Coverage 2.1.5 Loop Coverage 2.1.6 Hexadecimal Converter Exercise 2.1.7 Hexadecimal Converter Exercise Debrief 2.2 Condition Coverage 2.3 Decision Condition Coverage 2.4 Modified Condition/Decision Coverage (MC/DC) 2.4.1 Complicating Issues: Short-Circuiting 2.4.2 Complicating Issues: Coupling 2.5 Multiple Condition Coverage 2.5.1 Control Flow Exercise 2.5.2 Control Flow Exercise Debrief 2.6 Path Testing 2.6.1 Path Testing via Flow Graphs 2.6.2 Basis Path Testing 2.6.3 Cyclomatic Complexity Exercise 2.6.4 Cyclomatic Complexity Exercise Debrief 2.7 API Testing 2.8 Selecting a Structure-Based Technique 2.8.1 Structure-Based Testing Exercise Debrief 2.9 A Final Word on Structural Testing 2.10 Sample Exam Questions 3 Analytical Techniques 3.1 Introduction 3.2 Static Analysis 3.2.1 Control Flow Analysis 3.2.2 Data Flow Analysis 3.2.2.1 Define-Use Pairs 3.2.2.2 Define-Use Pair Example 3.2.2.3 Data Flow Exercise 3.2.2.4 Data Flow Exercise Debrief 3.2.2.5 A Data Flow Strategy 3.2.3 Static Analysis to Improve Maintainability 3.2.3.1 Code Parsing Tools 3.2.3.2 Standards and Guidelines 3.2.4 Call Graphs 3.2.4.1 Call-Graph-Based Integration Testing 3.2.4.2 McCabe’s Design Predicate Approach to Integration 3.2.4.3 Hex Converter Example 3.2.4.4 McCabe Design Predicate Exercise 3.2.4.5 McCabe Design Predicate Exercise Debrief 3.3 Dynamic Analysis 3.3.1 Memory Leak Detection 3.3.2 Wild Pointer Detection 3.3.3 Dynamic Analysis Exercise 3.3.4 Dynamic Analysis Exercise Debrief 3.4 Sample Exam Questions 4 Quality Characteristics for Technical Testing 4.1 Introduction 4.2 Security Testing 4.2.1 Security Issues 4.2.1.1 Piracy 4.2.1.2 Buffer Overflow 4.2.1.3 Denial of Service 4.2.1.4 Data Transfer Interception 4.2.1.5 Breaking Encryption 4.2.1.6 Logic Bombs/Viruses/Worms 4.2.1.7 Cross-Site Scripting 4.2.1.8 Timely Information 4.2.1.9 Internal Security Metrics 4.2.1.10 External Security Metrics 4.2.1.11 Exercise: Security 4.2.1.12 Exercise: Security Debrief 4.3 Reliability Testing 4.3.1 Maturity 4.3.1.1 Internal Maturity Metrics 4.3.1.2 External Maturity Metrics 4.3.2 Fault Tolerance 4.3.2.1 Internal Fault Tolerance Metrics 4.3.2.2 External Fault Tolerance Metrics 4.3.3 Recoverability 4.3.3.1 Internal Recoverability Metrics 4.3.3.2 External Recoverability Metrics 4.3.4 Compliance 4.3.4.1 Internal Compliance Metrics 4.3.4.2 External Compliance Metrics 4.3.5 An Example of Good Reliability Testing 4.3.6 Exercise: Reliability Testing 4.3.7 Exercise: Reliability Testing Debrief 4.4 Efficiency Testing 4.4.1 Multiple Flavors of Efficiency Testing 4.4.2 Modeling the System 4.4.2.1 Identify the Test Environment 4.4.2.2 Identify the Performance Acceptance Criteria 4.4.2.3 Plan and Design Tests 4.4.2.4 Configure the Test Environment 4.4.2.5 Implement the Test Design 4.4.2.6 Execute the Test 4.4.2.7 Analyze the Results, Tune and Retest 4.4.3 Time Behavior 4.4.3.1 Internal Time Behavior Metrics 4.4.3.2 External Time Behavior Metrics 4.4.4 Resource Utilization 4.4.4.1 Internal Resource Utilization Metrics 4.4.4.2 External Resource Utilization Metrics 4.4.5 Compliance 4.4.5.1 Internal Compliance Metric 4.4.5.2 External Compliance Metric 4.4.6 Exercise: Efficiency Testing 4.4.7 Exercise: Efficiency Testing Debrief 4.5 Maintainability Testing 4.5.1 Analyzability 4.5.1.1 Internal Analyzability Metrics 4.5.1.2 External Analyzability Metrics 4.5.2 Changeability 4.5.2.1 Internal Changeability Metrics 4.5.2.2 External Changeability Metrics 4.5.3 Stability 4.5.3.1 Internal Stability Metrics 4.5.3.2 External Stability Metrics 4.5.4 Testability 4.5.4.1 Internal Testability Metrics 4.5.4.2 External Testability Metrics 4.5.5 Compliance 4.5.5.1 Internal Compliance Metric 4.5.5.2 External Compliance Metric 4.5.6 Exercise: Maintainability Testing 4.5.7 Exercise: Maintainability Testing Debrief 4.6 Portability Testing 4.6.1 Adaptability 4.6.1.1 Internal Adaptability Metrics 4.6.1.2 External Adaptability Metrics 4.6.2 Replaceability 4.6.2.1 Internal Replaceability Metrics 4.6.2.2 External Replaceability Metrics 4.6.3 Installability 4.6.3.1 Internal Installability Metrics 4.6.3.2 External Installability Metrics 4.6.4 Coexistence 4.6.4.1 Internal Coexistence Metrics 4.6.4.2 External Coexistence Metrics 4.6.5 Compliance 4.6.5.1 Internal Compliance Metrics 4.6.5.2 External Compliance Metrics 4.6.6 Exercise: Portability Testing 4.6.7 Exercise: Portability Testing Debrief 4.7 General Planning Issues 4.8 Sample Exam Questions 5 Reviews 5.1 Introduction 5.2 Using Checklists in Reviews 5.2.1 Some General Checklist Items for Design and Architecture Reviews 5.2.2 Deutsch’s Design Review Checklist 5.2.3 Some General Checklist Items for Code Reviews 5.2.4 Marick’s Code Review Checklist 5.2.5 The Open Laszlo Code Review Checklist 5.3 Deutsch Checklist Review Exercise 5.4 Deutsch Checklist Review Exercise Debrief 5.5 Code Review Exercise 5.6 Code Review Exercise Debrief 5.7 Sample Exam Questions 6 Test Tools and Automation 6.1 Integration and Information Interchange between Tools 6.2 Defining the Test Automation Project 6.2.1 Preparing for a Test Automation Project 6.2.2 Why Automation Projects Fail 6.2.3 Automation Architectures (Data Driven vs.

In the real world, it is not acceptable for a single, isolated failure to bring an entire system down. Undoubtedly, certain failures may degrade the performance of the system, but the ability to keep delivering services—even in degraded terms—is a required feature that should be tested. Negative testing is often used to test fault tolerance. We partially or fully degrade the system operation via testing while measuring specific performance metrics. Fault tolerance tends to be tested at each phase of testing. During unit test we should test error and exception handling with interface values that are erroneous, including out of range, poorly formed, or semantically incorrect. During integration test we should test incorrect inputs from user interface, files, or devices.

pages: 931 words: 79,142

Concepts, Techniques, and Models of Computer Programming
by Peter Van-Roy and Seif Haridi
Published 15 Feb 2004

This solution is harder to implement, but can make the program much simpler. Raising an exception corresponds to aborting the transaction. A third motivation is fault tolerance. Lightweight transactions are important for writing fault-tolerant applications. With respect to a component, e.g., an application doing a transaction, we define a fault as incorrect behavior in one of its subcomponents. Ideally, the application should continue to behave correctly when there are faults, i.e., it should be fault tolerant. When a fault occurs, a fault-tolerant application has to take three steps: (1) detect the fault, (2) contain the fault in a limited part of the application, and (3) repair any problems caused by the fault.

For this exercise, write an abstraction for a replicated server that hides all the fault-handling activities from the clients. 5. (advanced exercise) Fault tolerance and synchronous communication. Section 11.10 says that synchronous communication makes fault confinement easier. Section 5.7 says that asynchronous communication helps keep concurrent components independent, which is important when building fault tolerance abstractions. For this exercise, reconcile these two principles by studying the architecture of fault tolerant applications. This page intentionally left blank 12 Constraint Programming by Peter Van Roy, Raphaël Collet, and Seif Haridi Plans within plans within plans within plans. – Dune, Frank Herbert (1920–1986) Constraint programming consists of a set of techniques for solving constraint satisfaction problems.

Its implementation, the Ericsson OTP (Open Telecom Platform), features fine-grained concurrency (efficient threads), extreme reliability (high-performance software fault tolerance), and hot code replacement ability (update software while the system is running). It is a high-level language that hides the internal representation of data and does automatic memory management. It has been used successfully in several Ericsson products. 5.7.1 Computation model The Erlang computation model has an elegant layered structure. We first explain the model and then we show how it is extended for distribution and fault tolerance. The Erlang computation model consists of concurrent entities called “processes.”

pages: 496 words: 70,263

Erlang Programming
by Francesco Cesarini

This should record enough information to enable billing for the use of the phone. 138 | Chapter 5: Process Design Patterns CHAPTER 6 Process Error Handling Whatever the programming language, building distributed, fault-tolerant, and scalable systems with requirements for high availability is not for the faint of heart. Erlang’s reputation for handling the fault-tolerant and high-availability aspects of these systems has its foundations in the simple but powerful constructs built into the language’s concurrency model. These constructs allow processes to monitor each other’s behavior and to recover from software faults. They give Erlang a competitive advantage over other programming languages, as they facilitate development of the complex architecture that provides the required fault tolerance through isolating errors and ensuring nonstop operation.

In conjunction with these projects, the OTP framework was developed and released in 1996. OTP provides a framework to structure Erlang systems, offering robustness and fault tolerance together with a set of tools and libraries. The history of Erlang is important in understanding its philosophy. Although many languages were developed before finding their niche, Erlang was developed to solve the “time-to-market” requirements of distributed, fault-tolerant, massively concurrent, soft real-time systems. The fact that web services, retail and commercial banking, computer telephony, messaging systems, and enterprise integration, to mention but a few, happen to share the same requirements as telecom systems explains why Erlang is gaining headway in these sectors.

A typical example here is a web server: if you are planning a new release of a piece of software, or you are planning to stream video of a football match in real time, distributing the server across a number of machines will make this possible without failure. This performance is given by replication of a service—in this case a web server— which is often found in the architecture of a distributed system. • Replication also provides fault tolerance: if one of the replicated web servers fails or becomes unavailable for some reason, HTTP requests can still be served by the other servers, albeit at a slower rate. This fault tolerance allows the system to be more robust and reliable. • Distribution allows transparent access to remote resources, and building on this, it is possible to federate a collection of different systems to provide an overall user service.

pages: 570 words: 115,722

The Tangled Web: A Guide to Securing Modern Web Applications
by Michal Zalewski
Published 26 Nov 2011

Vendors released their products with embedded programming languages such as JavaScript and Visual Basic, plug-ins to execute platform-independent Java or Flash applets on the user’s machine, and useful but tricky HTTP extensions such as cookies. Only a limited degree of superficial compatibility, sometimes hindered by patents and trademarks,[7] would be maintained. As the Web grew larger and more diverse, a sneaky disease spread across browser engines under the guise of fault tolerance. At first, the reasoning seemed to make perfect sense: If browser A could display a poorly designed, broken page but browser B refused to (for any reason), users would inevitably see browser B’s failure as a bug in that product and flock in droves to the seemingly more capable client, browser A.

One such example is the advice on parsing dates in certain HTTP headers, at the request of section 3.3 in RFC 1945. The resulting implementation (the prtime.c file in the Firefox codebase[118]) consists of close to 2,000 lines of extremely confusing and unreadable C code just to decipher the specified date, time, and time zone in a sufficiently fault-tolerant way (for uses such as deciding cache content expiration). Semicolon-Delimited Header Values Several HTTP headers, such as Cache-Control or Content-Disposition, use a semicolon-delimited syntax to cram several separate name=value pairs into a single line. The reason for allowing this nested notation is unclear, but it is probably driven by the belief that it will be a more efficient or a more intuitive approach that using several separate headers that would always have to go hand in hand.

Escape or substitute these values as appropriate. When building a new HTTP client, server, or proxy: Do not create a new implementation unless you absolutely have to. If you can’t help it, read this chapter thoroughly and aim to mimic an existing mainstream implementation closely. If possible, ignore the RFC-provided advice about fault tolerance and bail out if you encounter any syntax ambiguities. * * * [24] Public key cryptography relies on asymmetrical encryption algorithms to create a pair of keys: a private one, kept secret by the owner and required to decrypt messages, and a public one, broadcast to the world and useful only to encrypt traffic to that recipient, not to decrypt it.

pages: 1,085 words: 219,144

Solr in Action
by Trey Grainger and Timothy Potter
Published 14 Sep 2014

We only point this out because it’s a testament to the depth and breadth of automated testing in Lucene and Solr. If you have a nightly build off trunk in which all the automated tests pass, then you can be fairly confident that the core functionality is solid. We’ve touched on Solr’s approach to scalability and fault tolerance in sections 1.2.6 and 1.2.7. As an architect, you’re probably most curious about the limitations of Solr’s approach to scalability and fault tolerance. First, you should realize that the sharding and replication features in Solr have been improved in Solr 4 to be robust and easier to manage. The new approach to scaling is called SolrCloud. Under the covers, SolrCloud uses Apache ZooKeeper to distribute configurations across a cluster of Solr servers and to keep track of cluster state.

You may want to use replication either when you want to isolate indexing from searching operations to different servers within your cluster or when you need to increase available queries-per-second capacity. Fault tolerance It’s great that we can increase our overall query capacity by adding another server and replicating the index to that server, but what happens when one of our servers eventually crashes? When our application had only one server, the application clearly would have stopped. Now that multiple, redundant servers exist, one server dying will simply reduce our capacity back to the capacity of however many servers remain. If you want to build fault tolerance into your system, it’s a good idea to have additional resources (extra slave servers) in your cluster so that your system can continue functioning with enough capacity even if a single server fails.

Chapter 13 will demonstrate how you can delegate most of the fault tolerance and distributed query routing concerns to SolrCloud to make scaling Solr a more manageable process. Regardless of which scaling approach you take, it can be useful to understand how to interact with your Solr cores without having to restart Solr to make changes. The next section will dive into Solr’s Core Admin API, which provides rich features for managing Solr cores. 12.6. Solr core management In the previous section we discussed how to scale Solr using sharding (for large document sets) and replication (for fault tolerance and query load). Solr contains a suite of capabilities collectively called SolrCloud (covered in depth in the next chapter) that makes managing collections of documents through shards and replicas somewhat seamless.

pages: 161 words: 44,488

The Business Blockchain: Promise, Practice, and Application of the Next Internet Technology
by William Mougayar
Published 25 Apr 2016

Leslie Lamport, Robert Shostak, and Marshall Pease, The Byzantine Generals Problem. http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf. 6. IT Does not Matter, https://hbr.org/2003/05/it-doesnt-matter. 7. PayPal website, https://www.paypal.com/webapps/mpp/about. 8. Personal communication with Vitalik Buterin, February 2016. 9. Byzantine fault tolerance, https://en.wikipedia.org/wiki/Byzantine_fault_tolerance. 10. Proof-of-stake, https://en.wikipedia.org/wiki/Proof-of-stake. 2 HOW BLOCKCHAIN TRUST INFILTRATES “I cannot understand why people are frightened of new ideas. I’m frightened of the old ones.” –JOHN CAGE REACHING CONSENSUS is at the heart of a blockchain’s operations.

In part, the continuation of some of the trends in crypto 2.0, and particularly generalized protocols that provide both computational abstraction and privacy. But equally important is the current technological elephant in the room in the blockchain sphere: scalability. Currently, all existing blockchain protocols have the property that every computer in the network must process every transaction—a property that provides extreme degrees of fault tolerance and security, but at the cost of ensuring that the network's processing power is effectively bounded by the processing power of a single node. Crypto 3.0—at least in my mind—consists of approaches that move beyond this limitation, in one of various ways to create systems that break through this limitation and actually achieve the scale needed to support mainstream adoption (technically astute readers may have heard of “lightning networks,” “state channels,” and “sharding”).

Game theory is ‘the study of mathematical models of conflict and cooperation between intelligent rational decision-makers.”4 And this is related to the blockchain because the Bitcoin blockchain, originally conceived by Satoshi Nakamoto, had to solve a known game theory conundrum called the Byzantine Generals Problem.5 Solving that problem consists in mitigating any attempts by a small number of unethical Generals who would otherwise become traitors, and lie about coordinating their attack to guarantee victory. This is accomplished by enforcing a process for verifying the work that was put into crafting these messages, and time-limiting the requirement for seeing untampered messages in order to ensure their validity. Implementing a “Byzantine Fault Tolerance” is important because it starts with the assumption that you cannot trust anyone, and yet it delivers assurance that the transaction has traveled and arrived safely based on trusting the network during its journey, while surviving potential attacks. There are fundamental implications for this new method of reaching safety in the finality of a transaction, because it questions the existence and roles of current trusted intermediaries, who held the traditional authority on validating transactions.

pages: 434 words: 77,974

Mastering Blockchain: Unlocking the Power of Cryptocurrencies and Smart Contracts
by Lorne Lantz and Daniel Cawrey
Published 8 Dec 2020

The following are some of the companies involved, and their roles: Payments: PayU Technology: Facebook, FarFetch, Lyft, Spotify, Uber Telecom: Iliad Blockchain: Anchorage, BisonTrails, Coinbase, Xapo Venture capital: Andreessen Horowitz, Breakthrough Initiatives, Union Square Ventures, Ribbit Capital, Thrive Capital Nonprofits: Creative Destruction Lab, Kiva, Mercy Corps, Women’s World Banking Borrowing from Existing Blockchains The Libra Association intends to create an entirely new payments system on the internet by using a proof-of-stake consensus Byzantine fault-tolerant algorithm developed by VMware, known as HotStuff. The association’s members will be the validators of the system. HotStuff uses a lead validator. It accepts transactions from the clients and uses a voting mechanism for validation. It is fault tolerant because the other validators can take the lead’s place in case of error or downtime. Byzantine fault tolerance is used in other blockchain systems, most notably on some smaller open networks utilizing proof-of-stake. Figure 9-7 illustrates Libra’s consensus mechanism.

While early on Ripple was an open source competitor to Bitcoin, with third-party “gateways” that functioned as a method of anonymous exchange, in 2014 the company pivoted to supporting banks as a faster and cheaper settlement network with a cross-border focus. Instead of using traditional proof-of-work, Ripple introduced a new type of consensus known as the XRP Consensus Protocol. It uses Byzantine fault-tolerant agreement, which requires nodes to come to agreement on transactions. Ripple has hundreds of partnerships with various companies in the banking and payments sectors. The best-known strategic partnership is with the money remittances company MoneyGram, in which Ripple has made a $50 million equity investment.

VmWare With support for the EVM, DAML, and Hyperledger, VmWare Blockchain is a multiblockchain platform. Developers are also able to use VmWare’s cloud technology to set up various types of infrastructure implementations, including the option of hybrid cloud capabilities to increase security and privacy. It also uses a Byzantine fault-tolerant consensus engine to provide features of decentralization. Oracle Oracle’s Blockchain Platform is built on Hyperledger Fabric and supports multicloud implementations—hybrid, on-premise, or a mix of the two for greater flexibility. The purpose is to be able to configure specific environments depending on regulatory requirements.

Scala in Action
by Nilanjan Raychaudhuri
Published 27 Mar 2012

So many things can go wrong in the concurrent/ parallel programming world. What if we get an IOException while reading the file? Let’s learn how to handle faults in an actor-based application. 9.3.4. Fault tolerance made easy with a supervisor Akka encourages nondefensive programming in which failure is a valid state in the lifecycle of an application. As a programmer you know you can’t prevent every error, so it’s better to prepare your application for the errors. You can easily do this through fault-tolerance support provided by Akka through the supervisor hierarchy. Think of this supervisor as an actor that links to supervised actors and restarts them when one dies.

You can have one supervisor linked to another supervisor. That way you can supervise a supervisor in case of a crash. It’s hard to build a fault-tolerant system with one box, so I recommend having your supervisor hierarchy spread across multiple machines. That way, if a node (machine) is down, you can restart an actor in a different box. Always remember to delegate the work so that if a crash occurs, another supervisor can recover. Now let’s look into the fault-tolerant strategies available in Akka. Supervision Strategies in Akka Akka comes with two restarting strategies: One-for-One and All-for-One.

First I’ll talk about the philosophy behind Akka so you understand the goal behind the Akka project and the problems it tries to solve. 12.1. The philosophy behind Akka The philosophy behind Akka is simple: make it easier for developers to build correct, concurrent, scalable, and fault-tolerant applications. To that end, Akka provides a higher level of abstractions to deal with concurrency, scalability, and faults. Figure 12.1 shows the three core modules provided by Akka for concurrency, scalability, and fault tolerance. Figure 12.1. Akka core modules The concurrency module provides options to solve concurrency-related problems. By now I’m sure you’re comfortable with actors (message-oriented concurrency).

pages: 250 words: 73,574

Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
by John MacCormick and Chris Bishop
Published 27 Dec 2011

At the time of writing, however, many of the systems that claim to be peer-to-peer in fact use central servers for some of their functionality and thus do not need to rely on distributed hash tables. The technique of “Byzantine fault tolerance” falls in the same category: a surprising and beautiful algorithm that can't yet be classed as great, due to lack of adoption. Byzantine fault tolerance allows certain computer systems to tolerate any type of error whatsoever (as long as there are not too many simultaneous errors). This contrasts with the more usual notion of fault tolerance, in which a system can survive more benign errors, such as the permanent failure of a disk drive or an operating system crash.

The to-do list trick also guarantees consistency in the face of failures. When combined with the prepare-then-commit trick for replicated databases, we are left with iron-clad consistency and durability for our data. The heroic triumph of databases over unreliable components, known by computer scientists as “fault-tolerance,” is the work of many researchers over many decades. But among the most important contributors was Jim Gray, a superb computer scientist who literally wrote the book on transaction processing. (The book is Transaction Processing: Concepts and Techniques, first published in 1992.) Sadly, Gray's career ended early: one day in 2007, he sailed his yacht out of San Francisco Bay, under the Golden Gate Bridge, and into the open ocean on a planned day trip to some nearby islands.

See also certification authority authority trick B-tree Babylonia backup bank; account number; balance; for keys; online banking; for signatures; transfer; as trusted third party base, in exponentiation Battelle, John Bell Telephone Company binary Bing biology biometric sensor Bishop, Christopher bit block cipher body, of a web page brain Brin, Sergey British government browser brute force bug Burrows, Mike Bush, Vannevar Businessweek Byzantine fault tolerance C++ programming language CA. See certification authority calculus Caltech Cambridge CanCrash.exe CanCrashWeird.exe Carnegie Mellon University CD cell phone. See phone certificate certification authority Charles Babbage Institute chat-bot checkbook checksum; in practice; simple; staircase.

pages: 422 words: 86,414

Hands-On RESTful API Design Patterns and Best Practices
by Harihara Subramanian
Published 31 Jan 2019

Thus, API gateway clustering is important for continuously receiving and responding to service messages and the Load Balancer plays a vital role in fulfilling this, as illustrated in the following diagram: High availability and failover In the era of microservices, guaranteeing high available through fault tolerance, fault detection, and isolation is an important thing for architects. In the recent past, API gateway solutions emerged as a critical component for the microeconomic era. Microservice architecture is being touted as as the soul and savior for facilitating the mandated business adaptivity, process optimization, and automation.

The orchestration/choreography, brokerage, discovery, routing, enrichment, policy enforcement, governance, concierge jobs, and so on are performed by standardized API gateway solutions. On the other hand, API management adds additional capabilities such as analytics and life cycle management. In future, there will be attempts to meet QoS and NFRs such as availability, scalability, high performance/throughput, security, and reliability through replication and fault tolerance, through a combination of API gateways, cluster and orchestration platforms, service mesh solutions, and so on. API gateways for microservice-centric applications The unique contributions of API gateways for operationalizing microservices in a beneficial fashion are growing as days pass.

This strategically-sound transition ultimately enables them to be innately smart in their actions and reactions. And this grandiose and technology-inspired transformation of everyday elements and entities in our daily environments leads to the timely formulation and delivery of service-oriented, event-driven, insight-filled, cloud-enabled, fault-tolerant, mission-critical, multifaceted, and people-centric services. The role of the powerful RESTful paradigm in building and providing these kinds of advanced and next-generation services is steadily growing. This chapter is specially crafted to tell you all about the contributions of the RESTful services paradigm toward designing, developing, and deploying next-generation microservices-centric and enterprise-scale applications.

pages: 412 words: 104,864

Silence on the Wire: A Field Guide to Passive Reconnaissance and Indirect Attacks
by Michal Zalewski
Published 4 Apr 2005

Spanning tree protocol (STP) Lets you build redundant network structures in which switches are interconnected in more than one location, in order to maintain fault tolerance. Traditionally, such a design could cause broadcast traffic and some other packets to loop forever while also causing network performance to deteriorate significantly, because the data received on one interface and forwarded to another in effect bounces back to the originator (see Figure 7-2, left). When designing a network, it is often difficult to avoid accidental broadcast loops. It is also sometimes desirable to design architectures with potential loops (in which one switch connects to two or more switches), because this type of design is much more fault tolerant and a single device or single link can be taken out without dividing the entire network into two separate islands.

A stateful NAT mechanism can be used, among other applications, to implement fault-tolerant setups in which a single, publicly accessible IP address is served by more than one internal server. Or to save address space and improve security, NAT can be implemented to allow the internal network to use a pool of private, not publicly accessible, addresses, while enabling hosts on the network to communicate with the Internet by “masquerading” as a single public IP machine. In the first scenario, NAT rewrites destination addresses on incoming packets to a number of private systems behind the firewall. This provides a fault-tolerant load-balancing setup, in which subsequent requests to a popular website (http://www.microsoft.com, perhaps) or other critical service can be distributed among an array of systems, and if any one fails, other systems can take over.

Based on the result of this election, a treelike traffic distribution hierarchy is built from this node down, and links that could cause a reverse propagation of broadcast traffic are temporarily disabled (see Figure 7-2, right). You can quickly change this simple self-organizing hierarchy when one of the nodes drops off and reactivate a link previously deemed unnecessary. Figure 7-2. Packet storm problem and STP election scheme; left side shows a fault-tolerant network with no STP, where some packets are bound to loop (almost) forever between switches; right side is the same network where one of the devices was automatically elected a master node using STP, and for which the logical topology was adjusted to eliminate loops. When one of the links fails, the network would be reconfigured to ensure proper operations

pages: 305 words: 89,103

Scarcity: The True Cost of Not Having Enough
by Sendhil Mullainathan
Published 3 Sep 2014

Skipping class in a training program while you’re dealing with scarcity is not the same as playing hooky in middle school. Linear classes that must not be missed can work well for the full-time student; they do not make sense for the juggling poor. It is important to emphasize that fault tolerance is not a substitute for personal responsibility. On the contrary: fault tolerance is a way to ensure that when the poor do take it on themselves, they can improve—as so many do. Fault tolerance allows the opportunities people receive to match the effort they put in and the circumstances they face. It does not take away the need for hard work; rather, it allows hard work to yield better returns for those who are up for the challenge, just as improved levers in the cockpit allow the dedicated pilot to excel.

But why not look at the design of the cockpit rather than the workings of the pilot? Why not look at the structure of the programs rather than the failings of the clients? If we accept that pilots can fail and that cockpits need to be wisely structured so as to inhibit those failures, why can we not do the same with the poor? Why not design programs structured to be more fault tolerant? We could ask the same question of anti-poverty programs. Consider the training programs, where absenteeism is common and dropout rates are high. What happens when, loaded and depleted, a client misses a class? What happens when her mind wanders in class? The next class becomes a lot harder.

Now roll forward a few more weeks. By now you’ve missed another class. And when you go, you understand less than before. Eventually you decide it’s just too much right now; you’ll drop out and sign up another time, when your financial life is more together. The program you tried was not designed to be fault tolerant. It magnified your mistakes, which were predictable, and essentially pushed you out the door. But it need not be that way. Instead of insisting on no mistakes or for behavior to change, we can redesign the cockpit. Curricula can be altered, for example, so that there are modules, staggered to start at different times and to proceed in parallel.

pages: 523 words: 112,185

Doing Data Science: Straight Talk From the Frontline
by Cathy O'Neil and Rachel Schutt
Published 8 Oct 2013

Enter MapReduce In 2004 Jeff and Sanjay published their paper “MapReduce: Simplified Data Processing on Large Clusters” (and here’s another one on the underlying filesystem). MapReduce allows us to stop thinking about fault tolerance; it is a platform that does the fault tolerance work for us. Programming 1,000 computers is now easier than programming 100. It’s a library to do fancy things. To use MapReduce, you write two functions: a mapper function, and then a reducer function. It takes these functions and runs them on many machines that are local to your stored data. All of the fault tolerance is automatically done for you once you’ve placed the algorithm into the map/reduce framework. The mapper takes each data point and produces an ordered pair of the form (key, value).

If we denote by the variable that exhibits whether a given computer is working, so means it works and means it’s broken, then we can assume: But this means, when we have 1,000 computers, the chance that no computer is broken is which is generally pretty small even if is small. So if for each individual computer, then the probability that all 1,000 computers work is 0.37, less than even odds. This isn’t sufficiently robust. What to do? We address this problem by talking about fault tolerance for distributed work. This usually involves replicating the input (the default is to have three copies of everything), and making the different copies available to different machines, so if one blows, another one will still have the good data. We might also embed checksums in the data, so the data itself can be audited for errors, and we will automate monitoring by a controller machine (or maybe more than one?).

To add efficiency, when some machines finish, we should use the excess capacity to rerun work, again checking for errors. Note Q: Wait, I thought we were counting things?! This seems like some other awful rat’s nest we’ve gotten ourselves into. A: It’s always like this. You cannot reason about the efficiency of fault tolerance easily; everything is complicated. And note, efficiency is just as important as correctness, because a thousand computers are worth more than your salary. It’s like this: The first 10 computers are easy; The first 100 computers are hard; and The first 1,000 computers are impossible. There’s really no hope.

pages: 589 words: 147,053

The Age of Em: Work, Love and Life When Robots Rule the Earth
by Robin Hanson
Published 31 Mar 2016

If emulation hardware is digital, then it could either be deterministic, so that the value and timing of output states are always exactly predictable, or it could be fault-prone and fault-tolerant in the sense of having and tolerating more frequent and larger logic errors and timing fluctuations. Most digital hardware today is deterministic, but large parallel systems are more often fault-tolerant. The design of fault-tolerant hardware and software is an active area of research today (Bogdan et al. 2007). As human brains are large, parallel, and have an intrinsically fault-tolerant design, brain emulation software is likely to need less special adaptation to run on fault-prone hardware.

As human brains are large, parallel, and have an intrinsically fault-tolerant design, brain emulation software is likely to need less special adaptation to run on fault-prone hardware. Such hardware is usually cheaper to design and construct, occupies less volume, and takes less energy to run. Thus em hardware is likely to often be fault-prone and fault-tolerant. Cosmic rays are high-energy particles that come from space and disrupt the operation of electronic devices. Hardware errors resulting from cosmic rays cause a higher rate of errors per operation in hardware that runs more slowly, with all else equal. Because of this, when ems run slower, with each operation taking more time, they either tend to tolerate fewer other errors, or they pay more for error correction.

United States Bureau of Labor Statistics USDL-12–1887, September 18. http://www.bls.gov/news.release/archives/tenure_09182012.pdf. Boehm, Christopher. 1999. Hierarchy in the Forest: The Evolution of Egalitarian Behavior. Harvard University Press, December 1. Bogdan, Paul, Tudor Dumitras, and Radu Marculescu. 2007. “Stochastic Communication: A New Paradigm for Fault-Tolerant Networks-on-Chip.” VLSI Design 2007: 95348. Boning, Brent, Casey Ichniowski, and Kathryn Shaw. 2007. “Opportunity Counts: Teams and the Effectiveness of Production Incentives.” Journal of Labor Economics 25(4): 613–650. Bonke, Jens. 2012. “Do Morning-Type People Earn More than Evening-Type People?

pages: 463 words: 118,936

Darwin Among the Machines
by George Dyson
Published 28 Mar 2012

Von Neumann believed that entirely different logical foundations would be required to arrive at an understanding of even the simplest nervous system, let alone the human brain. His Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components (1956) explored the possibilities of parallel architecture and fault-tolerant neural nets. This approach would soon be superseded by a development that neither nature nor von Neumann had counted on: the integrated circuit, composed of logically intricate yet structurally monolithic microscopic parts. Serial architecture swept the stage. Probabilistic logics, along with vacuum tubes and acoustic delay-line memory, would scarcely be heard from again.

At one level, this language may appear to us to be money, especially the new, polymorphous E-money that circulates without reserve at the speed of light. E-money is, after all, simply a consensual definition of “electrons with meaning,” allowing other levels of meaning to freely evolve. Composed of discrete yet divisible and liquid units, digital currency resembles the pulse-frequency coding that has proved to be such a rugged and fault-tolerant characteristic of the nervous systems evolved by biology. Frequency-modulated signals that travel through the nerves are associated with chemical messages that are broadcast by diffusion through the fluid that bathes the brain. Money has a twofold nature that encompasses both kinds of behavior: it can be transmitted, like an electrical signal, from one place (or time) to another; or it can be diffused in any number of more chemical, hormonelike ways.

The packet chooses a channel that happens to be quiet at that instant and jumps to the next lamppost at the speed of light. The multiplexing of communications across the available network topology is extended to the multiplexing of network topology across the available frequency spectrum. Communication becomes more efficient, fault tolerant, and secure. The way the system works now (in a growing number of metropolitan areas—hence the name) is that you purchase or rent a small Ricochet modem, about the size of a large candy bar and transmitting at about two-thirds of a watt. Your modem establishes contact with the nearest pole-top lunch box or directly with any other modem of its species within range.

pages: 1,758 words: 342,766

Code Complete (Developer Best Practices)
by Steve McConnell
Published 8 Jun 2004

The fact that an environment has a particular error-handling approach doesn't mean that it's the best approach for your requirements. Fault Tolerance The architecture should also indicate the kind of fault tolerance expected. Fault tolerance is a collection of techniques that increase a system's reliability by detecting errors, recovering from them if possible, and containing their bad effects if not. Further Reading For a good introduction to fault tolerance, see the July 2001 issue of IEEE Software. In addition to providing a good introduction, the articles cite many key books and key articles on the topic. For example, a system could make the computation of the square root of a number fault tolerant in any of several ways: The system might back up and try again when it detects a fault.

Each class computes the square root, and then the system compares the results. Depending on the kind of fault tolerance built into the system, it then uses the mean, the median, or the mode of the three results. The system might replace the erroneous value with a phony value that it knows to have a benign effect on the rest of the system. Other fault-tolerance approaches include having the system change to a state of partial operation or a state of degraded functionality when it detects an error. It can shut itself down or automatically restart itself. These examples are necessarily simplistic. Fault tolerance is a fascinating and complex subject—unfortunately, it's one that's outside the scope of this book.

Does the architecture set space and speed budgets for each class, subsystem, or functionality area? Does the architecture describe how scalability will be achieved? Does the architecture address interoperability? Is a strategy for internationalization/localization described? Is a coherent error-handling strategy provided? Is the approach to fault tolerance defined (if any is needed)? Has technical feasibility of all parts of the system been established? Is an approach to overengineering specified? Are necessary buy-vs.-build decisions included? Does the architecture describe how reused code will be made to conform to other architectural objectives?

pages: 480 words: 99,288

Mastering ElasticSearch
by Rafal Kuc and Marek Rogozinski
Published 14 Aug 2013

This allows us to store various document types in one index and have different mappings for different document types. Node The single instance of the ElasticSearch server is called a node. A single node ElasticSearch deployment can be sufficient for many simple use cases, but when you have to think about fault tolerance or you have lots of data that cannot fit in a single server, you should think about multi-node ElasticSearch cluster. Cluster Cluster is a set of ElasticSearch nodes that work together to handle the load bigger than single instance can handle (both in terms of handling queries and documents).

In the next chapter, we'll look closely at what ElasticSearch offers us when it comes to shard control. We'll see how to choose the right amount of shards and replicas for our index, we'll manipulate shard placement and we will see when to create more shards than we actually need. We'll discuss how the shard allocator works. Finally, we'll use all the knowledge we've got so far to create fault tolerant and scalable clusters. Chapter 4. Index Distribution Architecture In the previous chapter, we've learned how to use different scoring formulas and how we can benefit from using them. We've also seen how to use different posting formats to change how the data is indexed. In addition to that, we now know how to handle near real-time searching and real-time get and what searcher reopening means for ElasticSearch.

Using our knowledge As we are slowly approaching the end of the fourth chapter we need to get something that is closer to what you can encounter during your everyday work. Because of that we have decided to divide the real-life example into two sections. In this section, you'll see how to combine the knowledge we've got so far to build a fault-tolerant and scalable cluster based on some assumptions. Because this chapter is mostly about configuration, we will concentrate on that. The mappings and your data may be different, but with similar amount data and queries hitting your cluster the following sections may be useful for you. Assumptions Before we go into the juicy configuration details let's make some basic assumptions with which using which we will configure our ElasticSearch cluster.

pages: 719 words: 181,090

Site Reliability Engineering: How Google Runs Production Systems
by Betsy Beyer , Chris Jones , Jennifer Petoff and Niall Richard Murphy
Published 15 Apr 2016

The product developers have more visibility into the time and effort involved in writing and releasing their code, while the SREs have more visibility into the service’s reliability (and the state of production in general). These tensions often reflect themselves in different opinions about the level of effort that should be put into engineering practices. The following list presents some typical tensions: Software fault tolerance How hardened do we make the software to unexpected events? Too little, and we have a brittle, unusable product. Too much, and we have a product no one wants to use (but that runs very stably). Testing Again, not enough testing and you have embarrassing outages, privacy data leaks, or a number of other press-worthy events.

Deploying Distributed Consensus-Based Systems The most critical decisions system designers must make when deploying a consensus-based system concern the number of replicas to be deployed and the location of those replicas. Number of Replicas In general, consensus-based systems operate using majority quorums, i.e., a group of replicas may tolerate failures (if Byzantine fault tolerance, in which the system is resistant to replicas returning incorrect results, is required, then replicas may tolerate failures [Cas99]). For non-Byzantine failures, the minimum number of replicas that can be deployed is three—if two are deployed, then there is no tolerance for failure of any process.

[All15] J. Allspaw, “Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages”, MSc thesis, Lund University, 2015. [Ana07] S. Anantharaju, “Automating web application security testing”, blog post, July 2007. [Ana13] R. Ananatharayan et al., “Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams”, in SIGMOD ’13, 2013. [And05] A. Andrieux, K. Czajkowski, A. Dan, et al., “Web Services Agreement Specification (WS-Agreement)”, September 2005. [Bai13] P. Bailis and A. Ghodsi, “Eventual Consistency Today: Limitations, Extensions, and Beyond”, in ACM Queue, vol. 11, no. 3, 2013.

Reactive Messaging Patterns With the Actor Model: Applications and Integration in Scala and Akka
by Vaughn Vernon
Published 16 Aug 2015

• If scheduling tasks is difficult and error prone, leave the task scheduling to software that is best at that job, and focus on your system’s use cases instead. • If errors happen—and errors do happen—design your system to expect errors and react to errors by being fault tolerant. These are powerful assertions. Yet, is there a way to realize these sound concurrency design principles? Or have we just identified a panacea of wishful thinking? Can we actually use multithreaded software development techniques that enable us to reason about our systems, that react to changing conditions, that are scalable and fault tolerant, and that really work? How the Actor Model Helps A system of actors helps you leverage the simultaneous use of multiple processor cores.

Trying to take full advantage of contemporary hardware improvements such as increasing numbers of processors and cores and growing processor cache is seriously impeded by the very tools and patterns that should be helping us. Thus, implementing event-driven, scalable, resilient, and responsive applications is often deemed too difficult and risky and as a result is generally avoided. The Akka toolkit was created to address the failings of common multithreaded programming approaches, distributed computing, and fault tolerance. It does so by using the Actor model, which provides powerful abstractions that make creating solutions around concurrency and parallelism much easier to reason about and succeed in. This is not to say that Akka removes the need to think about concurrency. It doesn’t, and you must still design for parallelism, latency, and eventually consistent application state and think of how you will prevent your application from unnecessary blocking.

Akka clustering is useful not only for peak demand but also for failover. Even if you have five machines available, clustering can make more efficient use of resources by assigning extra work to the machines least under load and by rebalancing work between machines if a machine crashes unexpectedly. Akka clustering is designed to support a multinode, fault-tolerant, distributed system of actors. It does this by creating a cluster of nodes. A node must be an ActorSystem that is exposed on a TCP port so that it has a unique identifier. Every node must share the same ActorSystem name. Every node member must use a different port number within its host server hardware; no two node members may share a socket port number on the same physical machine.

pages: 721 words: 197,134

Data Mining: Concepts, Models, Methods, and Algorithms
by Mehmed Kantardzić
Published 2 Jan 2003

In the context of data classification, an ANN can be designed to provide information not only about which particular class to select for a given sample, but also about confidence in the decision made. This latter information may be used to reject ambiguous data, should they arise, and thereby improve the classification performance or performances of the other tasks modeled by the network. 5. Fault Tolerance. An ANN has the potential to be inherently fault-tolerant, or capable of robust computation. Its performances do not degrade significantly under adverse operating conditions such as disconnection of neurons, and noisy or missing data. There is some empirical evidence for robust computation, but usually it is uncontrolled. 6.

SOM applications. (a) Drugs binding to human cytochrome; (b) interest rate classification; (c) analysis of book-buying behavior. 7.8 REVIEW QUESTIONS AND PROBLEMS 1. Explain the fundamental differences between the design of an ANN and “classical” information-processing systems. 2. Why is fault-tolerance property one of the most important characteristics and capabilities of ANNs? 3. What are the basic components of the neuron’s model? 4. Why are continuous functions such as log-sigmoid or hyperbolic tangent considered common activation functions in real-world applications of ANNs? 5. Discuss the differences between feedforward and recurrent neural networks. 6.

Because of the massive amount of data and the speed of which the data are generated, many data-mining applications in sensor networks require in-network processing such as aggregation to reduce sample size and communication overhead. Online data mining in sensor networks offers many additional challenges, including: limited communication bandwidth, constraints on local computing resources, limited power supply, need for fault tolerance, and asynchronous nature of the network. Obviously, data-mining systems have evolved in a short period of time from stand-alone programs characterized by single algorithms with little support for the entire knowledge-discovery process to integrated systems incorporating several mining algorithms, multiple users, communications, and various and heterogeneous data formats and distributed data sources.

Industry 4.0: The Industrial Internet of Things
by Alasdair Gilchrist
Published 27 Jun 2016

Therefore, we see the following delivery mechanisms: At most once delivery—This is commonly called fire and forget and rides on unreliable protocols such as UDP At least once delivery—This is reliable delivery such as TCP/IP where every message is delivered to the recipient Exactly once delivery—This technique is used in batch jobs as means of delivery that ensures late packets, delayed through excessive latency or delay or even jitter do not mess up the results Additionally, there are also many other factors that need to be taken into consideration such as lifespan, which relates to the IISs to discard old data packets, much like the time-to-live factor on IP packets. There is also fault tolerance, which ensures that there is fault survivability and alternative routes or hardware redundancy is available, which will guarantee availability and reliability. Similarly, there is the case of security, which we will discuss in detail in a later chapter. Industry 4.0 Key Functions of the Communication Layer The communication layer functions can deliver the data to the correct address and application.

There is also considerable interest in the production of IoT devices capable of energy harvesting solar, wind, or electromagnetic fields as a power source, as that can be a major technology advance in deploying remote M2M style mesh networking in rural areas. For example, in a smart agriculture scenario. Energy harvesting IoT devices would provide the means through mesh M2M networks for highly fault tolerant, unattended long-term solutions that require only minimal human intervention However, research and technology is not just focused on the technology. They are also keenly studying methods that would make application protocols and data formats far more efficient. For instance, low-power sources require that devices running on minimal power levels or are harvesting energy, again at subsistence levels, must communicate their data in a highly efficient and timely manner and this has serious implications for protocol design.

Therefore, contention ratios—the number of other customers you are sharing the bandwidth with—can be as high as 50:1 for residential use and 10:1 for business use. • SDH/Sonnet—This optic ring technology is typically deployed as the service provider’s transport core as it is provides high speed, high capacity, and highly reliable and fault-tolerant transport for data over sometimesvast geographical regions. However, for customers that require high-speed data links over a large geographical region, typically enterprises or large company's fiber optic 163 164 Chapter 11 | IIoT WAN Technologies and Protocols rings are high performance, highly reliable, and high cost.

pages: 757 words: 193,541

The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2
by Thomas A. Limoncelli , Strata R. Chalup and Christina J. Hogan
Published 27 Aug 2014

Sometimes services were also scaled by deploying servers for the application into several geographic regions, or business units, each of which would then use its local server. For example, when Tom first worked at AT&T, there was a different payroll processing center for each division of the company. High Availability Applications requiring high availability required “fault-tolerant” computers. These computers had multiple CPUs, error-correcting RAM, and other technologies that were extremely expensive at the time. Fault-tolerant systems were niche products. Generally only the military and Wall Street needed such systems. As a result they were usually priced out of the reach of typical companies. Costs During this era the Internet was not business-critical, and outages for internal business-critical systems could be scheduled because the customer base was a limited, known set of people.

Failure domains can be any size: a device, a computer, a rack, a datacenter, or even an entire company. The amount of capacity in a system is N + M, where N is the amount of capacity used to provide a service and M is the amount of spare capacity available, which can be used in the event of a failure. A system that is N + 1 fault tolerant can survive one unit of failure and remain operational. The most common way to route around failure is through replication of services. A service may be replicated one or more times per failure domain to provide resilience greater than the domain. Failures can also come from external sources that overload a system, and from human mistakes.

Originally based on applying Agile methodology to operations, the result is a streamlined set of principles and processes that can create reliable services. Appendix B will make the case that cloud or distributed computing was the inevitable result of the economics of hardware. DevOps is the inevitable result of needing to do efficient operations in such an environment. If hardware and software are sufficiently fault tolerant, the remaining problems are human. The seminal paper “Why Do Internet Services Fail, and What Can Be Done about It?” by Oppenheimer et al. (2003) raised awareness that if web services are to be a success in the future, operational aspects must improve: We find that (1) operator error is the largest single cause of failures in two of the three services, (2) operator errors often take a long time to repair, (3) configuration errors are the largest category of operator errors, (4) failures in custom-written front-end software are significant, and (5) more extensive online testing and more thoroughly exposing and detecting component failures would reduce failure rates in at least one service.

pages: 194 words: 49,310

Clock of the Long Now
by Stewart Brand
Published 1 Jan 1999

Hasty opportunists will never get past the foothills because they only pay attention to the slope of the ground under their feet, climb quickly to the immediate hilltop, and get stuck there. Patient opportunists take the longer view to the distant peaks, and toil through many ups and downs on the long trek to the heights. There are two ways to make systems fault-tolerant: One is to make them small, so that correction is local and quick; the other is to make them slow, so that correction has time to permeate the system. When you proceed too rapidly with something mistakes cascade, whereas when you proceed slowly the mistakes instruct. Gradual, incremental projects engage the full power of learning and discovery, and they are able to back out of problems.

Diamond, Jared Digital information and core standards discontinuity of and immortality and megadata and migration preservation of Digital records, passive and active Discounting of value Drexler, Eric Drucker, Peter Dubos, René Dyson, Esther Dyson, Freeman Earth, view of from outer space Earth Day Easterbrook, Gregg Eaton Collection Eberling, Richard Ecological communities systems and change See also Environment Economic forecasting Ecotrust Egyptian civilization and time Ehrlich, Paul Electronic Frontier Foundation Eliade, Mircea Eno, Brian and ancient Egyptian woman and Clock of the Long Now ideas for participation in Clock/Library and tour of Big Ben Environment degradation of and peace, prosperity, and continuity reframing of problems of and technology See also Ecological Environmentalists and long-view Europe-America dialogue Event horizon Evolution of Cooperation, The “Experts Look Ahead, The” Extinction rate Extra-Terrestrial Intelligence programs and time-release services Extropians Family Tree Maker Fashion Fast and bad things Fault-tolerant systems Feedback and tuning of systems Feldman, Marcus Finite and Infinite Games Finite games Florescence Foresight Institute Freefall Free will Fuller, Buckminster Fundamental tracking Future configuration towards continuous of desire versus fate feeling of and nuclear armageddon one hundred years and present moment tree uses of and value Future of Industrial Man, The “Futurismists” Gabriel, Peter Galileo Galvin, Robert Gambling Games, finite and infinite Gender imbalance in Chinese babies Generations Gershenfeld, Neil Gibbon, Edward GI Bill Gibson, William Gilbert, Joseph Henry Global Business Network (GBN) Global collapse Global computer Global perspective Global warming Goebbels, Joseph Goethe, Johann Wolfgang von Goldberg, Avram “Goldberg rule, the” Goldsmith, Oliver Goodall, Jane Governance Governing the Commons Government and the long view Grand Canyon Great Year Greek tragedy Grove, Andy Hale-Bopp comet Hampden-Turner, Charles Hardware dependent digital experiences, preservation of Hawking, Stephen Hawthorne, Nathaniel Heinlein, Robert Herman, Arthur Hill climbing Hillis, Daniel definition of technology and design of Clock and digital discontinuity and digital preservation and extra-terrestrial intelligence programs ideas for participation in Clock/Library and Long Now Foundation and long-term responsibility and motivation to build linear Clock and the Singularity and sustained endeavors and types of time History and accessible data as a horror and warning how to apply intelligently Hitler, Adolf Holling, C.

pages: 348 words: 97,277

The Truth Machine: The Blockchain and the Future of Everything
by Paul Vigna and Michael J. Casey
Published 27 Feb 2018

However, instead of its electricity-hungry “proof-of-work” consensus model, they drew upon older, pre-Bitcoin protocols that were more efficient but which couldn’t achieve the same level of security without putting a centralized entity in charge of identifying and authorizing participants. Predominantly, the bankers’ models used a consensus algorithm known as practical byzantine fault tolerance, or PBFT, a cryptographic solution invented in 1999. It gave all approved ledger-keepers in the network confidence that each other’s actions weren’t undermining the shared record even when there was no way of knowing whether one or more had malicious intent to defraud the others. With these consensus-building systems, the computers adopted each updated version of the ledger once certain thresholds of acceptance were demonstrated across the network.

See also R3 CEV Cosmos costs-per-impression measures (CPMs) Craigslist Creative Commons credit default swap (CDS) Crowdfunder crowdfunding crypto-asset analysts crypto-assets Crytpo Company cryptocurrency and criminality and Cypherpunk movement and decentralization and fair distribution and financial sector and Fourth Industrial Revolution hoarding investors and privacy and quantum computing and regulatory challenges See also Bitcoin cryptography and blockchain technology and data storage and financial sector hashes history of and identity and math Merkle Tree practical byzantine fault tolerance (PBFT) and registers and security and privacy signatures and supply chains and tokens triple-entry bookkeeping and trust crypto-impact-economics Cryptokernel (CK) crypto-libertarians cryptomoney Cryptonomos Cuende, Luis Iván Cuomo, Jerry cyber-attacks ransom attacks cybersecurity and decentralized trust model device identity model shared-secret model Cypherpunk manifesto Cypherpunk movement and community DAO, The (The Decentralized Autonomous Organization) Dapps.

MIT Media Lab MIT Media Lab’s Digital Currency Initiative Mizrahi, Alex MME Modi, Narendra Monax Monero monetary and banking systems central bank fiat digital currency and community connections and digital counterfeiting mobile money systems money laundering See also cryptocurrency; financial sector Moore’s law Mooti Morehead, Dan Mozilla M-Pesa Nakamoto, Satoshi (pseudonymous Bitcoin creator) Nasdaq Nelson, Ted New America Foundation New York Department of Financial Services Niederauer, Duncan North American Bitcoin Conference Norway Obama, Barack Occupy Wall Street Ocean Health Coin off-chain environment Olsen, Richard open protocols open-source systems and movement and art and innovation challenges of Cryptokernel (CK) and data storage and financial sector and health care sector and honest accounting Hyperledger and identity and permissioned systems and registries and ride-sharing and tokens See also Ethereum organized crime Pacioli, Luca Pantera Capital Parity Wallet peer-to-peer commerce and economy Pentland, Alex “Sandy” Perkins Coie permissioned (private) blockchains advantages of challenges of and cryptocurrency-less systems definition of and finance sector open-source development of scalability of and security and supply chains permissionless blockchains Bitcoin and Cypherpunks Ethereum financial sector and identity information mobile money systems and scalability and trusted computing Pink Army Cooperative Plasma Polkadot Polychain Capital Poon, Joseph practical byzantine fault tolerance (PBFT) pre-mining pre-selling private blockchains. See permissioned (private) blockchains Procivis proof-of-stake algorithm proof of work prosumers Protocol Labs Provenance public key infrastructure (PKI) Pureswaran, Veena R3 CEV consortium ransom attacks Ravikant, Naval Realini, Carol re-architecting record keeping and proof-of-stake algorithm and supply chains and trust See also ledger-keeping Reddit refugee camps Regenor, James reputation scoring Reuschel, Peter Rhodes, Yorke ride-sharing Commuterz Lyft reputation scoring Uber Ripple Labs Rivest Co.

RDF Database Systems: Triples Storage and SPARQL Query Processing
by Olivier Cure and Guillaume Blin
Published 10 Dec 2014

When writing these programs, one does not need to take care about the data distribution and parallelism aspects. In fact, the main contribution of MapReduce-based systems is to orchestrate the distribution and execution of these map and reduce operations on a cluster of machines over very large data sets. It also fault-tolerant, meaning that if a machine of the cluster fails during the execution of a process, its job will be given to another machine automatically.Therefore, most of the hard tasks from an end-user point of view are automatized and taken care of by the system: data partitioning, execution scheduling, handling machine failure, and managing intermachine communication.

Because these index lookups are defined procedurally, we can consider that any forms of optimization are quite difficult to process. This implies that the generated index lookups need to be optimal to ensure efficient query answering. We saw in Chapter 5 that many systems are using a MapReduce approach to benefit from a parallel-processing, fault-tolerant environment. PigSPARQL, presented in Schätzle et al. (2013), is a system that maps SPARQL queries to Pig Latin queries. In a nutshell, Pig is a data analysis platform developed by Yahoo! that runs on top of the Hadoop processing framework, and Latin is its query language that abstracts the creation of the map and reduce functions using a relational algebra–like approach.

These two mechanisms come with support for conflict resolution, i.e., detect whether an update has been correctly replicated to a subscriber. The second strategy is based on partitioning that is specified at the index level using a hash function on key parts. Each partition is replicated on different physical machines to ensure load balancing and fault tolerance. When triple updates are being performed, all copies are updated within the same transaction. The clustering approach of the Mark Logic system distinguishes between two kinds of nodes: data managers (denoted as D-nodes) and evaluators (denoted as E-nodes). The D-nodes are responsible for the management of a data subset, while the E-nodes handle the access to data and the query processing.

pages: 1,409 words: 205,237

Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale
by Jan Kunigk , Ian Buss , Paul Wilkinson and Lars George
Published 8 Jan 2019

Core Components The first set of projects are those that form the core of the Hadoop project itself or are key enabling technologies for the rest of the stack: HDFS, YARN, Apache ZooKeeper, and the Apache Hive Metastore. Together, these projects form the foundation on which most other frameworks, projects, and applications running on the cluster depend. HDFS The Hadoop Distributed File System (HDFS) is the scalable, fault-tolerant, and distributed filesystem for Hadoop. Based on the original use case of analytics over large-scale datasets, HDFS is optimized to store very large amounts of immutable data with files being typically accessed in long sequential scans. HDFS is the critical supporting technology for many of the other components in the stack.

The client then reads the data directly from the DataNodes, preferring replicas that are local or close, in network terms. The design of HDFS means that it does not allow in-place updates to the files it stores. This can initially seem quite restrictive until you realize that this immutability allows it to achieve the required horizontal scalability and resilience in a relatively simple way. HDFS is fault-tolerant because the failure of an individual disk, DataNode, or even rack does not imperil the safety of the data. In these situations, the NameNode simply directs one of the DataNodes that is maintaining a surviving replica to copy the block to another DataNode until the required replication factor is reasserted.

In Chapter 3, we discuss in detail how HDFS interacts with the servers on which its daemons run and how it uses the locally attached disks in these servers. In Chapter 4, we examine the options when putting a network plan together, and in Chapter 12, we cover how to make HDFS as highly available and fault-tolerant as possible. One final note before we move on. In this short description of HDFS, we glossed over the fact that Hadoop abstracts much of this detail from the client. The API that a client uses is actually a Hadoop-compatible filesystem, of which HDFS is just one implementation. We will come across other commonly used implementations in this book, such as cloud-based object storage offerings like Amazon S3.

Engineering Security
by Peter Gutmann

For example a resolver could decide that although a particular entry may be stale, it came from an authoritative source and so it can still be used until newer information becomes available (the technical name for a resolver that provides this type of service on behalf of the user is “curated DNS”). What DNSSEC does is take the irregularity- and fault-tolerant behaviour of resolvers and turn any problem into a fatal error, since close-enough is no longer sufficient to satisfy a resolver that for security reasons can’t allow a single bit to be out of place. The DNSSEC documents describe in great detail the bits-on-the-wire representation of the packets that carry the data but say nothing about what happens to those bits once they’ve reached their destination [637]. As a result the implicit fault-tolerance of the DNS, which works because resolvers go to great lengths to tolerate any form of vaguely-acceptable (and in a number of cases unacceptable but present in widelydeployed implementations) responses [638], is seriously impacted when glitches are 390 Design no longer allowed to be tolerated.

Other Threat Analysis Techniques The discussion above has focused heavily on PSMs for threat analysis because that seems to be the most useful technique to apply to product development. Another 260 Threats threat analysis technique that you may run into is the use of attack trees or graphs [97][98][99][100][101][102][103][104][105][106][107][108][109][110][111][112] [113][114][115][116][117][118][119][120][121][122] which are derived from fault trees used in fault-tolerant computing and safety-critical systems [123][124] [125][126][127][128]. The general idea behind a fault tree is shown in Figure 70 and involves starting with the general high-level concept that “a failure occurred” and then iteratively breaking it down into more and more detailed failure classes.

The analysis process for these methods is a relatively straightforward modification of the existing FMEA one that involves identifying all of the system components that would be affected by a particular type of attack (typically a computer-based one rather than just a standard component failure) and then applying standard mitigation techniques used with fault-tolerant and safety-critical systems. So although FMEA and RA aren’t entirely useful for dealing with malicious rather than benign faults, they can at least be applied as a general tool to structuring the allocation of resources towards dealing with malicious faults. Another area where FMEA can be useful is in modelling the process of risk diversification that’s covered in “Security through Diversity” on page 315.

pages: 201 words: 63,192

Graph Databases
by Ian Robinson , Jim Webber and Emil Eifrem
Published 13 Jun 2013

Whatever the database, understanding the underlying storage and caching infrastruc‐ ture will help you construct idiomatic-- and hence, mechanically sympathetic—queries that maximise performance. Our final observation on availability is that scaling for cluster-wide replication has a positive impact, not just in terms of fault-tolerance, but also responsiveness. Since there are many machines available for a given workload, query latency is low and availability is maintained. But as we’ll now discuss, scale itself is more nuanced than simply the number of servers we deploy. Scale The topic of scale has become more important as data volumes have grown.

Document Stores | 173 Key-Value Stores Key-value stores are cousins of the document store family, but their lineage comes from Amazon’s Dynamo database. 3 They act like large, distributed hashmap data structures that store and retrieve opaque values by key. As shown in Figure A-3 the key space of the hashmap is spread across numerous buckets on the network. For fault-tolerance reasons each bucket is replicated onto several ma‐ chines. The formula for number of replicas required is given by R = 2F +1 where F is the number of failures we can tolerate. The replication algorithm seeks to ensure that machines aren’t exact copies of each other. This allows the system to load-balance while a machine and its buckets recover; it also helps avoid hotspots, which can cause inad‐ vertent self denial-of-service.

pages: 540 words: 103,101

Building Microservices
by Sam Newman
Published 25 Dec 2014

Netflix, for example, is especially concerned with aspects like fault tolerance, to ensure that the outage of one part of its system cannot take everything down. To handle this, a large amount of work has been done to ensure that there are client libraries on the JVM to provide teams with the tools they need to keep their services well behaved. Anyone introducing a new technology stack would mean having to reproduce all this effort. The main concern for Netflix is less about the duplicated effort, and more about the fact that it is so easy to get this wrong. The risk of a service getting newly implemented fault tolerance wrong is high if it could impact more of the system.

Consul also builds in other capabilities that you might find useful, such as the ability to perform health checks on nodes. This means that Consul could well overlap the capabilities provided by other dedicated monitoring tools, although you would more likely use Consul as a source of this information and then pull it into a more comprehensive dashboard or alerting system. Consul’s highly fault-tolerant design and focus on handling systems that make heavy use of ephemeral nodes does make me wonder, though, if it may end up replacing systems like Nagios and Sensu for some use cases. Consul uses a RESTful HTTP interface for everything from registering a service, querying the key/value store, or inserting health checks.

pages: 31 words: 9,168

Designing Reactive Systems: The Role of Actors in Distributed Architecture
by Hugh McKee
Published 5 Sep 2016

These two features of the actor system directly impact the operational costs of your application system: you use the processing capacity that you have more efficiently and you use only the capacity that is needed at a given point in time. The main takeaways in this chapter are: Delegation of work through supervised workers allows for higher levels of concurrency and fault tolerance. Workers are asynchronous and run concurrently, never sitting idle as in synchronous systems. Efficient utilization of system resources (CPU, memory, and threads) results in reduced infrastructure costs. It’s simple to scale elastically at the actor level by increasing or decreasing workers as needed.

pages: 923 words: 516,602

The C++ Programming Language
by Bjarne Stroustrup
Published 2 Jan 1986

Chapter 13 presents templates, that is, C++’s facilities for defining families of types and functions. It demonstrates the basic techniques used to provide containers, such as lists, and to support generic programming. Chapter 14 presents exception handling, discusses techniques for error handling, and presents strategies for fault tolerance. I assume that you either aren’t well acquainted with objectoriented programming and generic programming or could benefit from an explanation of how the main abstraction techniques are supported by C++. Thus, I don’t just present the language features supporting the abstraction techniques; I also explain the techniques themselves.

The exception-handling mechanism is a nonlocal control structure based on stack unwinding (§14.4) that can be seen as an alternative return mechanism. There are therefore legitimate uses of exceptions that have nothing to do with errors (§14.5). However, the primary aim of the exception-handling mechanism and the focus of this chapter is error handling and the support of fault tolerance. Standard C++ doesn’t have the notion of a thread or a process. Consequently, exceptional circumstances relating to concurrency are not discussed here. The concurrency facilities available on your system are described in its documentation. Here, I’ll just note that the C++ exception- The C++ Programming Language, Third Edition by Bjarne Stroustrup.

Exactly the same problem can occur in languages that do not support exception handling. For example, the standard C library function lloonnggjjm mpp() can cause the same problem. Even an ordinary rreettuurrnn-statement could exit uussee__ffiillee without closing ff. A first attempt to make uussee__ffiillee() to be fault-tolerant looks like this: vvooiidd uussee__ffiillee(ccoonnsstt cchhaarr* ffnn) { F FIIL LE E* f = ffooppeenn(ffnn,"rr"); ttrryy { // use f } The C++ Programming Language, Third Edition by Bjarne Stroustrup. Copyright ©1997 by AT&T. Published by Addison Wesley Longman, Inc. ISBN 0-201-88954-4. All rights reserved.

pages: 933 words: 205,691

Hadoop: The Definitive Guide
by Tom White
Published 29 May 2009

The storage subsystem deals with blocks, simplifying storage management (since blocks are a fixed size, it is easy to calculate how many can be stored on a given disk) and eliminating metadata concerns (blocks are just a chunk of data to be stored—file metadata such as permissions information does not need to be stored with the blocks, so another system can handle metadata separately). Furthermore, blocks fit well with replication for providing fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.

There are many other interfaces to HDFS, but the command line is one of the simplest and, to many developers, the most familiar. We are going to run HDFS on one machine, so first follow the instructions for setting up Hadoop in pseudo-distributed mode in Appendix A. Later you’ll see how to run on a cluster of machines to give us scalability and fault tolerance. There are two properties that we set in the pseudo-distributed configuration that deserve further explanation. The first is fs.default.name, set to hdfs://localhost/, which is used to set a default filesystem for Hadoop. Filesystems are specified by a URI, and here we have used an hdfs URI to configure Hadoop to use HDFS by default.

In any case, you’ll need to begin to scale horizontally. You can attempt to build some type of partitioning on your largest tables, or look into some of the commercial solutions that provide multiple master capabilities. Countless applications, businesses, and websites have successfully achieved scalable, fault-tolerant, and distributed data systems built on top of RDBMSs and are likely using many of the previous strategies. But what you end up with is something that is no longer a true RDBMS, sacrificing features and conveniences for compromises and complexities. Any form of slave replication or external caching introduces weak consistency into your now denormalized data.

pages: 444 words: 118,393

The Nature of Software Development: Keep It Simple, Make It Valuable, Build It Piece by Piece
by Ron Jeffries
Published 14 Aug 2015

The exhaustive brute-force approach is clearly impractical for anything but life-critical systems or Mars rovers. What if you actually have to deliver in this decade? Our community is divided about how to handle faults. One camp says we need to make systems fault-tolerant. We should catch exceptions, check error codes, and generally keep faults from turning into errors. The other camp says it’s futile to aim for fault tolerance. It’s like trying to make a fool-proof device: the universe will always deliver a better fool. No matter what faults you try to catch and recover from, something unexpected will always occur. This camp says “let it crash” so you can restart from a known good state.

The alternative, vertical scaling, means building bigger and bigger servers—adding core, memory, and storage to hosts. Vertical scaling has its place, but most of our interactive workload goes to horizontally scaled farms. If your system scales horizontally, then you will have load-balanced farms or clusters where each server runs the same applications. The multiplicity of machines provides you with fault tolerance through redundancy. A single machine or process can completely bonk while the remainder continues serving transactions. Still, even though horizontal clusters are not susceptible to single points of failure (except in the case of attacks of self-denial; see ​Self-Denial Attacks​), they can exhibit a load-related failure mode.

pages: 58 words: 12,386

Big Data Glossary
by Pete Warden
Published 20 Sep 2011

This horizontal scaling approach tends to be cheaper as the number of operations and the size of the data increases, and the very largest data processing pipelines are all built on a horizontal model. There is a cost to this approach, though. Writing distributed data handling code is tricky and involves tradeoffs between speed, scalability, fault tolerance, and traditional database goals like atomicity and consistency. MapReduce MapReduce is an algorithm design pattern that originated in the functional programming world. It consists of three steps. First, you write a mapper function or script that goes through your input data and outputs a series of keys and values to use in calculating the results.

Applied Cryptography: Protocols, Algorithms, and Source Code in C
by Bruce Schneier
Published 10 Nov 1993

Ciphertext is up to one block longer - Ciphertext is up to one block longer than the plaintext, due to padding. than the plaintext, not counting the IV. - No preprocessing is possible. - No preprocessing is possible. + Processing is parallelizable. +/- Encryptions not parallelizable; decryption is parallelizable and has a random-access property. Fault-tolerance: Fault-tolerance: - A ciphertext error affects one full - A ciphertext error affects one full block of plaintext. block of plaintext and the corresponding bit in the next block. - Synchronization error is - Synchronization error is unrecoverable. unrecoverable. CFB: OFB/Counter: Security: Security: + Plaintext patterns are concealed. + Plaintext patterns are concealed. + Input to the block cipher is + Input to the block cipher is randomized. randomized. + More than one message can be + More than one message can be encrypted with the same key provided encrypted with the same key, that a different IV is used. provided that a different IV is used. +/- Plaintext is somewhat difficult to - Plaintext is very easy to manipulate, manipulate;blocks can be removed any change in ciphertext directly from the beginning and end of the affects the plaintext. message, bits of the first block can be changed, and repetition allows some controlled changes.

. +/- Encryption is not parallelizable; decryption is parallelizable and has a random-access property. Fault-tolerance: - A ciphertext error affects the corresponding bit of plaintext and the next full block. +Synchronization errors of full block sizes are recoverable. 1-bit CFB can recover from the addition or loss of single bits. Efficiency: + Speed is the same as the block cipher. - Ciphertext is the same size as the plaintext, not counting the IV. + Processing is possible before the message is seen. -/+ OFB processing is not parallelizable; counter processing is parallelizable. Fault-tolerance: + A ciphertext error affects only the corresponding bit of plaintext.

There are other security considerations: Patterns in the plaintext should be concealed, input to the cipher should be randomized, manipulation of the plaintext by introducing errors in the ciphertext should be difficult, and encryption of more than one message with the same key should be possible. These will be discussed in detail in the next sections. Efficiency is another consideration. The mode should not be significantly less efficient than the underlying cipher. In some circumstances it is important that the ciphertext be the same size as the plaintext. A third consideration is fault-tolerance. Some applications need to parallelize encryption or decryption, while others need to be able to preprocess as much as possible. In still others it is important that the decrypting process be able to recover from bit errors in the ciphertext stream, or dropped or added bits. As we will see, different modes have different subsets of these characteristics. 9.1 Electronic Codebook Mode Electronic codebook (ECB) mode is the most obvious way to use a block cipher: A block of plaintext encrypts into a block of ciphertext.

pages: 232 words: 71,237

Kill It With Fire: Manage Aging Computer Systems
by Marianne Bellotti
Published 17 Mar 2021

A quick trick when two capable engineers cannot seem to agree on a decision is to ask yourself what each one is optimizing for with their suggested approach. Remember, technology has a number of trade-offs where optimizing for one characteristic diminishes another important characteristic. Examples include security versus usability, coupling versus complexity, fault tolerance versus consistency, and so on, and so forth. If two engineers really can’t agree on a decision, it’s usually because they have different beliefs about where the ideal optimization between two such poles is. Looking for absolute truths in situations that are ambiguous and value-based is painful.

Networking issues are not subtle, and they are generally a product of misconfiguration rather than gremlins. The HTTP request solution is wrong in the correct way because migrating from an HTTP request between Service A and Service B to a message queue later is straightforward. While we are temporarily losing built-in fault tolerance and accepting a higher scaling burden, it creates a system that is easier for the current teams to maintain. The counterexample would be if we swapped the order of the HTTP request and had Service B poll Service A for new data. While this is also less complex than a message queue, it is unnecessarily resource-intensive.

pages: 834 words: 180,700

The Architecture of Open Source Applications
by Amy Brown and Greg Wilson
Published 24 May 2011

The coordinator distributes requests to individual CouchDB instances based on the key of the document being requested. Twitter has built the notions of sharding and replication into a coordinating framework called Gizzard16. Gizzard takes standalone data stores of any type—you can build wrappers for SQL or NoSQL storage systems—and arranges them in trees of any depth to partition keys by key range. For fault tolerance, Gizzard can be configured to replicate data to multiple physical machines for the same key range. 13.4.3. Consistent Hash Rings Good hash functions distribute a set of keys in a uniform manner. This makes them a powerful tool for distributing key-value pairs among multiple servers. The academic literature on a technique called consistent hashing is extensive, and the first applications of the technique to data stores was in systems called distributed hash tables (DHTs).

With more complicated rebalancing schemes, finding the right node for a key becomes more difficult. Range partitioning requires the upfront cost of maintaining routing and configuration nodes, which can see heavy load and become central points of failure in the absence of relatively complex fault tolerance schemes. Done well, however, range-partitioned data can be load-balanced in small chunks which can be reassigned in high-load situations. If a server goes down, its assigned ranges can be distributed to many servers, rather than loading the server's immediate neighbors during downtime. 13.5.

I.e., as a RFC 3279 Dsa-Sig-Value, created by algorithm 1.2.840.10040.4.3. The Architecture of Open Source Applications Amy Brown and Greg Wilson (eds.) ISBN 978-1-257-63801-7 License / Buy / Contribute Chapter 15. Riak and Erlang/OTP Francesco Cesarini, Andy Gross, and Justin Sheehy Riak is a distributed, fault tolerant, open source database that illustrates how to build large scale systems using Erlang/OTP. Thanks in large part to Erlang's support for massively scalable distributed systems, Riak offers features that are uncommon in databases, such as high-availability and linear scalability of both capacity and throughput.

pages: 319 words: 72,969

Nginx HTTP Server Second Edition
by Clement Nedelcu
Published 18 Jul 2013

Here is a list of the main features of the web branch, quoted from the official website www.nginx.org: • Handling of static files, index files, and autoindexing; open file descriptor cache. • Accelerated reverse proxying with caching; simple load balancing and fault tolerance. • Accelerated support with caching of remote FastCGI servers; simple load balancing and fault tolerance. • Modular architecture. Filters include Gzipping, byte ranges, chunked responses, XSLT, SSI, and image resizing filter. Multiple SSI inclusions within a single page can be processed in parallel if they are handled by FastCGI or proxied servers

pages: 66 words: 9,247

MongoDB and Python
by Niall O’Higgins

MongoDB ObjectIds have the nice property of being almost-certainly-unique upon generation, hence no central coordination is required. This contrasts sharply with the common RDBMS idiom of using auto-increment primary keys. Guaranteeing that an auto-increment key is not already in use usually requires consulting some centralized system. When the intention is to provide a horizontally scalable, de-centralized and fault-tolerant database—as is the case with MongoDB—auto-increment keys represent an ugly bottleneck. By employing ObjectId as your _id, you leave the door open to horizontal scaling via MongoDB’s sharding capabilities. While you can in fact supply your own value for the _id property if you wish—so long as it is globally unique—this is best avoided unless there is a strong reason to do otherwise.

pages: 355 words: 81,788

Monolith to Microservices: Evolutionary Patterns to Transform Your Monolith
by Sam Newman
Published 14 Nov 2019

I often look back at the small part I played in this industry with a great deal of regret. It turns out not knowing what you’re doing and doing it anyway can have some pretty disastrous implications. 7 See Liming Chen and Algirdas Avizienis, “N-Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation,” published in the Twenty-Fifth International Symposium on Fault-Tolerant Computing (1995). Chapter 4. Decomposing the Database As we’ve already explored, there are a host of ways to extract functionality into microservices. However, we need to address the elephant in the room: namely, what do we do about our data?

Service Design Patterns: Fundamental Design Solutions for SOAP/WSDL and RESTful Web Services
by Robert Daigneau
Published 14 Sep 2011

Once a task has completed, the request would be forwarded to the next background process to perform the next task (e.g., reserve hotel), and so on. The request is therefore processed much like a baton is passed from one runner to the next in a relay race. Web server scalability is promoted because the work is off-loaded from the web servers. This pattern also provides a relatively fault-tolerant way to conduct long-running business processes. However, it can be challenging to understand the entire business process at a macro level, and it can also be difficult to change or debug control-flow logic since these rules are typically buried within individual services, configuration W ORKFLOW C ONNECTOR files, routing tables, and messages in transit.

These Process Snapshots provide several benefits. One may query the database to determine the status of any process instance. If a process instance crashes, the database may be queried to determine the last task that completed successfully, and the process may be restarted from that step. This is one way Workflow Engines help to ensure fault tolerance. Complete Flight Reservation Issue Confirmation Callback Message Flight Reservation ID Process Variable Figure 5.4 Graphical workflow design tools let developers depict control flow through UML activity diagrams and flowcharts. Information may be mapped from one task to another through Process Variables. 159 Workflow Connector 160 Workflow Connector C HAPTER 5 W EB S ERVICE I MPLEMENTATION S TYLES The Workflow Connector pattern uses web services as a means to launch the business processes managed by workflow engines.

pages: 713 words: 93,944

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement
by Eric Redmond , Jim Wilson and Jim R. Wilson
Published 7 May 2012

Each component is cheap and expendable, but when used right, it’s hard to find a simpler or stronger structure upon which to build a foundation. Riak is a distributed key-value database where values can be anything—from plain text, JSON, or XML to images or video clips—all accessible through a simple HTTP interface. Whatever data you have, Riak can store it. Riak is also fault-tolerant. Servers can go up or down at any moment with no single point of failure. Your cluster continues humming along as servers are added, removed, or (ideally not) crash. Riak won’t keep you up nights worrying about your cluster—a failed node is not an emergency, and you can wait to deal with it in the morning.

It is based on BigTable, a high-performance, proprietary database developed by Google and described in the 2006 white paper “Bigtable: A Distributed Storage System for Structured Data.”[26] Initially created for natural-language processing, HBase started life as a contrib package for Apache Hadoop. Since then, it has become a top-level Apache project. On the architecture front, HBase is designed to be fault tolerant. Hardware failures may be uncommon for individual machines, but in a large cluster, node failure is the norm. By using write-ahead logging and distributed configuration, HBase can quickly recover from individual server failures. Additionally, HBase lives in an ecosystem that has its own complementary benefits.

pages: 329 words: 95,309

Digital Bank: Strategies for Launching or Becoming a Digital Bank
by Chris Skinner
Published 27 Aug 2013

It is far easier to change and add new front office systems – new trading desks, new channels or new customer service operations – than to replace core back office platforms – deposit account processing, post-trade services and payment systems. Why? Because the core processing needs to be highly resilient; 99.9999999999999999999999% and a few more 9’s fault tolerant; and running 24 by 7. In other words these systems are non-stop and would highly expose the bank to failure if they stop working. It is these systems that cause most of the challenges for a bank however. This is because, being a core system, they were often developed in the 1960s and 1970s. Back then, computing technologies were based upon lines of code fed into the machine through packs and packs of punched cards.

Add to this the regulatory regime change, which would force banks to respond more and more rapidly to new requirements, and the old technologies could not keep up. Finally, the technology had to change. This is why banks have been working hard to consolidate and replace their old infrastructures, and why we are seeing more and more glitches and failures. As soon as you upgrade an old, embedded, non-stop fault tolerant machine however, you are open to risk. The 99.9999+% non-stop machine suddenly has to stop. A competent bank derisks the risk of change by testing, testing and testing, whilst an incompetent bank may test but not enough. Luckily, most banks and exchanges are competent enough to test these things properly by planning correctly through roll forward and roll back cycles.

pages: 304 words: 91,566

Bitcoin Billionaires: A True Story of Genius, Betrayal, and Redemption
by Ben Mezrich
Published 20 May 2019

Either way, it would be a logistical nightmare—Mission Impossible shit that only worked in the movies—to get ahold of the three shards that made up the bitcoin private key. Moreover, the twins had replicated this model four times across different geographic regions, to build redundancy into their system—removing the final single point of failure—and improving their overall fault tolerance. This way, if a natural disaster like a major tornado decimated the Midwest, there would still be other sets of alpha, bravo, and charlie spread across other regions in the country (the Northeast, Mid-Atlantic, West, etc.) that could be assembled to form the twins’ private key. If a mega tsunami—or hell, Godzilla—hit the eastern seaboard, or a meteor hit Los Angeles, the twins’ private key would still be safe.

Tyler corralled the security expert by the pool table, where McCaleb and Levchin were geeking out on god knows what. “Why all three?” Tyler asked. “Doesn’t one do the job?” Kaminsky shrugged. “The second one is to tell if the first one is broken. The third is to tell if the other two are lying.” It was exactly how Tyler should have expected a security engineer to think—in terms of systems and their fault tolerance and integrity. Over the next ten minutes, he interrogated Kaminsky about his hacking efforts; at first, the security expert had expected to be able to penetrate such a complex piece of code easily—the fact that it was so complex, so long, meant there should have been many weak spots to exploit.

pages: 1,201 words: 233,519

Coders at Work
by Peter Seibel
Published 22 Jun 2009

When we first did Erlang and we went to conferences and said, “You should copy all your data.” And I think they accepted the arguments over fault tolerance—the reason you copy all your data is to make the system fault tolerant. They said, “It'll be terribly inefficient if you do that,” and we said, “Yeah, it will but it'll be fault tolerant.” The thing that is surprising is that it's more efficient in certain circumstances. What we did for the reasons of fault tolerance, turned out to be, in many circumstances, just as efficient or even more efficient than sharing. Then we asked the question, “Why is that?”

The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise
by Martin L. Abbott and Michael T. Fisher
Published 1 Dec 2009

If we have a technology platform comprised of a number of noncommunicating services, we increase the number of airports or runways for which we are managing traffic; as a result, we can have many more “landings” or changes. If the services communicate asynchronously, we would have a few more concerns, but we are also likely more willing to take risks. On the other hand, if the services all communicate synchronously with each other, there isn’t much more fault tolerance than with a monolithic system (see Chapter 21, Creating Fault Isolative Architectural Structures) and we are back to managing a single runway at a single airport. The expected result of the change is important as we want to be able to verify later that the change was successful. For instance, if a change is being made to a Web server and that change is to allow more threads of execution in the Web server, we should state that as the expected result.

If availability and reliability are important to you and your customers, try to be an early majority or late majority adopter of those systems that are critical to the operations of your service, product, or platform. Asynchronous Design Whenever possible, systems should communicate in an asynchronous fashion. Asynchronous systems tend to be more fault tolerant to extreme load and do not easily fall prey to the multiplicative effects of failure that characterize synchronous systems. We will discuss the reasons for this in greater detail in the next section of this chapter. Stateless Systems Although some systems need state, state has a cost in terms of availability, scalability, and overall cost of your system.

The first factor to use in determining which services should be selected for stress testing is the criticality of each service to the overall system performance. If there is a central service such as a data abstract layer (DAL) or user authorization, this should be included as a candidate for stress testing because the stability of the entire application depends on this service. If you have architected your application into fault tolerant “swim lanes,” which will be discussed in Chapter 21, Creating Fault Isolative Architectural Structures, you still likely have core services that have been replicated across the lanes. The second consideration for determining services to stress test is the likelihood that a service affects performance.

pages: 102 words: 27,769

Rework
by Jason Fried and David Heinemeier Hansson
Published 9 Mar 2010

Rework is not just smart and succinct but grounded in the concreteness of doing rather than hard-to-apply philosophizing. This book inspired me to trust myself in defying the status quo.” —Penelope Trunk, author of Brazen Careerist: The New Rules for Success “[This book’s] assumption is that an organization is a piece of software. Editable. Malleable. Sharable. Fault-tolerant. Comfortable in Beta. Reworkable. The authors live by the credo ‘keep it simple, stupid’ and Rework possesses the same intelligence—and irreverence—of that simple adage.” —John Maeda, author of The Laws of Simplicity “Rework is like its authors: fast-moving, iconoclastic, and inspiring. It’s not just for startups.

pages: 400 words: 94,847

Reinventing Discovery: The New Era of Networked Science
by Michael Nielsen
Published 2 Oct 2011

Nature, 442:981, August 31, 2006. [21] John Bohannon. Gamers unravel the secret life of protein. Wired, 17(5), April 20, 2009. http://www.wired.com/medtech/genetics/magazine/17-05/ff_protein?currentPage=all. [22] Parsa Bonderson, Sankar Das Sarma, Michael Freedman, and Chetan Nayak. A blueprint for a topologically fault-tolerant quantum computer. eprint arXiv:1003.2856, 2010. [23] Christine L. Borgman. Scholarship in the Digital Age. Cambrdge, MA: MIT Press, 2007. [24] Kirk D. Borne et al. Astroinformatics: A 21st century approach to astronomy. eprint arXiv: 0909.3892, 2009. Position paper for Astro2010 Decadal Survey State, available at http://arxiv.org/abs/0909.3892

Edge: The Third Culture, 2006. http://www.edge.org/3rd_culture/kelly06/kelly06_index.html. [109] Kevin Kelly. What Technology Wants. New York: Viking, 2010. [110] Richard A. Kerr. Recently discovered habitable world may not exist. Science Now, October 12, 2010. http://news.sciencemag.org/sciencenow/2010/10/recently-discovered-habitable-world.html. [111] A. Yu Kitaev. Fault-tolerant quantum computation by anyons. Annals of Physics, 303(1):2–30, 2003. [112] Helge Kragh. Max Planck: The reluctant revolutionary. Physics World, December 2000. http://physicsworld.com/cws/article/print/373. [113] Greg Kroah-Hartman. The Linux kernel. Online video from Google Tech Talks. http://www.youtube.com/watch?

pages: 178 words: 33,275

Ansible Playbook Essentials
by Gourav Shah
Published 29 Jul 2015

Kotian Copy Editors Pranjali Chury Neha Vyas Project Coordinator Suzanne Coutinho Proofreader Safis Editing Indexer Monica Ajmera Mehta Graphics Jason Monteiro Production Coordinator Nilesh R. Mohite Cover Work Nilesh R. Mohite About the Author Gourav Shah (www.gouravshah.com) has extensive experience in building and managing highly available, automated, fault-tolerant infrastructure and scaling it. He started his career as a passionate Linux and open source enthusiast, transformed himself into an operations engineer, and evolved to be a cloud and DevOps expert and trainer. In his previous avatar, Gourav headed IT operations for Efficient Frontier (now Adobe), India.

pages: 554 words: 108,035

Scala in Depth
by Tom Kleenex and Joshua Suereth
Published 2 Jan 2010

These aren’t discussed in the book, but can be found in Akka’s documentation at http://akka.io/docs/ This technique can be powerful when distributed and clustered. The Akka 2.0 framework is adding the ability to create actors inside a cluster and allow them to be dynamically moved around to machines as needed. 9.6. Summary Actors provide a simpler parallelization model than traditional locking and threading. A well-behaved actors system can be fault-tolerant and resistant to total system slowdown. Actors provide an excellent abstraction for designing high-performance servers, where throughput and uptime are of the utmost importance. For these systems, designing failure zones and failure handling behaviors can help keep a system running even in the event of critical failures.

So, while the Scala actors library is an excellent resource for creating actors applications, the Akka library provides the features and performance needed to make a production application. Akka also supports common features out of the box. Actors and actor-related system design is a rich subject. This chapter lightly covered a few of the key aspects to actor-related design. These should be enough to create a fault-tolerant high-performant actors system. Next let’s look into a topic of great interest: Java interoperability with Scala. Chapter 10. Integrating Scala with Java In this chapter The benefits of using interfaces for Scala-Java interaction The dangers of automatic implicit conversions of Java types The complications of Java serialization in Scala How to effectively use annotations in Scala for Java libraries One of the biggest advantages of the Scala language is its ability to seamlessly interact with existing Java libraries and applications.

HBase: The Definitive Guide
by Lars George
Published 29 Aug 2011

You may have a background in relational database theory or you want to start fresh and this “column-oriented thing” is something that seems to fit your bill. You also heard that HBase can scale without much effort, and that alone is reason enough to look at it since you are building the next web-scale system. I was at that point in late 2007 when I was facing the task of storing millions of documents in a system that needed to be fault-tolerant and scalable while still being maintainable by just me. I had decent skills in managing a MySQL database system, and was using the database to store data that would ultimately be served to our website users. This database was running on a single server, with another as a backup. The issue was that it would not be able to hold the amount of data I needed to store for this new project.

The question is, wouldn’t it be good to trade relational features permanently for performance? You could denormalize (see the next section) the data model and avoid waits and deadlocks by minimizing necessary locking. How about built-in horizontal scalability without the need to repartition as your data grows? Finally, throw in fault tolerance and data availability, using the same mechanisms that allow scalability, and what you get is a NoSQL solution—more specifically, one that matches what HBase has to offer. Database (De-)Normalization At scale, it is often a requirement that we design schema differently, and a good term to describe this principle is Denormalization, Duplication, and Intelligent Keys (DDI).[20] It is about rethinking how data is stored in Bigtable-like storage systems, and how to make use of it in an appropriate way.

HDFS is the most used and tested filesystem in production. Almost all production clusters use it as the underlying storage layer. It is proven stable and reliable, so deviating from it may impose its own risks and subsequent problems. The primary reason HDFS is so popular is its built-in replication, fault tolerance, and scalability. Choosing a different filesystem should provide the same guarantees, as HBase implicitly assumes that data is stored in a reliable manner by the filesystem. It has no added means to replicate data or even maintain copies of its own storage files. This functionality must be provided by the lower-level system.

pages: 470 words: 109,589

Apache Solr 3 Enterprise Search Server
by Unknown
Published 13 Jan 2012

The distributed search of Solr doesn't adapt to real time changes in indexing or query load and doesn't provide any fail-over support. SolrCloud is an ongoing effort to build a fault tolerant, centrally managed support for clusters of Solr instances and is part of the trunk development path (Solr 4.0). SolrCloud introduces the idea that a logical collection of documents (otherwise known as an index) is distributed across a number of slices. Each slice is made up of shards, which are the physical pieces of the collection. In order to support fault tolerance, there may be multiple replicas of a shard distributed across different physical nodes. To keep all this data straight, Solr embeds Apache ZooKeeper as the centralized service for managing all configuration information for the cluster of Solr instances, including mapping which shards are available on which set of nodes of the cluster.

pages: 541 words: 109,698

Mining the Social Web: Finding Needles in the Social Haystack
by Matthew A. Russell
Published 15 Jan 2011

Sorting by date seems like a good idea and opens the door to certain kinds of time-series analysis, so let’s start there and see what happens. But first, we’ll need to make a small configuration change so that we can write our map/reduce functions to perform this task in Python. CouchDB is especially intriguing in that it’s written in Erlang, a language engineered to support super-high concurrency[16] and fault tolerance. The de facto out-of-the-box language you use to query and transform your data via map/reduce functions is JavaScript. Note that we could certainly opt to write map/reduce functions in JavaScript and realize some benefits from built-in JavaScript functions CouchDB offers—such as _sum, _count, and _stats.

But before we get too pie-in-the-sky, let’s back up for just a moment and reflect on how we got to where we are right now. The Internet is just a network of networks,[63] and what’s very fascinating about it from a technical standpoint is how layers of increasingly higher-level protocols build on top of lower-level protocols to ultimately produce a fault-tolerant worldwide computing infrastructure. In our online activity, we rely on dozens of protocols every single day, without even thinking about it. However, there is one ubiquitous protocol that is hard not to think about explicitly from time to time: HTTP, the prefix of just about every URL that you type into your browser, the enabling protocol for the extensive universe of hypertext documents (HTML pages), and the links that glue them all together into what we know as the Web.

pages: 210 words: 42,271

Programming HTML5 Applications
by Zachary Kessin
Published 9 May 2011

ws.onmessage { |msg| ws.send "Pong: #{msg}" } ws.onclose { puts "WebSocket closed" } end Erlang Yaws Erlang is a pretty rigorously functional language that was developed several decades ago for telephone switches and has found acceptance in many other areas where massive parallelism and strong robustness are desired. The language is concurrent, fault-tolerant, and very scalable. In recent years it has moved into the web space because all of the traits that make it useful in phone switches are very useful in a web server. The Erlang Yaws web server also supports web sockets right out of the box. The documentation can be found at the Web Sockets in Yaws web page, along with code for a simple echo server.

pages: 133 words: 42,254

Big Data Analytics: Turning Big Data Into Big Money
by Frank J. Ohlhorst
Published 28 Nov 2012

Big Data analytics requires that organizations choose the data to analyze, consolidate them, and then apply aggregation methods before the data can be subjected to the ETL process. This has to occur with large volumes of data, which can be structured, unstructured, or from multiple sources, such as social networks, data logs, web sites, mobile devices, and sensors. Hadoop accomplishes that by incorporating pragmatic processes and considerations, such as a fault-tolerant clustered architecture, the ability to move computing power closer to the data, parallel and/or batch processing of large data sets, and an open ecosystem that supports enterprise architecture layers from data storage to analytics processes. Not all enterprises require what Big Data analytics has to offer; those that do must consider Hadoop’s ability to meet the challenge.

pages: 179 words: 42,081

DeFi and the Future of Finance
by Campbell R. Harvey , Ashwin Ramachandran , Joey Santoro , Vitalik Buterin and Fred Ehrsam
Published 23 Aug 2021

VII RISKS As we have emphasized in previous sections, DeFi allows developers to create new types of financial products and services, expanding the possibilities of financial technology. While DeFi can eliminate counterparty risk – cutting out intermediators and allowing financial assets to be exchanged in a trustless way – all innovative technologies introduce a new set of risks. To provide users and institutions with a robust and fault-tolerant system capable of handling new financial applications at scale, we must confront and properly mitigate these risks; otherwise, DeFi will remain an exploratory technology, restricting its use, adoption, and appeal. The principal risks DeFi faces today are smart contract,governance, oracle, scaling, DEX custodial, environmental,and regulatory.

Seeking SRE: Conversations About Running Production Systems at Scale
by David N. Blank-Edelman
Published 16 Sep 2018

As we close out, you should take the following points with you: Third parties are an extension of your stack, not ancillary. If it’s critical path, treat it like a service. Consider abandonment during the life cycle of an integration. The quality of your third-party integration depends on good communication. Contributor Bio Jonathan Mercereau has spent his career leading teams and architecting resilient, fault tolerant, and performant solutions working with the biggest players in DNS, CDN, Certificate Authority, and Synthetic Monitoring. Chances are, you’ve experienced the results of his work, from multi-CDN and optimizing streaming algorithms at Netflix to all multi-vendor solutions and performance enhancements at LinkedIn.

The result is that we now have systems that lack unique state. In such a world, reverting a software change can make the system take on a more familiar appearance, but it might not restore the world to the way it once was. Special Knowledge About Complex Systems The situation facing SREs is seldom simple. The fault-tolerance mechanisms built into the design of distributed systems and related automation handle most problems that arise. Because of this, incidents represent situations that fall outside of the “most problems” boundary. Reasoning about cause and effect here is often challenging. For example, simply observing that a process is failing does not necessarily mean that fixing that process will resolve the incident.

Operations groups often take on big projects to increase MTBF and decrease MTTR, usually at the level of hardware components, because this is the only level at which assumptions of rationality hold well enough for “mean time to anything” to be well defined. It’s certainly worthwhile for a team to optimize within one “accountability domain” like this. Even so, when you’re looking at the entire system, an increase in fault tolerance beyond “barely acceptable” tends to be immediately eaten up by another layer. Suppose that you have a distributed storage system that was deployed to tolerate three simultaneous disk failures in an array. Then the hardware team, full of gumption and wishing to be promoted, takes clever measures to “guarantee” that the array will never have more than one disk down.

pages: 377 words: 21,687

Digital Apollo: Human and Machine in Spaceflight
by David A. Mindell
Published 3 Apr 2008

Holliday, Will L., and Dale P. Hoffman. ‘‘Systems Approach to Flight Controls.’’ Astronautics (May 1962): 36–37, 74–80. Hong, Sungook. ‘‘Man and Machine in the 1960s.’’ Techne 7, no. 3 (2004): 49–77. Hopkins, Albert L. ‘‘A Fault-Tolerant Information Processing Concept for Space Vehicles.’’ Cambridge, Mass.: MIT Instrumentation Laboratory, 1970. Hopkins, Albert L. ‘‘A Fault-Tolerant Information Processing System for Advanced Control, Guidance, and Navigation.’’ Cambridge, Mass.: Charles Stark Draper Laboratories, 1970. Hopkins Jr., Albert L., Ramon Alonso, and Hugh Blair-Smith. ‘‘Logical Description for the Apollo Guidance Computer (AGC4).’’

pages: 271 words: 52,814

Blockchain: Blueprint for a New Economy
by Melanie Swan
Published 22 Jan 2014

Consensus without mining is another area being explored, such as in Tendermint’s modified version of DLS (the solution to the Byzantine Generals’ Problem by Dwork, Lynch, and Stockmeyer), with bonded coins belonging to byzantine participants.184 Another idea for consensus without mining or proof of work is through a consensus algorithm such as Hyperledger’s, which is based on the Practical Byzantine Fault Tolerance algorithm. Only focus on the most recent or unspent outputs Many blockchain operations could be based on surface calculations of the most recent or unspent outputs, similar to how credit card transactions operate. “Thin wallets” operate this way, as opposed to querying a full Bitcoind node, and this is how Bitcoin ewallets work on cellular telephones.

Beautiful Data: The Stories Behind Elegant Data Solutions
by Toby Segaran and Jeff Hammerbacher
Published 1 Jul 2009

Acknowledgments Thanks to Darius Bacon, Thorsten Brants, Andy Golding, Mark Paskin, Franco Salvetti, and Casey Whitelaw for comments, corrections, and code. 242 CHAPTER FOURTEEN Download at Boykma.Com Chapter 15 CHAPTER FIFTEEN Life in Data: The Story of DNA Matt Wood and Ben Blackburne DNA IS A BIOLOGICAL BUILDING BLOCK, A CONCISE, SCHEMA-LESS, FAULT-TOLERANT DATABASE OF AN organism’s chemical makeup, designed and implemented by a population over millions of years. Over the past 20 years, biologists have begun to move from the study of individual genes to whole genomes, with genomic approaches forming an increasingly large part of modern biomedical research.

It is written in the molecules of DNA, copies of which are stored in each cell of the human body (with a few exceptions). This pattern is repeated across nature, right down to the simplest forms of life. The information encoded within the genome contains the directions to build the proteins that make up the molecular machinery that runs the chemistry of the cell. Now that’s what I call fault-tolerant and redundant storage. 243 Download at Boykma.Com Almost every cell in your body contains a central data center, which stores these genomic databases, called the nucleus. Within this are the chromosomes. Like all humans, you are diploid, with two copies of each chromosome, one from your father and one from your mother.

Programming Android
by Zigurd Mednieks , Laird Dornin , G. Blake Meike and Masumi Nakamura
Published 15 Jul 2011

SQLite Android uses the SQLite database engine, a self-contained, transactional database engine that requires no separate server process. Many applications and environments beyond Android make use of it, and a large open source community actively develops SQLite. In contrast to desktop-oriented or enterprise databases, which provide a plethora of features related to fault tolerance and concurrent access to data, SQLite aggressively strips out features that are not absolutely necessary in order to achieve a small footprint. For example, many database systems use static typing, but SQLite does not store database type information. Instead, it pushes the responsibility of keeping type information into high-level languages, such as Java, that map database structures into high-level types.

For a given transaction, SQLite does not modify the database until all statements in the transaction have completed successfully. Given the volatility of the Android mobile environment, we recommend that in addition to meeting the needs for consistency in your app, you also make liberal use of transactions to support fault tolerance in your application. Example Database Manipulation Using sqlite3 Now that you understand the basics of SQL as it pertains to SQLite, let’s have a look at a simple database for storing video metadata using the sqlite3 command-line tool and the Android debug shell, which you can start by using the adb command.

pages: 211 words: 58,677

Philosophy of Software Design
by John Ousterhout
Published 28 Jan 2018

In a distributed system, network packets may be lost or delayed, servers may not respond in a timely fashion, or peers may communicate in unexpected ways. The code may detect bugs, internal inconsistencies, or situations it is not prepared to handle. Large systems have to deal with many exceptional conditions, particularly if they are distributed or need to be fault-tolerant. Exception handling can account for a significant fraction of all the code in a system. Exception handling code is inherently more difficult to write than normal-case code. An exception disrupts the normal flow of the code; it usually means that something didn’t work as expected. When an exception occurs, the programmer can deal with it in two ways, each of which can be complicated.

pages: 208 words: 57,602

Futureproof: 9 Rules for Humans in the Age of Automation
by Kevin Roose
Published 9 Mar 2021

Which means that people with unusual combinations of skills—like a zoologist with a math degree, or a graphic designer who knows everything there is to know about folk music—will have an upper hand against AI. Another type of scarce work that will be hard to automate is work that involves rare or high-stakes situations with low fault tolerance. Most AI learns in an iterative way—that is, it repeats a task over and over again, getting it a little more right each time. But in the real world, we don’t always have time to run a thousand tests, and we know, intuitively, that there are things that are too important to entrust to machines.

The Dream Machine: J.C.R. Licklider and the Revolution That Made Computing Personal
by M. Mitchell Waldrop
Published 14 Apr 2001

He and his colleagues would have to give up every engineer's first instinct, which was to control things so that problems could not happen, and instead design a system that was guaranteed to fail-but that would keep running anyhow. Nowadays this is known as a fault-tolerant system, and designing one is still considered a cutting-edge challenge. It means giving the system some of the same quality possessed by a superbly trained military unit, or a talented football team, or, for that matter, any living organism-namely, an ability to react to the unexpected. But in the early 1960s, with CTSS, Corbato and his colleagues had to pioneer fault-tolerant design even as they were pioneering time-sharing itself: For example, among their early innovations were "firewalls," or software barriers that kept each user's area of computer memory isolated from its neighbors, so that a flameout in one program wouldn't necessarily consume the others.

Presper, 45-47,59,61, 62,63,87 Eckert- Mauchly Corporation, 63, 101, 115 EDV AC, see ElectronIc Discrete Vanable Automatic Com- puter Egan, James P., 67 Eglm Army Air Force base, 16 Eldophor display, 289, 290, 291 eight-bit standard, 246-47 Emstem, Albert, 85, 91 Eisenhower, Dwight D., 197-98 electncal engmeenng, 82 Electncal Engmeermg, 113 electnc power networks, 25-26 electroencephalography (EEG), 11-12 electronIc commons Idea, 413-14,420 ElectronIc Discrete Vanable Automatic Computer (EDVAC),47, 100-101 von Neumann's report on, 59-65 Electronic News, 338 ElectronIc Numencal Integrator and Calculator (EN lAC), 43,45-47,87-88, 101, 102, 103,339 drawbacks of, 46-47 patent dispute over, 63 programmmg of, 46-47 electronIc office Idea, 363-64, 407 ElIas, Peter, 220 ELIZA, 229 Elkmd, Jerry, 110, 111, 152, 175-76, 194, 295, 345, 351, 354,368,371,399,438,444, 446,447 Ellenby, John, 382, 408 EllIs, Jim, 427 E-maIl, 231, 324-26, 384, 420, 465 Engelbart, Douglas, 210-17, 241-43,255,261,273,278, 285, 342, 358, 360n, 364, 406,465,470 at Fall JOInt Computer Confer- ence,5,287-94 EnglIsh, BIll, 242, 243, 289-90, 293-94,354,355,361-62, 365n,366,368 EN lAC, see ElectroniC Numeflcal Integrator and Calculator EnIgma machines, 80 entropy, 81 error-checking codes, 271 error-correcting codes, 79-80, 94n Ethernet, 5, 374-75, 382, 385, 386,439-40,452 Ethernet-AI to- RCG-S LOT (EARS), 385 EuclId, 137 Evans, David, 239, 261, 274, 282, 303, 343, 357, 358 Everett, Robert, 102-3, 108 expectation, 10 behavIOral theory, 74, 97 expert systems, 397-98, 406 ExtrapolatIOn, InterpolatlOn, and Smoothmg of StatIOnary Time Series (Wiener), 54 facsimile machines, 347-48 Fahlman, Scott, 438 FairchIld Semiconductor, 339 Fall JOInt Computer Conference, 5,287-94 Fano, Robert, 19, 75, 94-95, 107, 174,193,217-24,227-36, 243,244,249-51,252-53, 257, 281, 307, 310, 317, 453 FantasIa, 338 Farley, Belmont, 144 fault-tolerant systems, 234 Federal Research Internet Coor- dlnatmg Committee (FRICC), 462 feedback, 55-57, 92, 138 Feigenbaum, Edward, 210, 281, 396,397-98,403,405-6 Fiala, Ed, 346 file systems, hierarchical, 230 FIle Transfer Protocol (FTP), 301 firewalls, 234 "First Draft of a Report on the EDV AC" (von Neumann), 59-65,68,86,102 flat-panel displays, 359 Flegal, Bob, 345 FLEX machine, 358, 359, 361 Flexownter, 166, 188 flight simulators, 101-2 floppy disks, CP/M software and, 434 Ford Motor Company, 334, 335, 336, 337, 389 Forrester, Jay, 102-3, 113, 114-15, 117, 173,230-31 Fortran, 165, 168, 169, 171-72, 246 Fortune, 27, 93 Fossum, Bob, 418, 420 Foster, John, 278, 279, 330 Frankston, Bob, 315 Fredkin, Edward, 152-56, 179, 194,208,313-14,323,412 Freeman, Greydon, 457 Freyman, Monsieur, 83 FRICC (Federal Research Inter- net Coordlnatmg Commit- tee), 462 Fnck, Fredenck, 97, 128, 201-2, 203n Fublni, Gene, 202 Fuchs, Ira, 457 FUJI Xerox, 409 Fumblmg the Future (Smith and Alexander), 382n, 446 functions, 10 lIst processing, 169-70 Galanter, Eugene, 139 Galley, Stuart, 319-20 games, computer, 188, 320, 435 game theory, 85-86, 91 Garner, W.

pages: 247 words: 60,543

The Currency Cold War: Cash and Cryptography, Hash Rates and Hegemony
by David G. W. Birch
Published 14 Apr 2020

This means that it can take a while to establish consensus, which can remain probabilistic for some time (with Bitcoin, for example, people generally wait for an hour or so in order to see which chain has ‘won’). Nevertheless, the science of consensus protocols is well known, highly developed and widely used to create alternatives (Kravchenko et al. 2018). In particular, cryptographers have been exploring what are known as Byzantine fault tolerant (BFT) protocols that use rounds of voting between participants to agree on the state of the ledger (or anything else). These protocols function as long as no more than a third of the participants are malicious. Thus, they work well when the number of participants is limited, so the voting overhead is not so great, although there are variations that allow for much larger groups of participants to interact, such as the Federated Byzantine Agreement (FBA) used in Stellar.

pages: 265 words: 60,880

The Docker Book
by James Turnbull
Published 13 Jul 2014

This provides the information needed, for example an IP address or port or both, to allow interaction between services. Our example service discovery tool, Consul, is a specialized datastore that uses consensus algorithms. Consul specifically uses the Raft consensus algorithm to require a quorum for writes. It also exposes a key value store and service catalog that is highly available, fault-tolerant, and maintains strong consistency guarantees. Services can register themselves with Consul and share that registration information in a highly-available and distributed manner. Consul is also interesting because it provides: A service catalog with an API instead of the traditional key=value store of most service discovery tools.

pages: 391 words: 71,600

Hit Refresh: The Quest to Rediscover Microsoft's Soul and Imagine a Better Future for Everyone
by Satya Nadella , Greg Shaw and Jill Tracie Nichols
Published 25 Sep 2017

The same can be said of a dozen other areas in which technology is “stuck”—high temperature superconductors, energy efficient fertilizer production, string theory. A quantum computer would allow a new look at our most compelling problems. Computer scientist Krysta Svore is at the heart of our quest to solve problems on a quantum computer. Krysta received her PhD from Columbia University focusing on fault tolerance and scalable quantum computing, and she spent a year at MIT working with an experimentalist designing the software needed to control a quantum computer. Her team is designing an exotic software architecture that assumes our math, physics, and superconducting experts succeed in building a quantum computer.

PostgreSQL Cookbook
by Chitij Chauhan
Published 30 Jan 2015

High Availability and Replication In this chapter, we will cover the following recipes: Setting up hot streaming replication Replication using Slony-I Replication using Londiste Replication using Bucardo Replication using DRBD Setting up the Postgres-XC cluster Introduction The important components for any production database is to achieve fault tolerance, 24/7 availability, and redundancy. It is for this purpose that we have different high availability and replication solutions available for PostgreSQL. From a business perspective, it is important to ensure 24/7 data availability in the event of a disaster situation or a database crash due to disk or hardware failure.

pages: 218 words: 68,648

Confessions of a Crypto Millionaire: My Unlikely Escape From Corporate America
by Dan Conway
Published 8 Sep 2019

It was a delightfully odd group, without a single big personality pushing business cards. The first question was something like this: “Vitalik, don’t you think that the Byzantine general’s dilemma could be exploited by the various geographic nodes in a proof of stake architecture? Is there a way to compile the blockchain that is fault tolerant and aligns incentives with the miners?” I had no idea what they were talking about. I especially didn’t understand Vitalik’s response, which he delivered in an even voice seasoned with small bursts of energy, as if he were connected to a gentle electrical current that gave his face a stutter step every so often.

pages: 205 words: 71,872

Whistleblower: My Journey to Silicon Valley and Fight for Justice at Uber
by Susan Fowler
Published 18 Feb 2020

During my Engucation classes, I tried to wrap my head around what the computing infrastructure underneath all of these applications actually looked like. It was that infrastructure—the servers, the operating systems, the networks, and all of the code that connected the applications together—that I would be working on, that I would need to make better, more reliable, and more fault-tolerant. After Engucation came more specialized training to prepare new hires for their particular roles within the company. New data scientists spent time with their data science teams, front-end developers learned how to work with the front-end code, and I would embed with one of the site reliability engineering (SRE) teams and learn the basics before I could join my permanent team.

pages: 227 words: 63,186

An Elegant Puzzle: Systems of Engineering Management
by Will Larson
Published 19 May 2019

Raft is used by etcd14 and influxDB15 among many others. “Paxos Made Simple” One of Leslie Lamport’s numerous influential papers, “Paxos Made Simple” is a gem, both in explaining the notoriously complex Paxos algorithm and because, even at its simplest, Paxos isn’t really that simple: The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensus algorithm—the “synod” algorithm. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satisfy.

pages: 661 words: 185,701

The Future of Money: How the Digital Revolution Is Transforming Currencies and Finance
by Eswar S. Prasad
Published 27 Sep 2021

Moreover, the reserve would feature a “loss-absorbing capital buffer,” meaning that, to offset any doubts about the extent of the backing, the stablecoins would be matched more than one-to-one by the stock of fiat currencies held in reserve. The Libra project also includes some technical innovations. It employs a new programming language, Move, which is designed to keep the Libra blockchain secure while allowing for the use of specific types of smart contracts. The blockchain is Byzantine fault tolerant, which means that its integrity cannot be compromised by a small number of malicious nodes (other consensus mechanisms such as Proof of Work have this property as well). The consensus protocol ensures transaction finality, is more energy-efficient than Proof of Work, and allows the network to function properly even if nearly one-third of the validator nodes fail or are compromised.

Vitalik Buterin, cofounder of Ethereum DeFi relies on smart contract blockchains, of which Ethereum is by far the most widely used. The Bitcoin blockchain, as noted earlier, does not have smart contract capabilities. Vitalik Buterin, a wunderkind who is a cofounder of Ethereum (and is a college dropout, need you ask?), has argued that decentralization confers many advantages over traditional financial systems. One is fault tolerance—failure is less likely because such a system relies on many separate components. Another is attack resistance—there is no central point, such as a major financial institution or centralized exchange, that is vulnerable to attack. A third advantage is collusion resistance—it is difficult for participants in a large decentralized system to collude; corporations and governments, by contrast, have the power to act in ways that might not necessarily benefit common people.

pages: 237 words: 76,486

Mars Rover Curiosity: An Inside Account From Curiosity's Chief Engineer
by Rob Manning and William L. Simon
Published 20 Oct 2014

My first job was as an apprentice electronics tester, helping run tests on what would become the brains of the Galileo spacecraft. I quickly discovered that building spacecraft included many extremely tedious jobs. After Galileo, I worked on Magellan (to Venus) and Cassini (to Saturn), becoming expert in the design of spacecraft computers, computer memory, computer architectures, and fault-tolerant systems. In 1993, after thirteen years at JPL, my career took a sudden leap forward. Brian Muirhead, the most inspiring and level-headed spacecraft leader I have ever met, had recently been named spacecraft manager for a funky little mission to Mars called Pathfinder. We had a conversation in which he explained that he was a master of mechanical systems but had not had much experience with electronics.

pages: 313 words: 75,583

Ansible for DevOps: Server and Configuration Management for Humans
by Jeff Geerling
Published 9 Oct 2015

Use a distributed file system, like Gluster, Lustre, Fraunhofer, or Ceph. Some options are easier to set up than others, and all have benefits—and drawbacks. Rsync, git, or NFS offer simple initial setup, and low impact on filesystem performance (in many scenarios). But if you need more flexibility and scalability, less network overhead, and greater fault tolerance, you will have to consider something that requires more configuration (e.g. a distributed file system) and/or more hardware (e.g. a SAN). GlusterFS is licensed under the AGPL license, has good documentation, and a fairly active support community (especially in the #gluster IRC channel). But to someone new to distributed file systems, it can be daunting to get set it up the first time.

Mastering Structured Data on the Semantic Web: From HTML5 Microdata to Linked Open Data
by Leslie Sikos
Published 10 Jul 2015

It is a highly scalable, open source storage and computing platform [11]. Suitable for Big Data applications and selected for the Wikidata Query Service, Blazegraph is specifically designed to support big graphs, offering Semantic Web (RDF/SPARQL) and graph database (tinkerpop, blueprints, vertex-centric) APIs. The robust, scalable, fault-tolerant, enterprise-class storage and query features are combined with high availability, online backup, failover, and self-healing. Blazegraph features an ultra-high performance RDF graph database that supports RDFS and OWL Lite reasoning, as well as SPARQL 1.1 querying. Designed for huge amounts of information, the Blazegraph RDF graph database can load 1 billion graph edges in less than an hour on a 15-node cluster.

pages: 328 words: 77,877

API Marketplace Engineering: Design, Build, and Run a Platform for External Developers
by Rennay Dorasamy
Published 2 Dec 2021

I hope that this position inspires a reader out there to write a book on the “Dummy’s Guide to Persistent Storage on Kubernetes” and I promise to buy a copy. However, this decision was made after intense discussion and deliberation. The deciding factor that helped to settle the matter was that database management would never be a core function of our team. We could easily build a database container, but making it highly available and fault tolerant would make our container cluster configuration much more complex. Our Platform now uses an Enterprise provided service, and from observing the various areas of database management, I have no doubt that we made the right decision. To clarify, our team owns the data. The Enterprise service team owns the database.

pages: 283 words: 78,705

Principles of Web API Design: Delivering Value with APIs and Microservices
by James Higginbotham
Published 20 Dec 2021

Finally, don’t underestimate the effort required to separate a monolithic data store into a data store per service. Distributed Systems Challenges The journey toward microservices requires a deep understanding of distributed systems. Those not as familiar with the concepts of distributed tracing, observability, eventual consistency, fault tolerance, and failover will encounter a more difficult time with microservices. The eight fallacies of distributed computing, written in 1994 and still applicable today, must be understood by every developer. Additionally, many find that architectural oversight is required to initially decompose and subsequently integrate services into solutions.

pages: 619 words: 197,256

Apollo
by Charles Murray and Catherine Bly Cox
Published 1 Jan 1989

“There was only one engine bell, of course, and only one combustion chamber, but all the avionics that fed the signals to that engine and all the mechanical components that had to work, like the little valves that had to be pressurized to open the ball valves, and so forth, were at least single-fault tolerant and usually two-fault tolerant. . . . There were a heck of a lot of ways to start that engine.” And of course they had indeed checked it out carefully before the flight, but nothing they didn’t do for any other mission. All this was still correct as of Christmas Eve, 1968. And yet it ultimately didn’t make any difference to the way many of the people in Apollo felt.

pages: 1,380 words: 190,710

Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems
by Heather Adkins , Betsy Beyer , Paul Blankinship , Ana Oprea , Piotr Lewandowski and Adam Stubblefield
Published 29 Mar 2020

Your risk assessment may vary depending on where your organization’s assets are located. For example, a site in Japan or Taiwan should account for typhoons, while a site in the Southeastern US should account for hurricanes. Risk ratings may also change as an organization matures and incorporates fault-tolerant systems, like redundant internet circuits and backup power supplies, into its systems. Large organizations should perform risk assessments on both global and per-site levels, and review and update these assessments periodically as the operating environment changes. Equipped with a risk assessment that identifies which systems need protection, you’re ready to create a response team prepared with tools, procedures, and training.

The IR team should have read access to logs for analysis and event reconstruction, as well as access to tools for analyzing data, sending reports, and conducting forensic examinations. Configuring Systems You can make a number of adjustments to systems before a disaster or incident to reduce an IR team’s initial response time. For example: Build fault tolerance into local systems and create failovers. For more information on this topic, see Chapters 8 and 9. Deploy forensic agents, such as GRR agents or EnCase Remote Agents, across the network with logs enabled. This will aid both your response and later forensic analysis. Be aware that security logs may require a lengthy retention period, as discussed in Chapter 15 (the industry average for detecting intrusions is approximately 200 days, and logs deleted before an incident is detected cannot be used to investigate it).

pages: 275 words: 84,980

Before Babylon, Beyond Bitcoin: From Money That We Understand to Money That Understands Us (Perspectives)
by David Birch
Published 14 Jun 2017

Ripple After Bitcoin and Ethereum, the third biggest cryptocurrency is Ripple, which unlike those first two has its roots in local exchange trading systems (Peck 2013). It is a protocol for value exchange that uses a shared ledger but it does not use a Bitcoin-like blockchain, preferring another kind of what is known as a ‘Byzantine fault-tolerant consensus-forming process’. Ripple signs every transaction that parties submit to the network with a digital signature. Each user selects a list, called a ‘unique node list’, comprising other users that it trusts as what are known as ‘validating nodes’. Each validating node independently verifies every proposed transaction within its network to determine if it is valid.

pages: 362 words: 86,195

Fatal System Error: The Hunt for the New Crime Lords Who Are Bringing Down the Internet
by Joseph Menn
Published 26 Jan 2010

Cerf, who has a generally upbeat tone about most things, gives the impression that he remains pleasantly surprised that the Internet has continued to function and thrive—even though, as he put it, “We never got to do the production engineering,” the version ready for prime time. Even after his years on the front line, Barrett found such statements amazing. “It’s incredibly disturbing,” he said. “The engine of the world economy is based on this really cool experiment that is not designed for security, it’s designed for fault-tolerance,” which is a system’s ability to withstand some failures. “You can reduce your risks, but the naughty truth is that the Net is just not a secure place for business or society.” Cerf listed a dozen things that could be done to make the Internet safer. Among them: encouraging research into “hardware-assisted security mechanisms,” limiting the enormous damage that Web browsers can wreak on operating systems, and hiring more and better trained federal cybercrime agents while pursuing international legal frameworks.

pages: 669 words: 210,153

Tools of Titans: The Tactics, Routines, and Habits of Billionaires, Icons, and World-Class Performers
by Timothy Ferriss
Published 6 Dec 2016

DEC was first in minicomputers. Many other computer companies (and their entrepreneurial owners) became rich and famous by following a simple principle: If you can’t be first in a category, set up a new category you can be first in. Tandem was first in fault-tolerant computers and built a $1.9 billion business. So Stratus stepped down with the first fault-tolerant minicomputer. Are the laws of marketing difficult? No, they are quite simple. Working things out in practice is another matter, however. Cray Research went over the top with the first supercomputer. So Convex put two and two together and launched the first mini supercomputer.

pages: 722 words: 90,903

Practical Vim: Edit Text at the Speed of Thought
by Drew Neil
Published 6 Oct 2012

I’ve borrowed the expressions in series and in parallel from the field of electronics to differentiate between two techniques for executing a macro multiple times. The technique for executing a macro in series is brittle. Like cheap Christmas tree lights, it breaks easily. The technique for executing a macro in parallel is more fault tolerant. Execute the Macro in Series Picture a robotic arm and a conveyor belt containing a series of items for the robot to manipulate (Figure 4, ​Vim's macros make quick work of repetitive tasks​). Recording a macro is like programming the robot to do a single unit of work. As a final step, we instruct the robot to move the conveyor belt and bring the next item within reach.

pages: 350 words: 90,898

A World Without Email: Reimagining Work in an Age of Communication Overload
by Cal Newport
Published 2 Mar 2021

It made it clear that asynchronous communication complicates attempts to coordinate, and therefore, it’s almost always worth the extra cost required to introduce more synchrony. In the context of distributed systems, the added synchrony explored in the aftermath of this famous 1985 paper took several forms. One heavy-handed solution, used in some early fly-by-wire systems and fault-tolerant credit card transaction processing machines, was to connect the machines on a common electrical circuit, allowing them to operate at the same lockstep pace. This approach eliminates unpredictable communication delays and allows your application to immediately detect if a machine has crashed. Because these circuits were sometimes complicated to implement, software approaches to adding synchrony also became popular.

The Internet Trap: How the Digital Economy Builds Monopolies and Undermines Democracy
by Matthew Hindman
Published 24 Sep 2018

Shoenfeld, Z. (2017, June). MTV News—and other sites—are frantically pivoting to video. It won’t work. Newsweek. Retrieved from http://www.newsweek.com/mtv -news-video-vocativ-media-ads-pivot-630223. Shute, J., Oancea, M., Ellner, S., Handy, B., Rollins, E., Samwel, B.,. . . , Jegerlehner, B., et al. (2012). F1: the fault-tolerant distributed RDBMS supporting Google’s ad business. In Proceedings of the 2012 International Conference on Management of Data, Scottsdale, AZ (pp. 777–78). ACM. Sifry, M. (2009, November). Critiquing Matthew Hindman’s “The Myth of Digital Democracy”. TechPresident. Retrieved from http://techpresident.com/blog-entry /critiquing-matthew-hindmans-myth-digital-democracy.

pages: 332 words: 93,672

Life After Google: The Fall of Big Data and the Rise of the Blockchain Economy
by George Gilder
Published 16 Jul 2018

Nick Tredennick and Paul Wu, Transaction Security, Cryptochain, and Chip Level Identity (Cupertino: Jonetix, 2018). See also Tredennick and Wu, “Transaction Security Begins With Chip Level Identity,” Int’l Conference on Internet Computing and Internet of Things, ICOMP, 2017. 8. Leemon Baird, “The Swirlds Hashgraph Consensus Algorithm: Fair, Fast, Byzantine Fault Tolerance,” Swirlds Tech Report, May 31, 2016, revised February 16, 2018. 9. Leemon Baird, Mance Harmon, and Paul Madsen, “Hedera: A Governing Council & Public Hashgraph Network: The Trust Layer of the Internet,” white paper V1.0, March 13, 2018, 22. 10. Ibid., 19. Epilogue: The New System of the World 1. 

Practical Vim, Second Edition (for Stefano Alcazi)
by Drew Neil

I’ve borrowed the expressions in series and in parallel from the field of electronics to differentiate between two techniques for executing a macro multiple times. The technique for executing a macro in series is brittle. Like cheap Christmas tree lights, it breaks easily. The technique for executing a macro in parallel is more fault tolerant. Execute the Macro in Series Picture a robotic arm and a conveyor belt containing a series of items for the robot to manipulate. Recording a macro is like programming the robot to do a single unit of work. As a final step, we instruct the robot to move the conveyor belt and bring the next item within reach.

Practical Vim
by Drew Neil

I’ve borrowed the expressions in series and in parallel from the field of electronics to differentiate between two techniques for executing a macro multiple times. The technique for executing a macro in series is brittle. Like cheap Christmas tree lights, it breaks easily. The technique for executing a macro in parallel is more fault tolerant. Execute the Macro in Series Picture a robotic arm and a conveyor belt containing a series of items for the robot to manipulate. Recording a macro is like programming the robot to do a single unit of work. As a final step, we instruct the robot to move the conveyor belt and bring the next item within reach.

pages: 352 words: 96,532

Where Wizards Stay Up Late: The Origins of the Internet
by Katie Hafner and Matthew Lyon
Published 1 Jan 1996

But imagine a local post office somewhere that decided to go it alone, making up its own rules for addressing, packaging, stamping, and sorting mail. Imagine if that rogue post office decided to invent its own set of ZIP codes. Imagine any number of post offices taking it upon themselves to invent new rules. Imagine widespread confusion. Mail handling begs for a certain amount of conformity, and because computers are less fault-tolerant than human beings, e-mail begs loudly. The early wrangling on the ARPANET over attempts to impose standard message headers was typical of other debates over computer industry standards that came later. But because the struggle over e-mail standards was one of the first sources of real tension in the community, it stood out.

Scratch Monkey
by Stross, Charles
Published 1 Jan 2011

I'm probably grinning like a corpse but I don't care -- she must know by now that blind people often smile. It's easier to grin than to frown; the facial muscles contract into a smirk more easily. Even when you're about to die. "It takes a lot of stress to unbalance a network processor the size of a small moon," she replies calmly; "it shows a remarkable degree of fault tolerance. As for physical assault, the automatic defences are still armed ... as they always have been. So If we want to take it for ourselves, we must overwhelm it by frontal assault, sending uploaded minds out into the simulation space until it overloads and drops into NP-stasis. They do that if you feed them faster than they can transfer capacity elsewhere, you know.

Data and the City
by Rob Kitchin,Tracey P. Lauriault,Gavin McArdle
Published 2 Aug 2017

Such data include public administrative records, operational management information, as well as that produced by sensors, transponders and cameras that make up the internet of things, smartphones, wearables, social media, loyalty cards and commercial sources. In many cases, cities are turning to big data technologies and their novel distributed computational infrastructure for the reliable and fault tolerant storage, analysis and dissemination of data from various sources. In such systems, processing is generally brought to the data, rather than bringing data to the processing. Since each organization uses different platforms, operating systems and software to generate and analyse data, data sharing mechanisms should ideally be provided as platform-independent services so that they can be utilized by various users for different purposes, for example, for research, business, improving existing services of city authorities and organizations, and for facilitating communication between people and policymakers.

pages: 648 words: 108,814

Solr 1.4 Enterprise Search Server
by David Smiley and Eric Pugh
Published 15 Nov 2009

There has been a fair amount of discussion on Solr mailing lists about setting up distributed Solr on a robust foundation that adapts to changing environment. There has been some investigation regarding using Apache Hadoop, a platform for building reliable, distributing computing as a foundation for Solr that would provide a robust fault-tolerant filesystem. Another interesting sub project of Hadoop is ZooKeeper, which aims to be a service for centralizing the management required by distributed applications. There has been some development work on integrating ZooKeeper as the management interface for Solr. Keep an eye on the Hadoop homepage for more information about these efforts at http://hadoop.apache.org/ and Zookeeper at http://hadoop.apache.org/zookeeper/.

pages: 354 words: 26,550

High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems
by Irene Aldridge
Published 1 Dec 2009

New York–based MarketFactory provides a suite of software tools to help automated traders get an extra edge in the market, help their models scale, increase their fill ratios, reduce slippage, and thereby improve profitability (P&L). Chapter 18 discusses optimization of execution. Run-time risk management applications ensure that the system stays within prespecified behavioral and P&L bounds. Such applications may also be known as system-monitoring and fault-tolerance software. 26 HIGH-FREQUENCY TRADING r Mobile applications suitable for monitoring performance of highfrequency trading systems alert administration of any issues. r Real-time third-party research can stream advanced information and forecasts. Legal, Accounting, and Other Professional Services Like any business in the financial sector, high-frequency trading needs to make sure that “all i’s are dotted and all t’s are crossed” in the legal and accounting departments.

pages: 406 words: 105,602

The Startup Way: Making Entrepreneurship a Fundamental Discipline of Every Enterprise
by Eric Ries
Published 15 Mar 2017

“We spent three or four weeks when the only visible thing we were doing was making everybody come to one place,” he recalls. “When things went wrong, we just went and found the person who was responsible.” In addition, the site architecture was so bad that the slightest problem had the potential to knock the whole thing out. There was no way to track issues, and none of the fault tolerance or resistance that such a massive system should have had in place, as a matter of course, existed. Faced with this quagmire, the team asked a single question: “Why is the site not working on October 22?” Then they worked backward, applying the management and technological practices that by now should sound familiar: small teams, rapid iteration, accountability metrics, and a culture of transparency without fear of recrimination.

pages: 1,266 words: 278,632

Backup & Recovery
by W. Curtis Preston
Published 9 Feb 2009

In addition to these dbcc tasks, you need to choose a transaction log archive strategy. If you follow these tasks, you will help maintain the database, keeping it running smoothly and ready for proper backups. dbcc: The Database Consistency Checker Even though Sybase’s dataserver products are very robust and much effort has gone into making them fault-tolerant, there is always the chance that a problem will occur. For very large tables, some of these problems might not show until very specific queries are run. This is one of the reasons for the database consistency checker, dbcc. This set of SQL commands can review all the database page allocations, linkages, and data pointers, finding problems and, in many cases, fixing them before they become insurmountable.

(As of this writing, the MySQL team is developing other ACID-compliant storage engines.) With PostgreSQL, all data is stored in an ACID-compliant fashion. PostgreSQL also offers sophisticated features such as point-in-time recovery, tablespaces, checkpoints, hot backups, and write ahead logging for fault tolerance. These are all very good things from a data-protection and data-integrity standpoint. PostgreSQL Architecture From a power-user standpoint, PostgreSQL is like any other database. The following terms mean essentially the same in PostgreSQL as they do in any other relational database: Database Table Index Row Attribute Extent Partition Transaction Clusters A PostgreSQL cluster is analogous to an instance in other RDBMSs, and each cluster works with one or more databases.

pages: 302 words: 82,233

Beautiful security
by Andy Oram and John Viega
Published 15 Dec 2009

With this dynamic control and command infrastructure, the botnet owner can mobilize a massive amount of computing resources from one corner of the Internet to another within a matter of minutes. It should be noted that the control server itself might not be static. Botnets have evolved from a static control infrastructure to a peer-to-peer structure for the purposes of fault tolerance and evading detection. When one server is detected and blocked, other servers can step in and take over. It is also common for the control server to run on a compromised machine or by proxy, so that the botnet’s owner is unlikely to be identified. Botnets commonly communicate through the same method as their creators’ public IRC servers.

pages: 424 words: 114,905

Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again
by Eric Topol
Published 1 Jan 2019

The K supercomputer in Japan, by contrast, requires about 10 megawatts of power and occupies more than 1.3 million liters.56 Where our brain’s estimated 100 billion neurons and 100 trillion connections give it a high tolerance for failure—not to mention its astonishing ability to learn both with and without a teacher, from very few examples—even the most powerful computers have poor fault tolerance for any lost circuitry, and they certainly require plenty of programming before they can begin to learn, and then only from millions of examples. Another major difference is that our brain is relatively slow, with computation speeds 10 million times slower than machines, so a machine can respond to a stimulus much faster than we can.

pages: 587 words: 117,894

Cybersecurity: What Everyone Needs to Know
by P. W. Singer and Allan Friedman
Published 3 Jan 2014

One is the importance of building in “the intentional capacity to work under degraded conditions.” Beyond that, resilient systems must also recover quickly, and, finally, learn lessons to deal better with future threats. For decades, most major corporations have had business continuity plans for fires or natural disasters, while the electronics industry has measured what it thinks of as fault tolerance, and the communications industry has talked about reliability and redundancy in its operations. All of these fit into the idea of resilience, but most assume some natural disaster, accident, failure, or crisis rather than deliberate attack. This is where cybersecurity must go in a very different direction: if you are only thinking in terms of reliability, a network can be made resilient merely by creating redundancies.

pages: 461 words: 125,845

This Machine Kills Secrets: Julian Assange, the Cypherpunks, and Their Fight to Empower Whistleblowers
by Andy Greenberg
Published 12 Sep 2012

The geekery had gotten so thick that even some of Tor’s modern-day cypherpunks and volunteer coders, loath as they might have been to admit it, might just have gotten lost. Within minutes, Mathewson, wearing a sport jacket over a Tor T-shirt over a dwarfish potbelly, was delving into security issues like “epistemic attacks” and “Byzantine fault tolerances.” By the time he sat down, still grinning, a growing fraction of the room seemed baffled or possibly bored. Appelbaum’s presence, on the other hand, is as much guerrilla as geek. He’s Tor’s field researcher, unofficial revolutionary, and man on the ground in countries from Qatar to Brazil.

pages: 482 words: 125,973

Competition Demystified
by Bruce C. Greenwald
Published 31 Aug 2016

TABLE 6.1 Compaq and Dell, 1990 and 1995 ($ million, costs as a percentage of sales) FIGURE 6.4 Compaq’s return on invested capital and operating income margin, 1990–2001 For a time, the approach was successful, as the company combined strong sales growth with decent operating margins and high return on invested capital (figure 6.4).* But ingrained cultures are difficult to uproot. The engineering mentality and love of technology that was part of Compaq’s tradition did not disappear, even after Rod Canion left. In 1997 the company bought Tandem Computers, a firm that specialized in producing fault-tolerant machines designed for uninterruptible transaction processing. A year later it bought Digital Equipment Corporation, a former engineering star in the computing world which had fallen from grace as its minicomputer bastion was undermined by the personal computer revolution. At the time of the purchase, Compaq wanted DEC for its consulting business, its AltaVista Internet search engine, and some in-process research.

Autonomous Driving: How the Driverless Revolution Will Change the World
by Andreas Herrmann , Walter Brenner and Rupert Stadler
Published 25 Mar 2018

The more traffic situations these algorithms are exposed to, the better prepared they are to master a new situation. Designing this training process so that the accuracy demanded by Jen-Hsun Huang is obtained will be the crucial challenge in the development of autonomous vehicles. When discussing what fault tolerance might be acceptable, it should be borne in mind that people are more likely to forgive mistakes made by other people than mistakes made by machines. This also applies to driving errors, which are more likely to be overlooked if they were committed by a driver and not by a machine. This means that autonomous vehicles will only be accepted if they cause significantly fewer errors than the drivers.

pages: 448 words: 117,325

Click Here to Kill Everybody: Security and Survival in a Hyper-Connected World
by Bruce Schneier
Published 3 Sep 2018

In 2017, traffic to and from several major US ISPs was briefly routed to an obscure Russian Internet provider. And don’t think this kind of attack is limited to nation-states; a 2008 talk at the DefCon hackers conference showed how anyone can do it. When the Internet was developed, what security there was focused on physical attacks against the network. Its fault-tolerant architecture can handle servers and connections failing or being destroyed. What it can’t handle is systemic attacks against the underlying protocols. The base Internet protocols were developed without security in mind, and many of them remain insecure to this day. There’s no security in the “From” line of an e-mail: anyone can pretend to be anyone.

pages: 571 words: 124,448

Building Habitats on the Moon: Engineering Approaches to Lunar Settlements
by Haym Benaroya
Published 12 Jan 2018

“Despite being critical to the reliability of redundant systems, however, mediating systems cannot be redundant themselves, as then they would need mediating, leading to an infinite regress. ‘The daunting truth,’ to quote a 1993 report to the FAA, ‘is that some of the core [mediating] mechanisms in fault-tolerant systems are single points of failure: they just have to work correctly’.” The assumption of independence of elements in the system, whether for purposes of redundancy or as part of the system model, can also be a cause of failure. Interdependencies (correlations) exist in complex systems at the least because they are operating in, and are driven by, the same environment.

pages: 960 words: 125,049

Mastering Ethereum: Building Smart Contracts and DApps
by Andreas M. Antonopoulos and Gavin Wood Ph. D.
Published 23 Dec 2018

While providing high availability, auditability, transparency, and neutrality, it also reduces or eliminates censorship and reduces certain counterparty risks. Compared to Bitcoin Many people will come to Ethereum with some prior experience of cryptocurrencies, specifically Bitcoin. Ethereum shares many common elements with other open blockchains: a peer-to-peer network connecting participants, a Byzantine fault–tolerant consensus algorithm for synchronization of state updates (a proof-of-work blockchain), the use of cryptographic primitives such as digital signatures and hashes, and a digital currency (ether). Yet in many ways, both the purpose and construction of Ethereum are strikingly different from those of the open blockchains that preceded it, including Bitcoin.

pages: 580 words: 125,129

Androids: The Team That Built the Android Operating System
by Chet Haase
Published 12 Aug 2021

Bob’s fix was to catch that failure condition and set the initial time on the phone to the day that he fixed the bug. Bob also tracked down a networking problem that was specific to mobile data. Android phones were experiencing severe outages that seemed like a problem with bad carrier network infrastructure. Networking protocols have built in fault-tolerance, because networks can go down, or packets of data can get lost or delayed. Android was using the congestion window approach in Linux that responds to an outage by halving the size of the data packet, and halving it again, and again, until it gets a response from the server that packets are going through.

pages: 448 words: 71,301

Programming Scala
by Unknown
Published 2 Jan 2010

ScalaModules Scala DSL to ease OSGi development (http://code.google.com/p/scalamodules/). Configgy Managing configuration files and logging for “daemons” written in Scala (http://www.lag.net/configgy/). scouchdb Scala interface to CouchDB (http://code.google.com/p/scouchdb/). Akka A project to implement a platform for building fault-tolerant, distributed applications based on REST, Actors, etc. (http://akkasource.org/). scala-query A type-safe database query API for Scala (http://github.com/szeiger/scala-query/tree/master). We’ll discuss using Scala with several well-known Java libraries after we discuss Java interoperability, next. 368 | Chapter 14: Scala Tools, Libraries, and IDE Support Download at WoweBook.Com Java Interoperability Of all the alternative JVM languages, Scala’s interoperability with Java source code is among the most seamless.

pages: 559 words: 130,949

Learn You a Haskell for Great Good!: A Beginner's Guide
by Miran Lipovaca
Published 17 Apr 2011

ghci> solveRPN "2.7 ln" 0.9932517730102834 ghci> solveRPN "10 10 10 10 sum 4 /" 10.0 ghci> solveRPN "10 10 10 10 10 sum 4 /" 12.5 ghci> solveRPN "10 2 ^" 100.0 I think that making a function that can calculate arbitrary floating-point RPN expressions and has the option to be easily extended in 10 lines is pretty awesome. Note This RPN calculation solution is not really fault tolerant. When given input that doesn’t make sense, it might result in a runtime error. But don’t worry, you’ll learn how to make this function more robust in Chapter 14. Heathrow to London Suppose that we’re on a business trip. Our plane has just landed in England, and we rent a car. We have a meeting really soon, and we need to get from Heathrow Airport to London as fast as we can (but safely!).

pages: 458 words: 137,960

Ready Player One
by Ernest Cline
Published 15 Feb 2011

In addition to restricting the overall size of their virtual environments, earlier MMOs had been forced to limit their virtual populations, usually to a few thousand users per server. If too many people were logged in at the same time, the simulation would slow to a crawl and avatars would freeze in midstride as the system struggled to keep up. But the OASIS utilized a new kind of fault-tolerant server array that could draw additional processing power from every computer connected to it. At the time of its initial launch, the OASIS could handle up to five million simultaneous users, with no discernible latency and no chance of a system crash. A massive marketing campaign promoted the launch of the OASIS.

AI 2041: Ten Visions for Our Future
by Kai-Fu Lee and Qiufan Chen
Published 13 Sep 2021

The IBM researchers acknowledge that control of errors caused by decoherence will get much worse the more qubits are added. To deal with this challenge, complex and fragile equipment must be built with new technologies and precision engineering. Also, decoherence errors will require each logical qubit to be represented by many physical qubits to provide stability, error correction, and fault tolerance. It is estimated that a QC will likely need a million or more physical qubits in order to deliver the performance of a 4,000 logical qubit QC. And even when a useful quantum computer is successfully demonstrated, mass production is another matter. Finally, quantum computers are programmed completely differently from classical computers, so new algorithms will need to be invented, and new software tools will need to be built.

pages: 474 words: 130,575

Surveillance Valley: The Rise of the Military-Digital Complex
by Yasha Levine
Published 6 Feb 2018

NRL Review was an in-house navy magazine that showcased all the cool gadgets cooked up by the lab over the previous year. D. M. Goldschlag, M. G. Reed, and P. F. Syverson, “Internet Communication Resistant to Traffic Analysis,” NRL Review, April 1997. 13. This last stage of development was funded by both the Office of Naval Research and DARPA under its Fault Tolerant Networks Program. The amount of the DARPA funding is unknown. “Onion Routing: Brief Selected History,” website formerly operated by the Center for High Assurance Computer Systems in the Information Technology Division of the US Naval Research Lab, 2005, accessed July 6, 2017, https://www.onion-router.net/History.html. 14.

pages: 528 words: 146,459

Computer: A History of the Information Machine
by Martin Campbell-Kelly and Nathan Ensmenger
Published 29 Jul 2013

Although computer technology is at the heart of the Internet, its importance is economic and social: the Internet gives computer users the ability to communicate, to gain access to information sources, and to conduct business. I. From the World Brain to the World Wide Web The Internet sprang from a confluence of three desires, two that emerged in the 1960s and one that originated much further back in time. First, there was the rather utilitarian desire for an efficient, fault-tolerant networking technology, suitable for military communications, that would never break down. Second, there was a wish to unite the world’s computer networks into a single system. Just as the telephone would never have become the dominant person-to-person communications medium if users had been restricted to the network of their particular provider, so the world’s isolated computer networks would be far more useful if they were joined together.

pages: 470 words: 144,455

Secrets and Lies: Digital Security in a Networked World
by Bruce Schneier
Published 1 Jan 2000

These definitions have always struck me as being somewhat circular. We know intuitively what we mean by availability with respect to computers: We want the computer to work when we expect it to as we expect it to. Lots of software doesn’t work when and as we expect it to, and there are entire areas of computer science research in reliability and fault- tolerant computing and software quality ... none of which has anything to do with security. In the context of security, availability is about ensuring that an attacker can’t prevent legitimate users from having reasonable access to their systems. For example, availability is about ensuring that denial-of-service attacks are not possible.

pages: 1,025 words: 150,187

ZeroMQ
by Pieter Hintjens
Published 12 Mar 2013

Rob Gagnon’s Story “We use ØMQ to assist in aggregating thousands of events occurring every minute across our global network of telecommunications servers so that we can accurately report and monitor for situations that require our attention. ØMQ made the development of the system not only easier, but faster to develop and more robust and fault-tolerant than we had originally planned in our original design. “We’re able to easily add and remove clients from the network without the loss of any message. If we need to enhance the server portion of our system, we can stop and restart it as well, without having to worry about stopping all of the clients first.

pages: 598 words: 134,339

Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World
by Bruce Schneier
Published 2 Mar 2015

If systemic imperfections are inevitable, we have to accept them—in laws, in government institutions, in corporations, in individuals, in society. We have to design systems that expect them and can work despite them. If something is going to fail or break, we need it to fail in a predictable way. That’s resilience. In systems design, resilience comes from a combination of elements: fault-tolerance, mitigation, redundancy, adaptability, recoverability, and survivability. It’s what we need in the complex and ever-changing threat landscape I’ve described in this book. I am advocating for several flavors of resilience for both our systems of surveillance and our systems that control surveillance: resilience to hardware and software failure, resilience to technological innovation, resilience to political change, and resilience to coercion.

pages: 496 words: 154,363

I'm Feeling Lucky: The Confessions of Google Employee Number 59
by Douglas Edwards
Published 11 Jul 2011

"Build machines so cheap that we don't care if they fail. And if they fail, just ignore them until we get around to fixing them." That was Google's strategy, according to hardware designer Will Whitted, who joined the company in 2001. "That concept of using commodity parts and of being extremely fault tolerant, of writing the software in a way that the hardware didn't have to be very good, was just brilliant." But only if you could get the parts to fix the broken computers and keep adding new machines. Or if you could improve the machines' efficiency so you didn't need so many of them. The first batch of Google servers had been so hastily assembled that the solder points on the motherboards touched the metal of the trays beneath them, so the engineers added corkboard liners as insulation.

pages: 523 words: 143,139

Algorithms to Live By: The Computer Science of Human Decisions
by Brian Christian and Tom Griffiths
Published 4 Apr 2016

In this algorithm, each item is compared to all the others, generating a tally of how many items it is bigger than. This number can then be used directly as the item’s rank. Since it compares all pairs, Comparison Counting Sort is a quadratic-time algorithm, like Bubble Sort. Thus it’s not a popular choice in traditional computer science applications, but it’s exceptionally fault-tolerant. This algorithm’s workings should sound familiar. Comparison Counting Sort operates exactly like a Round-Robin tournament. In other words, it strongly resembles a sports team’s regular season—playing every other team in the division and building up a win-loss record by which they are ranked.

pages: 590 words: 152,595

Army of None: Autonomous Weapons and the Future of War
by Paul Scharre
Published 23 Apr 2018

Safety under these conditions requires something more than high-reliability organizations. It requires high-reliability fully autonomous complex machines, and there is no precedent for such systems. This would require a vastly different kind of machine from Aegis, one that was exceptionally predictable to the user but not to the enemy, and with a fault-tolerant design that defaulted to safe operations in the event of failures. Given the state of technology today, no one knows how to build a complex system that is 100 percent fail-safe. It is tempting to think that future systems will change this dynamic. The promise of “smarter” machines is seductive: they will be more advanced, more intelligent, and therefore able to account for more variables and avoid failures.

pages: 739 words: 174,990

The TypeScript Workshop: A Practical Guide to Confident, Effective TypeScript Programming
by Ben Grynhaus , Jordan Hudgens , Rayon Hunte , Matthew Thomas Morgan and Wekoslav Stefanovski
Published 28 Jul 2021

; }; const secondaryFn = async () => { console.log('Aye aye!'); }; const asyncFn = async () => { try { await primaryFn(); } catch (e) { console.warn(e); secondaryFn(); } }; asyncFn(); In this case, we just throw a warning and fall back to the secondary system because this program was designed to be fault-tolerant. It's still a good idea to log the warning so that we can trace how our system is behaving. One more variation of this for now. Let's put our try and catch blocks at the top level and rewrite our program like this:export const errorFN = async () => { throw new Error('An error has occurred!')

pages: 552 words: 168,518

MacroWikinomics: Rebooting Business and the World
by Don Tapscott and Anthony D. Williams
Published 28 Sep 2010

To make it work, you’ll need to reveal your IP in an appropriate network, socializing it with participants and letting it spawn new knowledge and invention. You’ll need to stay plugged into the community so that you can leverage new contributions as they come in. You’ll also need to dedicate some resources to filtering and aggregating contributions. It can be a lot of work, but these types of collaborations can produce more robust, user-defined, fault-tolerant products in less time and for less expense than the conventional closed approach. 3. LET GO Leaders in business and society who are attempting to transform their organizations have many understandable concerns about moving forward. One of the biggest is a fear of losing control. I can’t open up, it’s too risky.

Turing's Cathedral
by George Dyson
Published 6 Mar 2012

“If the only demerit of the digital expansion system were its greater logical complexity, nature would not, for this reason alone, have rejected it,” von Neumann admitted in 1948.48 Search engines and social networks are analog computers of unprecedented scale. Information is being encoded (and operated upon) as continuous (and noise-tolerant) variables such as frequencies (of connection or occurrence) and the topology of what connects where, with location being increasingly defined by a fault-tolerant template rather than by an unforgiving numerical address. Pulse-frequency coding for the Internet is one way to describe the working architecture of a search engine, and PageRank for neurons is one way to describe the working architecture of the brain. These computational structures use digital components, but the analog computing being performed by the system as a whole exceeds the complexity of the digital code on which it runs.

pages: 604 words: 161,455

The Moral Animal: Evolutionary Psychology and Everyday Life
by Robert Wright
Published 1 Jan 1994

The iron horseshoe and the windpipe-friendly harness seem to have been invented in Asia and then to have leapt from person to person to person—maybe hitching a ride with nomads for a time—all the way to the Atlantic Ocean. One key to the resilience of this giant multicultural brain is its multiculturalness. No one culture is in charge, so no one culture controls the memes (though some try in vain). This decentralization makes epic social setbacks of reliably limited duration; the system is “fault-tolerant,” as computer engineers say. While Europe fell into its slough of despond, Byzantium and southern China stayed standing, India had ups and downs, and the newborn Islamic civilization flourished. These cultures performed two key services: inventing neat new things that would eventually spread into Europe (the spinning wheel probably arose somewhere in the Orient); and conserving useful old things that were now scarce in Europe (the astrolabe, a Greek invention, came to Europe via Islam, as did Ptolemy’s astronomy—which, though ultimately wrong, worked for navigational purposes).

Nonzero: The Logic of Human Destiny
by Robert Wright
Published 28 Dec 2010

The iron horseshoe and the windpipe-friendly harness seem to have been invented in Asia and then to have leapt from person to person to person—maybe hitching a ride with nomads for a time—all the way to the Atlantic Ocean. One key to the resilience of this giant multicultural brain is its multiculturalness. No one culture is in charge, so no one culture controls the memes (though some try in vain). This decentralization makes epic social setbacks of reliably limited duration; the system is “fault-tolerant,” as computer engineers say. While Europe fell into its slough of despond, Byzantium and southern China stayed standing, India had ups and downs, and the newborn Islamic civilization flourished. These cultures performed two key services: inventing neat new things that would eventually spread into Europe (the spinning wheel probably arose somewhere in the Orient); and conserving useful old things that were now scarce in Europe (the astrolabe, a Greek invention, came to Europe via Islam, as did Ptolemy’s astronomy—which, though ultimately wrong, worked for navigational purposes).

pages: 666 words: 181,495

In the Plex: How Google Thinks, Works, and Shapes Our Lives
by Steven Levy
Published 12 Apr 2011

Google’s first CIO, Douglas Merrill, once noted that the disk drives Google purchased were “poorer quality than you would put into your kid’s computer at home.” But Google designed around the flaws. “We built capabilities into the software, the hardware, and the network—the way we hook them up, the load balancing, and so on—to build in redundancy, to make the system fault-tolerant,” says Reese. The Google File System, written by Jeff Dean and Sanjay Ghemawat, was invaluable in this process: it was designed to manage failure by “sharding” data, distributing it to multiple servers. If Google search called for certain information at one server and didn’t get a reply after a couple of milliseconds, there were two other Google servers that could fulfill the request.

pages: 798 words: 240,182

The Transhumanist Reader
by Max More and Natasha Vita-More
Published 4 Mar 2013

The goal of substrate-independence is to continue personality, individual characteristics, a manner of experiencing, and a personal way of processing those experiences (Koene 2011a, 2011b). Your identity, your memories can then be embodied physically in many ways. They can also be backed up and operate robustly on fault-tolerant hardware with redundancy schemes. Achieving substrate-independence will allow us to optimize the operational framework, the hardware, to challenges posed by novel circumstances and different environments. Think, instead of sending extremophile bacteria to slowly terraform another world into a habitat, we ourselves can be extremophiles.

pages: 778 words: 239,744

Gnomon
by Nick Harkaway
Published 18 Oct 2017

Effectively deployed bad practice under the System is a disaster. It would place the most absolute surveillance machine in history in the hands of villainous actors or mob instincts.’ ‘And you stop that from happening?’ ‘Oh no. Not us. The System itself, as designed by its original architects. Firespine is not a back door. It is a fault-tolerant architecture – a protocol of desperation. It adjusts where necessary, pushes people to vote when they are wise and not when they are foolish. It organises instants in time, perfect moments that unlock our better selves, serendipitous encounters to correct negative ones that make us less than we should be.

pages: 945 words: 292,893

Seveneves
by Neal Stephenson
Published 19 May 2015

If the Cloud Ark survived, it would survive on a water-based economy. A hundred years from now everything in space would be cooled by circulating water systems. But for now they had to keep the ammonia-based equipment running as well. Further complications, as if any were wanted, came from the fact that the systems had to be fault tolerant. If one of them got bashed by a hurtling piece of moon shrapnel and began to leak, it needed to be isolated from the rest of the system before too much of the precious water, or ammonia, leaked into space. So, the system as a whole possessed vast hierarchies of check valves, crossover switches, and redundancies that had saturated even Ivy’s brain, normally an infinite sink for detail.

pages: 1,164 words: 309,327

Trading and Exchanges: Market Microstructure for Practitioners
by Larry Harris
Published 2 Jan 2003

They must eliminate all single points of failure. Since failures are inevitable, given current technologies, markets also must invest in systems that allow them to recover from service interruptions. Markets—as well as brokers and dealers—employ many of the following processes to create reliable trading systems: • They use fault-tolerant computer hardware. • They build redundant computer systems. • They build redundant network connections. * * * ▶ Some Examples of the Risks of Trading Through Unreliable Data Networks • A trader submits a limit order to an electronic market. After the order is accepted, but before it trades, the trader’s network connection fails.