Handling failures

In a meeting earlier today the question of handling machine failures was raised. Dealing with failures is obviously something one cannot take lightly and there are several approaches available. It is always, however, good to first explicate the requirements. What user-experience do we want to provide? In some scenarios (like when withdrawing money) it is not ok to fail, while in others (getting your friend’s latest facebook update) some degree of failure is ok. In extreme cases, maybe it is even ok to tell the user that the service is unavailable, although that is more certain to stir up some frustration.

Secondly, one need to think about what level of failures to handle. There’s huge difference between handling machine failures and datacenter failures. Dealing with machine failures can be addressed within the application or using external hardware components. There’s also a design decision (or philosophical decision) to make whether the system should be aware of what type of failure guarantees it can provide or not: i.e cluster or machine-aware. Each will require different semantics and consistency considerations.

In the service component that we’re building for generating recommendations on-the-fly, we can integrate methods for replication of the data model to increase availability of recommendations to serve. There are, at least, four possible replication schemes with varying complexity to consider:

  1. If the whole model fits in RAM it can be replicated on all machines. Since, at least initially, the model is only updated once a day, there are few consistency issues to worry about. As long as the cluster can handle all incoming requests all but one machine may fail.
  2. If the whole model does not fit in RAM, it can be sharded and replicated amongst a subset of machines in the cluster. This could result in certain parts of the model not being available to recommend, but if the index used to keep track of the itemsets is kept up to date, the “most similar” items can still be served. Here a subset but one machine may fail to ensure that some data is served (albeit it may not be the most accurate recommendations).
  3. An alternative to version two is to split the data such that only some users are affected by machine outages. This would depend on how the model is split across the cluster and how the index of the itemsets are kept up to date.
  4. Finally, one alternative is to not do any replication to handle failures and simply serve static or no recommendations at all if a failure occur.

Perhaps the question we should ask ourselves is: How little redundancy can we get away with?

Posted on 20 Feb 2012 in notes

Older posts

Drawing sequence diagrams - 14 Feb 2012 in tips
Math libraries (cont) - 09 Feb 2012 in notes
Evaluating math libraries - 09 Feb 2012 in notes
Recommendations from a philosophical view - 07 Feb 2012 in thoughts
On-line computation cost - 06 Feb 2012 in notes
Load test prototype - 03 Feb 2012 in notes
Reducing dimensions of the problem - 02 Feb 2012 in notes
Usage analysis - 31 Jan 2012 in notes
A set of requirements for a recommendation framework - 30 Jan 2012 in ideas
Architecting Recommendation Systems for Web-Scale Data - 27 Jan 2012 in ideas
A day of tutorials and code - 25 Jan 2012 in notes
Production recommender systems - 24 Jan 2012 in links
Mahout vs GraphLab - 23 Jan 2012 in notes
Head-banging - 20 Jan 2012 in notes
Survey paper on CF recommendation algorithms - 19 Jan 2012 in notes
Summary of "Google News Personalization Scalable Online Collaborative Filtering" - 19 Jan 2012 in summaries
Motivating my thesis topic - 19 Jan 2012 in notes
Summary of "A case for distributed recommender system architecture" - 18 Jan 2012 in summaries
More Matrix Factorization - 17 Jan 2012 in maths
Time Computing vs Accuracy - 16 Jan 2012 in ideas
Singular Value Decomposition - 16 Jan 2012 in maths
Summary of Toward the Next Generation of Recommender Systems - 13 Jan 2012 in summaries
First day at Tuenti - 12 Jan 2012 in random
Entry 4 - 07 Jan 2012 in random
Entry 3 - 06 Jan 2012 in references
Entry 2 - 25 Dec 2011 in references
Entry 1 - 15 Dec 2011 in random

Listing all posts