When all else fail

So, first ever results from the recommendation engine running on five machines.

Not really what I expected...

What now? The graph above suggests that there are more failures than successfully answered requests. Not what I had in mind. Moreover, the it fails already around 90-100 requests/second, and there are even failures at a higher rate. Looking at another graph (not posted) the response times are around 1 second which is probably causing the high number of failures as the hard timeouts are configured to 1 second.

What is the cause? There may be several reasons of course. The mistake I’ve made is to run ahead of myself, I think. Here are a few reasons as to why the results may be quite depressing compared to those taken earlier.

  1. I decoupled the http-interface to a separate java-application (running in a separate JVM). I didn’t want the REST interface to interfere with the Akka system containing the recommendation engine.
  2. Previously I have only tested the performance with static HTTP requests. This means every request is identical and is routed to the same itemset. In the run above each request is randomly generated. In order to solve this I decided to implement my own Jmeter Sampler. This was a simple exercise, but I’m not sure how much my implementation affects the timing results measured by Jmeter. Maybe I’m doing something wrong?
  3. All requests are issued over the network. All machines, however, sit in the same rack and the normal round-trip time is about half a millisecond.
  4. There is a design flaw with the HTTP interface. As I was writing on the report yesterday I got the feeling that the HTTP server doesn’t handle requests concurrently. I.e when the request is accepted and forwarded to the native interface it is blocking before it returns a result and processes the next message. It is more likely, to be honest, that this error is in the native interface that I’ve created though than the HTTP library.
  5. The design of sharing the workload between the nodes does not work. There is a bottleneck elsewhere that I’m missing. Potential suspects are the native-interface, routing and the workers. Routing is a sequential part of the code base that I’m aware can cause troubles during high workloads, but I didn’t expect it to peak already at ~100 requests/second according to previous measurements.
  6. I’m missing something else.

Next step will be to try with a smaller set-up (1 node, 1 test machine) locally and see if I get similar results. If not, I’ll try to profile the time spent in the different parts of the application to see where that may lead.

When all else fails…

Posted on 15 May 2012 in results

Older posts

Unix tools - 15 May 2012 in notes
Weaving the fabric - 12 May 2012 in notes
Configuration management - 08 May 2012 in rant
Balancing the cluster - 04 May 2012 in notes
Towards distributed evaluation - 03 May 2012 in notes
Illustrating matrix factorisation - 02 May 2012 in notes
Planning evaluation - 30 Apr 2012 in notes
More on Evaluation - 26 Apr 2012 in notes
How to evaluate a recommendation system? - 24 Apr 2012 in notes
Follow your guts - 23 Apr 2012 in notes
Towards real-world testing - 19 Apr 2012 in notes
Performance evaluation with JMeter - 19 Apr 2012 in notes
First user interface - 17 Apr 2012 in notes
Working with Scala - 16 Apr 2012 in notes
New popular items - 12 Apr 2012 in notes
REST confusion (again) - 05 Apr 2012 in notes
Supervisor meeting - 03 Apr 2012 in notes
Paper review - Fast Top-k retrieval for Model Based Recommendation - 02 Apr 2012 in review
Rewriting the core - 29 Mar 2012 in code
Re-run with bigger dataset - 27 Mar 2012 in notes
Writing every day - 22 Mar 2012 in links
Iteration 2 - Routing - 20 Mar 2012 in notes
Code coverage in Scala - 20 Mar 2012 in notes
Routing to the most relevant itemset(s) - 15 Mar 2012 in notes
Work process - 08 Mar 2012 in ideas
Worth migrating from Akka 1.3 to 2.0? - 08 Mar 2012 in notes
Status update - 05 Mar 2012 in notes
Mind your language - 05 Mar 2012 in random
Curse of Dimensionality - 02 Mar 2012 in notes
Meeting my supervisor - 01 Mar 2012 in notes
Finding a needle in a haystack - 27 Feb 2012 in notes
Handling failures - 20 Feb 2012 in notes
Drawing sequence diagrams - 14 Feb 2012 in tips
Math libraries (cont) - 09 Feb 2012 in notes
Evaluating math libraries - 09 Feb 2012 in notes
Recommendations from a philosophical view - 07 Feb 2012 in thoughts
On-line computation cost - 06 Feb 2012 in notes
Load test prototype - 03 Feb 2012 in notes
Reducing dimensions of the problem - 02 Feb 2012 in notes
Usage analysis - 31 Jan 2012 in notes
A set of requirements for a recommendation framework - 30 Jan 2012 in ideas
Architecting Recommendation Systems for Web-Scale Data - 27 Jan 2012 in ideas
A day of tutorials and code - 25 Jan 2012 in notes
Production recommender systems - 24 Jan 2012 in links
Mahout vs GraphLab - 23 Jan 2012 in notes
Head-banging - 20 Jan 2012 in notes
Survey paper on CF recommendation algorithms - 19 Jan 2012 in notes
Summary of "Google News Personalization Scalable Online Collaborative Filtering" - 19 Jan 2012 in summaries
Motivating my thesis topic - 19 Jan 2012 in notes
Summary of "A case for distributed recommender system architecture" - 18 Jan 2012 in summaries
More Matrix Factorization - 17 Jan 2012 in maths
Time Computing vs Accuracy - 16 Jan 2012 in ideas
Singular Value Decomposition - 16 Jan 2012 in maths
Summary of Toward the Next Generation of Recommender Systems - 13 Jan 2012 in summaries
First day at Tuenti - 12 Jan 2012 in random
Entry 4 - 07 Jan 2012 in random
Entry 3 - 06 Jan 2012 in references
Entry 2 - 25 Dec 2011 in references
Entry 1 - 15 Dec 2011 in random

Listing all posts