Thesis by @mljungblad

Online recommendations

marcus@ljungblad.nu (Marcus Ljungblad) — Fri, 22 Jun 2012 00:00:00 -0700

Since January 12, 2012 I’ve been slowly working on my master thesis called Online recommendations at web-scale using matrix factorisation. Today I successfully defended it and can happily say I’m satisified with the results.

Over the course of the semester this blog has served as a place to vent ideas and clarify problems for myself. Perhaps most of all it has been an experiment where I could document my progress. I wanted to, in retrospect, be able to see how my perception of the problem changed over time. As I learned more and more about the problem, how did my understanding change? What decisions led to progress and when did they not? Essentially it has been a tool for personal reflection on my learning process. After I have let the last few weeks sink in a bit, I will try to do a summary on my personal blog.

Anyway, for those of you who are interested, you can download a full copy of the thesis and read all about its juicy details. If you have any questions about the work, don’t hesitate to shoot me an e-mail at marcus@ljungblad.nu.

Abstract

In social networks, e-commerce systems, and other web-services the sheer size of available content is overwhelming. Highlighting relevant content is the focus of recommender systems. Most previous research in the area has provided several algorithms for personalising the user experience, but few have addressed the issues of scalability. In this study we show how matrix factorisation, one of the more accurate recommendation techniques, can be used to serve recommendations online for millions of items and millions of users. An approach based on dividing all available items in clusters and restricting the computation to a selected few is outlined. Consequently, we developed a prototype using requirements from a production environment to demonstrate its feasability. Experimental results show that 600 recommendation requests per second can be served with a latency below 30 ms. We conclude that matrix factorisation can be used online in large-scale settings but specific care has to be taken when clustering the items.

And though it may not make much sense without me talking, here are the slides from this morning’s defense.

The presentation

Online recommendations at scale using matrix factorisation

View more presentations from Marcus Ljungblad

This will also mark the last post on this blog. From now on you can only find me on http://ljungblad.nu.

So long and thanks for all the fish!

Average precision

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 21 Jun 2012 00:00:00 -0700

In preparing the final presentation for tomorrow, one of the hardest concepts I have to explain has one of the easiest motivations to understand. In other words, why I’m taking certain measurements are easy to understand, but how I get them, is certainly not as straightforward.

Why:

to show that the system does not affect the model, and
to show that accuracy depends on performance

How:
By plotting Mean Average Precision against item coverage. There are two abstract concepts here.

Mean Average Precision is a measure of recommendation accuracy and is commonly used in the algorithmic research on recommenders. In essence it works by asking for a set of recommended items, assuming that we know what items are relevant to a certain user beforehand, we count how many relevant items appear in the set. Using a little aggregation for many users we can compute the mean of all precision values received.

For example, consider the following set: [ a,b,c,d,e,f ] where a, and c are relevant (I determine this by looking at real users’ history). For this set we get the following:

(1/1 + 2/3)/2 = 0.83333...

In other words, average precision for this set is 83%. It is irrelevant how many results are returned, only how high the relevant items rank in the recommended set influence the average precision.

Item coverage is simpler to understand. It shows how many percent of the entire item catalogue that is used to provide the recommendation set. Since the system depends on clustered data, I chose to use this as the input for coverage.

Hence, if I can query all clusters then the MAP value should be identical to the MAP value computed offline (i.e without the online component). If it is, this proves that the system does not affect the model as such.

However, if I query only a subset of the clusters the MAP value should be reduced. This happens because the relevant items for the particular user may not be in the clusters that are used to provide the recommendations. Two conclusions can be drawn from this:

The system depends on well-balanced clusters for a stable performance/accuracy ratio. If the clusters contain a varying number of items the performance and accuracy become hard to predict.
Being able to determine the most important clusters (if all cannot be queried) well improves the performance/accuracy ratio.

Naturally, querying more clusters will require more computational resources and thus affect performance.

The graph below shows how MAP increases with coverage. It doesn’t show cluster sizes, but a separate comparison of this lead me to point 1 above.

In the end I think I will keep it super simple and just mention the conclusions. However, the graph and the above explanation will go into the hidden slides section if I get asked a question about it.

Thesis delivered

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 19 Jun 2012 00:00:00 -0700

As it turns out I handed in my thesis last Friday! I have quite mixed feelings about it. Part of me says that’s great, the other part says more like “but why didn’t you think of that?!”

Last few weeks were a bit crazy as I was working to get the measurements I wanted to support (or counter) some decisions. The report could certainly do with some more polishing, if not content, at least layout-wise. Making things pretty gives bonus points right?

Anyway, the last weeks’ hard work culminated in an internal presentation for the engineers at Tuenti. I tend to like doing presentations (and planning them). Although I ended up being very short of time for the preparation, I felt happy with the delivery. The slides are very sketchy (attached below) and I hope to refine them, and the content, for the final presentation this coming Friday.

Questions I remember:

Why matrix factorisation?
How could you address the router bottleneck?
What is the memory and cpu utilisation?
On what machines did you run the experiments?
What clustering algoritm was used to cluster the items?

Comments received afterwards:

More details about the system including the setup and the tools used (Scala+Akka)
A bit too fast through the architecture
Focus a little bit more on scalability
There’s a simpler way to explain matrix factorisation (I’m all for making things simpler)
(Own comment) Too slow through the motivation – cut down and focus on the most essential

This was the first presentation I did where part of the audience was available on video-link. It felt awkward and hard to connect with them. Also, I did the presentation sitting down (our video conference room is not great for standing remote presentations). Sitting, however, I will try to avoid in the future as that made me feel restricted.

Thesis-presentation: Tuenti Engineering

View more presentations from Marcus Ljungblad

Writing writing

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 11 Jun 2012 00:00:00 -0700

Haven’t provided much of info here in a while, so here comes a short update of what I’ve done:

writing, rewriting, trashing, and writing introduction again
got first draft of report reviewed by my supervisor
fixing comments from supervisor
making some exploratory development to see the effect of parallelisation – not sure it will make it into the report
taking measurements – specifically something called Mean Average Precision which I’m using to measure the accuracy of the recommendation results. I’ll try to write a more detailed post on this soon.
procrastinating
thinking that my report suck and mentally thrown it out of the window a few times
found inspiration and solutions to problems when I least expected it
writing and working from cafées around Barcelona. It is incredible how much more productive I am in these places. 4 hours with a good view of people passing by equals around 6-7h of office productivity.

Now, back to writing. Here’s the rest of my TODO:

rewrite abstract
finish conclusion
get the last MAP numbers into the report and explain them
update two figures
review the layout
write acknowledgements

I can kind of see the finishing line somewhere over there.

Rock on!

Iterative writing

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 11 Jun 2012 00:00:00 -0700

As I was rewriting the conclusion I realised how often I start on sentences and then ignore them. The following is a snippet of abandoned sentences from one paragraph of my conclusion.

As a result, the prototype developed to test the concept

The recommendations are taken from a substantial item catalogue of tens of millions of items.

Our results show that describe the concept here

Secondly, the accuracy and performance results

We identified a number of improvements that can be made:
- scaling routing
- minimizing inter-node traffic
- auto-adjusting load

Throwaway-prototyping…

The value of instrumentation

marcus@ljungblad.nu (Marcus Ljungblad) — Wed, 23 May 2012 00:00:00 -0700

As I’m continuing to take measurements on the recommendation engine’s performance I’m becoming increasingly aware of the value of instrumentation. Several noteworthy bloggers and companies repeatedly talk about it. The mantra follows something along the lines “if you don’t instrument, you have no clue what is happening.” It couldn’t be more true.

Being a project that has mostly been confined to my development machine and small-scale tests, there hasn’t been a strong need for looking inside. Various log.debug() messages coupled with some grep/awk magic have so far been sufficient.

At the time of writing a 5-minute performance test is running. During this time I have essentially no insight into what the system is up to. Only when I get the results back and can plot the graphs will I know how it performed. Considering that I do not trust the software that I’ve written (I wouldn’t do that till I see it run for a substantial amount of time) it seems it would have been a good time investment to set up better run-time metrics.

Developed by the guys behind Yammer, I found Metrics – a Java library to easily extract information from deployed code. It can export the information to both jconsole (handy tool to see what’s happening inside your JVM), as well as the more large-scale tools Graphite and Ganglia. It looks promising for code running in production (which it was designed for), but perhaps not the most optimal tool for development/performance testing.

At the moment I don’t have the time to explore Metrics further, but will definitely put it in my toolbox.

Progress!

marcus@ljungblad.nu (Marcus Ljungblad) — Wed, 23 May 2012 00:00:00 -0700

Compared to last week’s miserable results. Today it is looking a lot better.

Bad

Better

When all else fail

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 15 May 2012 00:00:00 -0700

So, first ever results from the recommendation engine running on five machines.

What now? The graph above suggests that there are more failures than successfully answered requests. Not what I had in mind. Moreover, the it fails already around 90-100 requests/second, and there are even failures at a higher rate. Looking at another graph (not posted) the response times are around 1 second which is probably causing the high number of failures as the hard timeouts are configured to 1 second.

What is the cause? There may be several reasons of course. The mistake I’ve made is to run ahead of myself, I think. Here are a few reasons as to why the results may be quite depressing compared to those taken earlier.

I decoupled the http-interface to a separate java-application (running in a separate JVM). I didn’t want the REST interface to interfere with the Akka system containing the recommendation engine.
Previously I have only tested the performance with static HTTP requests. This means every request is identical and is routed to the same itemset. In the run above each request is randomly generated. In order to solve this I decided to implement my own Jmeter Sampler. This was a simple exercise, but I’m not sure how much my implementation affects the timing results measured by Jmeter. Maybe I’m doing something wrong?
All requests are issued over the network. All machines, however, sit in the same rack and the normal round-trip time is about half a millisecond.
There is a design flaw with the HTTP interface. As I was writing on the report yesterday I got the feeling that the HTTP server doesn’t handle requests concurrently. I.e when the request is accepted and forwarded to the native interface it is blocking before it returns a result and processes the next message. It is more likely, to be honest, that this error is in the native interface that I’ve created though than the HTTP library.
The design of sharing the workload between the nodes does not work. There is a bottleneck elsewhere that I’m missing. Potential suspects are the native-interface, routing and the workers. Routing is a sequential part of the code base that I’m aware can cause troubles during high workloads, but I didn’t expect it to peak already at ~100 requests/second according to previous measurements.
I’m missing something else.

Next step will be to try with a smaller set-up (1 node, 1 test machine) locally and see if I get similar results. If not, I’ll try to profile the time spent in the different parts of the application to see where that may lead.

When all else fails…

Unix tools

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 15 May 2012 00:00:00 -0700

I’m often impressed by the pletheora of tools available in *nix environments. It’s incredible how these small and composable applications have evolved and is creating an amazing ecosystem. The “do one thing and do it really well” truly drives their design.

There are three tools, two of which are default in any *nix environment and one which follows similar principles: mktemp, nohup and dtach, that I’ve used to deploy the recommendation engine.

nohup enables you to start other applications that does not listen to the SIGHUP signal. The SIGHUP signal is sent to all processes that are spawned by a terminal. For example, if I log in using SSH to a remote machine and start an application, it will automatically recieve a SIGHUP when I logout from the SSH session. Nifty, but not always desirable. nohup, hence, solves this issue by disabling the the listener.

nohup java -jar recsys.jar app.conf > /tmp/recsys.log 2>&1 & echo $! > recsys.pid

mktemp does exactly what it says, it makes temporary files. mktemp -n /tmp/recsys.XXXX can for example be used to create a log file that I can pipe output to.

dtach is not bundled with the OS per se but I found it to be a pretty handy complement to its more heavy-weight big brother screen. Using dtach -c /tmp/app.session app you start an application in a separate terminal session. With -c you are not attached to it by default. If you want to attach to it, simply issue dtach -a /tmp/app.session

*nix is a great example of good software design.

Weaving the fabric

marcus@ljungblad.nu (Marcus Ljungblad) — Sat, 12 May 2012 00:00:00 -0700

After the last days dispair fighting with packaging jar files and configuration management in order to deploy the system on a set of servers, I’ve yesterday worked with something delightful: Fabric.

I posed this question to a hacker-friend of mine: what is the easiest way to deploy and run a set of configuration commands on five machines. The only requirement? It shouldn’t take me forever to set-up.

He immediately posted a snippet of python code using the Fabric library.

Fabric is a “tool for streamlining the use of SSH for application deployment or systems administration tasks”. And what’s more, it’s super-easy to use and has good documentation!

By defining small tasks, which are composeable, you declare commands to run on remote machines. You can execute commands locally too. With three commands and a few tasks I had the whole procedure set-up: Packaging jar files, moving them to one of the remote servers, distributing them from there to the others (our upload capacity is kind of slow), uploading the test data, generating a configuration template, uploading this one, generating host-specific configurations based on the template, and finally (re)starting the application.

Generating and uploading configuration files look like this:

And can be run by issuing fab prepare_config:data=sample,maxcap=20 config.

Happy deploying!

Configuration management

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 08 May 2012 00:00:00 -0700

One thing that I find notoriously hard to get right in every project I’m working on is configuration management.

It sucks.

There are always a pletheora of libraries, tools and techniques claiming to have tamed the unicorn. To make matters worse everyone uses different notations, different instantiation methods, and different access methods.

For a while I thought Scala+Akka had found the holy grail: ConfigFactory. Now I’m not all that convinced after having struggled with for a while. It’s probably two reasons at play: first, documentation is rusty at best making it hard to interpret what the load() function really does (I ended up looking at the source code to find out it didn’t do at all what I expected it to do). Secondly, I’m probably doing something wrong, especially with respect to getting the design right.

Configuration libraries worthy their names have at least the majority of the following properties:

use a big-name markup language for describing the config file (json, yaml, or possibly xml although it is usually bloated)
support tree-like organisation of configuration properties
once instantiated in your application be globally accessible (singletons are bad but configurations are application-cutting)
be composable to multiple files if needed
support the overriding of settings using for example system environment or vm variables
be instantiated once and only once in runtime, meaning the settings once instantiated are immutable (runtime variation is a whole other matter (I wrote a thesis on it))
be mockable/replaceable for test environments
provides a small and intuitive api
preferably come with decent documentation and sample implementations

ConfigFactory by Typesafe actually fulfills most of above out of the box.

The world doesn’t need yet another configuration library, so don’t expect me to start hacking on one. Perhaps I’ll fork ConfigFactory one day and fix its broken parts, for now I’ve hacked my way around its major shortcoming (globally accessible settings).

Your mileage may vary…

</rant>

Balancing the cluster

marcus@ljungblad.nu (Marcus Ljungblad) — Fri, 04 May 2012 00:00:00 -0700

Yesterday I explained briefly what happens when the recommendation cluster is bootstrapped. Today I thought I would go into a little bit more detail, explaining how a new node figures out what to prioritise.

First node to join is obviously on its own. I assume that a Zookeeper-like service is available to keep track of members of the cluster. In my current setup this is a dummy nameserver of 10 lines of scala+akka code which fulfills the purpose just fine for testing. This service must be running before any node can potentially join the cluster.

The first thing the node does is to register with the nameserver. The nameserver will reply by sending all the currently registered nodes, including the node issuing the request. If there are no other nodes in the cluster it will become responsible for coordinating the bootstrapping procedure of the cluster. This responsibility includes two things:

Read the file with itemsets’s signatures and assign internal id’s to each. This id is only used internally by the recommendation cluster.
Load as many itemsets to memory as the node can handle (1 itemset = 1 worker) – this can either be manually restricted or determined from within the system. It is manual for now since that makes it more predictable and easier to evaluate.

Assuming that the first node is not able to maintain all itemsets in memory on its own it will mark those that are not yet loaded and should be given priority.

As the second node joins the it will, similarly to the first node, register with the nameserver and receive a list of existing nodes. When discovering that existing nodes exists it will:

Replicate the first node’s registry to learn what itemsets are already loaded and where the workers to serve those itemsets are residing. If it receives registers from more than one node the registers are merged. This way they will eventually be consistent on all nodes.
Retrieve a list of itemsets to load. Once all nodes have replied it will sort the itemsets according to following formula: itemsets not previously loaded will be prioritised first. If all itemsets are loaded it will sort the itemsets according to their popularity (determined by the number of requests each itemset retrieves).
Load the prioritised itemsets. One worker per itemset is started and it will ensure to register with both the local node on which it resides, as well as on the other nodes in the cluster.

A cluster with a lot of direct links will be expensive to maintain. Well, there are both sides to the argument. Maintaining more links have to be put in perspective to the frequency of expected failures and the cost of fixing/updating the link once a node recovers. I plan to evaluate this more closely in the next few days. An alternative is to add another level of indirection by storing only at which node the worker belongs. I would expect a small (max 5) nodes with a few hundred itemsets each to be sufficient for most cases. Maintaining the links for such a setup may cause a certain spike (increased latency) when a node dies, but overall be manageable since they otherwise rarely change. The time spent routing is possibly capped by the similarity function anyway. If this proves to be a bottleneck there’s need for more efficient datastructures and some further parallelisation. Actually, when I think about it, there is a lot which can be improved. First get it running…

Failure scenarios are plenty. That said, there are some resiliance to node and/or worker failures.

The nodes establishes monitors for the workers, such that if one dies, it will be removed from the registry. When the worker is restarted by the worker’s supervisor it automatically re-registers with all nodes.
If a node dies, the same procedure as above happens except that there is nothing to restart the node.
Nodes can obviously fail during start-up. The case when the first node joining the cluster fails (before all itemsets are loaded at some node) is, however, not handled.
A node which fails and rejoins continuously could probably overload the system. There is no protection against thrashing, such as throttling the number of join requests.
If all available replicas for a particular itemset is down it is needed to respawn them from persistent storage (local file system or wherever they are stored). This is still unresolved and deserves a post on its own.

Now it remains to see if this scales sufficiently and what the impact of node join and leave is, and to rewrite this text more formally so that I can use it in the report.

Towards distributed evaluation

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 03 May 2012 00:00:00 -0700

Today is big progress day. One of the things that I’ve set out to evaluate is the scalability of the system. Specifically I want to investigate two things:

how the system behaves as the quantity of data increases, and
the throughput and latency of requests that it supports

Before I had used the MultiJVM plugin for Akka to test with multiple JVMs locally. Today I’ve refactored the code and enabled it to run from a scriptable number of machines.

When a node starts it registers with a nameserver, as a confirmation on its registration it is sent a list with all other nodes that are also registered. From here on there are two different paths. If the node is first to join the cluster it will become responsible for populating the cluster with data. Hence, once the second node joins it will before alleviating load from the first node, ask if there is data that still hasn’t been loaded and prioritise that. The number of data itemsets a node can load depends on its memory capacity. If there are no new data itemsets to load when a node joins it will automatically alleviate load from the existing nodes. Any node can thereafter serve incoming recommendation requests.

Pretty cool.

Now I need to sort out two bugs that I’ve not been able to catch with the much smaller test cases that I had with the local setting. Afterwards I have to identify all failure-cases that I handle and those that I do not.

> run 127.0.0.1 2552 127.0.0.1:2550
[info] Running recsys.Main 127.0.0.1 2552 127.0.0.1:2550
Using configuration: 
[
 address: 127.0.0.1:2552
 path: /Users/marcus/tensor/
 filename: data.out
 replicas: 1
 nameserver: 127.0.0.1:2550
 nodemaxcapacity: 5
]
[INFO] [05/03/2012 18:03:03.149] [run-main] [ActorSystem(recsys)] 
	REMOTE: RemoteServerStarted@akka://recsys@127.0.0.1:2552
[INFO] [05/03/2012 18:03:03.251] [run-main] [ActorSystem(recsys)] 
	REMOTE: RemoteClientStarted@akka://nameserver@127.0.0.1:2550
Bootstrapping recsys cluster.
New id: 0 worker: Actor[akka://recsys/user/$a]
New id: 1 worker: Actor[akka://recsys/user/$b]
New id: 2 worker: Actor[akka://recsys/user/$c]
New id: 3 worker: Actor[akka://recsys/user/$d]
New id: 4 worker: Actor[akka://recsys/user/$e]
Recsys running. Press 'return' key to exit.

Illustrating matrix factorisation

marcus@ljungblad.nu (Marcus Ljungblad) — Wed, 02 May 2012 00:00:00 -0700

As part of the report I’m including a section on the intuition behind matrix factorisation. Since I’m not the biggest fan of maths (it’s fascinating when it works but basically I suck at it) I want a more illustrative example. Prefering code over mathematical equations I decided to include some pseudo-code.

However, I’m not every happy with it. Too many abbreviations and complicated syntax. Nick suggested making fluffy functions out of the upm[u,:] parts since they literally translate to “u’th row of matrix upm”.

Alternatively I could just stick to the original python code.

Hopefully with the examples I’ve provided it should be roughly understandable what one can achieve with matrix factorisation: approximating unknown values.

Planning evaluation

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 30 Apr 2012 00:00:00 -0700

According to my initial plan I want to start doing the evaluation in May. This means that a majority of the code will be frozen soon too. I’m almost on track with that plan, but there are some areas of the code which are still subject to change.

Nevertheless, I’ve started to outline what I want to measure and tried to indicate its difficulty and usefullness. They are subject to change.

More on Evaluation

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 26 Apr 2012 00:00:00 -0700

Guy Shani and Asela Gunawardana contributed a chapter on evaluating recommendation systems to the heavy book: The Recommender Systems Handbook. A book which litterally covers everything, with the possible exception of papers covering the architecting recommender systems for scale.

Shani’s chapter did not add much new information than that given by Herlocker et al. They add upon it, of course, especially with respect to measuring confidence, and they dig further into user studies. But overall I think there was nothing of particular interest. Probably I don’t have sufficient background knowledge to appreciate it. However, they did include a section on scalability which I thought was relevant.

For anyone with a system’s background it isn’t news (but I suspect it may have been for some of the algorithm lovers). Essentially, they suggest to measure:

complexity of the algorithm (including annotations whether it is cpu or memory bound)
behaviour as number of users and/or number of items grow
time to compute a recommendation (more commonly known as throughput and latency)
coverage, meaning how many of the items can be recommended within a given timeframe

I need to decide which and how to do it and get going.

How to evaluate a recommendation system?

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 24 Apr 2012 00:00:00 -0700

As I’m starting to look into the evaluation of the system I was curious to find related work in this area. Linas suggested two papers which form the backbone of recommender system evaluation. Both of them are written by well-known researchers within the field.

Herlocker, Jonathan L., Joseph A. Konstan, Loren G. Terveen, John, and T. Riedl. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems 22 (2004): 5–53.
Ricci, Francesco, Lior Rokach, and Paul B. Kantor. Recommender Systems Handbook. Springer, 2010.

I’ve began reading the “old” one by Herlocker et al to get an understanding of the basics. Here’s what I’ve found so far.

Evaluating recommendation systems and algorithms is difficult. It depends heavily on the data, its size and its properties. It is also difficult because the goals of evaluation differs. For example, some want to measure accuracy and some may be more marketing related (click-through). In the end it is user-satisfaction that counts.
Begin with defining the end-user’s goals and tasks. This should be the first point for any evaluation. For example, does the user simply want to browse around or are they strictly interested in finding good items?
Define the user, define the environment it operates in, and define the data.
Data has different properties:
- Is novelty prefered over accuracy? Items recommended may be highly relevant (i.e liked by the user) but already known. One may also consider the cost vs benefit of the recommendation. How computationally expensive is it to generate compared to the benefit or the click-through generated?
- Are the ratings implicit or explict? What other inherent features may exist, such as demographics or time?
- Finally, what does the data itself look like? Sparse? Dense? Size and distribution. These are all important to consider when comparing algorithms between each other. Some algorithms are more suited to implicit ratings, but the same algorithm may perform horribly on data with explicit ratings.
There are a number of commonly used measures in previous litterature. The most relevant for me, taking data into consideration, may be:
- Precision and Recall – normally these are measured together as they inversely affect one another. If Precision is higher, recall is lower, and the opposite.
- Mean Average Precision – is the average precision over several queries.
They make a good point about accuracy not being all there is to recommendations. A more appropriate measure may be usefulness but that is a lot harder to evaluate. By including a measure (percentage) of the dataset that the recommendation system can provide predictions for¹, a measure for how fast an algorithm can produce recommendations, and a novelty measure, we can start to evaluate the usefulness aspect.
- An interesting version of coverage is “What percentage of available items does this recommender ever recommend to users?” This is related to my current experimental results on only few clusters being used to make recommendations.
- Learning rate is more closely related to the offline model in my case.
- What strikes me about the novelty metrics proposed is how subjective they are. In fact, all recommendations are highly subjective and we’re trying to quantitise them.
First rule of recommender systems in e-commerce: “Don’t make me look stupid!”
Eventually there has to be user evaluation. It may take many forms and is a field in its own right. However, it should at the very least focus on the defined goals and tasks that a user perform.

¹ This is particularly interesting as one of the design goals for my thesis has been to be able to recommend from a large set of items, without stripping out for example old items. But with a large item catalogue it becomes impossible within the given time constraints to provide predictions for all items.

Follow your guts

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 23 Apr 2012 00:00:00 -0700

This weekend I decided to dedicate four hours to an experimental upgrade from Akka 1.3 to Akka 2.0 of my thesis-system. There were a number of features in Akka 2.0 that had attracted my attention and I was facing challenges with it in 1.3.

Several talented people, including my supervisor, rightly advised against such a move so late into the work. Lalith pointed out that you can get away with your PhD on relatively old pieces of code. And true, a thesis probably shouldn’t be tied to a particular version of your code, but rather to the theories that it explores. Nevertheless, I felt hunger to tinker was too great to control. To not completely go over board, I restricted my time for the upgrade to four hours so as to not endlessly continue along the lines "just this small change to and then I’m do… " Right.

Overall it was fairly successful. It took me four hours and 15 minutes to complete the transition. The pain of upgrading comes from the fact that Akka 2.0 isn’t backwards compatible. APIs have changed, though reasonably well-motivated by the team behind it, it certainly isn’t a straightforward change for projects with much larger code-bases than mine (core is less than 600 scala lines excluding tests).

Wins:

Remoting became a lot simpler. I’m now able to run recommendations across several nodes.
Code readability improved with respect to actor creation
Less code (not really due to Akka 2.0, but in the process I got rid of some stuff that wasn’t used)

Losses:

I no longer have a REST-interface as the http-library I used isn’t 2.0 compatible yet. Alternatives exists but I haven’t explored this option. For now a command-line interface suffices.
The bugs I haven’t found yet due to the upgrade

In the end I will follow my supervisor’s dangerous advice:

For once (do not abuse this advice), listen to your guts. ;-)

Towards real-world testing

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 19 Apr 2012 00:00:00 -0700

Up until now I’ve mostly used fictitious examples to test the recommender system that’s being developed. This data was generated with random numbers without any correlation whatsoever to real world events. Today, however, that changed.

One of the advantages of doing a thesis with a social network is that I can test with real data. This means true anonymised logs from users interacting with content on the site. In my case that is two months of click data from November and December last year, in total about 2 Gb worth of raw data for videos played.

In order to use the data there are a number of steps to be taken. As mentioned before the recommender system scales on the premise that we can divide the item factors for 40 million+ videos into clusters. Here is the bird’s eye view of the process:

count the number of plays per user-video pair (python)
run matrix factorisation and store the intermediate item and user vectors separately (java)
cluster the item vectors produced in step 2 – currently using scipy.cluster.vq.kmeans2 to do this (python)
load the data into the online recommendation system and start serving recommendations (scala)

Counting
The data comes in the following format: ... userid videoid ... (the dots indicate play related data) and there’s one line per play. A simple python script does the trick.

Since I need to verify the accuracy of the recommendation I create two sets: a training set used to build the model, and a test set which will be used to verify the results. The sets are the end-result of this process.

Model generation
The implementation of the matrix factorisation algorithm is beyond the scope of my thesis (except for an understanding of how it works). I’m lucky to be able to use an existing implementation of Koren’s algorithm for implicit feedback provided by Linas and his team at Telefonica. It is part of a larger suite of recommendation algorithms which they plan to open source, and I hope to be able to assist them in that process. For now, it’s enough to say that the result of this step consists of two files: one with all items’ latent factors and one with all users’ latent factors.

Clustering
I’m really focusing on scalability more than the accuracy of the algorithms used. This is both an advantage and a drawback of my work. The advantage is I can be less picky about the model and how the data is supplied to the online recommendation system. Clustering is thus something I do not have to implement myself and, browsing a little for easy-to-use clustering implementations I finally settled on scipy’s k-means. The k-means algorithm is one of the easiest and works by grouping items into K clusters such that the item joins the cluster with the nearest mean of the items already in the cluster (see the Wikipedia article for a full explanation of the algorithm). Clustering with scipy is a breeze and the gist of the code is only four lines:

The output is one file per cluster with the respective item vectors and a signature file with the centroids. The latter is used to initialise the last online recommendation system.

Online
The recommender system reads the centroid file and spawns one process per centroid/cluster distributed across a set of nodes¹. Each process loads its cluster into memory and registers itself as ready to accept recommendation requests. “Routers” on every node routes requests based on cosine similarity between the signature and the user’s request. The request contains the user’s latent factors and possibly contextual information such as the last video played. Once the best cluster is determined the request is forwarded to the process responsible for that cluster which in turn computes the recommendations. Finally, the top K recommendations are returned to the user.

Easy peasy.

There are still some hurdles to tackle before I can test the system fully and say that it really works. But today’s progress is a good step in the direction of ensuring some form of qualitative results from my thesis.

¹ There are still some details missing before it is fully distributed. It’s coming.

Performance evaluation with JMeter

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 19 Apr 2012 00:00:00 -0700

A classmate asked in our irc-channel earlier today about performance testing the system he’s working on. We discussed a bit about JMeter and I thought I’d share the set-up I used to, in particular, determine the maximum sustained throughput.¹

More specifically:

The goal of these tests is to establish the approximate throughput of serving recommendations on-line and the number of machines required to handle today’s load at Tuenti.

JMeter comes with good documentation for how to set up tests. I suggest starting with this tutorial. It does not, however, come with particularly useful graphing tools. Instead I strongly recommend installing the jmeter-plugins suite.

For our purposes I collected the following metrics:

Response time over time
Response time distribution
Response time percentiles (cumulative distribution)
Transactions completed per second (throughput)

In order to determine the maximum capacity of the system I defined a test which I gradually increased the number of requests per second. Luckily there’s a Throughput Shaping Timer to do exactly this! I suspected the system would start to behave weirdly around 1000 requests per second and thus defined the following. It increases faster in the beginning and then slows down, but steadily increases the throughput to a total of 1700 requests per second.

The results below show that at around 1200 requests the system does start to behave weirdly. The blue line shows failed requests, and although I don’t know exactly why yet, it turns out that the application is starting to spit out exceptions at this point.

It could be worse, it could be better. Measuring is the only way of knowing as another classmate would have put it.

¹ This may not be the most academic way of doing it, but it gave me sufficient data to know where to focus our efforts.

First user interface

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 17 Apr 2012 00:00:00 -0700

It rocks! Right?

And a nice quote from the Instagram talk featured on the High Scalability blog

Ideas are disposable: if one doesn’t work, you quickly move on to another.

Working with Scala

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 16 Apr 2012 00:00:00 -0700

Scala is a fascinating language. It was new to me when I started on the thesis but my colleage from Telefonica promoted it, and there are some legitimate reasons for us to use it too.

It runs on the JVM. This is a significant advantage as we can use any java-library directly from Scala, for example jBlas for some heavy-duty math computations.
It is functional making it a bliss to convert mathematical algorithms to code. It reads more or less the same way.
It is object oriented (huh?!) providing nice and easy ways to re-use and encapsulate code
It is compact. At least more compact than Java and there is less worrying about catching exceptions thanks to pattern matching.
The Akka-library provides an actor model closely resembling Erlang’s. One which I’m already familiar with and makes parallel programming a heck of a lot easier.
It has ScalaTest and EasyMock (the latter is not Scala specific though) for writing specs and mocks. I prefer the WordSpecs.

The combination of Akka and JVM’s performance are two main reasons behind using Scala. However, at the same time Scala does, as any language, have some drawbacks. And particularly so for the novice Scala-hacker.

There appears to be one too many ways of solving particular problems. I have a hard time figuring out the de facto standards. Perhaps as this blogger put it: “Scala is almost too clever for its own good.”
The sbt build tool is not as polished as it could be. It feels sluggish at times and repeatedly runs out of memory after a day usage or so (despite following the instructions to increase PermGenSize).
I have a hard time getting dependency injection “done right” when testing. Jonas Bonér wrote a good post on the cake pattern, but when objects starts to rely on actors and their replies it becomes a mess too. Maybe I’ll write a separate post on this later.

Let’s see what comes next. For now I quite enjoy hacking in Scala and Akka.

New popular items

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 12 Apr 2012 00:00:00 -0700

One of the challenges with using a matrix factorisation method for generating recommmendations is that new items can only be accounted for when the model is generated. And this happens only once in a while. Das et al who published the Google News paper argues that because news item churn is high (news get old within 2 days roughly) they want to include new items in the recommendations immediately.

At first I thought this was an issue for me too. Now I think I’ve concluded that it is not really such a big issue because there are really few cases where you require items to be recommended so fast.

Some researchers at Yahoo! argue along similar lines to Das but with respect to tweets. In other words, they want to be able to recommend content from tweets that are so recent that even waiting for model regeneration would be too time-consuming.

In my opinion, one could, argue that for news and tweets it is unnecessary to recommend the very very latest content. Instead we can treat the problem of new items separately. Possibly as a problem of detecting trends. The results of such process can easily be mixed-in with the recommendations if need be. This would enable one to focus on making good accurate recommendations without immediately worrying about new items.

Now, there happens to be some work on online updates to matrix factorisation models too, but that is far beyond the scope of my thesis.

REST confusion (again)

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 05 Apr 2012 00:00:00 -0700

This REST business is very confusing. Well, it is confusing becuase of the debate of what is really REST and what is not. In the true sense I mean.

I tend to refer to the recommendation system’s external interface as REST although I’m not sure it really lives up to the standards set out by Fielding. REST dictates that the HTML request should be as readable as possible. Furthermore, GET implies the request is side-effect free and can thus be cached. Fielding’s blog post on the POST/PUT argument explains why. As far as I can tell from his other blog post REST APIs must be hypertext-driven I think I’m on the right track.

Querying the following today:

http://service.domain.com/api/users/<id>/recommendations?factors=<list of numbers>&quantity=<int>

will simply yield:

{
	"recommendations" : [[Item, Relevance], [Item, Relevance], ... ]
	"itemset" : Id
}

Nothing more nothing less. Can this be called REST or is it simply a HTTP GET request with a response in a JSON format?

Supervisor meeting

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 03 Apr 2012 00:00:00 -0700

This morning I headed up to UPC for another meeting with my academic supervisor. Updated him on the latest development which in short are:

major rewrite of the core functionality to improve fault-isolation and concurrency (and ended up with a lot cleaner codebase)
primitives for load-balancing and replication in place
performance measures indicate no measurable overhead with the latest developments – request handling is still primarily consumed by getting the top K items.
writing the system architecture in the report (still more to do here)

We agreed that by the end of next week I will send a first draft for him to review my writing and structure of the thesis in more detail.

The upcoming challenges that I’m currently aware of are (in no particular order):

testing using a multi-node setup (multiple JVMs)
generate model based on an experimental offline algorithm implemented by Linas
concretise the load-balancing and replication
write :)

Work work!

Paper review - Fast Top-k retrieval for Model Based Recommendation

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 02 Apr 2012 00:00:00 -0700

I’ve noted earlier that there is very little research published on the systems of recommender systems. Either it is completely uninteresting (I don’t think so), research doesn’t need them to be big, and/or companies are not willing to disclose their system system details. Or a combination of all. Who knows? However, once in a while I stumble over articles and posts that are related and relevant for my thesis and last week Linas suggested reading Fast Top-k Retrieval for Model Based Recommendation. While it is not a system’s paper directly, the authors (D. Agarwal and M. Gurevich) emphasize that new approaches must be developed to improve recommendation request performance.

In essence they are tackling the same problem as I’m trying to do at Tuenti. The item inventory (using their terminology) is too vast to explore for brute-force methods for computing the recommendation online. The authors continue to note that previous research mostly focused on reducing the itemset, for example by discaring older items), using some form of heuristics to minimise the set, or optimising the model algorithms.

Similar to the approach we’re exploring at Tuenti they divide the computation in two stages. Each item is represented by a sparse feature vector and a query item. The relevance score is computed by the dot-product of the two vectors. What is significantly different between their assumptions and mine is that for them a large item inventory consists of 50 000 items, whereas I assume large is in the ranges of tens of millions. They address the scale by computing an inverted index for the documents in the first stage (remember we’re trying to route using cosine-similarity).

Recommendation model
Their approach of creating the model is not based on item content and meta information. Instead they “learn [item] vectors by minimising the deviation from the original scores [of a function], while ensuring sparsity to reduce the index size.” It basically boils down to the following equation: ascore(d,q) = sum(q,d) where d is the weight of the document learned from some offline machine-learning model and q is the query context (such as user preferences and session context). When the request arrives, they check q against the index and returns the top-K documents using the previous ascore-function. There are a lot more details but will not address that here.

Using the index, they, as far as I understand, approximate the model. Hence the goal is to reduce the approximation error as much as possible, and they claim to do so by up to 85% on synthetic and real datasets.

There are many interesting points in their paper though, here are some:

Previous research has focused on accuracy but not on retrieval
They treat the original model as a black box, same as I do. In other words, the system is model- and item-agnostic.
In their related work, they note that some people cached results for similar queries. An interesting twist tackle the problem.
The index construction is parallelisable to item-level.
They can do incremental updates and it is easy to add new items since there is no item-interdependency once the model-function is obtained (during offline computation).
Query distribution changes over time, hence model should be recomputed regularly.

Evaluation
In my opinion they provide a relatively strong evaluation using three different datasets. Two of them are synthetically generated (10k items each) to expose specific properties in which their appoach should face challenges. The third dataset (50k items) is taken from an ad-serving site. Unfortunately for me the datasets are far from the sizes that I was hoping for.

Obviously their model outperforms those that it compares to, except for in one synthetic case which was specifically designed to be a pain (basically there are no dependencies between items and their model is designed to work with some inter-dependencies). Interestingly though the two other datasets are highly non-linear and thus should be challenging to approximate. Still they perform well.

Finally, as for retrieval times, they claim on average 14-16 ms for the first stage (inverted index lookup) for 100 documents. I haven’t tried retrieving that many documents yet, but for 20 items I’m in the lower 5 ms range for the 90th percentile, including parsing the HTTP request and packing the response as json-data.

Two favourite quotes

Under-utilised CPU anyone?

The prototype was implemented as a single-threaded Java application and run on an Intel Xeon 2.0 Ghz 8-core machine with 32gb ram.

and a nice metaphor

Since the total number of all possible cross-products are astronomical, the features are hashed into a large number of bins.

Rewriting the core

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 29 Mar 2012 00:00:00 -0700

Last Monday I began focusing on the details of load-balancing between the nodes in the cluster generating recommendations. In this post I’ll outline roughly what I want to achieve.

The recommendation model consists of a few gigabytes of partially computed recommendations. As I’ve talked about before, it is unrealistic to generate recommendations from the entire set fast enough, and hence I split the data in smaller clusters. Today it is possible to run and load clusters of recommendations on a single machine. There is one worker per cluster and a router ensuring each recommendation request is forwarded to the most relevant worker. A worker is essentially a separate process implemented as an Akka Actor. Once the request reaches the worker it computes the top-K recommendations using whatever-algorithm to do so.

By now you will recognise at least two challenges with this design:

the router may become a bottleneck, and
some clusters are bound to be more popular than others and, hence, receive significantly more requests

Lets see each of these in more detail.

One can argue that the first point is prematurely trying to optimise something which may not be a problem unless load is extreme¹. The second point, however, is a real issue. We know that a few videos are watched a lot more times than the rest. There are some more tricks here but that’s related to the model generation and out of scope right now. For now all we know is that some clusters will receive more requests than others.

Cluster popularity
Since one worker is responsible for only one itemset, it is easy to spawn more workers responsible for the same itemset. In my first prototype this was not the case. After having refactored the prototype did I not only improve fault-isolation and concurrency, but also reduce code complexity and cut the number of lines to a third. Retaining all the functionality from before. It’s quite rewarding to see how a design becomes simpler and simpler the more you work with it.

There are more advantages to essentially replicating the workers working on the same (non-shared) cluster. Workers can be assigned the same itemset but running on different nodes. Thus also improving fault-tolerance in case of node failure. This, however, is another detail that I have not begun exploring yet and will have to wait too.

Design-wise I’m facing a number of questions related to the replication procedure:

Who replicates workers?
When and on what grounds is replication triggered?
How will this work when a worker may be started at any node?

Compared to Nick I’m not trying to do any fancy machine-learning algoritms to determine when to increase the number of workers per itemset. What I’m looking for is a rudimentary approach that works good enough.

Routing bottleneck
A router consists of a registry which maps cluster signatures to workers. Determining the worker to use is based on calculating the cosine-similarity between the signature and the context that is attached in the request. With a dataset with 40M items, each cluster containing 40K items, the registry contains 1000 signatures. The registry needs updating if a worker process is killed, is restarted, or added. The router is also implemented as an Akka Actor and thus share no memory with other processes. Due to updates to the registry it becomes more cumbersome to replicate the router as data consistency have to be considered. For example, if a worker dies it must be updated in both routers.

Granted, consistency is not our biggest concern. After all serving recommendations is an add-on feature to the total user experience. And if they are not completely in sync or up to date with the available workers then the controller will automatically return default recommendations (for example most watched videos this week). Needless to say, we want to minimise the number of misses.

My idea for now is to add some kind of listener at each router. It would track all live workers, and if one dies, remove it from its registry. The question then becomes: who is responsible for ensuring a new worker is added in its place? I could have the workers register with each router as they start. The drawback with this approach (I tried it in my first prototype) is that creating the registry becomes a tangled mess with some possibly uninstantiated data that continuously need to be checked.

Even if I do not run multiple routers on one node, there still has to be routers on each node, all able to tell where all workers are. Akka provides location transparency (same as in Erlang). So maintaining the registries up to date is still an issue.

Suggestions? Bring them up in #emdc@freenode or e-mail.

¹ Note to self: measure this

Re-run with bigger dataset

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 27 Mar 2012 00:00:00 -0700

Last Friday I started to run with real-size data in the prototype. Some rudimentary testing indicated it worked fairly well and today I wanted to see how it ran using the JMeter tests that I had defined a few weeks back.

Everything runs on my machine with 4 cores. However, since the test is always requesting recommendations for the same user this isn’t really a fair test. Each request will be routed to the same cluster of recommendations, and hence, to the same actor. This also means that Tomcat can do its optimisations (whatever those may be) and ensure that the data that is being requested frequently is cached.

I want to refactor some code to increase fault-isolation and concurrency amongst the actors and the data they are in charge of. Once that is done I will try to create a more dynamic test.

Good part is, graphs so far looks okish. Except for the blue line which indicates responses are lost. The 90th percentile is completed around 8.5 ms.

Writing every day

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 22 Mar 2012 00:00:00 -0700

Before I started working on this thesis I had the idea that I would write something everyday. To date there are 39 posts here, which two and a half month in, suggests I’m not exactly there. That said, the idea of writing everyday is twofold. It is to track my progress with the thesis, and secondly, it is to see how my learning and perception of the work changes over time.

Fred Wilson, the guy behind with the popular blog AVC recently gave a short talk about why he is blogging. One particular thought of his resonate with me:

By putting it down on paper, it helps me crystalise my thoughts on that [matter].

It is many times for this reason that whatever appears on this blog is not always straightforward to anyone else. In other words, I write what is valuable to me more so than for anyone else. No offense. My personal blog at ljungblad.nu contains a broader mix, but mostly I write for exactly the same reason there.

Many people blogging, writing, or producing stuff in general have a fear of shipping. A fear of pressing send on that e-mail, or publish on that blog post. It can be quite intimidating to get things wrong, but that will happen once in a while whether you want it or not. I tend to believe I got over the shipping part, and this ultimately lead me to appreciate when people take time to give constructive feedback more.

Finally, Fred continues by suggesting that you write opinion. This is beautiful and why, or this sucks and explain why. Always explain why. It relates to “crystalising your thoughts” and is more valuable than just recalling It’s a good advice that I’ll try to adopt a little more.

Food for thought.

Iteration 2 - Routing

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 20 Mar 2012 00:00:00 -0700

Yesterday I finished the first, albeit fairly naive, version of finding the right chunk of itemsets to use for recommendations (described here). As I kind of hinted at myself I was overdoing it and a much simpler solution was the way to go. But in spite of this awareness, finding the actual solution is never as easy. Many thanks to Linas and Toni for their input which lead to a working routing mechanism.

With a “complete” prototype of the system, this also marked the end of the first iteration (three days later than planned). In this second iteration I will narrow in on the routing further by exploring the following:

Alternative routing mechanisms to cosine-similarity
Hierarchical routing, i.e chunks may be grouped together in a tree
Parallel requests, combining results from multiple chunks in a “roll-up”-like function
Continuous measurements on real(istic) data

Except the obvious surprises of course.

Code coverage in Scala

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 20 Mar 2012 00:00:00 -0700

Before I continue with iteration 2 I wanted to review how I’m testing what I’m building. In essence, I try to follow TDD as strictly as possible. Meaning, I run the tests before writing any code to ensure that it really does break or fail, implement the feature/function, run the tests again, refactor, test and repeat.

However, as the code evolved I felt there were cases, especially corner-cases, not being tested at all. Especially a I had a creepy suspicion that one of my tests was trying to cover too much in one go, and hence when it failed it was hard to pinpoint its cause, many times leading to the old-school printf() debug technique.

To investigate I hooked in the Jacoco Code Coverage plugin to sbt and ran jacoco:cover.

My immediate reaction was: 25%?! WTF!?!

There’s a catch, or a short-coming in the Jacoco tool, though. As numerous classes are extended or have traits mixed-in, especially the Akka-actors, there is a lot of unused code. One could argue, correctly so, that there should be test-cases trying some of these scenarios. For example, what happens if an actor is killed in the midst of processing a request? This will trigger code in branches that are never executed during normal operation.

On the other hand there is also more mysterious pieces of code.

It’s a bit hard to see in the screenshot, but there seems to be some partial functions that evade the tests almost completely. These, as far as I can tell, are coming from the Akka libraries. Similar to functions which are unused in traits and extended classes it would be interesting to have an option in Jacoco to exclude these from being evaluated.

Consequently, after going through the report in more detail my initial WTF has been reduced. The fact remains, however, that one test in particular is absolutely too broad in scope. Fixing this now.

Routing to the most relevant itemset(s)

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 15 Mar 2012 00:00:00 -0700

The dataset I’m working with contains more than 40M items and we want to be able to use most of it to make recommendations online. It is unfeasible, as we have seen in measurements, to compute the most relevant sets from the entire dataset for three reasons: the average compute time is too high for real-time responses, the entire dataset has to be stored in memory (depends on your machine obviously), and finally, as the dataset grows this approach will not scale horizontally.

There are, at least, two approaches to address the dataset size. First, by discarding “old” data or removing portions of the long-tail it would be possible to reduce the dataset size significantly. Reducing the size may impact the quality and scope, or breadth, of the recommendations made. It remains to be determined how many items are sufficient to generate adequately relevant recommenadations.

The second alternative, and the one I’m currently working with, assumes we can split the dataset into smaller chunks during the off-line computation. Each chunk represents a cluster of related items which are constructed offline by some clustering algorithm. For the moment I only assume these chunks are made available to the online cluster somehow. This leads to the following challenge which can be formulated as:

When a recommendation requests arrives to a node in the cluster, how do I know which chunk of items to use when generating the recommendations?

The recommendation request has the user’s preference factors, and in the future context information, attached to it. Given that each chunk can be described by a centroid, a unique representative id, I could formulate a distance function to identify which chunk is closest to the user’s preferences and/or session context. The approaches that I’ve looked at for determining the most relevant chunk(s) are locality-sensitive hashing and kd-trees, both which are known solutions to the nearest-neighbour search problem. Read this for a nice introduction to LSH.

Now, my worries with the LSH approach are several:

can the centroids correctly describe the chunks?
does LSH or the chunks mask the advantages of the matrix factorisation model and the eventual re-ranking with the user preferences?
so far I’ve only tried an LSH implementation using signature hashes (which are used to reduce the dimensionality into an approximation). This is easy as the signatures contain only positive integers. However, if I were to skip the signatures (there are not that many dimensions anyway) the LSH would have to operate on the latent factors directly, making me unsure how to define the hash functions.
maybe this is overly complicated for what I’m trying to achieve, are there any simpler alternatives for nearest-neighbour? Perhaps if I can mix in some domain specific knowledge it becomes a lot simpler?

There’s also a (big?) chance I’m looking at the problem in the wrong way. I’m kind of stuck at this problem at the moment.

Work process

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 08 Mar 2012 00:00:00 -0800

One of the most exciting things with building an online recommendation system from scratch is that one can apply many different concepts and theories in practice, especially concepts from a courses and litterature that we have studied in EMDC. At the same time, it is very dangerous. In this post I thought I would outline a bit about the work process. Partly to get it down on paper for later reference, and partly to track how (and if) it evolves.

Pre-system development

In order to get an idea of the complexity and the architecture of the online recommendation system, both Toni and I spent considerable time in the beginning reading papers, testing libraries, evaluating existing systems, and most importantly, building throw-away prototypes. We built prototypes in Scala, PHP, and Python covering specific aspects of the offline algorithm, the online part, as well as more general architectual concepts.

Iterations

I’ve tried to divide the development of the system in two week iterations (most teams at Tuenti also work in two-week sprints). The idea is that each iteration should finish with all tests passing, and that every part of the system have received some attention (even if only minor edits).

As with all early development, the code is very instable and parts sometimes changes several times a day. There have been days where I come to work the following morning and thinking “what the he** did I write yesterday”. Luckily, there are days where the opposite is true too and I can direct my attention to another part of the system.

Before finishing the second iteration (which according to my initial plan should be done on March 15) I wanted to make sure I have touched upon all the areas of the system. Most components so far are very dumb, some return static values, other more comprehensive results. However, it does hang together and seeing the integration tests pass makes you feel good. Gradually I’m replacing each of the stubs and, for example, yesterday I spent most time sketching out the gossip algorithm that will share information about the available itemsets and their respective current load. Eventually this data will be made available to the load-balancer (instead of the current brute-force load-balancing which is in place now).

Writing

A thesis is no thesis without a written report. At least as far as I know. Though I haven’t started writing anything yet my intention is to do so shortly. My supervisor asked if I could bring something in writing until our next meeting, which so happen to coincide with my second iteration deadline. Text, as much as code, require thinking, editing, refactoring, in short a lot of attention. Thus I feel it would be a good idea to already now devote some time every week to writing¹.

Deadlines

As I know them today:

Iteration 2: March 15 – all parts outlined in system, at least stubs
Iteration 3: March 30 – recommendation routing
Iteration 4: April 15 – load-balancing
Iteration 5: April 30 – metrics
Iteration +: This is still too far off to say anything about. I could argue that even end of April is way too far ahead. Well, it’s never too late to change it around a bit.

And the current plan is to present the thesis in June.

Work work!

¹ This isn’t really anything new but somehow it is always difficult to start on time and writing tends to be left for the last minute.

Worth migrating from Akka 1.3 to 2.0?

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 08 Mar 2012 00:00:00 -0800

A few days ago Akka 2.0 was officially released. It’s been available as a release candidate for a while and should be fairly stable (sufficient for what I’m working on). However, the new release isn’t as easy to migrate too as one would wish. It includes several changes to the API and structural changes to how actors are created and accessed.

That said, Typesafe et al have made an effort to smooth the transition for developers by packing the “old api” in an isolated jar which can run alongside the 2.0 code. This enables gradual migration of code. Though it looks “easy” on paper, I’m not convinced it is as easy in practice.

Pros of switching:

ActorSystems which makes it possible to run multiple akka deployments on the same JVM – this is a huge win for testability and to me seems like one of the major design improvements in akka 2.0.
Geting access to the new Router package would simplify the implementation of the load-balancer
Nicer APIs :)
The codebase is still relatively small so it is easier to switch now
Google’s references to the docs would be more useful. Today I often manually have to switch back to 1.3 only to discover that the feature I was looking at doesn’t exist.
Living on the edge is fun

Cons:

Akka 1.3 works
It would take time from new development
As Lalith pointed out “you can do a PhD on an old release”, i.e there is no need to switch
Most blogs with tips and tricks on akka mostly refers to 1.3 (obviously this changes over time)
All the unknowns that I have to face if switching

Moreover, there are some intriguing items on the (unofficial?) roadmap of akka, specifically with respect to clustering and elasticity. These properties are often challenging to develop (at least in Erlang, and I believe it is the same in akka 1.3) and it keeps reappearing as an issue. Making it part of the akka-library should open quite a few new doors.

Lets think about this migration decision over lunch.

Status update

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 05 Mar 2012 00:00:00 -0800

Compiled a todo-list of things to implement / look at. No priorities assigned yet and everything is probably not relevant.

Check architecture against original plan and update accordingly
Investigate distributed setup and testing with Akka
Remote actors identification and lookup – discovery of remote nodes
Change load-balancer to support remote actors
Dynamically redistribute itemsets if node is down or itemset is unavailable or overloaded → this implies changing the index-mechanism too.
Fake lookups of preferences
Gossip health status of itemsets
Load and register itemsets to actors and trigger reload
Define bootstrapping procedure
Add proper configurability

And less code related

Define a preliminary title for the thesis
Draft an outline of the report to next supervisor meeting

Work work!

Mind your language

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 05 Mar 2012 00:00:00 -0800

In Erlang we often refer to processes as children, and shutting down a child is often called kill children. What a horrible thought! In Scala the processes are referred to as actors and we kill those too. Albeit slightly better than slaughtering unknowing youngsters.

*nix is also a part of a dark conspiracy with killall and its violent sibling kill.

C, on the other hand, is a happy language. There data is set free(), which sounds like we’re letting it of to a better place. I like that.

Where did language designers get their brutal metaphors from?

Curse of Dimensionality

marcus@ljungblad.nu (Marcus Ljungblad) — Fri, 02 Mar 2012 00:00:00 -0800

Came across an interesting phenomena a few days back when I was reading about dimensionality reduction, i.e using techniques to reduce the number of dimensions of some data. At that time I didn’t know a term exists to describe it: The curse of dimensionality. Kevin Lacker explains the problem in simple terms on Quora:

Let’s say you have a straight line 100 yards long and you dropped a penny somewhere on it. It wouldn’t be too hard to find. You walk along the line and it takes two minutes.

Now let’s say you have a square 100 yards on each side and you dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. It could take days.

Now a cube 100 yards across. That’s like searching a 30-story building the size of a football stadium. Ugh.

After three dimensions it becomes difficult to find physical metaphors, and the problem is more evident when dimensions increase significantly. You see the challenge? The phenomena is found in several areas related to analysing or interpreting data, for example, data mining. When I first started working on recommendations using matrix factorization I thought 20 latent factors to describe a user’s preferences sounded like quite a lot of dimensions. As far as my memory can tell, I’ve never written an application where I needed more than a three or four. Thus, 20 dimensions was rather abstract. Today it’s less so, although some of part of the recommendation modelling is still equivalent to black magic.

While working on the implemenation of Minhash and Locality-Sensitive hashing I’ve come to realise that 20 dimensions are still pretty few and, luckily, the curse of dimensionality might not be kicking in. Because the challenge is that some algorithms that are designed for low dimensions perform horrible on large dimensions, and vice versa. Good for me I guess :)

Now lets see… where was I?

Meeting my supervisor

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 01 Mar 2012 00:00:00 -0800

I started my thesis work in January and haven’t since met my academic supervisor. In fact, I didn’t really have one until two weeks ago. Today I met Prof. José L Balcázar.

We discussed a number of things, some which might be useful to highlight are:

Deviations

All projects drift. Whether we like it or not does not matter, they all drift. Each project also drifts differently; some fluctuating weekly or even daily, and some over a long period. One of the reasons I’m keeping this blog is to be able to track a the project’s drift. Prof. Balcázar also suggested taking a minute or two everyday looking at all the pieces of the puzzle and the relationship between the pieces. The day they do not make sense or the process takes more time than two minutes it is time to spend a few hours thinking through everything thoroughly.

Scope

One of my task as your supervisor is to make sure you do not build a website and sell it as a master thesis with fancy words like Joomla. In your case we have a different problem: to make sure you do a master thesis and not a PhD thesis.

This has, and still is, one of the major challenges for me too. There are simply too many interesting threads to follow up on and narrowing down on the most important ones is hard. Within the next few days I’ll try to write down a title of my thesis as a way of defining the scope more precisely.

The Second System Effect

We also discussed second-generations of software, particularly with respect to the problem you are trying to tackle. The term above is from Fred Brooks’s quintessential book The Mythical Man-Moth in which he describes the effect as the second version of a system being bloated and clunky compared to the first version. It is certainly something to have in mind, especially as Tuenti already have a naive recommendation engine for video content (although as mentioned earlier it does not account for contextual data or user preferences).

Overall, I’m very happy and I’m looking forward to his valuable insights on this project. For the next meeting I also hope to present something in writing. Should be a good push to get started with the text too, even if only extremely rudimentary.

Finding a needle in a haystack

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 27 Feb 2012 00:00:00 -0800

As I’ve mentioned previously doing on-line recommendations of all 40M videos and 13M+ users isn’t exactly reasonable (within the time constraints given). We’ve been approaching the problem by creating a partial model offline and then loading this one to memory of the online cluster. When a request arrives, we complete the recommendation by applying the user preferences and session context. Sounds good in theory and we’re now in the stage of evaluating whether this also works in practice.

Since it is unreasonable to go through all 40M items online and select the top-K items from such a large list the idea is to cluster the items in smaller sets. Lets say, for example, that each cluster is 10 000 items large¹, then we’ll end up with 4 000 such sets. Out of those we need to pick only a few, and preferably those should also be the most relevant sets to the user’s session.

Enter minhash. The algorithm works by generating signatures of larger sets such that they can be compared against each other faster. One of the more popular (and easy) ways to measure similarity is using the Jaccard similarity which basically is setA.intersection(setB).length / setA.union(setB).length. Applying the Jaccard similarity measure on the signature sets turns out to be a very good approximation on the true similarity.

How are signatures generated? I hacked together this function to create them:

It takes a character matrix of the sets that we want to “compress” and a list of random hash functions. Each hash function is some variation of Ax+1 mod l where l is the height of the character matrix.

As we would still end up with the same number of signatures as the original, it may still be too computationally expensive to discover which sets are the most relevant. We can combine the technique of minhashing with Locality Sensitive Hashing to reduce the number of comparisons required to find similar items. This algorithm groups signatures into buckets by hashing portions of each set (think of it as horizontal slices of the signature matrix). The intuition is that multiple columns can be hashed to the same bucket, thus indicating that they are similar. The likelihood of similarity increases with the size of the slice (or band is it is also called). Consequently, this algorithm’s drawback is that it yields both false positives and false negatives and I’m yet unsure to what extent this affects the quality of future recommendations.

For a more detailed explanations of the Minhash and Locality Sensitive Hashing algorithms I suggest reading chapter 3 in Rajaraman and Ullman’s book on Mining of Massive Datasets

¹ Computing the predicted ratings of 10 000 items for a user takes only a few milliseconds using jBlas.

Handling failures

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 20 Feb 2012 00:00:00 -0800

In a meeting earlier today the question of handling machine failures was raised. Dealing with failures is obviously something one cannot take lightly and there are several approaches available. It is always, however, good to first explicate the requirements. What user-experience do we want to provide? In some scenarios (like when withdrawing money) it is not ok to fail, while in others (getting your friend’s latest facebook update) some degree of failure is ok. In extreme cases, maybe it is even ok to tell the user that the service is unavailable, although that is more certain to stir up some frustration.

Secondly, one need to think about what level of failures to handle. There’s huge difference between handling machine failures and datacenter failures. Dealing with machine failures can be addressed within the application or using external hardware components. There’s also a design decision (or philosophical decision) to make whether the system should be aware of what type of failure guarantees it can provide or not: i.e cluster or machine-aware. Each will require different semantics and consistency considerations.

In the service component that we’re building for generating recommendations on-the-fly, we can integrate methods for replication of the data model to increase availability of recommendations to serve. There are, at least, four possible replication schemes with varying complexity to consider:

If the whole model fits in RAM it can be replicated on all machines. Since, at least initially, the model is only updated once a day, there are few consistency issues to worry about. As long as the cluster can handle all incoming requests all but one machine may fail.
If the whole model does not fit in RAM, it can be sharded and replicated amongst a subset of machines in the cluster. This could result in certain parts of the model not being available to recommend, but if the index used to keep track of the itemsets is kept up to date, the “most similar” items can still be served. Here a subset but one machine may fail to ensure that some data is served (albeit it may not be the most accurate recommendations).
An alternative to version two is to split the data such that only some users are affected by machine outages. This would depend on how the model is split across the cluster and how the index of the itemsets are kept up to date.
Finally, one alternative is to not do any replication to handle failures and simply serve static or no recommendations at all if a failure occur.

Perhaps the question we should ask ourselves is: How little redundancy can we get away with?

Drawing sequence diagrams

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 14 Feb 2012 00:00:00 -0800

Yesterday I had a need to persist my (fancy) sequence diagram that I had jotted down on a piece of paper to illustrate the flow of a request in the system. It seems my notebook is not as reliable as a backed-up mercurial repository… I discussed with Nick and Patrik to find easy, expressive yet simple, and free tools for drawing these types of diagrams.

I settled on WebSequenceDiagrams which neatly solved the job (albeit not free in the true sense). Patrik later pointed me to BlockDiag which is free and also looks straightforward. Next time I’ll give it a shot!

Math libraries (cont)

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 09 Feb 2012 00:00:00 -0800

And the winner is Eigen!

Yes, it wasn’t even mentioned in the previous post. In fact I mixed the names of uBlas and Eigen up. Perhaps not a big surprise that C++ outperforms the others. More significant is that the library is single-threaded and yet compares faster than the parallel libraries jBlas and ParallelColt. The difference, however, is marginal: ~6900 ms vs ~7900 ms.

The jumbo spot is, also not surprisingly, held by PHP. Memory explodes after 1M items for which it takes a rock-steady 220 seconds to compute m*m.transpose().

Evaluating math libraries

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 09 Feb 2012 00:00:00 -0800

Matrix factorization isn’t an algorithmically cheap technique. Quite the opposite. Most algorithms are based on approximating the values until the root mean squared error is sufficiently small (the definition of small is yet another variable to decide on).

There is one particular operation which has both a large memory and cpu footprint:

matrix * matrix.transpose()

Looks innocent? However, the matrix has the following dimensions: 20 × 40,000,000 which complicates things a bit. In fact, it could be a lot larger (and many others probably calculate much bigger matrices), but at the same time it is only a tiny part of the whole equation.

That said, Toni and I have spent some time profiling existing math libraries which provides smart matrix data structures and efficient algorithms. At the moment we have tried the following:

Java
- ParallelColt
- Apache Commons Math
- jBLAS
Scala
- Scalala
C++
- uBLAS
PHP
- PEAR_Math

We haven’t got the final verdict yet, but let’s say PHP is struggling way behind. We couldn’t even get it to run on matrices larger than 1,000,000.

Stay tuned!

Recommendations from a philosophical view

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 07 Feb 2012 00:00:00 -0800

Quite some time ago I came across a TED talk about filter bubbles which emphasize the risk of search engines filtering content which suite your particular taste. Especially how this filtering is narrowing our worldview and, in the long term, how it may affect democracy. The talk is worth watching and relates clearly to recommendation engines. In fact, there exists a debate about what is personalised search and personalised recommendations. Many argue that it is all the same, and that search is irrelevant these days. Personally I believe the difference in use is what makes the distinction. In search I actively look for material that I wish to find, whereas for recommendations I’m exposed (un)willingly to interesting content.

The New York Times article If you liked this, you’re sure to love that highlight another perspective of recommendations: that of culture.

[Prof. Pattie Maes] notes that there’s something slightly antisocial — “narrow-minded” — about hyperpersonalized recommendation systems. Sure, it’s good to have a computer find more of what you already like. But culture isn’t experienced in solitude. We also consume shows and movies and music as a way of participating in society. That social need can override the question of whether or not we’ll like the movie.

Both the filter bubble and hyperpersonalisation can probably be minimised by introducing some randomness, or negating the result (looking for contradicting views). The latter being more algorithmically challenging. However, that will also impact your trust in the recommendations provided, potentially causing you to ignore them completely. And that isn’t really what we’re after…

On-line computation cost

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 06 Feb 2012 00:00:00 -0800

Last Friday I hacked together an experimental on-line recommender in PHP to evaluate time spent on computation vs size of datasets. Here are some preliminary (discouraging) results using a local memcache setup and the following configuration:

Users in memory: 500, Concurrent clients: 10, Ranked list length¹: 100, Item vector size: 20

The script runs for 10 seconds during which time each client spams recommendation requests. Each request is independent of each other and are thus treated separately.

Increasing the number of users to 5000 with the other variables intact yields the following results:

This figure shows a similar distribution to 500 users, hence, it suggests we are not capped by the number of users in memory. The total throughput is around 2700 requests. Next I decrease the ranked list size stored in memory. The average latency remains the same, however, total throughput increases to roughly 6000 requests.

And with shorter preference vectors (set to 10) but ranked list length back to 100, the throughput is around 4000 requests.

I’m unsure why there is a significant portion of requests taking about 120-140 ms.

In summary, preference factors and ranked list length becomes important configuration variables for on-line computation. Much can, and needs to, be done to further increase the performance, for example, how the selection of the top K recommendations. For a production system we must get the majority of all requests down to, or below, 100 ms. This should include also updating the context vector.

Thoughts on PHP
PHP4 was still the mainstream version when I last hacked stuff using PHP. And obviously a lot have changed since. For starters, PHP has a class system! There are also array functions to do operations equivalent to map and reduce (maybe they were there earlier too, but I didn’t know). My impression so far is that it is pretty easy to get started with (hey, nothing to compile right), but it feels rather bloated.

PHP is extensive. There seems to be functions for almost everything. Thus having a browser and php.net open is essential. On the other hand, keeping the code easy to understand felt like a challenge. With experience this ought to be less of an issue. For example, using arrays as both dictionaries, lists, and tuples quickly made me realise I was reading my own code over and over again to see what was being passed around. Thank you Netbeans for auto-completing my very long variable and function names. :)

Lastly, I found myself looking for threading capabilities to run multiple clients concurrently. Turns out there is no such thing in PHP. Instead I ended up writing a python script to wrap my php script in. Epic fail. Naturally, PHP doesn’t come from a multi-threaded usage, it relies on whatever webserver to handle each request individually and thereby making it multi-threaded.

¹ This is the result of the partial model generated off-line and it is unique for every user.

Load test prototype

marcus@ljungblad.nu (Marcus Ljungblad) — Fri, 03 Feb 2012 00:00:00 -0800

Spent most of the day yesterday doing a simple prototype of the online component. Most of the logic is in place, and the trick now is to test it under load with a realistic storage configuration. Realistic means MySQL and Memcache. In fact, the most realistic would be to run it alongside some production code and use real load to get accurate measures of throughput and latency. That’d be real awesome. However, we’re not there yet, especially as that requires us to generate recommendation models for at least a subset of the users.

What struck me, again, yesterday is how we always lack small easy-to-use tools for generating load. Nick and I faced a similar problem when building our distributed lock service and, especially Nick, considered building his own. Why is it that these tools rarely are part of our toolboxes? Sure, there are tools for load-testing, but too often they are clunky and overly sophisticated (JMeter and Tsung to name two). Basho_bench is one of the better I’ve seen so far. Set-up is relatively small and you could make it work with REST interfaces quite easily¹.

What am I missing? Let me know.

¹ It is primarily targeted at Erlang applications and was initially written to test the Riak database.

Reducing dimensions of the problem

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 02 Feb 2012 00:00:00 -0800

Yesterday we, Toni and I, went for a meeting at Telefonica’s offices to brainstorm about possible system solutions for building personalised video recommendations. Together with two system’s researchers and two algorithm experts we banged our heads at the problem.

The problem, based on the current dataset, can be summarized as:

Given 40M videos and 13M users, recommend a personalised set (of length k) of videos using contextual information which the user may¹ like.

There are three parts to highlight here:

Each video is described by a vector of, for example, 20 factors, but can be many more. Hence, the 40M videos is represented by a 40Mx20 dense matrix. Let’s say each factor can be described by 4 bytes (an int), then this matrix is roughly 3 Gb of data.
User preferences are also described by a vector, i.e also a rather large matrix. While the number of videos can be reduced, for example, by cutting the long-tail, the users cannot. In fact, the number of users is expected to grow significantly as Tuenti is expanding into new markets.
Contextual information are things like browsing history, the last video viewed, browser settings, weather information, time of day, and whatever else you can think of. This too is represented by a vector and may change on every user action.

Some questions which arise from this are:

How large sample of the videos can we use?
How do we select the videos?
How many factors describing the videos / users can we use?
How fast can we serve a recommendation request? (The time to load a page shouldn’t exceed 200 ms)
What time budget for each computation (offline and online) do we have / can we allocate?
If a request cannot be served (in time), what fault-tolerance or fallback mechanisms do we use?

The algorithms we’re looking at uses matrix factorisation to compute a relevance matrix. One approach, as discussed in a previous post, is to split the recommendation computation in two parts. One off-line component which calculates a ranked list for each user, but excludes the contextual information. The second part is done on-line and here the contextual information is applied to the ranked list (by calculating the inner product of the two vectors).

Now, in order to reduce the dimensions of this problem, my next task is to estimate how much contextual information we can use and what size of the ranked list we can store based on current peak-load traffic and beyond.

¹ s/may/will/g depending on your confidence in the algorithm / philosophical view.

Usage analysis

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 31 Jan 2012 00:00:00 -0800

Moving on from the code analysis that I did yesterday in order to understand the different systems and what they are made up of, I continued today with investigating their usage.

There are currently three recommendation systems in place at Tuenti: Friends, Places, and Videos. The former two have been around for a while, whereas Videos launched more recently. This, if nowhere else, can be seen in the code and the code practices used. There is currently no way of measuring whether a user uses the video recommendations, thus, tomorrow I’ll try to set that up.

I wont share any specific numbers yet (hopefully I can do that later). But overall, it’s been fun playing around with Hive and MapReduce. On a huge codebase it is obivously easy to get lost, and that happened more than once. Initially I didn’t think there were any statistics at all for Places. It turned out, once I found where I wanted to add the data collection marker, that it was already there. However, it stored the values in another datastore which wasn’t documented. Do’h!

Waiting for last MapReduce job to complete, then ir a de casa.

A set of requirements for a recommendation framework

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 30 Jan 2012 00:00:00 -0800

Functional

recommend multiple types of content with different characteristics
- videos
- albums
- friends
- games
- (essentially this means supporting several recommendation algorithms)
use implicit feedback data to calculate recommendations
- views by users, click-through
- support multiple collection points
use explicit feedback data
- ratings by a user
record contextual information and use it instantenously to update a set of recommendations
generate sets of recommendations for specific types and mixed sets
be able to use data of different types to generate a specific type
support post-processing filters on recommended sets

Non-functional

scale to millions of users
support peak hours during which activity is significantly higher
degrade gracefully if service is limited / unavailable
easy to add and deploy new sets of recommendations
make use of contextual information in realtime to update recommendations

Architecting Recommendation Systems for Web-Scale Data

marcus@ljungblad.nu (Marcus Ljungblad) — Fri, 27 Jan 2012 00:00:00 -0800

To evaluate the current thesis proposal I challenged my EMDC colleague Lalith (who is doing some awesome work with wireless networks at T-Labs in Berlin) with the following question:

Let’s hypothetically say the name of my thesis is “Architecting Recommendation Systems for Web-Scale Data”. What would you expect to read in it?

Well, as it turns out his answer, although significanly shorter, matched the following, highly tentative, outline of a report quite well. I’ve updated it a bit to include Lalith’s feedback. Now I should verify this with my supervisor at UPC too.

Introduction

recommendations for personalisation and increased interaction
problem with scale, optimizing algorithms or sampling the data
little systems research on collaborative filtering and recommender systems (mostly on algorithms)
building a recommendation system which serves millions of users
Supporting a range of content: videos, games, photos, albums, friends, places, pages
main contributions:
- a system which supports several content types
- able to update according to recent contextual information
- evaluation on big data sets

Background

Definitions

web-scale data
collaborative filtering
- model based – common approaches
- memory based – common approaches

Describe current solutions to the growing amount of data

it has mostly focused on algorithm enhancement and/or downsizing the data
some algorithms are being ported to mapreduce, for example through the mahout project
other attempts include graphlab which uses something like a “bulk asynchrounous processing” model, but still lacks widespread production use and has limited support for distributed computations
biggest published system on recommendation systems is google news personalisation. The algorithms are simplified and system specific to Google’s infrastructure

Problems / Limitations of existing systems

Method

Something about the research method(s) used. Big TBD.

System / Architecture

Data collection – capturing user feedback, and using it for online feedback
Algorithms for computing recommendation model – dividing the model in two parts
Serving recommendations
Updating recommendations based on contextual data from a session, i.e creating relevant recommendations on the most recent user activities.
Components needed / Implementation
- offline (non-realtime)
- online (realtime)

Details

Usage peaks – degrading quality of service depending on load
Blacklisting, i.e removing recommendations that a user deemed irrelevant or has already seen
Updating / creating new recommendations on the fly
New users / cold start, i.e what to do when there are no previous history from the user
HCI – How long time does it take to serve a recommendation vs better to change UI to improve effectiveness (TBD)

Evaluation

Quantitative
Measure existing recommendations and compare with new system

accuracy of algorithm (not sure how relevant this is for a systems paper)
accuracy vs load
serving recommendations (latency / throughput)
clicks/interaction

Also check if it is quantatively comparable to any existing systems.

Qualitative
Architecture
Flexibility / modularity
Scalability

Conclusion

It will be awesome ;)

A day of tutorials and code

marcus@ljungblad.nu (Marcus Ljungblad) — Wed, 25 Jan 2012 00:00:00 -0800

Next step, after getting the gist of recommender systems was to narrow down on the existing systems and practices at Tuenti. Since the company has grown quite rapidly in the last two years, a full suite of training material exists to help introduce engineers to the systems. Despite each tutorial being quite short, it took more or less the entire day to go through the first stuff, check out code, and set-up environment properly.

Tomorrow I will focus on analysing the existing recommendation systems.

Production recommender systems

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 24 Jan 2012 00:00:00 -0800

Here’s a small collection recommender systems in production that I’ve come across in my background research. It is far from complete and if you know of anyone particularly interesting (especially where the datasets or item churn is extraordinary), please drop me an e-mail.

The developer’s at Foursquare made Explore using some pretty big datasets.
At Amazon they’ve been doing product item-to-item recommendations for quite a while. Greg Linden is recommendation king over there.
Google News personalises stories for you based on similar user’s click history.
The game-changing challenge announced by Netflix stirred up some serious activity in the research community. The prize totaled $1M dollars.
Drupal, popular CMS system, provides a recommendation API.
LastFM’s “Audioscrobbler” is based entirely on recommendations, but instead of improving the algorithms, LastFM focused on extracting really good data from the users. That turned out quite well too.

Then I’ve also seen Digg, StumbleUpon, Movielens, Facebook (duh), and many more mentioned, but have no links on how they work.

Mahout vs GraphLab

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 23 Jan 2012 00:00:00 -0800

Mahout

Is a framework for machine learning and part of the Apache Foundation. A sub-framework of Mahout is Taste used specifically for collaborative filtering.

The Taste framework comes in two tastes (pun intended):

Online where recommendations are computed on demand, typically on smaller datasets. This version is easily integrated in existing Java applications either by using on of the existing Recommender algorithms. Online computations are done in memory (as long as they fit) these can be updated more frequently by, for example, pushing new .csv files, or using data from a SQL database.
Offline which utilise Apache Hadoop to achieve scalability. Mahout points out, however, that map-reduce tasks doesn’t logically fit all types of algorithms and are hence exploring alternative distribution methods too.

The recommender beginner’s wiki points out that datasets containing up to 100M user-item ratings should be computable online using a decent server.

Not all algorithms provided by Taste are available as Hadoop implementations. There is an iterative algorithm for matrix factorization using Alternating Least Squares. Iterative algorithms incur significant overhead when written as MapReduce jobs in Hadoop (a better way could be to model the computation using bulk synchronous processing, like Pregel).

Building a system which combines Mahout’s offline and online capabilities seems yet to be done. Basically since you want your online computations to be O(1) I’m not sure that Mahout is a good fit. It might be easier to do online updates on data on the side, and possibly use Mahout for the offline computations.

Decent introductions to Mahout can be found here and here.

This page has information about the recommender architecture and how to build your own recommenders. The architecture does not provide an intuitive explanation for how the collaborative framework connects to Hadoop. Based on a brief tour in the source code it looks like Mahout provides a “Hadoop Job Factory” which generates and submits map- and reduce tasks (aka jobs) to your Hadoop cluster.

Mahout has also been shown to run on AWS Elastic MapReduce which, given the readme, does not seem like a trivial task.

Foursquare provides a pretty interesting use-case on Mahout with extremely large datasets, and also emphasizes the fact that Mahout is geared towards industry.

Graphlab

On the other hand, the Graphlab project takes a quite different approach to parallel collaborative filtering (more broadly, machine learning), and is primarily used by academic institutions.

Graphlab jobs operate on a graph data structure much similar to Google’s system Pregel. Computation is defined through an update-function which operates on one vertex of the graph at the time. During an update call, new update requests can be scheduled with other vertices of the graph. A central scheduler delegates vertices for processing. For a good example, see the Graphlab implementation of Pagerank.

Contrary to Hadoop, Graphlab is built for multi-core parallelism, although there is on-going work in making it easier to user in a distributed setup. It also seem to lack mechanisms for fault-tolerance (for example, map or reduce tasks are restarted by the master if they fail to complete).

However, Graphlab boasts that “implementing efficient and provably correct parallel machine algorithms” is easier when compared to MapReduce. Especially since computation is not required to be transformed into an embarrassingly parallel form. It is different from Pregel in the sense that communication between vertices is implicit, and that computation is asynchronous. The latter implies that computation on a vertex will happen on the most recent available data. By ensuring that all computations are sequentially consistent, the end result data will eventually also be consistent, programs becomes easier to debug, and complexity of parallelism is reduced.

Summary

Mahout looks like a more polished product, especially as it relies on Hadoop for scalability and distribution. Its computational model may, however, be constrained just because of the same prerequisite. It is, hence, here Graphlab excells since it is built ground up for iterative algorithms such as those used in collaborative filtering. On the downside, Graphlab lacks a production-ready distribution framework.

Head-banging

marcus@ljungblad.nu (Marcus Ljungblad) — Fri, 20 Jan 2012 00:00:00 -0800

First whole week at Tuenti completed. Some progress, and a bunch of questions. Here’s the short crackdown:

Acquired an understanding of the problem with recommendations, why it is useful, where it is used, who the big players are, and how it has evolved over the years.
Learned about matrix factorization techniques for predicting recommendations. Although I’m not sure I fully understand how it is done, I get the gist. Probably wouldn’t attempt to implement one of the more complicated algorithms on my own yet.
Tweaked and refactored some code which does simple matrix factorization on very small datasets, and in a sequential manner. Check the refactored code on Github.
Read numerous papers about algorithms (many of which I don’t understand, but able to classify).
Read the few papers I could find on recommendation systems architecture. Google News personalisation being the most prominent one.
Studied the Mahout architecture; a machine learning framework which is able to utilize Hadoop for larger datasets. It works offline only (although Taste, the recommendation framework, is pluggable for online recommendations too) and does not support model-based algorithms yet, afaik.
Discovered that there isn’t a whole lot of research, papers, or blogposts on the actual implementations of recommendation systems. Good for me, I guess.
Began studying Tuenti’s internal architecture based on technical specifications and requirement documents. Much of it I will never cover here. Next week a colleague is giving me and Toni a walkthrough of the essentials. Will be very interesting as a lot of things here really are at large scale. Things we only studied in papers before.
Rewrote my thesis position paper based on this week’s findings. As soon as the Tuenti architecture background is complete I should be able to finalise it and send it for a final review. My UPC supervisor isn’t responding to e-mails at the moment (and hasn’t for a week) which is an issue though.
Got some rough ideas on a possible system design if I could do it entirely as I want. That’s obviously neither a good idea, nor feasible. But sketching is always fun.

It’s been a lot this week. Although I still haven’t been able to exactly specify the topic and the work, I’m getting more confident that we’ll find something. The best part is that basically whatever I look at is really interesting. Perhaps with an exception for some of the complicated math algorithms.

Now pizza.

Survey paper on CF recommendation algorithms

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 19 Jan 2012 00:00:00 -0800

Found this paper titled A Survey of Collaborative Filtering Techniques from 2009. It is more recent than the previous survey paper I found and contains some useful references to the challenges with CF techniques.

One particular challenge with recommendation algorithms is to scale them to tens of millions of users and millions of items. Most research predating the Netflix Prize considers “large-scale” to be several orders of magnitude smaller than millions. Even the Netflix data is small in comparison to the datasets used at Google News or Amazon. Essentially O(n) is too slow for those numbers.

Several techniques have been proposed to address the scalability issue. In particular:

Doing matrix factorization once and updating it online (specifically Singular Value Decomposition) using projections.
Using Hadoop MapReduce. While some algorithms can be parallelised and made to support MapReduce, it doesn’t solve the freshness of the data if it changes quickly. Something web data have a tendency of doing.
Intermediate approaches have also been proposed. For example, instead of computing the top K recommendations on the entire user database, the users are first clustered (as in Google News), and recommendations are calculated on the fly using these smaller clusters.
The above paper mentions Pearson correlation (memory based algorithm) as a viable alternative to scale. I’m not convinced, however. If the data is too large to store in memory it will be too slow as reads have to be made from disk instead.

I haven’t read the entire survey, but it seems to be covering quite a few of the collaborative filtering techniques that I’ve seen mentioned in several other papers.

Summary of "Google News Personalization Scalable Online Collaborative Filtering"

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 19 Jan 2012 00:00:00 -0800

The authors describe three algorithms for content-agnostic recommendations and the system architecture employed to serve personalised news on Google News. Their contribution is distinct from earlier CF research in two ways: the massive scale and high item (news) churn.

CF algorithms, as mentioned in a previous post, can be categorised in two areas: memory-based and model-based. The former is significantly hard to deploy on massive item-sets since (surpise) everything needs to be kept in memory. This quickly becomes unfeasible. In Google News this is a an online covisitation algortihm which only updates the affected news items, and thus does not need to be maintained in memory. Two other model-based algorithms, both calculating clusters, are computed offline.

The engineers divided the system in three parts:

An offline part which is basically MapReduce jobs running periodically to compute user clusters based on their click history.
An online update part that continuously updates the statistics when a user clicks a news item.
An online retrieval part which fetches and computes news recommendations from the statistics and clusters stored.

The system is split into five components:

Front-ends which listens to registered user activity. The front-end passes the data on to either of the two following components.
The Statistics engine updates the user clusters and story items based on the clicks received. All information is stored in one of two tables: a user table and a story table.
The third server is the prediction engine which, given a set of user options (for example: language used, regional settings, and so on) and a few news items from these settings, computes a set of ranked stories. The top-K ones are presented to the user. The prediction engine fetches information from both tables and caches them for an “appropriate” time-window.
The BigTable tables which stores user and story statistics.
1. The users are indexed by id and contains two columns: a list of clusters the user belongs to, and click history.
2. The story table, indexed by story-id, contains how many times a story was clicked from users in each cluster, and how many times the story was visited along with another story¹.
The offline component operates on a “few months” of user data to create user clusters using the two model-based algorithms.

As components are split, the system can continue to serve personalised requests even if, for example, the statistics engine breaks. Multiple instances of each component increases the availability.

Finally, the system serves prediction requests in less than 100 ms. There are also evaluation of the prediction accuracy, but that is not as important now.

Reference:
Das, Abhinandan S., Mayur Datar, Ashutosh Garg, and Shyam Rajaram. “Google news personalization: scalable online collaborative filtering.” In Proceedings of the 16th international conference on World Wide Web, 271–280. WWW ’07. New York, NY, USA: ACM, 2007. http://doi.acm.org/10.1145/1242572.1242610.

¹ A story is covisited if a user clicks two stories after each other within a specified time-window.

Motivating my thesis topic

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 19 Jan 2012 00:00:00 -0800

Found the following quote in another survey paper on Recommender systems from 2010.

The architecture of recommender systems and their evaluation on real-world problems is an active area of research.

\o/

Summary of "A case for distributed recommender system architecture"

marcus@ljungblad.nu (Marcus Ljungblad) — Wed, 18 Jan 2012 00:00:00 -0800

This paper makes it to my top-ten list of worst research papers ever.

The authors, correctly, identifies that historically recommender systems are made sequential, have difficulties scaling, and are often built for specific purposes. With this outset, they propose four architectural techniques: network centric, client-server, layer, and component patterns, which magically will solve all issues of recommender systems.

While the abstract and introduction starts of fine, until you check the references that it uses to support some of the claims the authors are making, the solution proposed offers no novelty, implementation, or evaluation. It is purely hypothetical. The authors seem to have a superficial understanding of recommendation algorithms which are barely touched upon, let alone described, in the paper. Moreover, the claims made in the introduction are based on recommendation research from the early 90s and the paper was published in 2010.

Unfortunately, nowhere is it explained how the proposed patterns are distributing the load of commercial-scale (hundreds of thousands or more entries) recommendation datasets. And they, likely, haven’t heard of Hadoop.

On the upside of things, there seem to be plenty of room for actually providing something substantial in the field of scalable and fault-tolerant recommender systems.

Next paper: Amazon’s item-to-item recommendations.

More Matrix Factorization

marcus@ljungblad.nu (Marcus Ljungblad) — Tue, 17 Jan 2012 00:00:00 -0800

In order to better understand matrix factorization¹ I wanted to experiment with it in code. I find code much easier to understand than the Greek symbols in papers. Plus, you can tinker with it.

A little digging gave me Albert Au Yeung’s matrix factorization tutorial with some python code. His code, to me, wasn’t very easily understood, and therefore, after some refactoring this is what I came up with. It can surely be made even easier to understand. For more details of the maths, read his post.

The code assumes users are rating items. When the rating is defined as 0 in the input matrix, the user has not yet rated the item. The goal, hence, is to predict those values by discovering to matrices, which product, approximates the missing ratings. Obviously, the approximation is based on the already existing ratings. Thus, when calculating the mean squared error, this is done by comparing already existing ratings against predicted ratings. When the approximation error is small enough, or when MaxSteps is reached, we quit and take the dot product of the two resulting matrices to yield the predicted ratings for all users and movies.

It is not intuitive where the InitialUserFeatures and InitialMovieFeatures come from. In collaborative filtering algorithms, it is assumed that each user has some initial preferences, for example, based on their previous actions. However, in a new system where no previous data exists, i.e when bootstrapping, this can be merely an educated guess or made up by implicit data. In the example above each user’s preferences are therefore randomised for the NumberOfLatentFeatures we are trying to uncover.

The result of one example run is:



[[ 4.98400885  2.96534946  3.77717477  0.99965545]

 [ 3.97376149  2.37641209  3.21611904  0.99749797]

 [ 1.03827929  0.90545444  5.63804813  4.96223337]

 [ 0.9813382   0.81234711  4.59620317  3.97213023]

 [ 1.49546775  1.11543839  4.93859812  4.0289544 ]]

[Finished]

If the ratings are from 1-5, it is not very useful that the algorithm estimates the rating 5.64 in one particular case. Since a predicted value may exceed 5, but never be less than 0, it would be good to add some constraints on the final predicted values.

I’m considering rewriting the code to suite a much larger dataset (100k+ users) next.

¹ Matrix Factorization is probably part of a math’s course we never had as Software Engineering students at ITU. Wish I had more maths in undergrad.

Time Computing vs Accuracy

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 16 Jan 2012 00:00:00 -0800

Algorithm suggested at the moment is using Alternating Least Squares (ALS) to optimise the result. ALS is easy to parallelise, but may not be as accurate as Stochastic gradient descent.

Thus, one area to explore is how can computation time increase accuracy? Or, conversely, if we’re short on time, degrading the accuracy gracefully.

Singular Value Decomposition

marcus@ljungblad.nu (Marcus Ljungblad) — Mon, 16 Jan 2012 00:00:00 -0800

While reading up on how Singular Value Decomposition (SVD) works I found this quote by Simon Funk

“In today’s foray, that model is called singular value decomposition, which is just a fancy way of saying what I’ve already eluded to above.”

Nice to hear other people shudder about how we try to over-complicate a lot of things, especially in research and marketing departments.

As far as I understand, the intuition behind SVD (as applied to estimating ratings) is that you have matrix and you want to find two matrices that, when multiplied, predict the actual matrix with a minimal error of the approximation. This can be done in iterations to improve the accuracy of the prediction, and the nice thing about SVD is that it will automagically converge on the most optimal solution.

Alternating Least Squares
However, a complication arises when the matrix you are trying to decompose is not complete. Or, in other words, when there are some fields in the matrix which are empty or undefined SVD is insufficient for finding the composing matrices. In this case Alternating Least Squares (ALS) is the way to go. More to come about this.

Also, here is another remarkable quote I found in a paper from Zhou et al.

“We have found parallel Matlab to be flexible and efficient, and very straightforward to program. Thus, from our experience, it seems to be a strong candidate for widespread, easily scalable parallel/distributed computing [compared to Hadoop and MapReduce].”

Some more references:

Zhou, Yunhong, Dennis Wilkinson, Robert Schreiber, and Rong Pan. “Large-Scale Parallel Collaborative Filtering for the Netflix Prize.” PROC. 4TH INT’L CONF. ALGORITHMIC ASPECTS IN INFORMATION AND MANAGEMENT, LNCS 5034 (2008): 337—348.
Koren, Y., R. Bell, and C. Volinsky. “Matrix Factorization Techniques for Recommender Systems.” Computer 42, no. 8 (August 2009): 30-37.
How does SVD work

Summary of Toward the Next Generation of Recommender Systems

marcus@ljungblad.nu (Marcus Ljungblad) — Fri, 13 Jan 2012 00:00:00 -0800

The paper surveys the recommendation research area from around the mid-90s to around 2005, when the paper is published. The main contribution is a table classifying existing deployments and research in content-based, collaborative, and hybrid recommendation systems.

The general algorithm may be described as RelevanceOfItem = User x Item, or utility(user,item). Due to the size of this set, it is usually not computed for the whole user space. Instead there are a number of predictions available. These predictions are calculated using either a memory (sometimes called heuristic) method or a model-based method to compute the relevance of items for users.

Collaborative Methods
These recommendation systems uses the ratings of other users to produce a relevance set for a particular user. This is done with a user-similarity function which can be defined in several ways. The similarity is usually computed as the distance between two users and this value is used as a weight for the relevance calculation of an item which the user has not yet rated. One technique to find similarity between two users is to look at the items they have both rated previously. However, the calculations should be normalised (see formula 10b in the paper) to account for the fact that different users use, for example, a rating scale differently. 10 doesn’t always mean 10.

Two graph-theoretic approaches to collaborative filtering include the Pearson coefficient and the cosine-based approach.

Calculating user similarities can be expensive and thus one approach is to precompute these values for all users (recomputing them once in a while) and calculating the ratings much more efficiently when the user actually asks for them.

Predictions can also be made using a model-based technique. This may be probabilistic, for example using clustering or Bayesian networks. Making the model representative is challenging, and clustering may, for example, only limit a user to one single cluster.

Machine learning techniques have also been proposed to address the nature of evolving data.

Issues with collaborative methods

New users may have no or very little similarity to existing users. Relates to bootstrapping data. Take generalised sets?
New items relies on users rating them. Some weird items may be rated very high but only by a small set of users and thus have less total influence.
Sparsity, i.e there is not enough data. Adding context or using more profile data about the users is a way of overcoming this problem. The paper makes a case about using demographic data.

There are also hybrid versions that can address some of the short-comings in each of the two main methods.

Comments

OLAP – check the paper Phillipe mentioned in his talk.
Bootstrapping – either make some educated guesses or make sure you have data from the users.
Would be fun to hack a small simple recommendation system just to get the gist of it.
The paper makes a number of references to scalability issues with recommendation systems but provide no discussion in the paper itself.

First day at Tuenti

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 12 Jan 2012 00:00:00 -0800

Got impressed.

Got a workplace.

Got introduced.

Forgot everyone’s names.

Except Toni (es), my colleague who is an expert on recommendation systems, Tomasz (swe), the Barcelona boss, Albert (es), my supervisor in the product team responsible for the recommendation system development, Einar (swe), also in the team, and Virginia (ar), the very helpful office assistant. I’ll learn the other names eventually.

Got hungry.

Had an exceptionally late lunch by Swedish standards.

Began reading a survey paper on recommendation systems. More info to come.

Got impressed.

Had a product brainstorm meeting. I had no clue about anything.

Still have no clue.

Is happy.

Entry 4

marcus@ljungblad.nu (Marcus Ljungblad) — Sat, 07 Jan 2012 00:00:00 -0800

I’m tired of course work now. Need one of these:

Entry 3

marcus@ljungblad.nu (Marcus Ljungblad) — Fri, 06 Jan 2012 00:00:00 -0800

Facebook built Timeline by recomputing tonnes of data into a common data format. Aggregators are used to query multiple databases simultaneously. Reminds me of Dremel which Google developed.

Moving to Barcelona in less than a week. Excited.

Entry 2

marcus@ljungblad.nu (Marcus Ljungblad) — Sun, 25 Dec 2011 00:00:00 -0800

I found this post today on how to architect near real-time systems for risk analysis. The problem seem similar to recommendation engines which uses a lot of history data that rarely change, and only little data which change frequently.

Entry 1

marcus@ljungblad.nu (Marcus Ljungblad) — Thu, 15 Dec 2011 00:00:00 -0800

This will be a collection of data, comments, ideas, summaries, links, babbling, and random whatnots which somehow relates to my thesis work at Tuenti.

Expected end-date: mid-June.