To evaluate the current thesis proposal I challenged my EMDC colleague Lalith (who is doing some awesome work with wireless networks at T-Labs in Berlin) with the following question:
Let’s hypothetically say the name of my thesis is “Architecting Recommendation Systems for Web-Scale Data”. What would you expect to read in it?
Well, as it turns out his answer, although significanly shorter, matched the following, highly tentative, outline of a report quite well. I’ve updated it a bit to include Lalith’s feedback. Now I should verify this with my supervisor at UPC too.
Introduction
- recommendations for personalisation and increased interaction
- problem with scale, optimizing algorithms or sampling the data
- little systems research on collaborative filtering and recommender systems (mostly on algorithms)
- building a recommendation system which serves millions of users
- Supporting a range of content: videos, games, photos, albums, friends, places, pages
- main contributions:
- a system which supports several content types
- able to update according to recent contextual information
- evaluation on big data sets
Background
Definitions
- web-scale data
- collaborative filtering
- model based – common approaches
- memory based – common approaches
Describe current solutions to the growing amount of data
- it has mostly focused on algorithm enhancement and/or downsizing the data
- some algorithms are being ported to mapreduce, for example through the mahout project
- other attempts include graphlab which uses something like a “bulk asynchrounous processing” model, but still lacks widespread production use and has limited support for distributed computations
- biggest published system on recommendation systems is google news personalisation. The algorithms are simplified and system specific to Google’s infrastructure
Problems / Limitations of existing systems
Method
Something about the research method(s) used. Big TBD.
System / Architecture
- Data collection – capturing user feedback, and using it for online feedback
- Algorithms for computing recommendation model – dividing the model in two parts
- Serving recommendations
- Updating recommendations based on contextual data from a session, i.e creating relevant recommendations on the most recent user activities.
- Components needed / Implementation
- offline (non-realtime)
- online (realtime)
Details
- Usage peaks – degrading quality of service depending on load
- Blacklisting, i.e removing recommendations that a user deemed irrelevant or has already seen
- Updating / creating new recommendations on the fly
- New users / cold start, i.e what to do when there are no previous history from the user
- HCI – How long time does it take to serve a recommendation vs better to change UI to improve effectiveness (TBD)
Evaluation
Quantitative
Measure existing recommendations and compare with new system
- accuracy of algorithm (not sure how relevant this is for a systems paper)
- accuracy vs load
- serving recommendations (latency / throughput)
- clicks/interaction
Also check if it is quantatively comparable to any existing systems.
Qualitative
Architecture
Flexibility / modularity
Scalability
Conclusion
It will be awesome ;)