Comparing different methodologies, we show, with case studies, that different evaluation methodologies lead to contrasting conclusions about recommendations quality.

Recommender systems are used to suggest customized products to users. Most recommender algorithms create collaborative models by taking advantage of web user profiles. In the last years, in the area of recommender systems, the Netflix contest has been very attractive for the researchers. However, many recent papers on recommender systems present results evaluated with the methodology used in the Netflix contest, also in domains where the objectives are different from the contest (e.g., top-N recommendation task).

In this paper we do not propose new recommender algorithms but, rather, we compare different aspects of the official Netflix contest methodology based on RMSE and holdout with methodologies based on k-fold and classification accuracy metrics. We show, with case studies, that different evaluation methodologies lead to totally contrasting conclusions about the quality of recommendations.

PDF, 10 pages, 0.3 MB