Best approach for A/B testing two different recommendation systems
I have two recommendation systems for musical preference that make a list of predictions for a particular user based on the songs they have saved in their library. The user then rates how good each recommendation was out of 6. I will be evaluating the performance of the recommendation systems based on the average rating given to songs recommended by system A and system B.
Let us use A to denote a song recommended by system A and B to denote a song recommended by system B. For a particular user, should the recommendations be (AAAAAA or BBBBBB) or should they be (ABABABAB)? I implemented the first for now being (AAAAAA or BBBBBB). Thus, in the current system, each respondent will be randomly assigned A or B and only get the recommendations from that system. Is this the right approach or does only recommending 1 system to each respondent bias them against what the other system could have recommended?
Let us assume that B is far better than A. If we only recommend the same system to each user, and a user listens to songs which are all system A, they would never had heard system B, and the ratings of A would probably have been different (lower) if they had listened to the better system too. Is the ABABAB approach the best one? Which is the best method to evaluate the performance of each system while reducing bias?
Thanks.
Topic recommender-system
Category Data Science