Mutli Armed Bandit : determine how to reward a selection
I am new to reinforcement learning and i am trying to understand more on how to apply multi armed bandit in real world cases.
So here is my scenario, as i'm new on this i'm starting on small cases :-
We have 5 items to recommend from set of 20 items, currently we are using CTR(Click through Rate) to select top 5 items and display the same.
The problem here we saw was with cold start and exploration on other items was not possible. so i started looking at MAB.
consider below data for explaning on reward in my case, also
items : [1,2,3,4,5,6,7,8,9,10,...20]
current CTR%: [5.0,3.23,3.16,3.09,2.87,2.6,2.4,2.1,1.0,0.9,0.76,0.43,...0.00]
reward : [?,?....?]
top items : [1,2,3,4,5]
below are my doubts:
How should i run MAB? i mean should we run for selecting all the 5 items which we are going to display or should it be run only for 1 or 2 items in my case?
Should i run MAB on all 20 items or on 15 (20 - 5) items?
How do i determine reward? i am considering below for the same
reward(t) = total user click on item / total number of times item appeared
- what should i do for remaining 4 items which are unclicked ? should i increase total appeared count?
do let me know if there is a better way to do it.
Topic reinforcement-learning machine-learning
Category Data Science