Mutli Armed Bandit : determine how to reward a selection

I am new to reinforcement learning and i am trying to understand more on how to apply multi armed bandit in real world cases.

So here is my scenario, as i'm new on this i'm starting on small cases :-

     We have 5 items to recommend from set of 20 items, currently we are using CTR(Click through Rate) to select top 5 items and display the same. 

The problem here we saw was with cold start and exploration on other items was not possible. so i started looking at MAB.

consider below data for explaning on reward in my case, also 
      items : [1,2,3,4,5,6,7,8,9,10,...20]
current CTR%: [5.0,3.23,3.16,3.09,2.87,2.6,2.4,2.1,1.0,0.9,0.76,0.43,...0.00]
reward      : [?,?....?]

top items : [1,2,3,4,5]

below are my doubts:

  1. How should i run MAB? i mean should we run for selecting all the 5 items which we are going to display or should it be run only for 1 or 2 items in my case?

  2. Should i run MAB on all 20 items or on 15 (20 - 5) items?

  3. How do i determine reward? i am considering below for the same

reward(t) = total user click on item / total number of times item appeared
  1. what should i do for remaining 4 items which are unclicked ? should i increase total appeared count?

do let me know if there is a better way to do it.

Topic reinforcement-learning machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.