In this paper, a spectrum sensing policy employing recency-based exploration is proposed for cognitive radio networks. We formulate the problem of finding a spectrum sensing policy for multiband dynamic spectrum access as a stochastic restless multiarmed bandit problem with stationary unknown reward distributions. In cognitive radio networks, the multiarmed bandit problem arises when deciding where in the radio spectrum to look for idle frequencies that could be efficiently exploited for data transmission. We consider two models for the dynamics of the frequency bands: 1) the independent model where the state of the band evolves randomly independently from the past and 2) the Gilbert-Elliot model, where the states evolve according to a two-state Markov chain. It is shown that, in these conditions, the proposed sensing policy attains asymptotically logarithmic weak regret.
The policy proposed in this paper is an index policy, in which the index of a frequency band comprises a sample mean term and a recency-based exploration bonus term. The sample mean promotes spectrum exploitation, whereas the exploration bonus encourages further exploration for idle bands providing high data rates. The proposed recency-based approach readily allows constructing the exploration bonus such that it will grow the time interval between consecutive sensing time instants of a suboptimal band exponentially, which then leads to logarithmically increasing weak regret. Simulation results confirming logarithmic weak regret are presented, and it is found that the proposed policy provides often improved performance at low complexity over other state-of-the-art policies in the literature.