Slotted ALOHA with Reinforcement Learning

In this article we will discuss the potential of reinforcement learning (RL) to learn a backoff control policy for slotted ALOHA-type random access. We will use deep reinforcement learning (DRL) to learn a policy for multi user random access system.

Slotted ALOHA Background

Slotted ALOHA (sALOHA) protocol [1] for random access in wireless networks has been around for over 50 years. Thanks to its simple design and configuration it is still a topic of great interest, specifically in the machine-type communication (MTC) paradigm in 5G and beyond communication systems. In sALOHA users in the network randomly transmit in a grant-free manner with the hope that the receiver will decode their packets. For instance, consider a network of N = 10 sensor nodes in a certain geographic area and a central unit that collects the information from these sensors/users. Each sensor reports its readings to the central unit in form of packets containing the information about its reading. Whenever a sensor has some reading during the day it transmits its packet without coordinating with other sensors and without being scheduled by the central unit. In sALOHA we divide time in slots and enforce each sensor through synchronization to always transmit at the beginning of each time slot. 

However, what happens when two or more sensors transmit at the same time? This is where it becomes complicated. Their packets collide and the transmission of every user fails. They will have to send their packet again. Let us assume that each of the 5 users have a packet to transmit and they transmit them with a certain transmit probability pn. Now having the same pn for all the users will lead to what we call unstable behavior, i.e., eventually no user is able to transmit its packet, and the performance goes to zero. The performance is commonly measured with the throughput, i.e., the number of successful transmissions (packets) in a given time. To tackle this issue certain backoff techniques have been proposed in the literature, i.e., using some kind of feedback signal from the receiver to indicate whether the transmission on the channel resulted in a collision or not. There are other feedback signals considered in the literature, such as success/no-success and ternary feedback for success, collision, idle events, but in this work we only consider collision/no-collision feedback. If a collision event happens receiver sends 0 bit as feedback, and if no-collision event happens 1 bit is sent by the receiver. The exponential backoff (EB) is a commonly used backoff technique in which users reduce their transmit probability exponentially for each collision event. For a non-collision event they transmit with a maximum probability pmax and usually it is set to pmax = 1. The protocol is still distributed, i.e., each user independently calculates its pn after each collision or no-collision event. In binary exponential backoff (BEB) [2] the backoff factor is set to 2, i.e., after i consecutive collisions, the probability of a user is reduced to pn*2-i. The theoretical throughput for sALOHA is 1/e. 

This EB technique introduced another issue. It had been observed in the simulation experiments that the overall throughput of the network was very large, i.e., 0.6 or 0.7, compared to theoretical results that is 1/e = 0.36. It took researchers some time to figure out the reason for this behavior. It is commonly referred to as the capture effect. With pmax = 1 a user would capture the channel for some time, while starving other users, resulting in high overall throughput, but causing unfairness in the distribution of resources among users. Therefore, it is important for a protocol not only to have a good throughput but also to be fair.

Motivation

Since then there has been a tremendous amount of work on sALOHA. However, there has been no consensus on the performance and this throughout-fairness tradeoff. In fact, due to underlying assumptions of different models considered, the results are sometimes contradictory and confusing [4]. For instance, there is no consensus what is the optimal backoff factor or what kind of backoff technique is good. What is a good model for traffic arrival process? Moreover, most of the models assume that each user always has a packet to transmit, which is not true for MTC applications where the traffic is sporadic. This complexity of the process motivates us to analyse the performance of sALOHA with reinforcement learning and to learn a backoff protocol (a policy) using the simple aforementioned binary feedback signal. 

Problem Formulation with RL

In RL an agent in a state S interacts with an environment without knowing the dynamics of the environment by choosing an action from a pre-defined action space, receives a reward about how good the action was and moves to the next state S. In this case, the environment is the random access channel, action is to TRANSMIT (1) or NOT TRANSMIT (0), and we define reward as SUCCESS. If a packet has been successfully transmitted over the channel, the reward is 1, and it is 0 otherwise. The environment is show in the Fig. 1. For multiple number of users, every user is an RL agent. The objective of the agent is to maximize the reward in the long run. We assume homogeneous users, and we train them in a centralized way using neural network (DQN algorithm [5]) in such a way that they all learn the same policy that is implemented in a distributed manner for decision making. The purpose of this work is to find a policy for ALOHA-type random access, that provides us better throughput as well as fairness among users.

Fig. 1: Interaction of agents (users) with the environment. At the beginning each time slot k, each agent n first performs an action An(k), then receives the feedback signal F(k) at the end of the time slot k. Users updates their buffers depending on the feedback signal and new packet arrivals to the system.

For this purpose, each user uses its local observation to decide whether to transmit or not. The local observations make the state of the user. For time slot k the state contains previous action, previous feedback and buffer state. Buffer state indicates whether a user has a packet in its queue or not. The state of a user at time k can be written as:

Sn(k) = [An(k-1), F(k-1), Bn(k-1)] ——— (1)

Where An(k) ∈ {0, 1}, and F(k) ∈ {0, 1}, Bn(k) ∈ {0, 1}, which gives us total 23 states. Naturally, when there is no packet in the buffer of user n, i.e., Bn(k) = 0, the transmit probability is set to zero. Therefore, we are only interested in 4 states when Bn(k) = 1.

Procedure

At each time slot k new packet arrives using the Poisson arrival process with average arrival rate λ. We assume that each user can only contain maximum of one packet. If there is a packet in user’s buffer, it follows the learned policy to decide whether to transmit or not. The output of the DQN is the Q-values corresponding each action. These Q-values are used to calculate the transmit probability with softmax policy. After taking action all users receive a common feedback signal F(k) and use this to calculate the reward. The reward is global, i.e., same for all users. If the transmission was successful, every user receives reward that equals 1 or 0 otherwise.

Simulation Results

In Eq. 1 we have seen that there are only 4 possible states for each user, and it is interesting to see how the transmit probability changes for each state as we increase the arrival rate from 0.2 to 2 as shown in Fig. 2. We use transfer learning to train from lower to higher arrival rates. Learning for a small state space allows us to look into the details of how the policy changes for different arrival rates and helps us visualize it.

Fig. 2: DQN-RA policy transition w.r.t the arrival rate λ during training

We analyse the performance of this DQN-based random access (DQN-RA) protocol for throughput and fairness. For fairness we propose to use a parameter we refer to as age of packet (AoP). Popular Jain’s fairness index doesn’t consider short-term fairness, and AoP can measure fairness as well indicate the packet delay. If user transmits a packet, AoP(k) = 0; otherwise the AoP increases linearly with time, i.e., AoP(k) = AoP(k-1) +1 . We compare the performance of DQN-RA with BEB for 10 users. Clearly the proposed algorithm outperforms BEB in terms of both throughput and fairness. We compare the performance of DQN-RA with two types of EB, and we divide them into symmetric EB (SEB) and non-symmetric EB (nSEB). nSEB is the same as mentioned above, where each user changes its transmit probability based on whether there was collision or not, but in case of no-collision, pn = pmax. On the other hand, in SEB every user increases or decreases the pn exponentially if no-collision or collision event happens respectively. SEB is a more fair scheme, because all the users change their probabilities. We compare for two backoff factors, binary (B) and 1.25 with the DQN-RA.

avg throughput
Average Throughput
Average AoP
Average AoP for fairness

Surprisingly, DQN-RA is inherently fair among users compared to the EB without explicitly optimizing it for fairness. Moreover, even though the results show average AoP for fairness over 10 users it is interesting to see the AoP of individual users, and we have observed (not shown here) that BEB is quite unfair due to capture effect. DQN-RA provides us a balanced trade-off between throughput and fairness compared to EB schemes.

For the future of this topic it would be interesting to see if adding more past feedback will improve the performance or not. Moreover, how does this scheme behaves if the number of users is large?

References

  1. N. Abramson, “The throughput of packet broadcasting channels,” IEEETransactions on Communications, vol. 25, no. 1, pp. 117–128, 1977.
  2. B.-J. Kwak, N.-O. Song, and L. Miller, “Performance analysis of exponential backoff,” IEEE/ACM Transactions on Networking, vol. 13,no. 2, pp. 343–355, 2005.
  3. . L. Massey, “Some new approaches to random-access communication,”in Proceedings of the 12th IFIP WG 7.3 International Symposium on Computer Performance Modelling, Measurement and Evaluation,Performance ’87, (NLD), p. 551–569, North-Holland Publishing Co.,1987.
  4. L. Barletta, F. Borgonovo, and I. Filippini, “The throughput and access delay of slotted-aloha with exponential backoff,” IEEE/ACM Transactions on Networking, vol. 26, no. 1, pp. 451–464, 2018.
  5. V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, 2015.

0

Muhammad Jadoon

Muhammad is from Pakistan, where he also received his Bachelor’s degree in telecommunication engineering from the University of Engineering and Technology Peshawar. This achievement was then followed by a Master’s degree from the University of Ulsan South Korea on a fully funded scholarship. As a part of the WindMill project, he is based at CTTC Barcelona under the supervision of Monica Navarro. His project is on machine learning for massive connectivity and massive random access.

Leave a Reply