What exactly is reinforcement learning?
Reinforcement learning is a trial-and-error training method for machine learning models. This approach is especially useful for teaching computers to make sequential, goal-oriented decisions in interactive environments. Learning occurs as the machine repeatedly attempts to solve a problem and is rewarded for actions that move it closer to a solution and penalized for actions that don’t.
Gaming is both a useful way to explain reinforcement learning and a practical application of it. Modern chess programs like AlphaZero are trained using reinforcement learning strategies, specifically by playing games of chess against themselves thousands of times with no objective beyond winning.
With this approach, there is no “strategic” instruction like a human player might receive (like judgments about the value of particular openings and arrangements). Rather, every move is assessed strictly on the degree to which it contributes to the probability of victory, according to an assessment by its neural network.
Through this seemingly simple reinforcement learning method, both the algorithm and the neural network incrementally improve (with the algorithm focusing on better moves and the neural network making more accurate assessments of winning probability). And the chess program can quickly master the game.
AlphaZero, for example, was trained on a generic reinforcement learning algorithm that was originally devised for the game of Go. The program was able to achieve a superhuman level of play after only a few hours of self-learning.
How is reinforcement learning different from unsupervised learning?
Reinforcement learning may seem similar to unsupervised learning in that both involve training without reference to existing datapoint labels or values.
Reinforcement learning, though, involves entirely different training objectives. The goal of unsupervised learning is to find similarities in datasets and group similar data points together, whereas with reinforcement learning the goal is to maximize the cumulative reward for specific decisions (or sequences of decisions).
What are the challenges of reinforcement learning?
A cumulative reward training approach introduces a particular set of training challenges. For instance, what if the algorithm finds a single decision that seems to maximize value and thus sticks to that approach rather than branching out and exploring other potential solutions?
To go back to the chess example, what if the algorithm wins a few games using the famous London System opening and thus decides to never deviate from moving out the queen’s pawn first, as this decision produced positive results in the past? This focus on the queen’s pawn means that the algorithm will never branch out and learn more about other openings.
In the context of reinforcement learning, this is called the exploration vs exploitation trade-off. The algorithm naturally wants to exploit a decision that seems to maximize reward, but the programmer needs it to explore more (and even make decisions that may seem sub-optimal in the short term), in order to have a better overall understanding of the environment.
There are several training strategies to overcome this dilemma:
- Thompson sampling: New decisions are explored to maximize rewards while also exploiting the already explored choices.
- e-greedy strategy: Basically, the algorithm is training to pick the maximum reward decision in some cases, but to explore randomly in other cases.
- Optimistic initialization: Exploration is encouraged until particular decisions prove themselves to be sub-optimal.
- Upper confidence bound: Decisions are made with the highest “upper bound” estimate of reward, based on modeled outcomes.
What are the real-world applications of reinforcement learning?
Outside of gaming, there many other applications of reinforcement learning happening around us.
Reinforcement learning is functionally quite close to how human beings and animals learn to interact with their natural environments and learn new skills, and thus has extensive application in robotics.
Robotic arms, for instance, can learn from different sequences of movements based on a clearly defined outcome (like whether they did or did not pick up and rotate an object).