What is reinforcement learning? It is hard to describe without resorting to jargon – as most new things are. But for most people, you could view it as a way that a model can adapt as it receives new data: from either a real or a simulated environment.

On its face, that may not seem too useful. But it’s really effing useful! And it is useful for precisely this reason: as humans we do this exact same thing. Humans learn (kids especially) by going out, trying things, and adapting their model of reality as they go. Eventually they figure out that walking requires you to move your legs in a certain way, that saying certain words elicit favorable responses, and so on.

As humans grow older, they develop the ability to imagine and predict situations – we do it all the time. We handle a few situations in our head, imagine responses, and use that data to inform how we react. The computer’s version of this are simulations. Because computers can process data very quickly, we can have a reinforcement model learn by doing a task thousands of times slowly rewarding good results and punishing bad ones.

So let’s take a basic example with everyone’s favorite childhood game: Tic Tac Toe. Well, hopefully not your favorite! But it is one that everyone played, and the goal is simple: get three of your shape in a row.

As anyone who’s played this enough knows: once both players have learned the game, it will always result in a tie. It’s been shown that we can train monkeys to get to this point, so it should be possible for us to get computers to the same point!

And of course, we can. In fact, there are at least three ways we can do this:

- We can program a list of instructions into it, resembling a “IF the board looks like this, THEN do this.” It shouldn’t surprise you that in computers this is called If-Then logic. We can be a bit smart with this and say “If there are two in a row and the spot is free, then play there.”

- We can build a “dumb” learning algorithm that stores every possible way the board can look – we’d call this a state. On the first turn, a player can put the X in any of 9 spots, so we expand to 9 possible states. Turn 2 has 8 remaining options
*for each of the 9 initial states*, so we have 72 states after turn two. This expands quickly. We have 504 states after 3 turns, 3024 after 4, and 15,120 after just 5 turns. But if we can store and process all this, we can have an algorithm try things randomly a bit, record success rates at various states, and slowly learn to choose options that get the game into a better state.

Before we get to the third approach – it’s worth stating why the first two are interesting but not sufficiently useful. The first is incredibly time consuming and challenging to get right – the end result is a big messy block of code that will likely have bugs or shortcomings. Many video game AI’s are coded like this, and while this has gotten better it’s still quite likely you can find the exploit to beat the computer each time. The second approach is more interesting, but fails quickly due to processing and capacity for any complex game. Take Chess: the number of possible states expands rapidly to encompass more grains of sand than there are on the planet. Your laptop won’t stand a
chance.

And that brings us to the third option: reinforcement learning. Reinforcement learning still uses the concept of states, but rather than laboriously recording each and every one of them, we build a model – a neural network – that is able to predict the value of a given state using far fewer variables.

*To bring this home a bit for anyone who isn’t a data scientist: any model is simply a way of trying to explain many things using fewer things. And we all build informal models in our heads all the time!*

*For example, if you’ve shopped for a home recently, you’ve seen tens or hundreds of prices. But you quickly get to a point where you can price a new house reasonably if you know the location, the number of bedrooms, and how renovated it is. You now have simplified thousands of potential homes into being described by, in this case, three factors.*

*If you wanted to be a bit more sophisticated, you could gather a bunch of these data points and build a linear regression (or hopefully better models) to mathematically calculate how important each factor is. Here again, we’re attempting to explain many points/many states with fewer.*

Neural networks are a bit complex to try to describe for a general audience, but if you want a description of the architecture of one, you can [read a good description here]. But the essence of what a neural network does is that it uncovers those patterns or, in this case states, that are optimal.

Returning to our case of Tic Tac Toe, what this means is that our neural net has likely uncovered that having a triangular pattern is highly desirable and often leads to a win, so it would then weight situations that could lead to that pattern quite highly as well. This continues, and basically the neural net starts to build a very accurate model of how to get to ideal end states.

Effectively, this approach is how DeepMind (unfortunately acquired by Google) has been able to create algorithms that can beat any human at Chess, Go, and more recently Starcraft. No one is encoding a series of instructions – they are building an algorithm that can play said game, take in the results, and continuously improve itself. Instead of just building one, copy it. Once you have two, have them play each other once and they’ll each get a little smarter. But then have
them play each other 10,000 times, and they’ll both be pretty smart by the end of it.

Now, I’m not here to solve Chess, but what I did do was build a neural network to play Tic Tac Toe! It’s not the most useful thing to do, but the goal was to get a sense of what reinforcement learning entails with a lens toward much more complex systems in the future. Fortunately, I was able to rely on [person’s] great guide and code to build out the meat of it, but I wrote it myself in order to make sure I understood the whole system so that I could expand on it a bit and build more complex ones in the future.

The code itself is fairly straightforward. There’s a basic simulator of a Tic Tac Toe game that stores the state of the board, the outcome/status of the current game, and checks that a play sent by a player is valid. We then have players that interact with the game and have logic to “read” the board, “learn” and make a play.

I built three different kinds of players – each using different logic to make moves. The first one was a dummy that simply chose one random move out of those available to serve as a benchmark. The second relied on the “dumb” learning by saving every possible state of the board and slowly having it increase the weights on outcomes that led to good results and vice-versa. The last player is the reinforcement learning neural network version that takes in the data and updates its model after each game. While the “dumb” learning approach needs to store up to 360,000 states of play and their value, The neural net relies on three layers and thus only 729 different data points. We may be able to reduce this further, but it’s pretty obvious the value of saving only 1/500th the data!

So
with that said, here are the results once you have each of them play 15,000 games of Tic Tac Toe (this only takes a minute or so):

[Graph]

As you can see, the reinforcement model is great and quickly reaches a point where it is able to win or draw almost every game. It is worth noting that these models do rely on occasionally taking a random move rather than the best one available as a way for it to have some material to learn from. So the win rates are a bit hampered by that, but these were slowly decreased over the course of
the 15,000 runs.

As noted, Tic Tac Toe is an ideal learning case but not that useful – we can solve Tic Tac Toe without reinforcement learning! But there are a ton of places where reinforcement learning is useful – and I’ll share some thoughts on that next time!