# Introduction to Reinforcement Learning

## Reminder

We have just seen what Q-learning is a few minutes ago:
It is one of the fundamental principles of RL. For those who want to go deeper, you can look for Temporal Difference Learning (TD learning).
As a reminder, we can define Q-Learning as a technique for learning the Q-value, i.e. the quality of an action performed in a given state. Q-learning allows us to learn a policy (the choice that our agent must take).
We also saw how to iterate to find the correct Q-values:  
$$Q_{t}(S_{t-1}, a_{t-1}) = Q_{t-1}(S_{t-1}, a_{t-1}) + \alpha_t (r_t + \gamma \max_{a'}(S_t, a') - Q_{t-1}(S_{t-1}, a_{t-1}))$$

We are going to use Q-learning to make our agent perform better that a random action in a defined environment

In Reinforcement Learning, we always need an environment. We can either create one, or use one. The easiest way is to use some libraries that offers pre-defined environments but also a standardized way of building our own environments.  
We will use [OpenAI Gym](https://gym.openai.com/), which offers a lot of turnkey environments, with complete documentations. We invite you to visit their website and discover what they offer. We will use OpenAI Gym again later! 

We are gonna use Python 3.9 for this lab work


Start by installing the required dependencies

In [None]:
!pip install -r requirements.txt

In [None]:
import gym
import numpy as np
import matplotlib.pyplot as plt
import random
import json
import time
import flappy_bird_gym

## Let's discover OpenAI Gym together!

In [None]:
# Build environment for CartPole-v1
env = gym.make("CartPole-v1")

# Initialize
env.reset()

# And we can already use it !
for _ in range(1000):
  env.render()
  action = env.action_space.sample() # Choose a random action
  observation, reward, done, info = env.step(action)  # Perform the action
  # In exchange, we receive the observations, the rewards, a boolean which indicates if the "game" is finished or not, and debugging information

  if done:
    observation = env.reset()
env.close()

### In case of problems:

If you have an error message when running the code above, we are sorry :(  
Actually, the render doesn't work on google colab, and it's also possible that it doesn't work on some machines even if you use jupyter notebook...
If this is your case, you can watch this video which shows what you should have seen. Afterwards, remove all the env.render(), the code will work, but you just won't get the visual rendering.

http://s3-us-west-2.amazonaws.com/rl-gym-doc/cartpole-no-reset.mp4

## Taxi-v3

We are going to use Taxi-v3 for its simplicity (the output is displayed in the terminal, so EVERYONE can do it, even on google colab), and also because the simpler the "game" is, the "easier" it will be to make our agent learn quickly, which is not bad when we have short TP...
Anyway ...
Now it's your turn to play ! Code something close to the previous example to create and visualize the environment.

In [None]:
env = gym.make("Taxi-v3")

# TODO





# Correction:
env.reset()
for _ in range(20):
  env.render()
  action = env.action_space.sample()
  observation, reward, done, info = env.step(action)

  if done:
    observation = env.reset()
env.close()

Take a look at the doc to understand what the things you just visualized correspond to

In [None]:
help(gym.envs.toy_text.TaxiEnv)

Let's create and train our agent !

In [None]:
start = time.time()

env = gym.make("Taxi-v3")

# The hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.3
train_episodes = 100000


# Create an empty Q table:
# TODO

# correction
Q = np.zeros((env.observation_space.n, env.action_space.n))



# A list to keep track of reward and epsilon values
training_rewards = []


for episode in range(train_episodes):
    # TODO (perform actions and fill the Q table ; store the rewards for each episode in training_reward to be able to plot it later)
    # Don't forget the Exploration vs Exploitation traide off!


    # correction 
    # Reset the environment
    state = env.reset()

    # Init the reward of this episode
    episode_reward = 0

    done = False
    while not done:
        if random.uniform(0, 1) > epsilon: # EXPLOITATION: select the best action to perform
            action = np.argmax(Q[state,:])

        else: # EXPLORATION: choosing a random action
            action = env.action_space.sample()

        observation, reward, done, info = env.step(action)


        # Finally, update the Q-table and update the total reward
        Q[state, action] = Q[state, action] + alpha*(reward + gamma*np.max(Q[observation, :]) - Q[state, action])
        episode_reward += reward

        state = observation

    # Backup rewards to plot it later
    training_rewards.append(episode_reward)




    
env.close()

print("Training finished in {}".format(time.time() - start))

In [None]:
# Visualize the behaviour of the algo
x = range(train_episodes)
plt.plot(x, training_rewards)
plt.xlabel('Episode')
plt.ylabel('Total reward')
plt.title('Total rewards per episode')
plt.show()

Let's see how it performs

In [None]:
total_epochs = 0
episodes = 100

for ep in range(episodes):
    state = env.reset()
    epochs = 0

    done = False
    while not done:
        action = np.argmax(Q[state])
        state, _, done, _ = env.step(action)
        epochs += 1

    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")

In [None]:
env = gym.make("Taxi-v3")

observation = env.reset()
total_reward = 0
for _ in range(100):
    env.render()

    action = np.argmax(Q[observation,:])
    observation, reward, done, info = env.step(action)
    total_reward += reward

    if done:
        break

env.close()
print(total_reward)

## Flappy bird

### The environment

We will use a custom version of FlappyBird-v0 environment based on OpenAI Gym. The original version is available [here](https://github.com/Talendar/flappy-bird-gym). In the custom version, we added few more params to describe the states.
So, first, let's see what the environment looks like !

In [None]:
# Create the flappy bird environment using OpenAI Gym
env = flappy_bird_gym.make("FlappyBird-v0")

# Initialize the environment
obs = env.reset()

# Loop until the game is over
while True:
    # Next action to take:
    action = env.action_space.sample() # Chose a random action (flap or none)

    # Processing:
    obs, reward, done, info = env.step(action)


    # Rendering the game:
    env.render()
    time.sleep(1 / 30)  # FPS

    # Checking if the player is still alive
    if done:
        break

# Close the environment
env.close()

So, what happened there ?
Well, since we are taking a lot of (bad) decisions, the agent is doing everything wrong.
Let's make a real agent that uses Q-Learning to play the game, instead of taking random actions.

In [None]:
class Flappybot:
    """
    A flappy bird agent
    """

    def __init__(self):
        """
        Generate and initialize the agent
        """
        self.nb_of_iteration = 0

        # alive = 0 => add 1 to the total reward
        # dead = 1 => add -1000 to the total reward
        self.rewards = {0: 1, 1: -1000}

        # Q learning params:
        self.gamma = 1
        self.alpha = 0.1

        # State of the agent
        self.x = 0
        self.y = 0

        # Last action taken
        self.last_action = 0

        # Load stored Q values if they exist
        self.load_qvalues()

        # History of the choices taken during the game
        self.moves = []

    def load_qvalues(self):
        """
        Load q values from a JSON file
        """
        self.qvalues = {"0_0_0": 0}  # a key is x_y_action
        try:
            fil = open("./data/qvalues.json", "r")
            self.qvalues = json.load(fil)
        except IOError:
            return
        fil.close()
        
    def update_qvalues(self):
        """
        Update Q values once the game is over
        :return:
        """

        # TODO :
        # For every move the agent made except the last 3, update the q value using the +1 reward (don't forget that states are encoded in the q table using the get_key() method)
        # For the last 3 moves, update q values using -1000 reward
        # (We consider that the last 3 moves were "bad" while the others were "good")
        


        # correction
        # For every move the agent made except the last 3
        for e in list(reversed(self.moves[:-3])):
            previous_state, new_state, action = e
            # Update the Q value of the state (previous_state, action) using Q learning
            previous_key = self.get_key(previous_state, action)

            self.qvalues[previous_key] = self.qvalues[previous_key] + self.alpha * (
                        self.rewards[0] + max(self.qvalues[self.get_key(new_state, 0)], self.qvalues[self.get_key(new_state, 1)]) -
                        self.qvalues[previous_key])

        # The last 3 moves
        for e in list(reversed(self.moves[-3:])):
            previous_state, new_state, action = e
            previous_key = self.get_key(previous_state, action)
            self.qvalues[previous_key] = self.qvalues[previous_key] + self.alpha * (
                    self.rewards[1] + max(self.qvalues[self.get_key(new_state, 0)], self.qvalues[self.get_key(new_state, 1)]) -
                    self.qvalues[previous_key])

    def get_key(self, previous_state, action):
        """
        Generate a key x_y_action
        :param previous_state:
        :param action:
        :return: x_y_action: str
        """
        return str(previous_state[0]) + '_' + str(previous_state[1]) + '_' + str(action)

    def save_qvalues(self):
        """
        Save the qvalues in the JSON file
        """
        f = open("./data/qvalues.json", "w")
        json.dump(self.qvalues, f)
        f.close()
    
    def take_action(self, obs):
        """
        Backup last move in self.moves and returns the best action to take
        :param obs: observation returned by env.step()
        :return: 0 or 1 : action to take
        """
        self.nb_of_iteration += 1

        x = int(obs[0]*1000) # in the original version, coordinates are float and quite small, but q table must have discrete values
        y = int(obs[1]*1000)

        # save the last move taken
        self.moves.append(((self.x, self.y), (x, y), self.last_action))
        self.x = x
        self.y = y


        # TODO
        # Find the best action to take and return it. Pay attention to states we've never been into.
        # (you should also update the last action)

        # correction
        # Find the best action
        if self.get_key((x,y), 1) not in self.qvalues: # We've never been in this state before, so we juste create q values initialized to 0
            self.qvalues[self.get_key((x,y), 1)] = 0
            self.qvalues[self.get_key((x,y), 0)] = 0

        if self.qvalues[self.get_key((x, y), 1)] > self.qvalues[self.get_key((x, y), 0)]:
            self.last_action = 1
        else:
            self.last_action = 0

        return self.last_action

    def show_current_position(self):
        """
        Used to print the agent position during the game
        :return:
        """
        print("X = " + str(self.x) + " | Y = " + str(self.y))

    def print_nb_steps(self, epoch):
        """
        Used to print the current number of step the agent survived
        :param epoch: actual epoch
        :return:
        """
        print("epoch " + str(epoch) + " -> " + str(self.nb_of_iteration) + " steps")

    def reset_counter(self):
        """
        Reset local variables between each epoch
        :return:
        """
        self.nb_of_iteration = 0
        self.moves = []
        self.x = 0
        self.y = 0
        self.last_action = 0

Now let's try to train the agent !

In [None]:
TRAINING = False
EPOCH = 30000
VERBOSE = True

# Create the flappy bird environment using OpenAI Gym
env = flappy_bird_gym.make("FlappyBird-v0")

# Generate the agent
agent = Flappybot()

for epoch in range(EPOCH):
    # Initialize the environment
    obs = env.reset()
    while True:
        # Next action to take:
        action = agent.take_action(obs)

        # Processing:
        obs, reward, done, info = env.step(action)


        if not TRAINING:
            # Rendering the game:
            env.render()
            time.sleep(1 / 30)  # FPS

        # Checking if the player is still alive
        if done:
            if VERBOSE:
                agent.print_nb_steps(epoch)
            if TRAINING:
                agent.update_qvalues()
            agent.reset_counter()
            break

# Close the environment
env.close()

# Save the Q values in the JSON file
agent.save_qvalues()

Finally, to see our agent perform, just set TRAINING to False

## To go further

Well, we can use Double Q-Learning to improve our agents, but the best idea is to use Deep Q-Learning, and that's what we are going to see next week !
https://en.wikipedia.org/wiki/Q-learning