Implementation of Q-Learning with Python Skip to main content

Implementation of Q-Learning with Python

Q-learning is a popular reinforcement learning algorithm used to make decisions in an environment. It enables an agent to learn optimal actions by iteratively updating its Q-values, which represent the expected rewards for taking certain actions in specific states. Here is a step-by-step implementation of Q-learning using Python:
Image by Author
1. Import the necessary libraries:
import numpy as np

import random
2. Define the environment:
# Define the environment
env = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# Define the rewards
rewards = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# Define the actions
actions = ['up', 'down', 'left', 'right']

3. Define the Q-table:
# Define the Q-table
q_table = np.zeros([env.shape[0], env.shape[1], len(actions)])

4. Define the hyperparameters:

# Define the hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1
5. Define the training loop:

# Define the training loop
for episode in range(1, 1001):
    state = [0, 0]
    while state != [9, 9]:
        if random.uniform(0, 1) < epsilon:
            action = random.choice(actions)
        else:
            action = actions[np.argmax(q_table[state[0], state[1]])]

        if action == 'up':
            next_state = [max(state[0] - 1, 0), state[1]]
        elif action == 'down':
            next_state = [min(state[0] + 1, 9), state[1]]
        elif action == 'left':
            next_state = [state[0], max(state[1] - 1, 0)]
        else:
            next_state = [state[0], min(state[1] + 1

Neural Network with Q-learning

Neural networks can be used in Q-learning to approximate the Q-values of each state-action pair. This is known as Deep Q-Learning. Here’s how to  use a neural network with Q-learning:
1. Define the neural network architecture:
import tensorflow as tf

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(64, input_shape=(state_size,), activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(action_size, activation='linear')
])

2. Define the hyperparameters:
# Define the hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

3. Define the training loop:
# Define the training loop
for episode in range(1, 1001):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    done = False
    while not done:
        if np.random.rand() <= epsilon:
            action = random.randrange(action_size)
        else:
            q_values = model.predict(state)
            action = np.argmax(q_values[0])

        next_state, reward, done, _ = env.step(action)
        next_state = np.reshape(next_state, [1, state_size])

        target = reward + gamma * np.amax(model.predict(next_state)[0])
        q_values = model.predict(state)
        q_values[0][action] = (1 - alpha) * q_values[0][action] + alpha * target

        model.fit(state, q_values, verbose=0)

        state = next_state
4. Evaluate the model:
# Evaluate the model
scores = []
for episode in range(100):
    state = env.reset()
    state = np.reshape(state, [1, state_size])
    done = False
    score = 0
    while not done:
        q_values = model.predict(state)
        action = np.argmax(q_values[0])
        next_state, reward, done, _ = env.step(action)
        next_state = np.reshape(next_state, [1, state_size])
        score += reward
        state = next_state
    scores.append(score)
print(np.mean(scores))

In this implementation, we use a neural network with two hidden layers of 64 neurons each and a linear output layer with the same number of neurons as the number of actions. The neural network is trained using the Q-learning algorithm, which updates the Q-values of each state-action pair based on the Bellman equation. The hyperparameters alpha, gamma, and epsilon control the learning rate, discount factor, and exploration rate, respectively.

Comments

You may like

Latest Posts

SwiGLU Activation Function

Position Embedding: A Detailed Explanation

How to create a 1D- CNN in TensorFlow

Meta Pseudo Labels (MPL) Algorithm

Video Classification Using CNN and Transformer: Hybrid Model

Introduction to CNNs with Attention Layers

Graph Attention Neural Networks