Reinforcement Learning Algorithms

In this assignment you will implement and experiment with two fundamental reinforcement learning algorithms: value iteration and policy iteration.

Group Work

This assignment requires programming in Python. If you feel you need help with the programming work, you may form a group and submit a joint solution. Each group member will receive the same grade. Every group should have at most five members.

Setup.

You will need to do development in Python as in Assignment 2. You can use the environment of your choice. If you are looking for a good environment, we recommend Google Collab. It offers the following advantages:

Cloud-based, facilitates collaboration.
Common packages are pre-installed.
Ample computing resources for this project (e.g. code is run on GPUs).

If you decide to use Google Colab, here are the steps.

Create a Python notebook in Google Colab
Click on "edit", then "notebook settings" and select "None" (CPU), "GPU" or "TPU" for hardware acceleration.

Reward depends only on current state and action.

To simplify the data structures, we have adopted a version of an MDP where the reward function depends only on the current state and action (R(s,a) not R(s,a,s') as in the book). For the corresponding Bellman equation see the slides.

Assignment Instructions.

Program and compare value iteration and policy iteration.

Fill in the functions in the skeleton code of the file MDP.py. (50 points)
Test your code on the simple MDP based on this diagram.
- Run "python TestMDP.py" to make sure your code complies.
- Add print statements to check that the output of each function is what you would expect.
Apply your code on the problem described in the maze python file. Report the following.
- The policy, value function and number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states. (10 points)
- The policy, value function and number of iterations needed by policy iteration to find an optimal policy when starting from the policy that chooses action 0 in all states. (10 points)
- Do the results of value iteration and policy iteration agree with each other? That is, are the policies and value functions the same? If not, why do you think there is a difference? (5 points)

Submission

Submit your code (as a .py file).
Submit a report.

Updated Fri April 09 2021, 19:09 by dca124.

Simon Fraser University
Engaging the World

CourSys