Oddly enough, the AI learns how to walk toward the enemy AI if the enemy AI is frozen. So the code seems to be working. But when the enemy AI is allowed to move, the learning agent can’t learn and the random reward graph reflects this. Maybe this is because the environment is stochastic in a sense. I’ve heard that DQN is not able to learn stochastic policies.
I should try your library, I just haven’t done it yet for some reason. It looks a bit more complicated than the other library I tried so I thought it might take some time to learn. If you can link some tutorials and code examples that would help greatly.