How to design a reward function for rewarding AIs for getting closer to enemy AIs?

Don’t know if this is the right place to put this under but basically, I need help designing a reward function that will encourage 2 sword-fighting AIs to walk towards each other. This is for my deep reinforcement learning sword-fighting AI project. This problem sounds simple on paper, but it becomes more complex when you realize that the enemy is a moving target as well. So, I am a bit lost on how to design a reward function that will reward say agent A for walking towards agent B. A simple distance-based function doesn’t seem to work because agent A could stand still and do nothing while agent B walks towards it rewarding agent A for doing nothing. You also have the question of if agent A moved x amount of steps towards agent B but in that timeframe agent B moved away from agent A should agent A still be rewarded?

For more context, I have a gif showing the environment the AIs are trained in: https://gyazo.com/0e6ba8cd39039c60bb9450b05c92323e

There’s also a weird glitch where the AIs somehow fling themselves super far at the start of the round. Not sure how this happens. Might be because of AlignOrientation.

Reward function right now:

if AgentMoved then
	local Vel = Self.HumanoidRootPart.AssemblyLinearVelocity
				
	if Vel.Magnitude > Threshold then
		LastIdleTime = os.clock()
					
		if NewDistance < OldDistance then
			Reward = 1 / NewDistance
		elseif NewDistance > OldDistance then
			Reward = (DefaultParams.MovingBackPunishmentFactor * (NewDistance^2 / 1572.77186483) - 0.1)
		end
		--Reward = -0.00001 * NewDistance
	else
		if LastIdleTime == nil or os.clock() - LastIdleTime >= 2 then
			Reward = -0.1
		end
	end
end
			
local CurRotDif = (Target - CurRot)^2
			
if CurRotDif > OldRotDif then
	Reward += -0.000001 * (CurRotDif)
else
	Reward += 0.5 / (1 + CurRotDif)
end
			
if NewHealth ~= OldHealth or NewEnemyHealth ~= OldEnemyHealth then
	local HealthLostReward = -(1 - (NewHealth / Self.Humanoid.MaxHealth)^2)
	local DamagedReward = 1 - (NewEnemyHealth / Enemy.Humanoid.MaxHealth)^2

	Reward = HealthLostReward + DamagedReward
end
			
local Terminate = false

if Self.Humanoid.Health <= 0 and Enemy.Humanoid.Health > 0 then
	Reward = -1
	Terminate = true
elseif Self.Humanoid.Health > 0 and Enemy.Humanoid.Health <= 0 then
	Reward = 1
	Terminate = true
end

State function:

local function CalculateInputs()
	local Inputs = {}

	assert(Self ~= Enemy,'Cannot set self to enemy!')

	local SelfHRP = Self.HumanoidRootPart
	local EnemyHRP = Enemy.HumanoidRootPart

	table.insert(Inputs,(SelfHRP.Position.X - EnemyHRP.Position.X) / 29.15)
	table.insert(Inputs,(SelfHRP.Position.Y - EnemyHRP.Position.Y) / 7.5)
	table.insert(Inputs,(SelfHRP.Position.Z - EnemyHRP.Position.Z) / 29.15)

	local RotY = math.atan2(SelfHRP.CFrame.LookVector.X,SelfHRP.CFrame.LookVector.Z)

	RotY = math.deg(RotY) % 360

	local EnemyRotY = math.atan2(EnemyHRP.CFrame.LookVector.X,EnemyHRP.CFrame.LookVector.Z)

	EnemyRotY = math.deg(EnemyRotY) % 360

	local StandardizedDif = (RotY - EnemyRotY) / 208.15

	table.insert(Inputs,StandardizedDif)

	local SelfAttacking = 0

	if script.Parent.ClassicSword.Handle.SwordSlash.Playing or script.Parent.ClassicSword.Handle.SwordLunge.Playing then
		SelfAttacking = 1
	end

	table.insert(Inputs,SelfAttacking)

	local EnemyAttacking = 0

	local FoundSword = Enemy:FindFirstChildWhichIsA('Tool')

	if FoundSword then
		if string.match(string.lower(FoundSword.Name),'sword') ~= nil then
			if FoundSword.Handle.SwordSlash.Playing or FoundSword.Handle.SwordLunge.Playing then
				EnemyAttacking = 1
			end
		end
	end

	table.insert(Inputs,EnemyAttacking)

	local NormalizedEnemyXVel = EnemyHRP.AssemblyLinearVelocity.X

	NormalizedEnemyXVel /= 20.5

	table.insert(Inputs,NormalizedEnemyXVel)

	local NormalizedEnemyYVel = EnemyHRP.AssemblyLinearVelocity.Y

	NormalizedEnemyYVel /= 53.15

	table.insert(Inputs,NormalizedEnemyYVel)

	local NormalizedEnemyZVel = EnemyHRP.AssemblyLinearVelocity.Z

	NormalizedEnemyZVel /= 22.1

	table.insert(Inputs,NormalizedEnemyZVel)

	local NormalizedEnemyRotYVelocity = EnemyHRP.AssemblyAngularVelocity.Y

	NormalizedEnemyRotYVelocity /= 84.75482177734375

	table.insert(Inputs,NormalizedEnemyRotYVelocity)

	local NormalizedHP = Self.Humanoid.Health / Self.Humanoid.MaxHealth
	local NormalizedEnemyHP = Enemy.Humanoid.Health / Enemy.Humanoid.MaxHealth

	table.insert(Inputs,NormalizedHP)
	table.insert(Inputs,NormalizedEnemyHP)
			
	local IsJumping = 0
			
	if Self.Humanoid.Jump then
		IsJumping = 1
	end
			
	local EnemyJumping = 0
			
	if Enemy.Humanoid.Jump then
		EnemyJumping = 1
	end
			
	table.insert(Inputs,IsJumping)
	table.insert(Inputs,EnemyJumping)

	return Inputs
end

I have also looked at this implementation: Roblox Sword Fighting AI (Q-Learning) - YouTube and I am considering using the same type of reward structure they used, which is essentially a step function that takes in the distance of the enemy as the input:

Reward = -1

if Magnitude > 30 then
   Reward -= 1000
elseif Magnitude > 15 then
   Reward -= (Magnitude - 15)^2
end

My primary question for this method is though, why does this work? Doesn’t this suffer from the same issue I specified above? (Agent gets rewarded for doing nothing and simply waiting for the enemy to come towards it which does not encourage it to learn how to move towards the enemy.) After all, it intuitively seems easier for the agent to learn how to stand still rather than walk towards the enemy.

idk, make them associate better damage and upgrades with rewards

This is vanilla sword-fighting. Also did you read the title? I am asking how to reward them to encourage them to walk towards each other. I have the actual damage part figured out already.

I have made a reinforcement algorithm in Roblox before. What I did was use weighted actions for certain states. Let’s say this occurs:

  • AI 1 moves towards AI 2

The state I made for my AI was the difference in position and orientation between the two, relative to itself (the AI). At the end of the test, the AIs would be rewarded based on their distance between each other at first, but then later on be rewarded based on their damage to each other.

However, this method isn’t very efficient. I use a few action modules that are simply modules that tell the character what to do (like move forward, left, right, etc.). Between each action, I could adjust the weight of them if it helped the AI move closer to the enemy. The weight of the action determines how likely a random value from 0 to 1 will pick the action. If the action decreased the distance, you could add to the weight of the action that caused this to happen. However, we’re missing something else: states. Reinforcement learning uses states to determine where it is or what it’s doing, and what it should do. In my case, the state is represented by the position of AI 1 relative to AI 2, in a grid of about 2-3 studs (a decimal number would take too long). The state also has rotation rounded to 45 degrees, relative to the other AI. What I used is a table of states, and in each of them a table of weights for actions exists. Depending on the state, the action will have a different state. For example, if the AI was to the left of the enemy, the AI’s action “move left” would have a lower number than “move right”, since the AI wants to move towards the enemy. In another scenario, AI 1 might have a lower health than AI 2, so AI 1 runs away.

Additionally, to further improve the AI, you should add state prediction. This adds more data to the states table. What it does is it predicts the next state, and its own separate reward function if it got it right. In my AI, it is right about 55% of the time. I’m not sure of a use case for state prediction for you yet.

Have a magnitude scoring system, the closer AI 1 is to AI 2, the higher the score.
Also, give the AIs extra points for hitting the target.

Then, once the simulation has ended, take the best AIs, copy them and make slight modifications to them, and repeat.

The repeating process could go on for hours, depending how good you want the AI.

That’s my simple understanding of how AI training works.
Theres a youtuber called Code Bullet, who messes around with AI learning a lot.
He goes over the basics of how it works, if you want to look at that, but if I remember corrently its pretty much just what I’ve said.

That’s not the same thing as the algorithm I’m using. I am using the DQN algorithm, specifically a D3QN. Also, a simple magnitude scoring system does not work effectively. Like I said in my original post, the AI can simply get rewarded for doing nothing and walking into a wall if the other AI gets closer to it.

Could you expand on what you mean by state prediction? Also, I’m using the DQN algorithm so I’m not really sure how I would make it use a probability distribution for the policy besides SoftMax. Are you maybe talking about distributional DQN? I haven’t researched it much though so all I know is that it tries to predict the distribution of Q-values rather than the Q-value for each state. The issue for me is how to tell if the AI 1 is moving closer to AI 2. There is a bunch of problems when trying to use simple magnitude checks because AI 1 will get rewarded for standing and doing nothing if AI 2 got closer. Even when I added an if statement for checking if the agent moved or not I couldn’t really tell if there was any significant improvement. Here is my reward function right now:

if AgentMoved then
	local Vel = Self.HumanoidRootPart.AssemblyLinearVelocity
				
	if Vel.Magnitude > Threshold then
		LastIdleTime = os.clock()
					
		if NewDistance < OldDistance then
			Reward = 1 / NewDistance
		elseif NewDistance > OldDistance then
			Reward = (DefaultParams.MovingBackPunishmentFactor * (NewDistance^2 / 1572.77186483) - 0.1)
		end
		--Reward = -0.00001 * NewDistance
	else
		if LastIdleTime == nil or os.clock() - LastIdleTime >= 2 then
			Reward = -0.1
		end
	end
end
			
local CurRotDif = (Target - CurRot)^2
			
if CurRotDif > OldRotDif then
	Reward += -0.000001 * (CurRotDif)
else
	Reward += 0.5 / (1 + CurRotDif)
end
			
if NewHealth ~= OldHealth or NewEnemyHealth ~= OldEnemyHealth then
	local HealthLostReward = -(1 - (NewHealth / Self.Humanoid.MaxHealth)^2)
	local DamagedReward = 1 - (NewEnemyHealth / Enemy.Humanoid.MaxHealth)^2

	Reward = HealthLostReward + DamagedReward
end
			
local Terminate = false

if Self.Humanoid.Health <= 0 and Enemy.Humanoid.Health > 0 then
	Reward = -1
	Terminate = true
elseif Self.Humanoid.Health > 0 and Enemy.Humanoid.Health <= 0 then
	Reward = 1
	Terminate = true
end

AgentMoved is a variable that is a boolean for whether or not the agent took an action like Humanoid:MoveTo()

I am not aware of the DQN algorithm, but this seems like a relatively simple problem to solve. You can use Vector3.FuzzyEq to determine if they are moving, but I would recommend subtracting AI 2’s velocity. The issue here is that you are not using a probability distribution so I don’t know if the action will be randomly generated with your method. The AI won’t learn. However, maybe I don’t understand D3QN very well.

The difference is that the algorithm I’m using predicts the action-state values (Q-values) for each state. Basically, it predicts the quality of an action. I am then using a Boltzmann exploration policy which creates a probability distribution over those Q-values with a temperature parameter denoting the exploration vs exploitation tradeoff. These probabilities are the probability of choosing that action. The algorithm is trained to estimate better Q-values though, so these probabilities will eventually reflect a trained policy. Also, can you tell me what I am subtracting AI 2’s velocity from? AI 2’s position? How would that help?

I’m not sure. I would add another state, but then again that will take too long, unless the AI is capable of interpolating between states (mine cannot, I’m not sure how to implement that). For my AI, doing nothing is also an action. What I did is I positioned the AIs randomly each time so that they would need to do different actions each time. AI 2 might walk forward and that might lower the reward, so AI 1 will do something different to advance towards AI 2.

Wait so was yours using a deep reinforcement learning algorithm (neural networks) or just a regular reinforcement learning algorithm? Also, I put the function I used to calculate the state in the original post. I considered doing nothing (no-op) as an action, but I wasn’t sure if it would significantly improve agent performance. I should be positioning the AIs randomly to hopefully increase robustness but that was a pain to implement in the original version of my simulation (which I don’t use). What’s the extra state you’re talking about? Could you expand on what new information I would be giving the agent?

It was regular reinforcement learning.

You might want to make a whole new probability distribution table for the state that is most likely to happen because of an action. Then, you can predict whether that will improve the score or not. If not, you could have a 2nd choice of action. I haven’t fully implemented this into my system, but I’ve heard that it seems to work. If AI 1 does nothing and thinks of moving forward, the predicted state would have a higher score since it gets closer to AI 2. AI 1 would then move forward, its prediction was correct (or close to being so), and the score would increase some more, the probability distribution would automatically adjust and normalize, and then both AIs advance.

This doesn’t necessarily work for me because I am using neural networks, but I have basically incorporated the same thing. The way I have it setup is that it will be rewarded for moving and if the new position it’s in is closer to the enemy it will be given a positive reward. This essentially incorporates some sort of prediction like you said.
I have found someone’s implementation: Roblox Sword Fighting AI (Q-Learning) - YouTube
with a step-function for the reward and a regular reinforcement learning algorithm (Q-learning) and I am experimenting with their reward function right now, but it doesn’t look like it has great results.
Experimental reward function:

local ScaleFactor = 5000

Reward = -1 / ScaleFactor

if NewDistance > 20 then
	Reward -= 1000 / ScaleFactor
elseif NewDistance > 10 then
	Reward -= (NewDistance - 15)^2 / ScaleFactor
end

if NewHealth ~= OldHealth or NewEnemyHealth ~= OldEnemyHealth then
	local HealthLostReward = -(1 - (NewHealth / Self.Humanoid.MaxHealth)^2)
	local DamagedReward = 1 - (NewEnemyHealth / Enemy.Humanoid.MaxHealth)^2

	Reward = HealthLostReward + DamagedReward
end
			
local Terminate = false

if Self.Humanoid.Health <= 0 and Enemy.Humanoid.Health > 0 then
	Reward = -1
	Terminate = true
elseif Self.Humanoid.Health > 0 and Enemy.Humanoid.Health <= 0 then
	Reward = 1
	Terminate = true
end

This doesn’t really solve the issue of the AI being rewarded for doing nothing though. Or is this even an issue in the first place? I’m not really sure.
Reward graph:

Loss graph:

Here the loss shows the average error between the predicted Q-value and the target Q-value after each round.
I’ve noticed that the loss never seems to go below ~5. It seems to have small spikes but has slowing improvement. It might be stuck in some sort of local optima? Should I consider using a different optimizer? I don’t think a loss of 5.06 is good because it shows that it’s unable to predict the correct Q-values for states making it choose suboptimal actions. I would consider a loss of <0.2 good.

Alright. Now, I think a better idea could be put in place, where the AIs are actually given a negative reward if they don’t move or do any damage or movement at all (more of a control). Though, this would need to be at a certain threshold, since doing nothing is a valid action in combat situations.

I think I’ll try that, thanks. A high enough gamma should make the agent look past negative reward for standing still in exchange for a better position in the long-term. My main question now though why is the loss stuck at 5? Is the problem really reward function related? I’m not really sure at this point. If you want, I could send the rbxl file and maybe you might be able to investigate?

I don’t think it is. Is it possible the AIs are both conflicting with each other’s responses?

Possibly making the reward negative if there is no reward would fix this, but I’m sure that there might be another cause.

Not sure what you mean by conflicting with each other’s responses, but I don’t think that can happen. What’s weird is that even when I froze the enemy agents, the learning agents were unable to learn how to walk towards the enemy. I might try again with the new reward function, but I don’t have much hope.