Think of it as Metal Gear Solid 5’s setup, multiple outpost bases all around a map with incoming AI reinforcements coming in from other bases to support nearby ones where the player is spotted trying to infiltrate/ demolish. Sort of like one whole giant hive with separate houses. Each distribute themselves equally and only send out a few footmen per infiltration detected around them, sorry if it sounds very confusing im not very good at explaining things .
I have a couple ideas for how agents can be used here, maybe one main agent that manages all of the outposts at the same time? Or a couple agents in a MARL environment that control their own bases and communicate with values when they need help. But whenever I think of any of these approaches my mind goes blank because the inputs that would be required and the reward functions needed dont come to my mind at all.
If this sort of concept isn’t possible, I totally understand that and also have another idea, I also wanted to train an agent to work alongside other soldiers to form a sort of artificial squad/brotherhood among agents of 3-5 quantities. They would stick together and deal with threats using their own set of policies. Essentially groups of footsoldiers that can communicate and deal with player threats in an unexpected and smart manner using what they learn and evolve with through training.
Okay let’s try this then.
Settings for the agent that controls the strategies for all outpost
-
Use a single AdvantageActorCritic, PPO or PPO-Clip. (Do not use any other models)
-
The Actor will use Stable Soft max for last layer. You will also need to set returnOriginalOutput to true in ReinforcementLearningQuickSetup.
-
Each of actor’s output will represent a single outpost.
-
Since we’re using stable soft max, it will output probabilities between 0 and 1. We will multiply the number of total troops from all outpost with the probability, which will give the number of troops that should stay in that outpost.
-
When the number of troops have been determined, use Roblox’s pathfinding to reach the outpost.
-
For reward, maybe punish based on number of troop lost, and reward based on total damage done to the enemy. Also punish if the outpost is destroyed or any attempts on sending troops to destroyed outpost.
-
For the input, you might want to reserve some inputs slot that determines if any of the outpost is destroyed or not so that it can recognize when not to send the troops to destroyed outpost.
Settings for agents controlling individual troops
-
Use any models that you see fit. Make sure the model parameters are shared.
-
Once the troop reaches the outpost using Roblox pathfinding, use the usual reinforcement learning technique unless they are attacked midway.
-
For the reward, I think you can do this easily.
1 Like
Excellent! Thank you so much, i’ll start by training the agent to manage individual troops. One thing though, i’m not very sure what inputs I can give the agent controlling the troops, since the number of soldiers can be dynamic and wont really be the same all time time especially when they die. (e.g in order to teach them to stick together in squads, ambush together or set their foundation in a way such that they can learn expert techniques in combat later on in the future in their training like automatic dynamic ambushes/cutting off choke points etc)
And what type of model would you advise me to use for them? I am thinking of experimenting with PPO since I read up that they factor in small changes that make agents refine their correct actions further than exploring wrong actions and later on learn more complex actions in the future, so far I have tried out A2C but it does take a while for them to learn compared to DoubleQLearning.
For agent controlling troops, you might want to do this:
- Calculate inverse distance for each of the troops. E.g:
inverse = 1 / distance
- Reserve one neuron that handles total health for all troops inversely proportional to distance. Do not include itself, which it will has its own one.
scaledTotalHealth = (troop1HealthPecentage * troop1InverseDistance) + (troop2HealthPecentage * troop2InverseDistance) + ...
- Reserve one neuron that handles the number of troops, but do something similar
scaledTotalNumberOfTroops = troop1InverseDistance + troop2InverseDistance + ...
-- Remember each troop is equal to 1, so we just use the inverse distance.
- Add other inputs you can think of like the troop’s current health.
1 Like
Try PPO-Clip, since the outposts are stationary.
1 Like
For the main agent controlling outpost strategies or the ones controlling individual troops?
PPO clip for controlling outpost.
Troops can use any models. If you feel like Q learning is efficient, then go ahead.
1 Like
What do you think would be a great model to use if I wanted them to eventually learn productive tactics like retreating while suppressing back or splitting up to pin a player down etc?
Double Expected State Action Reward State Action comes to mind if you do not want the troops to be too overconfident with their decision making.
1 Like
Could you explain the inverse distance part I don’t understand what you mean by that, would the inversedistance be for the distance apart from each of the other troops in the squad from the troop being controlled by the agent or…?
This one. I’m pretty sure the longer the distance, the less “effect” it would have.
1 Like
Ohh, okay, also just curious why would inversedistance be factored into scaledtotalhealth? wouldnt that make the agent learn the wrong thing?
Or is it there to represent the total manpower the squad has, so if a troop decides to wander off it will lessen and I can penalise the agent for that?
That is one of them. The other one is that, well common sense… If you have a troop that is very far from the others, I’m pretty sure others wouldn’t:
- Sacrifice themselves slightly so that particular troop doesn’t take too much damage and die, and the damage is spread out and can regenerate.
For the numerical reasons:
-
If we use just regular distance instead of inverse, the number would be too huge to handle, and I think Roblox support 32 bit numerical values…
-
Neural network isn’t going to train effectively with that large value…
1 Like
I wanted to try A2C for the troop agents
I get a gradient explosion/nil label value, is there something wrong with this setup?
Above is the function that generates the Actor, below is the function that generates the Critic.
EDIT: it appears my inversedistance was becoming inf due to the fact that the npcs were inside eachother and essentially I was dividing 1 by 0 which is technically inf.
Bro, you need move the addLayer(), setClassesList() and setModelParametersInitializationMode() outside the if statement.
Use generateLayers() inside your if statement instead.
Oh, what does generatelayers() do? reload layers from saved parameters or
Just generates new model parameters. Do note that it will reset the current model parameters if you have any stored inside it though.
Had a bit of fun with the individual agents and managed to teach them to form formations while sticking together!
Well, after having some fun with them, I took a look at some real-life implementations of AI, specifically tesla’s autopilot and got a bit inspired. I understand what inputs I can give the car to stay on the road but… what would I ideally tell it when I want it to drive on the road while actively driving towards their destination? (e.g taking turns to drive towards the destination but avoiding turns that lead away or are unnecessary in order to get to their destination). I couldn’t really find any resources on this but I DID program an A* algorithm, although not perfect it can ideally pathfind for the vehicle, but what I wanted to know was your opinion on what inputs i’d be using to train the model to drive to their destination while staying on the road and making the right turns and what model do you recommend I use if I wanted it to learn how to drive naturally over the course of its training period? (e.g detect other cars lane-switching, braking etc.)
I think this time raw output would be great since there’s a steering and throttle value that can be set from 0 - 1 / -1 - 1, I wouldn’t essentially need any labels or classeslists.
I did come across a couple great courses on self-driving vehicles running off of neural networks but most of them just showed the vehicle sticking to its monotone path instead of taking turns and trying to drive to its destination over the course of multiple separate roads.
First off, I want to mention is that try to avoid using raw values when it is not needed. When you do that the performance will be actually worse. You can quickly look into conclusion for the proof here:
I recommend you do this for the output:
-
Reserve 2 output neurons for controlling the steering specific amount of value for left and right (i.e. 0.1 and -0.1) instead of full amount.
-
Reserve 2 output neurons for controlling throttle value.
-
Make sure to keep track the total amount of change for throttle and steering values, because these will be used for the inputs for the model.
Now to answer for your question for your inputs:
-
Use something similar to the sword fighting AI that marks the target location (i.e. the amount of rotation from the target and the distance from the target).
-
A few raycasts that shoots downwards and some of the raycast have angles. Make sure each raycast have its own input neuron. If the raycast detects the road parts, the value will be 1, otherwise it will be zero. You can also use this to reward the model, depending how many raycasts hits the road parts.
-
Reserve 2 input neurons for the total throttle and steering change. This will be used as a “memory”, so that it knows what’s the current amount of the throttle and steering it’s applying.
-
Reserve some input neurons for raycast shooting at the sides, front and back of the car. Use inverse distance.
For the model, I guess any could do. But if you want a more reckless driving, stick with Q-Learning. Otherwise, stick with expected SARSA or regular SARSA and its variants. The above input and output configuration should work really well with these model. (Don’t ask me the mathematics behind it, you would at least need a month to understand why is that).
Also can you do me a favour? Once you’re done with your stuff, can you give me a 3 minute video for it? I don’t want to see any codes though. I would like to use it for marketing purposes. I will also include credit for your work. It’s fine if you don’t want to.
PS: Can I also see a short video of the formation as well? I’m curious now. 0_o
Thanks for the feedback! Hmm, for the 3 min video i’ll see if I can get a couple good shots of the AI in action and compile them into a little video if I can even manage to train it properly .
And for the formations…
The AI is a bit slow at doing them since I only trained them for 15-30 mins but they still manage to form 2 basic things like full-coverage SFL and a square thing that I believe are their attempts at trying to stick together.