[Added SAC, DDPG and TD3] DataPredict [Release 2.0] - Machine Learning And Deep Learning Library (Learning AIs, Generative AIs, and more!)

MYOriginsWorkshop · August 21, 2023, 5:09pm

Please explain what’s that. I searched for “table of observation for large language model” in google and it returned nothing related to it.

I want to know the mathematics behind it, why it was created that way and so on.

Magus_ArtStudios · August 21, 2023, 5:11pm

Sure, so if you ask a large language model to take on the role of a character. They will be better if they know who/where the character is what’s surrounding it. For example with this you can provide the context as the previous response. or you can use something like GPT to expand on the context of the environment.

function conversationaldialogue(str,context,responses,model)
	local model=model
	local API_URL
	if model==nil then
		model=math.random(1,2)
	end
	if model==1 then
	 API_URL = "https://api-inference.huggingface.co/models/microsoft/GODEL-v1_1-base-seq2seq"
	else
	-- Define the URL and the headers for the request
	 API_URL = "https://api-inference.huggingface.co/models/facebook/blenderbot-400M-distill"
	end
	
	local headers = {
		["Authorization"] = Bearerkey,
		--["Content-Type"] = "application/json"
	}
table.insert(context,str)
	-- Define the payload for the request
	local payload = {
		inputs = {
			past_user_inputs = context,
			generated_responses = responses
		}
	}

	-- Encode the payload as a JSON string
	local payloadJSON = HttpService:JSONEncode(payload)

	-- Send the request and get the response
	local success, response = pcall(function()
		return HttpService:RequestAsync({
			Url = API_URL,
			Method = "POST",
			Headers = headers,
			Body = payloadJSON
		})
	end)

	-- Check if the request was successful
	if success then
		-- Decode the response as a JSON table
		local responseJSON = HttpService:JSONDecode(response.Body)

		-- Check if the response has a generated_text
		if responseJSON.generated_text then
			-- Print the generated_text
			return	(responseJSON.generated_text)
		else
			-- Print an error message
			print(response)
			return nil
		end
	else
		-- Print an error message
		print("Request failed: " .. response)
		return nil
	end

end

Also this module is a part of my multi-model AI system that uses old-school database searching.
So if the algorithm saw a enemy that is in the Bestiary. It would follow the observation by a description of the enemy from the databases

MYOriginsWorkshop · August 21, 2023, 5:14pm

First, what is this?

All you’re basically doing is sending texts to hugging face large language model. That’s not the mathematics.

Also, where’s the research paper on this “table of observation” on Google scholar. I’m pretty sure it can explain more than you do. I’m very interested on this term you made up.

Magus_ArtStudios · August 21, 2023, 5:17pm

You seem to not be grasping what this is. You must not have much experience working with LLMs or you are just purposefully being apprehensive and dense.
For example you input into the LLM
local payload = {
inputs = {
past_user_inputs = {“What should you do?”},
generated_responses = {“I am Magus Art Studios,”}–
}
}

You may think this data was constructed by an LLM but it was not this is the input data to the LLM that gives it character context.
The LLM has this personality and knows that this is its surroundings. In this example the environment is very empty but the algorithm would construct a full observation if their was objects located in those directories. Then my chatmodule algorithm connects the observations with a string from the personality database then the output can be input into a large language model and it will love to roleplay as that character.

MYOriginsWorkshop · August 21, 2023, 5:21pm

I will reiterate this.

You are not explaining the mathematics or give me research papers related to your version of “large language model” that I want to look into, because hey, if you know it, then it must be popular right?

You are blatantly giving an “explanation” based on a code. A code which sends text to hugging face LLM server and bring back the results. That’s not even explanation.

Also, you haven’t explain the mathematics behind “table of observation” related to LLM you have described. What is it? I want to know it. And explain it. If you can’t explain it, send me a research paper.

Magus_ArtStudios · August 21, 2023, 5:25pm

You’re a funny guy gaslighting. I think you’re adorable, here let me put in words you can not dwell on, “table of observations” a constructed database of string created by my algorithm that gives individual observations of closest objects and a full observation of the entire scene. These smaller strings are useful . But the main chunk is the paragraph entry.
I don’t need to provide you with anything you are very rude and demeaning and I do not appreciate your demeanor in this matter, you are being condescending.
You can exist in your world of exact research papers and not be a free thinking individual with a subjective thought to evaluate what you see in front of you, not to mention this module I graciously shared with the community is related to Text-Vision. As in Text based vision module.

MYOriginsWorkshop · August 21, 2023, 5:28pm

Finally. took you long enough. So basically it just your own definition that isn’t known to the LLM community.

Why does it take that long to actually pull out that explanation and instead giving me random codes?

Magus_ArtStudios · August 21, 2023, 5:30pm

Because its not hard to understand, you misunderstood and are asking foolish questions, trying to be negative. It’s not hard to visualize what you can do with a LLM if it knows what character it is roleplaying as and what its surroundings look like, and if you were to ask it to create a commentrary based on those observations and character then it would. Do you comprehende that compadre?

ImNotKevPlayz · August 21, 2023, 5:30pm

Is DQN with experience replay in the stable version now? Just to be sure, your DQN includes the target network improvement as well, right?

MYOriginsWorkshop · August 21, 2023, 5:31pm

Yes. Use :setExperienceReplay() function. Experience replay is disabled by default to avoid eating resources.

Also for your second question, yes. It will automatically improve it.

ImNotKevPlayz · August 21, 2023, 5:35pm

Okay, cool. It would be great if you could add extensions to DQN like double DQN, dueling DQN, and prioritized experience replay. Maybe even the rainbow DQN as shown in this paper: arxiv.org/pdf/1710.02298.pdf but that might be too much work.

Double DQN is relatively easy to implement though, it just requires changing the Q_Target when the next state is not terminal to:

MYOriginsWorkshop · August 21, 2023, 5:37pm

I’ll just leave it here.

MYOriginsWorkshop · August 21, 2023, 5:41pm

Rainbow DQN? Very interesting name.

That being said, I don’t think I can add more stuff to it since I will be focusing on my work.

Thanks for your ideas though. I’d probably release it on the Release 1.3 Version.

For now, people should enjoy the stability of the library instead of having to adapt to newer ones. So maybe I’ll release it in a month (if I can still remember to do that).

MYOriginsWorkshop · August 21, 2023, 5:47pm

Also, I made the code readable for you guys to modify to your own needs. So feel free to play around with it. For example, perhaps new optimizers or very interesting neural network variations.

Plus, my custom matrix library integrated into this DataPredict library should easily help you reach those needs.

Be sure to check the “API design” part.

ImNotKevPlayz · August 21, 2023, 6:11pm

Hi, I looked at the code for QLearningNeuralNetwork module and I didn’t see any hyperparameter for the target network update frequency. Are you sure you included the target network improvement?

function QLearningNeuralNetworkModel:update(previousFeatureVector, action, rewardValue, currentFeatureVector)

	if (self.ModelParameters == nil) then self:generateLayers() end

	local predictedValue, maxQValue = self:predict(currentFeatureVector)

	local target = rewardValue + (self.discountFactor * maxQValue[1][1])

	local targetVector = self:predict(previousFeatureVector, true)

	local actionIndex = table.find(self.ClassesList, action)

	targetVector[1][actionIndex] = target

	self:train(previousFeatureVector, targetVector)
	
end

You seem to be using the same neural network that predicts the Q values to update itself. Also, you seem to be missing the “IsTerminalState” value stored for each experience with the following logic:

if IsTerminalState then
   Target = Reward
else
   Target = Reward + Gamma * Argmax a' Q(s',a',theta')
end

RL — DQN Deep Q-network. Can computers play video games like a… | by Jonathan Hui | Medium

Divergence in Deep Q-Learning: Tips and Tricks | Aman (amanhussain.com) Some interesting graphs I found online:

Data

MYOriginsWorkshop · August 21, 2023, 6:44pm

I’m not too sure what you mean by the first question, but I’ll skip that first so that I can explain the rest. You might need to further elaborate on that. Probably it is covered in API here?

For the terminal state, it was removed. The reason being is that I expect for users to continuously call the :reinforce() function without any limit. Using terminal state assumes that the model will be trained within certain period amount of time. However, this is not true if we were to apply to real-life scenarios as the model requires to continuously learn for infinite amount of time and hence we may never reach to the point of terminal state.

Also, while it is true that the model uses the same neural network, you need to understand that giving different input gives out different output for the same model parameters weights. I use the current predicted feature vector as an input for training for previous feature vector. This will cause an update for the weights based on previous feature vector and future predicted vector as inputs. (Also refer to first link under references in the API).

ImNotKevPlayz · August 21, 2023, 6:54pm

I think you should’ve kept it. I don’t think this library is going to be used in real-life scenarios and in a game-development context I think most tasks are episodic, not non-episodic. You could just include support for both and have the user specify whether the task is episodic or not.

The target network is an improvement to DQN which helps stabilize its performance. DQN is prone to divergence as noted by the deadly triad issue Why is there a Deadly Triad issue and how to handle it ? | by Quentin Delfosse | Medium where when combining TD-Learning / Bootstrapping, off-policy learning, and function approximation leads to instability and divergence. It works by creating a separate neural network which is a copy of the policy network called the target network. The target network is used for calculating the target q value when training the DQN instead of the policy network. The target network is not trained however. Instead, every x steps the weights from the policy network are copied to the target network. This helps reduce divergence and increases stability by creating a stationary target which is periodically updated.

Essentially, you can think of the target network as a “frozen” target which changes periodically.
Edit: This article: Why is there a Deadly Triad issue and how to handle it ? | by Quentin Delfosse | Medium is much better at explaining the deadly triad issue than the one I linked before.
Also this one: Bootcamp Summer 2020 Week 7 – DQN And The Deadly Triad, Or, Why DQN Shouldn’t Work But Still Does (gatech.edu)

MYOriginsWorkshop · August 21, 2023, 7:13pm

Hmm, when I refer in real-life, I also mean in game development though. For your games it might be episodic, for me, it is mostly non-episodic for my games. Also, i want to cover large uses cases that includes both episodic and non-episodic scenarios, so it seem that the removal of terminal state would still satisfy both use cases. If i had introduced the terminal state, it would remove the non-episodic use case.

That being said, it just a minor detail anyways regarding about the terminal state. I doubt it changes too much if I don’t have the piece of code Target = Reward.

In addition to that, if I were to implement that system, I predict I would need to make some changes to the code that would be more inflexible for programmers who wishes to instantly use this library out of the box.

Also thanks for the other articles by the way. I will have a read at them, but I will always prioritize performance over correctness since Roblox puts a limit on how much hardware we can use.

ImNotKevPlayz · August 21, 2023, 7:25pm

The target network isn’t that computationally expensive and it’s pretty easy to implement. Nor is the terminal state flag added to each experience.

The removal of the terminal state is an issue in my opinion, because from what I’ve observed in the Q-Learning algorithm, it seems to propagate the reward/punishment in a terminal state backwards so that over time it knows which actions to choose that lead it to the terminal state of higher reward.

I don’t think it would be that inflexible, just add a parameter for episodic or non-episodic. Or even better, just make it so that when update is called if the terminal state flag exists it assumes episodic, if not it assumes non-episodic.

Also, I think the way you implemented DQN is a bit strange. From what I remembered in python DQN tutorials (although I only have basic knowledge of python) the way they do it is they have separate functions for training, predicting, and setting up how long the DQN will train for. I didn’t really understand much so if you want to check it out yourself here is a link to something I found online: stable-baselines3/stable_baselines3/dqn/dqn.py at master · DLR-RM/stable-baselines3 · GitHub

Setting the target to reward in terminal states also improves accuracy. How you ask? Well, if we calculate the Q_Target in a terminal state as usual (without Target = Reward) we’re assuming that the state we’re in is not a terminal state, so we use bootstrapping to estimate the expected discounted cumulative reward. This makes no sense because it assumes that there are possible states and actions the agent can take after the terminal state, and therefore isn’t accurate. Instead of using a likely inaccurate estimate of the Q-Value for the terminal state, why not just use the true Q-Value of the terminal state instead (the reward)?

ImNotKevPlayz · August 21, 2023, 8:43pm

I don’t know if I’m the only one but, the amount of ModuleScript links confused me. So, the package version always contains the most up-to-date library and the ModuleScripts above that contain different versions of the library? So only two modules are required?

Package version, option above that, or unstable version
Matrix library

Also, I tried the introduction code and it gave me an error. Edit: I managed to fix it. I forgot an extra {} around the predicted vector.