-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Which of the MAMuJoCo environments are even "solvable"? #141
Comments
Hey, sorry, I am in the process of validating the environments (which is why they are not included yet in a release) 1)Here is a single run of Hopper 2)I have verified that 3)Based on the way you are asking about the observation categories, there seems to some confusion on the supported observation types, I will update the documentation, to fix that. 4)I have not read the paper you have provided but giving each agent the full observability makes factorization pointless (it is like solving the single agent with extra steps) |
Thank you so much for the speedy response. This is helpful and I look forward to the outcomes of your investigation. Just to be clear, in the plot above, the Hopper environment was configured as If you do not mind, I would also appreciate if you could clarify what |
domain:
name: Hopper
factorization: 3x1 # agent factorization used, check MaMuJoCo Doc for more info
obsk: 1 # check MaMuJoCo Doc for more info
total_timesteps: 2_000_000 # how many learn steps the agent should take
#episodes: 1000
algo: TD3 # Valid values: 'DDPG', 'TD3', 'MADDPG'
init_learn_timestep: 25001 # at which timestep should the agent start learning
#learning_starts_ep: 10 # Start Learning at episode X, before that fill the ERB with random actions
evaluation_frequency: 5000 # how ofter should the agent be evaluated
runs: 10 # number of statistical runs
seed: 64 # seeds the enviroment
DDPG:
gamma: 0.99 # Reward Discount rate
tau: 0.01 # Target Network Update rate
N: 100 # Experience Replay Buffer's mini match size
experience_replay_buffer_size: 1000000
sigma: 0.1 # standard deviation of the action process for exploration
optimizer_gamma: 0.001 # the learning rate of the optimizers
mu_bias: True # Bias for the actor module
q_bias: True # Bias for the critic module
TD3:
gamma: 0.99 # Reward Discount rate
tau: 0.005 # Target Network Update rate
N: 256 # Experience Replay Buffer's mini match size
experience_replay_buffer_size: 1000000
sigma_policy: 0.2 # Standard deviation of the action process for policy update
sigma_explore: 0.1 # Standard deviation of the action process for exploration
optimizer_gamma: 0.001 # The learning rate of the optimizers
noise_policy_clip: 0.5 # Clamping for the target noise
d: 2 # Policy Update Frequency
mu_bias: True # Bias for the actor module
q_bias: True # Bias for the critics module |
I noticed in your MATD3 implementation that you use the environment state in the critic instead of the joint observation. Do you think that the environments should be solvable given the joint observation but not the environment state? Or is it by design that an algorithm should incorporate the environment state information in order to succeed? One problem I have with this is that it limits the degree to which one could use this environment for evaluating independent learners because there will always be missing information for the independent learners. I would be really interested to know what you think. |
Thanks! |
Thanks for the detailed response. I think your first point speaks to what I wanted to verify, namely that the intended design is that all the relevant information in the environment's state (i.e. the underlying single agent MuJoCo state) is also contained in the joint observation of the MA MuJoCo environment (possibly duplicated a couple times from each agents observation). |
@jcformanek if you want to try decentralized training methods, I would recommend you starting with HalfCheetah since it does not terminate (and therefore do not have to assign blame on cause of a terminal state) |
Question
TL;DR: do you have baselines for performance on the environments using some popular MARL algorithm, say MADDPG or other?
Hi there, first of all, thanks for maintaining MAMuJoCo. I have been experimenting with it for a few weeks now but am struggling to "solve" several of the scenarios using MATD3 / MADDPG. I was wonder if you have any baselines for the environments, i.e., demonstrated that they can be "solved" using some MARL algorithm? By "solved" I just mean some non-trivial return. In particular, my algorithm quickly learn to get a score of around 800 and 1000 on Ant and HalfCheetah respectively but fail to break out out that local optima until I added
qvel,qpos
to theglobal_categories
. After addingqvel,qpos
I now get scores ~3000 and ~6000 respectively. I originally tried this because I suspected there was some important information missing in the agent observations after I reduced the problem to a single agent task on the joint-observation and joint-action and my TD3 implementation could not solve it.I am now struggling to "solve" 2-agent Walker and 3-agent Hopper. I tried adding more values to the
global_categories
(qvel,qpos,cinert,cvel,qfrc_actuator,cfrc_ext
) but my algorithm seems stuck around the ~500 and ~200 return mark. Because of my experience with Ant and HalfCheetah I fear there is some important information omitted from the joint-observation, making it impossible to solve.To hopefully rule this out and narrow down the problem to a bug in my implementation I was hoping you had some kind of baseline for performance on these environments. I tried to refer to the results reported in other papers using MAMuJoCo but they all seem to use non-default settings for the scenarios which in some cases make the environment no longer a decentralised partially observed multi-agent environment. For example this paper gives each of the agents access to the state of the environment as their observation. I would like to avoid this and only give agents access to their partial local observations. However, I feel that if a single agent RL algorithm can't solve the tasks on the joint-observation, then its unrealistic to expect a MARL algorithm to succeed. What do you think?
I look forward to hearing from you.
The text was updated successfully, but these errors were encountered: