Multi-agent Deep Deterministic Policy Gradient (MADDPG) for available/busy agents

Hello there,

I´m working on MADDPG and I´ve had some trouble while dealing with the critic´s input. My scenario consists that at each time step there is only a subset of agents that can select a task while the other agents are busy in a previously selected task.

The critic in this situation will have a different joint action size at every timestep? What do I have to do ?

Moreover, As I´ve seen the critic network receives the whole actor´s output of each agent (the softmax vector containing the prob of all the possible actions of the agent) so I don´t know when the agent is busy, the critic will receive only one action? :frowning: