JMS timeouts can occur when there is no communication occurring between the server and agent(s). There is a heartbeat signal present (every 15, 30 seconds or so) to try and determine if the agent is still online and responsive. If no response is received, the agent is marked offline.
ERROR - com.urbancode.devilfish.services.jms.ReplyTimeoutException: Timeout expired: 10000ms
ERROR - Timeout expired: 10000ms
Some scenarios where the "Timeout Expired" message can appear:
If an agent becomes overloaded it is possible that the communication is interrupted and the agent can no longer communicate with the server for the time being. The anthillpro server will then think the agent has gone unresponsive and will subsequently fail the step, since no response can be sent due to not enough resources being available. What might be happening here is that the agent could be overloaded at the time and it is trying to run the step while it is doing other work. The agent proceeds to run the step but has so much tasking it that it can no longer communicate back to the server the status of its command. So the server times out the step, regardless if the agent can complete the work later and return the logs & info to the server (noted usually by a 403 or a 401 error in the agent logs).
TBA
Due to instabilities in a network, the following error message can appear quite often when connections are flaky/being dropped.
Connection reset by peer: socket write error
TBA
You may see errors like the following if the connection is being dropped by the firewall:
java.net.ConnectException: Connection Refused
TBA