Problem with Torque Scheduler (nodes down)

Hi to all the forum members.
I just registered on the forum. I am managing a Linux computer which has Torque for scheduling jobs.
I'd like your suggestions for the following issue with Torque. I hope this is the right section for this topic.

I tried extensively checking all the config files and settings, but the nodes always are down.

I recently upgraded packages and the torque packages were updated to the latest rpm versions. However, I am unable to get the nodes to active state (see the output of commands below).

> qnodes

node01.cluster
state = down
np = 12
properties = allcomp,gpu,compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1

node02.cluster
state = down
np = 12
properties = allcomp,gpu,compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1
?.
?
?

> momctl -d 3 -h node01

Host: node01.cluster/node01.cluster Version: 4.2.10 PID: 12009
Server[0]: XXXXXX.cluster (10.1.1.254:15001)
WARNING: no messages received from server
WARNING: no messages sent to server
HomeDirectory: /var/lib/torque/mom_priv
stdout/stderr spool directory: '/var/lib/torque/spool/' (108669845blocks available)
NOTE: syslog enabled
MOM active: 1755 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 7 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
TCP Timeout: 60 seconds
Prolog: /var/lib/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List: 10.1.1.1:0,10.1.1.254:0,127.0.0.1:0: 0
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected

diagnostics complete

The nodes and server were active and functioning properly previously.
I checked all the config files and they seem to be correct.
I can ssh to node01 or to other nodes and back to server, without password.
The hostname of the server is same in server_name on both server and client config files, as well as in /etc/hosts entries.
The munge and trqauthd daemons are running. As can be seen from the momctl command issued on the server, it provides output, but the WARNINGs indicate that server and client are not communicating. I can't get a clue in the server or mom logs. Could you provide some directions to resolve this?

The following are logs on server and node01.

ON SERVER:

> tail /var/lib/torque/server_logs/20160503
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job
Modified at request of
05/03/2016 09:19:04;0040;PBS_Server.2942;Req;node_spec;job allocation
request exceeds currently available cluster nodes, 1 requested, 0 available
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;could not
locate requested resources '1pn=1' (node_spec failed) job allocation
request exceeds currently available cluster nodes, 1 requested, 0 available
05/03/2016 09:19:04;0080;PBS_Server.2942;Req;req_reject;Reject reply
code=15046(Resource temporarily unavailable MSG=job allocation request
exceeds currently available cluster nodes, 1 requested, 0 available),
aux=0, type=RunJob, from
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job
Modified at request of
05/03/2016 09:19:48;0002;PBS_Server.2958;Svr;PBS_Server;Torque Server
Version = 4.2.10, loglevel = 0
05/03/2016 09:24:56;0002;PBS_Server.2940;Svr;PBS_Server;Torque Server
Version = 4.2.10, loglevel = 0

> tail -f sched_logs/20160503
05/03/2016 07:48:19;0080; pbs_sched.2825;Svr;main;brk point 98287616
05/03/2016 07:58:24;0080; pbs_sched.2825;Svr;main;brk point 98811904
05/03/2016 08:08:29;0080; pbs_sched.2825;Svr;main;brk point 99336192
05/03/2016 08:18:34;0080; pbs_sched.2825;Svr;main;brk point 99860480
05/03/2016 08:28:39;0080; pbs_sched.2825;Svr;main;brk point 100384768
05/03/2016 08:38:44;0080; pbs_sched.2825;Svr;main;brk point 100909056
05/03/2016 08:48:49;0080; pbs_sched.2825;Svr;main;brk point 101433344
05/03/2016 08:58:54;0080; pbs_sched.2825;Svr;main;brk point 102486016
05/03/2016 09:19:04;0080; pbs_sched.2825;Svr;main;brk point 103010304
05/03/2016 09:29:09;0080; pbs_sched.2825;Svr;main;brk point 103534592

ON node01:
> tail /var/lib/torque/mom_logs/20160503
05/03/2016 09:26:57;0002;
pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for
server
05/03/2016 09:26:57;0008; pbs_mom.15663;Job;scan_for_terminated;entered
05/03/2016 09:26:57;0080; pbs_mom.15663;Svr;mom_get_sample;proc_array
load started
05/03/2016 09:26:57;0002;
pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
05/03/2016 09:26:57;0080; pbs_mom.15663;n/a;mom_get_sample;proc_array
loaded - nproc=0
05/03/2016 09:26:57;0008; pbs_mom.15663;Job;scan_for_terminated;pid
15682 not tracked, statloc=0, exitval=0
05/03/2016 09:27:42;0002;
pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for
server
05/03/2016 09:27:42;0008; pbs_mom.15663;Job;scan_for_terminated;entered
05/03/2016 09:27:42;0080; pbs_mom.15663;Svr;mom_get_sample;proc_array
load started
05/03/2016 09:27:42;0002;
pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
05/03/2016 09:27:42;0080; pbs_mom.15663;n/a;mom_get_sample;proc_array
loaded - nproc=0
05/03/2016 09:27:42;0008; pbs_mom.15663;Job;scan_for_terminated;pid
15684 not tracked, statloc=0, exitval=0

Output of other key commands.

> uname -a
Linux stinger.cluster 2.6.32-573.7.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 23 03:02:55 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

> rpm -aq | grep torque

torque-server-4.2.10-9.el6.x86_64
torque-libs-4.2.10-9.el6.x86_64
torque-scheduler-4.2.10-9.el6.x86_64
torque-mom-4.2.10-9.el6.x86_64
torque-client-4.2.10-9.el6.x86_64
torque-4.2.10-9.el6.x86_64

> qstat -q

server: XXXXXX.cluster

Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
batch -- -- -- -- 0 1 6 E R
----- -----

> qmgr -c 'p s'

#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch max_running = 6
set queue batch resources_max.ncpus = 8
set queue batch resources_max.nodes = 1
set queue batch resources_default.ncpus = 1
set queue batch resources_default.neednodes = 1pn=1
set queue batch resources_default.walltime = 24:00:00
set queue batch max_user_run = 6
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = XXXXXX.cluster
set server acl_hosts += node01
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server next_job_number = 1
set server authorized_users =
set server moab_array_compatible = True
set server nppcu = 1
0 1

> cat /var/lib/torque/server_priv/nodes

node01.cluster np=12 gpus=1 allcomp gpu compute
node02.cluster np=12 gpus=1 allcomp gpu compute
node03.cluster np=12 gpus=1 allcomp gpu compute
node04.cluster np=12 gpus=1 allcomp gpu compute
node05.cluster np=12 gpus=1 allcomp gpu compute
node06.cluster np=12 gpus=1 allcomp gpu compute
node07.cluster np=12 gpus=1 allcomp gpu compute
node08.cluster np=12 gpus=1 allcomp gpu compute
node09.cluster np=12 gpus=1 allcomp gpu compute

Looking at those logs, the root cause is pretty clear. The most critical warnings screaming at you from the momctl output are:

WARNING: no messages received from server

WARNING: no messages sent to server

This is solid proof that TCP communication between pbs_mom (the node daemon) and pbs_server is completely dead. The nodes might look like they're up, but they aren't talking to the server at all. From my experience, there are a few classic reasons behind this. Here is how we can fix it:

1. Version Mismatch (The Most Likely Culprit)
After an RPM upgrade, the Torque packages on your server and nodes might have lost sync. Run this check on every single node:

Bash
rpm -qa | grep torque
The version on the server (4.2.10-9.el6) must match the nodes exactly. If there's a discrepancy, update the nodes immediately:

Bash
yum update torque-mom torque-libs torque-client
Click to expand...

2. Port Access & Firewall Blocks
Firewall rules might have been overwritten or changed during the upgrade. Let's make sure the ports are open from the server to the nodes and vice versa:

Bash
# Server to nodes:
telnet node01.cluster 15002
telnet node01.cluster 15003

# Nodes to server:
telnet XXXXXX.cluster 15001
If the connection gets rejected, you'll need to dig into your iptables rules:

Bash
iptables -L -n | grep 1500
3. Trashed pbs_mom Config File
Package upgrades sometimes play nasty jokes and reset your /var/lib/torque/mom_priv/config file. Check this file on every node:

Bash
cat /var/lib/torque/mom_priv/config
You should see these lines inside (without the equals sign):

Plaintext
$pbsserver XXXXXX.cluster
$logevent 225
4. trqauthd Socket Conflicts
In the Torque 4.x series, it's very common for old socket files to get left behind after an upgrade, which ends up crashing the trqauthd daemon. Let's nuke the old leftovers and fire up the service on all nodes:

Bash
# Nuke the old sockets
rm -f /var/run/trqauthd.socket

# Fire up the daemon
service trqauthd restart
service pbs_mom restart
5. Clean & Sequential Restart (Hard Restart)
Sometimes, just restarting the daemons post-upgrade isn't enough. You need to bring the system up cleanly and in the right order. Try this:

Bash
# On the Server first:
service pbs_server stop
service pbs_sched stop
sleep 5
service pbs_server start
service pbs_sched start

# Then on every node:
service pbs_mom stop
sleep 3
service pbs_mom start
After doing this, monitor the node status on the server with:

Bash
watch -n 5 'pbsnodes -a | grep -E "node|state"'
Quick Diagnostics (In Order of Priority)
To save time, prioritize your actions like this:

Version: Ensure versions match with rpm -qa | grep torque.

Ports: Cross-test ports 15001, 15002, 15003 using telnet.

Config: Verify the server name in mom_priv/config.
Click to expand...

Socket: Clean up trqauthd leftovers and restart.

Sequential Restart: Server first, then nodes.

By the way, there's another minor detail in the logs: the sched_logs are only spitting out brk point lines. This confirms that the scheduler is currently idling, meaning it's getting zero state info from the nodes.

Hope this helps you put out the fire. Good luck

Log in or Sign up

Problem with Torque Scheduler (nodes down)

Sridharacharya Peon

cryptopunk Greenhorn

Log in or Sign up

Problem with Torque Scheduler (nodes down)

Sridharacharya Peon

cryptopunk Greenhorn

Useful Searches