Problem with Torque Scheduler (nodes down)

Hi to all the forum members.
I just registered on the forum. I am managing a Linux computer which has Torque for scheduling jobs.
I'd like your suggestions for the following issue with Torque. I hope this is the right section for this topic.

I tried extensively checking all the config files and settings, but the nodes always are down.

I recently upgraded packages and the torque packages were updated to the latest rpm versions. However, I am unable to get the nodes to active state (see the output of commands below).

> qnodes

node01.cluster
state = down
np = 12
properties = allcomp,gpu,compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1

node02.cluster
state = down
np = 12
properties = allcomp,gpu,compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1
?.
?
?

> momctl -d 3 -h node01

Host: node01.cluster/node01.cluster Version: 4.2.10 PID: 12009
Server[0]: XXXXXX.cluster (10.1.1.254:15001)
WARNING: no messages received from server
WARNING: no messages sent to server
HomeDirectory: /var/lib/torque/mom_priv
stdout/stderr spool directory: '/var/lib/torque/spool/' (108669845blocks available)
NOTE: syslog enabled
MOM active: 1755 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 7 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
TCP Timeout: 60 seconds
Prolog: /var/lib/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List: 10.1.1.1:0,10.1.1.254:0,127.0.0.1:0: 0
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected

diagnostics complete

The nodes and server were active and functioning properly previously.
I checked all the config files and they seem to be correct.
I can ssh to node01 or to other nodes and back to server, without password.
The hostname of the server is same in server_name on both server and client config files, as well as in /etc/hosts entries.
The munge and trqauthd daemons are running. As can be seen from the momctl command issued on the server, it provides output, but the WARNINGs indicate that server and client are not communicating. I can't get a clue in the server or mom logs. Could you provide some directions to resolve this?

The following are logs on server and node01.

ON SERVER:

> tail /var/lib/torque/server_logs/20160503
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job
Modified at request of
05/03/2016 09:19:04;0040;PBS_Server.2942;Req;node_spec;job allocation
request exceeds currently available cluster nodes, 1 requested, 0 available
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;could not
locate requested resources '1pn=1' (node_spec failed) job allocation
request exceeds currently available cluster nodes, 1 requested, 0 available
05/03/2016 09:19:04;0080;PBS_Server.2942;Req;req_reject;Reject reply
code=15046(Resource temporarily unavailable MSG=job allocation request
exceeds currently available cluster nodes, 1 requested, 0 available),
aux=0, type=RunJob, from
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job
Modified at request of
05/03/2016 09:19:48;0002;PBS_Server.2958;Svr;PBS_Server;Torque Server
Version = 4.2.10, loglevel = 0
05/03/2016 09:24:56;0002;PBS_Server.2940;Svr;PBS_Server;Torque Server
Version = 4.2.10, loglevel = 0

> tail -f sched_logs/20160503
05/03/2016 07:48:19;0080; pbs_sched.2825;Svr;main;brk point 98287616
05/03/2016 07:58:24;0080; pbs_sched.2825;Svr;main;brk point 98811904
05/03/2016 08:08:29;0080; pbs_sched.2825;Svr;main;brk point 99336192
05/03/2016 08:18:34;0080; pbs_sched.2825;Svr;main;brk point 99860480
05/03/2016 08:28:39;0080; pbs_sched.2825;Svr;main;brk point 100384768
05/03/2016 08:38:44;0080; pbs_sched.2825;Svr;main;brk point 100909056
05/03/2016 08:48:49;0080; pbs_sched.2825;Svr;main;brk point 101433344
05/03/2016 08:58:54;0080; pbs_sched.2825;Svr;main;brk point 102486016
05/03/2016 09:19:04;0080; pbs_sched.2825;Svr;main;brk point 103010304
05/03/2016 09:29:09;0080; pbs_sched.2825;Svr;main;brk point 103534592

ON node01:
> tail /var/lib/torque/mom_logs/20160503
05/03/2016 09:26:57;0002;
pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for
server
05/03/2016 09:26:57;0008; pbs_mom.15663;Job;scan_for_terminated;entered
05/03/2016 09:26:57;0080; pbs_mom.15663;Svr;mom_get_sample;proc_array
load started
05/03/2016 09:26:57;0002;
pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
05/03/2016 09:26:57;0080; pbs_mom.15663;n/a;mom_get_sample;proc_array
loaded - nproc=0
05/03/2016 09:26:57;0008; pbs_mom.15663;Job;scan_for_terminated;pid
15682 not tracked, statloc=0, exitval=0
05/03/2016 09:27:42;0002;
pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for
server
05/03/2016 09:27:42;0008; pbs_mom.15663;Job;scan_for_terminated;entered
05/03/2016 09:27:42;0080; pbs_mom.15663;Svr;mom_get_sample;proc_array
load started
05/03/2016 09:27:42;0002;
pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
05/03/2016 09:27:42;0080; pbs_mom.15663;n/a;mom_get_sample;proc_array
loaded - nproc=0
05/03/2016 09:27:42;0008; pbs_mom.15663;Job;scan_for_terminated;pid
15684 not tracked, statloc=0, exitval=0

Output of other key commands.

> uname -a
Linux stinger.cluster 2.6.32-573.7.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 23 03:02:55 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

> rpm -aq | grep torque

torque-server-4.2.10-9.el6.x86_64
torque-libs-4.2.10-9.el6.x86_64
torque-scheduler-4.2.10-9.el6.x86_64
torque-mom-4.2.10-9.el6.x86_64
torque-client-4.2.10-9.el6.x86_64
torque-4.2.10-9.el6.x86_64

> qstat -q

server: XXXXXX.cluster

Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
batch -- -- -- -- 0 1 6 E R
----- -----

> qmgr -c 'p s'

#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch max_running = 6
set queue batch resources_max.ncpus = 8
set queue batch resources_max.nodes = 1
set queue batch resources_default.ncpus = 1
set queue batch resources_default.neednodes = 1pn=1
set queue batch resources_default.walltime = 24:00:00
set queue batch max_user_run = 6
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = XXXXXX.cluster
set server acl_hosts += node01
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server next_job_number = 1
set server authorized_users =
set server moab_array_compatible = True
set server nppcu = 1
0 1

> cat /var/lib/torque/server_priv/nodes

node01.cluster np=12 gpus=1 allcomp gpu compute
node02.cluster np=12 gpus=1 allcomp gpu compute
node03.cluster np=12 gpus=1 allcomp gpu compute
node04.cluster np=12 gpus=1 allcomp gpu compute
node05.cluster np=12 gpus=1 allcomp gpu compute
node06.cluster np=12 gpus=1 allcomp gpu compute
node07.cluster np=12 gpus=1 allcomp gpu compute
node08.cluster np=12 gpus=1 allcomp gpu compute
node09.cluster np=12 gpus=1 allcomp gpu compute

Log in or Sign up

Problem with Torque Scheduler (nodes down)

Sridharacharya Peon

Log in or Sign up

Problem with Torque Scheduler (nodes down)

Sridharacharya Peon

Useful Searches