Hi to all the forum members. I just registered on the forum. I am managing a Linux computer which has Torque for scheduling jobs. I'd like your suggestions for the following issue with Torque. I hope this is the right section for this topic. I tried extensively checking all the config files and settings, but the nodes always are down. I recently upgraded packages and the torque packages were updated to the latest rpm versions. However, I am unable to get the nodes to active state (see the output of commands below). > qnodes node01.cluster state = down np = 12 properties = allcomp,gpu,compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 1 node02.cluster state = down np = 12 properties = allcomp,gpu,compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 1 ?. ? ? > momctl -d 3 -h node01 Host: node01.cluster/node01.cluster Version: 4.2.10 PID: 12009 Server[0]: XXXXXX.cluster (10.1.1.254:15001) WARNING: no messages received from server WARNING: no messages sent to server HomeDirectory: /var/lib/torque/mom_priv stdout/stderr spool directory: '/var/lib/torque/spool/' (108669845blocks available) NOTE: syslog enabled MOM active: 1755 seconds Check Poll Time: 45 seconds Server Update Interval: 45 seconds LogLevel: 7 (use SIGUSR1/SIGUSR2 to adjust) Communication Model: TCP MemLocked: TRUE (mlock) TCP Timeout: 60 seconds Prolog: /var/lib/torque/mom_priv/prologue (disabled) Alarm Time: 0 of 10 seconds Trusted Client List: 10.1.1.1:0,10.1.1.254:0,127.0.0.1:0: 0 Copy Command: /usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete The nodes and server were active and functioning properly previously. I checked all the config files and they seem to be correct. I can ssh to node01 or to other nodes and back to server, without password. The hostname of the server is same in server_name on both server and client config files, as well as in /etc/hosts entries. The munge and trqauthd daemons are running. As can be seen from the momctl command issued on the server, it provides output, but the WARNINGs indicate that server and client are not communicating. I can't get a clue in the server or mom logs. Could you provide some directions to resolve this? The following are logs on server and node01. ON SERVER: > tail /var/lib/torque/server_logs/20160503 05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job Modified at request of 05/03/2016 09:19:04;0040;PBS_Server.2942;Req;node_spec;job allocation request exceeds currently available cluster nodes, 1 requested, 0 available 05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;could not locate requested resources '1pn=1' (node_spec failed) job allocation request exceeds currently available cluster nodes, 1 requested, 0 available 05/03/2016 09:19:04;0080;PBS_Server.2942;Req;req_reject;Reject reply code=15046(Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available), aux=0, type=RunJob, from 05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job Modified at request of 05/03/2016 09:19:48;0002;PBS_Server.2958;Svr;PBS_Server;Torque Server Version = 4.2.10, loglevel = 0 05/03/2016 09:24:56;0002;PBS_Server.2940;Svr;PBS_Server;Torque Server Version = 4.2.10, loglevel = 0 > tail -f sched_logs/20160503 05/03/2016 07:48:19;0080; pbs_sched.2825;Svr;main;brk point 98287616 05/03/2016 07:58:24;0080; pbs_sched.2825;Svr;main;brk point 98811904 05/03/2016 08:08:29;0080; pbs_sched.2825;Svr;main;brk point 99336192 05/03/2016 08:18:34;0080; pbs_sched.2825;Svr;main;brk point 99860480 05/03/2016 08:28:39;0080; pbs_sched.2825;Svr;main;brk point 100384768 05/03/2016 08:38:44;0080; pbs_sched.2825;Svr;main;brk point 100909056 05/03/2016 08:48:49;0080; pbs_sched.2825;Svr;main;brk point 101433344 05/03/2016 08:58:54;0080; pbs_sched.2825;Svr;main;brk point 102486016 05/03/2016 09:19:04;0080; pbs_sched.2825;Svr;main;brk point 103010304 05/03/2016 09:29:09;0080; pbs_sched.2825;Svr;main;brk point 103534592 ON node01: > tail /var/lib/torque/mom_logs/20160503 05/03/2016 09:26:57;0002; pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for server 05/03/2016 09:26:57;0008; pbs_mom.15663;Job;scan_for_terminated;entered 05/03/2016 09:26:57;0080; pbs_mom.15663;Svr;mom_get_sample;proc_array load started 05/03/2016 09:26:57;0002; pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs 05/03/2016 09:26:57;0080; pbs_mom.15663;n/a;mom_get_sample;proc_array loaded - nproc=0 05/03/2016 09:26:57;0008; pbs_mom.15663;Job;scan_for_terminated;pid 15682 not tracked, statloc=0, exitval=0 05/03/2016 09:27:42;0002; pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for server 05/03/2016 09:27:42;0008; pbs_mom.15663;Job;scan_for_terminated;entered 05/03/2016 09:27:42;0080; pbs_mom.15663;Svr;mom_get_sample;proc_array load started 05/03/2016 09:27:42;0002; pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs 05/03/2016 09:27:42;0080; pbs_mom.15663;n/a;mom_get_sample;proc_array loaded - nproc=0 05/03/2016 09:27:42;0008; pbs_mom.15663;Job;scan_for_terminated;pid 15684 not tracked, statloc=0, exitval=0 Output of other key commands. > uname -a Linux stinger.cluster 2.6.32-573.7.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 23 03:02:55 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux > rpm -aq | grep torque torque-server-4.2.10-9.el6.x86_64 torque-libs-4.2.10-9.el6.x86_64 torque-scheduler-4.2.10-9.el6.x86_64 torque-mom-4.2.10-9.el6.x86_64 torque-client-4.2.10-9.el6.x86_64 torque-4.2.10-9.el6.x86_64 > qstat -q server: XXXXXX.cluster Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- batch -- -- -- -- 0 1 6 E R ----- ----- > qmgr -c 'p s' # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_running = 6 set queue batch resources_max.ncpus = 8 set queue batch resources_max.nodes = 1 set queue batch resources_default.ncpus = 1 set queue batch resources_default.neednodes = 1pn=1 set queue batch resources_default.walltime = 24:00:00 set queue batch max_user_run = 6 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = XXXXXX.cluster set server acl_hosts += node01 set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server next_job_number = 1 set server authorized_users = set server moab_array_compatible = True set server nppcu = 1 0 1 > cat /var/lib/torque/server_priv/nodes node01.cluster np=12 gpus=1 allcomp gpu compute node02.cluster np=12 gpus=1 allcomp gpu compute node03.cluster np=12 gpus=1 allcomp gpu compute node04.cluster np=12 gpus=1 allcomp gpu compute node05.cluster np=12 gpus=1 allcomp gpu compute node06.cluster np=12 gpus=1 allcomp gpu compute node07.cluster np=12 gpus=1 allcomp gpu compute node08.cluster np=12 gpus=1 allcomp gpu compute node09.cluster np=12 gpus=1 allcomp gpu compute