1. Advertising
    y u no do it?

    Advertising (learn more)

    Advertise virtually anything here, with CPM banner ads, CPM email ads and CPC contextual links. You can target relevant areas of the site and show ads based on geographical location of the user if you wish.

    Starts at just $1 per CPM or $0.10 per CPC.

Problem with Torque Scheduler (nodes down)

Discussion in 'Cluster Computing' started by Sridharacharya, May 4, 2016.

  1. #1
    Hi to all the forum members.
    I just registered on the forum. I am managing a Linux computer which has Torque for scheduling jobs.
    I'd like your suggestions for the following issue with Torque. I hope this is the right section for this topic.

    I tried extensively checking all the config files and settings, but the nodes always are down.

    I recently upgraded packages and the torque packages were updated to the latest rpm versions. However, I am unable to get the nodes to active state (see the output of commands below).

    > qnodes


    node01.cluster
    state = down
    np = 12
    properties = allcomp,gpu,compute
    ntype = cluster
    mom_service_port = 15002
    mom_manager_port = 15003
    gpus = 1

    node02.cluster
    state = down
    np = 12
    properties = allcomp,gpu,compute
    ntype = cluster
    mom_service_port = 15002
    mom_manager_port = 15003
    gpus = 1
    ?.
    ?
    ?


    > momctl -d 3 -h node01

    Host: node01.cluster/node01.cluster Version: 4.2.10 PID: 12009
    Server[0]: XXXXXX.cluster (10.1.1.254:15001)
    WARNING: no messages received from server
    WARNING: no messages sent to server
    HomeDirectory: /var/lib/torque/mom_priv
    stdout/stderr spool directory: '/var/lib/torque/spool/' (108669845blocks available)
    NOTE: syslog enabled
    MOM active: 1755 seconds
    Check Poll Time: 45 seconds
    Server Update Interval: 45 seconds
    LogLevel: 7 (use SIGUSR1/SIGUSR2 to adjust)
    Communication Model: TCP
    MemLocked: TRUE (mlock)
    TCP Timeout: 60 seconds
    Prolog: /var/lib/torque/mom_priv/prologue (disabled)
    Alarm Time: 0 of 10 seconds
    Trusted Client List: 10.1.1.1:0,10.1.1.254:0,127.0.0.1:0: 0
    Copy Command: /usr/bin/scp -rpB
    NOTE: no local jobs detected

    diagnostics complete


    The nodes and server were active and functioning properly previously.
    I checked all the config files and they seem to be correct.
    I can ssh to node01 or to other nodes and back to server, without password.
    The hostname of the server is same in server_name on both server and client config files, as well as in /etc/hosts entries.
    The munge and trqauthd daemons are running. As can be seen from the momctl command issued on the server, it provides output, but the WARNINGs indicate that server and client are not communicating. I can't get a clue in the server or mom logs. Could you provide some directions to resolve this?

    The following are logs on server and node01.

    ON SERVER:

    > tail /var/lib/torque/server_logs/20160503
    05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job
    Modified at request of
    05/03/2016 09:19:04;0040;PBS_Server.2942;Req;node_spec;job allocation
    request exceeds currently available cluster nodes, 1 requested, 0 available
    05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;could not
    locate requested resources '1:ppn=1' (node_spec failed) job allocation
    request exceeds currently available cluster nodes, 1 requested, 0 available
    05/03/2016 09:19:04;0080;PBS_Server.2942;Req;req_reject;Reject reply
    code=15046(Resource temporarily unavailable MSG=job allocation request
    exceeds currently available cluster nodes, 1 requested, 0 available),
    aux=0, type=RunJob, from
    05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job
    Modified at request of
    05/03/2016 09:19:48;0002;PBS_Server.2958;Svr;PBS_Server;Torque Server
    Version = 4.2.10, loglevel = 0
    05/03/2016 09:24:56;0002;PBS_Server.2940;Svr;PBS_Server;Torque Server
    Version = 4.2.10, loglevel = 0


    > tail -f sched_logs/20160503
    05/03/2016 07:48:19;0080; pbs_sched.2825;Svr;main;brk point 98287616
    05/03/2016 07:58:24;0080; pbs_sched.2825;Svr;main;brk point 98811904
    05/03/2016 08:08:29;0080; pbs_sched.2825;Svr;main;brk point 99336192
    05/03/2016 08:18:34;0080; pbs_sched.2825;Svr;main;brk point 99860480
    05/03/2016 08:28:39;0080; pbs_sched.2825;Svr;main;brk point 100384768
    05/03/2016 08:38:44;0080; pbs_sched.2825;Svr;main;brk point 100909056
    05/03/2016 08:48:49;0080; pbs_sched.2825;Svr;main;brk point 101433344
    05/03/2016 08:58:54;0080; pbs_sched.2825;Svr;main;brk point 102486016
    05/03/2016 09:19:04;0080; pbs_sched.2825;Svr;main;brk point 103010304
    05/03/2016 09:29:09;0080; pbs_sched.2825;Svr;main;brk point 103534592


    ON node01:
    > tail /var/lib/torque/mom_logs/20160503

    05/03/2016 09:26:57;0002;
    pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for
    server
    05/03/2016 09:26:57;0008; pbs_mom.15663;Job;scan_for_terminated;entered
    05/03/2016 09:26:57;0080; pbs_mom.15663;Svr;mom_get_sample;proc_array
    load started
    05/03/2016 09:26:57;0002;
    pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
    05/03/2016 09:26:57;0080; pbs_mom.15663;n/a;mom_get_sample;proc_array
    loaded - nproc=0
    05/03/2016 09:26:57;0008; pbs_mom.15663;Job;scan_for_terminated;pid
    15682 not tracked, statloc=0, exitval=0
    05/03/2016 09:27:42;0002;
    pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for
    server
    05/03/2016 09:27:42;0008; pbs_mom.15663;Job;scan_for_terminated;entered
    05/03/2016 09:27:42;0080; pbs_mom.15663;Svr;mom_get_sample;proc_array
    load started
    05/03/2016 09:27:42;0002;
    pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
    05/03/2016 09:27:42;0080; pbs_mom.15663;n/a;mom_get_sample;proc_array
    loaded - nproc=0
    05/03/2016 09:27:42;0008; pbs_mom.15663;Job;scan_for_terminated;pid
    15684 not tracked, statloc=0, exitval=0



    Output of other key commands.

    > uname -a
    Linux stinger.cluster 2.6.32-573.7.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 23 03:02:55 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

    > rpm -aq | grep torque

    torque-server-4.2.10-9.el6.x86_64
    torque-libs-4.2.10-9.el6.x86_64
    torque-scheduler-4.2.10-9.el6.x86_64
    torque-mom-4.2.10-9.el6.x86_64
    torque-client-4.2.10-9.el6.x86_64
    torque-4.2.10-9.el6.x86_64


    > qstat -q

    server: XXXXXX.cluster

    Queue Memory CPU Time Walltime Node Run Que Lm State
    ---------------- ------ -------- -------- ---- --- --- -- -----
    batch -- -- -- -- 0 1 6 E R
    ----- -----



    > qmgr -c 'p s'

    #
    # Create queues and set their attributes.
    #
    #
    # Create and define queue batch
    #
    create queue batch
    set queue batch queue_type = Execution
    set queue batch max_running = 6
    set queue batch resources_max.ncpus = 8
    set queue batch resources_max.nodes = 1
    set queue batch resources_default.ncpus = 1
    set queue batch resources_default.neednodes = 1:ppn=1
    set queue batch resources_default.walltime = 24:00:00
    set queue batch max_user_run = 6
    set queue batch enabled = True
    set queue batch started = True
    #
    # Set server attributes.
    #
    set server scheduling = True
    set server acl_hosts = XXXXXX.cluster
    set server acl_hosts += node01
    set server default_queue = batch
    set server log_events = 511
    set server mail_from = adm
    set server scheduler_iteration = 600
    set server node_check_rate = 150
    set server tcp_timeout = 300
    set server job_stat_rate = 45
    set server poll_jobs = True
    set server mom_job_sync = True
    set server next_job_number = 1
    set server authorized_users =
    set server moab_array_compatible = True
    set server nppcu = 1
    0 1


    > cat /var/lib/torque/server_priv/nodes

    node01.cluster np=12 gpus=1 allcomp gpu compute
    node02.cluster np=12 gpus=1 allcomp gpu compute
    node03.cluster np=12 gpus=1 allcomp gpu compute
    node04.cluster np=12 gpus=1 allcomp gpu compute
    node05.cluster np=12 gpus=1 allcomp gpu compute
    node06.cluster np=12 gpus=1 allcomp gpu compute
    node07.cluster np=12 gpus=1 allcomp gpu compute
    node08.cluster np=12 gpus=1 allcomp gpu compute
    node09.cluster np=12 gpus=1 allcomp gpu compute
     
    Sridharacharya, May 4, 2016 IP