Troubleshooting job failures

Overview

This page covers the common job failures that  may be seen and ways to troubleshoot and correct these failures. This page will cover both general troubleshooting techniques as well as give common errors and ways to fix them.

General troubleshooting techniques

Condor_q diagnostics

The condor_q command has several tools that can be used to diagnose why jobs are not running or are in the held state.  The first is the -analyze and -better-analyze options.  These option allows users to get information on which available job slots a given job matches with.  If the job had errors during execution, these options will give more detailed messages about errors that occurred.

condor_ssh_to_job

This command allows the user to ssh to the compute node that is running a specified job_id.  Once the command is run, the user will be in the job's working directory and can examine the job's environment and run commands.  The  _condor_stdout and _condor_stderr files will have the current job's stdout and stderr outputs.  Note, this command requires that the site running the job allow users to ssh to the job.  Most sites allow this but some sites do not.

Specific issues 

Jobs not matching

If submitted jobs remain in the idle state and don't start, then there is usually an issue with the job requirements that prevent the job from being matched with an available resource.  Users can troubleshoot this by running condor_q -better-analyze jobid and then examining the output.  E.g.

condor_q output
[sthapa@login01 osg-stash_chirp-bigData]$ condor_q -better-analyze 371156.8


-- Submitter: login01.osgconnect.net : <192.170.227.195:56174> : login01.osgconnect.net
IndexSet::Init: size out of range: 0
IndexSet::Init: IndexSet not initialized
User priority for sthapa@login01.osgconnect.net is not available, attempting to analyze without it.
---
371156.008:  Run analysis summary.  Of 0 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

    ( ( OpSys == "LINUX" && OpSysMajorVer == 10 ) ) &&
    ( TARGET.Arch == "X86_64" ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer )

Your job defines the following attributes:

    DiskUsage = 12
    ImageSize = 12
    RequestDisk = 12
    RequestMemory = 1

The Requirements expression for your job reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           0  OpSys == "LINUX"
[1]           0  OpSysMajorVer == 10
[3]           0  TARGET.Arch == "X86_64"
[5]           0  TARGET.Disk >= RequestDisk
[7]           0  TARGET.Memory >= RequestMemory
[9]           0  TARGET.HasFileTransfer

Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   target.OpSys == "LINUX"           0                   REMOVE
2   target.OpSysMajorVer == 10        0                   REMOVE
3   ( TARGET.Arch == "X86_64" )       0                   REMOVE
4   ( TARGET.Disk >= 12 )             0                   REMOVE
5   ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                                      0                   REMOVE
6   ( TARGET.HasFileTransfer )        0                   REMOVE

WARNING:  Be advised:
   Request did not match any resource's constraints

The output clearly indicates that job did not match any resources.  Additionally, by looking through the conditions listed, it becomes apparent that the job requires version Scientific Linux 10 (target.OpSysMajorVer == 10) which won't be matched by available resources.  Looking at the submit file for the job shows the following requirement:

Requirements = (OpSys == "LINUX" && OpSysMajorVer == 10)

  This can be corrected in two different ways.  The entire job cluster can be removed using condor_rm 371156, followed by editing the submission file and then resubmitting. Alternatively, condor_qedit can be used to change requirements for submitted jobs:

[sthapa@login01 demo]$ condor_qedit 371156 Requirements '(OpSys == "LINUX" && OpSysMajorVer == 6)'

Held Jobs

Job output missing

If a job's submit file uses the transfer_output_files setting in the submit file to indicate that HTCondor should transfer files back after the job completes, HTCondor will put the job in the held state if the output is missing.  If condor_q -analyze is run on the job, this is indicated in the error message:

[sthapa@login01 osg-stash_chirp-bigData]$ condor_q -analyze 372993.0
-- Submitter: login01.osgconnect.net : <192.170.227.195:56174> : login01.osgconnect.net
---
372993.000:  Request is held.
Hold reason: Error from glidein_9371@compute-6-28.tier2: STARTER at 10.3.11.39 failed to send file(s) to <192.170.227.195:40485>: error reading from /wntmp/condor/compute-6-28/execute/dir_9368/glide_J6I1HT/execute/dir_16393/outputfile: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <192.84.86.100:50805>

The important parts are the message indicating that HTCondor couldn't transfer the job back (SHADOW failed to receive file(s)) and the part just before this that gives the name of the file or directory that HTCondor couldn't find.  This is failure is probably due to your application encountering an error while executing and then exiting before writing it's output files.  If you think that the error is a transient one and won't reoccur, you can run condor_release job_id to requeue the job.  Alternatively, you can use condor_ssh_to_job job_id to examine the job environment and investigate further.