CMS Dashboard Job Monitoring

Condor jobs submitted via CMS-Connect will be automatically reported to CMS Dashboard, in a similar way CRAB does. A basic report that doesn't require any particular action from the user is done by default, but users are encouraged to provide a few parameters in their submission workflows in order to do handle e.g: stage-in, stage-out and full error code management in the report.


The reporting procedure is done in 2 steps:

  1. Report from the Submission Machine: 
    The whole task is registered and sent to Dashboard from the submission machine while using condor_submit. 
  2. Report from the Worker Node
    Each job is reported once it is assigned to an available machine and executed from it.
    As opposed to regular CRAB workflows, users define their own submission scripts in CMS-Connect (as in any regular condor workflow). Due to this fact, tasks like stage-out, stage-in and error code management are implemented and handled by each user. For this reason, only a few parameters are reported by default, without the need of any further action from the user.

Basic Report (Default)

The basic report is handled by CMS-Connect wrappers and there no user-side action is required for it. This report includes the following:

  • Start and End time of report
  • Executable CPU and WallClock time
  • Executable exit code
    Please, notice that if the user submits a wrapper on top of the executable, the wrapper exit code and times will be reported, unless the user specifies such values (see Advanced Report).
  • Hostname of machine where the job was executed
  • Computing Element Name

Please, see the Advanced Report  in order to report stage-in/stage-out times and exit codes, number of events in the job or to override some of the default parameters.

Full Report

The following parameters can be specified by the user in order to report more advanced parameters from the worker node to the CMS Dashboard. The only requirement is to print out such parameters in the format:

Parameters report format
PARAMETER = VALUE
# Example: Print this out at the end of your job to report the number of events on it.
CMS_DASHBOARD_N_EVENTS = 5000

 

The following table provides a list of the parameters than can be reported from the user side and the default values for the basic report case.

ParametersDescription
CMS_DASHBOARD_N_EVENTSNumber of events in the job. Default: 0
CMS_DASHBOARD_EXE_WC_TIMEExecutable wall clock time. Default: Condor executable WC time.
CMS_DASHBOARD_EXE_CPU_TIMEExecutable CPU time. Default: Condor executable CPU time.
CMS_DASHBOARD_EXE_EXIT_CODE

Executable exit code. Default: Condor Executable exit code.

 

Note: The user might want to override the default values for EXE_WC_TIME, EXE_CPU_TIME and EXE_EXIT_CODE in cases where e.g the Condor Executable is just a user wrapper running the actual executable.
CMS_DASHBOARD_STAGEOUT_SEStorage Element name. Default: unknown.
CMS_DASHBOARD_STAGEOUT_EXIT_CODEStage out exit code.
CMS_DASHBOARD_STAGEOUT_TIMEStage out exit time.
CMS_DASHBOARD_JOB_EXIT_CODEJob Exit code. Default: Executable exit code.
User can report their own job exit codes to handle the overall completion state of the job.  
CMS_DASHBOARD_JOB_EXIT_REASONJob Exit Reason. Default: Empty

You can follow this Twiki link to find more information about job monitoring with CMS Dashboard.

Historical View Example

Example CMS-Connect jobs reported to Dashboard.

http://dashboard.cern.ch/cms/