Epylog messages¶
.idletime¶
Nov 30 11:09:03 exec6-3 HMDC_periodic_job_is_idle.get_current_idle_time:
Job 33727.0: Unable to read .idletime file: [Errno 2] No such file or
directory: '/tmp/condor/execute/dir_85231/.idletime'
Nov 30 11:09:03 exec6-3 HMDC_periodic_job_is_idle.get_last_check_time:
Job 33727.0: Failure reading last_check_time from#012.idletime file,
exception: [Errno 2] No such file or directory:
'/tmp/condor/execute/dir_85231/.idletime'
HMDC_periodic_job_is_idle.py
is run periodically for every job
currently executing. It calculates idletime by determing whether the
currently executing job is idle, and, if so, incrementing the job’s
idletime found in $TEMP/.idletime
. $TEMP
references the job’s
execute directory under /tmp/condor/execute
.
If $TEMP/.idletime
does not exist, HMDC_periodic_job_is_idle.py
will
write this error to syslog and create $TEMP/.idletime
with a
value of 0. $TEMP/.idletime
does not exist when a job begins to
execute, so this error is expected and can be ignored.
However, if the logged exception is something other than No such file
or directory
, Operations should create a new ticket for the RCE
Development team.
Note
HMDC_job_wrapper.py
should create a blank $TEMP/.idletime
upon
job execution such that these errors aren’t logged unnecessarily.
Unable to find email in either gecos or mail field¶
RceSubmit.esarmien.31662.HMDCLog.__rcelog_str__: _get_email(): Unable
to find email in either gecos or mail field. Investigate
rce_submit.py
sets the Email
field of the submitted ClassAd by
looking for the user’s email address first in the gecos
field, then
in the mail
field. If an email address is found in neither,
Operations should investigate in order to fix the erroneous LDAP record.
If the LDAP record is found not to be erroneous, open a ticket with the
RCE Development Team.
Unable to open job classad or machinead¶
Unable to open machinead from environment variable _CONDOR_MACHINE_AD:
{Exception}
Unable to open machinead from environment variable _CONDOR_JOB_AD:
{Exception}
HMDCWrapper.py
opens the job classad and machine classads referenced
by environment variables set by HTCondor _CONDOR_MACHINE_AD
and
_CONDOR_JOB_AD
. When a job executes, HTCondor writes the machinead
and the job classad to the job’s execute directory in /tmp/condor/execute
as
.machine.ad
and .job.ad
respectively.
This is a critical error and you should never see this. If this error
does occur, it could mean that free space on /tmp
, the parent
directory of all HTCondor execute directories, is exhausted and HTCondor
is unable to write these files. If you discover that /tmp
is indeed
full, clear out the problematic files and ask the impacted users to
resubmit their jobs.
If /tmp
is not full, this error indicates a more serious problem.
Investigate and create a ticket in the RCE Development queue.
Note
See http://research.cs.wisc.edu/htcondor/manual/v8.4/2_5Submitting_Job.html#3382 for more information regarding HTCondor set environment variables.
Encountered exception setting memory limits¶
Encountered exception setting memory limits: {Exception}
This is a critical error. Before a job executes, HMDCWrapper.py
sets
the appropriate ulimits on the job according to to the slot’s memory and
cpu allocation. While the job will still succeed in executing, this job
will not be resource limited via ulimits, just by cgroups. If you discover this
error, investigate immediately and create a ticket for RCE Development
with the full output of the log.
Note
We may do away with ulimits in the future and solely rely on cgroups. For now, this is a critical error.
Found job in history, terminated in error¶
find_job_and_status(): Found job {JobId} in history. Terminated in error.
This is usually not a critical error. This means that a user submitted a job which immediately exited. Unless the user submits a ticket, you can safely ignore this. The following could cause this error:
- An inappropriately sized memory or cpu request.
- Application crashes – if Matlab, XStata, or R crash upon startup for one reason or another, this log message will be produced.
- HTCondor misconfiguration – unlikely, although possible.
Job took too long to start¶
Oct 21 13:55:35 dev-rce6-1
RceSubmit.esarmien.18147.HMDCLog.__rcelog_str__: run_app(): Job 148 took
too long to start:
Application=shell,Version=2.31.3,RequestMemory=9999,RequestCpu=9999
This is typically not a critical error and can be ignored, unless you notice a substantial number of these messages in epylog.
The above job took too long to start because it requested 9999 GiB of memory and 9999 CPUs. It is unlikely you’ll encounter such an obvious reason why a job took too long to start. Most often, jobs take too long to start when:
- There are not enough resources to satisfy the job in the cluster, run
rce-info.sh
to determine the amount of resources available. - User entered an extreme resource allocation request
- There are no execute nodes joined to the cluster (use
condor_status
to determine whether machines have joined the cluster) or all execute machines haveSTART=FALSE
. - Other problems in HTCondor configuration or execute nodes.
Xpra took too long to start¶
run_app(): Job {jobid}, xpra took too long to start. Printing classad.
This is a critical error. When rce_submit.py
submits a job, it polls
the xpra server running in the job slot on the remote execute machine to
determine whether it has started up. Upon startup, rce_submit.py
launches an xpra client to connect to the launched xpra server.
If xpra took too long to start, this could mean that:
- The xpra log located in
$HOME/.HMDC/interactive/{application}-{version}_{jobid}_{date}/out.txt
does not contain a string likeUsing display number provided by xpra_Xdummy: :3
. This is the string that rce_submit.py polls for to determine when Xpra has launched. This string could change in newer versions of Xpra, but, it is unlikely, and checking this is part of the RCE Cluster Tools development process. - There is an XPRA error in
$HOME/.HMDC/interactive/{application}-{version}_{jobid}_{date}/err.txt
. Note that XPRA writes all output to stderr, so you will need to sift through the log and determine which errors are pertinent. - There was an X server error. Check the appropriate
$HOME/.xpra/Xorg.:{display}.log
While investigating, please also create a ticket for the RCE Development team.
Encountered unknown exception¶
All exceptions should be caught. If you see this, attempt to investigate, but also open a ticket for the RCE Development team.
Encountered exception while removing LocalJobDir¶
Encountered exception while removing LocalJobDir: {dir}
All jobs submitted using rce_submit.py
have an extra ClassAd element
LocalJobDir
, which typically resolves to
$HOME/.HMDC/interactive/{application}-{version}_{jobid}_{date}/
.
Stdout, stderr, and xpra logs are written here. When a job exits
successfully, this directory is removed. If this directory is unable to
be removed successfully, the preceeding log message will be produced.
Note
Run condor_q -autoformat LocalJobDir -constraint 'HMDCUseXpra =?=
True'
for an example
While this is not a critical error, it should be investigated as soon as
time permits. Most likely, this means that a user has corrupted their
.HMDC
file permissions in their home directory.
Error sending email notification¶
Error sending email notification: {notification}
This error is produced when HMDC_periodic_job_is_idle.py
is unable
to send an email to the job owner regarding the jobs impending
pre-emptibility. This is not a serious error, but, should be
investigated. Operations should manually notify the user of their job’s
impending preemptibility.