Obsolete:Toolforge/Grid - Wikitech
Jump to content
From Wikitech
Toolforge
Cloud Services overview
Toolforge user docs
Toolforge changelog
Get started
Quickstart: set up and get access
How Toolforge works
Rules you must follow
Tutorials
Build and run tools
Navigate tool accounts and files
Build container images for tools
Run a web service
Schedule and manage jobs
Manage tool runtime configuration (envvars)
Deploy your tool on every push (beta)
Language-specific details:
Python
Pywikibot
Node.js
PHP
...more languages/frameworks
Use Redis for caching
Index content with Elasticsearch
Access shared storage and databases
Access shared storage and public wiki dumps
Access the Wiki Replicas databases
Access replica search indices
Manage
tool databases
Sending and receiving email
as tools
Share and maintain tools
Set up version control and code review
Develop successful tools
Find and share tools on Toolhub
Delete a tool
Get help
How and where to get help
Troubleshooting
Contribute to Toolforge
Useful links
Toolforge admin docs
List of tools
Toolforge Admin Console (toolsadmin)
Toolforge API
edit
This page contains historical information
. It may be outdated or unreliable.
2024
The Toolforge Grid Engine was shut down in March 2024.
Tools not migrated to newer runtimes were shut down. For details, see
News/Toolforge Grid Engine deprecation
Every non-trivial task performed in Toolforge should be dispatched by the
Grid Engine
, which ensures that the job is run in a suitable place with sufficient resources.
The basic principle of running jobs is fairly straightforward:
You submit a job to a work queue from a submission server (for example
login.toolforge.org
The grid engine master finds a suitable execution host to run the job on, and starts it there once resources are available
As it runs, your job will send output and errors to files until the job completes or is aborted.
Jobs can be scheduled synchronously or asynchronously, continuously, or simply executed once. If a continuous job fails, the grid will automatically restart the job so that it keeps going.
What is the grid engine?
The grid engine is a highly flexible system for assigning resources to jobs, including parallel processing. The Toolforge Grid Engine was implemented with Son of Grid Engine (an inactive open-source fork of
Oracle Grid Engine
, previously known as Sun Grid Engine). You can find more documentation on
sourceforge.net
and the
archived Son of Grid Engine website
WMCS is deprecating grid engine and replacing it with
Kubernetes
. Run your tool on the Kubernetes platform instead of grid engine.
Commonly used Grid Engine commands include:
jsub
: Toolforge specific wrapper for
qsub
that makes submitting a job much easier
qsub
: submit jobs to the grid
qalter
: modify job settings (while the job is waiting or running)
qstat
: get information about a queued or running job
qacct
: extracts arbitrary accounting information from the cluster logfile (also after job termination, useful for debugging)
qdel
: abort or cancel a job
You can find detailed information about these commands in the
Grid Engine Manual
The Grid Engine commands are very flexible, but a little complex at first – you might prefer to use the helper scripts instead (jsub, jstart, jstop) described in more detail in the next sections.
Submitting simple one-off jobs using 'jsub'
Jobs can be submitted to the work queue with either Grid Engine’s 'qsub' command or the 'jsub' helper script, which is simpler to use and described in this section. (For information about qsub, please see the
the Grid Engine Manual
.)
To run a job on demand (on a schedule via cron, for instance, or from a web tool or the command line), simply use the 'jsub' command:
$ jsub [options…] program [args…]
By default, jsub will schedule the job to be run as soon as possible, and print the eventual output to files (‘
jobname
.out’ and ‘
jobname
.err’) in your home directory. Unless a job name is explicitly specified with
jsub options
, the job will have the same name as the program, minus extensions (e.g., if you have a program named foobot.pl and start it with jsub, the job's name will be foobot.)
Once your jobs has been submitted to the grid, you will receive an output similar to the one below, which includes the job id and job name.
Your job 120 ("foobot") has been submitted
Extended instructions
Example:
The following example uses the jsub command to run
mybot.sh
as the tool
shtest
. The 'qstat' command returns job status information. By default, job output is placed in the 'mybot.out' and 'mybot.err' files in the home directory.
ssh
login.toolforge.org
become
shtest
ls
logs mybot.sh public_html replica.my.cnf
jsub
mybot.sh
Your job 8326234 ("mybot") has been submitted
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
8326234 0.00000 mybot tools.shtest qw 09/11/2019 19:41:43 1
qstat
ls
logs mybot.err mybot.out mybot.sh public_html replica.my.cnf
cat
mybot.out
user_editcount
1363
jsub options
In addition to a number of customized options, jsub supports many, but not all, qsub options:
Extended instructions
jsub options
Option
Behavior
-stderr
Send errors that occur during job submission to stderr rather than the error output file (errors that occur while running the script are always sent to the error file).
-mem
value
Request
value
amount of memory for the job, where
value
is a number suffixed by 'k', 'm' or 'g'. The default is 256m. For more information, please see
Allocating additional memory
-once
Run the named job only once; fail the job if a job by the same name is already running or queued. For more information, please see
Running a job only once
-continuous
Start a self-restarting job on the continuous queue. For more information, please see
Submitting continuous jobs (such as bots) with 'jstart'
-quiet
Suppress output if job has been submitted successfully (if set cron jobs will not send mail on successful submit)
-i, -o, and -e
(qsub options)
Selects the file used for standard input, output, and error streams of the job, respectively. By default, jsub will append stdout and stderr to the files
jobname
.out and
jobname
.err in the tool account's home directory, and will not have standard input. If a directory is given for '-o' or '-e', new files
jobname
.o
jobid
and
jobname
.e
jobid
are created there for each job.
-j y
(qsub option)
send standard output and error together to the output file
-sync y
(qsub option)
Normally, jsub queues up the job and returns immediately. The '-sync y' option waits for the job to be complete instead. For more information, please see
Synchronizing jobs
-cwd
(qsub option)
Start the script in the same directory you invoked jsub from (for more info, see qsub docs)
-N
jobname
(qsub option)
Specify a job name. The default is the name of the program run, without extension. For more information, please see
Naming jobs
-M
user@host
(qsub option)
Send email to specified address.
-m
b|e|a|s|n
(qsub option)
Defines under which circumstances mail is to be sent to the job owner or to the users defined with the '-M' option. Possible arguments are
b - Mail is sent at the beginning of the job.
e - Mail is sent at the end of the job.
a - Mail is sent when the job is aborted or rescheduled.
s - Mail is sent when the job is suspended.
n - No mail is sent.
Run
jsub --help
and
man jsub
to learn more.
Naming jobs
WMCS is deprecating grid engine and replacing it with
Kubernetes
. Run your tool on the Kubernetes platform instead of grid engine.
The job name identifies the job and can also be used to control it (for example to suspend or stop it). By default, jobs are assigned the name of the program or script, minus its extension. For instance, if you started a program named 'foobot.pl' with jsub, the job's name would be 'foobot'.
It's important to note that you can have more than one job, running or queued, bearing the same name. Some of the utilities that accept a job name may not behave as expected in those cases.
Specify a different name for the job using the jsub’s -N option:
jsub
-N
NewName
program
args…
Allocating additional memory
By default, jobs are allowed 512 MB of memory; you can request more (or less) with jsub’s '-mem' option (or qsub's '-l h_vmem=memory'). Keep in mind that a job that requests more resources may be penalized in its priority and may have to wait longer before being run until sufficient resources are available.
jsub
-mem
500m
program
args…
For example, running a PHP script which requires 350MB of memory to work properly:
jsub
-mem
350m
php
i_like_more_ram.php
Synchronizing jobs
By default, jobs are processed asynchronously in the background. If you need to wait until the job has completed (for instance, to do further processing on its output), you can add the '-sync y' (for sync y[es]!) option to the jsub command:
jsub
-sync
program
args...
Running a job only once
If you need to make certain that the job isn't running multiple times in parallel (such as when you invoke it from a crontab), you can add the '-once' option. If the job is already running or queued the grid engine will simply mark the failed attempt in the error file and return immediately.
jsub
-once
program
args...
Quoted arguments
Jsub and qsub always strip quotes in the arguments of a job. If the arguments include any special shell characters like spaces, "|" or "&" the job submission will likely fail, even when the arguments are given quoted to jsub (see
phab:T50811
). For instance with
jsub myScript.php "Foo bar"
, myScript.php may only see
$argv
as "Foo" and not the expected "Foo bar".
A workaround is to use two layers of quotes:
jsub
myScript.php
\'
'Foo bar'
\'
Alternatively you can create a wrapper script, for example runMyScript.sh, that contains
php
myScript.php
"Foo bar"
and call that with
jsub sh runMyScript.sh
Specifying an operating system release
The only operating system release currently available on the Toolforge grid is
Debian Buster
Stretch was deprecated
in 2022.
Trusty was deprecated
on Monday 2019-03-25.
Prior to 14 March 2017, there were two different versions of Ubuntu in use on Toolforge: Ubuntu 12.04 ('precise') and Ubuntu 14.04 ('trusty'). The
-l release=...
option to jsub allowed a tool to choose which release to execute under. This option is currently not needed, but may be useful again in the future when multiple Linux distributions are available simultaneously.
Submitting continuous jobs (such as bots) with 'jstart'
WMCS is deprecating grid engine and replacing it with
Kubernetes
. Run your tool on the Kubernetes platform instead of grid engine.
Continuous jobs, such as bots, have a dedicated queue ('continuous') which is set up slightly differently from the standard queue:
Jobs started on the continuous queue are automatically restarted if they, or the node they run on, crash
In case of outage or lack of resources, continuous jobs will be stopped and restarted automatically on a working node
Only tool accounts can start continuous jobs
Continuous jobs are not restarted if they end normally (with the exit status 0)
For convenience, the jstart script (which accepts all the
jsub options
) facilitates the submission of continuous jobs:
jstart
options…
program
args…
The jstart script will start the program in continuous mode (if it is not already running), and ensure that the program keeps running.
The jstart script is exactly equivalent to:
jsub
-once
-continuous
options…
program
args…
jsub's '-once' option is important for ensuring that the job can be managed reliably with job and jstop utilities. The '-continuous' option ensures that the job will be restarted automatically until it exits normally with an exit value of zero, indicating completion.
Bigbrother (Deprecated)
Bigbrother was a job monitoring tool that has been
decommissioned
. It watched jobs specified in a
.bigbrotherrc
file and restarted them if they were not running.
If you're using Grid jobs that sometimes are killed even when running in the
continuous
queue (when started with
jstart
), you could consider having a simple shell script watching over them, like the following:
#!/bin/bash
set
-o
pipefail
if
$#
-ne
then
echo
"Usage:
$0
exit
fi
JOBNAME
$1
COMMAND
$2
function
log
echo
$(
date
-Iseconds
$1
function
restart_needed
if
/usr/bin/qstat
awk
'{ print $3 }'
grep
${
JOBNAME
10
>/dev/null
then
return
else
return
fi
function
submit_job
/usr/bin/jstart
-N
$JOBNAME
$COMMAND
if
restart_needed
then
log
"Restarting job '
$JOBNAME
' ('
$COMMAND
')"
submit_job
fi
Save this to a bigbrother.sh file (or any other name you want) in the tool's home directory and add a crontab entry to trigger it:
*/5
/data/project/tool_name/bigbrother.sh
my_job
/data/project/tool_name/my_command.sh
Cron will trigger the bigbrother.sh script every 5 minutes. The script will look for a job named
my_job
and, if it's not running, it will run the command you specify.
Note that Bigbrother was never necessary for web services. The
webservice
system uses a built-in system called "manifest monitors" to provide similar functionality automatically.
Managing Jobs
Each job submitted to the grid has a unique job id as well as a job name (which will not be unique if you have more than one instance running). The name and id identify the job, and can also be used to retrieve information about its status.
If you don’t know the job id, you can find it with either the ‘job’ command or the ‘qstat’ command. Both of these commands can also be used to return additional status information, as described in the next sections.
Finding a job id and status with the ‘job’ command
If you know that your job has only one instance running (if you used the -once option when starting it, for example) you can use the ‘job’ command to get its job id:
tools.xbot@tools-login:~$ job xbot
717898
Use the job command’s -v (‘verbose’) option to return additional status information:
tools.xbot@tools-login:~$ job -v xbot
Job 'xbot' has been running since 2013-04-01T21:00:00 as id 717898
The verbose response is particularly useful from scripts or web services.
Once you know the job id, you can use the ‘qstat’ command to return additional information about it. See
Returning the status of a particular job
for more information.
Using ‘qstat’ to return status information
The ‘qstat’ command returns detailed information about the status of queued jobs. If you know the job id of a particular job, you can use qstat’s ‘-j’ option to return information about that job. If you use the ‘qstat’ command without options, it will return the status of all your currently running and pending jobs. More information about running qstat
without options
and with the
-j option
is included in the following sections. For more information about qstat in general, please see the
Grid Engine Manual
Returning the status of all your queued jobs
To see the status of all of your running and pending jobs (including the job number), use the ‘qstat’ command without options. ‘qstat’ will then return the job id, priority, name, owner, state (e.g., r(unning) or s(uspended)), the date and time the job was submitted or started, and the name of the assigned job queue (e.g., continuous) for each job.
For example:
tools.xbot@tools-login:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
120 0.50000 xbot tools.xbot r 04/01/2013 21:00:00 continuous@tools-exec-01.pmtpa 1
Common job states include:
r (running)
qw (queued/waiting)
d (deleted)
E (error)
s (suspended)
See the
Grid Engine Manual
for a complete list of states and abbreviations.
Returning the status of a particular job
If you know the job Id of a job, you can find out more information about it using the 'qstat command's ‘-j’ option.
qstat
will only show information about currently running jobs. For historical jobs, use
qacct
(which may take minutes to return information).
For example, the following command returns detailed information about job id 990.
tools.toolname@tools-login:~$ qstat -j 990
==============================================================
job_number: 990
exec_file: job_scripts/990
submission_time: Wed Apr 13 08:32:39 2013
owner: tools.toolname
uid: 40005
group: tools.toolname
gid: 40005
sge_o_home: /data/project/toolname/ sge_o_log_
name: tools.toolname
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin
sge_o_shell: /bin/bash
sge_o_workdir: /data/project/toolname
sge_o_host: tools-login
account: sge
stderr_path_list: NONE:NONE:/data/project/toolname//taskname.err
hard resource_list: h_vmem=256m
mail_list: tools.toolname@tools-login.pmtpa.wmflabs
notify: FALSE
job_name: epm
stdout_path_list: NONE:NONE:/data/project/toolname//taskname.out
jobshare: 0
hard_queue_list: task
env_list:
script_file: /data/project/toolname/taskname.py
usage 1: cpu=00:21:08, mem=158.09600 GBs, io=0.00373, vmem=127.719M, maxvmem=127.723M
Common shell exit code numbers
returned e.g. by
qacct
include (there are no standard exit codes, aside from 0 meaning success - non-zero doesn't necessarily mean failure):
exit_status
Meaning
Example
Comments
Success
No errors, meaning success
Catchall for general errors
let "var1 = 1/0"
Miscellaneous errors, such as "divide by zero" and other impermissible operations
Misuse of shell builtins (according to Bash documentation)
empty_function() {}
Missing keyword or command
126
Command invoked cannot execute
/dev/null
Permission problem or command is not an executable
127
"command not found"
illegal_command
Possible problem with $PATH or a typo
128
Invalid argument to exit
exit 3.14159
exit takes only integer args in the range 0 - 255
128+n
Fatal error signal "n"
kill -9 $PPID of script
$? returns 137 (=128+9)
128+2=130
Script terminated by Control-C
Ctrl-C
Control-C generates SIGINT which is fatal error signal 2
128+9=137
Process terminated by kernel (no further signal handling performed)
kill -9 $PPID of script
Kernel immediately terminates any process sent this signal, generating SIGKILL which is fatal error signal 9
128+11=139
Segmentation fault (kernel killed process due to segfault)
E.g. the program accessed a not assigned memory location, generating SIGSEGV which is fatal error signal 11
255
Exit status out of range
exit -1
exit takes only integer args in the range 0 - 255
See the
signal(.h) man pages
for a more comprehensive list of the values ("n") of the possible fatal error signals (SIG...) issued by the kernel.
Stopping jobs with ‘qdel’ and ‘jstop’
If you started a job with the 'jstart' command, or if you know there is only one job with the same name, then you can also use the 'jstop' utility command with the job name to stop it:
jstop job_name
You can also use the underlying ‘qdel’ command with a job’s number or name:
qdel job_number/job_name
This will also delete matching jobs that have only been queued, but not started yet. Do note that if you specify a 'job_name',
all
queued or running jobs with that name are deleted.
If you do not know the job number, you can find it using the
‘qstat’ command
Stuck jobs
In some cases, jobs can get stuck on a host. This happens, for example, if the job somehow does not respond to SIGSEGV and continues running. In these cases, try the following steps:
Find the host the job is running on: run
qstat -xml
, and find the relevant
queue_name
. The part after the @ is the host the job is running on.
Ssh to that host, e.g.
ssh tools-webgrid-generic-1404.tools.eqiad1.wikimedia.cloud
Find all your running jobs using
ps ux
Kill them:
kill
, where
is the number in the second column of
ps ux
Check if the jobs have been killed with
ps ux
. If not, try again, but using
kill -9
Concurrency limits
Tracked in
Phabricator
Task T67777
Maximum of
16
active jobs simultaneously allowed per tool user
The scheduler will hold additional job submissions in the
qw
(queued/waiting) until an active slot is available.
Maximum of
50
active and queued jobs simultaneously allowed per tool user
The scheduler will reject additional job submissions by exiting with a status code of 25 and writing "Unable to run job: job rejected: only 50 jobs are allowed per user (current job count: 50)" to stderr
Tracked in
Phabricator
Task T123270
Implementing these limits has allowed us enable job submission from the continuous and and task job queues.
Scheduling jobs at regular intervals with cron
To schedule jobs to be run at specific days or time of days, you can use
cron
to submit the jobs to the grid.
Scheduling a command more often than every five minutes (e.g.
* * * * * command
) is highly discouraged, even if the command is "only" jsub. In these cases, you very probably want to use
'jstart'
instead. The grid engine ensures that jobs submitted with 'jstart' are automatically restarted if they exit.
Creating a crontab
Crontabs are set (as on any Unix system) using
crontab -e
or
crontab FILE
Please be aware that any submitted crontab is automatically going to be edited to send any jobs to the grid directly (by prepending a default jsub invocation unless the cron entry already had one).
If your cron entry
only
includes a brief script that, itself, sends any real work to the grid then you may skip that automatic invocation by prepending
jlocal
explicitly marking it as a local job. Any script or job invoked with jlocal should not be running more than a few seconds and use minimal resources; misuse of that feature may have severe impact on general reliability for all users and is not allowed.
Implementation Detail
: Because the $PATH environment variable is set differently for interactive shells and cron jobs, please be aware that the
crontab
command is a symbolic link to a special executable (
/usr/bin/oge-crontab
) which will create your crontab in a special way so it's correctly recognized by Toolforge. If you run
/usr/bin/crontab
directly, that is the local crontab command which will NOT create a crontab in the grid (it will create only a local crontab in the server you're on at the moment and, since cron does not run on all servers, nothing will run based on your crontab). In other words, just use
crontab
directly or, if you want to specify the full path, use
/usr/local/bin/crontab
Specifying time zones
The ‘tools’ project, like other hosting environments, uses the time zone UTC (to view UTC time just write
date
). If you need to schedule a job for another time zone, you can specify so in the crontab.
For example, to schedule a job for midnight in Germany, you can use the crontab line:
0 22,23 * * * [ "$(TZ=Europe/Berlin date +\%H)" = "00" ] && jsub ...
The above crontab line instructs the system to check on 22:00 UTC (23:00 CET and 0:00 CEST) and 23:00 UTC (0:00 CET and 1:00 CEST) whether it is midnight in Berlin, and if so, calls jsub.
Note that you can't just replace "Berlin" with "Hamburg"; the values for TZ are limited to those found at /usr/share/zoneinfo. If you're unsure what the offset of your time zone to UTC is, you can run the check hourly by replacing 22,23 with *.
Note that the deployed version of crontab currently does not support
CRON_TZ
phab:T208561
).
FAQ
My shell script job fails with "Exec format error"
The program you want to execute must either be a binary executable or a script. In the latter case, it
must
contain a
shebang line
with the name of the interpreter (
/usr/bin/perl
/usr/bin/python
, etc.). For shell scripts that means in most cases the first line needs to be
#!/bin/bash
An error with "ascii" codepage, "file not found", or UnicodeEncodeError
Tracked in
Phabricator
Task T60784
When you run a script may be a problem with non-ascii characters.
Make sure that the script is saved in utf-8 encoding.
The error may occur if the bash scripts saved with CRLF
newlines
format of Windows, but it is necessary saved in Unix LF format.
You must set the LANG environment variable, it
you can add it to a bash script
. Or: add
-v LC_ALL=en_US.UTF-8
parameter to jsub.
For Python:
Set the LC_ALL environment variable, add
-v LC_ALL=en_US.UTF-8
parameter to jsub (as with all jsub parameters, make sure to place it
before
the python command or else it will be passed to script and not parsed by jsub).
Run script on python 3 via the command:
$ python3 myscript.py
. Or, set as the first line of script:
#!/usr/bin/env python3
, and run it like
$ ./myscript.py
Make sure that in the begin of script is the line
# coding: utf8
Set the PYTHONIOENCODING environment variable. Set it through a bash script or by sending
-v PYTHONIOENCODING=UTF-8
jsub argument.
Useful links
The following tools have been built by the Toolforge admin team to help others see grid engine job status:
sge-jobs.toolforge.org
— web-based tool for statistics on current and past Grid engine jobs
sge-status.toolforge.org
— status board of Grid engine nodes and jobs they are currently running (also available as
an API
Communication and support
Support and administration of the WMCS resources is provided by the
Wikimedia Foundation Cloud Services team
and
Wikimedia movement volunteers
. Please reach out with questions and join the conversation:
Discuss and receive general support
Chat in real time in the
IRC channel
#wikimedia-cloud
connect
or the bridged
Telegram group
Discuss via email after you have subscribed to the
cloud@
mailing list
Stay aware of critical changes and plans
Subscribe to the
cloud-announce@
mailing list
(all messages are also mirrored to the
cloud@
list)
Read the
News
wiki page
Track work tasks and
report bugs
Use a subproject of the
#Cloud-Services
Phabricator
project to track confirmed bug reports and feature requests about the Cloud Services infrastructure itself
Read stories and WMCS blog posts
Read the
Cloud Services Blog
(for the broader Wikimedia movement, see the
Wikimedia Technical Blog
Notes
google about it
External links
Source code for sge-jobs.toolforge.org
Wikimedia Techblog: Toolforge and Grid Engine
Arturo Borrero González, Site Reliability Engineer, Wikimedia Cloud Services Team, March 14, 2022.
Retrieved from "
Categories
Toolforge archive
Archive
Documentation
Obsolete
Toolforge/Grid
Add topic
US