Open Science Grid Overview
The Open Science Grid (OSG) promotes science by:
- enabling a framework of distributed computing and storage resources
- making available a set of services and methods that enable better access to ever increasing computing resources for researchers and communities
- providing resource sharing principles and software that enable distributed high throughput computing (DHTC) for users and communities at all scales.
The OSG facilitates access to DHTC for scientific research in the US. The resources accessible through the OSG are contributed by the community, organized by the OSG, and governed by the OSG Consortium; an overview is available at An Introduction to OSG. In July 2013, OSG is comprised of about 126 institutions with ~120 active sites that collectively support usage of ~2,000,000 core hours per day. Up-to-date usage metrics are available at the OSG Usage Display.
Cores that are not being used at any point in time by the owning communities are made available for shared use by other researchers; this usage mode is called opportunistic access. OSG supports OSG Connect users by providing a virtual cluster that forms an abstraction layer to access the opportunistic cores in the distributed OSG infrastructure. This interface allows OSG Connect users to view the OSG as a single cluster to manage their jobs, provide the inputs and retrieve the outputs. OSG Connect users access the OSG via the OSG Connect login host as discussed elsewhere in this ConnectBook.
Computation that is a good match for OSG
High throughput workflows with simple system and data dependencies are a good fit for OSG. The HTCondor manual has an overview of high throughput computing.
Jobs submitted into the OSG will be executed on machines at several remote physical clusters. These machines may differ in terms of computing environment from the submit node. Therefore it is important that the jobs are as self-contained as possible by generic binaries and data that can be either carried with the job, or staged on demand. Please consider the following guidelines:
- Software should preferably be single threaded, using less than 2 GB memory and each invocation should run for 4-12 hours. There is some support for jobs with longer run time, more memory or multi-threaded codes. Please contact the support listed below for more information about these capabilities. System level check pointing, such as the HTCondor standard universe, is not available. Application level check pointing, for example applications writing out state and restart files, can be made to work on the system.
- Compute sites in the OSG can be configured to use pre-emption, which means jobs can be automatically killed if higher priority jobs enter the system. Pre-empted jobs will restart on another site, but it is important that the jobs can handle multiple restarts.
- Binaries should preferably be statically linked. However, dynamically linked binaries with standard library dependencies, built for a 64-bit Red Hat Enterprise Linux (RHEL) 5 machines will also work. Also, interpreted languages such as Python or Perl will work as long as there are no special module requirements.
- Input and output data for each job should be < 10 GB to allow them to be pulled in by the jobs, processed and pushed back to the submit node. Note that the OSG Virtual Cluster does not currently have a global shared file system, so jobs with such dependencies will not work.
- Software dependencies can be difficult to accommodate unless the software can be staged with the job.
The following are examples of computations that are not good matches for OSG:
- Tightly coupled computations, for example MPI based communication, will not work well on OSG due to the distributed nature of the infrastructure.
- Computations requiring a shared file system will not work, as there is no shared filesystem between the different clusters on OSG.
- Computations requiring complex software deployments are not a good fit. There is limited support for distributing software to the compute clusters, but for complex software, or licensed software, deployment can be a major task.
No shared file system!
The OSG Connect resource is a HTCondor pool overlay on top of OSG resources. The pool is dynamically sized based on the demand, the number of jobs in the queue, and supply, resource availability at the OSG resources. It is important to note that OSG Connect does not have a shared file system. This means that your jobs will have to bring executables and input data. HTCondor can transfer the files for you, but you will have to identify and list the files in your HTCondor job description file.
Most of the clusters in OSG are running Red Hat Enterprise Linux (RHEL) 5 or 6, or some derivative thereof, on an x86_64 architecture. For your application to work well in this environment, it is recommend that the application is compiled on a similar system, for example on the OSG Connect login system: login01.osgconnect.net. It is also recommended that the application be statically linked, or alternatively dynamically linked against just a few standard libraries. What libraries a binary depends on can be checked using the Unix
ldd command-line utility:
$ ldd a.out
a.out is a static executable
In the case of interpreted languages like Python and Perl, applications have to either use only standard modules, or be able to ship the modules with the jobs. Please note that different compute nodes might have different versions of these tools installed.
How to get help using OSG Connect
OSG Connect users of OSG may get technical support by contacting OSG User Support staff at email email@example.com.