Oct. 4, 2008, 1:02 p.m.

What Matters in an Asynchronous Job Queue

An asynchronous job queue is a key element of system scalability. Job queues are particularly well-suited for web sites where an HTTP request requires some actions to be performed that may take longer than a second or two and where immediate results aren't necessarily required.

Important Properties of a Job Queue

There are several properties of such a queue system that have various levels of importance. Everybody has a different take on the levels of importance of each property. I'm going to list the properties that I find important and why here. I expect lots of people to disagree, and that's perfectly fine as I'd like to see more people's perspectives.

A Single Job is Handled by a Single Worker

This one is seemingly obvious. When a job is enqueued, I want it to be picked up by one worker.

Note that this is not universally true, however. In large systems (for example, at google), a given job may be handed to more than one worker at a time to ensure it gets done in a timely manner. This type of thing is obviously more reliable, but it's very hard to reach this level of reliability. For example, not all jobs are idempotent. If you were to have a job that formats and sends mail to a bunch of recipients, you would want to make sure the part that sends the email is not done more than once.

Different Jobs May be Handled by Different Workers

I have different classes of workers dedicated to performing different jobs. These workers may grow independently of each other, and in some cases, get rewritten in different languages for various reasons.

I do often have "general" queues that can process many types of jobs and just shove them all in there, but having the ability to split of dedicated workers has been critical to me in certain applications.

Priority Queues

I've never deployed a worker queue and not needed to start prioritizing jobs. Some jobs are responsible for fanning out (creating more jobs) and should really happen nearer an empty queue. For some jobs, timeliness is important, so I'd like to request that they should happen fairly soon. Some jobs are just expected to be bigger and slower so I toss them in at a lower priority.

Delayed Jobs

Similarly to priorities, being able to push a job in with a delay has been useful in a couple of circumstances.

My #1 reason to delay a job is because of a temporary failure. This may be either because some is kind of broken in a way that I expect will be fixed later, or because of an inability to acquire a lock or similar scarce resource.

By pushing the job back into the queue with a delay, I can do the jobs I can do without having to wait for this job to become available.

Introspection

Introspection is key to monitoring.

Are there enough workers right now?
Is a job queue growing faster than consumers can consume things?
Is the job queue shrinking at all?
Are some jobs getting stuck?

There are lots of health-related questions that you'll want answers to as you make more of your processing asynchronous.

Blocking Delivery

This is one that I've been seeing missing from a lot of queuing systems and it just baffles me. If I ask for a job, and there isn't one available, can I just wait? Having to poll is not acceptable. I see this kind of code a lot:

while True:
    job = queue.ask_for_a_job()
    if job:
        process_job(job)
    else:
        time.sleep(sleep_time) # CODE SMELL!

Sleep is for humans. A sleeping process is a waste of resources. The reason the sleep is there is because this becomes a fast infinite loop (with network IO) without it. It's taxing on the client and the server just to see if something's changed. sleep_time is a value that balances how much latency you're willing to have in your jobs and how much of a burden you want on your network, client, and server.

Consider the same code with a fully-blocking queue:

while True:
    process_job(queue.ask_for_a_job())

In addition to being less code, this makes much better use of resources, gets the jobs done at a much lower latency, and overall makes the world a better place.

Don't get me wrong, long polling, or even quick polling is OK in some applications. It should be an option, not a technological constraint.

Must Handle Worker Crashes

If a worker takes a job and then crashes, should the job get done? This is a really important part I think a lot of people who design worker queues ignore, but it's the most common type of failure I ever see.

Properties that Don't Matter (As Much As You Think)

Since I see these things come up a lot, I'm going to argue against them. If just one person doesn't implement another queue focused on the wrong properties, my work here won't be fruitless.

An Existing Protocol

I can't remember how many queue systems I've seen written to the memcached protocol. It's just wrong. You simply can't achieve the properties I consider important in a queue with a protocol designed for simple key/value caching.

Both starling and memcacheq attempt to solve the same problem the wrong way. Both require clients to poll the servers for new jobs. Neither has positive job completion acknowledgments, crash handling, priorities, delays, or any room for them because of the desire to maintain compatible with memcache client libraries.

It's just not worth losing all of this just for the sake of not coming up with a new protocol.

Queue Durability

I'm a pretty big fan of beanstalkd. I see a lot of people decide it's not well-suited for their environments because it doesn't keep its quite across restarts.

I won't argue that queue should never be durable, but I will restate that I this isn't what's ever caused me to lose a job. People consider queue durability to make up a reliable queue system, but it's just completely wrong.

Consider the starling case again. It keeps the queue on disk, so you can enqueue an item, crash the server, and the next get will return your job. Awesome.

Now grab an item out of a queue and kill the worker (who owns the job currently). I've yet to crash a beanstalkd, but workers crash or restart every time code is deployed, or there's a memory leak or similar bug, broken DB, unavailable lock server (or just lock).

Job workers are just like web servers in our environment. We don't want to care if they crash occasionally.

What's Right for You?

There are many options from a simple DB table to JMS. beanstalkd meets all of my requirements (and in the areas where it didn't, I've modified it to do so).

If you absolutely need queue durability, I'm sure a solution with minimal overhead would be a welcome contribution. Otherwise, make sure that you don't lose job durability in the process.

But whatever you do, please, don't build yet another one on memcached.

Table of Contents

Feeds