Dec. 26, 2008, 10:05 p.m.

Stack Overflow is a Time Machine

I've been using stackoverflow a bit lately, and there are definitely some smart people there. A lot of what I find reminds me of darker times, though.

For example, just about every time anything about revision control comes up, for example, people talk about how awesome this new subversion thing is.

move to svn?

I've been using DVCS for almost a decade now, so I receive the idea of moving towards subversion with a bit of shock.

There also seems to be a bit of a... n00b overflow. Some of the popular questions are really newbie. Like, what's with these arrays (and why is it worth 11,000 views)?

what's with the arrays?

Or perhaps this hot question with over 4,000 views:

the truth is out there

I've all but one of the questions I've asked answered, which is nice, but the deeper questions that seem more interesting don't get a whole lot of views. Most of the hottest questions are really fluffy.

It's still a good resource, but you've gotta work a bit to keep it from being a frustrating time sink.

Dec. 6, 2008, 8:48 p.m.

Simple Named Job Deduplication

Our Problem

We would like content on our web site available in our search engine as soon after the save as possible. Our search engine is decentralized in that every front-end has a copy of the search index and searches locally. This architecture allows searches to scale quite horizontally, but does so at the cost of simple index updates.

With a centralized search index, we could just push a modification into the central server and be done with it. With our architecture, we need a worker machine to build a search index and distribute it to all of the front-end machines.

Historically, we just had a cron job that'd occasionally rebuild and ship the index. Later, we started trying to keep track of what had changed and doing incremental updates.

Eventually, I figured out it'd be less work and faster if we just sent object changes into the job queue and had the index builder pick these little changes up and ship them to the web servers. This worked quite well for a while.

This became suboptimal when a bunch of content editors were rapidly making changes on a small development system with a couple nodes running in VMWare. The actual index distribution would just kill the machine due to IO on what ended up being the same disks.

The Idea

I wanted to keep the rapid update properties while trying to reduce IO. The obvious thing would be to try to aggregate multiple index updates into a single index distribution.

The Implementation

The first thing that had to be done, of course, was to break the job into two parts:

  1. The index update.
  2. The index distribution.

Now the trick is to ensure that for every index update, there is at least one index distribution without there being an index distribution for every update.

The easy way to do this is to define a job to have a "run after" date. This works very well in things such as index distribution. In this case, we've built an index and we want to make sure that the results of that index build make it to production. The job we queue will do it, but if these jobs block on each other, then any other job that runs after the time the index build completed will do.

So where do we track the timestamps so a job knows it doesn't need to run? Well, memcached ends up being a perfect place for this.

We give each job a name and a "run after" parameter, and store the "last run" timestamp in memcached under the job name. Really simple, and allows us to create as many of these jobs as we need.

Ruby code for doing this looks something like this:

def run_after_cb(name, timestamp, ob, method, *args)
  k="jobts_#{name}"
  t=cache[k] rescue 0
  if t && t > timestamp
    # Ignored -- log or something
  else
    nt = Time.now.to_i
    ob.send(method, *args)
    begin
      cache[k] = nt
    rescue
      # Can't record a new date (next job will run even if unnecessary)
    end
  end
end

memcached is often the last thing I'd recommend for any sort of thing that isn't exactly a cache, but the semantics fit quite well here. This is treated as an optimization such that only when we know for sure that a distribution is redundant will we drop it.

Specifically, the index will be distributed under the following conditions:

  1. When the key is not in the cache (never seen, dropped, etc...)
  2. When the key is found, but the date is in the past.
  3. When any error occurs when trying to talk to memcached.

Example

Imagine five index updates, each requiring a distribution occurring in the following scenario:

Dedup Example

Shortly after completing the first index update, a distribution will start. Content updated in update 2 and update 3 will not be included in this update.

After dist 1 completes, dist 2 is ready to go for update 2. update 2 completed at t2 and the most recent update completed at t1, so we start dist 1.

Because dist 2 begins at t4, it naturally includes the effects of update 2 and update 3, but not update 4 which started before dist 2 began.

Now for a bit of imagination because I'm too lazy to draw this better.

Although it's not illustrated here, it should be clear that the next update would be dist 3 (queued by update 3 for updates after t3). That next update would be dropped because the effects of it have already been distributed.

Next would be dist 4 which would have been queued from update 4 and that one would not be dropped, but dist 5 would be.

In this example, we distributed our index three times for five updates. In practice, this helps quite a bit -- especially when things start getting slow and the distributions are backing up anyway.

Oct. 16, 2008, 10:16 p.m.

Git Tag Does the Wrong Thing by Default

I'm writing this because I don't think anyone is actually aware of it, but I keep seeing it show up in various projects.

When you have done all the cool work you want to do and get ready to tag it, you may think the right thing to do is this:

# THIS IS WRONG!
git tag 1.0

...but that does not create a tag. A tag is a special kind of object. It has a date, tagger (author) its own ID, and optionally a GPG signature. The default mechanism above creates something called a "lightweight" tag. A lightweight tag is a ref pointer that is more like a branch than a tag. If you've used mercurial before, you can liken this to hg tag -l to create a "local" tag.

The right way to create a tag is to make either an annotated (-a) or signed(-s) tag:

# This is the right way!
tag -a 1.0

A signed tag works the same way, but cryptographically signs the tag with your private GPG key.

# This is also the right way!
tag -s 1.0

Why does this matter, you ask? Because a real tag also works with things like git describe -- which is very useful when you're rolling releases.

You can see the difference here:

dustinmb:/tmp/project 549% git init
Initialized empty Git repository in /private/tmp/project/.git/
dustinmb:/tmp/project 550% touch afile
dustinmb:/tmp/project 551% git add afile 
dustinmb:/tmp/project 552% git commit -m 'added afile'
Created initial commit d1e6305: added afile
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 afile
dustinmb:/tmp/project 553% git describe 
fatal: cannot describe 'd1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d'
dustinmb:/tmp/project 554% git tag 1.0
dustinmb:/tmp/project 555% git describe 
fatal: cannot describe 'd1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d'
dustinmb:/tmp/project 556% git tag -am 'Rolled the annotated version' 1.0-a
dustinmb:/tmp/project 557% git describe 
1.0-a
dustinmb:/tmp/project 558% git show 1.0-a
tag 1.0-a
Tagger: Dustin Sallings <dustin@spy.net>
Date:   Thu Oct 16 22:26:28 2008 -0700

Rolled the annotated version
commit d1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d
Author: Dustin Sallings <dustin@spy.net>
Date:   Thu Oct 16 22:25:46 2008 -0700

    added afile

diff --git a/afile b/afile
new file mode 100644
index 0000000..e69de29
dustinmb:/tmp/project 559% git tag -sm 'Rolled the signed version.' 1.0-s

You need a passphrase to unlock the secret key for
user: "Dustin Sallings (primary) <dustin@spy.net>"
1024-bit DSA key, ID 43E59D54, created 2003-01-18

dustinmb:/tmp/project 560% git show 1.0-s
tag 1.0-s
Tagger: Dustin Sallings <dustin@spy.net>
Date:   Thu Oct 16 22:27:12 2008 -0700

Rolled the signed version.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEABECAAYFAkj4IjAACgkQeWDnv0PlnVTS7gCggImUJawC+cNEppCQ9bTtw+MZ
Nq4An2Vr7gbUAUDEQY97P1hwKK8cehfW
=Al8E
-----END PGP SIGNATURE-----
commit d1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d
Author: Dustin Sallings <dustin@spy.net>
Date:   Thu Oct 16 22:25:46 2008 -0700

    added afile

diff --git a/afile b/afile
new file mode 100644
index 0000000..e69de29
dustinmb:/tmp/project 561% git for-each-ref 
d1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d commit refs/heads/master
d1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d commit refs/tags/1.0
2b521732e0717c9f3f27330133be95284a059252 tag    refs/tags/1.0-a
13cf45a9eca3d3b4b3d1405588f9d6e551515a89 tag    refs/tags/1.0-s

You can pretty clearly see the difference here. One is not a tag, and the other two are. The two that are will work happily with describe and just overall make more sense.

So please, use -a with your tags and make describe and related tools happy.

Oct. 5, 2008, 12:57 a.m.

Using Git Bundle When Your Central Repo Fails

I have an application of mine I did a bunch of work on tonight and wanted to deploy that work on the VPS that runs it.

Unfortunately, after pushing my changes to github, I found that I couldn't pull from this box. I don't know whether it was because of some port filtering stuff or a broken machine at github. It was a great opportunity to try out git bundle, though.

What is Git Bundle?

A bundle is a way to put a bunch of changesets into a file so you can exchange them out of band while maintaining object IDs (as opposed to a git format-patch/git am sequence).

How Do I Make One?

In my case, my remote tree had change 23b730, but I had several changes out of that and my normal means of moving code around (git:// from github) wasn't working. The bundle creation to package up all changes and blobs after 23b730 was pretty straight-forward:

% git bundle create /tmp/cmd.git 23b730..

That created the file /tmp/cmd.git.

Neat, Now How Do I Use It?

First, get that file where you need it. Email it, put it on a web server, scp it, whatever. Once there, you unbundle it into the target repo using the git bundle unbundle command. Here's my example:

% git bundle unbundle /tmp/cmd.git 
0b6bf526dc3c9544288444dbe7eb58c7d091038e HEAD

Note that 0b6bf52 is my dev head I'm wanting to deploy. This does not change any of your branches, it only shoves the object into the git filesystem. You can either reset your branch or merge it at this point. I chose a merge (which, as I expected, was a fast-forward):

% git merge 0b6bf52

Now you're done. Code is deployed and all's well.

Oct. 4, 2008, 1:02 p.m.

What Matters in an Asynchronous Job Queue

An asynchronous job queue is a key element of system scalability. Job queues are particularly well-suited for web sites where an HTTP request requires some actions to be performed that may take longer than a second or two and where immediate results aren't necessarily required.

Important Properties of a Job Queue

There are several properties of such a queue system that have various levels of importance. Everybody has a different take on the levels of importance of each property. I'm going to list the properties that I find important and why here. I expect lots of people to disagree, and that's perfectly fine as I'd like to see more people's perspectives.

A Single Job is Handled by a Single Worker

This one is seemingly obvious. When a job is enqueued, I want it to be picked up by one worker.

Note that this is not universally true, however. In large systems (for example, at google), a given job may be handed to more than one worker at a time to ensure it gets done in a timely manner. This type of thing is obviously more reliable, but it's very hard to reach this level of reliability. For example, not all jobs are idempotent. If you were to have a job that formats and sends mail to a bunch of recipients, you would want to make sure the part that sends the email is not done more than once.

Different Jobs May be Handled by Different Workers

I have different classes of workers dedicated to performing different jobs. These workers may grow independently of each other, and in some cases, get rewritten in different languages for various reasons.

I do often have "general" queues that can process many types of jobs and just shove them all in there, but having the ability to split of dedicated workers has been critical to me in certain applications.

Priority Queues

I've never deployed a worker queue and not needed to start prioritizing jobs. Some jobs are responsible for fanning out (creating more jobs) and should really happen nearer an empty queue. For some jobs, timeliness is important, so I'd like to request that they should happen fairly soon. Some jobs are just expected to be bigger and slower so I toss them in at a lower priority.

Delayed Jobs

Similarly to priorities, being able to push a job in with a delay has been useful in a couple of circumstances.

My #1 reason to delay a job is because of a temporary failure. This may be either because some is kind of broken in a way that I expect will be fixed later, or because of an inability to acquire a lock or similar scarce resource.

By pushing the job back into the queue with a delay, I can do the jobs I can do without having to wait for this job to become available.

Introspection

Introspection is key to monitoring.

There are lots of health-related questions that you'll want answers to as you make more of your processing asynchronous.

Blocking Delivery

This is one that I've been seeing missing from a lot of queuing systems and it just baffles me. If I ask for a job, and there isn't one available, can I just wait? Having to poll is not acceptable. I see this kind of code a lot:

while True:
    job = queue.ask_for_a_job()
    if job:
        process_job(job)
    else:
        time.sleep(sleep_time) # CODE SMELL!

Sleep is for humans. A sleeping process is a waste of resources. The reason the sleep is there is because this becomes a fast infinite loop (with network IO) without it. It's taxing on the client and the server just to see if something's changed. sleep_time is a value that balances how much latency you're willing to have in your jobs and how much of a burden you want on your network, client, and server.

Consider the same code with a fully-blocking queue:

while True:
    process_job(queue.ask_for_a_job())

In addition to being less code, this makes much better use of resources, gets the jobs done at a much lower latency, and overall makes the world a better place.

Don't get me wrong, long polling, or even quick polling is OK in some applications. It should be an option, not a technological constraint.

Must Handle Worker Crashes

If a worker takes a job and then crashes, should the job get done? This is a really important part I think a lot of people who design worker queues ignore, but it's the most common type of failure I ever see.

Properties that Don't Matter (As Much As You Think)

Since I see these things come up a lot, I'm going to argue against them. If just one person doesn't implement another queue focused on the wrong properties, my work here won't be fruitless.

An Existing Protocol

I can't remember how many queue systems I've seen written to the memcached protocol. It's just wrong. You simply can't achieve the properties I consider important in a queue with a protocol designed for simple key/value caching.

Both starling and memcacheq attempt to solve the same problem the wrong way. Both require clients to poll the servers for new jobs. Neither has positive job completion acknowledgments, crash handling, priorities, delays, or any room for them because of the desire to maintain compatible with memcache client libraries.

It's just not worth losing all of this just for the sake of not coming up with a new protocol.

Queue Durability

I'm a pretty big fan of beanstalkd. I see a lot of people decide it's not well-suited for their environments because it doesn't keep its quite across restarts.

I won't argue that queue should never be durable, but I will restate that I this isn't what's ever caused me to lose a job. People consider queue durability to make up a reliable queue system, but it's just completely wrong.

Consider the starling case again. It keeps the queue on disk, so you can enqueue an item, crash the server, and the next get will return your job. Awesome.

Now grab an item out of a queue and kill the worker (who owns the job currently). I've yet to crash a beanstalkd, but workers crash or restart every time code is deployed, or there's a memory leak or similar bug, broken DB, unavailable lock server (or just lock).

Job workers are just like web servers in our environment. We don't want to care if they crash occasionally.

What's Right for You?

There are many options from a simple DB table to JMS. beanstalkd meets all of my requirements (and in the areas where it didn't, I've modified it to do so).

If you absolutely need queue durability, I'm sure a solution with minimal overhead would be a welcome contribution. Otherwise, make sure that you don't lose job durability in the process.

But whatever you do, please, don't build yet another one on memcached.

Sept. 25, 2008, 10:27 p.m.

Automating Git Bisection for Rails Apps

Bisection is an awesome strategy for finding the introduction of a flaw. The basic idea is to recognize a failure in a particular version of your code, find a version where the failure did not exist, and use the SCM to automate finding change that introduced the bug.

I first used it in darcs a few years ago (where it's known as trackdown). mercurial and git both implement it as a bisect command.

While the concept is the same, the implementations vary across systems. In darcs, the trackdown command may only be used in an automated fashion (i.e. you have to write a test script), while in mercurial, the bisect command may only be used in an interactive fashion (i.e. you have to start bisection and manually test each revision as you go). git, however, supports both modes.

In practice, I find the darcs way generally preferable as it's faster (assuming you have a test ready) and harder to get wrong. Somehow, I manage to mark a revision as good when I mean bad or similar and have to start the whole thing over. In an automated mode, there's no thinking required.

Easy Case: An Existing Test Case That's Failing

If you have an existing test that's failing, you've got it quite easy. Find a version where it worked (we'll say HEAD~50) and just let it go:

% git bisect start HEAD HEAD~50
% git bisect run rake

That will spit out the change that caused the unit tests to start failing. If you've had multiple failures (or your tests are slow), you may want to tell it to just run a single test case:

% git bisect start HEAD HEAD~50
% git bisect run ruby test/unit/some_test.rb

Harder Case: Finding a Failure with a New Test

If the test didn't exist when the code was broken, bisection won't be helpful. I've found git stash to be very helpful in this case, however. Write the new, failing test case (that you believe would've succeeded before), and instead of committing it, just stash it (git stash) and write a quick shell script to run the test:

#!/bin/sh

git stash apply
ruby test/unit/modified_or_new_test.rb
rv=$?
git reset --hard
exit $rv

Once that script's in place (say /tmp/try.sh), you run the bisection as you normally would:


% git bisect start HEAD HEAD~50
% git bisect run /tmp/try.sh

A Really Hard Case: HTTP Request Needed to Show Problem

Recently, I had a bug in reloading a module in development mode that caused the second HTTP request sequence after a certain type of modification to attempt to reload a module that couldn't be reloaded. I wanted to bisect this, but I didn't want to use my browser and editor and stuff for every test during a bisection, so I automated it the following way:

#!/bin/sh

http_get() {
  curl -f -s $1 > /dev/null
  rv=$?
  echo "Requested $1 -> $rv"
  if [ $rv -ne 0 ]
    echo "Failed to fetch $1 (try #$2)"
    kill $pid
    exit $rv
  fi
}

http_sequence() {
  http_get http://127.0.0.1:3000/page1 $1
  http_get http://127.0.0.1:3000/page2 $1
  # [...]
}

# Start the dev server and capture the PID
./script/server &
pid=$!

# Give the server a chance to start before running sequences
waitforsocket 127.0.0.1 3000

http_sequence 1

touch app/[...]/somefile.rb

http_sequence 2

kill $pid
exit 0 # If we get this far, this version has no sequence bug

This script as my bisection command tracked down the first changeset with the reload issue very quickly and accurately. It's easy to adapt it to anything where you want to actually make an HTTP request and inspect the traffic/server/log/whatever.

July 6, 2008, 3:26 p.m.

Ruby's HTTP Client Sucks

I implemented what's up in ruby since it seemed to have some of the best XMPP support I know of. I also got to learn the datamapper API, which is alright.

Ruby's HTTP client, however, really sucks. I have to imagine someone else has known how much it sucks, but I haven't found much talking about it, or any solutions to the problems. For a reason simple example, consider the following code:

#!/usr/bin/env ruby

require 'net/http'

u = URI.parse $*[0]
puts Net::HTTP.start(u.host, u.port) { |h| h.get u.path }

There are some URLs I simply can't get that thing to deal with. Examples: http://digg.com/ and http://bleu.west.spy.net/diggwatch/comments/dlsspy (these two fail in different ways, but work fine in browsers).

Is this really as bad as it seems to be, or am I just doing it wrong?

Update: Sun Jul 6 16:50:24 PDT 2008

In these two examples, there seems to be something wrong on the server side. digg gets very upset if you don't send a user agent and just kind of hangs for a while. That's...odd. diggwatch (a service I have running here) is emitting incorrect responses through lighttpd. It seems that lighttpd's proxy module is just...wrong (service works fine without lighttpd in the way).

July 6, 2008, 2:58 p.m.

What's Up? — An XMPP-based Web Monitor

I finally got around to building a useful web monitor or myself. I bring up little web services all over the place and generally do a bad job of ensuring they start correctly and/or continue to run correctly. I don't want to get paged when these things are broken, or have my email box flooded or anything, so XMPP seemed to be the right thing for me.

So yesterday I started writing the monitor for me. I called it what's up because I'm not very creative. You can try it out by sending an IM to whatsup@jabber.org. It can do a single get (a la down for everyone or just me) over IM, or it can periodically do requests and validations for you on a collection of URLs.

Resources

June 17, 2008, 10:32 p.m.

Good, Fast, Cheap? Eh, No Thanks

I had a conversation with a guy today about scaling his app. The app looks really simple and is still kind of below radar in popularity, but is expected to be growing for a couple different reasons. They're experiencing a bit of slowness, but nothing too out of control. I heard they just bought about seventy new servers to run the application.

Now, their app is a really simple read-mostly content spewing app. They're adding some more interactivity in it, but it's the kind of stuff that is more realtime-ish — the kind of stuff which I would absolutely not put a database in the critical path of. Admittedly, they don't have any experience with load balancing, caching, etc... However, they do have about seventy servers now.

I'm pretty sure I could meet their current load requirements on one really bored server, so I thought I'd offer some assistance. I suggested that a bit, showed examples of how I'd done stuff like this in the past, and suggested that it was a really bad idea to buy so many machines and asked what his thoughts were on EC2 since it'd be immediately cheaper and could pretty close to instantly reach whatever scale they needed in the medium-long term.

This is where things got a little weird for me to the point of distracting me away from the conversation about software architecture I'd intended to have to trying to trying to talk about a lower level of scaling and general cost reduction. The guy said he didn't like the idea of EC2 for a number of reasons which I'll iterate below.

  1. I don't want an EC2-based business
  2. I want to be running on my own hardware so I'll have more assets for a potential sell in 2-3 years
  3. What if Amazon jacks up the price a lot?
  4. What if Amazon decides to not run this service anymore?
  5. What if they have a huge traffic spike — can't be spinning up instances while they're being beaten down.
  6. I used EC2 before and it cost me $150/mo for an idle machine. Can you imagine multiplying that by 70?
  7. We've got lots of money, so cost efficiency isn't a problem

I found the list a little... backwards. I'll go into detail just in case any of them seem to make sense.

I don't Want to be an EC2 Based Business

His is not an EC2 based business. His is a software business that provides services to clients. It's a web site. Getting piles of hardware adds a lot of op ex as he develops a cost center for his business and distracts himself from his core competency (which, as he said, is not scaling out hardware).

I Need the Assets to Sell My Business

This one I didn't understand at all. If there's significant value in commodity hardware 2-3 years from now, he'll be in a sad shape.

The interesting thing is that the hardware is already about two years old. He paid about $24k for two racks of 34 machines each. Best part, it came out of an Amazon cluster. So it's older than any machine you'd get in EC2, and if you're planning on being bought in 2-3 years, it'll be 4-5 year-old commodity hardware by then.

You may find after a year or so that newer hardware would result in a lower cost per request served around a time that you need to serve significantly more requests. New hardware would make a lot of sense then. You can throw away all these machines you got, or you can just start rebooting EC2 instances. I know which one I think is easier.

What if Amazon Raises the Price Significantly?

You move.

Right now, it's the cheapest way to deploy an app you want complete control over and want to be able to scale with demand. If it's not tomorrow, pick it up and take it elsewhere.

Here he's betting that Amazon is going to lock him in somehow and then screw him out of more than $24k. That's not the whole story, though. $24k is the acquisition cost. These 68 machines still have to be located somewhere, have connectivity into them, have redundant switches, redundant power supplies, careful management of distribution across different PDUs and switches to prevent local outages from taking you out, spare parts when MTBE strikes and lots of other hidden costs. While it's a valid way to do things if you know it'll be cheaper, the op ex is likely to be at least as high as Amazon, but with a cap ex introduced.

What if There's a Huge Traffic Spike

In the EC2 model fronted by something like fuzed and a bit of preparation in creating a custom AMI, it'd be unlikely to take a full minute to add a node to a cluster. And as Jeff Bezos talked about at startup school, Animoto went from 50 to 3,500 servers in three days. You just can't do that with standard colocation practices.

In contrast, he wants to have all of his new machines running 24/7 in case of a traffic spike. When your traffic is low to normal, you're just burning cash. When the traffic is really high, you're just plain burning. A significant amount of new hardware can't even be acquired in a day even if you have a place to put it. Installation will be a pain. And when the huge spike is over, you'll just be burning cash even faster.

I Used EC2 Before and It Cost a Lot

He said he spent $150/mo on an idle machine in the past and felt that that was an incredible cost. Can you imagine that times 70?. Um, no, I can't. Because you'd just be trying to give away money if you chose to run 70 idle high-cpu medium servers 24/7 for a month. Fundamentally, it's the same problem you'll have running them in house, but that always gets calculated differently for some reason. But if $150/mo is what it takes to run his app, then his new cluster will pay for itself in... a bit over thirteen years.

At low traffic times, he can probably survive on, say, three machines ($210/mo minimum). At high traffic times, let's imagine he needs up to 70 servers at peak ($7/hr during the peak hours). That's all that needs to go on.

But imagine for a moment he did need 70 servers running all the time. That's under $5k/mo for small servers (which I'm guessing is what this purchased hardware is, or is at least equivalent to). That means it'd take about five months for the new hardware to pay for itself if electricity, bandwidth, and maintenance were completely free.

But, again, the reality is that he'd be likely paying closer to $500/mo for long enough that by the time he ran up $24k worth of EC2 bills, the hardware would be completely obsolete.

But We're Not Concerned About Cost

Maybe not, but you're still trying to walk a fine line between having enough servers to handle the day everyone on digg and slashdot start using your product and having few enough that you can burn more money on making a good and fast product and less on trying to play with old computers.

With 68 machines, you'll probably have what appears to be a surplus today, so you can afford to be sloppy with your coding. You might find yourself running at 60% or so capacity earlier than you would if you were generally aware of cost. Now, given that cost isn't an issue, that itself isn't a problem. The real problem is what it's going to take to grow it beyond this.

In Conclusion

EC2 isn't right for everyone, but better, faster and cheaper solutions make better, cheaper, and faster companies. If someone will do some of the hard parts that fall outside of your core competency, then you hopefully have a really good reason to do so yourself.

What he's doing will work. I've seen terrible things work when I really just with they wouldn't. It'll just be a lot harder and a lot more expensive than it needs to be. What are you going to do, though?

As for me, I run most of my software on old crappy computers at my house where I've got really bad connectivity. If you can even see this post, consider yourself lucky. Seeing as how I'm the only one who finds any of this interesting, it doesn't matter, though. :)

April 24, 2008, 8:51 p.m.

Gems on Github

octocat Github is now a gem server. See the howto page for details.

This is a really big deal. It's not that we need yet another gem server, but what this means is that there's now a standard way to distribute your own custom variations of gems at quite nearly no effort.

For example, I might have a project that's built on sinatra, but requires a few minor tweaks to it that either haven't been, or won't be accepted upstream. All I've got to do is fork the project on github, and add a dependency for my version of the gem and we're good to go. I can keep mine up to date, or if the code from the fork is accepted, it can get pulled back in.

Github is ushering in open source philosophies so rapidly that we should really take a step back to appreciate what it's doing. This is the bazaar.

Critics have called github a throwback to centralized systems because it's, well, a hub. That's kind of a warped view, though. sourceforge tried to do it with a good degree of success, but they actually were centralized. Github is centralized the same way a farmer's market is — we all do our work in our own places with various backend tree swapping, but we show up at this one place to show off and trade our wares. Customers in this model need only pay in code for the right to make someone's project a little better.

This is open source 2.0.