RockStarProgrammer - Full Postshttp://www.rockstarprogrammer.org/Rants of an accidental Rock Star Programmeren-usFri, 26 Dec 2008 22:05:00 -0800Stack Overflow is a Time Machine http://www.rockstarprogrammer.org/post/2008/dec/26/stack-overflow-time-machine/ <p>I've been using <a href="http://stackoverflow.com/">stackoverflow</a> a bit lately, and there are definitely some smart people there. A lot of what I find reminds me of darker times, though.</p> <p>For example, just about every time anything about revision control comes up, for example, people talk about how awesome this new subversion thing is.</p> <p><img alt="move to svn?" src="http://img.skitch.com/20081227-th9qadgbhduffrbxbybe3ehe5s.png" /></p> <p>I've been using DVCS for almost a decade now, so I receive the idea of moving <em>towards</em> subversion with a bit of shock.</p> <p>There also seems to be a bit of a... n00b overflow. Some of the popular questions are really newbie. Like, what's with these arrays (and why is it worth 11,000 views)?</p> <p><img alt="what's with the arrays?" src="http://img.skitch.com/20081227-jeheg78pchbcjpemymsrqrgsji.png" /></p> <p>Or perhaps this hot question with over 4,000 views:</p> <p><img alt="the truth is out there" src="http://img.skitch.com/20081227-juy2bn55w3bk9wkqk6anqkhyp9.png" /></p> <p>I've all but one of the questions I've asked answered, which is nice, but the deeper questions that seem more interesting don't get a whole lot of views. Most of the hottest questions are really fluffy.</p> <p>It's still a good resource, but you've gotta work a bit to keep it from being a frustrating time sink.</p> <a href="/post/2008/dec/26/stack-overflow-time-machine/#disqus_thread">Comments</a> Fri, 26 Dec 2008 22:05:00 -0800http://www.rockstarprogrammer.org/post/2008/dec/26/stack-overflow-time-machine/Simple Named Job Deduplication http://www.rockstarprogrammer.org/post/2008/dec/06/simple-named-job-deduplication/ <h2>Our Problem</h2> <p>We would like content on our web site available in our search engine as soon after the save as possible. Our search engine is decentralized in that every front-end has a copy of the search index and searches locally. This architecture allows searches to scale quite horizontally, but does so at the cost of simple index updates.</p> <p>With a centralized search index, we could just push a modification into the central server and be done with it. With our architecture, we need a worker machine to build a search index and distribute it to all of the front-end machines.</p> <p>Historically, we just had a cron job that'd occasionally rebuild and ship the index. Later, we started trying to keep track of what had changed and doing incremental updates.</p> <p>Eventually, I figured out it'd be less work and faster if we just sent object changes into the job queue and had the index builder pick these little changes up and ship them to the web servers. This worked quite well for a while.</p> <p>This became suboptimal when a bunch of content editors were rapidly making changes on a small development system with a couple nodes running in VMWare. The actual index distribution would just kill the machine due to IO on what ended up being the same disks.</p> <h2>The Idea</h2> <p>I wanted to keep the rapid update properties while trying to reduce IO. The obvious thing would be to try to aggregate multiple index updates into a single index distribution.</p> <h2>The Implementation</h2> <p>The first thing that had to be done, of course, was to break the job into two parts:</p> <ol> <li>The index update.</li> <li>The index distribution.</li> </ol> <p>Now the trick is to ensure that for every index update, there is at least one index distribution without there being an index distribution for every update.</p> <p>The easy way to do this is to define a job to have a "run after" date. This works very well in things such as index distribution. In this case, we've built an index and we want to make sure that the results of that index build make it to production. The job we queue will do it, but if these jobs block on each other, then any other job that runs after the time the index build completed will do.</p> <p>So where do we track the timestamps so a job knows it doesn't need to run? Well, memcached ends up being a perfect place for this.</p> <p>We give each job a name and a "run after" parameter, and store the "last run" timestamp in memcached under the job name. Really simple, and allows us to create as many of these jobs as we need.</p> <p>Ruby code for doing this looks something like this:</p> <pre><code>def run_after_cb(name, timestamp, ob, method, *args) k="jobts_#{name}" t=cache[k] rescue 0 if t &amp;&amp; t &gt; timestamp # Ignored -- log or something else nt = Time.now.to_i ob.send(method, *args) begin cache[k] = nt rescue # Can't record a new date (next job will run even if unnecessary) end end end </code></pre> <p>memcached is often the last thing I'd recommend for any sort of thing that isn't exactly a cache, but the semantics fit quite well here. This is treated as an optimization such that only when we know for sure that a distribution is redundant will we drop it.</p> <p>Specifically, the index <em>will</em> be distributed under the following conditions:</p> <ol> <li>When the key is not in the cache (never seen, dropped, etc...)</li> <li>When the key is found, but the date is in the past.</li> <li>When any error occurs when trying to talk to memcached.</li> </ol> <h2>Example</h2> <p>Imagine five index updates, each requiring a distribution occurring in the following scenario:</p> <p><img alt="Dedup Example" src="http://public.west.spy.net/images/jobdedup.png" title="Example dedup scenario" /></p> <p>Shortly after completing the first index update, a distribution will start. Content updated in <code>update 2</code> and <code>update 3</code> will not be included in this update.</p> <p>After <code>dist 1</code> completes, <code>dist 2</code> is ready to go for <code>update 2</code>. <code>update 2</code> completed at <code>t2</code> and the most recent update completed at <code>t1</code>, so we start <code>dist 1</code>.</p> <p>Because <code>dist 2</code> begins at <code>t4</code>, it naturally includes the effects of <code>update 2</code> and <code>update 3</code>, but not <code>update 4</code> which started <em>before</em> <code>dist 2</code> began.</p> <p>Now for a bit of imagination because I'm too lazy to draw this better.</p> <p>Although it's not illustrated here, it should be clear that the next update would be <code>dist 3</code> (queued by <code>update 3</code> for updates after <code>t3</code>). That next update would be <em>dropped</em> because the effects of it have already been distributed.</p> <p>Next would be <code>dist 4</code> which would have been queued from <code>update 4</code> and that one would <em>not</em> be dropped, but <code>dist 5</code> would be.</p> <p>In this example, we distributed our index three times for five updates. In practice, this helps quite a bit -- especially when things start getting slow and the distributions are backing up anyway.</p> <a href="/post/2008/dec/06/simple-named-job-deduplication/#disqus_thread">Comments</a> Sat, 06 Dec 2008 20:48:00 -0800http://www.rockstarprogrammer.org/post/2008/dec/06/simple-named-job-deduplication/Git Tag Does the Wrong Thing by Default http://www.rockstarprogrammer.org/post/2008/oct/16/git-tag-does-wrong-thing-default/ <p>I'm writing this because I don't think anyone is actually aware of it, but I keep seeing it show up in various projects.</p> <p>When you have done all the cool work you want to do and get ready to tag it, you may think the right thing to do is this:</p> <pre><code># THIS IS WRONG! git tag 1.0 </code></pre> <p>...but that does <em>not</em> create a tag. A tag is a special kind of object. It has a date, tagger (author) its own ID, and optionally a GPG signature. The default mechanism above creates something called a "lightweight" tag. A lightweight tag is a ref pointer that is more like a branch than a tag. If you've used <a href="www.selenic.com/mercurial/">mercurial</a> before, you can liken this to <code>hg tag -l</code> to create a "local" tag.</p> <p>The <em>right</em> way to create a tag is to make either an annotated (<code>-a</code>) or signed(<code>-s</code>) tag:</p> <pre><code># This is the right way! tag -a 1.0 </code></pre> <p>A signed tag works the same way, but cryptographically signs the tag with your private GPG key.</p> <pre><code># This is also the right way! tag -s 1.0 </code></pre> <p>Why does this matter, you ask? Because a real tag also works with things like <code>git describe</code> -- which is very useful when you're rolling releases.</p> <p>You can see the difference here:</p> <pre><code>dustinmb:/tmp/project 549% git init Initialized empty Git repository in /private/tmp/project/.git/ dustinmb:/tmp/project 550% touch afile dustinmb:/tmp/project 551% git add afile dustinmb:/tmp/project 552% git commit -m 'added afile' Created initial commit d1e6305: added afile 0 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 afile dustinmb:/tmp/project 553% git describe fatal: cannot describe 'd1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d' dustinmb:/tmp/project 554% git tag 1.0 dustinmb:/tmp/project 555% git describe fatal: cannot describe 'd1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d' dustinmb:/tmp/project 556% git tag -am 'Rolled the annotated version' 1.0-a dustinmb:/tmp/project 557% git describe 1.0-a dustinmb:/tmp/project 558% git show 1.0-a tag 1.0-a Tagger: Dustin Sallings &lt;dustin@spy.net&gt; Date: Thu Oct 16 22:26:28 2008 -0700 Rolled the annotated version commit d1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d Author: Dustin Sallings &lt;dustin@spy.net&gt; Date: Thu Oct 16 22:25:46 2008 -0700 added afile diff --git a/afile b/afile new file mode 100644 index 0000000..e69de29 dustinmb:/tmp/project 559% git tag -sm 'Rolled the signed version.' 1.0-s You need a passphrase to unlock the secret key for user: "Dustin Sallings (primary) &lt;dustin@spy.net&gt;" 1024-bit DSA key, ID 43E59D54, created 2003-01-18 dustinmb:/tmp/project 560% git show 1.0-s tag 1.0-s Tagger: Dustin Sallings &lt;dustin@spy.net&gt; Date: Thu Oct 16 22:27:12 2008 -0700 Rolled the signed version. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iEYEABECAAYFAkj4IjAACgkQeWDnv0PlnVTS7gCggImUJawC+cNEppCQ9bTtw+MZ Nq4An2Vr7gbUAUDEQY97P1hwKK8cehfW =Al8E -----END PGP SIGNATURE----- commit d1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d Author: Dustin Sallings &lt;dustin@spy.net&gt; Date: Thu Oct 16 22:25:46 2008 -0700 added afile diff --git a/afile b/afile new file mode 100644 index 0000000..e69de29 dustinmb:/tmp/project 561% git for-each-ref d1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d commit refs/heads/master d1e6305e4d8e00cf5f6f9cd5143ab96fb3451f0d commit refs/tags/1.0 2b521732e0717c9f3f27330133be95284a059252 tag refs/tags/1.0-a 13cf45a9eca3d3b4b3d1405588f9d6e551515a89 tag refs/tags/1.0-s </code></pre> <p>You can pretty clearly see the difference here. One is not a tag, and the other two are. The two that are will work happily with describe and just overall make more sense.</p> <p>So please, use <code>-a</code> with your tags and make describe and related tools happy.</p> <a href="/post/2008/oct/16/git-tag-does-wrong-thing-default/#disqus_thread">Comments</a> Thu, 16 Oct 2008 22:16:00 -0700http://www.rockstarprogrammer.org/post/2008/oct/16/git-tag-does-wrong-thing-default/Using Git Bundle When Your Central Repo Fails http://www.rockstarprogrammer.org/post/2008/oct/05/using-git-bundle-when-your-central-repo-fails/ <p>I have <a href="http://github.com/dustin/twitterspy">an application</a> of mine I did a bunch of <a href="http://dlsspy.tumblr.com/post/53133290/twitterspy-adhoc">work</a> on tonight and wanted to deploy that work on the VPS that runs it.</p> <p>Unfortunately, after pushing my changes to github, I found that I couldn't <code>pull</code> from this box. I don't know whether it was because of some port filtering stuff or a broken machine at github. It was a great opportunity to try out <code>git bundle</code>, though.</p> <h2>What is Git Bundle?</h2> <p>A bundle is a way to put a bunch of changesets into a file so you can exchange them out of band while maintaining object IDs (as opposed to a <code>git format-patch</code>/<code>git am</code> sequence).</p> <h2>How Do I Make One?</h2> <p>In my case, my remote tree had change <code>23b730</code>, but I had several changes out of that and my normal means of moving code around (<code>git://</code> from github) wasn't working. The bundle creation to package up all changes and blobs after <code>23b730</code> was pretty straight-forward:</p> <pre><code>% git bundle create /tmp/cmd.git 23b730.. </code></pre> <p>That created the file <code>/tmp/cmd.git</code>.</p> <h2>Neat, Now How Do I Use It?</h2> <p>First, get that file where you need it. Email it, put it on a web server, scp it, whatever. Once there, you unbundle it into the target repo using the <code>git bundle unbundle</code> command. Here's my example:</p> <pre><code>% git bundle unbundle /tmp/cmd.git 0b6bf526dc3c9544288444dbe7eb58c7d091038e HEAD </code></pre> <p>Note that <code>0b6bf52</code> is my dev head I'm wanting to deploy. This <em>does not</em> change any of your branches, it only shoves the object into the git filesystem. You can either reset your branch or merge it at this point. I chose a merge (which, as I expected, was a fast-forward):</p> <pre><code>% git merge 0b6bf52 </code></pre> <p>Now you're done. Code is deployed and all's well.</p> <a href="/post/2008/oct/05/using-git-bundle-when-your-central-repo-fails/#disqus_thread">Comments</a> Sun, 05 Oct 2008 00:57:00 -0700http://www.rockstarprogrammer.org/post/2008/oct/05/using-git-bundle-when-your-central-repo-fails/What Matters in an Asynchronous Job Queue http://www.rockstarprogrammer.org/post/2008/oct/04/what-matters-asynchronous-job-queue/ <p>An asynchronous job queue is a key element of system scalability. Job queues are particularly well-suited for web sites where an HTTP request requires some actions to be performed that may take longer than a second or two and where immediate results aren't necessarily required.</p> <h2>Important Properties of a Job Queue</h2> <p>There are several properties of such a queue system that have various levels of importance. Everybody has a different take on the levels of importance of each property. I'm going to list the properties that I find important and why here. I expect lots of people to disagree, and that's perfectly fine as I'd like to see more people's perspectives.</p> <h3>A Single Job is Handled by a Single Worker</h3> <p>This one is seemingly obvious. When a job is enqueued, I want it to be picked up by <em>one</em> worker.</p> <p>Note that this is not universally true, however. In large systems (for example, at google), a given job may be handed to more than one worker at a time to ensure it gets done in a timely manner. This type of thing is obviously more reliable, but it's very hard to reach this level of reliability. For example, not all jobs are idempotent. If you were to have a job that formats and sends mail to a bunch of recipients, you would want to make sure the part that sends the email is not done more than once.</p> <h3>Different Jobs May be Handled by Different Workers</h3> <p>I have different classes of workers dedicated to performing different jobs. These workers may grow independently of each other, and in some cases, get rewritten in different languages for various reasons.</p> <p>I do often have "general" queues that can process many types of jobs and just shove them all in there, but having the ability to split of dedicated workers has been critical to me in certain applications.</p> <h3>Priority Queues</h3> <p>I've never deployed a worker queue and not needed to start prioritizing jobs. Some jobs are responsible for fanning out (creating more jobs) and should really happen nearer an empty queue. For some jobs, timeliness is important, so I'd like to request that they should happen fairly soon. Some jobs are just expected to be bigger and slower so I toss them in at a lower priority.</p> <h3>Delayed Jobs</h3> <p>Similarly to priorities, being able to push a job in with a delay has been useful in a couple of circumstances.</p> <p>My #1 reason to delay a job is because of a temporary failure. This may be either because some is kind of broken in a way that I expect will be fixed later, or because of an inability to acquire a lock or similar scarce resource.</p> <p>By pushing the job back into the queue with a delay, I can do the jobs I <em>can</em> do without having to wait for this job to become available.</p> <h3>Introspection</h3> <p>Introspection is key to monitoring.</p> <ul> <li>Are there enough workers right now?</li> <li>Is a job queue growing faster than consumers can consume things?</li> <li>Is the job queue shrinking at all?</li> <li>Are some jobs getting stuck?</li> </ul> <p>There are lots of health-related questions that you'll want answers to as you make more of your processing asynchronous.</p> <h3>Blocking Delivery</h3> <p>This is one that I've been seeing missing from a lot of queuing systems and it just baffles me. If I ask for a job, and there isn't one available, can I just wait? Having to poll is not acceptable. I see this kind of code a lot:</p> <pre><code>while True: job = queue.ask_for_a_job() if job: process_job(job) else: time.sleep(sleep_time) # CODE SMELL! </code></pre> <p>Sleep is for humans. A sleeping process is a waste of resources. The <em>reason</em> the sleep is there is because this becomes a fast infinite loop (with network IO) without it. It's taxing on the client and the server just to see if something's changed. <code>sleep_time</code> is a value that balances how much latency you're willing to have in your jobs and how much of a burden you want on your network, client, and server.</p> <p>Consider the same code with a fully-blocking queue:</p> <pre><code>while True: process_job(queue.ask_for_a_job()) </code></pre> <p>In addition to being less code, this makes much better use of resources, gets the jobs done at a much lower latency, and overall makes the world a better place.</p> <p>Don't get me wrong, long polling, or even quick polling is OK in some applications. It should be an option, not a technological constraint.</p> <h3>Must Handle Worker Crashes</h3> <p>If a worker takes a job and then crashes, should the job get done? This is a really important part I think a lot of people who design worker queues ignore, but it's the most common type of failure I ever see.</p> <h2>Properties that Don't Matter (As Much As You Think)</h2> <p>Since I see these things come up a lot, I'm going to argue against them. If just <em>one</em> person doesn't implement another queue focused on the wrong properties, my work here won't be fruitless.</p> <h3>An Existing Protocol</h3> <p>I can't remember how many queue systems I've seen written to the memcached protocol. It's just wrong. You simply can't achieve the properties I consider important in a queue with a protocol designed for simple key/value caching.</p> <p>Both <a href="http://rubyforge.org/projects/starling/">starling</a> and <a href="http://code.google.com/p/memcacheq/">memcacheq</a> attempt to solve the same problem the wrong way. Both require clients to poll the servers for new jobs. Neither has positive job completion acknowledgments, crash handling, priorities, delays, or any room for them because of the desire to maintain compatible with memcache client libraries.</p> <p>It's just not worth losing all of this just for the sake of not coming up with a new protocol.</p> <h3>Queue Durability</h3> <p>I'm a pretty big fan of <a href="http://xph.us/software/beanstalkd/">beanstalkd</a>. I see a lot of people decide it's not well-suited for their environments because it doesn't keep its quite across restarts.</p> <p>I won't argue that queue should never be durable, but I will restate that I this isn't what's ever caused me to lose a job. People consider queue durability to make up a reliable queue system, but it's just completely wrong.</p> <p>Consider the starling case again. It keeps the queue on disk, so you can enqueue an item, crash the server, and the next <code>get</code> will return your job. Awesome.</p> <p>Now grab an item out of a queue and kill the worker (who owns the job currently). I've yet to crash a beanstalkd, but workers crash or restart every time code is deployed, or there's a memory leak or similar bug, broken DB, unavailable lock server (or just lock).</p> <p>Job workers are just like web servers in our environment. We don't want to care if they crash occasionally.</p> <h2>What's Right for You?</h2> <p>There are many options from a simple DB table to JMS. <a href="http://xph.us/software/beanstalkd/">beanstalkd</a> meets all of my requirements (and in the areas where it didn't, I've modified it to do so).</p> <p>If you absolutely need queue durability, I'm sure a solution with minimal overhead would be a welcome contribution. Otherwise, make sure that you don't lose <em>job</em> durability in the process.</p> <p>But whatever you do, please, don't build yet another one on memcached.</p> <a href="/post/2008/oct/04/what-matters-asynchronous-job-queue/#disqus_thread">Comments</a> Sat, 04 Oct 2008 13:02:00 -0700http://www.rockstarprogrammer.org/post/2008/oct/04/what-matters-asynchronous-job-queue/Automating Git Bisection for Rails Apps http://www.rockstarprogrammer.org/post/2008/sep/25/automating-git-bisection-rails-apps/ <p> Bisection is an awesome strategy for finding the introduction of a flaw. The basic idea is to recognize a failure in a particular version of your code, find a version where the failure did not exist, and use the SCM to automate finding change that introduced the bug. </p> <p> I first used it in <a href="http://darcs.net/">darcs</a> a few years ago (where it's known as <a href="http://darcs.net/manual/node8.html#SECTION008114000000000000000">trackdown</a>). <a href="http://www.selenic.com/mercurial/wiki/">mercurial</a> and <a href="http://git.or.cz/">git</a> both implement it as a <code>bisect</code> command. </p> <p> While the concept is the same, the implementations vary across systems. In darcs, the trackdown command may <em>only</em> be used in an automated fashion (i.e. you have to write a test script), while in mercurial, the bisect command may <em>only</em> be used in an interactive fashion (i.e. you have to start bisection and manually test each revision as you go). git, however, supports both modes. </p> <p> In practice, I find the darcs way <em>generally</em> preferable as it's faster (assuming you have a test ready) and harder to get wrong. Somehow, I manage to mark a revision as good when I mean bad or similar and have to start the whole thing over. In an automated mode, there's no thinking required. </p> <h2>Easy Case: An Existing Test Case That's Failing</h2> <p> If you have an existing test that's failing, you've got it quite easy. Find a version where it worked (we'll say <code>HEAD~50</code>) and just let it go: </p> <pre><code>% git bisect start HEAD HEAD~50 % git bisect run rake </code></pre> <p> That will spit out the change that caused the unit tests to start failing. If you've had multiple failures (or your tests are slow), you may want to tell it to just run a single test case: </p> <pre><code>% git bisect start HEAD HEAD~50 % git bisect run ruby test/unit/some_test.rb </code></pre> <h2>Harder Case: Finding a Failure with a New Test</h2> <p> If the test didn't exist when the code was broken, bisection won't be helpful. I've found <a href="http://www.kernel.org/pub/software/scm/git/docs/git-stash.html">git stash</a> to be very helpful in this case, however. Write the new, failing test case (that you believe would've succeeded before), and instead of committing it, just stash it (<code>git stash</code>) and write a quick shell script to run the test: </p> <pre><code>#!/bin/sh git stash apply ruby test/unit/modified_or_new_test.rb rv=$? git reset --hard exit $rv </code></pre> <p> Once that script's in place (say <code>/tmp/try.sh</code>), you run the bisection as you normally would: </p> <pre><code> % git bisect start HEAD HEAD~50 % git bisect run /tmp/try.sh </code></pre> <h2>A Really Hard Case: HTTP Request Needed to Show Problem</h2> <p> Recently, I had a bug in reloading a module in development mode that caused the second HTTP request sequence after a certain type of modification to attempt to reload a module that couldn't be reloaded. I wanted to bisect this, but I didn't want to use my browser and editor and stuff for every test during a bisection, so I automated it the following way: </p> <pre><code>#!/bin/sh http_get() { curl -f -s $1 &gt; /dev/null rv=$? echo "Requested $1 -> $rv" if [ $rv -ne 0 ] echo "Failed to fetch $1 (try #$2)" kill $pid exit $rv fi } http_sequence() { http_get http://127.0.0.1:3000/page1 $1 http_get http://127.0.0.1:3000/page2 $1 # [...] } # Start the dev server and capture the PID ./script/server & pid=$! # Give the server a chance to start before running sequences <a href="http://github.com/dustin/waitforsocket">waitforsocket</a> 127.0.0.1 3000 http_sequence 1 touch app/[...]/somefile.rb http_sequence 2 kill $pid exit 0 # If we get this far, this version has no sequence bug </code></pre> <p> This script as my bisection command tracked down the first changeset with the reload issue very quickly and accurately. It's easy to adapt it to anything where you want to actually make an HTTP request and inspect the traffic/server/log/whatever. </p> <a href="/post/2008/sep/25/automating-git-bisection-rails-apps/#disqus_thread">Comments</a> Thu, 25 Sep 2008 22:27:00 -0700http://www.rockstarprogrammer.org/post/2008/sep/25/automating-git-bisection-rails-apps/Ruby's HTTP Client Sucks http://www.rockstarprogrammer.org/post/2008/jul/06/rubys-http-client-sucks/ <p> I implemented <a href="http://www.rockstarprogrammer.org/post/2008/jul/06/whats-up-xmpp-based-web-monitor/">what's up</a> in ruby since it seemed to have some of the best XMPP support I know of. I also got to learn the <a href="http://datamapper.org/">datamapper</a> API, which is alright. </p> <p> Ruby's HTTP client, however, <em>really</em> sucks. I have to imagine someone else has known how much it sucks, but I haven't found much talking about it, or any solutions to the problems. For a reason simple example, consider the following code: </p> <pre> #!/usr/bin/env ruby require 'net/http' u = URI.parse $*[0] puts Net::HTTP.start(u.host, u.port) { |h| h.get u.path } </pre> <p> There are some URLs I simply can't get that thing to deal with. Examples: <code>http://digg.com/</code> and <code>http://bleu.west.spy.net/diggwatch/comments/dlsspy</code> (these two fail in different ways, but work fine in browsers). </p> <p> Is this really as bad as it seems to be, or am I just doing it wrong? </p> <h2>Update: Sun Jul 6 16:50:24 PDT 2008</h2> <p> In these two examples, there seems to be something wrong on the server side. digg gets very upset if you don't send a user agent and just kind of hangs for a while. That's...odd. diggwatch (a service I have running here) is emitting incorrect responses through lighttpd. It seems that lighttpd's proxy module is just...wrong (service works fine without lighttpd in the way). </p> <a href="/post/2008/jul/06/rubys-http-client-sucks/#disqus_thread">Comments</a> Sun, 06 Jul 2008 15:26:00 -0700http://www.rockstarprogrammer.org/post/2008/jul/06/rubys-http-client-sucks/What's Up? &mdash; An XMPP-based Web Monitor http://www.rockstarprogrammer.org/post/2008/jul/06/whats-up-xmpp-based-web-monitor/ <p> I finally got around to building a useful web monitor or myself. I bring up little web services all over the place and generally do a bad job of ensuring they start correctly and/or continue to run correctly. I don't want to get <em>paged</em> when these things are broken, or have my email box flooded or anything, so <a href="http://www.jabber.org/">XMPP</a> seemed to be the right thing for me. </p> <p> So yesterday I started writing the monitor for me. I called it <a href="http://github.com/dustin/whatsup">what's up</a> because I'm not very creative. You can try it out by sending an IM to <a href="xmpp:whatsup@jabber.org">whatsup@jabber.org</a>. It can do a single get (a la down for everyone or just me) over IM, or it can periodically do requests and validations for you on a collection of URLs. </p> <h2>Resources</h2> <ul> <li><a href="xmpp://whatsup@jabber.org">The Service</a></li> <li><a href="http://github.com/dustin/whatsup">The Source</a></li> </ul> <a href="/post/2008/jul/06/whats-up-xmpp-based-web-monitor/#disqus_thread">Comments</a> Sun, 06 Jul 2008 14:58:00 -0700http://www.rockstarprogrammer.org/post/2008/jul/06/whats-up-xmpp-based-web-monitor/Good, Fast, Cheap? Eh, No Thanks http://www.rockstarprogrammer.org/post/2008/jun/17/good-fast-cheap-eh-no-thanks/ <p> I had a conversation with a guy today about scaling his app. The app looks really simple and is still kind of below radar in popularity, but is expected to be growing for a couple different reasons. They're experiencing a bit of slowness, but nothing too out of control. I heard they just bought about seventy new servers to run the application. </p> <p> Now, their app is a really simple read-mostly content spewing app. They're adding some more interactivity in it, but it's the kind of stuff that is more realtime-ish &mdash; the kind of stuff which I would absolutely not put a database in the critical path of. Admittedly, they don't have any experience with load balancing, caching, etc... However, they <em>do</em> have about seventy servers now. </p> <p> I'm pretty sure I could meet their current load requirements on one really bored server, so I thought I'd offer some assistance. I suggested that a bit, showed examples of how I'd done stuff like this in the past, and suggested that it was a <em>really</em> bad idea to buy so many machines and asked what his thoughts were on EC2 since it'd be immediately cheaper and could pretty close to instantly reach whatever scale they needed in the medium-long term. </p> <p> This is where things got a little weird for me to the point of distracting me away from the conversation about software architecture I'd intended to have to trying to trying to talk about a lower level of scaling and general cost reduction. The guy said he didn't like the idea of EC2 for a number of reasons which I'll iterate below. </p> <ol> <li>I don't want an EC2-based business</li> <li>I want to be running on my own hardware so I'll have more assets for a potential sell in 2-3 years</li> <li>What if Amazon jacks up the price a lot?</li> <li>What if Amazon decides to not run this service anymore?</li> <li>What if they have a huge traffic spike &mdash; can't be spinning up instances while they're being beaten down.</li> <li>I used EC2 before and it cost me $150/mo for an idle machine. Can you imagine multiplying that by 70?</li> <li>We've got lots of money, so cost efficiency isn't a problem</li> </ol> <p> I found the list a little... backwards. I'll go into detail just in case any of them seem to make sense. </p> <h3>I don't Want to be an EC2 Based Business</h3> <p> His is not an EC2 based business. His is a software business that provides services to clients. It's a web site. Getting piles of hardware adds a lot of op ex as he develops a cost center for his business and distracts himself from his core competency (which, as he said, is not scaling out hardware). </p> <h3>I Need the Assets to Sell My Business</h3> <p> This one I didn't understand at all. If there's significant value in commodity hardware 2-3 years from now, he'll be in a sad shape. </p> <p> The interesting thing is that the hardware is already about two years old. He paid about $24k for two racks of 34 machines each. Best part, it came out of an Amazon cluster. So it's older than any machine you'd get in EC2, and if you're planning on being bought in 2-3 years, it'll be 4-5 year-old commodity hardware by then. </p> <p> You may find after a year or so that newer hardware would result in a lower cost per request served around a time that you need to serve significantly more requests. New hardware would make a lot of sense then. You can throw away all these machines you got, or you can just start rebooting EC2 instances. I know which one <em>I</em> think is easier. </p> <h3>What if Amazon Raises the Price Significantly?</h3> <p> You move. </p> <p> Right now, it's the cheapest way to deploy an app you want complete control over and want to be able to scale with demand. If it's not tomorrow, pick it up and take it elsewhere. </p> <p> Here he's betting that Amazon is going to lock him in somehow and then screw him out of more than $24k. That's not the whole story, though. $24k is the acquisition cost. These 68 machines still have to be located somewhere, have connectivity into them, have redundant switches, redundant power supplies, careful management of distribution across different PDUs and switches to prevent local outages from taking you out, spare parts when MTBE strikes and lots of other hidden costs. While it's a valid way to do things if you know it'll be cheaper, the op ex is likely to be at least as high as Amazon, but with a cap ex introduced. </p> <h3>What if There's a Huge Traffic Spike</h3> <p> In the EC2 model fronted by something like <a href="http://jointheconversation.org/2008/06/06/fuzed-and-ec2/">fuzed</a> and a bit of preparation in creating a custom AMI, it'd be unlikely to take a full minute to add a node to a cluster. And as Jeff Bezos talked about at startup school, Animoto went from <a href="http://blog.animoto.com/2008/04/21/amazon-ceo-jeff-bezos-on-animoto/">50 to 3,500 servers in three days</a>. You just can't do that with standard colocation practices. </p> <p> In contrast, he wants to have all of his new machines running 24/7 in case of a traffic spike. When your traffic is low to normal, you're just burning cash. When the traffic is <em>really</em> high, you're just plain burning. A significant amount of new hardware can't even be acquired in a day even if you have a place to put it. Installation will be a pain. And when the huge spike is over, you'll just be burning cash even faster. </p> <h3>I Used EC2 Before and It Cost a Lot</h3> <p> He said he spent $150/mo on an idle machine in the past and felt that that was an incredible cost. <q>Can you imagine that times 70?</q>. Um, no, I can't. Because you'd just be trying to give away money if you chose to run 70 idle high-cpu medium servers 24/7 for a month. Fundamentally, it's the same problem you'll have running them in house, but that always gets calculated differently for some reason. But if $150/mo is what it takes to run his app, then his new cluster will pay for itself in... a bit over thirteen years. </p> <p> At low traffic times, he can probably survive on, say, three machines ($210/mo minimum). At high traffic times, let's imagine he needs up to 70 servers at peak ($7/hr during the peak hours). That's all that needs to go on. </p> <p> But imagine for a moment he did need 70 servers running all the time. That's under $5k/mo for small servers (which I'm guessing is what this purchased hardware is, or is at least equivalent to). That means it'd take about five months for the new hardware to pay for itself if electricity, bandwidth, and maintenance were completely free. </p> <p> But, again, the reality is that he'd be likely paying closer to $500/mo for long enough that by the time he ran up $24k worth of EC2 bills, the hardware would be <em>completely</em> obsolete. </p> <h3>But We're Not Concerned About Cost</h3> <p> Maybe not, but you're still trying to walk a fine line between having enough servers to handle the day everyone on digg and slashdot start using your product and having few enough that you can burn more money on making a good and fast product and less on trying to play with old computers. </p> <p> With 68 machines, you'll probably have what appears to be a surplus today, so you can afford to be sloppy with your coding. You might find yourself running at 60% or so capacity earlier than you would if you were generally aware of cost. Now, given that cost isn't an issue, that itself isn't a problem. The real problem is what it's going to take to grow it beyond this. </p> <h3>In Conclusion</h3> <p> EC2 isn't right for everyone, but better, faster and cheaper solutions make better, cheaper, and faster companies. If someone will do some of the hard parts that fall outside of your core competency, then you hopefully have a really good reason to do so yourself. </p> <p> What he's doing will work. I've seen terrible things work when I really just with they wouldn't. It'll just be a lot harder and a lot more expensive than it needs to be. What are you going to do, though? </p> <p> As for me, I run most of my software on old crappy computers at my house where I've got really bad connectivity. If you can even see this post, consider yourself lucky. Seeing as how I'm the only one who finds any of this interesting, it doesn't matter, though. :) </p> <a href="/post/2008/jun/17/good-fast-cheap-eh-no-thanks/#disqus_thread">Comments</a> Tue, 17 Jun 2008 22:32:00 -0700http://www.rockstarprogrammer.org/post/2008/jun/17/good-fast-cheap-eh-no-thanks/Gems on Github http://www.rockstarprogrammer.org/post/2008/apr/24/gems-github/ <p> <a href="http://github.com/"><img style="float: right" width=120 height=120 src="http://public.west.spy.net/octogem-small.png" alt="octocat"/></a> <a href="http://github.com/">Github</a> is now a gem server. See <a href="http://gems.github.com/">the howto page</a> for details. </p> <p> This is a really big deal. It's not that we need yet another gem server, but what this means is that there's now a standard way to distribute your own custom variations of gems at quite nearly no effort. </p> <p> For example, I might have a project that's built on <a href="http://github.com/bmizerany/sinatra">sinatra</a>, but requires a few minor tweaks to it that either haven't been, or won't be accepted upstream. All I've got to do is fork the project on github, and add a dependency for my version of the gem and we're good to go. I can keep mine up to date, or if the code from the fork is accepted, it can get pulled back in. </p> <p> Github is ushering in open source philosophies so rapidly that we should really take a step back to appreciate what it's doing. This is the bazaar. </p> <p> Critics have called github a throwback to centralized systems because it's, well, a hub. That's kind of a warped view, though. <a href="http://sf.net/">sourceforge</a> tried to do it with a good degree of success, but they actually were centralized. Github is centralized the same way a <a href="http://en.wikipedia.org/wiki/Farmer%27s_market">farmer's market</a> is &mdash; we all do our work in our own places with various backend tree swapping, but we show up at this one place to show off and trade our wares. <q>Customers</q> in this model need only pay in code for the right to make someone's project a little better. </p> <p> This is open source 2.0. </p> <a href="/post/2008/apr/24/gems-github/#disqus_thread">Comments</a> Thu, 24 Apr 2008 20:51:00 -0700http://www.rockstarprogrammer.org/post/2008/apr/24/gems-github/