2008-06-17 22:32:00

Good, Fast, Cheap? Eh, No Thanks

I had a conversation with a guy today about scaling his app. The app looks really simple and is still kind of below radar in popularity, but is expected to be growing for a couple different reasons. They're experiencing a bit of slowness, but nothing too out of control. I heard they just bought about seventy new servers to run the application.

Now, their app is a really simple read-mostly content spewing app. They're adding some more interactivity in it, but it's the kind of stuff that is more realtime-ish — the kind of stuff which I would absolutely not put a database in the critical path of. Admittedly, they don't have any experience with load balancing, caching, etc... However, they do have about seventy servers now.

I'm pretty sure I could meet their current load requirements on one really bored server, so I thought I'd offer some assistance. I suggested that a bit, showed examples of how I'd done stuff like this in the past, and suggested that it was a really bad idea to buy so many machines and asked what his thoughts were on EC2 since it'd be immediately cheaper and could pretty close to instantly reach whatever scale they needed in the medium-long term.

This is where things got a little weird for me to the point of distracting me away from the conversation about software architecture I'd intended to have to trying to trying to talk about a lower level of scaling and general cost reduction. The guy said he didn't like the idea of EC2 for a number of reasons which I'll iterate below.

  1. I don't want an EC2-based business
  2. I want to be running on my own hardware so I'll have more assets for a potential sell in 2-3 years
  3. What if Amazon jacks up the price a lot?
  4. What if Amazon decides to not run this service anymore?
  5. What if they have a huge traffic spike — can't be spinning up instances while they're being beaten down.
  6. I used EC2 before and it cost me $150/mo for an idle machine. Can you imagine multiplying that by 70?
  7. We've got lots of money, so cost efficiency isn't a problem

I found the list a little... backwards. I'll go into detail just in case any of them seem to make sense.

I don't Want to be an EC2 Based Business

His is not an EC2 based business. His is a software business that provides services to clients. It's a web site. Getting piles of hardware adds a lot of op ex as he develops a cost center for his business and distracts himself from his core competency (which, as he said, is not scaling out hardware).

I Need the Assets to Sell My Business

This one I didn't understand at all. If there's significant value in commodity hardware 2-3 years from now, he'll be in a sad shape.

The interesting thing is that the hardware is already about two years old. He paid about $24k for two racks of 34 machines each. Best part, it came out of an Amazon cluster. So it's older than any machine you'd get in EC2, and if you're planning on being bought in 2-3 years, it'll be 4-5 year-old commodity hardware by then.

You may find after a year or so that newer hardware would result in a lower cost per request served around a time that you need to serve significantly more requests. New hardware would make a lot of sense then. You can throw away all these machines you got, or you can just start rebooting EC2 instances. I know which one I think is easier.

What if Amazon Raises the Price Significantly?

You move.

Right now, it's the cheapest way to deploy an app you want complete control over and want to be able to scale with demand. If it's not tomorrow, pick it up and take it elsewhere.

Here he's betting that Amazon is going to lock him in somehow and then screw him out of more then $24k. That's not the whole story, though. $24k is the acquisition cost. These 68 machines still have to be located somewhere, have connectivity into them, have redundant switches, redundant power supplies, careful management of distribution across different PDUs and switches to prevent local outages from taking you out, spare parts when MTBE strikes and lots of other hidden costs. While it's a valid way to do things if you know it'll be cheaper, the op ex is likely to be at least as high as Amazon, but with a cap ex introduced.

What if There's a Huge Traffic Spike

In the EC2 model fronted by something like fuzed and a bit of preparation in creating a custom AMI, it'd be unlikely to take a full minute to add a node to a cluster. And as Jeff Bezos talked about at startup school, Animoto went from 50 to 3,500 servers in three days. You just can't do that with standard colocation practices.

In contrast, he wants to have all of his new machines running 24/7 in case of a traffic spike. When your traffic is low to normal, you're just burning cash. When the traffic is really high, you're just plain burning. A significant amount of new hardware can't even be acquired in a day even if you have a place to put it. Installation will be a pain. And when the huge spike is over, you'll just be burning cash even faster.

I Used EC2 Before and It Cost a Lot

He said he spent $150/mo on an idle machine in the past and felt that that was an incredible cost. Can you imagine that times 70?. Um, no, I can't. Because you'd just be trying to give away money if you chose to run 70 idle high-cpu medium servers 24/7 for a month. Fundamentally, it's the same problem you'll have running them in house, but that always gets calculated differently for some reason. But if $150/mo is what it takes to run his app, then his new cluster will pay for itself in... a bit over thirteen years.

At low traffic times, he can probably survive on, say, three machines ($210/mo minimum). At high traffic times, let's imagine he needs up to 70 servers at peak ($7/hr during the peak hours). That's all that needs to go on.

But imagine for a moment he did need 70 servers running all the time. That's under $5k/mo for small servers (which I'm guessing is what this purchased hardware is, or is at least equivalent to). That means it'd take about five months for the new hardware to pay for itself if electricity, bandwidth, and maintenance were completely free.

But, again, the reality is that he'd be likely paying closer to $500/mo for long enough that by the time he ran up $24k worth of EC2 bills, the hardware would be completely obsolete.

But We're Not Concerned About Cost

Maybe not, but you're still trying to walk a fine line between having enough servers to handle the day everyone on digg and slashdot start using your product and having few enough that you can burn more money on making a good and fast product and less on trying to play with old computers.

With 68 machines, you'll probably have what appears to be a surplus today, so you can afford to be sloppy with your coding. You might find yourself running at 60% or so capacity earlier than you would if you were generally aware of cost. Now, given that cost isn't an issue, that itself isn't a problem. The real problem is what it's going to take to grow it beyond this.

In Conclusion

EC2 isn't right for everyone, but better, faster and cheaper solutions make better, cheaper, and faster companies. If someone will do some of the hard parts that fall outside of your core competency, then you hopefully have a really good reason to do so yourself.

What he's doing will work. I've seen terrible things work when I really just with they wouldn't. It'll just be a lot harder and a lot more expensive than it needs to be. What are you going to do, though?

As for me, I run most of my software on old crappy computers at my house where I've got really bad connectivity. If you can even see this post, consider yourself lucky. Seeing as how I'm the only one who finds any of this interesting, it doesn't matter, though. :)

2008-04-24 20:51:00

Gems on Github

octocat Github is now a gem server. See the howto page for details.

This is a really big deal. It's not that we need yet another gem server, but what this means is that there's now a standard way to distribute your own custom variations of gems at quite nearly no effort.

For example, I might have a project that's built on sinatra, but requires a few minor tweaks to it that either haven't been, or won't be accepted upstream. All I've got to do is fork the project on github, and add a dependency for my version of the gem and we're good to go. I can keep mine up to date, or if the code from the fork is accepted, it can get pulled back in.

Github is ushering in open source philosophies so rapidly that we should really take a step back to appreciate what it's doing. This is the bazaar.

Critics have called github a throwback to centralized systems because it's, well, a hub. That's kind of a warped view, though. sourceforge tried to do it with a good degree of success, but they actually were centralized. Github is centralized the same way a farmer's market is — we all do our work in our own places with various backend tree swapping, but we show up at this one place to show off and trade our wares. Customers in this model need only pay in code for the right to make someone's project a little better.

This is open source 2.0.

2008-04-13 23:26:00

My Main Problem with Google App Engine

The community. It's completely full of ungrateful idiots. I mean, sure, some people are doing some really cool stuff in there, but you can't find it because of the complaints.

I signed up the night it was announced, and got an RSS feed from the google group. I expected lots of interesting hackery going on. What I found instead was people complaining about python and how there's no ruby and how there's no java and how there's no c# and how there's no... cold fusion? Seriously? Learn a new tool and see what this free service can do for you, or don't sign up because it doesn't do what you want.

So some of the users moved over to the issue tracker and started filing feature requests there. That's great. I starred some of the features that were interesting to me. I eventually had to unstar many of them because people keep writing +1 comments. A +1 comment on a bug tracker means, Hey everybody! Look at me! I can't figure out how to do proper priority adjustment, so I'm going to spam everyone with something completely useless.

Case in point.

Now, I have managed to find some people who are actually willing to evaluate the actual offering from google and comment on how it works and all that. There's some signal in there, but it's really frustrating to find any.

This is a google bug. Someone point me to the issue so I can star it. Google: I implore you. Fix your bug tracker so that it is useful again. Let people know how to vote in a really obvious manner. If anyone posts a comment that's fewer than three characters or starts with a plus sign or something, revoke his/her programming license.

For GAE itself, I have an overall good feeling for it. It's simple, yet is designed in such a way that it should be possible to scale to huge numbers of database-backed queries. I like python. BigTable is new to me, but still somewhat exciting. I find many of the constraints of GAE to be beyond what I'm capable of doing for a similar price anywhere else, so it seems like progress to me.

Update: Tue Apr 15 12:45:59 PDT 2008

It looks like google heard my cries and at least cleaned up that bug. I speak for the community when I thank you.

BTW, be sure to star the +1 comment bug at google hosting itself.

2008-04-06 19:28:00

The Differences Between Mercurial and Git

I realized recently that I've been using distributed revision control for several years now. It's always been an exciting landscape for me, although it's been a bit lonely. I used gnu arch for most of my code for a long time, and dabbled some in darcs at the same time. It wasn't until I saw Brian O'Sullivan's tech talk on mercurial that I started wondering if my needs weren't being fully met. He showed what mercurial provided that my systems lacked and provided a generally useful tool, so I dove in and enjoyed it greatly.

Neither darcs nor gnu arch were particularly fast systems, but they did a reasonably good job. The most obvious difference between them was that darcs had a really nice UI, and the philosophy of gnu arch was all but against having a useful interface. The idea was that it was something on which revision control systems could be built, and wasn't so much one on its own.

I heard about git on the darcs list a while back, but didn't see what it provided that was so great. In its early days it was... well, considered difficult enough to use that only the sickest bothered. Once git started rising in popularity, I tried to understand what it had to offer that I was missing. Linus Torvalds' tech talk on git seemed to just be selling distributed systems (and pretty much said mercurial was decent). Randal Schwartz's git talk promised to tell me why I should use it instead of CVS, Subversion, SVK, Arch, Darcs, Mercurial, Monotone, Bazaar, and just about every other repository manager, but I didn't really get the impression he'd actually used any of these to compare with. Certainly didn't convince me.

But I eventually did start looking at git, only because there were projects to which I contribute that also use git. I've been taking the time to fully understand it and its ways and explored much of its dirty parts (but I'm not quite done yet). So now I feel I can start writing a bit about the differences I've encountered.

Technical Differences

The primary difference between distributed systems and centralized systems is in granularity. i.e. in a centralized system, there's no distinction between saving a change and making it available. In this regard, git is more granular than mercurial. There are lots of small parts that work together. It's possible for this to be mostly transparent to the user, but as we've seen in comparing the difference between centralized and distributed, it can be very beneficial in creating new types of workflows.

History is a DAG

In both git and mercurial, the history is just a directed acyclic graph, but mercurial attempts to provide a linear history and this has a negative effect in a few places. For one, the rev number is displayed a lot and people try to use it, but it varies from repo to repo and probably does little more than cause confusion.

git will show you from a particular point all of the changesets that led up to it by following the history backwards, but otherwise it represents what happens in the real world. From a given point, there's a change. That change may be desirable or may not be. It may have been at one point, but isn't any longer. The only thing that matters is a head.

And I believe this is where mercurial has gone wrong. A head in mercurial is inferred... it's just a point on the DAG where there are no children. A branch is inferred to be inactive when it's not a head.

Contrast to git where all heads are explicit. A tag or a branch points to a particular node in the graph, and there are tools to compare the changes between two nodes. This is obviously right for tags, but it's also really nice for heads. You can simply dump mass changesets into another repo (for example in a mob branch), but only include heads you feel are worth noting. It's this distinction that allows for private branches.

Mutability Tools

The culture of mercurial is one of immutability. This is quite a good thing, and it's one of my favorite aspects of gnu arch. If I commit something, I like to know that it's going to be there. Because of this, there are no tools to manipulate history by default.

git is all about manipulating history. There's rebase, commit amend, reset, filter-branch, and probably other commands I'm not thinking of, many of which make it into day-to-day workflows. Then again, there's reflog, which adds a big safety net around this mutability.

Branch Management

I heard many people say they felt git handled branches better, but nobody could explain why. I've described the foundation above, but I'll go into more detail here.

In mercurial there are two types of branches you might see as a user: named branches and repository clones. Cloning a repository is how you branch in darcs and is a really simple concept. When you want to do something different, you just clone your repo somewhere else and start working. Named branches allow you to have a workflow where you can switch back and forth between branches in the same working directory. However, named branches are flawed.

In mercurial, every changeset belongs to a named branch. That is, the branch name is stored in the changeset. The flaw is that it's quite easy to have more than one branch with the same name, and it's difficult to tell when this has happened. This can cause confusion in a team where one is left wondering what changes, exactly, have made it into the stable branch when multiple people have reopened and merged the branch on different timelines.

Also, because the branch name is in the changeset, the branch lives forever. The only short-term branches are clone branches. That just doesn't encourage quick experiments.

In git, a branch is just a head (see above). Making changes to a branch actually moves the pointer to the new changeset. This head must be explicitly shared across repositories.

In practice, this drops the cost down to approximately zero. You won't accidentally push code you don't mean to. You won't have to be reminded of a failed experiment for the rest of your life, and you won't have to fear naming them in such a way that they don't collide with something someone else has done somewhere else.

Interoperability Tools

Mercurial has the convert extension, which does a great job of converting repositories from some other system to mercurial. It's clean, consistent, and fast.

git has a variety of tools for common systems, some of which are bidirectional. Not many of them seem similar to each other (mob design), but the most common ones get the best treatment.

git-svn, in particular, is so good it's evil. Some of my earliest experiences with git involved cloning an svn repo, creating a git branch of an svn branch, doing various changes there, git merging svn trunk on top of it, doing more changes, and pushing all that work as individual changesets back onto svn. That's good stuff and just shouldn't work.

There's a similar tool for CVS, but neither really makes up for a proper distributed system. You can do all your work in your git tree, but the git tree itself can't be meaningfully cloned (e.g. you can't train svn or cvs to merge).

Non-Technical Differences

The biggest non-technical difference between git and mercurial is the rabid culture surrounding git. mercurial users fairly happily and quietly use their tool, while I've had to send two separate door-to-door git missionaries away today alone.

While popularity is a terrible reason to use something, it's contributed a lot to making git a usable system, and you can see this happening on the mailing list. Take a quick look at some recent posts. There's generally more development going on on that list than discussion.

And along came github. There was repo.or.cz before that, and gitorious is also quite nice, but github has made people really want it. Tons of people signing up constantly, tons of projects entered, and a very easy way to contribute to these projects.

Although mercurial may still feel nicer today, the change feels inevitable. This flood of people leaving centralized systems means that it's way easier to contribute to their projects than ever before. This is the important part.

In the end, we all win either way.

2008-01-29 16:03:00

Wanted: Git Cheat Sheet for Collaboration

I've seen a lot of git advocacy lately, and a few comparisons to mercurial, but I've not seen anything that suggests using git over mercurial (my personal favorite) that made me feel that the author had ever used git.

Like many, I am not qualified to describe the differences in any sort of compelling fashion. mercurial feels easier to use to me, but it may just be that git is too foreign and I don't understand it. As I do work with git, I figured it'd be good to find out how I might build similar collaboration patterns to those to which I've grown accustomed. It is for this reason that I'm asking for some command equivalents (or at least some guidance on what I should be doing).

How do I know where I am?

hg command hg identify

In mercurial, I can always ask for tree identity using the command hg identify. The closest I can find in git is git describe, which generally just gives me an error telling me it can't describe the changeset with the given hash. While this does give me something useful, it does so in the form of an error message, which makes me believe that it may not be the right thing.

How do I see the changes in a repo?

hg command hg log

If I git checkout an older revision of a tree, the newer changes simply disappear from the log. This is a bit inconvenient for me, because I generally want to see what's coming up, not just what's got me to the current revision. The current behavior is a valid case, but it's certainly not the only one.

This is especially confusing after a pull. I can see lots of hashes scroll by the screen making me thing git is doing something really important and perhaps I should be able to reverse SHA-1 in my head so I can tell what that is, but when I'm finished, I can't tell what just happened. git log still shows me the state I was in before.

When I do go back to a previous version, how do I get back to the present? I can't see the changes, so I don't know where I'm going. I'm sure there's a simple mechanism somewhere, but it's not obvious to me.

How do I tell what a given upstream needs?

hg command hg outgoing [url or registered alias]

I managed to figure out I can do git log origin.. to get changes that I have that aren't in my concept of what the origin has. How do I do that with an arbitrary URL? Do I always have to register them? Maybe that's OK.

How do I tell what a given upstream may provide me?

hg command hg incoming [url or registered alias]

I haven't figured this one out other than to pull and see what happens (though as I mentioned above, I haven't figured out how to see what's happened, either).

How do I share changes when I have read-only access to an upstream repo?

hg command hg bundle /output/file [url or registered alias]

The above is the easiest way to do it and the way I tell people, but there's also hg epxort and extensions like patchbomb (which is similar to git's format-patch, except it will actually send emails.

Of course, I can also just serve my repo, which is probably the best way to do it in any case. For small changes, that's a bit of a pain, though.

How do I import an exported changeset?

hg command hg import /input/file

The best I've been able to figure out here is git apply, but that doesn't actually save a change, it's just patch with git extension support. Surely there's something that will take an exported changeset and reimport it as it was.

2007-11-05 10:49:00

Processing Amazon S3 Logs

I saw a blog post on reddit today about processing Amazon S3 logs. It went into a lot of the detail about how to set up logging and stuff (although I believe I just used jets3t to set mine up), but nothing about how to process them with a log processing tool.

I wrote a log merge tool a long time ago for collecting large numbers of logfiles from various servers, sorting them, and sending them through a normal web processing tool that accepts CLF. Five or so months ago, I needed a way to process my S3 logs, so I modified my tool to accept them as input while producing the same output.

Since one of the primary design goals is to be able to handle a large number of logs, I most commonly take the file list on stdin. Here is a pseudo-shell example of how I process my S3 logs:

# s3sync -> a directory called l
find l -type f | logmerge | webalizer -o report

The above currently processes 4,440 files (I'm not a very heavy S3 user). It does so without hitting my 256 descriptor ulimit, and without needing to have its input sorted in any particular way.

I'm fairly satisfied with the results and the performance. The current tip is written in C++ (sort of) and uses boost's regex to grok the input. I've an older revision in there that is pure C and uses PCRE, but it's no longer maintained.

I wouldn't consider this a finished project, but it does what I need quite well. I'd welcome any input (or patches!) from anyone who might get more use out of it.

2007-09-21 10:16:00

How to Look Like an Idiot

So, you want to look like an idiot in front of other people? You've come to the right place.

Let's start with an example. You can be like this guy and get on an erlang mailing list and rant about how much you hate python and how you'll never use it because of how much you hate it. It's the most ugly language ever because it has significant whitespace. (be sure to call Guido an idiot and repeatedly talk about how idiotic that is).

Most python programmers I've spoken to admit they thought the significant whitespace was a dumb idea before actually sitting down and writing code in it. In the end, the code only works correctly when it's indented the way a good programmer would indent his code anyway.

I have seen the other side of this. Several times I have had people bring me perl code they'd written with no indentation and asked me to explain why it wasn't working. Every time, I told the programmer I couldn't read it and to go back and indent it properly before showing it to me. After a little argument, he'd storm off and indent it and generally come back to tell me he saw an obvious problem and it works now.

It's OK to sometimes want to avoid something you know little about for silly reasons, but seriously, you're not going to win respect by going to a completely unrelated forum and ranting about how much you hate something you don't understand at all.

2007-09-20 17:23:00

Dis-Integration

When I was a sysadmin, I built a really nice mail server on top of postfix and cyrus imapd with a few CGIs for managing mailbox and alias maintenance. It handled a decently large volume of email for a medium-sized company and nobody every complained.

Then one day, we got a new CIO, and he wanted exchange. The excuse was calendaring. It was very important to have one server doing both mail and calendaring. Architecturally, that has never made sense to me, and I don't believe it ever will.

I used to use ical a lot for my calendaring needs. I'd get an invitation via email and it'd automatically drop a new item in my calendar. I could accept or decline from my calendar, and it'd pass the message back over to my email box. That type of integration is exactly what makes sense to me. I can choose the tool that works best for email, and the tool that works best for viewing and manipulating my calendar, and they just have to go talk to their respective servers and know how to pass information back and forth and everybody's happy.

This is why we create protocols. With clean, well-defined protocols, I can use a mail client that makes sense to me (and that has varied over time), and someone else can come along and try to compete.

Exchange provides the opposite of this world. Everything has to fit into one logical box. Blackberry's email support is the worst I've seen. BES and BIS sit between you and your email and give you something that looks, but does not act like your normal email client. Over the air synchronization is just...unpredictable. I never figured out what would cause something to be deleted upstream or downstream from either end. It just didn't work right sometimes.

In my company, we run both BES and GoodLink to deal with mobile clients. That's a lot of stuff for IT to manage, and a lot of stuff to try to keep going. I don't use either here, but I used BES at my last company and always managed to crash it on the weekend which prevented me from getting any useful data services while I was out.

There's a lot of controversy about the iPhone, but there's one path they're taking that seems to be exactly what one should be doing: Integration. They solved all of the problems around how to integrate with corporate email servers, web servers, BES, GL, etc... with one simple strategy: To not do any integration.

The iPhone has three things that makes these types of integrations unnecessary:

  1. An IMAP client (does POP too, but POP sucks).
  2. A proper web browser.
  3. A VPN client (PPTP and L2TP).

With these three things, I can show up on the network just as I do from my laptop. Throw in the 802.11 support, and I don't even necessarily need the VPN (unless the network requires it). The calendar support is admittedly weak, but I use the web UI for that anyway, so it's no different for me. Once protocols around calendaring are actually standardized and the standards are implemented, that can work just as well.

In a world drowning users with vendor lock-in and proprietary applications and protocols, there is at least a bit of hope in a few areas.

2007-08-31 11:54:00

Imperative Programming is Rotting People's Brains

We have a problem with our database schema upgrade mechanism at work. The problem is that we don't test the migration script we generate against the databases for which it's targetted.

The solution is obvious: Change the way we generate this script.

At least, this is the argument I keep getting. Let me fill in some context and see if I can make sense of all this. First, let me list the goals of this mechanism:

  1. Generate an upgrade script for our application to go from version X to version Y
  2. Create incremental scripts for dev and test systems

This is a fairly common problem with a wide variety of solutions, many of which are far more complex than they need to be for the benefit they provide. We moved past most of these arguments by getting people to accept that a symmetric process enabling downgrades is impossible and it really, really doesn't matter that our script creates a table and then alters it sixteen times because the alternative is for a human to do a lot of work to save the computer from doing a little. So we check in little pieces of work that get concatenated into a big script.

The only thing it really came down to was how we check these scripts in. One proposal was to have the scripts be numbered and just increment the number every time you write one. You can then have a table with a bunch of numbers in it indicating which scripts have been applied for meeting goal #2. This sort of falls apart at the point where you have two people doing db scripts at the same time, though.

My suggestion is ostensibly more complicated, but in practice has been much easier for me (and I've done twice as many commits as the next person in this area): script dependencies. I can declare this a good idea because I stole it from NetBSD (although I wrote the ordering utility myself and the semantics are slightly different). Basically, every script declares that it provides something, and states what it requires, and then a total order is computed using a topological sort and a script is made available. Goal #1 is met, and goal #2 may be met by having a table of what tokens have been provided (e.g. user+home_page or something) and excluding any scripts that provide any of these tokens.

Conceptually, you can consider these the same, except in my approach, you explicitly list your dependencies, and in the sequencing approach, you implicitly assume that everything you do requires everything anyone's done before. The difference is when you have more than one person working on something. Two people check in script 53. Someone loses. This is when the well, let's just put it all in one big file or well, renumber one of them arguments start (either of which makes goal #2 much more difficult).

While there are some complaints that one must know what dependencies are needed (which I find completely bogus, as that effort is negligible compared to the actual contents), the larger complaint seems to be that people don't get to specifically define the order in which changes occur. This is the turning point of this entire ramble.

I've been pondering this for a couple of days now, and it seems that I've taken imperative programmers out of their comfort zone. Why would anyone want to spend time on the least interesting part of the application? If there is a script A and a script B that create tables A and B respectively, and there is no relationship whatsoever between A and B, it simply doesn't matter in which order they're invoked. No amount of fear will convince me otherwise.

So, back to our actual problems. Your hand doesn't need to have all of its fingers to count the number of times we've had problems with this process due to unstated or incorrect dependencies. We have automated tools in place for generating the script for #1 (from buildbot against every change) and #2 (dynamically determining the state of a DB and building a custom custom script). Both of these processes at least confirm that your dependencies make sense. However, there is no testing, so we don't confirm that we're doing is considered absolutely correct when applied to a real DB until someone manually does so.

In preparing our staging databases for an upgrade of our new app, we ran into a small number of problems:

  1. Unexpected data problems on this new database
  2. Some script was modifying too much data and taking too long.
  3. Some script removed data people wanted.

Nobody complained too much about #1, although that one really hurt the most. I didn't realize that I even had any respect for mysql before starting this job and working with it regularly and finding that I was actually losing some. For #2, people changed some stuff around to not have to make these large changes.

But for #3, the debate started up yet again. OMG! The records are all gone! We need to use sequential scripts!! OK, this came down to two scripts with two possible orders: 1) column is added to DB and another script removes everything where that column is null or 2) other script fails because column doesn't exist.

As far as I can tell, our only problem is that we don't test this stuff regularly. This really shouldn't surprise anyone, but I've been surprised to find how difficult it is to get people to focus on what's actually wrong. Inevitably, I started to end all of these conversations with the following words: If your proposal doesn't end with ...and you don't have to test it, then I don't want to hear it.

2007-08-07 13:33:00

Easy HTTP Parsing With Python

I'm writing some code to read a pcap file to produce input for a load tester. This is simple HTTP over TCP and I'm picking out the layers in the packet structure as they go. I'm using simple struct stuff to get through IP and TCP, but when I got to HTTP, I wanted to see if there was something easier to do for understanding HTTP.

This is the simultaneously ugly and beautiful parser I ended up with:

class HTTPDecoder(BaseHTTPServer.BaseHTTPRequestHandler):
    def __init__(self, v):
        self.rfile=StringIO.StringIO(v)
        self.raw_requestline=self.rfile.readline()
        self.parse_request()
        self.request_data=self.rfile.read()

Perhaps it's bad abstraction on their part, but parse_request() is exactly the functionality I wanted.