May 9, 2007, 10:24 p.m.

Fast and Easy Token Classification in C

Occasionally when writing a C program you need to write a function that will take a string and match it against a known set and return a value for that string. The easy way to do that is with a bunch of if/else statements. That's not so difficult to maintain, but it's a fairly poor algorithm.

As an example, consider a month name to numeric month function as you'd find in a common log parser:

int parseMonth(const char *s) {
    int rv=-1;

    if(strncmp(s, "Jan", 3)==0) {
        rv=0;
    } else if(strncmp(s, "Feb", 3)==0) {
        rv=1;
    } else if(strncmp(s, "Mar", 3)==0) {
        rv=2;
    } else if(strncmp(s, "Apr", 3)==0) {
        rv=3;
    } else if(strncmp(s, "May", 3)==0) {
        rv=4;
    } else if(strncmp(s, "Jun", 3)==0) {
        rv=5;
    } else if(strncmp(s, "Jul", 3)==0) {
        rv=6;
    } else if(strncmp(s, "Aug", 3)==0) {
        rv=7;
    } else if(strncmp(s, "Sep", 3)==0) {
        rv=8;
    } else if(strncmp(s, "Oct", 3)==0) {
        rv=9;
    } else if(strncmp(s, "Nov", 3)==0) {
        rv=10;
    } else if(strncmp(s, "Dec", 3)==0) {
        rv=11;
    }

    return(rv);
}

I'm sure code identical to the above can be found in countless apps. It's not terrible, but it's O(n) (where n == number of possible tokens) and it took a lot of copying and pasting to get there.

The last couple of times I had to do this, I wrote a python program that will generate a search function for me based on some input. It generates a function that is roughly twice as fast as the naïve function on average.

To do the same month match, we start with the following mapping file:

Jan 0
Feb 1
Mar 2
Apr 3
May 4
Jun 5
Jul 6
Aug 7
Sep 8
Oct 9
Nov 10
Dec 11

Note that this is already a win because it's just the data we're interested in. Now, using the aforementioned python script, I generate the following new parseMonth function:

int parseMonth(const char *input) {
    int rv=-1;
    const char *p=input;
    int len=strlen(input);

    /* Short cut when the length falls outside of
        expected ranges */
    if(len < 3 || len > 3) {
        return rv;
    }

    /* Collapsing anything with fewer than 4 paths */
    /* 12 matches for "" */
    switch(p[0]) {
        case 'A':
            /* 2 match(es) for "A" */
            if(memcmp(p, "Apr\0", 4) == 0) {
                rv=3;
            } else if(memcmp(p, "Aug\0", 4) == 0) {
                rv=7;
            }
        break;
        case 'D':
            /* 1 match(es) for "D" */
            if(memcmp(p, "Dec\0", 4) == 0) {
                rv=11;
            }
        break;
        case 'F':
            /* 1 match(es) for "F" */
            if(memcmp(p, "Feb\0", 4) == 0) {
                rv=1;
            }
        break;
        case 'J':
            /* 3 match(es) for "J" */
            if(memcmp(p, "Jan\0", 4) == 0) {
                rv=0;
            } else if(memcmp(p, "Jun\0", 4) == 0) {
                rv=5;
            } else if(memcmp(p, "Jul\0", 4) == 0) {
                rv=6;
            }
        break;
        case 'M':
            /* 2 match(es) for "M" */
            if(memcmp(p, "Mar\0", 4) == 0) {
                rv=2;
            } else if(memcmp(p, "May\0", 4) == 0) {
                rv=4;
            }
        break;
        case 'O':
            /* 1 match(es) for "O" */
            if(memcmp(p, "Oct\0", 4) == 0) {
                rv=9;
            }
        break;
        case 'N':
            /* 1 match(es) for "N" */
            if(memcmp(p, "Nov\0", 4) == 0) {
                rv=10;
            }
        break;
        case 'S':
            /* 1 match(es) for "S" */
            if(memcmp(p, "Sep\0", 4) == 0) {
                rv=8;
            }
        break;
    }
    return rv;
}

Now, that is a *lot* more code, for sure, but there are two important things to consider:

It's auto-generated (make it part of your build system and you never have to see it).
It's over twice as fast

To qualify the second point, consider three runs each of these two functions. My test sequentially matches every valid value, and then an invalid value ten million times in a loop. First, the naïve version:

27.585u 0.022s 0:27.73 99.5%    0+0k 0+1io 0pf+0w
27.587u 0.018s 0:27.63 99.8%    0+0k 0+0io 0pf+0w
27.581u 0.014s 0:27.60 99.9%    0+0k 0+0io 0pf+0w

Next, our new generated version:

11.809u 0.018s 0:11.91 99.1%    0+0k 0+2io 0pf+0w
11.786u 0.011s 0:11.81 99.8%    0+0k 0+0io 0pf+0w
11.796u 0.018s 0:11.87 99.4%    0+0k 0+0io 0pf+0w

It was actually only twice as fast before I added the failing case (which I did to make sure my test was complete when I added another algorithm). In the naïve version, there were twelve string comparison tests before failing, while in the optimized version, there was a strlen and a switch over the first character. Worst case, is a couple of character switches and a couple of string compares.

The number of matches before branching is configurable. For our month example, setting this value to 13 will generate code that looks very much like the naïve example. Setting it to 2 means you'll never do a memcmp until you've got something that definitely distinctly matches a prefix. That does lead to slightly better performance, but it's not significant enough for me to care since the branches are all fairly small anyway. To be thorough, this is how it performed guaranteeing no more than one memcmp per parse:

10.352u 0.019s 0:10.45 99.1%    0+0k 0+1io 0pf+0w
10.348u 0.018s 0:10.40 99.5%    0+0k 0+0io 0pf+0w
10.342u 0.014s 0:10.38 99.7%    0+0k 0+0io 0pf+0w

Oh, and for those of you who are clever and make use of the fact that all of the tokens in this example are the same to implement it as a single strstr, that's not faster. It is less code, but it all depends on your priorities. I'll include it for completeness. Example code:

static char *months="JanFebMarAprMayJunJulAugSepOctNovDec";

int parseMonth(const char *s) {
    int rv=-1;
    char *p=strstr(months, s);

    if(p != NULL) {
        rv=(p - months) / 3;
    }

    return rv;
}

Results:

15.773u 0.030s 0:15.91 99.3%    0+0k 0+2io 0pf+0w
15.768u 0.021s 0:15.84 99.6%    0+0k 0+0io 0pf+0w
15.766u 0.020s 0:15.83 99.6%    0+0k 0+0io 0pf+0w

I'm sure none of this is interesting to anyone, but it was interesting to me. Feel free to download my search generator and using it for your token classifiers.

Table of Contents

Feeds

Fast and Easy Token Classification in C