Sprite Building with Stijl

One of the recurring site optimization problems we face at work is reducing page load time through resource consolidation, Javascript, CSS, and images. You know the story– the fewer requests you make the quicker the page loads.

Some images can be consolidated in the form of sprite sheets. At work, our front-end developers tend to shy away from the use of sprite sheets because of the amount of energy invested in creating those sheets. In an agency environment, where developers frequently switch between projects throughout the day, dealing with the finicky details of maintaining these resources becomes a burden.

Of course, computers are really quite good at dealing with finicky details. My goal with this project was to keep interaction with the utility down to a minimum– plug and play, as it were. Using sprites should be a hands-off affair were the machine takes care of the precision for you.

In the simplest case, Stijl will install in an ASP.Net site in minutes, and will rewrite your stylesheets with injected sprite rules right out of the box. Check out the details over at Google Code:


EC2, Counters and Thread.Sleep

I wrote earlier about the difficulties of relying on the RTC (Real Time Clock) while running on virtual x64 Windows machines on EC2. As it turns out, the effects of the erratic clock behavior go a bit further than I had originally realized.

While I was doing some other performance research, I found that the output of performance counters was not accurate, at least when these counters are generated using the .Net PerformanceCounter class. So far, I’ve only looked at the RateOfCountsPerSecond32 type, but I wouldn’t be surprised if other time-based counters produce inaccurate results.

The data produced by this counter are generally good; most samples are an accurate reflection of the real data. However, every 6 seconds or so you’ll get a value which is way off. As it turns out, the pattern of bad values looks a whole lot like the bad values produced by the Stopwatch class. The bad values can, in turn, be tied to the jumps in the RTC. Check out this code:

int count = 0;
PerformanceCounter RateOfCountsPerSecond32 = ...;
Stopwatch sw = new Stopwatch();
DateTime start = DateTime.Now;
for (int i = 0; i < 200; ++i)
    if (i % 10 == 0)
        Console.WriteLine("{0,6:00,000}ms  {1,7:000,000}ms  {2,10:#,##0}  {3,7:#,##0.000}",
            (DateTime.Now - start).TotalMilliseconds,
            count, RateOfCountsPerSecond32.NextValue());



You would expect the output to be easily predictable, looking something like this (generated on my laptop):

DateTime  Stopwatch  Operations  Counter
--------  ---------  ----------  -------
00,000ms  000,000ms           0    0.000
01,041ms  001,038ms          10    9.627
02,041ms  002,043ms          20    9.958
03,052ms  003,054ms          30    9.890
04,053ms  004,057ms          40    9.968
05,053ms  005,061ms          50    9.958
06,074ms  006,079ms          60    9.830
07,075ms  007,082ms          70    9.968
08,075ms  008,086ms          80    9.958
09,076ms  009,090ms          90    9.963
10,077ms  010,094ms         100    9.954
11,097ms  011,110ms         110    9.848
12,098ms  012,113ms         120    9.964
13,099ms  013,118ms         130    9.955
14,099ms  014,121ms         140    9.966
15,100ms  015,125ms         150    9.952
16,111ms  016,130ms         160    9.964
17,111ms  017,133ms         170    9.969
18,112ms  018,137ms         180    9.951
19,113ms  019,142ms         190    9.955

But instead, it looks something like this (generated on an EC2 m1.large machine running Windows Server 2008):

DateTime  Stopwatch  Operations  Counter
--------  ---------  ----------  -------
00,000ms  000,000ms           0    0.000
01,092ms  052,758ms          10    0.190
02,183ms  053,850ms          20    9.159
03,275ms  054,942ms          30    9.158
04,367ms  056,033ms          40    9.166
05,459ms  057,124ms          50    9.162
06,550ms  161,551ms          60    0.096
07,642ms  162,642ms          70    9.163
08,734ms  163,734ms          80    9.160
09,826ms  164,825ms          90    9.160
10,917ms  165,917ms         100    9.164
12,009ms  167,008ms         110    9.164
13,101ms  271,434ms         120    0.096
14,193ms  272,526ms         130    9.161
15,284ms  273,618ms         140    9.159
16,376ms  274,710ms         150    9.158
17,468ms  275,801ms         160    9.160
18,560ms  276,893ms         170    9.159
19,651ms  277,985ms         180    9.158
20,743ms  382,411ms         190    0.096

So what’s happening here? You’ll notice that when the Stopwatch is tracking time accurately, the counter value is accurate. When the clock jumps forward, however, the counter value is a fraction of what it should be. My assumption is that either .NET or WMI is calculating the average “counts per second” value using the RTC value. When it makes its erratic jumps forward, it skews the rate substantially.

You may have also noticed that Thread.Sleep is a lot less accurate on EC2 as well. Check out the values in the first column on both runs; the EC2 run shows a longer time between samples. From what I can tell, it’s generally about as accurate as DateTime.Now is– within 15ms or so. As a result, if you’re iteratively calling Thread.Sleep, you can build up a serious error. Here, where I’m calling Sleep(100) ten times in a row, the actual wait time is more like 1,100ms, not 1,000ms. If you alter the code above to sleep for 1ms, each sleep will actually take around 15ms, producing an error percentage with four digits.

So, as a word to the wise– don’t expect time-based operations to work exactly they do on a physical system. Even accounting for the slop you’ll see in things like Thread.Sleep(), EC2 machines are off by a wide margin.

lock(), SpinLock and CompareExchange() Performance

Recently, I’ve been interested in the performance of different mechanisms for thread synchronization in .NET. There are quite a few of them:

  • native lock() syntax
  • EventWaitHandle
  • Mutex
  • SpinLock
  • SpinWait
  • Interlocked.* methods, such as CompareExchange()
  • Concurrent classes (new in .NET 4.0, under System.Collections.Concurrent)
  • Semaphore

Personally, I don’t find that any of these readily suggest what their performance profile might be. Is there an advantage to EventWaitHandle over lock() or Mutex? How substantial is the difference?

Most of these mechanisms are slightly different from each other, so there are very definite reasons for choosing one over the other based on your personal needs. Here’s a quick run-down:


Useful for blocking access to a shared resource in a single process.

EventWaitHandle and Mutex

Allows threads to wait for or signal an event. E.g., “data is now ready, commence all parallel worker threads.” Both can be used to signal system-wide events across a process barrier.


A very low-level, CPU-intensive lock. Good for situations where the lock is held a very short amount of time.


Provides a low-level means for threads to wait while checking a condition. Generally more nuanced than SpinLock, but not as aggressive, either.

Interlocked.Exchange(), CompareExchange(), etc.

These are not really a normal means of thread synchronization, however, they can be used to not synchronize access to a shared resource. Using the Interlocked methods usually assumes success for an operation, defaulting to a behavior more like SpinLock or SpinWait when two threads contend with one another.

System.Collections.Concurrent Classes

The 4.0 runtime includes a new namespace with some slick implementations of Stack, Queue, Dictionary, and Bag.


Semaphores will block threads based on the value of the semaphore. Greater than zero? You get it. Zero? You have to wait until another thread increments the value. While this was traditionally the resource locking device, it’s been supplanted by things like lock.


I’ve been wondering for a while how much performance gain you would actually get out of using “lockless” (CaS/Interlocked-based) implementations for something like a stack or queue. It’s not a lot of overhead from the perspective of code size, but the level of complexity and testing that goes along with something like this shoots through the roof — few people really understand the code and it’s very easy to make mistakes that only appear during optimization or in quirky thread-switching scenarios. Whatever advantage you gain from using these approaches better be well worth the pain.

To get some hard numbers, I decided to pit different lock implementations against each other in the implementation of a concurrent stack.

I wrote a quick test-bed which does the following:

  1. Adds 100,000 elements to a new stack
  2. Removes all the elements
  3. Start 12 threads with 24 elements and have each thread Push and Pop as much as possible for 10 seconds. (high contention)

Steps 1 and 2 are done with a single thread, then with 12 threads in parallel.

I wrote up stack implementations (just Push and Pop) which use the following synchronization approaches:

  • lock()
  • SpinLock
  • SpinWait (using Interlocked.CompareExchange on an instance-level int)
  • EventWaitHandle
  • Mutex
  • Interlocked.CompareExchange with a linked list
  • ConcurrentStack

The only non-trivial implementation is the Interlocked.CompareExchange/Linked list approach. I can post the code if you’re really curious.

Here’s how things broke down (numbers are ticks in 000′s):

Mono 2.6.3 on an i7-920 (mono does not have implementations for Spin*)

Add Remove Parallel Add Parallel Remove High Contention (time for 1m ops)
Mutex 558.6 583.8 8,120.4 8,169.4 10.904s
EventWaitHandle 514.8 454.4 8,173.7 8,485.5 10.269s
lock() 30.7 29.5 571.5 667.3 0.720s
CompareExchange 67.2 20.9 97.3 134.5 0.100s

Vista/CLR4.0 on an i7-920

Add Remove Parallel Add Parallel Remove High Contention (time for 1m ops)
Mutex 1,744.7 1,723.9 5,088.3 5,108.2 4.500s
EventWaitHandle 1,735.1 1,609.2 4,496.4 4,819.2 4.073s
SpinLock 93.4 75.0 246.3 474.2 0.258s
SpinWait 77.8 57.9 196.5 352.2 0.186s
lock() 65.5 61.8 121.4 257.0 0.161s
CompareExchange 52.8 36.1 115.9 295.9 0.108s
ConcurrentStack 49.4 39.4 98.0 286.0 0.117s

Surprised by parallel times being longer than single-threaded? You shouldn’t be — we’re forcing all threads to wait while only one thread at a time can access the underlying datastructure. This highlights the cost of contention for that data.


As you can see from the data, the performance of lock() is fairly close to an optimized, lockless approach. Everything else is actually slower.

So, unless you’re really fighting for every last ounce of muscle from your application, coding obscure, complicated classes which use CompareExchange is not going to provide you with a substantial benefit. Stick with lock(), or use one of the concurrent classes in .NET 4.0 if you can.

One other surprise for me is the relative performance of Mono vs. MS-CLR. lock() is actually faster in Mono unless there is contention. In which case it’s noticeably slower.

In case you’re curious, ConcurrentStack is using CompareExchange as the fundamental means for controlling access to the stack, plus SpinWait and some other tricks to optimize performance. It’s a little more posh than my bare-minimum CompareExchange implementation.

EC2, Stopwatch, and 64-bit wormholes

So, a while back, we had planned on running some load tests on public-facing webservers from Amazon EC2. For a company of 150 (or so), it’s just not practical to acquire a dozen machines and engineer sufficient bandwidth for a one-week testing engagement.

Getting the machine images up and running and the testing software installed was simple enough. However, once we started running tests we found that the timing was wildly incorrect– web pages taking two minutes to load after only 15 seconds into the test.

After digging around a bit, I found out that accessing the Real Time Clock (RTC) on 64-bit instances produces a very-nearly useless series of numbers. I wrote some code like this:

Stopwatch sw = new Stopwatch();
for (int i = 0; i < 10; ++i)
    Console.WriteLine("Elapsed time: {0:#,##0}ms", sw.ElapsedMilliseconds);

Which produced a series like this:

  Elapsed time: 1,007ms
  Elapsed time: 2,011ms
  Elapsed time: 3,013ms
  Elapsed time: 55,412ms
  Elapsed time: 56,422ms

So, where the testing software was using the RTC, the times it was recording were all over the map. According to Amazon, this in a known issue with the current version of Xen (the virtualization software they use) running on 64-bit hardware. They’re working with MS to fix this, but I have no idea what the priority is for either party.

In the meantime, however, what to do? If you have access to the code for your application, you can rewrite time measurement to use an alternate technique.

DateTime subtraction will give you accuracy of around 15ms. I have also heard, but not verified, that getting the difference in calls to System.Environment.TickCount will yield accuracy of around 1ms.

If, however, you can’t change the code… you’re in a bit of a bind. From what I’ve seen, 32-bit instances don’t exhibit the problem, so you could conceivably step down to one of those. If you can scale out your operation, that is. If you need the 64-bit instances for memory or raw CPU power, you’re destined to be randomly transported 50 seconds into the future every few seconds.

Pro tip: this is bad for meeting deadlines.

Luhn One-Liner

I recently discovered an absolutely terrifying implementation of the Luhn credit card checksum algorithm (AKA mod10). Think lots of integer.ToString().Length and other idiomatically abhorrent, cargo-cult programming techniques. In response to the pain that resulted, my esteemed colleague, Mr. Dailey, and I came up with this one-liner:

bool isValid = cardNumber
    .Select((c, i) =>
       c - '0' << (i & 1 ^ cardNumber.Length & 1 ^ 1))
    .Select(d=> d > 9 ? d - 9 : d)
    .Sum() % 10 == 0;

Yes, obviously, clarity has already gone out the window. But sometimes it’s nice to make the programming equivalent of a Celtic knot.

Velocity versus Memcached

At work we recently needed to choose between a distributed cache provider. The cache is for reducing hosting cost in EC2 where the ultimate backing store incurs a per-request fee (albeit in absurdly fractional cents). “Distributed” is necessary to mitigate random server failures; not so much to handle an overwhelming load.

But I digress. In this situation, we’re looking at a fairly narrow set of criteria: TCO on EC2 and raw performance from a simple put/get perspective. However, I found a complete lack of performance comparisons online. Most of it was outdated or applies to the wrong machine architecture.

x86 vs x64

Since we’re not encumbered by much legacy code (well, not any which can’t be mitigated my a little SOA sleight-of-hand), there’s no advantage to deliberately accepting the memory constraints of the 32-bit world. This does, however, present a little bit of a problem for memcached; the only Win64 version is produced by Northscale (here), and their licensing info is strangely absent. This is currently based on Memcached v1.4.4.

There’s a financial advantage to running this on Linux, as EC2 Linux machines cost about 30% less than their Windows counterparts. So, ideally, our memcached servers will be pure Linux, regardless of the rest of our architecture (.NET 4.0 on Windows). However, for the sake of consistency between tests here, I’ve used the memcached windows service instead of a Linux machine.

Test Setup

I ran test runs with 1,000, 10,000, 100,000 and 1,000,000 iterations to compare results. Velocity and memcached were restarted between runs.

The software in use is:

  • Velocity CTP3
  • Northscale Memcached Windows x64 service v1.4.4
  • Enyim v1.2.0.8
  • Windows 7
  • .NET 4.0 client (built from VS2010 RTM)

The test application ran get and put operations on a single memcached/velocity service running on the same machine as the test application. Both serial and parallel operations (8 threads) were tested. Data sizes ranged from around 30 to roughly 500 bytes (small to be sure).


Lets get down to brass tacks, then, shall we? Based on the results from the tests, Memcached was consistently 5-10 times faster than Velocity and used about one third of the memory.

No, that’s a not a typo– Memcached trounced Velocity in this setup.

For simple string gets and puts, the margin was around 10 times faster (0.507 versus 0.047 sec/put). With objects, it fell to about 6 times faster (0.500 versus 0.081 sec/put). I assume that the difference is related to the serialization overhead, which plays a more substantial portion of the overhead spent for object-based operations.

I didn’t find a noticeable difference between get/put or serial/parallel operations, at least not in the scale I was testing. The number of cached items did not appear to alter the speed at which requests were serviced; they were consistent with the O(1) behavior advertised.


There are substantial feature differences in Velocity and Memcached, mostly in favor of Velocity, which I’m not going to address here. I have gone with Velocity in another context based solely on its support for tags.

To be clear, there’s a lot I’m not testing here– the operation of a full, distributed cache cluster and the different configuration options available through Velocity.

I do plan on doing some more extensive tests which address the relatively narrow scope of these tests.

Limpet Chrome Extension

From a certain viewpoint, the history of development tools has been one of ever-decreasing opacity. Thirty years ago, debuggers were terrifying and not for the faint of heart. This was back in the olden days, when men were men and bonesaws were a standard tool of the medical trade. Running GDB was a painful experience (easier to spray your code with printf buckshot than step through the code from the command line). The latest generation of visual debuggers (VS2010, Eclipse) and profilers (like ANTS) provide the equivalent of an MRI for your code.

While, it’s understandable that software you release in the form of compiled binaries is outside your diagnostic reach, you want your SOA-based systems to be more accessible. They are, after all, within your administrative grasp– you control the environment in which they run (or, at least, someone at your organization does). The irony is that those systems are often even more opaque than remote client installations, hidden behind firewalls and typically locked down in an aggressive security model.

I work in an agency environment where you often split your time across multiple projects which can start to bear a striking resemblance to each other. When you’re on call and launch into troubleshooting in bed at 3am, that opacity can get a little frustrating (do I have to NAT my traffic through the corporate network to remote into the servers, or was there some other way to authenticate with the server?) And when you’re half-awake, you tend to resent waiting for your 50-hop round trip to a datacenter 4,800 miles away.

Obviously, the reasons for this opacity are twofold– you reduce your attack surface and you obscure your potentially ugly and quirky internals from the innocent and unsuspecting public.

But, given the sheer number of things one can bury in an HTTP/HTML stream, there surely must be a way to embed diagnostic data in your sites without compromising the system (for a good example of constructively hidden data, look at the hints embedded in some sites for screen readers and the visually impaired). It seems that a de facto standard is burying some details in an XML comment somewhere on the page, which savvy technicians can dig out when necessary. That works great until your site ends up on The Daily WTF  or someone exploits the data found therein for mysteriously deep ecommerce savings.

Since Google released their API for extending Chrome, I’ve been intrigued by their model of an HTML-based UI. When compared to developing a Firefox extension, I’ll take HTML5 and jQuery (or prototype) over XUL any day of the week.

Cracking this diagnostic nut and reading up on Chrome extensions seemed like a good pairing. And so, Limpet was born. It’s an extension which allows you to embed diagnostic data on a page, but only expose it to approved parties. The data is only included if you submit the right request (using a cookie as a sort of password), from the right IP (custom IP restrictions on the server side), and the response is only understandable if the browser knows the password (256-bit AES encrypted).

Check out the writeup on Google Code– I think the project can be very useful in extended diagnostic data out to the public, and increasing the transparency of your public-facing web servers.

.NET Base64 encoding stream

While working on an extension for Chrome, I found that there is no .NET implementation of a stream which encodes binary data as a Base64 ASCII stream. The implementation (as you might imagine) is relatively simple, assuming your usage is relatively straightforward. There’s an issue with the terminal block (base64 works in 4-byte blocks), where you need to inform the stream that the final block needs to be committed to the stream. In my case, I was able to use Flush(), but one might also adopt the pattern that the .NET crypto API uses, where you provide a FlushFinalBlock() method which does that additional work.

/// <summary>
/// A simple stream which converts written data to base64-encoded ASCII
/// data on the underlying stream.
/// </summary>
public class Base64EncodingStream: Stream
    // An array which maps acceptable base64 values (0-63) to ASCII characters
    static byte[] _intToChar;

    /// <summary>
    /// Static constructor, initializes the byte array
    /// </summary>
    static Base64EncodingStream()
        _intToChar = System.Text.Encoding.ASCII.GetBytes("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/");

    // the underlying stream, to which we will be writing the base64-encoded ASCII data
    Stream _baseStream;
    // the current byte offset of the output block (0 - 2)
    int _cbit = 0;
    // remaining bits to be included with the next output byte
    byte _remaining = 0;
    // the working 4-byte block
    byte[] _working = new byte[4];

    /// <summary>
    /// Constructor taking a base stream
    /// </summary>
    /// <param name="baseStream">The stream to which base64 data is written</param>
    public Base64EncodingStream(Stream baseStream)
        _baseStream = baseStream;


    public override void Write(byte[] buffer, int offset, int count)
        for (int i = 0; i < count; ++i)
            byte datum = buffer[offset + i];

    public override void WriteByte(byte value)
        int bitOffset = (_cbit * 2) + 2;
        _working[_cbit] = _intToChar[(value >> bitOffset) + _remaining];
        if (_cbit == 2)
            _working[3] = _intToChar[value &amp;amp; 0x3f];
            _baseStream.Write(_working, 0, 4);
            _remaining = 0;
            _remaining = (byte)(value << (6 - bitOffset) &amp;amp; 0x3f);
        _cbit = (_cbit + 1) % 3;

You can catch the whole file here, if you want to steal the whole thing. I needed to put the the terminal flush into the normal Flush, since I don’t have control of the stream once I create it. You may want it in FlushFinalBlock instead (of course, you should be calling that from Close(), anyway, this is only necessary where you need to explicitly Flush your end-of-stream)

Ektron 7.5 and Server.Transfer()

If you’ve ever used Ektron to assemble a design-conscious, non-brochure web site, you may have found it lacking. I’m trying to be diplomatic. It’s really one of the most profoundly frustrating software packages I’ve ever worked with (and that includes the aptly named “curses”). The frustration comes primarily from it being almost effective everywhere you use it.

Notice I did not say, “effective almost everywhere you use it”. Instead, you can get 98% of your project done perfectly well. But then bad things happen.

Take, for example, a recent attempt to use Server.Transfer() in the code-behind of a page. The server choked, hard, with “Error executing child request.” No hints whatsoever as to what the cause was, but on a whim, I removed the Ektron URL-rewriting HTTP handler from the web.config file and, presto, things started working. It appears that the problem is caused by the page you’re transferring to. By reverting that page to the standard PageHandlerFactory, the error disappeared.

<add verb="*" path="*/transfer_to.aspx" type="System.Web.UI.PageHandlerFactory" />

Visual Studio Load Agent on EC2

Has anyone else tried to create a VS2008 Test Rig on EC2? I just suffered through this, and it was painful enough to write about.

The main problem is that the Load Agent requires the name/IP of the Load Controller during installation. It then tucks that safely away in a registry setting. When you register the agent with the controller (by name), that gets stored in an XML config file on the controller.

Oh– and the controller must register the agent by machine name, which EC2 assigns dynamically. This is a bit of a problem. I got around it by adding host entries for each of the load agent names.

As you might guess, all this configuration is destroyed when you terminate your instances. And who wants to re-install Load Agent on each of the agent machines? Here’s my solution:

I have three images:

  • UI machine (running VS2008 Team Test Edition)
  • Controller
  • Agent

On the controller, I have a powershell script that does the following:

  • Connects to EC2 to find all the instances of the agent (by image id)
  • Adds a new host entry to the controller for each instance
  • Modifies each agent’s registry so it points to the controller’s internal IP
  • Adds the agent to the controller’s list of registered load agents (in the XML config file)
  • Restarts the load agent service on each of the agent machines
  • Restarts the load controller service on the controller

So, when I’m starting up my rig, I need only fire up all my machine, RDP into the controller, run that script, and everything’s ready to rock.

If I were a more clever powershell and Windows admin hacker, I would be able to get to launch the instances and make these changes from my desktop. Since this is the first powershell script I ever wrote, though… I think it’s serviceable. Some other kindly internet person might spruce it up if they were inclined.

Here’s the script, if you’d like to take a look at it. It depends on PowerEC2Dream to manage the communication with EC2.

Addendum: There are a few more caveats for doing this.

First and foremost, you should make an alias for the test controller machine and use that when setting up the test rig. If you keep changing the name/IP of the controller, all your results will end up categorized under the different controller names on the “Test Runs” tab. I added “TestController” as a hosts entry. Ideally the script would set that up for you, too.

But don’t try to use “TestController” in the database connection string! Apparently all the machine connect to this independently– just plug the IP address of that server into your “Administer Test Controllers” setup. Unless you want to add a new hosts entry to every machine in the rig.