But it worked before — stability and assumptions

Peter Naulls

Published in

Adventures in Software Development

3 min readDec 27, 2019

Never underestimate the role of luck in correct software operation

Yes, I made that up, and yes you can quote me.

There’s a dirty secret in software development — much of it’s made with shortcuts, assumptions, incomplete testing and bad design. With time and effort, all that can be polished, but new software can do weird things. There’s a reason we have prototypes — to test ideas.

And as usual, prototypes rapidly become the real software. Or least, begin the journey towards real software. Along the way are demos, reworks, design mistake fixes or work arounds, etc, etc.

Did I mention it was a prototype? That means that the software tends to work in a very specific way, with a good wind behind it, and a good deal of luck. Prototypes have a way of working perfectly the very first time they are tried, and then quickly hitting the real world.

This can give the impression that something is working famously, when it’s not.

How things are

In operating systems like Windows, there’s a lot of separation between systems — necessarily so, since so many people work on it. APIs between drivers (software which talks directly to hardware) remain stable for many years, and even if the driver is updated or the OS has fixes, you can by and large, expect things to keep working.

On smaller systems, not so much. Systems are developed by a smaller (sometimes only a couple) team of people, who have cross-system responsibility. Often, adding a feature or driver means a rework of interfaces. This is unfortunate, but hard to avoid.

Indeed, there is an essay in the Linux kernel documentation that talks about the follies of such a “stable” API:

https://raw.githubusercontent.com/torvalds/linux/master/Documentation/process/stable-api-nonsense.rst

The point of this rant is that the Linux kernel, like so many other things, is a living thing, with ongoing development, and a “stable” interface is somethings not possible or desirable.

Narrow Design

Many well designed and otherwise seemly robustly design products balance upon a knife edge. Software is written to work within confined constraints of a system, and hardware is design to meet specific needs. If things are changed due to new features, a rework for a new product, or hardware changes, well, somethings things can fall apart quickly as cascade of assumptions is challenged, or old design flaws have to be reworked.

Even the most robustly designed and broad-ranging system eventually has to be changed to meet new demands, and can come face to face with this.

Luck

And finally of course, luck can play an outsized role. Sometimes during a prototype or test, nothing goes wrong that could. It happens, and I’d suggest, it happens quite a lot. Sometimes long-standing bugs are simply not noticed in the field, because users didn’t push the system in just the right way (or didn’t notice). Sometimes the network conditions are just perfect enough to not noticed degrading performance. And so the Jenga tower continues to stand.

Where does this leave us?

And so to the casual observer, and sometimes, to engineers who probably should know better, something that works once, doesn’t mean it will work a second time.

And in the case of that driver setup, just because the driver worked against a previous version of the system software, doesn’t mean it will continue to do so. And just because something worked before doesn’t mean it doesn’t need a lot more work to be robust.

But good luck explaining all this to a lay person who saw the demo working.

But it worked before — stability and assumptions

How things are

Narrow Design

Luck

Where does this leave us?

Written by Peter Naulls