This Place is Taken: The Origin of 0 index arrays

Tuesday, December 10, 2013

The Origin of 0 index arrays

 

Sometimes somebody says something to me, like a whisper of a hint of an echo of something half-forgotten, and it lands on me like an invocation. The mania sets in, and it isn’t enough to believe; I have to know.

I’ve spent far more effort than is sensible this month crawling down a rabbit hole disguised, as they often are, as a straightforward question: why do programmers start counting at zero?

Now: stop right there. By now your peripheral vision should have convinced you that this is a long article, and I’m not here to waste your time. But if you’re gearing up to tell me about efficient pointer arithmetic or binary addition or something, you’re wrong. You don’t think you’re wrong and that’s part of a much larger problem, but you’re still wrong.

For some backstory, on the off chance anyone still reading by this paragraph isn’t an IT professional of some stripe: most computer languages including C/C++, Perl, Python, some (but not all!) versions of Lisp, many others – are “zero-origin” or “zero-indexed”. That is to say, in an array A with 8 elements in it, the first element is A[0], and the last is A[7]. This isn’t universally true, though, and other languages from the same (and earlier!) eras are sometimes one-indexed, going from A[1] to A[8].

While it’s a relatively rare practice in modern languages, one-origin arrays certainly aren’t dead; there’s a lot of blood pumping through Lua these days, not to mention MATLAB, Mathematica and a handful of others. If you’re feeling particularly adventurous Haskell apparently lets you pick your poison at startup, and in what has to be the most lunatic thing I’ve seen on a piece of silicon since I found out the MIPS architecture had runtime-mutable endianness, Visual Basic (up to v6.0) featured the OPTION BASE flag, letting you flip that coin on a per-module basis. Zero- and one-origin arrays in different corners of the same program! It’s just software, why not?

All that is to say that starting at 1 is not an unreasonable position at all; to a typical human thinking about the zeroth element of an array doesn’t make any more sense than trying to catch the zeroth bus that comes by, but we’ve clearly ended up here somehow. So what’s the story there?

The usual arguments involving pointer arithmetic and incrementing by sizeof(struct) and so forth describe features that are nice enough once you’ve got the hang of them, but they’re also post-facto justifications. This is obvious if you take the most cursory look at the history of programming languages; C inherited its array semantics from B, which inherited them in turn from BCPL, and though BCPL arrays are zero-origin, the language doesn’t support pointer arithmetic, much less data structures. On top of that other languages that antedate BCPL and C aren’t zero-indexed. Algol 60 uses one-indexed arrays, and arrays in Fortran are arbitrarily indexed – they’re just a range from X to Y, and X and Y don’t even need to be positive integers.

So by the early 1960′s, there are three different approaches to the data structure we now call an array.

  • Zero-indexed, in which the array index carries no particular semantics beyond its implementation in machine code.
  • One-indexed, identical to the matrix notation people have been using for quite some time. It comes at the cost of a CPU instruction to manage the offset; usability isn’t free.
  • Arbitrary indices, in which the range is significant with regards to the problem you’re up against.

So if your answer started with “because in C…”, you’ve been repeating a good story you heard one time, without ever asking yourself if it’s true. It’s not about*i = a + n*sizeof(x) because pointers and structs didn’t exist. And that’s the most coherent argument I can find; there are dozens of other arguments for zero-indexing involving “natural numbers” or “elegance” or some other unresearched hippie voodoo nonsense that are either wrong or too dumb to rise to the level of wrong.

The fact of it is this: before pointers, structs, C and Unix existed, at a time when other languages with a lot of resources and (by the standard of the day) user populations behind them were one- or arbitrarily-indexed, somebody decided that the right thing was for arrays to start at zero.

So I found that person and asked him.

His name is Dr. Martin Richards; he’s the creator of BCPL, now almost 7 years into retirement; you’ve probably heard of one of his doctoral students Eben Upton, creator of the Raspberry Pi. I emailed him to ask why he decided to start counting arrays from zero, way back then. He replied that…

As for BCPL and C subscripts starting at zero. BCPL was essentially designed as typeless language close to machine code. Just as in machine code registers are typically all the same size and contain values that represent almost anything, such as integers, machine addresses, truth values, characters, etc. BCPL has typeless variables just like machine registers capable of representing anything. If a BCPL variable represents a pointer, it points to one or more consecutive words of memory. These words are the same size as BCPL variables. Just as machine code allows address arithmetic so does BCPL, so if p is a pointer p+1 is a pointer to the next word after the one p points to. Naturally p+0 has the same value as p. The monodic indirection operator ! takes a pointer as it’s argument and returns the contents of the word pointed to. If v is a pointer !(v+I) will access the word pointed to by v+I. As I varies from zero upwards we access consecutive locations starting at the one pointed to by v when I is zero. The dyadic version of ! is defined so that v!i = !(v+I). v!i behaves like a subscripted expression with v being a one dimensional array and I being an integer subscript. It is entirely natural for the first element of the array to have subscript zero. C copied BCPL’s approach using * for monodic ! and [ ] for array subscription. Note that, in BCPL v!5 = !(v+5) = !(5+v) = 5!v. The same happens in C, v[5] = 5[v]. I can see no sensible reason why the first element of a BCPL array should have subscript one. Note that 5!v is rather like a field selector accessing a field in a structure pointed to by v.

This is interesting for a number of reasons, though I’ll leave their enumeration to your discretion. The one that I find most striking, though, is that this is the earliest example I can find of the understanding that a programming language is a user interface, and that there are difficult, subtle tradeoffs to make between resources and usability. Remember, all this was at a time when everything about the future of human-computer interaction was up in the air, from the shape of the keyboard and the glyphs on the switches and keycaps right down to how the ones and zeros were manifested in paper ribbon and bare metal; this note by the late Dennis Ritchie might give you a taste of the situation, where he mentions that five years later one of the primary reasons they went with C’s square-bracket array notation was that it was getting steadily easier to reliably find square brackets on the world’s keyboards.

“Now just a second, Hoye”, I can hear you muttering. “I’ve looked at the BCPL manual and read Dr. Richards’ explanation and you’re not fooling anyone. That looks a lot like the efficient-pointer-arithmetic argument you were frothing about, except with exclamation points.” And you’d be very close to right. That’s exactly what it is – the distinction is where those efficiencies take place, and why.

BCPL was first compiled on an IBM 7094here’s a picture of the console, though the entire computer took up a large room – running CTSS – theCompatible Time Sharing System – that antedates Unix much as BCPL antedates C. There’s no malloc() in that context, because there’s nobody to share the memory core with. You get the entire machine and the clock starts ticking, and when your wall-clock time block runs out that’s it. But here’s the thing: in that context none of the offset-calculations we’re supposedly economizing are calculated at execution time. All that work is done ahead of time by the compiler.

You read that right. That sheet-metal, “wibble-wibble-wibble” noise your brain is making is exactly the right reaction.

Whatever justifications or advantages came along later – and it’s true, you do save a few processor cycles here and there and that’s nice – the reason we started using zero-indexed arrays was because it shaved a couple of processor cycles off of a program’s compilation time. Not execution time; compile time.

No comments:

Post a Comment