## A brief primer on computing number systems

Over at Long Trance, I recently made a sort of snarky post about third-party Twitter apps who use signed 32-bit integers to hold message IDs and heaped criticism on them: they must be bad programmers if they did that, so they deserve to fail!

But it occurred to me that this comment should probably be put in a bit of context for everyone who hasn’t taken two years of Computer Science in college.

First, let’s back all the way up: computers are inherently binary, which means everything can be broken down in to two states: on and off. On a very fundamental level, computers have no sense of, for example, the number 3 or 7 or 128. They just know “on” and “off”. They signify this binary choice with “bits”, or “binary digits”, where each bit is just a single on/off flag.

So you can use a single bit to represent two numbers: zero and one. Likewise, you can use two bits to represent four numbers: zero, one, two, and three. The computer still just sees them as a series of bits, but by treating those two bits as one unit, programmers can do quite a lot.

This process can also be extended and scaled: if one bit represents two numbers, and two bits represents four, three bits represents eight different numbers and four bits represents sixteen different numbers. (If you’re curious how this works, this is the best explanation I can find but I’m still not really happy with it).

Here’s the question, though: let’s say we’re using four bits representing sixteen numbers. Computer science gives rise to an interesting question: *which* sixteen numbers? As shown earlier, the simplest thing to do is just count the “first” sixteen numbers: zero through fifteen. (In computer science, counting starts from zero instead of one because otherwise the computer can’t make sense of things like multiplication.)

Well, zero through fifteen will work great if we’re recording, say, the par for a given hole of golf: it’s always a positive number. It will never be zero (that’d be a weird golf course) and it’s unlikely to be fifteen, but it will fit nicely in between those extremes. Four bits will work because the par of the hole will never be more than fifteen or less than zero.

But what if we’re trying to store how well a golfer did on the hole versus the par? Golfer A sinks a Par 5 in 7 strokes: his difference is 2. Golfer B sinks that same hole in 5 strokes: his difference is 0. But what about their jerk friend Golfer C who always comes along and ruins the fun by sinking it in 3 strokes? Well, not only did he ruin the day, he just broke our number system.

How do we store a negative 2? We change *which* sixteen numbers we’re representing with those four bits. Instead of representing zero through fifteen, we’ll still represent sixteen different numbers, but start those numbers at negative eight and go up, giving us a range of negative eight to positive seven.

The short explanation of how we do this is to use one bit of our four bits to indicate whether the number is negative or not. That basically just leaves us with three bits to use to actually count with, so we can only go up to seven. The reason we went to negative eight to start and go farther negative than positive is all thanks to Two’s Complement, the number system that lets computers work their magic. Unfortunately, that’s outside the scope of this post (it took about a week to explain it in my Assembler class), and the Wikipedia article on the topic sucks, so you’ll have to take it on faith.

But when a number that uses one of its bits to determine whether the number is positive or negative, we call that “signed”, as in our above example was a “signed four-bit integer”. Integer just means that it only holds mathematical integers (i.e. whole numbers) and doesn’t mess around with fractions and decimals: you can’t be half a stroke ahead of par, for example.

So again, using the above example: it doesn’t make sense to use a signed integer to represent the par for the hole because it will never be negative. So when we use a number that will always be positive, we call that “unsigned”, as in “unsigned four-bit integer”.

Switching gears back to the origin of this post: every Twitter message is given a unique ID number, which I’ll call its TweetID. The most important thing about a TweetID is that only one message ever has that ID. If two messages had the same ID, we wouldn’t know which one the user wanted when he asked for tweet number whatever.

The easiest way to do this is just to start at 0 and count upwards, giving each new tweet the next number. The problem then arises of making sure that we can give out enough numbers to give each tweet its own. Obviously, this would be a terrible use of four-bit integers, because then you could only have sixteen unique tweets! Likewise, it makes no sense to use a signed number here because if you start at zero and count upwards, you’ll never have a tweet with a negative number.

So what we need to hold TweetIDs is something nice and big, as well as unsigned: say, an unsigned 64-bit integer. That’ll give us 18 quintillion different numbers, more than we should ever need (but not vastly more like we would get if we used, say, an unsigned 128-bit integer). To give you an idea of scale, Wikipedia says that one quintillion is about the number of insects on earth. And this is 18 times that.

So that’s what we’d want to use (and what we hope they’re using at Twitter HQ). But as the site linked to in my Long Trance post points out, if someone was writing a Twitter app and didn’t think specifically about this problem (having enough space to hold a relatively large TweetID), they would probably use the programmer’s standard number type for storing “big” numbers: the signed 32-bit integer.

See the problem yet? Using 32 bits, we can represent over 4 billion numbers, but when we use them in a signed number, half of that representational potential is less than zero and useless here. The biggest number that a signed 32-bit integer can hold is about 2.14 billion, and there are about 2.09 billion tweets currently on the books. It won’t be long now before there are tweets that have an id that can’t be contained in a signed 32-bit integer, and if any third-party apps try, they’ll fail in a myriad of unexpected and fun ways.

Edited a few days later: Yeah, the rollover came and went, and broke a few apps (notably Twitterific on the iPhone). But there was also a tweet from one of the Twitter guys saying that they indeed do use a 64-bit unsigned integer to hold TweetIDs. Good for them.

Small math error here.

A signed 32-bit integer in standard two’s complement notation has a range of (2^31)-1 to -(2^31), or 2147483647 to -2147483648.

That 2.14 billion, not 2.14 million.

So Twitter won’t run out of Tweet IDs anytime soon, but they will run out in the foreseeable future.

So your point was entirely valid, but your numbers were off by a factor of 1000, well, 1024. đź™‚

If they had used an unsigned 32-bit integer as you point out they’d have a range up to 2^32, or 4294967295, or 4.29 billion. Still a lot, but not enough.

They definitely screwed up by not using a longer unsigned integer as you point out. Their architect deserves scorn for all the trouble his/her very poor decision is going to cause Twitter someday.

BTW, it’s easier to see what going on with all these signed and unsigned numbers if you look at their internal representations in hex.

For a signed 32-bit integer, the most significant bit is the sign bit, so for a positive number the maximum number has all the other 31 bits turned on and the sign bit off, like this: 0x7fffffff.

Zero is represented by the sign bit off and all other bits off, like this: 0x00000000.

The minimum negative number has the sign bit on and all the other bits off, like this: 0x80000000.

A -1 is represented by the sign bit on and all the other bits on, like this: 0xffffffff.

The maximum unsigned 32-bit number has no sign bit and all bits turned on, like this: 0xffffffff.

See that a signed -1 and a maximum unsigned number are represented by the same arrangement of internal bits? That’s a clue for debugging signed math problems. If you ever see a -1 popping up somewhere you didn’t expect, look for an improper casting of unsigned to signed numbers. That’ll pinpoint the bug.

Extra credit: Twos compliment is not the only way to represent binary numbers. You can also use ones compliment. What would the sentinel numbers look like in ones compliment?

You were right, I made a mistake: 32-bits can hold over 4 billion, with signed holding a maximum positive value of 2.14 billion. But the problem is still there because the number of tweets is also billions, not millions and so the problem is still possible.

It is worth noting though, that there’s no reason to think that Twitter itself will have this problem, just poorly-engineering third-party apps that made the mistake of putting TweetIDs into signed longs.