IntegerFor a few years now, I’ve been meaning to read Peter Norvig’s old book, “Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp”, but every time I started I’d inevitably get lost in the code due to my poor Lisp skills. For some reason I just couldn’t absorb the introductory Lisp chapters at the beginning of the book. So a little while ago I decided to “begin at the beginning” and slowly started making my way through Peter Seibel’s “Practical Common Lisp”.

Inevitably, one of the first things I learned about was the “format” function. Format, is essentially the Lisp equivalent of printf, so you kinda need it to produce the old “Hello World”. Just like printf and its ilk, format supports string interpolation via embedded directives. For example if you wanted to print two strings side by side but separate them by a specific number of spaces you could do something like this:

CL-USER> (format t "~a~40t~a~%" "abc" "123")
abc                                     123
NIL

One of the most interesting format directives is ~r which will take an integer and produce its English representation e.g.:

CL-USER> (format nil "~r" 123456)
"one hundred twenty-three thousand four hundred fifty-six"

I found this little gem in one of the footnotes, the lesson here being – always read the footnotes. Anyway, I don’t know about you, but I thought that was pretty awesome, which immediately got me thinking about how hard it would be to implement a stand-alone integer to English translation function. I would have loved to try my hand at doing this in Lisp, but I don’t think I am sufficiently Jedi for that as yet (I will however give it a go as soon as I am ready, you will not want to miss that :)), so instead I decided to have a go at implementing it in Ruby.

A Totally Unrelated Musical Trivia Interlude

Speaking of Jedi, I’ve recently been a little obsessed with the “Good Morning” song from “Singing In The Rain”, you know the one I mean. I blame “Family Guy” for this, it all started after I watched the episode where they perform this song in one of their postmodern flashes of randomness. Anyway, it’s just such a nice and happy song, it’s kinda quaint how 1:30 used to be considered “late” in those days and I dig the whole synchronized tap dancing thing that they do (just imagine how much they had to practice to get it right).

Now, here is the trivia, the girl in the song/movie is a 20 year old Debbie Reynolds (did you know that before “Singing In The Rain” she didn’t even know how to dance) who went on to marry singer Eddie Fisher. A few years later she gave birth to a daughter named Carrie (i.e. Carrie Fisher) who, in turn, went on to play Princess Leia in “Star Wars”. That’s how all of this relates to Jedi, and you thought I was going off on a random tangent – now you know better :). Time to get back to some Ruby code.

A Ruby Based Integer To English Converter

It turns out writing it was somewhat harder than I expected because the English language is a little insane when it comes to naming numbers. Everything was going well at the start; it was obvious that we need some kind of lookup table to convert digits to words. The lookup table must contain all the uniquely named numbers that we are likely to encounter.

  • all number between 1 and 10 (e.g. one, two, five etc.)
  • all number between 11 and 20 (e.g. eleven, twelve, sixteen etc.)
  • all multiples of 10 between 20 and 100 (e.g. thirty, forty etc.)
  • all uniquely named powers of 2 over 100 (e.g. hundred, thousand, million, billion etc.)

Here is a cut-down version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
1 => "one",
2 => "two",
3 => "three",
...
19 => "nineteen",
20 => "twenty",
...
90 => "ninety",
100 => "hundred",
1000 => "thousand",
1000000 => "million",
...
}

We don’t really think about it, since we are so used to it, but that is actually a lot more unique numbers than strictly necessary. You can easily identify all possible numbers using just 1 through 9 as well as all the named powers of 10, such as ten, hundred, thousand etc (we’ll explore this thought further shortly). Doing it this way would be pretty logical, instead the English speaking world decided that we needed nine special words to represent numbers between 11 and 19 as well as eight more similar, but still unique, words for the multiples of 10 between 20 and 100. As much as we take it for granted this causes all sorts of pain when we try to implement out integer to words converter.

Consider this, we’re given an integer several digits long, let’s say 12.

  • to get a word representation we start looking at the digits of the integer in reverse order (we have to start with the least significant digit first):
    • digits = 21
  • we use the look-up table to find the word that represents the first digit and add it to an accumulator (which is a list that will contain all the words in the English representation of our integer):
    • digits = 21, accumulator = [two]

But as soon as we move on to the second digit, we’re already in trouble. If this digit is a one, we know we have a number between 11 and 19, so we have to do the following:

  • backtrack the previous digit from the accumulator:
    • digits = 21, accumulator = []
  • get the current digit along with the previous digit and reverse them to get our actual number:
    • digits = 12, accumulator = []
  • do a lookup to get the new word representation and put the new representation in the accumulator:
    • digits = 12, accumulator = [twelve]

The story is similar if the second digit is 2 through 9, we still have to backtrack, then we need to do a lookup on the second digit multiplied by 10 (i.e. 20, 30 etc), combine the result with the word representation of the previous digit and put that back in the accumulator. So if the number is 20, we need to look up 20, combine that with 5 and get twenty five which ends up in the accumulator.

When the number is big enough, this patters comes up again and again every few digits. For example, numbers such as 11000, 23000, 134000 need to be represented as eleven thousand, twenty three thousand and one hundred and thirty four thousand (see what I mean). Luckily it is a pattern so we can handle all occurrences of it as a special case, but it’s a pattern that need not be. And talking about patterns that need not be, what is our obsession with the word hundred:

  • three hundred and fifty
  • one hundred and twenty thousand three hundred and fifty
  • one hundred and twenty five million three hundred and twenty thousand five hundred and six

We have the word hundred appearing 3 times in that last one, we’re used to it and it feels comfortable, but is it strictly necessary? We have to handle that one as a special case as well. Needless to say, the code ends up being long and ugly, you can have a look at the full source here, but the main chunk of it is something along the following lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def convert(number)
    return "zero" if number == 0
    word_representation_accumulator = []
    number_digits_reversed = number.to_s.reverse
    digit_count = 0
    number_digits_reversed.chars.each_with_index do |digit, index|
      digit_as_number = Integer(digit)
      skip_zero(digit_as_number) do
        if digit_count == 0
          word_representation_accumulator << "#{NTW[digit_as_number]}"
        elsif ten_to_twenty?(digit_as_number, digit_count)
          backtrack word_representation_accumulator
          actual_number = Integer("#{digit}#{number_digits_reversed[index - 1]}")
          multiplier = (digit_count > 1 ? 10**(digit_count - 1) : nil)
          word_representation = "#{NTW[actual_number]}"
          word_representation += " #{NTW[multiplier]}" if multiplier
          word_representation += " and" if word_representation_accumulator.size == 1
          word_representation_accumulator << word_representation
        elsif twenty_to_one_hundred?(digit_count)
          backtrack word_representation_accumulator
          multiplier = (digit_count > 1 ? 10**(digit_count - 1) : nil)
          lookup_number = digit_as_number * 10
          word_representation = "#{NTW[lookup_number]}"
          word_representation += " #{NTW[Integer(number_digits_reversed[index - 1])]}"
          word_representation += " #{NTW[multiplier]}" if multiplier
          word_representation += " and" if word_representation_accumulator.size == 1
          word_representation_accumulator << word_representation
        elsif digit_count == 2 || digit_count % 3  == 2
          multiplier = 10**2
          word_representation = "#{NTW[digit_as_number]} #{NTW[multiplier]}"
          word_representation += " and" if word_representation_accumulator.size != 0
          word_representation_accumulator << word_representation
        else
          multiplier = 10**digit_count
          word_representation = "#{NTW[digit_as_number]} #{NTW[multiplier]}"
          word_representation += " and" if word_representation_accumulator.size == 1
          word_representation_accumulator << word_representation
        end
      end
      digit_count += 1
    end
    word_representation_accumulator.reverse.join(" ")
  end

You can have a go at running it, but don’t forget, I only took the lookup table up to trillions, so if you want to try numbers bigger than that you will need to add more named powers of ten (such as quadrillion, quintillion etc.) to the table. Here is some sample output:

315119 - three hundred and fifteen thousand one hundred and nineteen
1000001 - one million and one 
1315119 - one million three hundred and fifteen thousand one hundred and nineteen
11315119 - eleven million three hundred and fifteen thousand one hundred and nineteen

Using Code To Derive The English That Should Have Been

Even though my code was producing reasonable output, I wasn’t entirely happy, the arbitrary nature of the English language made my code uglier than it could have been. Since I’ve taken liberties with the English language before I decided this was a good time to do it again. What if we were to remove all the extraneous words that English had for numbers. Surely, the numbers 1 through 9 are more than sufficient? This would be the simplest possible case e.g.:

135 - one three five
3619 - three six one nine

Unfortunately it just doesn’t cut it. Firstly we would have to introduce 0 as a non-silent number e.g.:

609 - six zero nine

Currently, zero is completely silent in all numbers unless it is by itself e.g.:

609 - six hundred and nine (see, zero is silent)

But even if this wasn’t an issue, it turns out we actually need the named powers of ten to make English representations of integers meaningful. One five six two three, tells us almost nothing, but fifteen thousand six hundred and twenty three gives us a lot of information instantly. Even when we haven’t processed the number fully, we know straight away that it is approximately 15 thousand and a few hundred. The bigger the number gets, the more pronounced this “estimation” effect is. So, we need numbers one through nine as well as named powers of 10. This allows us to cut our lookup table down to the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
NTW = {
  1 => "one",
  2 => "two",
  3 => "three",
  4 => "four",
  5 => "five",
  6 => "six",
  7 => "seven",
  8 => "eight",
  9 => "nine",
  10 => "ten",
  100 => "hundred",
  1000 => "thousand",
  1000000 => "million",
  1000000000 => "billion",
  1000000000000 => "trillion"
  }

We can use this table to write code that is much more intuitive than our previous attempt. Infact, we don’t even need to know exactly what we’re trying to achieve, instead we let the code guide us to the “correct” solution. We’re using code to derive the requirements for the real world which is the opposite of the usual scenario. The full implementation I came up with is in this gist, but here is the main bit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def convert(number)
    return "zero" if number == 0
    word_representation_accumulator = []
    number_digits_reversed = number.to_s.reverse
    digit_count = 0
    number_digits_reversed.chars.each_with_index do |digit, index|
      digit_as_number = Integer(digit)
      skip_zero(digit_as_number) do
        multiplier = 10**digit_count
        word_representation = "#{NTW[digit_as_number]}"
        word_representation += " #{NTW[multiplier]}" if multiplier > 1 && NTW[multiplier]
        word_representation_accumulator << word_representation
      end
      digit_count += 1
    end
    word_representation_accumulator.reverse.join(" ")
  end

You have to admit, it’s much shorter and nicer looking and here is some of the output produced:

0 - zero
1 - one
3 - three
5 - five
11 - one ten one
15 - one ten five
25 - two ten five
71 - seven ten one
40 - four ten
100 - one hundred
101 - one hundred one
112 - one hundred one ten two
123 - one hundred two ten three
457 - four hundred five ten seven
999 - nine hundred nine ten nine
1000 - one thousand
1001 - one thousand one
1010 - one thousand one ten
1011 - one thousand one ten one
2117 - two thousand one hundred one ten seven
3001 - three thousand one
13101 - one three thousand one hundred one
14001 - one four thousand one
16000 - one six thousand
25119 - two five thousand one hundred one ten nine
65009 - six five thousand nine
315119 - three one five thousand one hundred one ten nine
1000001 - one million one
1315119 - one million three one five thousand one hundred one ten nine
11315119 - one one million three one five thousand one hundred one ten nine
74315119 - seven four million three one five thousand one hundred one ten nine
174315119 - one seven four million three one five thousand one hundred one ten nine
1174315119 - one billion one seven four million three one five thousand one hundred one ten nine
15174315119 - one five billion one seven four million three one five thousand one hundred one ten nine
35174315119 - three five billion one seven four million three one five thousand one hundred one ten nine
935174315119 - nine three five billion one seven four million three one five thousand one hundred one ten nine
2935174315119 - two trillion nine three five billion one seven four million three one five thousand one hundred one ten nine

As you can see, all the named powers of ten are now treated in exactly the same way i.e.:

  • 10 is ten to the power of 1
  • 100 is ten to the power of 2
  • etc.

So 101 is one hundred one and 11 is one ten one, 203 is two hundred three, while 23 is two ten three – logical isn’t it? There is no longer any need for any specially named numbers between 10 and 100 since we can represent all of them just as efficiently with this notation; this means no more special cases for numbers under 100. The other one we had was all the extra usages of the word hundred. We simply eliminate them all together (except for when we are talking about numbers between 100 and 1000), three one five thousand conveys the same amount of information as three hundred and fifteen thousand – the word hundred is totally redundant. Now, every named power of 10 appears only once in the English representation of the integer which allows us to still retain the ability to “estimate” the magnitude instantly without any extraneous information:

one million three one five thousand one hundred one ten nine

Short, consistent and to the point. It does seem a little awkward at first, but after a few minutes it starts to grow on you – after all, it is much more logical than the system we have now. There you go, deriving “better” English through some Ruby code. Not only do we get superior language but it is also easier, faster and less error prone to code it – FTW brothers, FTW :)!!!

Image by plindberg