Converting Integers To Words – Bringing Order To English Through Code

IntegerFor a few years now, I've been meaning to read Peter Norvig's old book, "Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp", but every time I started I'd inevitably get lost in the code due to my poor Lisp skills. For some reason I just couldn't absorb the introductory Lisp chapters at the beginning of the book. So a little while ago I decided to "begin at the beginning" and slowly started making my way through Peter Seibel's "Practical Common Lisp".

Inevitably, one of the first things I learned about was the "format" function. Format, is essentially the Lisp equivalent of printf, so you kinda need it to produce the old "Hello World". Just like printf and its ilk, format supports string interpolation via embedded directives. For example if you wanted to print two strings side by side but separate them by a specific number of spaces you could do something like this:

CL-USER> (format t "~a~40t~a~%" "abc" "123")
abc                                     123
NIL

One of the most interesting format directives is ~r which will take an integer and produce its English representation e.g.:

CL-USER> (format nil "~r" 123456)
"one hundred twenty-three thousand four hundred fifty-six"

I found this little gem in one of the footnotes, the lesson here being – always read the footnotes. Anyway, I don't know about you, but I thought that was pretty awesome, which immediately got me thinking about how hard it would be to implement a stand-alone integer to English translation function. I would have loved to try my hand at doing this in Lisp, but I don't think I am sufficiently Jedi for that as yet (I will however give it a go as soon as I am ready, you will not want to miss that :)), so instead I decided to have a go at implementing it in Ruby.

A Totally Unrelated Musical Trivia Interlude

Speaking of Jedi, I've recently been a little obsessed with the "Good Morning" song from "Singing In The Rain", you know the one I mean. I blame "Family Guy" for this, it all started after I watched the episode where they perform this song in one of their postmodern flashes of randomness. Anyway, it's just such a nice and happy song, it's kinda quaint how 1:30 used to be considered "late" in those days and I dig the whole synchronized tap dancing thing that they do (just imagine how much they had to practice to get it right).

Now, here is the trivia, the girl in the song/movie is a 20 year old Debbie Reynolds (did you know that before "Singing In The Rain" she didn't even know how to dance) who went on to marry singer Eddie Fisher. A few years later she gave birth to a daughter named Carrie (i.e. Carrie Fisher) who, in turn, went on to play Princess Leia in "Star Wars". That's how all of this relates to Jedi, and you thought I was going off on a random tangent – now you know better :). Time to get back to some Ruby code.

A Ruby Based Integer To English Converter

It turns out writing it was somewhat harder than I expected because the English language is a little insane when it comes to naming numbers. Everything was going well at the start; it was obvious that we need some kind of lookup table to convert digits to words. The lookup table must contain all the uniquely named numbers that we are likely to encounter.

  • all number between 1 and 10 (e.g. one, two, five etc.)
  • all number between 11 and 20 (e.g. eleven, twelve, sixteen etc.)
  • all multiples of 10 between 20 and 100 (e.g. thirty, forty etc.)
  • all uniquely named powers of 2 over 100 (e.g. hundred, thousand, million, billion etc.)

Here is a cut-down version:

{
1 => "one",
2 => "two",
3 => "three",
...
19 => "nineteen",
20 => "twenty",
...
90 => "ninety",
100 => "hundred",
1000 => "thousand",
1000000 => "million",
...
}

We don't really think about it, since we are so used to it, but that is actually a lot more unique numbers than strictly necessary. You can easily identify all possible numbers using just 1 through 9 as well as all the named powers of 10, such as ten, hundred, thousand etc (we'll explore this thought further shortly). Doing it this way would be pretty logical, instead the English speaking world decided that we needed nine special words to represent numbers between 11 and 19 as well as eight more similar, but still unique, words for the multiples of 10 between 20 and 100. As much as we take it for granted this causes all sorts of pain when we try to implement out integer to words converter.

Consider this, we're given an integer several digits long, let's say 12.

  • to get a word representation we start looking at the digits of the integer in reverse order (we have to start with the least significant digit first):
    • digits = 21
  • we use the look-up table to find the word that represents the first digit and add it to an accumulator (which is a list that will contain all the words in the English representation of our integer):
    • digits = 21, accumulator = [two]

But as soon as we move on to the second digit, we're already in trouble. If this digit is a one, we know we have a number between 11 and 19, so we have to do the following:

  • backtrack the previous digit from the accumulator:
    • digits = 21, accumulator = []
  • get the current digit along with the previous digit and reverse them to get our actual number:
    • digits = 12, accumulator = []
  • do a lookup to get the new word representation and put the new representation in the accumulator:
    • digits = 12, accumulator = [twelve]

The story is similar if the second digit is 2 through 9, we still have to backtrack, then we need to do a lookup on the second digit multiplied by 10 (i.e. 20, 30 etc), combine the result with the word representation of the previous digit and put that back in the accumulator. So if the number is 20, we need to look up 20, combine that with 5 and get twenty five which ends up in the accumulator.

When the number is big enough, this patters comes up again and again every few digits. For example, numbers such as 11000, 23000, 134000 need to be represented as eleven thousand, twenty three thousand and one hundred and thirty four thousand (see what I mean). Luckily it is a pattern so we can handle all occurrences of it as a special case, but it's a pattern that need not be. And talking about patterns that need not be, what is our obsession with the word hundred:

  • three hundred and fifty
  • one hundred and twenty thousand three hundred and fifty
  • one hundred and twenty five million three hundred and twenty thousand five hundred and six

We have the word hundred appearing 3 times in that last one, we're used to it and it feels comfortable, but is it strictly necessary? We have to handle that one as a special case as well. Needless to say, the code ends up being long and ugly, you can have a look at the full source here, but the main chunk of it is something along the following lines:

  def convert(number)
    return "zero" if number == 0
    word_representation_accumulator = []
    number_digits_reversed = number.to_s.reverse
    digit_count = 0
    number_digits_reversed.chars.each_with_index do |digit, index|
      digit_as_number = Integer(digit)
      skip_zero(digit_as_number) do
        if digit_count == 0
          word_representation_accumulator << "#{NTW[digit_as_number]}"
        elsif ten_to_twenty?(digit_as_number, digit_count)
          backtrack word_representation_accumulator
          actual_number = Integer("#{digit}#{number_digits_reversed[index - 1]}")
          multiplier = (digit_count > 1 ? 10**(digit_count - 1) : nil)
          word_representation = "#{NTW[actual_number]}"
          word_representation += " #{NTW[multiplier]}" if multiplier
          word_representation += " and" if word_representation_accumulator.size == 1
          word_representation_accumulator << word_representation
        elsif twenty_to_one_hundred?(digit_count)
          backtrack word_representation_accumulator
          multiplier = (digit_count > 1 ? 10**(digit_count - 1) : nil)
          lookup_number = digit_as_number * 10
          word_representation = "#{NTW[lookup_number]}"
          word_representation += " #{NTW[Integer(number_digits_reversed[index - 1])]}"
          word_representation += " #{NTW[multiplier]}" if multiplier
          word_representation += " and" if word_representation_accumulator.size == 1
          word_representation_accumulator << word_representation
        elsif digit_count == 2 || digit_count % 3 == 2
          multiplier = 10**2
          word_representation = "#{NTW[digit_as_number]} #{NTW[multiplier]}"
          word_representation += " and" if word_representation_accumulator.size != 0
          word_representation_accumulator << word_representation
        else
          multiplier = 10**digit_count
          word_representation = "#{NTW[digit_as_number]} #{NTW[multiplier]}"
          word_representation += " and" if word_representation_accumulator.size == 1
          word_representation_accumulator << word_representation
        end
      end
      digit_count += 1
    end
    word_representation_accumulator.reverse.join(" ")
  end

You can have a go at running it, but don't forget, I only took the lookup table up to trillions, so if you want to try numbers bigger than that you will need to add more named powers of ten (such as quadrillion, quintillion etc.) to the table. Here is some sample output:

315119 - three hundred and fifteen thousand one hundred and nineteen
1000001 - one million and one 
1315119 - one million three hundred and fifteen thousand one hundred and nineteen
11315119 - eleven million three hundred and fifteen thousand one hundred and nineteen

Using Code To Derive The English That Should Have Been

Even though my code was producing reasonable output, I wasn't entirely happy, the arbitrary nature of the English language made my code uglier than it could have been. Since I've taken liberties with the English language before I decided this was a good time to do it again. What if we were to remove all the extraneous words that English had for numbers. Surely, the numbers 1 through 9 are more than sufficient? This would be the simplest possible case e.g.:

135 - one three five
3619 - three six one nine

Unfortunately it just doesn't cut it. Firstly we would have to introduce 0 as a non-silent number e.g.:

609 - six zero nine

Currently, zero is completely silent in all numbers unless it is by itself e.g.:

609 - six hundred and nine (see, zero is silent)

But even if this wasn't an issue, it turns out we actually need the named powers of ten to make English representations of integers meaningful. One five six two three, tells us almost nothing, but fifteen thousand six hundred and twenty three gives us a lot of information instantly. Even when we haven't processed the number fully, we know straight away that it is approximately 15 thousand and a few hundred. The bigger the number gets, the more pronounced this "estimation" effect is. So, we need numbers one through nine as well as named powers of 10. This allows us to cut our lookup table down to the following:

  NTW = {
  1 => "one",
  2 => "two",
  3 => "three",
  4 => "four",
  5 => "five",
  6 => "six",
  7 => "seven",
  8 => "eight",
  9 => "nine",
  10 => "ten",
  100 => "hundred",
  1000 => "thousand",
  1000000 => "million",
  1000000000 => "billion",
  1000000000000 => "trillion"
  }

We can use this table to write code that is much more intuitive than our previous attempt. Infact, we don't even need to know exactly what we're trying to achieve, instead we let the code guide us to the "correct" solution. We're using code to derive the requirements for the real world which is the opposite of the usual scenario. The full implementation I came up with is in this gist, but here is the main bit:

  def convert(number)
    return "zero" if number == 0
    word_representation_accumulator = []
    number_digits_reversed = number.to_s.reverse
    digit_count = 0
    number_digits_reversed.chars.each_with_index do |digit, index|
      digit_as_number = Integer(digit)
      skip_zero(digit_as_number) do
        multiplier = 10**digit_count
        word_representation = "#{NTW[digit_as_number]}"
        word_representation += " #{NTW[multiplier]}" if multiplier > 1 && NTW[multiplier]
        word_representation_accumulator << word_representation
      end
      digit_count += 1
    end
    word_representation_accumulator.reverse.join(" ")
  end

You have to admit, it's much shorter and nicer looking and here is some of the output produced:

0 - zero
1 - one
3 - three
5 - five
11 - one ten one
15 - one ten five
25 - two ten five
71 - seven ten one
40 - four ten
100 - one hundred
101 - one hundred one
112 - one hundred one ten two
123 - one hundred two ten three
457 - four hundred five ten seven
999 - nine hundred nine ten nine
1000 - one thousand
1001 - one thousand one
1010 - one thousand one ten
1011 - one thousand one ten one
2117 - two thousand one hundred one ten seven
3001 - three thousand one
13101 - one three thousand one hundred one
14001 - one four thousand one
16000 - one six thousand
25119 - two five thousand one hundred one ten nine
65009 - six five thousand nine
315119 - three one five thousand one hundred one ten nine
1000001 - one million one
1315119 - one million three one five thousand one hundred one ten nine
11315119 - one one million three one five thousand one hundred one ten nine
74315119 - seven four million three one five thousand one hundred one ten nine
174315119 - one seven four million three one five thousand one hundred one ten nine
1174315119 - one billion one seven four million three one five thousand one hundred one ten nine
15174315119 - one five billion one seven four million three one five thousand one hundred one ten nine
35174315119 - three five billion one seven four million three one five thousand one hundred one ten nine
935174315119 - nine three five billion one seven four million three one five thousand one hundred one ten nine
2935174315119 - two trillion nine three five billion one seven four million three one five thousand one hundred one ten nine

As you can see, all the named powers of ten are now treated in exactly the same way i.e.:

  • 10 is ten to the power of 1
  • 100 is ten to the power of 2
  • etc.

So 101 is one hundred one and 11 is one ten one, 203 is two hundred three, while 23 is two ten three – logical isn't it? There is no longer any need for any specially named numbers between 10 and 100 since we can represent all of them just as efficiently with this notation; this means no more special cases for numbers under 100. The other one we had was all the extra usages of the word hundred. We simply eliminate them all together (except for when we are talking about numbers between 100 and 1000), three one five thousand conveys the same amount of information as three hundred and fifteen thousand – the word hundred is totally redundant. Now, every named power of 10 appears only once in the English representation of the integer which allows us to still retain the ability to "estimate" the magnitude instantly without any extraneous information:

one million three one five thousand one hundred one ten nine

Short, consistent and to the point. It does seem a little awkward at first, but after a few minutes it starts to grow on you – after all, it is much more logical than the system we have now. There you go, deriving "better" English through some Ruby code. Not only do we get superior language but it is also easier, faster and less error prone to code it – FTW brothers, FTW :)!!!

Image by plindberg

  • http://www.deploymentzone.com Charles Feduke

    I’ve been reading the same book and remember the ~r option and wondering what the code must look like to arrive at its result. Going to have to search for the Lisp version of it now just to compare with your Ruby version.

    • http://www.skorks.com Alan Skorkin

      I’ll write the lisp version just as soon as I have a better grasp of macros and get used to the syntax a little more :).

  • Pingback: Deployment Zone » Blog Archive » Successful Lisp: How to Understand and Use Common Lisp

  • http://codosaur.us Dave Aronson

    Likely to encounter all uniquely named powers of *what* now? ;-)

  • Peter

    It’s funny that Ruby code led you to this naming convention. We homeschool our kids and picked up an abacus-based math curriculum. It talks about how confusing english language treatment of numbers is, and uses the “one ten two” phrasing for numbers like “12″. I don’t think it’s coincidence that “cutting edge math curriculum” and Ruby-the-beautiful-language-of-poetry-on-computers end up with the same common sense conclusions.

    • http://www.skorks.com Alan Skorkin

      I am not surprised other people have come up with this, but I didn’t realise the idea was out there in the wild, good to know.

      • Joseph

        The ‘better’ naming convention that you proposed is already being used in the Chinese language (at least as far as I know – I’m not a native speaker). If China does end up taking over the world (as the media would sometimes have us believe) then we might all end up using this system after all!

        Excellent post by the way!

  • Andrei Vajna II

    The English language is simply an optimization of your more “logical” language. Instead of a cumbersome ‘one ten two’, we simply use the shortcut ‘twelve’, because the smaller numbers are used more often and such shortcuts are useful.

    And I find it awkward to say ‘three one five thousand’. That is because you don’t have words for every power of 10. In English, by dividing the digits in groups you can pronounce any number. That is why ‘hundred’ is repeated. This is also useful for huge numbers like ‘fifteen billion billion’, which would be very cumbersome in your language.

    Oh, and what about 305 000? How would you say that? ‘Three zero five thousand’ or ‘three five thousand’? I think it’s the latter, which is definitely ambiguous.

    • http://www.skorks.com Alan Skorkin

      I think we only find it awkward because we are not used to this way, but are used to the other, ‘three one five thousand’ doesn’t lose any information by omitting the hundred etc.
      ‘fifteen billion billion’ is ‘one five quintillion’ which is not cumbersome at all in my opinion.

      You are however quite right about the 305 000. It should be three zero five but right now the code would produce three five thousand which is wrong. It is a bug, I was too aggressive with skipping the zeros :(.

      • Andrei Vajna II

        Yes, quintillion is ok, but you have to have a name for *every* power of 10, which is impossible, since they are infinite in number. That’s why repeating a ‘quantifier’ is so useful. You can easily chain them to produce any number: “three billion billion billion billion billion billion billion billion billion billion billion”. Besides, it keeps the number of keywords to remember low, as in your case this number is (should be) infinite.

        And by the way, to be consistent, you should have names for the 4th and 5th powers of ten as well, like say “tenthousand” and “hundredthousand”, but that would be really awkward.

  • Sam

    Thought I’d give this a go because it looks fun. There’s not actually that many rules to displaying numbers in English.
    - English numbers are essentially base 1000.
    - The hundred digit either translates to empty string, something in the format “x hundred” if there are no following non-zero digits, or “x hundred and” if there are following non-zero digits.
    - For the final two digits, numbers under 20 have specific translations, while the rest are of the format “x-y”.


    POWERS_OF_1000 = ["", "thousand", "million", "billion", "trillion", "quadrillion"]
    MULTIPLES_OF_10 = ["", "ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
    NUMBERS = ["", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]

    def to_english(number)
    return "zero" if number == 0
    return "negative #{to_english(-number)}" if number < 0
    words = []
    thousand_power_counter = 0
    while number != 0
    if number % 1000 != 0
    words.insert 0, to_english_3_digits(number)
    words.insert 1, POWERS_OF_1000[thousand_power_counter]
    end
    number /= 1000
    thousand_power_counter += 1
    end

    return words.join(" ")
    end

    def to_english_3_digits(number)
    last_2_digits = number % 100
    hundreds = (number % 1000) / 100
    tens = last_2_digits / 10
    ones = last_2_digits % 10

    words = []
    words += [NUMBERS[hundreds], "hundred"] if hundreds != 0
    words <<= "and" if hundreds != 0 && last_2_digits != 0
    words <<= last_2_digits < 20 ? NUMBERS[last_2_digits] : "#{MULTIPLES_OF_10[tens]}#{"-#{NUMBERS[ones]}" if ones != 0}"

    return words
    end

    • Michael

      I wrote a similar method whilst working through some tutorials.
      I’ve been learning Ruby for about a week, and this is my first attempt at writing something from scratch. It outputs British English, short scale.
      I’m sure there are things I could do better, but I’ve only covered the basics of programming so far.

      def en_uk number
      count_out number, [], 0
      end

      def count_out number, output, magnitude
      count = ['', 'one','two','three','four','five','six','seven','eight','nine', 'ten','eleven',
      'twelve','thirteen','fourteen','fifteen','sixteen','seventeen','eighteen','nineteen']
      x10 = ['','','twenty','thirty','forty','fifty','sixty','seventy','eighty','ninety']
      order = [' thousand',' million',' billion',' trillion',' quadrillion',' quintillion',' sextillion',
      ' septillion',' octillion',' nonillion']

      units = (number.to_s[-1,1]).to_i
      tens = (number.to_s[-2,1]).to_i
      to99 = (tens*10) + units
      hundreds = (number.to_s[-3,1]).to_i

      if (hundreds*100) + to99 != 0 && magnitude != 0
      if output.to_s.length > 14
      output.insert(0, ‘, ‘)
      elsif output.to_s.length > 0
      output.insert(0, ‘ and ‘)
      end
      output.insert(0, order[magnitude-1])
      end

      if number == 0
      return puts ‘zero’
      elsif to99 99 && hundreds != 0
      if output.to_s.length > 0 && to99 > 0
      output.insert(0, ‘and ‘)
      end
      output.insert(0, count[hundreds] + ‘ hundred ‘)
      end

      if number > 999
      number = number/1000
      magnitude = magnitude + 1
      count_out number, output, magnitude
      else
      puts output.to_s
      end
      end

      #examples

      en_uk(0)
      en_uk(1)
      en_uk(11)
      en_uk(71)
      en_uk(111)
      en_uk(999)
      en_uk(1000)
      en_uk(1077)
      en_uk(2117)
      en_uk(13101)
      en_uk(34341)
      en_uk(11315119)
      en_uk(1174315119)
      en_uk(35174315119)
      en_uk(935174315119)
      en_uk(2935174315119)
      en_uk(2935174315119654654654654654)

  • Klaas van Gend

    Your code is absolutely not i18n-proof:
    In German and Dutch, we say “seven-eighty” instead of 87.
    In French, things get more weird with “quatre-vingt-sept” (4×20+7) – or quatre-vingt-douze (4×20+12) for 92.

    I agree that it would be a lot easier if we would adopt a scheme like you suggest – for all languages in the world, please.

    Another interesting one:
    English: million billion trillion (1e6, 1e9 e12)
    Dutch/German: miljoen miljard biljoen biljard, triljoen (1e6 1e9 1e12 1e15 1e18)

    Good luck fixing your code for that :-)
    Does LISP handle i18n correctly?

    • http://www.skorks.com Alan Skorkin

      I am not surprised about the non i18n :).
      The million, milliard, billion, billiard is the same in russian as well, it seems to be a European thing.

      I am not sure if LISP will handle any language besides english, I would be very pleasantly surprised if there was a format directive to convert an integer into french :).

  • Shane Ryans

    This is a bit over my head. What exactly would this do?

  • http://droope.wordpress.com droope

    FTW!!!!!!!!!!!!!!!!!!!!!!!!!! :D :D :d

  • http://droope.wordpress.com droope
    • http://www.skorks.com Alan Skorkin

      Haha, nice

  • Kyle

    Having just completed an introductory AI course I came to wondering if this problem could be effectively expressed in a prolog like language.

    All we really want to do here is ask questions whose answers can be extracted from a set of facts…

    • http://www.skorks.com Alan Skorkin

      I don’t know prolog, but it would be a curious exercise

  • http://www.surgeforward.com SaaS

    I had never thought of the numbers that we use, I know that the english language is terrible. I speak dutch and spanish, and both of those languages treat numbers similarly but I think it would still be a nightmare in the long run.

  • Robert Massaioli

    I just implemented an englishNumbers :: Integer -> String function in Haskell and that was easy in comparison to the inverse function fromEnglishNumbers :: String -> Integer because the English numbers are an absolute PITA to translate. Also you might be interested in this wiki article on the long and short scales: http://en.wikipedia.org/wiki/Long_and_short_scales

    • http://www.skorks.com Alan Skorkin

      Yeah, I’ve see that wikipedia article before, I find it quite interesting

  • Michael

    I can definitely see your point about having a simplified approach to numbers in english, but it would take some getting used to, and I dont think you’d have an easy time getting people to adopt it, or even use it, in any meaningfull way.

    This ‘more logical’ approach to numbers reminds me a little of a very old counting system used by shepherds/farmers in the UK, which largely died out a hundred years ago, but I still remember my grandfather using.

    1. Ain == one
    2. Tain == two
    3. Tethera == three
    4. Methera == four
    5. Mimp == five
    6. Ayta == six
    7. Slayta == seven
    8. Laura == eight
    9. Dora == nine
    10. Dik == ten
    11. Ain-a-dik == one and ten
    12. Tain-a-dik == two and ten
    13. Tethera-a-dik == three and ten
    14. Methera-a-dik == four and ten
    15. Mit == fifteen
    16. Ain-a-mit == one and fifteen
    17. Tain-a-mit == two and fifteen
    18. Tethera-mit == three and fifteen
    19. Gethera-mit == four and fifteen
    20. Ghet == twenty

    This counting system is so old that it’s origins can’t accurately be dated or traced, but you can definitely see a simplified logic in the method. It pretty much stops at 20, and when in use, the shepherd would simply make a note of how many 20s he’d counted so far. I suppose they had no real use for ‘large’ numbers in this context.

    You can see the various dialects of this system @ http://en.wikipedia.org/wiki/Yan_Tan_Tethera

  • David Carlin

    Here’s my solution (for numbers between -1 million to 1 million (exclusive)

    def number_to_words(n)
    return ‘negative ‘ + number_to_words(n * -1) if (n < 0)
    if (n < 20)
    return [ 'zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen'][n]
    elsif (n 0 ? (‘ ‘ + number_to_words(n % 10)) : ”)
    elsif (n 0 ? (‘ and ‘ + number_to_words(n % 100)) : ”)
    elsif (n 0 ? (‘ ‘ + number_to_words(n % 1000)) : ”)
    else
    raise ‘number too large for what I’ve coded’
    end
    end

  • Grammar Nazi

    When speaking of integers, “and” is not a conjunction between magnitudes, it is a decimal point, therefore, your function produces ridiculous results. Refer to output from common lisp as a litmus of my statement.

  • Grammar Nazi

    s/, therefore/; therefore/

    Of course, we’re all subject to runons.

  • http://code.dblock.org dB.

    Finally, there’s something generic out there to convert numbers to words that we can extend for many languages – https://github.com/kslazarev/numbers_and_words