What could be more boring than capturing credit card data on a form? Well, it's actually not that boring since you may want to encrypt this particular data, which presents it's own set of challenges. Nevertheless, it's still a textbox which takes digits that you store in a database – whoopty doo – not exactly rocket surgery. Well, I've got a piece of data that's got the credit card beat for sheer mundanity – the ABN. If you're an Australian you know all about this. For everybody else, it stands for Australian Business Number which is an 11 digit number, provided by the government to every company. It's not secret (you can look them up online), so you don't even need to encrypt it – difficult to get excited about that. Of course if that was the end of the story, this wouldn't be much of a blog post, so – as you might imagine – things are not as bland as they appear.
At CrowdHired, we don't tend to deal much with credit card numbers, but ABNs are another matter entirely – since companies are one of the two types of users we have in the system (by the way – as you may have deduced – I've been working for a startup for the last few months, I should really talk about how that happened, it's an interesting story). Just like any piece of data, you want to validate the user input if you possibly can. When I started looking into this for ABNs I discovered that they had an interesting trait, it is a trait which credit card numbers share. You see, both credit cards and ABNs are self-verifying numbers.
I've been doing web development for many years now, but had no idea this was the case. So naturally – being the curious developer that I am – I had to dig a little further. It turns out that these kinds of numbers are quite common, with other well-known examples being ISBNs, UPCs and VINs. Most of these use a variation of a check digit-based algorithm for both validation and generation. Probably the most well-known of these algorithms is the Luhn algorithm which is what credit cards use. So, we'll use a credit card as an example.
Let us say we have the following credit card number:
4870696871788604
It is 16 digits (Visa and MasterCard are usually 16, but Amex is 15). This number is broken down in the following way:
Issuer Number | Account Number | Check Digit 487069 | 687178860 | 4
You can read lots more about the anatomy of a credit card, but all we want to do is apply the Luhn algorithm to check if this credit card is valid. It goes something like this:
1. Starting from the back, double every second digit
4 | 8 | 7 | 0 | 6 | 9 | 6 | 8 | 7 | 1 | 7 | 8 | 8 | 6 | 0 | 4 8 | 8 |14 | 0 |12 | 9 |12 | 8 |14 | 1 |14 | 8 |16 | 6 |00 | 4
2. If the doubled numbers form a double digit number, add the two digits
4 | 8 | 7 | 0 | 6 | 9 | 6 | 8 | 7 | 1 | 7 | 8 | 8 | 6 | 0 | 4 8 | 8 |14 | 0 |12 | 9 |12 | 8 |14 | 1 |14 | 8 |16 | 6 |00 | 4 8 | 8 | 5 | 0 | 3 | 9 | 3 | 8 | 5 | 1 | 5 | 8 | 7 | 6 | 0 | 4
3. Sum up all the digits of this new number
8+8+5+0+3+9+3+8+5+1+5+8+7+6+0+4 = 80
4. If the number is perfectly divisible by 10 it is a valid credit card number. Which in our case it is.
You can see how we can use the same algorithm to generate a valid credit card number. All we have to do is set the check digit value to X and then perform all the same steps. During the final step we simply pick our check digit in such a way as to make sure the sum of all the digits is divisible by 10. Let's do this for a slightly altered version of our previous credit card number (we simply set the digit before the check digit to 1 making the credit card number invalid).
4 | 8 | 7 | 0 | 6 | 9 | 6 | 8 | 7 | 1 | 7 | 8 | 8 | 6 | 1 | X 8 | 8 |14 | 0 |12 | 9 |12 | 8 |14 | 1 |14 | 8 |16 | 6 | 2 | X 8 | 8 | 5 | 0 | 3 | 9 | 3 | 8 | 5 | 1 | 5 | 8 | 7 | 6 | 2 | X
8+8+5+0+3+9+3+8+5+1+5+8+7+6+2+X = 78+X X = (78%10 == 0) ? 0 : 10 - 78%10 X=2
As you can see no matter what the other 15 digits are, we'll always be able to pick a check digit between 0 and 9 that will make the credit card number valid.
Of course not every self-verifying number uses the Luhn algorithm, most don't use mod(10)
to work out what the check digit should be, and for some numbers like the IBAN, the check digit actually consists of 2 digits. And yet, the most curious self-verifying number of the lot is the first one I learned about – the ABN. This is because, for the life of me, I couldn't work out what the check digit of the ABN could be.
Australia is certainly is not averse to using check digit-based algorithms. The Australian Tax File Number (TFN) and the Australian Company Number (ACN) are just two examples, but the ABN seems to be different. At first glance the ABN validation algorithm is just more of the same, it just has a larger than normal "mod" step at the end (mod(89)
).
In-fact, here is some ruby code to validate an ABN which I appropriated from the Ruby ABN gem (and then rolled it into a nice Rails 3, ActiveRecord validator so we could do validates_abn_format_of
in all out models :)) :
def is_integer?(number) Integer(number) true rescue false end def abn_valid?(number) raw_number = number number = number.to_s.tr ' ', '' return false unless is_integer?(number) && number.length == 11 weights = [10, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19] sum = 0 (0..10).each do |i| c = number[i,1] digit = c.to_i - (i.zero? ? 1 : 0) sum += weights[i] * digit end sum % 89 == 0 ? true : false end
But, while validating ABNs is easy, generating them is a whole other matter. As we've seen, with a check digit-based algorithm, generating the number is the same as validating the number, except we pick the digit in such a way as to make our 'mod
' step evaluate to zero. But with a number such as the ABN, where there is no apparent check digit (perhaps I am just having a bout of stupid, so if you can see an obvious check digit with ABNs do let me know), how do you easily generate a valid number? In-fact, why would you want to generate these numbers in the first place, isn't being able to validate them enough?
Well, in the case of CrowdHired, we tend to create object trees that are quite deep, so we build an maintain some infrastructure code to allow us to create fake data for use during development (another interesting thing to talk about at a later date). Before we started using the self-validating properties of ABNs we simply generated any old 11 digit number as fake data for ABN fields, but once the validations started kicking in this was no longer an option. Being the pragmatic developers that we are (even if we do say so ourselves), we took some real ABNs (like our own) chucked them into an array and randomly picked from there. But, this offended the developer gods, or my developer pride – whichever, so one Saturday I decided to take a couple of hours to generate some truly random ABNs that were still valid. Here is the code I came up with (it is now a proud part of our fake data generation script):
def random_abn weights = [10,1,3,5,7,9,11,13,15,17,19] reversed_weights = weights.reverse initial_numbers = [] final_numbers = [] 9.times {initial_numbers << rand(9)+1} initial_numbers = [rand(8)+1, rand(7)+2] + initial_numbers products = [] weights.each_with_index do |weight, index| products << weight * initial_numbers[index] end product_sum = products.inject(0){|sum, value| sum + value} remainder = product_sum % 89 if remainder == 0 final_numbers = initial_numbers else current_remainder = remainder reversed_numbers = initial_numbers.reverse reversed_weights.each_with_index do |weight, index| next if weight > current_remainder if reversed_numbers[index] > 0 reversed_numbers[index] -= 1 current_remainder -= weight if current_remainder < reversed_weights[index+1] redo end end end final_numbers = reversed_numbers.reverse end final_numbers[0] += 1 final_numbers.join end
The idea is pretty simple. Let's go through an example to demonstrate:
1. Firstly we randomly generate 11 digits between 0 and 9 to make up our probably ABN (they are actually not all between 0 and 9 but more on that shortly)
7 5 8 9 8 7 3 4 1 5 3
2. We then perform the validation steps on that number
multiply the digits by their weights to get weight-digit products
7x10=70 5x1=5 8x3=24 9x5=45 8x7=56 7x9=63 3x11=33 4x13=52 1x15=15 5x17=85 3x19=57
70+5+24+45+56+63+33+52+15+85+57 = 505
505 mod 89 = 60
3. Since we do mod(89)
at worst we'll be off by 88 (although if we get 0 as the remainder we lucked out with a valid ABN straight away), we now use the weight-digit products to "give change", subtracting from the remainder as we go until we hit zero.
We start with the last digit where the weight is 19. We subtract 1 from this digit, which means we can subtract 19 from our remainder. We then move on to the next digit until the remainder hits zero
Initial | Change | Remainder ------------------------------- 7x10=70 | 7x10=70 | 0 5x1=5 | 5x1=5 | 0 8x3=24 | 8x3=24 | 0 9x5=45 | 9x5=45 | 0 8x7=56 | 8x7=56 | 0 7x9=63 | <strong>6</strong>x9=63 | 0 3x11=33 | 3x11=33 | 9 4x13=52 | 4x13=52 | 9 1x15=15 | 0x15=0 | 9 5x17=85 | <strong>4</strong>x17=68 | 24 3x19=57 | <strong>2</strong>x19=38 | 41
4. This gives us our new number
7 5 8 9 8 6 3 4 0 4 2
5. Now we just need to add 1 to the very first number (as per the ABN validation steps) and we have our valid ABN
85898634042
There are a couple of nuances to those steps.
Given these nuances, this algorithm won't generate every possible ABN, but it will give you a large percentage of possible ABNs which is good enough for our needs. It took about an hour to get that working (we won't mention the little bug where I forgot the remainder could be zero from the start, which caused much grief to our random data generator :)), but it was a fun little exercise – time well spent as far as I am concerned. And to think, all this learning about self-validating numbers and algorithmic coding fun was triggered by trying to capture the most mundane piece of data on a form. It just goes to show that you can learn and grow no matter where you are and what you're doing, you just need to see the opportunities for what they are.
]]>I was surfing the web the other day and in the course of my random wanderings I ended up at the Dropbox programming challenges page. Apparently, the Dropbox guys have posted up some coding challenges for people who want to apply to work there (and everyone else, I guess, since it's on the internet and all :)). Challenge 3 (The Dropbox Diet) immediately caught my eye since it looked like one of those problems that should have a dynamic programming solution, so I decided to use it as an opportunity to practice. The full description of the problem is on the challenge page, but here is the gist of it.
We get a list of up to 50 activities and an associated calorie value for each (either positive or negative), we need to find a subset of activities where the sum of all the calorie values is zero.
It sounded easy enough until I thought about it and realised it was more complex than it first appeared. So, I went for a walk :) and when I came back I settled in to analyse it for real. The first part of solving any problem is to really understand what problem you're trying to solve (that one sentence really deserves its own article). In this case the activities list is just extraneous information, what we really have is a list of numbers and we need to find a subset of these numbers where the sum of the subset is equal to a particular value. It took me quite a while to come up with that definition, but once you have something like that, you can do some research and see if it is a known problem.
Of course, I did nothing of the kind, I had already decided that there must be a dynamic programming solution so I went ahead and tried to come up with it myself. This wasted about an hour at the end of which I had absolutely nothing; I guess my dynamic programming chops are still lamb-chops as opposed to nice meaty beef-chops :). Having failed I decided to do what I should have done in the first place – some research. Since I had taken the time to come up with a decent understanding of the problem, it only took 5 minutes of Googling to realise that I was dealing with the subset sum problem.
The unfortunate thing about the subset sum problem is the fact that it's NP-complete. This means that if our input is big enough we may be in trouble. Wikipedia does give some algorithmic approaches to the problem (no code though), but just to cross our t's I also cracked open Cormen et al (have you ever noticed how that book has everything when it comes to algorithms :)). In this case the book agreed with Wikipedia, but once again, no code (there are only two things I don't like about Intro To Algorithms, the lack of real code and the lack of examples). I browsed the web some more, in case it would give me further insight into the problem, but there wasn't much more to know – it was time to get my code on.
The problem with the exponential time algorithm is its runtime complexity (obviously), but our maximum input size was only 50 and even if that turned out to be too big, perhaps there were some easy optimizations to be made. Regardless I decided to tackle this one first, if nothing else it would immerse me in the problem. I'll demonstrate how it works via example. Let's say our input looks like this:
[1, -3, 2, 4]
We need to iterate through the values and on every iteration produce all the possible subsets that can be made with all the numbers we've looked at up until now. Here is how it looks:
Iteration 1:
[[1]]Iteration 2:
[[1], [-3], [1, -3]]Iteration 3:
[[1], [-3], [1, -3], [2], [1, 2], [-3, 2], [1, -3, 2]]Iteration 4:
[[1], [-3], [1, -3], [2], [1, 2], [-3, 2], [1, -3, 2], [4], [1, 4], [-3, 4], [1, -3, 4], [2, 4], [1, 2, 4], [-3, 2, 4], [1, -3, 2, 4]]
On every iteration we simply take the number we're currently looking at as well as a clone of the list of all the subsets we have seen so far, we append the new number to all the subsets (we also add the number itself to the list since it can also be a subset) and then we concatenate this new list to the list of subsets that we generated on the previous iteration. Here is the previous example again, but demonstrating this approach:
Iteration 1:
[] + [1]Iteration 2:
[1] + [-3], [1, -3]Iteration 3:
[1], [-3], [1, -3] + [2], [1, 2], [-3, 2], [1, -3, 2]Iteration 4:
[1], [-3], [1, -3], [2], [1, 2], [-3, 2], [1, -3, 2] + [4], [1, 4], [-3, 4], [1, -3, 4], [2, 4], [1, 2, 4], [-3, 2, 4], [1, -3, 2, 4]
This allows us to generate all the possible subsets of our input, all we have to do then is pick out the subsets that sum up to the value we're looking for (e.g. 0).
The list of subsets grows exponentially (it being an exponential time algorithm and all :)), but since we know what sum we're looking for, there is one small optimization we can make. We can sort our input list before trying to generate the subsets, this way all the negative values will be first in the list. The implication here is this, once the sum of any subset exceeds the value we're looking for, we can instantly discard it since all subsequent values we can append to it will only make it bigger. Here is some code:
def subsets_with_sum_less_than_or_equal(reference_value, array)
array = array.sort {|a,b| a <=> b}
previous_sums = []
array.each do |element|
new_sums = []
new_sums << [element] if element <= reference_value
previous_sums.each do |previous_sum|
current_sum = previous_sum + [ element ]
new_sums << current_sum if current_sum.inject(0){|accumulator,value|accumulator+value} <= reference_value
end
previous_sums = previous_sums + new_sums
end
previous_sums
end
If we execute that (with our reference value being 0 and our array being [1, -3, 2, 4]), we get the following output:
[[-3], [-3, 1], [-3, 2], [-3, 1, 2]]
All the subsets in that list sum up to less than or equal to our reference value (0). All we need to do now is pick out the ones that we're after.
def subsets_with_sums_equal(reference_value, array)
subsets_with_sums_less_than_or_equal = subsets_with_sum_less_than_or_equal(reference_value, array)
subsets_adding_up_to_reference_value = subsets_with_sums_less_than_or_equal.inject([]) do |accumulator, subset|
accumulator << subset if subset.inject(0){|sum, value| sum+value} == reference_value
accumulator
end
subsets_adding_up_to_reference_value
end
This function calls the previous one and then picks out the subset we're after:
[[-3, 1, 2]]
It's simple and works very well for any input array with less than 20 values or so, and if you try it with more than 25 – good luck waiting for it to finish :). Exponential time is no good if we want it to work with an input size of 50 (or more) numbers.
Both Wikipedia and Cormen tell us that there is a polynomial time approximate algorithm, but that's no good for us since we want the subsets that add up to exactly zero, not approximately zero. Fortunately, just like I suspected, there is a dynamic programming solution, Wikipedia even explains how it works, which is only marginally helpful when it comes to implementing it. I know because that was the solution I tackled next. Here is how it works, using the same input as before:
[1, -3, 2, 4]
Just like with any dynamic programming problem, we need to produce a matrix, the key is to figure out what it's a matrix of (how do we label the rows and how do we label the columns). In this case the rows are simply the indexes of our input array; the columns are labelled with every possible sum that can be made out of the input numbers. In our case, the smallest sum we can make from our input is -3 since that's the only negative number we have, the biggest sum is seven (1 + 2 + 4). So, our uninitialized matrix looks like this:
+---+----+----+----+---+---+---+---+---+---+---+---+ | | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+----+----+----+---+---+---+---+---+---+---+---+ | 0 | | | | | | | | | | | | | 1 | | | | | | | | | | | | | 2 | | | | | | | | | | | | | 3 | | | | | | | | | | | | +---+----+----+----+---+---+---+---+---+---+---+---+
So far so good, but what should we put in every cell of our matrix. In this case every cell will contain either T (true) or F (false).
A T value in a cell means that the sum that the column is labelled with can be constructed using the input array numbers that are indexed by the current row label and the labels of all the previous rows we have already looked at. An F in a cell means the sum of the column label cannot be constructed. Let's try to fill in our matrix to see how this works.
We start with the first row, the number indexed by the row label is 1, there is only one sum that can be made using that number – 1. So only one cell gets a T in it, all the rest get an F.
+---+----+----+----+---+---+---+---+---+---+---+---+ | | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+----+----+----+---+---+---+---+---+---+---+---+ | 0 | F | F | F | F | T | F | F | F | F | F | F | | 1 | | | | | | | | | | | | | 2 | | | | | | | | | | | | | 3 | | | | | | | | | | | | +---+----+----+----+---+---+---+---+---+---+---+---+
The number indexed by the second row label is -3, so in the second row, the column labelled by -3 will get a T in it. However, we're considering the numbers indexed by the current row and all previous rows, which means any sum that can be made using the numbers 1 and -3 will get a T in its column. This means that the column labelled with 1 gets a T and the column labelled with -2 gets a T since
1 + -3 = -2
+---+----+----+----+---+---+---+---+---+---+---+---+ | | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+----+----+----+---+---+---+---+---+---+---+---+ | 0 | F | F | F | F | T | F | F | F | F | F | F | | 1 | T | T | F | F | T | F | F | F | F | F | F | | 2 | | | | | | | | | | | | | 3 | | | | | | | | | | | | +---+----+----+----+---+---+---+---+---+---+---+---+
We continue in the same vein for the next row, we're now looking at number 2 since it's indexed by the third row in our matrix. So, the column labelled by 2 will get a T, all the columns labelled by T in the previous row propagate their T value down, since all those sums are still valid. But we can produce a few other sums given the numbers at our disposal:
2 + -3 = -1 1 + 2 + -3 = 0 1 + 2 = 3
All those sums get a T in their column for this row.
+---+----+----+----+---+---+---+---+---+---+---+---+ | | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+----+----+----+---+---+---+---+---+---+---+---+ | 0 | F | F | F | F | T | F | F | F | F | F | F | | 1 | T | T | F | F | T | F | F | F | F | F | F | | 2 | T | T | T | T | T | T | T | F | F | F | F | | 3 | | | | | | | | | | | | +---+----+----+----+---+---+---+---+---+---+---+---+
There are three patterns that are starting to emerge.
Those three patterns are the algorithm that we use to fill in our matrix one row at a time. We can now use them to fill in the last row. The number indexed by the last row is 4. Therefore in the last row, the column labelled by 4 will get a T (via the first pattern). All the columns that already have a T will have that T propagate to the last row (via the second pattern). This means the only columns with an F will be those labelled by 5, 6 and 7. However using pattern 3, if we subtract 4 from 5, 6 and 7 we get:
5 - 4 = 1 6 - 4 = 2 7 - 4 = 3
If we now look at the previous row in the columns labelled by those numbers we can see a T for all three cases, therefore, even the columns labelled with 5, 6 and 7 in the last row will pick up a T via the third pattern. Our final matrix is:
+---+----+----+----+---+---+---+---+---+---+---+---+ | | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+----+----+----+---+---+---+---+---+---+---+---+ | 0 | F | F | F | F | T | F | F | F | F | F | F | | 1 | T | T | F | F | T | F | F | F | F | F | F | | 2 | T | T | T | T | T | T | T | F | F | F | F | | 3 | T | T | T | T | T | T | T | T | T | T | T | +---+----+----+----+---+---+---+---+---+---+---+---+
One final problem remains, how can we use this matrix to get the subset that adds up to the value we want (i.e. 0). This is also reasonably simple. We start in the column labelled by the sum we're after, in our case we start in the column labelled by zero. If this column does not contain a T then our sum is not possible and the input does not have a solution. In our case, the column does have a T so we're in business.
Let's do this for our matrix. We start at the column labelled by 0 since that's the sum we're looking for. We look at the last row and see a T, but there is also a T in the row above so we go up to that row. Now there is an F in the row above, so we write the number indexed by this row into our output:
output = [2]
We now subtract this number from our column label to get the new column label:
0 - 2 = -2
We jump to the column labelled by -2 and go up a row, there is another T there with an F in the row above, so we write the number indexed by the row to our output:
output = [2, -3]
We perform our subtraction step again:
-2 - -3 = -2 + 3 = 1
We now jump to the column labelled by 1 in the first row in the matrix. There is also a T there, so we need to write one last number to our output:
output = [2, -3, 1]
Since we're at the top of the matrix, we're done. As you can see the procedure we perform to reconstruct the output subset is actually a variant of the third pattern we used to construct the matrix. And that's all there is to it.
Oh yeah, I almost forgot the code :), since it is not tiny, I put it in a gist, you can find it here. But, here are the guts of it:
def initialize_first_row
@matrix[1].each_with_index do |element,i|
next if i == 0 # skipping the first one since it is the index into the array
if @array[@matrix[1][0]] == @matrix[0][i] # the only sum we can have is the first number itself
@matrix[1][i] = "T";
end
end
@matrix
end
def populate
(2...@matrix.size).each do |row|
@matrix[row].each_with_index do |element,i|
next if i == 0
if @array[@matrix[row][0]] == @matrix[0][i] || @matrix[row-1][i] == 'T' || current_sum_possible(row, i)
@matrix[row][i] = "T";
end
end
end
@matrix
end
def current_sum_possible(row, column)
column_sum = @matrix[0][column] - @array[@matrix[row][0]]
column_index = @column_value_to_index[column_sum]
return false unless column_index
@matrix[row-1][column_index] == "T";
end
def derive_subset_for(reference_value)
subset = []
column_index = @column_value_to_index[reference_value]
(1...@matrix.size).to_a.reverse.each do |row|
if @matrix[row][column_index] == "F";
return subset
elsif @matrix[row-1][column_index] == "T";
next
else
array_value = @array[row - 1] # the -1 is to account for the fact that our rows are 1 larger than indexes of input array due to row 0 in matrix being header
subset.insert(0, array_value)
column_index = @column_value_to_index[@matrix[0][column_index] - array_value]
end
end
subset
end
You can recognise the 3 patterns being applied in the 'populate' method. We're, of course, missing the code for instantiating the matrix in the first place. Grab the whole thing from the gist and give it a run, it generates random inputs of size 50 with values between -1000 and 1000. And if you think that would produce quite a large matrix, you would be right :) (50 rows and about 25000 columns give or take a few thousand). But even with input size 100 it only takes a couple of seconds to get an answer, which is MUCH better than the exponential time algorithm; in my book that equals success. Dropbox Challenge 3 – solved (more or less :))!
By the way if you want to print out a few more matrices, grab the code and uncomment the relevant line (102) and you'll get a matrix similar to those above along with the rest of the output. Obviously, if you're doing that, make sure your input size is small enough for the matrix to actually fit on the screen. I used the great terminal-table gem to produce the nice ASCII tables.
Lastly, if you're wondering what framework this is:
if ENV["attest"]
this_tests "generating subset sums using dynamic programming" do
test("subset should be [1,-3,2]") do
actual_subset_sum = subset_sum_dynamic([1, -3, 2, 4], 0)
should_equal([1,-3,2], actual_subset_sum)
end
...
end
end
That would be me eating my own dog food, I took the time to write it, might as well use it :).
By the way, it took me hours (pretty much the better part of a day) to get all of this stuff working properly, dynamic programming algorithms really are fiddly little beasts. But, I had some fun, and got some good practice and learning out of it – time well spent (and now there is some decent subset sum code on the internet :P). Of course once I finished with this I had to look at the other challenges, number 2 didn't really catch my attention, but I couldn't walk away from number 1 with its ASCII boxes and bin packing goodness – I'll write that one up some other time.
Images by johntrainor, infinitewhite and familymwr
]]>