Algorithms, A Dropbox Challenge And Dynamic Programming

AlgorithmLately I've slowly been trying to grok the fullness of dynamic programming. It is an algorithmic technique that the vast majority of developers never master, which is unfortunate since it can help you come up with viable solutions for seemingly intractable problems. The issue with dynamic programming (besides the totally misleading name), is that it can be very difficult to see how to apply it to a particular problem and even when you do, it is a real pain to get it right. Anyway, I don't want to expound on this, I have something more interesting in mind.

The Dropbox Challenges

I was surfing the web the other day and in the course of my random wanderings I ended up at the Dropbox programming challenges page. Apparently, the Dropbox guys have posted up some coding challenges for people who want to apply to work there (and everyone else, I guess, since it's on the internet and all :)). Challenge 3 (The Dropbox Diet) immediately caught my eye since it looked like one of those problems that should have a dynamic programming solution, so I decided to use it as an opportunity to practice. The full description of the problem is on the challenge page, but here is the gist of it.

We get a list of up to 50 activities and an associated calorie value for each (either positive or negative), we need to find a subset of activities where the sum of all the calorie values is zero.

It sounded easy enough until I thought about it and realised it was more complex than it first appeared. So, I went for a walk :) and when I came back I settled in to analyse it for real. The first part of solving any problem is to really understand what problem you're trying to solve (that one sentence really deserves its own article). In this case the activities list is just extraneous information, what we really have is a list of numbers and we need to find a subset of these numbers where the sum of the subset is equal to a particular value. It took me quite a while to come up with that definition, but once you have something like that, you can do some research and see if it is a known problem.

Of course, I did nothing of the kind, I had already decided that there must be a dynamic programming solution so I went ahead and tried to come up with it myself. This wasted about an hour at the end of which I had absolutely nothing; I guess my dynamic programming chops are still lamb-chops as opposed to nice meaty beef-chops :). Having failed I decided to do what I should have done in the first place – some research. Since I had taken the time to come up with a decent understanding of the problem, it only took 5 minutes of Googling to realise that I was dealing with the subset sum problem.

The Subset Sum Problem

Subset

The unfortunate thing about the subset sum problem is the fact that it's NP-complete. This means that if our input is big enough we may be in trouble. Wikipedia does give some algorithmic approaches to the problem (no code though), but just to cross our t's I also cracked open Cormen et al (have you ever noticed how that book has everything when it comes to algorithms :)). In this case the book agreed with Wikipedia, but once again, no code (there are only two things I don't like about Intro To Algorithms, the lack of real code and the lack of examples). I browsed the web some more, in case it would give me further insight into the problem, but there wasn't much more to know – it was time to get my code on.

The Exponential Time Algorithm

The problem with the exponential time algorithm is its runtime complexity (obviously), but our maximum input size was only 50 and even if that turned out to be too big, perhaps there were some easy optimizations to be made. Regardless I decided to tackle this one first, if nothing else it would immerse me in the problem. I'll demonstrate how it works via example. Let's say our input looks like this:

[1, -3, 2, 4]

We need to iterate through the values and on every iteration produce all the possible subsets that can be made with all the numbers we've looked at up until now. Here is how it looks:

Iteration 1:

[[1]]

Iteration 2:

[[1], [-3], [1, -3]]

Iteration 3:

[[1], [-3], [1, -3], [2], [1, 2], [-3, 2], [1, -3, 2]]

Iteration 4:

[[1], [-3], [1, -3], [2], [1, 2], [-3, 2], [1, -3, 2], [4], [1, 4], [-3, 4], [1, -3, 4], [2, 4], [1, 2, 4], [-3, 2, 4], [1, -3, 2, 4]]

On every iteration we simply take the number we're currently looking at as well as a clone of the list of all the subsets we have seen so far, we append the new number to all the subsets (we also add the number itself to the list since it can also be a subset) and then we concatenate this new list to the list of subsets that we generated on the previous iteration. Here is the previous example again, but demonstrating this approach:

Iteration 1:

[] + [1]

Iteration 2:

[1] + [-3], [1, -3]

Iteration 3:

[1], [-3], [1, -3] + [2], [1, 2], [-3, 2], [1, -3, 2]

Iteration 4:

[1], [-3], [1, -3], [2], [1, 2], [-3, 2], [1, -3, 2] + [4], [1, 4], [-3, 4], [1, -3, 4], [2, 4], [1, 2, 4], [-3, 2, 4], [1, -3, 2, 4]

This allows us to generate all the possible subsets of our input, all we have to do then is pick out the subsets that sum up to the value we're looking for (e.g. 0).

The list of subsets grows exponentially (it being an exponential time algorithm and all :)), but since we know what sum we're looking for, there is one small optimization we can make. We can sort our input list before trying to generate the subsets, this way all the negative values will be first in the list. The implication here is this, once the sum of any subset exceeds the value we're looking for, we can instantly discard it since all subsequent values we can append to it will only make it bigger. Here is some code:

def subsets_with_sum_less_than_or_equal(reference_value, array)
  array = array.sort {|a,b| a <=> b}
  previous_sums = []
  array.each do |element|
    new_sums = []
    new_sums << [element] if element <= reference_value
    previous_sums.each do |previous_sum|
      current_sum = previous_sum + [ element ]
      new_sums << current_sum if current_sum.inject(0){|accumulator,value|accumulator+value} <= reference_value
    end
    previous_sums = previous_sums + new_sums
  end
  previous_sums
end

If we execute that (with our reference value being 0 and our array being [1, -3, 2, 4]), we get the following output:

[[-3], [-3, 1], [-3, 2], [-3, 1, 2]]

All the subsets in that list sum up to less than or equal to our reference value (0). All we need to do now is pick out the ones that we're after.

def subsets_with_sums_equal(reference_value, array)
  subsets_with_sums_less_than_or_equal = subsets_with_sum_less_than_or_equal(reference_value, array)
  subsets_adding_up_to_reference_value = subsets_with_sums_less_than_or_equal.inject([]) do |accumulator, subset|
    accumulator << subset if subset.inject(0){|sum, value| sum+value} == reference_value
    accumulator
  end
  subsets_adding_up_to_reference_value
end

This function calls the previous one and then picks out the subset we're after:

[[-3, 1, 2]]

It's simple and works very well for any input array with less than 20 values or so, and if you try it with more than 25 – good luck waiting for it to finish :). Exponential time is no good if we want it to work with an input size of 50 (or more) numbers.

The Dynamic Programming Algorithm

Matrix

Both Wikipedia and Cormen tell us that there is a polynomial time approximate algorithm, but that's no good for us since we want the subsets that add up to exactly zero, not approximately zero. Fortunately, just like I suspected, there is a dynamic programming solution, Wikipedia even explains how it works, which is only marginally helpful when it comes to implementing it. I know because that was the solution I tackled next. Here is how it works, using the same input as before:

[1, -3, 2, 4]

Just like with any dynamic programming problem, we need to produce a matrix, the key is to figure out what it's a matrix of (how do we label the rows and how do we label the columns). In this case the rows are simply the indexes of our input array; the columns are labelled with every possible sum that can be made out of the input numbers. In our case, the smallest sum we can make from our input is -3 since that's the only negative number we have, the biggest sum is seven (1 + 2 + 4). So, our uninitialized matrix looks like this:

+---+----+----+----+---+---+---+---+---+---+---+---+
|   | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+----+----+---+---+---+---+---+---+---+---+
| 0 |    |    |    |   |   |   |   |   |   |   |   |
| 1 |    |    |    |   |   |   |   |   |   |   |   |
| 2 |    |    |    |   |   |   |   |   |   |   |   |
| 3 |    |    |    |   |   |   |   |   |   |   |   |
+---+----+----+----+---+---+---+---+---+---+---+---+

So far so good, but what should we put in every cell of our matrix. In this case every cell will contain either T (true) or F (false).

A T value in a cell means that the sum that the column is labelled with can be constructed using the input array numbers that are indexed by the current row label and the labels of all the previous rows we have already looked at. An F in a cell means the sum of the column label cannot be constructed. Let's try to fill in our matrix to see how this works.

We start with the first row, the number indexed by the row label is 1, there is only one sum that can be made using that number – 1. So only one cell gets a T in it, all the rest get an F.

+---+----+----+----+---+---+---+---+---+---+---+---+
|   | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+----+----+---+---+---+---+---+---+---+---+
| 0 | F  | F  | F  | F | T | F | F | F | F | F | F |
| 1 |    |    |    |   |   |   |   |   |   |   |   |
| 2 |    |    |    |   |   |   |   |   |   |   |   |
| 3 |    |    |    |   |   |   |   |   |   |   |   |
+---+----+----+----+---+---+---+---+---+---+---+---+

The number indexed by the second row label is -3, so in the second row, the column labelled by -3 will get a T in it. However, we're considering the numbers indexed by the current row and all previous rows, which means any sum that can be made using the numbers 1 and -3 will get a T in its column. This means that the column labelled with 1 gets a T and the column labelled with -2 gets a T since

1 + -3 = -2
+---+----+----+----+---+---+---+---+---+---+---+---+
|   | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+----+----+---+---+---+---+---+---+---+---+
| 0 | F  | F  | F  | F | T | F | F | F | F | F | F |
| 1 | T  | T  | F  | F | T | F | F | F | F | F | F |
| 2 |    |    |    |   |   |   |   |   |   |   |   |
| 3 |    |    |    |   |   |   |   |   |   |   |   |
+---+----+----+----+---+---+---+---+---+---+---+---+

We continue in the same vein for the next row, we're now looking at number 2 since it's indexed by the third row in our matrix. So, the column labelled by 2 will get a T, all the columns labelled by T in the previous row propagate their T value down, since all those sums are still valid. But we can produce a few other sums given the numbers at our disposal:

2 + -3 = -1
1 + 2 + -3 = 0
1 + 2 = 3

All those sums get a T in their column for this row.

+---+----+----+----+---+---+---+---+---+---+---+---+
|   | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+----+----+---+---+---+---+---+---+---+---+
| 0 | F  | F  | F  | F | T | F | F | F | F | F | F |
| 1 | T  | T  | F  | F | T | F | F | F | F | F | F |
| 2 | T  | T  | T  | T | T | T | T | F | F | F | F |
| 3 |    |    |    |   |   |   |   |   |   |   |   |
+---+----+----+----+---+---+---+---+---+---+---+---+

There are three patterns that are starting to emerge.

  1. For every row, the column which is equivalent to the number indexed by the row get a T in it (e.g. row zero represents the number 1, so the column labelled by 1 gets a T in row 0, row one represents the number -3 so the column labelled by -3 get a T in row 1 etc.), every row will get one T in one of the columns via this pattern. This is because a single number by itself is a valid subset sum.
  2. If a column already has a T in the previous row, this T propagates down to the current row (e.g. when looking at the second row, the column labelled by 1 has a T in the first row and will therefore have a T in the second row also, when looking at the third row columns labelled by -3, -2 and 1 all had a T in the second row and will therefore contain a T in the third row). This is due to the fact that once it is possible to construct a certain sum using a subset of our input numbers, looking at more of the input numbers does not invalidate the existing subsets.
  3. Looking at any column label X in the current row which still has a value of F, if we subtract the number indexed by the current row from this column label we get a new number Y, we then check the row above the current row in the column labelled by Y, if we see a T, this T is propagated into the column X in the current row (e.g. if we're looking at the second row, column labelled with -2, we subtract the number of the current row -3 from the column label, -2 – -3 = -2 + 3 = 1, this new number is the column label in the first row, we can see that in the first row in the column labelled with 1 there is a T, therefore this T gets propagated to the second row into the column labelled with -2). This is due to the fact that if we take a sum that is already possible and add another number to it, this obviously creates a new sum which is now possible.

Those three patterns are the algorithm that we use to fill in our matrix one row at a time. We can now use them to fill in the last row. The number indexed by the last row is 4. Therefore in the last row, the column labelled by 4 will get a T (via the first pattern). All the columns that already have a T will have that T propagate to the last row (via the second pattern). This means the only columns with an F will be those labelled by 5, 6 and 7. However using pattern 3, if we subtract 4 from 5, 6 and 7 we get:

5 - 4 = 1
6 - 4 = 2
7 - 4 = 3

If we now look at the previous row in the columns labelled by those numbers we can see a T for all three cases, therefore, even the columns labelled with 5, 6 and 7 in the last row will pick up a T via the third pattern. Our final matrix is:

+---+----+----+----+---+---+---+---+---+---+---+---+
|   | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+----+----+---+---+---+---+---+---+---+---+
| 0 | F  | F  | F  | F | T | F | F | F | F | F | F |
| 1 | T  | T  | F  | F | T | F | F | F | F | F | F |
| 2 | T  | T  | T  | T | T | T | T | F | F | F | F |
| 3 | T  | T  | T  | T | T | T | T | T | T | T | T |
+---+----+----+----+---+---+---+---+---+---+---+---+

One final problem remains, how can we use this matrix to get the subset that adds up to the value we want (i.e. 0). This is also reasonably simple. We start in the column labelled by the sum we're after, in our case we start in the column labelled by zero. If this column does not contain a T then our sum is not possible and the input does not have a solution. In our case, the column does have a T so we're in business.

  • We start at the last row in this column; if it has a T and the row above has a T we go to the row above.
  • If the row above has an F then we take the number which is indexed by the current row and write it into our final output.
  • We then subtract this number from the column label to get the next column label. We jump to the new column label and go up a row.
  • Once again if there is a T there and there is an F above, then we write the number indexed by the row into our output and subtract it from the current column label to get the next column label.
  • We then jump to that column and go up a row again.
  • We keep doing this until we get to the top of the matrix, at this point the numbers we have written to the output will be our solution.

Let's do this for our matrix. We start at the column labelled by 0 since that's the sum we're looking for. We look at the last row and see a T, but there is also a T in the row above so we go up to that row. Now there is an F in the row above, so we write the number indexed by this row into our output:

output = [2]

We now subtract this number from our column label to get the new column label:

0 - 2 = -2

We jump to the column labelled by -2 and go up a row, there is another T there with an F in the row above, so we write the number indexed by the row to our output:

output = [2, -3]

We perform our subtraction step again:

-2 - -3 = -2 + 3 = 1

We now jump to the column labelled by 1 in the first row in the matrix. There is also a T there, so we need to write one last number to our output:

output = [2, -3, 1]

Since we're at the top of the matrix, we're done. As you can see the procedure we perform to reconstruct the output subset is actually a variant of the third pattern we used to construct the matrix. And that's all there is to it.

Oh yeah, I almost forgot the code :), since it is not tiny, I put it in a gist, you can find it here. But, here are the guts of it:

  def initialize_first_row
    @matrix[1].each_with_index do |element,i|
      next if i == 0 # skipping the first one since it is the index into the array
      if @array[@matrix[1][0]] == @matrix[0][i] # the only sum we can have is the first number itself
        @matrix[1][i] = "T";
      end
    end
    @matrix
  end
 
  def populate
    (2...@matrix.size).each do |row|
      @matrix[row].each_with_index do |element,i|
        next if i == 0
        if @array[@matrix[row][0]] == @matrix[0][i] || @matrix[row-1][i] == &#39;T&#39; || current_sum_possible(row, i) 
          @matrix[row][i] = "T";
        end
      end
    end
    @matrix
  end
 
  def current_sum_possible(row, column)
    column_sum = @matrix[0][column] - @array[@matrix[row][0]]
    column_index = @column_value_to_index[column_sum]
    return false unless column_index
    @matrix[row-1][column_index] == "T";
  end
 
  def derive_subset_for(reference_value)
    subset = []
    column_index = @column_value_to_index[reference_value]
    (1...@matrix.size).to_a.reverse.each do |row|
      if @matrix[row][column_index] == "F";
        return subset
      elsif @matrix[row-1][column_index] == "T";
        next
      else
        array_value = @array[row - 1] # the -1 is to account for the fact that our rows are 1 larger than indexes of input array due to row 0 in matrix being header
        subset.insert(0, array_value)
        column_index = @column_value_to_index[@matrix[0][column_index] - array_value]
      end
    end
    subset
  end

You can recognise the 3 patterns being applied in the 'populate' method. We're, of course, missing the code for instantiating the matrix in the first place. Grab the whole thing from the gist and give it a run, it generates random inputs of size 50 with values between -1000 and 1000. And if you think that would produce quite a large matrix, you would be right :) (50 rows and about 25000 columns give or take a few thousand). But even with input size 100 it only takes a couple of seconds to get an answer, which is MUCH better than the exponential time algorithm; in my book that equals success. Dropbox Challenge 3 – solved (more or less :))!

By the way if you want to print out a few more matrices, grab the code and uncomment the relevant line (102) and you'll get a matrix similar to those above along with the rest of the output. Obviously, if you're doing that, make sure your input size is small enough for the matrix to actually fit on the screen. I used the great terminal-table gem to produce the nice ASCII tables.

Lastly, if you're wondering what framework this is:

if ENV["attest"]
  this_tests "generating subset sums using dynamic programming" do
    test("subset should be [1,-3,2]") do
      actual_subset_sum = subset_sum_dynamic([1, -3, 2, 4], 0)
      should_equal([1,-3,2], actual_subset_sum)
    end
    ...
  end
end

That would be me eating my own dog food, I took the time to write it, might as well use it :).

By the way, it took me hours (pretty much the better part of a day) to get all of this stuff working properly, dynamic programming algorithms really are fiddly little beasts. But, I had some fun, and got some good practice and learning out of it – time well spent (and now there is some decent subset sum code on the internet :P). Of course once I finished with this I had to look at the other challenges, number 2 didn't really catch my attention, but I couldn't walk away from number 1 with its ASCII boxes and bin packing goodness – I'll write that one up some other time.

Images by johntrainor, infinitewhite and familymwr

  • david karapetyan

    You mentioned there was an approximation algorithm as well but you dismissed it for some reason. If the sets you are dealing with only contain integers then the approximation algorithm will indeed give you the correct answer. All you have to do is make sure that the error in the approximation is less than 1 in absolute value because if I give you some integers and tell you that |a_1+a_2+a_3+…+a_n| < 1 then there is only one possible integer the sum could possibly be.

    • http://www.skorks.com Alan Skorkin

      Hi David,

      You’re quite right, given that constraint the approximation algorithm would be totally viable. It would perhaps have been worth it to code up the approximation algorithm just to see how it compared.

      • david karapetyan

        Great job with the write up by the way. You should make the coding and explanation of the approximation algorithm an exercise for the reader because like you said we really don’t have enough quality implementations of dynamic algorithms in an awesome language like ruby.

  • Brian Chen

    I appreciate your thoughtful post, and all the helpful code and clear explanations you’ve put up, but I wish you wouldn’t've explicitly noted where this problem came from, because explaining a good way to go about solving a challenge is very contrary to the spirit of the challenge. What I mean to say by this is that many people who make submissions to this challenge who have seen this post might now have nearly identical code, because they saw a great way to do it, and people who came up with this stuff independently will have their efforts marginalized because other people found a tutorial that hand-fed them an answer.

    • http://www.skorks.com Alan Skorkin

      Hi Brian,

      I get what you’re saying, and I did think about it before I wrote this post, however in my opinion it is less of an issue than you might think. People who’re serious about the challenges will still go away and do it themselves and their code will end up being different, sometimes significantly so. I know this for a fact, since I tutored several subjects while at Uni, you wouldn’t think that the same tiny problem can be solved differently enough by dozens of people, but it can :).

      In this case there is also choice of language, people can choose to use whatever language they want and in the case of the dynamic programming algo, translating it from language to language is a significant enough challenge, I know because I’ve done it with a couple of other dynamic programming algorithms.

      Lastly if you want to truly grok how all of this works, you will have to go through all this effort yourself anyway even if you have read a solution. You need to truly grok something if you want to discuss it intelligently (it is quickly apparent if you can’t discuss your code intelligently), so the total amount of work remains the same/similar.

      Having said all that, I do hear you, perhaps in future I will note where a challenge came from in a less direct fashion.

      • http://www.greenlemon.com.au Age

        Maybe Dropbox should hire you? :)

        • http://www.skorks.com Alan Skorkin

          Haha, I appreciate the sentiment :).

  • Pingback: links for 2011-02-27 « that dismal science

  • Pingback: Дайджест #1 | ByteFrames

  • http://users.softlab.ece.ntua.gr/~ttsiod/ Thanassis Tsiodras

    I did very similar work 5 years ago, to find the optimal way to fill-up a DVD from set of files of various sizes. Have a look:

    http://users.softlab.ece.ntua.gr/~ttsiod/fillupDVD.html

    • http://www.skorks.com Alan Skorkin

      Very interesting, always good to see a real-world application, well kinda, considering you were filling DVDs with junk :).

  • harry Yan

    I agree with Brian…..and I saw your reason also.
    To me, the algorithm is generic and applicable to many problems….you can respect other people’s efforts by not explicitly saying this is a dropbox challenge…I think that’s make a win-win for everybody.

  • http://buddylindsey.com Buddy Lindsey

    I am going to vote in favor of posting where the challenge came from. I like these types of challenges and had it not been mentioned in the article, I never would have known about it. I love the Euler project and have tried to find others like it. So thanks for saying where you got it. I might want to try to try this one so I stopped reading about half way through so I can give it a shot someday, and not be completely tainted by it.

  • kpolo

    If you have N elements in a list, a simple way to generate all possible subsets is to use an N-bit binary number and go from 0 to 2^N – 1. The binary rep of the number when going from 0 to the end tells you what indices the elements of the subset come from.

  • http://www.agile-works.com/blog/ Gamal

    Hi,
    It’s a good solution, and also a good problem, I solved it several weeks ago (http://code.google.com/p/java-puzzles-solution/wiki/TheDropboxDiet) and got the following results :
    Number of Elements: 200
    Range of possible subset sums is from -561350 to +502259
    Different subset sums: 1063609
    Time to take to find the solution: 33834 milliseconds
    Groups that sum zero:48
    and :
    Number of Elements: 200
    Range of possible subset sums is from -4151821 to +5784312
    Different subset sums: 9936133
    Time to take to find the solution: 1113679 milliseconds
    Groups that sum zero: 38
    If you want I will send the numbers so you can run your program and take the times.
    Also thanks for sharing your example.

  • Martin

    Can you not stop computing when the column you are interested in (0, in this case) contains a ‘T’? No need to fill the table if you have a solution, right?

    • http://www.skorks.com Alan Skorkin

      You most definitely can, infact with a large matrix it would be the smart thing to do.

  • Jamie

    Hi Alan,
    Stopping by to say hello, I just discovered your website last month. I do not profess to be any kind of developer I am a administrator analyst that hacks at code, but I have to say your website does inspire me. I think Buddy missed out on who issued the challenge and why, but then again he didn’t want to see your solution and come up with his own. I use dropbox all time for personal backups (in the clouds) :p) who wouldn’t want to work for them. I agree with Age dropbox should hire you.

    • http://buddylindsey.com Buddy Lindsey

      I definitely picked up on who issued the challenge and why. I still think it was a good idea to say who issued. More problems can always be thought up if necessary. If he hadn’t mentioned dropbox as the challenger, whether they are to hire or not, I would not have known about them. I am having a fun time playing with doing the syncing challenge.

  • Martin

    Here’s my version, in python, immediately returning a solution when a ‘T’ is in the 0 column.

    https://github.com/mlbright/db3/raw/master/subsetsum.py

    • http://www.agile-works.com/blog/ Gamal

      Please Alan and Martin, could you execute your programs with these data:

      -30592,-9169,-58249,-34142,-43869,12959,-18210,15012,-69174,85253,-26281,-92162,81879,-27680,50737,-35641,32942,-61297,-444,90527,-19427,
      37398,63394,-82246,37874,-69777,15606,83604,92086,-52780,87205,-12087,-58274,93346,7583,-5140,72716,-26967,52061,-15782,-87580,63824,83575,
      40227,-11597,-82970,66962,35203,20080,24456,67245,46397,-50920,-83689,25943,-27288,-34123,67793,80787,-17716,-84817,16192,56083,-37053,-10021,
      94940,94476,16531,60105,-66212,-24838,-7313,-87070,15708,7635,8848,93908,91138,-35092,-26939,88379,-51900,-42146,-65451,72785,-48981,8965,73601,
      -42692,67817,-52878,12247,82542,94914,69731,75849,24254,-21358,-98978,-47567,21253,-83211,16415,-67566,3597,-18502,97983,35463,16968,71836,41472,
      75842,-32052,25194,99743,61611,88361,-16447,27721,-61001,51987,49696,-94395,-83762,-44810,-3896,-38903,31518,-13607,87193,87409,-47195,35717,
      -60860,-76231,72930,-72005,-58454,-78626,71995,66271,-5088,-21597,9348,-12460,-61218,-20810,63287,-4430,37389,-98472,-34260,42040,-11579,
      85541,-32912,62542,48741,43867,63112,-5447,48928,69527,7915,83894,-53855,-15804,48122,-54160,36925,46281,-9623,-13546,-58534,81734,8553,
      94110,-46844,-20899,-76752,-89345,-185,-48721,-73051,47509,-85309,95908,-22123,65648,-79337,62475,89166,27110,63889,87843,-38014,-9344,42961,
      53746,66704
      

      And let me know the groups that sum cero and the time that it takes to your program to find the solution.
      Thanks in advance

      • http://www.skorks.com Alan Skorkin

        As people have pointed out, my dynamic programming solution won’t work when the range of inputs is really large (as it is with your data). That is to say, it would work but the matrix would be so large that it might take quite a while. Of course the program can be reimplemented without using the matrix and in such a way as to make it insensitive to the actual range of inputs (as some people have also pointed out in various places e.g. Hacker News).

      • Harit

        Gamal,

        I wrote a C++ program which takes the following amount of time and gives the following answer:

        > time ./diet
        -9169 -58249 12959 -18210 85253 -27680 50737 -35641
        0.22u 0.00s 0:00.23 95.6%

        It does not do any file processing (I just put the input values in the source code) and stops at the first hit.

        If I change it to return all answers:
        448.07u 0.79s 7:29.20 99.9%
        ~7 1/2 minutes.

    • Charlie Sutikno

      Hmmmm I can’t download it, is it down?

  • Wonder Y

    No offense, but this is a very poor solution to the problem. Your matrix is cute, but it ignores some fundamental realities:
    * the problem, like the “same birthday” problem, usually converges to an answer very quickly unless the range of inputs is really large
    * if the range of inputs is really large you need a really large matrix
    * making a really large, sparsely populated matrix is a really large waste of memory

    A much better answer is to build a data structure from the ground up (in order of ascending absolute value), testing for solutions as you go along. This approach finds an answer so fast that optimizations (such as ignoring cases where one sum is greater than the sum of the other set) are superfluous. Your algorithm will run out of memory before mine starts to slow down. There is a lot of variability in the results of course, but using JavaScript / Chrome / Windows / a Vostro 1720 it doesn’t take more than a couple of seconds to find an answer until the range gets to +/- a billion.

    • http://www.skorks.com Alan Skorkin

      Agreed, the idea was to demonstrate a technique, but several people have pointed out that there is a better way to solve the problem. However the challenge did not ask to solve the general subset sum problem but the specific instance with calorie values which means the range of inputs will never be large enough for it to matter.

    • http://www.agile-works.com/blog/ Gamal

      Wonder Y, you are right when you said: “Your algorithm will run out of memory”, but please could you provide some solution in order to compare the different ones to see severals ways of solving this problem. Currently exist better solutions with all the teorical background such as: “A complete anytime algorithm for number partitioning” (author “Richard E. Korf”) but I think that the point is try to find some of fun solving this problem using only our imagination, please if you found a solution try with the number that I show above in order to compare times and algorithms. Thanks in advance

      • Wonder Y

        Gamal,
        The original question called for finding “a” solution, not all solutions. My algorithm does the following:
        * sort the input array by ascending absolute value
        * create two hash tables with the key being a number and the value being the numbers that sum to that value (in retrospect I probably only need one hash table and an additional conditional statement or two to make sure positive and negative numbers are kept separate)
        * iterate through the array
        * generate sums by adding to existing entries in the corresponding hash table, creating new entries and testing for matches on the other side as you go along

        I’m new to this site and don’t have a convenient way of posting my code publicly but I can tell you that it finds an answer almost immediately even in JavaScript. Only eighteen of the fifty inputs are needed to find a solution.

        The key to the question is that if you can filter data so that you process only the most important entries to start with, most of your scalability concerns will go away. Creating a massive data structure (whether a collection of possible sums or a matrix) should be avoided if at all possible.

        • http://www.agile-works.com/blog/ Gamal

          Wonder Y,
          Please could you use your program with the numbers that I post before (you could see above), and let me know the time that it takes.
          Regards
          Gamal

    • Martin

      Can you point out an example of this technique? perhaps some code on the web? That would be much appreciated.

  • devrandnull

    Very nice explanation, thanks a lot!

    Some doubts that I have:
    What would be the complexity of the dynamic algorithm you have explained?
    Time: O(N * C), where N = number of input values and C = number of possible sums
    Space: O(N * C)

    What if we want to get all the subsets that are valid? Do we have to go for the exponential algorithm?

    • http://www.skorks.com Alan Skorkin

      The complexity is linear, or to be more precise O(n) where n is the number of possible values between the largest possible sum that can be made from the input and the smallest possible sum (the number of input values is a factor but it is a multiplier and so doesn’t matter).

      If you want all the possible subsets, the exponential algorithm is one way, however I am pretty sure that you can backtrack through the matrix in multiple different ways if multiple sums are possible, I haven’t actually done it, but it makes sense that this would be possible.

  • http://droope.wordpress.com droope

    Holy cow!

    That was amazing. I think I have a “dynamic programming” challenge myself, and it’s hackthissite.org ‘s programming excercise number 3.

    Will bookmark it for a later read. As always, awesome ;)

    Thanks.

  • http://www.surgeforward.com Surge

    There is allot of good information here that will help me. I have been studying algorithms and this has helped me understand a little more about how to work it. Thanks for the post.

  • http://gagan.co.nr Gagan Deep

    Awesome article…i wanted to learn DP ….but didnt get proper thing….this one is simple to understand…thanx

  • Prateek Caire

    Thanks for the explanation. But why did not you use Knapsack algo here. I reckon Subset Sum problem is a special case of Knapsack Problem.

  • http://algologix.wordpress.com pawan

    cool dp solution man.

  • Kendal

    Stranger In a strange land reference at the beginning, first person I’ve ever seen besides myself that has read that book

  • roy

    Excellent post ! Simply awesome !!
    Can you please provide the code in C/C++/C#/VB ?

  • psahni

    Mind Blowing

  • psahni

    Thanks a Ton My Brother. May God Bless You. This article really helped me.