Finishing 2nd in Kaggle’s Abstraction and Reasoning Challenge

7 min readOct 14, 2020

Deep Learning algorithms have become extremely powerful tools over the last couple of years because they can perform much better than humans in certain tasks. However, they are not very good at solving tasks that they have never seen before, even the most simple ones. Below one can see three examples of tasks that a human would be able to solve within a few seconds, whereas a DL algorithm wouldn’t - unless trained specifically for that particular kind of task.

The three tasks above were part of the dataset of the ‘Abstraction and Reasoning Challenge’ (ARC), a competition launched by Kaggle in February 2020. Hosted by Keras’ creator François Chollet, it aimed at dealing with some of the main gaps that currently exist in the world of AI, as exposed in his paper ‘On the Measure of Intelligence’. In particular, participants were challenged to develop a program capable of solving IQ test-like problems that it had never seen before. Humans are very good at quickly detecting all sorts of visually clear patterns and symmetries that aren’t obvious to a program if it is not taught to detect them. A program capable of achieving human-like performance in such tasks would bring us a step closer to what is known as Artificial General Intelligence (AGI).

In this post I will share my personal experience in this competition, in which I managed to finish 2nd with the help of my teammates. I will also share the main ideas of our algorithm, which can be found on this GitHub repository.

Understanding the Dataset. First Approaches

A training dataset of 800 tasks was given, and a hidden test set of 100 extra tasks would determine the competition’s final leaderboard. For each task of the test set, one was allowed to submit three possible solutions, and if any of the proposed solutions was correct, the task would be considered solved.

Quite early on, some high-quality notebooks were shared by some of the competition participants pointing at the potential usefulness of combining CNNs and recurrence. This notebook in particular was my first source of inspiration for trying some DL-related approach. However, after a few attempts, it became clear to me that it was quite limited. Whereas some of the tasks could clearly be solved with such an approach, some others were quite unlikely to be solved by a generic CNN. In particular, I didn’t manage to solve a single task of the test set with this approach.

Two tasks easily solvable by a generic CNN, and a task a CNN struggles to solve

It soon became clear to me that I would need to develop some sort of Domain-Specific Language (DSL), as suggested by Chollet’s paper. This DSL would have to allow me to express the most basic concepts that we humans take for granted and detect easily. I needed to automatically detect shapes, symmetries, the background color, or features that remained unchanged. And a Neural Network, as we know them nowadays, wouldn’t do that for me.

So I started developing my own framework, analysing the properties at the task, sample, and matrix level, and writing functions that would apply basic transformations. Below one can see a task that could be solved by a function ‘changeColor’, turning orange intro grey, and another task that could be solved by a ‘pixelwiseAnd’ function (black AND black = red).

Tasks solvable with one simple operation

But the kind of tasks that we are looking at above is still quite simple, and some better tools were needed to solve more complex tasks.

Final Algorithm Design

Writing a DSL implied developing lots of small, basic functions, but one needed to combine them in a smart way in order to make a powerful algorithm out of it. This notebook convinced me that the most clever way to do it was to develop a genetic algorithm that would concatenate my basic operations until reaching the desired result.

Since for every task one could make up to three guesses, my genetic algorithm would keep track of the three best candidates. A candidate would be a sequence of operations, together with the resulting task (the original input matrices and the output matrices generated when applying the sequence of operations to them). A score would then determine which candidate was better. The score chosen was the number of pixels that differ between the true outputs and the outputs generated by the candidate.

With these definitions set, the implemented algorithm was as follows:

Initialize the three best candidates as the identity function.
If all the candidates have a score of 0 or none of them can be improved, END.
Pick a candidate that hasn’t been examined yet. Analyse the task and identify operations that, given the task properties, make sense to try.
Execute each of the selected operations and compute the score of the resulting task. If the score is better than the score of any of the best 3 candidates, generate a new candidate using that operation and add it to the set of the best 3 candidates, substituting it with the one with the worst score.
Go back to step 2.

Two tasks solvable by concatenating basic operations

This algorithm allowed me to solve many tasks that would have been very complicated to solve beforehand. For example, the two examples above could now be solved by:

changeShapeColor(grey, blue, ‘biggest’) → changeShapeColor(grey, red, ‘smallest’)→ changeColor(grey, black)
cropBackground → multiplyMatrix((1,2))

Extending and improving the set of operations

Once the algorithm was defined, it was time to extend the set of operations. At that point, I was still solving only one of the tasks in the test set, so I knew that there was a lot of work to be done in very little time. Luckily, my friend Roderic decided to join me, and together we started analysing the whole training and validation datasets, looking for ways of improving our set of functions.

In around two months of hard work, together we build a set of over 50 functions, each one of them possibly having several parameters. Some examples are cropShape, connectPixels or replicateShape; a complete list can be found in the GitHub repository above mentioned.

Moreover, we also improved our algorithm by adding some preprocessing and postprocessing steps whenever we considered that it might be convenient. For example, sometimes a grid played a role, and we needed either to ignore it or to consider its cells as separate objects. Or sometimes two different colors represented the exact same thing in different matrices of the same task, making it convenient to perform some recoloring as part of the preprocessing.

Some of these preprocessing ideas, which also included rotating matrices when necessary, or augmenting the set of samples, ended up being a key step in improving our final score.

Team Merging and End of Competition

As we were adding functions to our function basis, we started improving our score in the leaderboard. When there were only around ten days to go, we were quite confident about us ending up in the top 10. At that point, Yuji, a participant who had a decent score too, suggested us to merge teams. We considered it a good opportunity to finish in the top 3, so we decided to accept his offer, which in the end helped us solve the extra tasks that we needed to finish in the 2nd position.

Final Thoughts

Taking part in this competition was an amazing experience. I believe that there is a lot of exciting work ahead in the world of AI coming from the branch of AGI. There are just too many simple things that DL algorithms are not capable of doing, and I’m sure that future research in this domain will significantly contribute to humanity’s progress. Being able to identify powerful abstractions has always been an extremely powerful tool in mathematics (that’s what mathematics is all about really), and I think that the right abstractions will have the potential to become game-changers in the field of AI.

However, this competition has shown that this might be quite difficult to achieve. No team out of the 914 participants found a satisfying, AI-focused solution for this problem. Our solution is quite deterministic (even if some of our operations included ML techniques), and so are the solutions of all the other top teams; the winner’s solution is actually quite similar to ours. Whereas this might be a bit upsetting, one needs to keep in mind that the competition lasted only for around 3 months, and if the top solutions managed to solve around 20% of the tasks in the test set, I’m sure that some long-term collaborative research could lead to the discovery of techniques that bring computers closer to understanding and mimicking our way of reasoning.