NALU

NALU

It drives the self driving cars, but it’s in trouble, when it has to count. Do you know what I’m talking about? The answer is surprising: these are neural networks. The new research paper from DeepMind tries to solve this problem.

A neural network can map incredibly complex functions with the training examples, but it’s struggling to generalize well. This fact was a nightmare for the researchers. Think about the following situation: when you prepare yourself for a math exam. You have two options: You learn the solved exercises from the class, and you are hoping, that the teacher is lazy, and doesn’t invent new exercises. OR you understand the logic behind the exercises. The two method aren’t differ until the examples are the same as were solved on the class. But what happens, when it isn’t true?

The neural networks can’t generalize well, so them learn like the first student in the thought experiment. We can easily observe this behavior, simply with teaching a neural network to make a simple y=x function. First of all you have to create the training examples, so:

In the training range the NN can perfectly solve this equation, bot outside of that range, it fails. The question is, why?

The figure shows the Mean Absolute Error for a conventional FC network with different activations. [1]

The answer is: because of the non-linear activations. Without them we can’t map complex functions between the input and the output, but them restrain us in the numeric calculations. The more non-linear the activation is, the bigger the extrapolations error. For example a RELU activation can extrapolate with small error, a sigmoid activation can do this just with huge amount of error.

RELU activation (left) [2], sigmoid activation (right). [3]

The recently published paper solves this problem.

NAC

NAC module with variable names. [1]

The key idea is the NAC module. Its function to add, and subtract between the input data (linear combination), without do any scaling. It achieves this with a W matrix, whose weights can be just -1, 0, 1. We don’t’ apply any activation after that multiplication, so it can be done for arbitrary times, without any bias. This allow the network to extrapolate outside of the training range.

Equations of the NAC module [1]

The weight matrix can be created by assigning that three discrete values, but it isn’t differentiable. For that reason, we use the equation on the right side. It guaranties, that the weights are close to that three number, with a small amount of error. Further simplicity, that the W^, and M^ can be initialize, and optimize with the conventional methods.

The need is also actual for the complex numerical functions, for example multiplication, and division. What can we do? In the log space the addition, and the subtraction can be treat as multiplication, and division. We all known these log properties. If we keep the W matrix, which was constructed in the previous section, and apply it on the logs of the inputs, after that with the application of the exponential function we will obtain the expected result.

Addition in log space

There is an epsilon to eliminate the numerical problems with log(0).

NALU

NALU module with variable names [1]

We have now two NAC module (purple on the figure), which are basic building block of our NALU and function as basic numerical functions. This two NAC with a sigmoid gate (orange on the figure) make up the NALU architecture. The sigmoid gate allows  the model to determine what a basic function is need for the high accuracy. This NALU architecture can perfectly simulate the basic functions, thus it can generalize well outside of the training range. The NALU module can easily connect to other networks (RNN, CNN, FC).

Output of the NALU [1]

In the paper the researchers test this module on different tasks, and it achieved state of art accuracy on many problems. For example: The exercise is to count the different MNIST characters, or sum them up. The networks was trained on 10 long input data, the figure show the result:

Results on 2 tasks with different activations and regoins. [1]

The results are well demonstrate, that inside, and outside of the training range, the model achieved order of magnitude better results than the well-known and used models. This new architecture was tested on another task like transformation the written numbers to digits, and solving RL agents numerical problems

Of course it can’t fit every problem well, but it demonstrates a new model strategy, which can help to add numerical functions to an existing architecture.

We are confident that inside your company there are a lot of tasks which can be automated with AI: In case you would like to enjoy the advantages of artificial intelligence, then apply to our free consultation on one of our contacts.

 

References

Neural Arithmetic Logic Units. arXiv: 1808.00508v1 [cs.NE]. arxiv.org/abs/1807.09882

https://mlnotebook.github.io/post/transfer-functions/

https://medium.com/@kanchansarkar/relu-not-a-differentiable-function-why-used-in-gradient-based-optimization-7fef3a4cecec

Close Menu