e032028.pdf

COMPUT ING

Practical Neural

Networks (2)

Part 2: Back Propagation Neural Nets

By Chris MacLeod and Grant Maxwell

Back Propagation (BP) Networks are the quintessential Neural Nets.

Probably eighty percent of nets used today are of this type. Actually

though, Back Propagation is the learning or training method, rather than

the network structure itself.

The network operates in the same way as

the type we’ve looked at in part 1 — you apply

the inputs and calculate an output exactly as

described. What the Back Propagation part

does, is allow you to change the weights, so

that the network learns and gives you the out-

put you want. The weights which the network

starts off with are simply set to small random

numbers (say between –1 and +1).

Inputs

For this

particular

input

pattern to

the network,

we would

like to get

this

output.

Targets

0 1

1 0

1 1

(the output you

want for each pattern)

020324 - 2 - 11

What is BP good for?

Back Propagation is excellent for simple pat-

tern recognition and mapping tasks. It learns

by example.

To give a typical application, we can train

a BP network for character recognition. All

you need to do is give it examples of the char-

acters, and what output we would like the

network to have, and it will learn from them,

see Figure 1 .

The algorithm works by calculating an

error — which is the amount by which the

output differs from an ideal value (chosen by

you, and called the Target), and then chang-

ing the weights to minimise this error. Once

the network is trained, it will correctly give

the output when a character is applied, even

if the character is distorted, imperfect or

noisy. In this case, because the Target has

two bits, we need two output neurons (one

for each bit). Each input and its associated

Ta r get is called a Training Pair.

Figure.1. Use of a BP network for image recognition.

What does a BP network

look like?

output neuron 2 gives a ‘1’ (and the

rest are zero).

This only really leaves the number

of neurons in the hidden layer to

decided on. Fortunately, networks

are quite ﬂexible about this parame-

ter and will operate over a wide

range of hidden layer neurons;

although, the more patterns the net-

work needs to remember, the more

neurons you will need. In a network

designed to recognise all 26 letters

of the alphabet (26 output neurons)

on a 5× 7 grid (35 inputs), the net-

work will function with anywhere

between about 6 and 22 neurons. If

you have too few, then the network

hasn’t got enough weights to store

all the information in; if there are too

Figure 2 shows a BP network being

used for Pattern Recognition.

A common question is: How big

should the network be? We can see

from Figure 2 that the number of

inputs is ﬁxed by the pattern we are

trying to process. In the case of four

pixels, there must be four inputs.

Likewise, the number of output neu-

rons is fixed by the number of pat-

terns we what to recognise. If we

had nine patterns we could either

use three output neurons and binary

code their outputs, or we could use

nine and assign them so that, for

example, when pattern 2 appears,

Elektor Electronics

2/2003

COMPUT ING

out

)

Input 1

2. Change output layer weights

W + Aα

We’d like this

neuron to give a

“0” out.

= W Aα

Error

out A

W + Aβ

Input 2

W Aβ

Error

out A

W + Bα

= W Bα

Error

out B

W + Bβ

We’d like this

neuron to give a

“1” out.

Input 3

W Bβ

Error

out B

W + Cα

= W Cα

Error

out C

W + Cβ

W Cβ

Error

out C

Input 4

Targets

3. Calculate (back-propagate) hidden layer

errors

Error A = out A (1 – out A ) (Error

020324 - 2 - 12

W Aα

+ Error

Figure 2. A network wired for recognising patterns.

)

Error B = out B (1 – out B ) (Error

W Aβ

W Bα

+ Error

)

Error C = out C (1 – out C ) (Error

W Bβ

many, it becomes inefficient and

prone to a problem called local min-

ima (discussed later).

present because of the effect of the

sigmoid function — if we were just

using a binary threshold, we would

omit it.

3. Change the weight. Let W + AB be

the new (trained) weight and W AB

be the original (untrained) weight:

W Cα

+ Error

W Cβ

)

4. Change hidden layer weights

W +

A = W

A +

Error A in

The BP algorithm

Now let’s have a look at the training

algorithm itself. To do this, we’ll refer

to three neurons labelled A,B and C

in Figure 3 .

The weight that we’ll train is that

between neuron A and neuron B and

is labelled W AB in the diagram. The

diagram also shows another weight

— W AC — and we’ll return to that

one in a moment.

The algorithm works like this:

W +

A = W +

A +

Error A in

W +

B = W

B +

Error B in

W +

B = W +

B +

Error B in

W + AB = W AB + (Error B ×

W +

Output A )

C = W

C +

Error C in

W +

C = W +

C +

Error C in

Notice that we use the error of the

second neuron (B), but the output of

the feeding neuron (A).

4. Change all the other weights in

the output layer in this manner.

5. To c h ange the weights of the hid-

den layers you need to calculate an

error for the hidden neurons. We do

this by Back Propagating the errors of

the output neurons back. For exam-

ple, suppose we want to calculate the

error for neuron A. We use the errors

calculated for all the output neurons

attached to it, in this case B and C

and propagate them back — hence

the name of the algorithm.

(called the learning rate, and

nominally equal to one) is put in to speed up

or slow down the learning if required.

The constant

Using BP to train a network

Now that we’ve seen the algorithm in detail,

let’s look at how to use it. One of the most

common mistakes made when programming

a BP network for the ﬁrst time is the order in

which you apply the patterns to the network.

Let us take an example. Suppose you wanted

to teach the network to recognise the first

four letters of the alphabet, placed on a 5?7

grid.

The correct way to train the network is to

apply the ﬁrst letter, and then change all the

weights of the network once (i.e., do all the

calculations in Figure 4, once only). Then

apply the second pattern and do the same

again, then the third and finally the fourth.

Once you’ve gone through this cycle once

start all over again with pattern 1. Figure 5

shows the idea.

We stop the network when the total error

is low enough — that is, when the sum of all

the errors (the positive error from every neu-

ron, summed over every pattern) is below a

threshold. This threshold is usually set by the

user to be some arbitrary low number, like

0.1. In the example above the total error of the

network would be:

1. First, apply the inputs to the net-

work and calculate its outputs as

described last month in Part 1 (this

is the forward pass ).

2. Next, calculate the output error for

neuron B. The error is basically:

What you want - What you get .

What you want is your target and

what you get is your output. Mathe-

matically:

Error B = Output B * (1 – Output B ) *

(Target B – Output B )

Error A = Output A * (1 – Output A ) *

(Error B * W AB + Error C * W AC )

Again, the Output A * (1 – Output A )

serves the purpose noted in 2.

The term Output B * (1 – Output B ) is

6. Having obtained the errors for the

hidden layer neurons, we now pro-

ceed back to stage 3 and change

their weights.

W AB

W AC

Now this might be a little confusing,

so let’s show a full example, Figure

4 .

1. Calculate errors of output neurons

Error

020324 - 2 - 13

= out

(1 - out

) (Target

–

out

)

(Errors of all neurons in pattern 1) + (Pattern 2

errors) + (Pattern 3 errors) + (Pattern 4 errors)

Figure.3. Three neurons which are part of

a larger network.

Error

= out

(1 - out

) (Target

–

2/2003

Elektor Electronics

COMPUT ING

Before doing this, it is necessary to make all

the errors positive — we can do this by

squaring them.

The learning process is shown in the algo-

rithm below:

W ΩA

W Aα

W λA

W Aβ

Ω

W ΩB

W Bα

1. Apply ﬁrst pattern, perform forward pass,

perform reverse pass.

2. Apply second pattern, perform forward

pass, perform reverse pass.

3. Apply third pattern, perform forward pass,

perform reverse pass.

4. Apply fourth pattern, perform forward

pass, perform reverse pass.

5. Test: is total error small enough? If yes,

then go to 6.

6. Go to 1.

7. Stop, network has trained.

W Bβ

W λB

W ΩC

W Cα

W Cβ

W λC

Inputs

Hidden layer

Outputs

020324 - 2 - 14

Figure 4. All the calculations for a complete reverse pass in a network.

A common mistake to make is running the pro-

gram on pattern one until the error is low, then

on pattern two and then on pattern three. If

you do this, then the network will only learn

the last pattern you’ve presented it with.

Once the network has learned, you can

apply any of the inputs to it (just apply the

input and run a forward pass with the trained

weights) and it should recognise them. We

can then use the network to recognise pat-

terns in a real system.

A more accurate way to train the network

is to use a validation set. This is similar to the

set of the patterns which you are training the

network with — but with noise or other

imperfections added. After the training set

has been applied, the validation set is run

through the network to check its performance

(we don’t use the validation set to change the

network weights). When the net has fully

trained both the validation set and the train-

ing set will give a low error. If you’re training

the network too much, then the validation set

error will increase as shown in Figure 6 .

Where, in addition to the variables

explained in part 1 of this course,

E(L,n) and T(L,n) are the errors and

targets respectively of layer L, neu-

ron n.

ing how this can be done:

1. Set up inputs and targets for net-

work (either in a ﬁle, or in arrays).

2. Randomise weights being used.

3. Apply ﬁrst pattern, calculate net-

work output (forward pass) and

error, use error to change weights

(reverse pass) — once only. Keep a

note of the error.

4. Do the same for second pattern.

Add error to the running total from

pattern one.

Putting it all together

Now that we have algorithms for

both the forward and reverse pass

of the network, we can put them

together into a coherent whole.

Given below is a suggestion, show-

Listing 1

FOR x = ﬁrst_output_neuron TO ﬁnal_output_neuron_number

E(output_layer, x) = O(output_layer, x) * (1 -

O(output_layer, x) * (T(output_layer, x) -

O(output_layer, x))

NEXT x

Algorithms in software

In part 1 we discussed various ways of cod-

ing the network. One way was to store the

weights in a three dimensional array, with

indexes denoting the layer number, the neuron

number and the connection number. A suit-

able algorithm for a Back Propagation reverse

pass in such a network might be:

FOR L = number_of_layers TO 1 STEP –1

FOR n = 1 TO max_number_of_neurons

FOR c = 1 TO max_number_of_weights

W(L, n, c) = W(L, n, c) + E(L + 1, n) * O(L, c)

NEXT c

NEXT n

FOR n = 1 TO maximum_number_of_neurons

FOR c = 1 TO max_number_of_weights

E(L, n) = E(L, n) + E(L + 1, c) * W(L, c, n)

NEXT c

E(L, n) = E(L, n) * O(L, n) * (1 - O(L, n))

NEXT n

1. Initialise all unused weights, targets, errors

and outputs to zero

2. Calculate output errors, see Listing 1 , ﬁrst

part.

3. Change weights, see Listing 1, second

part.

4. Calculate error of hidden layers, see List-

ing 1, third part.

NEXT L

Elektor Electronics

2/2003

COMPUT ING

when the image it is to recognise is the correct

size and placed in a central position on the

grid. It’s no good at, say, recognising a face in a

crowd — unless you can centre the face or

make the network ‘scan’ the picture until it falls

onto the face (and even then you still have to

make the face the correct size). In other words,

many problems need to be ‘pre-processed’

before being presented to the network.

So these networks need to operate in a

controlled environment, which means that

applications such as Optical Character

Recognition (OCR) are more suitable. They

have problems dealing with the crowded and

confusing real world.

Incidentally, the human brain solves this

problem by first identifying ‘features’ in an

image, for example, horizontal or vertical

lines and then integrating these progressively

into a whole image in a layered structure. So

if you can identify a horizontal line along the

top of an image and a vertical line down the

middle, you can integrate these to find the

letter T. This approach is more tolerant

because these features (the two lines) are

always present in T, no matter where its

placed in the image or what size it is.

When running your network, you may run

into problems with its training. The most

common is known as ‘local minima’. This

occurs because the algorithm always follows

the error downwards (it can’t cause a change

of weights which causes the error to

increase). But sometimes, as part of a down-

wards trend, the error must go up as shown

in Figure 7 . In this case the training gets

stuck and the weights can’t move out of the

local minima.

This problem doesn’t really effect small net-

works, but becomes a problem as the network

size increases. One solution is to add ‘momen-

tum’ to the network. This involves allowing

the change of weight to continue for some

time in a particular direction as shown below:

Calculate the error and

change all the weights

in the network once.

Change all the

weights again

Apply this

letter first.

Apply this

letter next

Apply this

letter 3 rd .

Change

weights

Change weights

and start again

at A

Finally apply

this letter.

020324 - 2 - 15

Figure 5. How a network learns four patterns.

Problems and additions

5. Repeat for all subsequent pat-

terns, keep running total of error.

6. If error is too great (network still

not fully trained) then zero running

total and go to 3, else go to 7.

7. Network is trained and ready to

be used, either use directly or

store trained weights in a file for

future use.

Although BP is a very useful and

simple algorithm, it does have some

problems and limitations. Let’s start

with its limitations.

BP is excellent for the sort of sim-

ple pattern recognition and mapping

tasks explained above and in the ﬁrst

article. However, it only works well

Error measured

Error measured on

Network

error

on training set

validation set

Network

trained

020324 - 2 - 16

New_weight = Old_weight +

weight_change + Weight_change_from_pre-

vious_iteration.

Figure 6. Use of a validation set.

However, a simpler way to overcome this

problem (and several others which effect

training) is simply to monitor the training

progress of the network and if the error gets

‘stuck’ (does not decrease for some time),

reset the initial weights of the network to dif-

ferent random values and start training

again.

In next month’s instalment of this course,

we’ll have a look at networks which have

recurrent connections including the famous

‘Hopﬁeld’ network.

Global Minima - the

lowest error (the weight

value you really want to

find)

Local Minima

Network

error

Weight

020324 - 2 - 17

Figure 7. Local minima.

(020324-2)

2/2003

Elektor Electronics

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: