TIME SERIES FORECASTING WITH FEED-FORWARD NEURAL NETWORKS:

GUIDELINES AND LIMITATIONS

 

 

 

by

Eric A. Plummer

 

 

 

 

 

 

A thesis submitted to the Department of Computer Science

and The Graduate School of The University of Wyoming

in partial fulfillment of the requirements

 for the degree of

 

 

 

MASTER OF SCIENCE

in

COMPUTER SCIENCE

 

 

 

 

 

Laramie, Wyoming

July, 2000


 

 

 

 

Thank you to my parents, Bill and Fran Plummer, for their constant encouragement and support, Karl Branting for guidance, and Megan, my fiancée, for everything else.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Ó 2000 by Eric A. Plummer


Table of Contents

1     Introduction. 1

1.1      Time Series Forecasting. 1

1.1.1    Difficulties. 2

1.1.2    Importance. 3

1.2      Neural Networks. 3

1.2.1    Background. 3

1.2.2    Feed-Forward Neural Networks. 4

1.2.3    Backpropagation Training. 7

1.2.4    Data Series Partitioning. 9

1.3      K-Nearest-Neighbor 10

2     Related Work. 12

3     Application Level - Forecaster.. 14

3.1      Neural Networks. 14

3.1.1    Data Parsing. 14

3.1.2    Neural Network Files. 17

3.1.3    The Wizard. 17

3.1.4    Training. 21

3.1.5    Forecasting. 23

3.2      K-Nearest-Neighbor 24

4     Empirical Evaluation. 26

4.1      Data Series. 26

4.2      Neural Network Parameters and Procedure. 29

4.2.1    Architectures. 29

4.2.2    Training. 30

4.2.3    Forecasting. 32

4.2.4    Evaluation Specifics. 33

4.3      K-Nearest-Neighbor Parameters and Procedure. 34

4.4      Artificial Data Series Results. 35

4.4.1    Heuristically Trained Neural Networks with Thirty-Five Inputs. 35

4.4.2    Simply Trained Neural Networks with Thirty-Five Inputs. 44

4.4.3    Heuristically Trained Neural Networks with Twenty-Five Inputs. 49

4.4.4    K-Nearest-Neighbor 51

4.5      Real-World Data Series Results. 54

5     Conclusion. 55

6     References. 57

Appendix A: Class-level description of Forecaster.. 59

 


1      Introduction

1.1           Time Series Forecasting

Time series forecasting, or time series prediction, takes an existing series of data  and forecasts the  data values.  The goal is to observe or model the existing data series to enable future unknown data values to be forecasted accurately.  Examples of data series include financial data series (stocks, indices, rates, etc.), physically observed data series (sunspots, weather, etc.), and mathematical data series (Fibonacci sequence, integrals of differential equations, etc.).  The phrase “time series” generically refers to any data series, whether or not the data are dependent on a certain time increment.

Throughout the literature, many techniques have been implemented to perform time series forecasting.  This paper will focus on two techniques: neural networks and k-nearest-neighbor.  This paper will attempt to fill a gap in the abundant neural network time series forecasting literature, where testing arbitrary neural networks on arbitrarily complex data series is common, but not very enlightening.  This paper thoroughly analyzes the responses of specific neural network configurations to artificial data series, where each data series has a specific characteristic.  A better understanding of what causes the basic neural network to become an inadequate forecasting technique will be gained.  In addition, the influence of data preprocessing will be noted.  The forecasting performance of k-nearest-neighbor, which is a much simpler forecasting technique, will be compared to the neural networks’ performance. Finally, both techniques will be used to forecast a real data series.

Difficulties inherent in time series forecasting and the importance of time series forecasting are presented next.  Then, neural networks and k-nearest-neighbor are detailed.  Section 2 presents related work.  Section 3 gives an application level description of the test-bed application, and Section 4 presents an empirical evaluation of the results obtained with the application.

1.1.1     Difficulties

Several difficulties can arise when performing time series forecasting.  Depending on the type of data series, a particular difficulty may or may not exist.  A first difficulty is a limited quantity of data.  With data series that are observed, limited data may be the foremost difficulty.  For example, given a company’s stock that has been publicly traded for one year, a very limited amount of data are available for use by the forecasting technique.

A second difficulty is noise.  Two types of noisy data are (1) erroneous data points and (2) components that obscure the underlying form of the data series.  Two examples of erroneous data are measurement errors and a change in measurement methods or metrics.  In this paper, we will not be concerned about erroneous data points.  An example of a component that obscures the underlying form of the data series is an additive high-frequency component.  The technique used in this paper to reduce or remove this type of noise is the moving average.  The data series  becomes  after taking a moving average with an interval i of three.  Taking a moving average reduces the number of data points in the series by .

A third difficulty is nonstationarity, data that do not have the same statistical properties (e.g., mean and variance) at each point in time.  A simple example of a nonstationary series is the Fibonacci sequence: at every step the sequence takes on a new, higher mean value.  The technique used in this paper to make a series stationary in the mean is first-differencing.  The data series  becomes  after taking the first-difference.  This usually makes a data series stationary in the mean.  If not, the second-difference of the series can be taken.  Taking the first-difference reduces the number of data points in the series by one.

A fourth difficulty is forecasting technique selection.  From statistics to artificial intelligence, there are myriad choices of techniques.  One of the simplest techniques is to search a data series for similar past events and use the matches to make a forecast.  One of the most complex techniques is to train a model on the series and use the model to make a forecast.  K-nearest-neighbor and neural networks are examples of the first and second techniques, respectively.

1.1.2     Importance

Time series forecasting has several important applications.  One application is preventing undesirable events by forecasting the event, identifying the circumstances preceding the event, and taking corrective action so the event can be avoided.  At the time of this writing, the Federal Reserve Committee is actively raising interest rates to head off a possible inflationary economic period.  The Committee possibly uses time series forecasting with many data series to forecast the inflationary period and then acts to alter the future values of the data series.

Another application is forecasting undesirable, yet unavoidable, events to preemptively lessen their impact.  At the time of this writing, the sun’s cycle of storms, called solar maximum, is of concern because the storms cause technological disruptions on Earth.  The sunspots data series, which is data counting dark patches on the sun and is related to the solar storms, shows an eleven-year cycle of solar maximum activity, and if accurately modeled, can forecast the severity of future activity.  While solar activity is unavoidable, its impact can be lessened with appropriate forecasting and proactive action.

Finally, many people, primarily in the financial markets, would like to profit from time series forecasting.  Whether this is viable is most likely a never-to-be-resolved question.  Nevertheless many products are available for financial forecasting.

1.2           Neural Networks

1.2.1     Background

A neural network is a computational model that is loosely based on the neuron cell structure of the biological nervous system.  Given a training set of data, the neural network can learn the data with a learning algorithm; in this research, the most common algorithm, backpropagation, is used.  Through backpropagation, the neural network forms a mapping between inputs and desired outputs from the training set by altering weighted connections within the network.

A brief history of neural networks follows[1].  The origin of neural networks dates back to the 1940s.  McCulloch and Pitts (1943) and Hebb (1949) researched networks of simple computing devices that could model neurological activity and learning within these networks, respectively.  Later, the work of Rosenblatt (1962) focused on computational ability in perceptrons, or single-layer feed-forward networks.  Proofs showing that perceptrons, trained with the Perceptron Rule on linearly separable pattern class data, could correctly separate the classes generated excitement among researchers and practitioners.

This excitement waned with the discouraging analysis of perceptrons presented by Minsky and Papert (1969).  The analysis pointed out that perceptrons could not learn the class of linearly inseparable functions.  It also stated that the limitations could be resolved if networks contained more than one layer, but that no effective training algorithm for multi-layer networks was available.  Rumelhart, Hinton, and Williams (1986) revived interest in neural networks by introducing the generalized delta rule for learning by backpropagation, which is today the most commonly used training algorithm for multi-layer networks.

More complex network types, alternative training algorithms involving network growth and pruning, and an increasing number of application areas characterize the state-of-the-art in neural networks.  But no advancement beyond feed-forward neural networks trained with backpropagation has revolutionized the field.  Therefore, much work still waits.

1.2.2     Feed-Forward Neural Networks

Figure 1.1 depicts an example feed-forward neural network.  A neural network can have any number of layers, units per layer, network inputs, and network outputs.  This network has four units in the first layer (layer A) and three units in the second layer (layer B), which are called hidden layers.  This network has one unit in the third layer (layer C), which is called the output layer.  Finally, this network has four network inputs and one network output.  Some texts consider the network inputs to be an additional layer, the input layer, but since the network inputs do not implement any of the functionality of a unit, the network inputs will not be considered a layer in this discussion.


Figure 1.1 A three-layer feed-forward neural network.


Figure 1.2 Unit with its weights and bias.

If a unit is in the first layer, it has the same number of inputs as there are network inputs; if a unit is in succeeding layers, it has the same number of inputs as the number of units in the preceding layer.  Each network-input-to-unit and unit-to-unit connection (the lines in Figure 1.1) is modified by a weight.  In addition, each unit has an extra input that is assumed to have a constant value of one.  The weight that modifies this extra input is called the bias.  All data propagate along the connections in the direction from the network inputs to the network outputs, hence the term feed-forward.  Figure 1.2 shows an example unit with its weights and bias and with all other network connections omitted for clarity.

In this section and the next, subscripts c, p, and n will identify units in the current layer, the previous layer, and the next layer, respectively.  When the network is run, each hidden layer unit performs the calculation in Equation 1.1 on its inputs and transfers the result (Oc) to the next layer of units.

Equation 1.1 Activation function of a hidden layer unit.

Oc is the output of the current hidden layer unit c, P is either the number of units in the previous hidden layer or number of network inputs, ic,p is an input to unit c from either the previous hidden layer unit p or network input p, wc,p is the weight modifying the connection from either unit p to unit c or from input p to unit c, and bc is the bias. 

In Equation 1.1, hHidden(x) is the sigmoid activation function of the unit and is charted in Figure 1.3.  Other types of activation functions exist, but the sigmoid was implemented for this research.  To avoid saturating the activation function, which makes training the network difficult, the training data must be scaled appropriately.  Similarly, before training, the weights and biases are initialized to appropriately scaled values.


Figure 1.3 Sigmoid activation function.  Chart limits are x=±7 and y=-0.1, 1.1.

Each output layer unit performs the calculation in Equation 1.2 on its inputs and transfers the result (Oc) to a network output.

Equation 1.2 Activation function of an output layer unit.

Oc is the output of the current output layer unit c, P is the number of units in the previous hidden layer, ic,p is an input to unit c from the previous hidden layer unit p, wc,p is the weight modifying the connection from unit p to unit c, and bc is the bias.  For this research, hOutput(x) is a linear activation function[2].

1.2.3     Backpropagation Training

To make meaningful forecasts, the neural network has to be trained on an appropriate data series.  Examples in the form of <input, output> pairs are extracted from the data series, where input and output are vectors equal in size to the number of network inputs and outputs, respectively.  Then, for every example, backpropagation training[3] consists of three steps:

1.     Present an example’s input vector to the network inputs and run the network: compute activation functions sequentially forward from the first hidden layer to the output layer (referencing Figure 1.1, from layer A to layer C).

2.     Compute the difference between the desired output for that example, output, and the actual network output (output of unit(s) in the output layer).  Propagate the error sequentially backward from the output layer to the first hidden layer (referencing Figure 1.1, from layer C to layer A).

3.     For every connection, change the weight modifying that connection in proportion to the error.

When these three steps have been performed for every example from the data series, one epoch has occurred.  Training usually lasts thousands of epochs, possibly until a predetermined maximum number of epochs (epochs limit) is reached or the network output error (error limit) falls below an acceptable threshold.  Training can be time-consuming, depending on the network size, number of examples, epochs limit, and error limit.

Each of the three steps will now be detailed.  In the first step, an input vector is presented to the network inputs, then for each layer starting with the first hidden layer and for each unit in that layer, compute the output of the unit’s activation function (Equation 1.1 or Equation 1.2).  Eventually, the network will propagate values through all units to the network output(s).

In the second step, for each layer starting with the output layer and for each unit in that layer, an error term is computed.  For each unit in the output layer, the error term in Equation 1.3 is computed.

Equation 1.3 Error term for an output layer unit.

Dc is the desired network output (from the output vector) corresponding to the current output layer unit, Oc is the actual network output corresponding to the current output layer unit, and  is the derivative of the output unit linear activation function, i.e. 1.  For each unit in the hidden layers, the error term in Equation 1.4 is computed.

Equation 1.4 Error term for a hidden layer unit.

N is the number of units in the next layer (either another hidden layer or the output layer), dn is the error term for a unit in the next layer, and wn,c is the weight modifying the connection from unit c to unit n. The derivative of the hidden unit sigmoid activation function, , is .

In the third step, for each connection, Equation 1.5, which is the change in the weight modifying that connection, is computed and added to the weight.

Equation 1.5 Change in the weight modifying the connection from unit p or network input p to unit c.

The weight modifying the connection from unit p or network input p to unit c is wc,p, a is the learning rate (discussed later), and Op is the output of unit p or the network input p.  Therefore, after step three, most, if not all weights will have a different value.  Changing weights after each example is presented to the network is called on-line training.  Another option, which is not used in this research, is batch training, where changes are accumulated and applied only after the network has seen all examples.

The goal of backpropagation training is to converge to a near-optimal solution based on the total squared error calculated in Equation 1.6.

Equation 1.