TIME SERIES FORECASTING WITH FEED-FORWARD NEURAL NETWORKS:
GUIDELINES AND LIMITATIONS
by
Eric A. Plummer
A thesis submitted to the Department of Computer Science
and The Graduate School of The University of Wyoming
in partial fulfillment of the requirements
for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
Laramie, Wyoming
July, 2000
Thank you to
my parents, Bill and Fran Plummer, for their constant encouragement and
support, Karl Branting for guidance, and Megan, my fiancée, for everything
else.
Ó 2000 by Eric A. Plummer
Table of Contents
1.2.2 Feed-Forward Neural
Networks
1.2.3 Backpropagation
Training
1.2.4 Data Series
Partitioning
3 Application Level
- Forecaster
4.2 Neural Network
Parameters and Procedure
4.3 K-Nearest-Neighbor
Parameters and Procedure
4.4 Artificial Data
Series Results
4.4.1 Heuristically Trained
Neural Networks with Thirty-Five Inputs
4.4.2 Simply Trained Neural
Networks with Thirty-Five Inputs
4.4.3 Heuristically Trained
Neural Networks with Twenty-Five Inputs
4.5 Real-World Data
Series Results
Appendix A: Class-level description of Forecaster
Time series forecasting, or time series
prediction, takes an existing series of data
and forecasts the
data values. The goal is to observe or model the existing
data series to enable future unknown data values to be forecasted
accurately. Examples of data series
include financial data series (stocks, indices, rates, etc.), physically
observed data series (sunspots, weather, etc.), and mathematical data series
(Fibonacci sequence, integrals of differential equations, etc.). The phrase “time series” generically refers
to any data series, whether or not the data are dependent on a certain time
increment.
Throughout the literature, many techniques have been implemented to perform time series forecasting. This paper will focus on two techniques: neural networks and k-nearest-neighbor. This paper will attempt to fill a gap in the abundant neural network time series forecasting literature, where testing arbitrary neural networks on arbitrarily complex data series is common, but not very enlightening. This paper thoroughly analyzes the responses of specific neural network configurations to artificial data series, where each data series has a specific characteristic. A better understanding of what causes the basic neural network to become an inadequate forecasting technique will be gained. In addition, the influence of data preprocessing will be noted. The forecasting performance of k-nearest-neighbor, which is a much simpler forecasting technique, will be compared to the neural networks’ performance. Finally, both techniques will be used to forecast a real data series.
Difficulties inherent in time series forecasting and the importance of time series forecasting are presented next. Then, neural networks and k-nearest-neighbor are detailed. Section 2 presents related work. Section 3 gives an application level description of the test-bed application, and Section 4 presents an empirical evaluation of the results obtained with the application.
Several difficulties can arise when performing time series forecasting. Depending on the type of data series, a particular difficulty may or may not exist. A first difficulty is a limited quantity of data. With data series that are observed, limited data may be the foremost difficulty. For example, given a company’s stock that has been publicly traded for one year, a very limited amount of data are available for use by the forecasting technique.
A second difficulty is noise. Two types of noisy data are (1) erroneous
data points and (2) components that obscure the underlying form of the data
series. Two examples of erroneous data
are measurement errors and a change in measurement methods or metrics. In this paper, we will not be concerned
about erroneous data points. An example
of a component that obscures the underlying form of the data series is an
additive high-frequency component. The
technique used in this paper to reduce or remove this type of noise is the
moving average. The data series
becomes
after taking a moving
average with an interval i of three.
Taking a moving average reduces the number of data points in the series
by
.
A third difficulty is nonstationarity, data
that do not have the same statistical properties (e.g., mean and variance) at
each point in time. A simple example of
a nonstationary series is the Fibonacci sequence: at every step the sequence
takes on a new, higher mean value. The
technique used in this paper to make a series stationary in the mean is
first-differencing. The data series
becomes
after taking the
first-difference. This usually makes a
data series stationary in the mean. If
not, the second-difference of the series can be taken. Taking the first-difference reduces the
number of data points in the series by one.
A fourth difficulty is forecasting technique selection. From statistics to artificial intelligence, there are myriad choices of techniques. One of the simplest techniques is to search a data series for similar past events and use the matches to make a forecast. One of the most complex techniques is to train a model on the series and use the model to make a forecast. K-nearest-neighbor and neural networks are examples of the first and second techniques, respectively.
Time series forecasting has several important applications. One application is preventing undesirable events by forecasting the event, identifying the circumstances preceding the event, and taking corrective action so the event can be avoided. At the time of this writing, the Federal Reserve Committee is actively raising interest rates to head off a possible inflationary economic period. The Committee possibly uses time series forecasting with many data series to forecast the inflationary period and then acts to alter the future values of the data series.
Another application is forecasting undesirable, yet unavoidable, events to preemptively lessen their impact. At the time of this writing, the sun’s cycle of storms, called solar maximum, is of concern because the storms cause technological disruptions on Earth. The sunspots data series, which is data counting dark patches on the sun and is related to the solar storms, shows an eleven-year cycle of solar maximum activity, and if accurately modeled, can forecast the severity of future activity. While solar activity is unavoidable, its impact can be lessened with appropriate forecasting and proactive action.
Finally, many people, primarily in the financial markets, would like to profit from time series forecasting. Whether this is viable is most likely a never-to-be-resolved question. Nevertheless many products are available for financial forecasting.
A neural network is a computational model that is loosely based on the neuron cell structure of the biological nervous system. Given a training set of data, the neural network can learn the data with a learning algorithm; in this research, the most common algorithm, backpropagation, is used. Through backpropagation, the neural network forms a mapping between inputs and desired outputs from the training set by altering weighted connections within the network.
A brief history of neural networks follows[1]. The origin of neural networks dates back to the 1940s. McCulloch and Pitts (1943) and Hebb (1949) researched networks of simple computing devices that could model neurological activity and learning within these networks, respectively. Later, the work of Rosenblatt (1962) focused on computational ability in perceptrons, or single-layer feed-forward networks. Proofs showing that perceptrons, trained with the Perceptron Rule on linearly separable pattern class data, could correctly separate the classes generated excitement among researchers and practitioners.
This excitement waned with the discouraging analysis of perceptrons presented by Minsky and Papert (1969). The analysis pointed out that perceptrons could not learn the class of linearly inseparable functions. It also stated that the limitations could be resolved if networks contained more than one layer, but that no effective training algorithm for multi-layer networks was available. Rumelhart, Hinton, and Williams (1986) revived interest in neural networks by introducing the generalized delta rule for learning by backpropagation, which is today the most commonly used training algorithm for multi-layer networks.
More complex network types, alternative training algorithms involving network growth and pruning, and an increasing number of application areas characterize the state-of-the-art in neural networks. But no advancement beyond feed-forward neural networks trained with backpropagation has revolutionized the field. Therefore, much work still waits.
Figure 1.1 depicts an example feed-forward neural network. A neural network can have any number of layers, units per layer, network inputs, and network outputs. This network has four units in the first layer (layer A) and three units in the second layer (layer B), which are called hidden layers. This network has one unit in the third layer (layer C), which is called the output layer. Finally, this network has four network inputs and one network output. Some texts consider the network inputs to be an additional layer, the input layer, but since the network inputs do not implement any of the functionality of a unit, the network inputs will not be considered a layer in this discussion.
If a unit is in the first layer, it has the same number of inputs as there are network inputs; if a unit is in succeeding layers, it has the same number of inputs as the number of units in the preceding layer. Each network-input-to-unit and unit-to-unit connection (the lines in Figure 1.1) is modified by a weight. In addition, each unit has an extra input that is assumed to have a constant value of one. The weight that modifies this extra input is called the bias. All data propagate along the connections in the direction from the network inputs to the network outputs, hence the term feed-forward. Figure 1.2 shows an example unit with its weights and bias and with all other network connections omitted for clarity.
In this section and the next, subscripts c, p, and n will identify units in the current layer, the previous layer, and the next layer, respectively. When the network is run, each hidden layer unit performs the calculation in Equation 1.1 on its inputs and transfers the result (Oc) to the next layer of units.
Equation 1.1 Activation function of a hidden layer unit.
![]()
Oc is the output of the current hidden layer unit c, P is either the number of units in the previous hidden layer or number of network inputs, ic,p is an input to unit c from either the previous hidden layer unit p or network input p, wc,p is the weight modifying the connection from either unit p to unit c or from input p to unit c, and bc is the bias.
In Equation 1.1, hHidden(x) is the sigmoid activation function of the unit and is charted in Figure 1.3. Other types of activation functions exist, but the sigmoid was implemented for this research. To avoid saturating the activation function, which makes training the network difficult, the training data must be scaled appropriately. Similarly, before training, the weights and biases are initialized to appropriately scaled values.
|
|
Each output layer unit performs the calculation in Equation 1.2 on its inputs and transfers the result (Oc) to a network output.
Equation 1.2 Activation function of an output layer unit.
![]()
Oc is
the output of the current output layer unit c, P is the number of units in the previous hidden layer, ic,p is an input to unit c
from the previous hidden layer unit p, wc,p is the weight modifying the connection from unit p
to unit c, and bc
is the bias. For this research, hOutput(x)
is a linear activation function[2].
To make meaningful forecasts, the neural network has to be trained on an appropriate data series. Examples in the form of <input, output> pairs are extracted from the data series, where input and output are vectors equal in size to the number of network inputs and outputs, respectively. Then, for every example, backpropagation training[3] consists of three steps:
1. Present an example’s input vector to the network inputs and run the network: compute activation functions sequentially forward from the first hidden layer to the output layer (referencing Figure 1.1, from layer A to layer C).
2. Compute the difference between the desired output for that example, output, and the actual network output (output of unit(s) in the output layer). Propagate the error sequentially backward from the output layer to the first hidden layer (referencing Figure 1.1, from layer C to layer A).
3. For every connection, change the weight modifying that connection in proportion to the error.
When these three steps have been performed for every example from the data series, one epoch has occurred. Training usually lasts thousands of epochs, possibly until a predetermined maximum number of epochs (epochs limit) is reached or the network output error (error limit) falls below an acceptable threshold. Training can be time-consuming, depending on the network size, number of examples, epochs limit, and error limit.
Each of the three steps will now be detailed. In the first step, an input vector is presented to the network inputs, then for each layer starting with the first hidden layer and for each unit in that layer, compute the output of the unit’s activation function (Equation 1.1 or Equation 1.2). Eventually, the network will propagate values through all units to the network output(s).
In the second step, for each layer starting with the output layer and for each unit in that layer, an error term is computed. For each unit in the output layer, the error term in Equation 1.3 is computed.
Equation 1.3 Error term for an output layer unit.
![]()
Dc is the
desired network output (from the output vector) corresponding to the
current output layer unit, Oc is the actual network output
corresponding to the current output layer unit, and
is the derivative of
the output unit linear activation function, i.e. 1. For each unit in the hidden layers, the error term in Equation 1.4 is computed.
Equation 1.4 Error term for a hidden layer unit.
![]()
N is the number of
units in the next layer (either another hidden layer or the output layer), dn
is the error term for a unit in the next layer, and wn,c is
the weight modifying the connection from unit c to unit n. The
derivative of the hidden unit sigmoid activation function,
, is
.
In the third step, for each connection, Equation 1.5, which is the change in the weight modifying that connection, is computed and added to the weight.
Equation 1.5 Change in the weight modifying the connection from unit p or network input p to unit c.
![]()
The weight modifying the connection from unit p or network input p to unit c is wc,p, a is the learning rate (discussed later), and Op is the output of unit p or the network input p. Therefore, after step three, most, if not all weights will have a different value. Changing weights after each example is presented to the network is called on-line training. Another option, which is not used in this research, is batch training, where changes are accumulated and applied only after the network has seen all examples.
The goal of backpropagation training is to converge to a near-optimal solution based on the total squared error calculated in Equation 1.6.
Equation 1.