|
MLP(Multi Layer Perceptron) II
This note provides an interactive, visual simulation of a Multi-Layer Perceptron (MLP) with a 2-3-1 architecture (2 inputs, 3 hidden neurons, 1 output). It demonstrates how a hidden layer enables the network to solve non-linearly separable problems like XOR, which a single perceptron cannot solve. Compared to the 2-2-1 architecture in MLP I, the additional hidden neuron provides more capacity and makes training significantly more reliable with faster convergence.
The simulation shows the forward pass through all layers, displaying the activation values at each hidden neuron and the final output. It visualizes the network connections with color-coded weights (green for positive, red for negative) and thickness proportional to weight magnitude.
The training uses backpropagation with momentum to update weights in both the hidden and output layers. The 2-3-1 architecture provides more flexibility than the minimal 2-2-1 architecture, making XOR training much more reliable with early convergence and less prone to getting stuck in local minima. In practice, you will observe that the network typically converges in far fewer iterations compared to the 2-2-1 architecture. Weights are initialized in a wider range [-2.0, 2.0] to break symmetry and provide better starting points. You can watch the network learn to solve XOR and other logic gates by observing the iteration count, error decrease, and weight updates in real-time.
NOTE : Refer to this note for the theoretical details.
Parameters
Followings are short descriptions on each parameters
-
Activation Function: Selects the transfer function used in all neurons (hidden and output layers). Sigmoid and Tanh work well for XOR training. Step function cannot be trained with backpropagation (derivative is zero).
-
Gate Preset: Selects a logic gate problem to solve. XOR is the default to showcase the MLP's ability to solve non-linearly separable problems. Other gates (AND, OR, NAND, NOR) are also available.
-
Network Weights: All 9 weights in the network can be manually adjusted using sliders:
- Hidden Layer 1: w11 (input1→hidden1), w21 (input2→hidden1)
- Hidden Layer 2: w12 (input1→hidden2), w22 (input2→hidden2)
- Hidden Layer 3: w13 (input1→hidden3), w23 (input2→hidden3)
- Output Layer: wh1 (hidden1→output), wh2 (hidden2→output), wh3 (hidden3→output)
Weights are initialized in the range [-2.0, 2.0] to help break symmetry and avoid flat regions during training.
-
Input x1, x2: Binary input values (0 or 1) controlled by checkboxes. These represent the two inputs to the network.
-
Learning Rate: Controls the step size for weight updates during backpropagation. Higher values learn faster but may overshoot or oscillate. Lower values are more stable but slower. Default is 0.5. Can be adjusted during training.
-
Momentum: Controls the momentum factor (0 to 0.99) used in momentum-based gradient descent. Higher values (closer to 0.99) maintain more velocity, helping the network escape local minima and flat regions. Lower values (closer to 0) behave more like standard gradient descent. Default is 0.9. Can be adjusted during training. Momentum adds "velocity" to weight updates, allowing the network to continue moving in a direction even when gradients are small. While the 2-3-1 architecture is more robust than 2-2-1, momentum still helps improve convergence reliability.
-
Train Update Speed (sec): Controls the delay between training steps. Smaller values update more frequently (faster visualization). Default is 0.1 seconds.
Buttons
Followings are short descriptions on each Button
-
Reset: Stops training (if running) and randomizes all weights and biases to new random values in the range [-2.0, 2.0]. Also resets inputs to [0, 0] and clears velocity (momentum) history.
-
Randomize: Stops training (if running) and assigns random values to all weights and biases (range [-2.0, 2.0]), randomizes inputs to random binary values, and resets velocity history. Useful for trying different starting points when training gets stuck.
-
Train: Starts backpropagation training with momentum using the selected gate's truth table. The network cycles through all input combinations one at a time, updating weights after each example using momentum-based gradient descent. The iteration counter shows the number of training steps completed. Training stops automatically when all decisions are correct (all 4 input combinations produce correct outputs), or can be stopped manually by clicking the button again.
-
Test: Stops training (if running) and tests all possible input combinations ([0,0], [0,1], [1,0], [1,1]) with the current weights. Visually cycles through each combination with a delay. Shows a "Test PASS" popup if all combinations result in correct decisions (Decision = 1), or "Test FAIL" otherwise.
Tips on Implementation
In the 2-2-1 architecture (MLP I), training for XOR often got stuck in the middle without making any further progress. The network would reach an error around 0.4-0.5 after hundreds or thousands of iterations and then plateau, unable to converge to a solution. The 2-3-1 architecture (this simulation) addresses these issues effectively. Followings are what we have learned and how the problem was improved.
-
XOR Problem Difficulty and Architecture Comparison: The XOR problem is non-linearly separable and requires a hidden layer. A 2-2-1 architecture (2 hidden neurons) is the theoretical minimum to solve XOR, but it's notoriously difficult for standard gradient descent due to flat regions in the error surface where gradients become tiny. The 2-3-1 architecture (3 hidden neurons) provides additional capacity, making training significantly more reliable with early convergence. In practice, you will observe that the 2-3-1 architecture typically converges in far fewer iterations (often within 100-500 iterations) compared to the 2-2-1 architecture, which may require thousands of iterations or fail to converge at all.
-
Momentum is Essential: Without momentum, the network often gets stuck at error values around 0.4-0.5 after many iterations. Momentum (0.9) helps the network maintain velocity through flat regions and escape local minima, making convergence much more reliable. Combined with the 2-3-1 architecture, momentum further accelerates convergence.
-
Wider Initialization Range: Initializing weights in the range [-2.0, 2.0] instead of [-0.5, 0.5] helps break symmetry and places neurons in more active regions of the sigmoid curve. This prevents neurons from starting in saturated regions where gradients vanish.
-
Online vs Batch Learning: The current implementation uses online/stochastic learning (one example at a time). This can be less stable than batch learning (process all examples before updating), but it provides better visual feedback during training.
-
Learning Rate Sensitivity: The default learning rate of 0.5 works well with momentum. Without momentum, lower rates (0.1-0.2) are often needed, but training becomes slower. With momentum, higher rates are more stable.
-
Activation Function Choice: Sigmoid and Tanh work well for XOR. Step function cannot be used for training (derivative is zero), but the output is thresholded at 0.5 for binary classification anyway.
-
Velocity Reset: Velocities (momentum history) are reset when starting new training or randomizing weights. This prevents momentum from carrying over inappropriate velocity from previous training sessions.
-
Convergence Indicators: Watch for the error dropping below 0.1 and all decisions becoming correct (Decision = 1 for all 4 input combinations). With the 2-3-1 architecture, convergence typically occurs much faster than with 2-2-1, often within 100-500 iterations. If training stalls (which is rare with this architecture), try randomizing weights to start from a different point in the weight space.
NOTE : The 2-3-1 architecture provides more capacity than the minimal 2-2-1 architecture, making XOR training significantly more reliable with early convergence. With momentum and wider initialization, the network typically converges successfully in far fewer iterations (often within 100-500 iterations) compared to the 2-2-1 architecture, which may require thousands of iterations or fail to converge. The additional hidden neuron creates a more favorable error surface with fewer flat regions and local minima, allowing the network to find solutions much more efficiently. In practice, convergence failures are rare with this architecture. If training fails to converge after many iterations (which is uncommon), try clicking "Randomize" to restart with different initial weights. The combination of the additional hidden neuron, momentum, and wider initialization makes the network much more robust and efficient compared to the minimal 2-2-1 architecture.
|
|