← Back to Main Site

🧠 CNN Algorithm Visualization

Interactive demonstration of Convolutional Neural Network processing MNIST digits

📝 Input (28×28)

Original Image

MNIST handwritten digit

Grayscale: 0-255 pixel values

🔍 Convolution Process

Ready to process...

Click 'Start CNN' to begin

🎯 Prediction

Predicted Digit:
?
Confidence: 0%

🏗️ CNN Architecture

Input

28×28×1

Conv1

26×26×32

Pool1

13×13×32

Conv2

11×11×64

Pool2

5×5×64

Flatten

1600×1

Dense

128×1

Output

10×1

Detailed Theory of Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a class of deep learning models primarily used for analyzing visual imagery. They are inspired by the organization of the animal visual cortex and are designed to automatically and adaptively learn spatial hierarchies of features from input data.

1. Convolution Operation

The core building block of a CNN is the convolutional layer. It performs a convolution operation on the input, passing a "filter" (or "kernel") over the input to produce a "feature map." This process helps in extracting features like edges, textures, and patterns.

For a 2D input image $I$ and a 2D kernel $K$, the convolution $(I * K)$ at position $(i, j)$ is given by:

$$(I * K)(i, j) = \sum_m \sum_n I(i-m, j-n) K(m, n)$$

Nomenclature:
$I$: Input image
$K$: Kernel/Filter
$(i, j)$: Output pixel coordinates
$(m, n)$: Kernel coordinates

2. Key Concepts in Convolutional Layers

2.1. Kernel/Filter

A kernel is a small matrix that slides over the input data, performing element-wise multiplication with the input values it covers and summing the results. Each kernel is designed to detect a specific feature (e.g., horizontal edges, vertical edges).

2.2. Padding

Padding involves adding extra rows and columns of zeros (or other values) around the input image. This is done for two main reasons:

Common types are 'valid' (no padding, output size shrinks) and 'same' (padding added to make output size same as input).

2.3. Stride

Stride defines the step size at which the kernel slides over the input image.

Stride is a powerful tool for downsampling the input representation.

Output size calculation for a 2D convolution:

$$O_h = \lfloor \frac{I_h - K_h + 2P}{S} \rfloor + 1$$ $$O_w = \lfloor \frac{I_w - K_w + 2P}{S} \rfloor + 1$$

Nomenclature:
$O_h, O_w$: Output height and width
$I_h, I_w$: Input height and width
$K_h, K_w$: Kernel height and width
$P$: Padding
$S$: Stride
$\lfloor . \rfloor$: Floor function

3. Activation Functions (ReLU)

After the convolution operation, an activation function is applied element-wise to the feature map. The Rectified Linear Unit (ReLU) is a popular choice:

$$ReLU(x) = \max(0, x)$$

ReLU introduces non-linearity into the model, allowing it to learn more complex patterns. It converts negative inputs to zero, effectively acting as a threshold.

4. Pooling Layers (Max Pooling)

Pooling layers are used to reduce the spatial dimensions (width and height) of the feature maps, thereby reducing the number of parameters and computational cost. Max pooling is the most common type: it takes the maximum value from a window (e.g., 2x2) of the feature map. This helps in making the model more robust to small translations and distortions in the input.

5. Flattening

After several convolutional and pooling layers, the 2D feature maps are "flattened" into a single, long 1D vector. This transformation is necessary to connect the convolutional part of the CNN to the fully connected (dense) layers for classification.

6. Dense (Fully Connected) Layers

These are traditional neural network layers where every input neuron is connected to every output neuron. In a CNN, dense layers typically come after the flattening layer and are responsible for performing the final classification based on the features extracted by the convolutional layers.

7. Output Layer (Softmax)

The final layer in a classification CNN is usually a dense layer with a Softmax activation function. Softmax converts the raw output scores into probabilities, where each probability corresponds to the likelihood of the input belonging to a particular class. The sum of all probabilities is 1.

For a vector $z$ of $K$ real numbers, the Softmax function is:

$$\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

Nomenclature:
$z_i$: Input score for class $i$
$K$: Number of classes
$e$: Euler's number (base of natural logarithm)

8. Advantages and Disadvantages of CNNs

Advantages:

Disadvantages: