What a CNN see — visualizing intermediate output of the conv layers

Francesco
4 min readMay 10, 2021

--

Today you will see how the convolutional layers of a CNN transform an image. Moreover, you’ll see that as we go higher on the stacked conv layer the activations become more and more abstracts.

For doing this, I created a CNN from scratch trained on ‘cats_vs_dogs’ dataset taken from TensorFlow datasets page

Here you will find the most useful parts of the code. For the complete Jupyter notebook, take a look at the link at the bottom page.

Ok, let’s start!

Import and pre-processing

import tensorflow as tf
import tensorflow_datasets as tfd

Get the data

# load dataset
(train, validation), metadata = tfd.load(
'cats_vs_dogs',
split=["train[:80%]","train[80%:]"],
as_supervised = True,
with_info = True,
)

Pre-process the data

width, height = 150, 150
def preprocess(image, label):
img = tf.cast(image, tf.float32)
img = img / 255
img = tf.image.resize(img, (width, height))
return img, label

train = train.map(preprocess)
validation = validation.map(preprocess)

#create dataset for train and val
train_dataset = train.shuffle(100).batch(64)#look
validation_dataset = validation.shuffle(100).batch(64)

Model definition and training

There is 4 conv layer and, on top of that, a Flatten and a dense one.

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(width, height, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Flatten())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.summary()

Training the model

epochs = 10
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.0001),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])

history = model.fit(train_dataset, epochs=epochs,
validation_data=validation_dataset)
model.save("cat_vs_dog.h5")

Result of train after 10 epoch (we are not interested in performance, we want to see the activations of the conv layers)

Epoch 10/10
291/291 [==============================] - 82s 280ms/step - loss: 0.3577 - accuracy: 0.8445 - val_loss: 0.4525 - val_accuracy: 0.7896

Intermediate output of the conv layers

Now the interesting part! We define the index of the conv layer and then we feed the network with a dog image

I choose to try with a photo of my dog, Rocky

Hey Rocky, you are on the internet
from keras.preprocessing import image

model_l = tf.keras.models.load_model('cat_vs_dog.h5')

# load image of rocky
rocky = image.load_img('rocky.jpg', target_size=(width,height))
rocky_as_tensor = image.img_to_array(rocky)
rocky_as_tensor = np.expand_dims(rocky_as_tensor, axis=0)
rocky_as_tensor /= 255 # normalize
print("dog shape", rocky_as_tensor.shape)

# get layer for catch the output
layer_outputs = [layer.output for layer in model_l.layers[:8]]
activation_model = models.Model(inputs=model_l.input, outputs=layer_outputs)
#feed the model with the image
activations = activation_model.predict(rocky_as_tensor)

For getting the 4 conv layer, we selected model_l.layers[:8]. Indeed, there are 4 conv and 4 max pool layer. The code below will filter out the max pool and take the conv indexes.

#take only the conv layers ( we filter out the max pool layers)
conv_indixes = []
for i in range(len(activations)):
if( "conv2d" in model_l.layers[i].name) :
conv_indixes.append(i)
print("Layer: ", model_l.layers[i].name, " Shape: ", activations[i].shape)

Create a function plot_layer to display the layers as a grid of images. We iterate through the conv layer and call this function

from mpl_toolkits.axes_grid1 import ImageGrid
#https://matplotlib.org/stable/gallery/axes_grid1/simple_axesgrid.html

def plot_layer(name, activation):
print("Processing {} layer...".format(name))
how_many_features_map = activation.shape[3]

figure_size = how_many_features_map * 2
fig = plt.figure(figsize=(figure_size, figure_size),)

grid = ImageGrid(fig, 111,
nrows_ncols=(how_many_features_map // 16, 16),
axes_pad=0.1, # pad between axes in inch.
)
images = [activation[0, :, :, i] for i in range(how_many_features_map)]

for ax, img in zip(grid, images):
# Iterating over the grid returns the Axes.
ax.matshow(img)
plt.show()

#for each conv2d layer plot the feature maps
for i, conv_ix in enumerate(conv_indixes):
plot_layer(model_l.layers[conv_ix].name, activations[conv_ix])

Result

layers.Conv2D(32, (3, 3), activation=’relu’, input_shape=(width, height, 3)). The first layer act as edge detector
layers.Conv2D(64, (3, 3), activation=’relu’)
layers.Conv2D(128, (3, 3), activation=’relu’)
layers.Conv2D(128, (3, 3), activation=’relu’)

As you can see, the last layer is the most abstract and the less visually interpretable. There are a few things to note here. First of all, the first layer is the most visually interpretable. We can say that it acts as an edge detector.

In contrast, as we dig deeper into the network, it seems that the results are less interpretable and more abstract.

Complete code

Here the colab file

--

--

Francesco
Francesco

Written by Francesco

Master’s degree in Computer Engineering for Robotics and Smart Industry — Smart Systems & Data Analytics

No responses yet