In this concluding part, we will build the transformer model using positional encoding.
The Transformer model consists of several key components that work together to process and understand input sequences. Let's break down each component and its role:
The embedding layer converts input tokens into dense vectors of a fixed size. This is the first step in the model, where each word in the vocabulary is mapped to a corresponding dense vector.
embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)(inputs)
The Transformer model consists of several key components that work together to process and understand input sequences. Let's break down each component and its role:
The embedding layer converts input tokens into dense vectors of a fixed size. This is the first step in the model, where each word in the vocabulary is mapped to a corresponding dense vector.
The dropout layer is a regularization technique that prevents overfitting by randomly setting a fraction of input units to 0 at each update during training.
x = Dropout(0.3)(positional_encoding)
The multi-head attention mechanism allows the model to focus on different parts of the input sequence simultaneously. This capability is essential for capturing various aspects of the input, making it highly relevant for customer journey analysis and customer insights.
multi_head_attention = MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim)
x = multi_head_attention(query=x, value=x, key=x)
Layer normalization normalizes the inputs across the features, stabilizing the training process and improving model performance.
x = LayerNormalization()(x)
The feed-forward network applies a series of transformations to the attended representations. It consists of two dense layers with a ReLU activation function in between, adding non-linearity and enabling the model to learn more complex representations.
ff_network = Dense(ff_dim, activation='relu', kernel_regularizer=l2(0.01))(x)
ff_network = Dense(embedding_dim, kernel_regularizer=l2(0.01))(ff_network)
x = x + ff_network
The global max pooling layer reduces the output of the Transformer block to a fixed size by selecting the maximum value over the time dimension. This layer captures the most important features, which is beneficial for customer analysis.
x = GlobalMaxPooling1D()(x)
The output layer consists of dense layers for the final prediction. For tasks such as sentiment analysis and behavior prediction, this layer outputs the final classifications.
outputs = Dense(num_classes, activation='softmax')(x)
Here's the complete code to build and train the Transformer model with positional encoding:
def build_transformer_model_with_positional_encoding(max_length, vocab_size, embedding_dim, num_heads, ff_dim, num_classes):
inputs = Input(shape=(max_length,))
embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)(inputs)
positional_encoding = PositionalEncoding(max_length, embedding_dim)(embedding)
x = Dropout(0.3)(positional_encoding)
multi_head_attention = MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim)
x = multi_head_attention(query=x, value=x, key=x)
x = LayerNormalization()(x)
ff_network = Dense(ff_dim, activation='relu', kernel_regularizer=l2(0.01))(x)
ff_network = Dense(embedding_dim, kernel_regularizer=l2(0.01))(ff_network)
x = x + ff_network
x = LayerNormalization()(x)
x = GlobalMaxPooling1D()(x)
x = Dropout(0.2)(x)
outputs = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=inputs, outputs=outputs)
return model
# Model Parameters
vocab_size = len(word2vec_model.wv.key_to_index) + 1
embedding_dim = vector_size
num_heads = 8
ff_dim = 256
num_classes = len(label_encoder.classes_)
# Build and compile the model
model_with_positional_encoding = build_transformer_model_with_positional_encoding(
max_length, vocab_size, embedding_dim, num_heads, ff_dim, num_classes
)
optimizer = Adam(learning_rate=0.001)
model_with_positional_encoding.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Model summary
model_with_positional_encoding.summary()
# Train the model with early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True)
history = model_with_positional_encoding.fit(X_train, y_train, epochs=50, batch_size=256, validation_data=(X_val, y_val), callbacks=[early_stopping])
# Evaluate model
loss, accuracy = model_with_positional_encoding.evaluate(X_val, y_val, verbose=0)
print(f'Loss: {loss:.2f}')
print(f'Accuracy: {accuracy * 100:.2f}%')
Visualizing Model Performance
To better understand the model's performance, we can plot the training and validation loss, as well as the training and validation accuracy. These plots provide insights into how well the model is learning and whether it is overfitting or underfitting.
# Plot training and validation loss
plt.figure(figsize=(12, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.savefig('training_validation_loss.png')
plt.show()
# Plot training and validation accuracy
plt.figure(figsize=(12, 6))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.savefig('training_validation_accuracy.png')
plt.show()
Training Accuracy: The blue line represents the training accuracy over epochs. The training accuracy starts at a lower value and improves significantly with each epoch, stabilizing around 0.99 towards the end.
Validation Accuracy: The orange line represents the validation accuracy over epochs. The validation accuracy improves rapidly initially, reaching around 0.90, but then it fluctuates slightly and stabilizes around 0.88.
Observations:
There is a significant gap between the training accuracy and validation accuracy starting from around epoch 10, indicating potential overfitting. The model fits the training data very well but does not generalize as effectively as the validation data.
The early stopping might help, but it looks like the model continues to improve its training accuracy even after the validation accuracy has plateaued.
Training Loss: The blue line represents the training loss over epochs. The training loss decreases consistently and becomes almost flat, reaching near zero towards the end.
Validation Loss: The orange line represents the validation loss over epochs. The validation loss decreases rapidly initially, reaching a minimum value around epoch 10, but then starts to increase slightly and stabilizes.
The rapid decrease in training loss indicates that the model is learning effectively from the training data.
The increase in validation loss after epoch 10 suggests overfitting, where the model is fitting the noise in the training data rather than learning the underlying patterns that generalize to new data.
The point where the validation loss starts to increase while the training loss continues to decrease is a clear indication of overfitting.
The confusion matrix provides a detailed breakdown of the model’s performance by showing the number of correct and incorrect predictions for each class. Here's a detailed analysis:
True Positives (TP): The bottom-right cell (4276) indicates the number of instances where the model correctly predicted the positive class.
True Negatives (TN): The top-left cell (1115) shows the number of instances where the model correctly predicted the negative class.
False Positives (FP): The top-right cell (315) represents the number of instances where the model incorrectly predicted the positive class when it was actually negative.
False Negatives (FN): The bottom-left cell (293) indicates the number of instances where the model incorrectly predicted the negative class when it was actually positive.
From the confusion matrix, we can derive important performance metrics:
Accuracy: The overall accuracy of the model can be calculated as:
Accuracy for positive classes =
Precision: Precision for the positive class can be calculated as
Precision for the positive class
Recall: Recall for the positive class can be calculated as
Recall for the positive class
F1 Score: The F1 Score, which is the harmonic mean of precision and recall, can be calculated as:
The high values for true positives (4276) and true negatives (1115) indicate that the model is performing well in correctly classifying both the positive and negative classes.
The values for false positives (315) and false negatives (293) are relatively low, but there is still room for improvement. Reducing these errors would further enhance the model’s performance.
The precision, recall, and F1 Score values are all high, suggesting a well-balanced model that performs consistently across different evaluation metrics.
Addressing False Positives and False Negatives: To reduce the false positives and false negatives, consider further tuning the model, exploring more advanced regularization techniques, or augmenting the dataset to improve generalization.
Continuous Monitoring: Regularly update the confusion matrix and related metrics to monitor the model's performance over time and ensure it remains robust across various data distributions.
By leveraging the detailed insights from the confusion matrix, you can continue to refine your model and achieve even greater accuracy and reliability in text classification tasks.