Stack Overflow Questions Quality Rating with LSTM

Using Kaggle dataset to gain experience with LSTM and text processing.

Here, you’ll find Stack Overflow quality ranking based on a model that takes multi-input (title, body and tags of the question) and process it with LSTM using Keras.

Here I will put the most relevant parts of the code. For the complete one, look at the Colab file at the end of the article.

Dataset

You can find it on Kaggle.

We collected 60,000 Stack Overflow questions from 2016–2020 and classified them into three categories: HQ (high quality), LQ_EDIT (low quality but still open), LQ_CLOSE (low quality — close by the community)

Pre processing

Removing some chars from the train and validation set

Define some settings

One hot encode the 3 possible ratings

Prepare the data for training — Tokenize the data, transform it into a sequence and apply the padding

I define a function that transform the input into a sequence of integer (mapped to word) and return the sequence and the word index
Transform to sequence the title, the body and the tags. Then apply the padding. The code for the validation data is the same

Model definition

As we said before, I take three input separately, i.e the title of the question, the body and the associated tags. For each one I define an Input, then I embedded it into a vector of dimension 16 and then I feed it into an LSTM. The last step is to concatenate the three input and add a final dense layer of dimension 3.

Training
For the training, I define two callbacks, EarlyStopping and ReduceLROnPlateau.

Note: for the fitting, I define the three input and the label. Moreover, I also provide the validation data.

Result

The best result are around epoch 4 with

loss: 0.2923 - accuracy: 0.8923 - val_loss: 0.3161 - val_accuracy: 0.8823

Plus: single prediction of a model with multi-input

As I was writing this article, I struggled with the single prediction (I was having difficulties preparing the data for the predict method). I kept having some strange shape as output. The error was that the model always wants batch as first dim, so here is the code for solving this.

Look at the reshape. Here I want one prediction at a time. If you want 10 predictions in one call, set the batch to 10 and the index of the data accordingly

Colab

Here you can find the Colab

Master’s degree in Computer Engineering for Robotics and Smart Industry — Smart Systems & Data Analytics