Using Kaggle dataset to gain experience with LSTM and text processing.
Here, you’ll find Stack Overflow quality ranking based on a model that takes multi-input (title, body and tags of the question) and process it with LSTM using Keras.
Here I will put the most relevant parts of the code. For the complete one, look at the Colab file at the end of the article.
Dataset
You can find it on Kaggle.
We collected 60,000 Stack Overflow questions from 2016–2020 and classified them into three categories: HQ (high quality), LQ_EDIT (low quality but still open), LQ_CLOSE (low quality — close by the community)
Pre processing
Removing some chars from the train and validation set
Define some settings
One hot encode the 3 possible ratings
Prepare the data for training — Tokenize the data, transform it into a sequence and apply the padding
Model definition
As we said before, I take three input separately, i.e the title of the question, the body and the associated tags. For each one I define an Input, then I embedded it into a vector of dimension 16 and then I feed it into an LSTM. The last step is to concatenate the three input and add a final dense layer of dimension 3.
Training
For the training, I define two callbacks, EarlyStopping and ReduceLROnPlateau.
Note: for the fitting, I define the three input and the label. Moreover, I also provide the validation data.
Result
The best result are around epoch 4 with
loss: 0.2923 - accuracy: 0.8923 - val_loss: 0.3161 - val_accuracy: 0.8823
Plus: single prediction of a model with multi-input
As I was writing this article, I struggled with the single prediction (I was having difficulties preparing the data for the predict method). I kept having some strange shape as output. The error was that the model always wants batch as first dim, so here is the code for solving this.
Colab
Here you can find the Colab