Architectures of Recurrent Neural Networks (RNN) recently become a very popular choice for Spoken Language Understanding (SLU) problems; however, they represent a big family of different architectures that can furthermore be combined to form more complex neural networks. In this work, we compare different recurrent networks, such as simple Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, Gated Memory Units (GRU) and their bidirectional versions, on the popular ATIS dataset and on MEDIA, a more complex French dataset. Additionally, we propose a novel method where information about the presence of relevant word classes in the dialog history is combined with a bidirectional GRU, and we show that combining relevant word classes from the dialog history improves the performance over recurrent networks that work by solely analyzing the current sentence.
In this work we do two things: i) we evaluate different (gated and non-gated) recurrent neural network architecture, modeling sequences either in one or both directions, in the task of slot tagging / spoken language understanding, and ii) we try to model key concepts of the dialog to make the network aware of crucial past information.
We evaluate the following RNN architectures:
Each architecture is tasted in two setups:
A bidirectional architecture can be modeled directly by adding weights in the other direction (sometimes done for RNNs) or it can be just two architectures working in opposing directions (usually done with all more complex architectures in most frameworks).
We additionally test each setup on two datasets:
In the last part, we try to model key concepts of the dialog. We model this as an additional vector that is presented with each word as input. The vector is binary and each element indicates whether a concept has been mentioned from the beginning of the dialog up to the current word. For ATIS, we model 19 concepts (e.g. aircraft_code, airline_code, airline_name, airport_code, airport_name, city_name, class_type, cost_relative, country_name, day_name, etc.) while for MEDIA, we model 37 concepts. The architecture combines a bidirectional GRU that models the input sequence with a fully connected dense layer that models the dialog concepts and a fully connected dense layer that merges the two and produces the output label y:
The results are as follows:
|Model||F1 (%)||Std. dev. (%)|
|Bidirectional GRU + dialog aw.||95.54||0.16|
|Bidirectional GRU + dialog aw.||83.89||0.27|
And we can conclude the following: