Gated Self-matching Networks for Reading Comprehension and Question Answering

ane. Introduction

Question Answering (QA) is the task of retrieving answer to a given question. It is an intelligent search engine based on natural language processing and information retrieval techniques. The user is allowed to enquire questions in natural language and the system will render the corresponding answer directly. It has been explored both in the open up-domain field [1] and domain-specific settings, such as BioASQ for Biomedical field [2]. The Reading Comprehension job limits the candidate answer in a given passage.

Question Answering techniques have been improved by the promotion of official evaluation and open information set publication. Wang [3] proposes the generative probability model to compute the matching degree of dependency tree between question and answer. Heilman and Smith [4] make use of the conditional random field model to approximate the structural altitude between question and answer in dependency tree. Ko [v] puts forward a probability-based ranking model to select the candidate answers, and logistic regression method is used to estimate the probability of the correct answer. Severyn and Moschitti [6] apply the SVM tree kernel to larn the shallow syntactic features for classification of question-reply pairs.

The traditional methods take as well been applied on biomedical datasets and reach ameliorate results. The OAQA(Open Advocacy of Question Answering) system [vii] combines the biomedical resources, including domain-specific parsers and entity markers, to remember concepts and synonyms. Logistic regression classifiers are used for question classification and candidate reply scoring.

With the publication of large-calibration open-domain datasets Stanford Question Answering Dataset (SQuAD) [viii, 9], TriviaQA [10], WikiReading [11], Children Book Exam [12], etc, the neural network based techniques for QA system have been developed recently [13–16], and they atomic number 82 to the pregnant improvement over traditional methods.

The neural network based Question Answering makes difference with the traditional methods. Unremarkably, the neural model is trained by cease-to-end fashion for achieving an answer to the given question and passage. Yu [17] uses Convolutional Neural Networks (CNN) to model the distribution of question and reply. Feng [18] publishes a question-answer data fix for the insurance domain and proposes several CNN models based on this data prepare. Because of the superior performance of the Attention mechanism [nineteen] in the sequence to sequence model, the researchers try to introduce the attention mechanism into the question answering job. With the publication of Squad dataset [8, ix], the techniques based on deep learning accept been well verified. Wang and Jiang [9] propose two kinds of reply prediction models: Sequence Model and the Boundary Model, and the interaction layer is imported to compute the attention distribution. Seo [15] improves the model of Wang and Jiang [13] past adding the bidirectional attention machinery. Though these approaches piece of work well in Question Answering job, in all just a few cases, such methods have a high requirement for computing resources and have difficulties to answer complex context-dependent questions.

Most question answering systems based on neural networks use interactive attention mechanism or bi-attention mechanism to obtain answers [20]. The existing methods mainly focus on the relationship betwixt the question and the passage, and they pay piddling attending to the interactive verification between candidate answers. The problem is more obvious in open-domain QA, where a question needs to be answered by considering candidate answers from multiple paragraph. In order to resolve this problem, Wang et al. [21] propose a ii-stage process extraction: ﬁrst excerpt reply candidates from passages and so select the ﬁnal respond by combining information from all of the candidates. V-net [22] adopts an end-to-end neural model that enables answer candidates from unlike passages to verify each other based on their content representations.

To obtain answers to complex questions, reviewing later on reading documents for further reasoning is necessary. This tin can exist realized by multi-round reasoning mechanism, which attempts to combine the information of questions with the new information extracted from previous iterations ([23–25]). Gated-attention reader [26] uses multiplicative interactions betwixt the query embedding and intermediate states of a recurrent neural network reader, which is realized by feeding the question encoding into an attention-based gate in each iteration. Cui et al. [27] further advise that question specific attention should exist extended to bi-attention machinery, including both question to document and document to question. ReasonNet [28], which is unlike from those methods which apply stock-still iterations, adds a termination module to recognize whether to go on to the next inference or to terminate the reasoning procedure when the information is sufficient.

To alleviate the problems above, our model allows for significantly more than parallelization and even gets higher accuracy for answer prediction. Nosotros propose a Gated Scaled Dot-Production Attention based model for Reading Comprehension chore, which aims to answer questions in a given context passage. The character-level embedding is incorporated to word embedding which is helpful to deal with Out-of-Vocabulary (OOV) tokens. The attending distribution is achieved past Scaled dot product and self-matching attention mechanism. Finally, the Arrow Network is used to predict the starting and ending position of the answer. The rest of paper is organized equally following. Department 2 introduces the hierarchical multi-layer machinery of the proposed model. Section iii introduces the different level encoding for question and passage. Section iv describes the attention-based passage encoding past question interaction. Section five is the Pointer Network for respond pick. Section half-dozen gives the experimental results and Section 7 is the conclusion.

2. Hierarchical multi-layer reading comprehension model

Nosotros put forward a Gated Scaled Dot-Product Attending based model for RC job, which is represented as a hierarchical multi-layer mechanism shown in Figure one. It consists of vi components.

Figure ane Gated Scaled Dot-Production Attending based Module Structure for RC task.

Character-Level Discussion Embedding Layer. Each Character is mapped into a high-dimensional vector space and the grapheme-level word embedding for each word is generated by Bi-GRU.

Discussion Embedding Layer. The word-level vector is concatenated with character-level vector. The distributed matrix representation of each word in question and passage is generated by the ii-layer Highway Network.

Question and Passage Encoding Layer. It utilizes contextual cues from surrounding words to refine the embedding of each word in question and passage.

Gated Scaled Dot-Production Attention Layer. The representation of each discussion in the candidate passage is encoded by the conjunction of question-aware feature vector.

Self-matching Attention Layer. The representation of passage is enriched past matching itself with the output representation of the previous layer. It captures the of import cues with long-altitude.

Pointer Network for Answer Selection. For each question, predict the starting position and ending position of the respond in the passage past Pointer Network.

3. Question and passage encoding

iii.1. Character-level embedding layer

The character-level embedding layer is responsible for mapping each word into a loftier-dimensional vector space, which has been shown to be helpful to bargain with Out-Of-Vocabulary (OOV) tokens.

Question and contextual passage are represented as the word gear up $Q = {w_{1}^{q}, w_{two}^{q}, \dots, w_{m}^{q}}$ and $P = {{westward}_{1}^{p}, w_{2}^{p}, \dots, w_{due north}^{p}}$ respectively. Each discussion $w_{i}$ consists of several characters and it is represented every bit the character-level word distribution matrix $w_{i} = {c_{1}, c_{2}, \dots, c_{k}}$ . The distributed representation of each character $c_{i} (i = 1, \dots thou)$ is obtained by the pre-trained character vector. ^one Further, Bi-directional Gated Recurrent Unit of measurement (Bi-GRU) is used to generate grapheme-level word embedding for each word in question and passage separately. (1) $\begin{aligned} u_{i}^{q} & = B i Grand R U (u_{i - 1}^{q}, c_{i}^{q}) (i = 1, \dots k) \end{aligned}$ (1) (2) $\begin{aligned} u_{i}^{p} & = B i K R U (u_{i - 1}^{p}, c_{i}^{p}) (i = 1, \dots k) \end{aligned}$ (2) We employ the concluding hidden land of Bi-GRU $u_{thousand}^{p}$ and $u_{k}^{q}$ to represent character-level word embedding which is shown in Figure 2.

Effigy 2. Graphic symbol-level Word Embedding by Bi-GRU.

3.ii. Word embedding layer

Give-and-take Embedding Layer also maps each give-and-take to loftier-dimensional vector infinite. We use the pre-trained give-and-take vector model Glove [29] to obtain fixed-dimension discussion vectors ${e_{t}^{q}}_{t = ane}^{m}$ for question and ${e_{t}^{p}}_{t = 1}^{n}$ for passage.

For each word ${westward}_{t}$ in question and passage, the grapheme-level vector $u_{t}^{q}$ , $u_{t}^{p}$ and word-level vector ${due east}_{t}^{q}$ , ${eastward}_{t}^{p}$ are concatenated respectively, which is represented every bit Equation 3–four. (3) $\begin{aligned} Q & = {[e_{t}^{q}, u_{t}^{q}]}_{t = 1}^{m} \end{aligned}$ (3) (four) $\begin{aligned} P & = {[{east}_{t}^{p}, u_{t}^{p}]}_{t = 1}^{n} \end{aligned}$ (four) Further, the distributed matrix representation of each discussion $w_{t}$ in question and passage is generated by the two-layer Highway Network [30], which are shown in Equation v–6. (five) $\begin{aligned} y_{t}^{q} & = R e L U (x_{t}^{q} W^{10} + b^{x}) \cdot σ ({ten}_{t}^{q} {West}^{T} + b^{T}) \\ + 10_{t}^{q} (1 - σ (x_{t}^{q} {Due west}^{T} + b^{T})) \end{aligned}$ (v) (6) $\begin{aligned} y_{t}^{p} & = R due east L U (x_{t}^{p} W^{10} + b^{x}) \cdot σ (x_{t}^{p} W^{T} + b^{T}) \\ + x_{t}^{p} (1 - σ ({ten}_{t}^{p} {Westward}^{T} + b^{T})) \end{aligned}$ (6) Where $x_{t}^{q} = [{east}_{t}^{q}, u_{t}^{q}] \in R^{d}$ and $x_{t}^{p} = [e_{t}^{p}, u_{t}^{p}] \in R^{d}$ , $d$ represents the dimension of the concatenated vector. Finally, the question and passage are represented as matrix $Q = {y_{t}^{q}}_{t = ane}^{k} \in R^{m * d}$ and $P = {y_{t}^{p}}_{t = 1}^{n} \in R^{northward * d}$ respectively.

iv. Attention-based passage encoding

4.1. Gated scaled dot-product attending layer

An attending machinery called Gated Attention-based Recurrent Network is proposed to generate new passage representation aligned to question [31]. We prefer the scaled dot-product to implement question-enlightened passage encoding. Dot-product attending is much faster and more infinite-efficient in exercise, since it can exist implemented using highly optimized matrix multiplication code.

An attending function maps a passage and a set of primal-value pairs of question to an output, where the passage(P), keys(Q_key), values(Q_value) and output are all vectors. The output is computed every bit a weighted sum of the values, where the weight assigned to each value is computed past a compatibility function of the passage with the respective key of the question. The question-aware passage embedding is achieved by this particular attention mechanism-Scaled Dot-Product Attention [32], which is shown in Figure 3.

Effigy three. Scaled Dot-Product Attention on Question and Passage.

The input consists of passage and question's central-value pairs with dimension d_q . We compute the dot products of the passage with all keys of question, dissever each past $\sqrt{d_{q}}$ , and apply a softmax office to obtain the weights on the values.

The attention function on a fix of words in passage is computed simultaneously, packed together into a matrix P . The keys and values are as well packed together into matrices Q_key and Q_value . Given question and passage representation $U^{q} = {u_{i}^{'^{q}}, u_{2}^{'^{q}}, \dots u_{one thousand}^{'^{q}}}$ and $U^{p} = {u_{1}^{'^{p}}, u_{2}^{'^{p}}, \dots u_{north}^{'^{p}}}$ , the output attention matrix is computed as: (9) $C (P, Q_{one thousand eastward y}, Q_{v a l u due east}) = s o f t chiliad a x (\frac{P \cdot Q_{k e y}^{T}}{\sqrt{d_{q}}}) Q_{5 a l u e}$ (nine) Hither, $P = σ (U^{p}) \in R^{due north * d}$ , $Q_{thousand e y} = σ (U^{q}) \in R^{k * d}$ , $Q_{v a 50 u e} = U^{q} \in R^{m * d}$ and $σ = R e L U (West x + b)$ is a non-linear mapping function.

The purpose of attention in RC system is to read the passage by incorporation of question and re-encode the passage by the question-enlightened information.

To brand the attending focused on the important parts of passage that is relevant to question, we add some other gate to the input of bi-GRU and it is updated as Equation 10–12. And then the representation of passage is updated equally $U^{p} = {u_{1}^{{''}^{p}}, u_{2}^{{''}^{p}}, \dots u_{n}^{{''}^{p}}}$ . (10) $\begin{aligned} u_{t}^{{''}^{p}} = b i G R U (u_{t - one}^{{''}^{p}}, {[u_{t}^{'^{p}}, c_{t}]}^{*}) \end{aligned}$ (ten) (11) $\begin{aligned} [u_{t}^{'^{p}}, c_{t}]^{*} = {thousand}_{t} ⊙ [u_{t}^{'^{p}}, c_{t}] \end{aligned}$ (11) (12) $\begin{aligned} {thousand}_{t} = s i g grand o i d (W_{g} [u_{t}^{'^{p}}, c_{t}]) \end{aligned}$ (12)

Hither, $u_{t}^{'^{p}}$ is from the previous encoding layer and information technology is an additional input into the recurrent network. $c_{t}$ is the $t$ th vector of attention matrix $C$ .

v. Pointer network for answer choice

Nosotros apply arrow networks [33] to predict the starting and ending position of the answer in the passage. The attending-pooling over the question representation is used to generate the initial hidden vector for the pointer network. Given the passage representation ${u_{t}^{'''^{p}}}_{t = 1}^{n}$ , the attending mechanism is utilized to select the starting position p^Start and ending position p^Stop in the passage, which tin exist formulated as post-obit.

For each word $w_{t}$ in passage, predict the corresponding probability to be the starting (L = First) or ending (L = Cease) discussion of the answer. (14) $\begin{aligned} s_{t}^{L} = v^{T} t a n h (W^{p} u_{t}^{'''^{p}} + West h_{50}) (L = Start, Cease) \\ a_{t}^{L} = \frac{eastward x p (s_{t}^{L})}{\sum_{i = 1}^{northward} due east x p (s_{i}^{Fifty})} \\ p^{L} = a r g 1000 a x (a_{1}^{50}, \dots, a_{n}^{Fifty}) \end{aligned}$ (xiv) Nosotros use question representation $U^{q} = {u_{i}^{'^{q}}, u_{2}^{'^{q}}, \dots u_{g}^{'^{q}}}$ , which is obtained from question encoding layer described in department 3.3, to help locate the starting position of answer p^Offset . Here, h_Showtime in Equation 15 is computed as Equation xvi and it is an attending-pooling vector of the question based on the random parameter $v^{q}$ . (xv) $\begin{aligned} s_{t}^{q} & = {five}^{T} t a n h ({Westward}^{q} u_{t}^{'^{q}} + W {five}^{q}) \\ a_{t}^{q} & = \frac{e ten p ({south}_{t}^{q})}{\sum_{j = 1}^{m} e 10 p ({south}_{j}^{q})} \\ h_{S t a r t} & = \sum_{i = one}^{m} a_{i}^{q} u_{t}^{'^{q}} \end{aligned}$ (15) Also the attention-based passage representation ${u_{t}^{'''^{p}}}_{t = 1}^{n}$ is used to predict the ending position of answer by Equation sixteen. h_End represents the last hidden state of the arrow network. (16) $h_{Due east n d} = b i Thou R U (h_{S t a r t}, \sum_{t = ane}^{n} a_{t}^{Due south t a r t} u_{t}^{{'''}^{p}})$ (16)

6. Experiment and results

vi.1. Dataset

We evaluate our model on Stanford Question Answering Dataset (SQuAD) V1.1 ² dataset. It consists of questions on a set of Wikipedia articles, where the reply to every question is a segment of text from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets (Table 1).

Table 1. SQuAD Data Distribution.

The question-context sample from Team is shown in below.

Passage: In 1870, Tesla moved to Karlovac, to attend school at the College Real Gymnasium, where he was profoundly influenced past a math teacher Martin Sekulic. The classes...

Question: Who was tesla's chief influence in Karlovac?

Answer: Martin Sekulic

half-dozen.3. Experimental results

In order to evaluate the performance impact by different components in our hierarchical multi-layer model, we requite detailed comparing by removing them separately and the results are shown in Table 2. The scores on DevSet are evaluated by the official script. ^four

Table 2. Functioning impact by different components in GSA-Net.

Equally it can be seen from Table 2, the performance declines by removing different components in GSA-Net. These components play an important role to select the correct respond. Character-level word embedding can handle the Out-of-Vocabulary tokens very well. With the Self-Matching Layer removed, EM and F1 value is lowered by 4.92% and 4.12% respectively. It indicates that the long-distance dependency in the passage can assist to locate the right answer efficiently. In addition, Gate machinery makes the attention focused on the important parts of passage and boosts the performance.

In order to show the ability of the model for encoding show from passage, nosotros draw the alignment of the passage against the question in Gated Scaled Dot-Product Attention Layer. The attention weights are visualized and shown in Figure four. The darker of the colour, the higher weight value of the word. For case, the answer "Martin Sekulic" in the passage is given more attending to the question. Other words with darker colour, such as "was", "Kalovac" and "Tesla", are overlapped with question.

Figure 4. Visualization of the Attention Weight in Gated Scaled Dot-product Layer.

The cross-entropy loss for the RC task in models GSA-Net and GSA-Net without Gate are demonstrated in Effigy 5. The loss value decreased gradually with the increasing of preparation stride. It is shown that GSA-Internet model has the less loss and also converges faster.

Figure v. The Loss during the Training Procedure for Answer Choice.

Nosotros also compare the performance of our model GSA-Cyberspace with other related work on Squad every bit shown in Table 3.

Tabular array 3: The performance comparison with dissimilar methods.

LR Baseline [viii]: A model based on Logistic Regression for RC task, which extracts several types of features for each candidate and computes the unigram/bigram overlap between the sentence and question.

Dynamic Chunk Reader [35]: A novel neural network model for joint candidate answer chunking and ranking, where the candidate reply chunks are dynamically constructed and ranked in an cease-to-stop style.

Match-LSTM with Ans-Ptr [13]: It proposes ii new finish-to-end neural network models for machine comprehension task, which combine lucifer-LSTM and Ptr-Net to handle the special properties of the Team dataset.

Dynamic Co-attention Network [14]: A Model called Dynamic Co-attending Network (DCN) for question answering. The DCN firstly fuses co-dependent representations of the question and the document in gild to focus on relevant parts of both. Then a dynamic pointing decoder iterates over potential reply spans.

RaSoR [36]: A model chosen RASOR that efficiently builds fixed length representations of all spans in the evidence certificate with a recurrent network. It explicitly computes embedding representations for candidate answer spans.

BiDAF [15]: A Model named Bi-Directional Attention Flow (BIDAF) network, a multi-phase hierarchical process that represents the context at dissimilar levels of granularity and uses bidirectional attending catamenia mechanism to obtain a query-aware context representation without early summarization.

Fine-Grained Gating (ensemble) [37]: A model with fine-grained gating mechanism to combine give-and-take-level and character-level representations dynamically. It farther extends the idea of fine-grained gating to model the interaction between question and paragraph for reading comprehension.

The results in Table 3 denote that our model GSA-Net performs best in both EM and F1 value. Our method patently outperforms the baseline and several strong land-of-the-art systems for both single model and ensemble i.

Nosotros as well compare the functioning of our model with the participating systems in BioASQ. ⁵ For each batch and dissimilar question blazon, the upshot of the top 2 competing systems and our model are shown in Table 4.

Table four. Comparing with competing BioASQ systems.

Our model has a stiff power to process the interactive encoding of questions and contexts, which tin can go high quality question-context alignment representation. Gated mechanism is adopted in our model to solve the problem of propagating dependencies over long distances. Moreover, hierarchical attention mechanisms can locate the segments which are related to the answer stride by step, therefore the semantic representation ability of the model is enhanced.

7. Conclusion

Automobile Reading Comprehension is an important chore for natural language understanding. It evaluates the machine's ability to admission knowledge and answer questions from the given passage. This paper proposes an end-to-end machine reading comprehension framework, which tin can well empathize questions and relevant fragments for answer prediction. We present a Gated Scaled dot-product Attention based Neural network (GSA-Net) for Reading Comprehension task. The dissimilar components in this hierarchical multi-layer model play an important part to locate the correct answer. The gated scaled dot-product attention and self-matching attending machinery are used to obtain the suitable question-aware representation for passage. Further, the pointer network predicts the answer position effectively. Our model achieves an verbal match score (EM) of 71.1% and an F1 score of 80.1% on SQuAD, which outperforms several strong competing systems. This model has fiddling retentivity requirements, and it performs even better than the models that rely on more computing resources.

Although we have added two attention layers in the proposed model, the interpretability is still not enough, which is a mutual problem in the deep learning based tongue processing task. In addition, the cross-paragraph reasoning ability of this model needs to exist improved and it is of import to respond complex question. In the future piece of work, we volition try to combine BERT with our method to improve the reasoning ability of the model. Meanwhile, how to learn the prior knowledge of man language expression from large-scale unstructured information and utilize it to motorcar reading comprehension is as well a significant goal of our work.

leewhiseas.blogspot.com

Source: https://www.tandfonline.com/doi/full/10.1080/00051144.2020.1809221

Gated Self-matching Networks for Reading Comprehension and Question Answering

ane. Introduction

2. Hierarchical multi-layer reading comprehension model

Published online:

3. Question and passage encoding

iii.1. Character-level embedding layer

Published online:

3.ii. Word embedding layer

iv. Attention-based passage encoding

4.1. Gated scaled dot-product attending layer

Published online:

v. Pointer network for answer choice

6. Experiment and results

vi.1. Dataset

Published online:

Table 1. SQuAD Data Distribution.

half-dozen.3. Experimental results

Published online:

Table 2. Functioning impact by different components in GSA-Net.

Published online:

Published online:

Published online:

Tabular array 3: The performance comparison with dissimilar methods.

Published online:

Table four. Comparing with competing BioASQ systems.

7. Conclusion

0 Response to "Gated Self-matching Networks for Reading Comprehension and Question Answering"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel