Abstract

In this project, Microsoft Research WikiQA Corpus is used. The WikiQA corpus includes a set of question and sentence pairs, which is collected and annotated for research on open-domain question answering. The question sources were derived from Bing query logs, and each question is linked to a Wikipedia page that potentially has the answer.

This project presents a Question Answering framework using sequence model with attention mechanism. The dataset consists of questions, corresponding candidate answers, and labels indicating whether each sentence is the correct answer. Document preprocessing involves consolidating sentences into documents for each question and capturing the correct answer sentences. Various functions are defined for data processing, including tokenization, word embedding usingWord2Vec- SkipGram, padding, part-of-speech tagging, lemmatization, named entity tagging, and TF-IDF calculation. These functions transform the data and generate input features for the QA model.

The QA model architecture includes a Question Summary model and a Document model. The Question Summary model utilizes Bi-Directional Recurrent Neural Networks (Bi-RNN) to encode contextual information from the questions, and the output of the model is a question summary. NER tagging, POS tagging, lemmatization, and TF-IDF calculation are done on the documents and the embeddings of these values are obtained to feed into the document model. The document tokens are also converted into word vectors, and the resulting embeddings are fed into a GRU model.

The output of the GRU model is then passed through an attention mechanism together with the question summary. The output from the attention mechanism is fed into a Softmax classification. For each word, Softmax classification calculates the probability of belonging to the answer. Extensive testing and ablation studies to evaluate the model performance were done. The testing involved an input embedding ablation study, attention ablation study, and hyperparameter testing.