Loading dialog datasets in the QA format

As described in the previous section, we need to convert dialog data from line-by-line conversation turns into a (facts, question, answer) tuple format for each turn of the dialog. For this purpose, we need to write a method that will read lines from the raw dialog corpus and return the desired tuples for training in a memory network paradigm.

Since we will be using word vectors as inputs to our model, we first need to define a tokenize method which will be used for converting a sentence into a list of words (minus special symbols and common words):

def tokenize(sent):
    stop_words = {"a", "an", "the"}
    sent = sent.lower()
    if sent == '&lt;silence&gt;':
        return [sent]
    # Convert sentence to tokens
    result = [word.strip() for word in re.split('(W+)?', sent) 
              if word.strip() and word.strip() not in stop_words]
    # Cleanup
    if not result:
        result = ['&lt;silence&gt;']
    if result[-1]=='.' or result[-1]=='?' or result[-1]=='!':
        result = result[:-1]
    return result

Then, we can define a function to read the raw data files from the bAbI dialog dataset and process them. We parse the text in a file, line-by-line, and keep track of all possible (facts, question, answer) tuples in the data. We must keep updating the facts list in a dialog as we move from one line to the next, and reset the facts list when we encounter a blank line. We must also be careful about lines which do not contain an (utterance, response) pair:

def parse_dialogs_per_response(lines, candidates_to_idx):
    data = []
    facts_temp = []
    utterance_temp = None
    response_temp = None
    # Parse line by line
    for line in lines:
        line = line.strip()
        if line:
            id, line = line.split(' ', 1)
            if '	' in line: # Has utterance and response
                utterance_temp, response_temp = line.split('	')
                # Convert answer to integer index
                answer = candidates_to_idx[response_temp]
                # Tokenize sentences
                utterance_temp = tokenize(utterance_temp)
                response_temp = tokenize(response_temp)
                # Add (facts, question, answer) tuple to data
                data.append((facts_temp[:], utterance_temp[:], answer))
                # Add utterance/response encoding
                utterance_temp.append('$u')
                response_temp.append('$r')
                # Add turn count temporal encoding
                utterance_temp.append('#' + id)
                response_temp.append('#' + id)
                # Update facts
                facts_temp.append(utterance_temp)
                facts_temp.append(response_temp)
            else: # Has KB Fact
                response_temp = tokenize(line)
                response_temp.append('$r')
                response_temp.append('#' + id)
                facts_temp.append(response_temp)
        else: # Start of new dialog
            facts_temp = []
    return data

An important nuance to note is that we have added two extra symbols (utterance/response encoding and turn count encoding) to the tokenized versions of all facts, questions, and responses in our data. This results in our model treating these encodings as words and building word vectors for them. The utterance/response encoding helps the model to differentiate between sentences spoken by the user and the bot, and the turn count encoding builds temporal understanding in the model.

Here, the candidates dictionary is a mapping of candidate answers to integer indices. We need to do such a conversion because our memory network will be performing a softmax over the candidates, dictionary integer entries, which can then point us to the chosen response. The candidates dictionary can be constructed directly from a file containing all possible response candidates line-by-line, along with tokenized versions of the response candidates themselves, as follows:

candidates = []
candidates_to_idx = {}
with open('dialog-babi/dialog-babi-candidates.txt') as f:
    for i, line in enumerate(f):
        candidates_to_idx[line.strip().split(' ', 1)[1]] = i
        line = tokenize(line.strip())[1:]
        candidates.append(line)

Next, we can use the candidates dictionary to load the training, validation, and testing dialogs in the QA format using the parsing method we just defined:

train_data = []
with open('dialog-babi/dialog-babi-task5-full-dialogs-trn.txt') as f:
    train_data = parse_dialogs_per_response(f.readlines(), candidates_to_idx)

test_data = []
with open('dialog-babi/dialog-babi-task5-full-dialogs-tst.txt') as f:
    test_data = parse_dialogs_per_response(f.readlines(), candidates_to_idx)

val_data = [] 
with open('dialog-babi/dialog-babi-task5-full-dialogs-dev.txt') as f:
    val_data = parse_dialogs_per_response(f.readlines(), candidates_to_idx)

Table of Contents for Loading dialog datasets in the QA format

Create new playlist

Sign In

Sign Up

Table of Contents for
Loading dialog datasets in the QA format