-
Notifications
You must be signed in to change notification settings - Fork 4
Data Download and Pre Processing
Answers from all QA Systems along with questions and confidence can be put together into one csv file. For each system there will be separate csv file with Question, Top Answer Text and Top Answer Confidence fields. Let's refer this file as qa-pairs.csv
now onward.
Frequency of questions being asked to QA system is an important factor. qa-pairs.csv
with frequency as a field will be useful for future comparison.
Along with qa-paris if multiple systems are to be trained on the ground truth, it is necessary that ground truth for training modle should be common for all systems.truth.csv
should contain Question ID, Question and Answer/Answer ID. If truth file is created with Answer ID then corpus.csv
should have mapping of Answer ID to the Answers. Here corpus.csv
is the whole data-set which is used to answer any question asked to QA System.
For this particular experiment, we have used data from Watson Experience Manager(XMGR). corpus.csv
,truth.csv
and qa-pairs.csv
can be generated from the user logs at client location. Following is the process to download and process the raw data into above mentioned files. But this toolkit can be used to deal with any data format which can be classified as corpus.csv
,truth.csv
and qa-pairs.csv
.
Steps for pre-processing using Watson Experience Manager:
-
Download the corpus file:
themis xmgr download-corpus <url> <username> <password>
here url is XMGR project instance url (e.g.
https://watson.ihost.com/instance/283/predeploy/$150dd167e4e
) url might contain ‘$’ symbol which need to be escaped from command line by putting ‘/’ before it username and passwords are the credentials from client XMGR instance.This command will download
corpus.csv
from Watson Experience Manager (XMGR) interface which will be used in the subsequent steps. This may take several hours to run. It saves intermediate state, so if it drops in the middle, by running it again it will pick up where it left off. Optionally--retries
can be specified as a parameter which automatically restarts a specified number of times. -
Download the truth file:
themis xmgr truth <url> <username> <password>
here url, username and password are are the credentials from client XMGR instance.
This command will create two files in the output directory: a raw
truth.json
that contains all the information downloaded from XMGR and a filteredtruth.csv
file which will have Answers associated with Answer Id in thecorpus.csv
. Truth is used to train the WEA model and NLC model.Assumption: all answer Ids referenced in the
truth.csv
are present in the 'corpus.csv'. If this is not the case, then runthemis xmgr validate-truth <corpus.csv> <truth.csv>
If all the Answer IDs are present in the corpus, then this command will do nothing. If any are missing, it creates two new files:
truth.in-corpus.csv
andtruth.not-in-corpus.csv
. -
Download usage log from XMGR:
Usage report from XMGR which will be downloaded as zip file. This zip will have
QuestionsData.csv
which contains records of the questions that were asked to Watson and the answers it provided. Use this to extract a set of questions that were asked to Watson along with the answers it gave. -
Derive Question-Answer pairs:
themis question extract <QuestionsData.csv> > <qa-pairs.csv>
here
QuestionsData.csv
is the csv file from the zip of usage log.qa-pairs.csv
is the output file which will be used in the subsequent steps.This command will Extract questions and answers from usage logs, adding question frequency information. Frequency will be used in the final result as a factor in the measurement.
Assumption: Given question always elicits the same answer. If this is not the case, then warning will be printed and unique answer will be selected from the bunch of answers. All other answers will be dropped arbitrarily except for the one answer to the question.