Skip to content

Data Download and Pre Processing

DharmendraVaghela edited this page Aug 12, 2016 · 2 revisions

Answers from all QA Systems along with questions and confidence can be put together into one csv file. For each system there will be separate csv file with Question, Top Answer Text and Top Answer Confidence fields. Let's refer this file as qa-pairs.csv now onward.

Frequency of questions being asked to QA system is an important factor. qa-pairs.csv with frequency as a field will be useful for future comparison.

Along with qa-paris if multiple systems are to be trained on the ground truth, it is necessary that ground truth for training modle should be common for all systems.truth.csv should contain Question ID, Question and Answer/Answer ID. If truth file is created with Answer ID then corpus.csv should have mapping of Answer ID to the Answers. Here corpus.csv is the whole data-set which is used to answer any question asked to QA System.

For this particular experiment, we have used data from Watson Experience Manager(XMGR). corpus.csv,truth.csv and qa-pairs.csv can be generated from the user logs at client location. Following is the process to download and process the raw data into above mentioned files. But this toolkit can be used to deal with any data format which can be classified as corpus.csv,truth.csv and qa-pairs.csv.

Steps for pre-processing using Watson Experience Manager:

  1. Download the corpus file:

     themis xmgr download-corpus <url> <username> <password>
    

    here url is XMGR project instance url (e.g. https://watson.ihost.com/instance/283/predeploy/$150dd167e4e) url might contain ‘$’ symbol which need to be escaped from command line by putting ‘/’ before it username and passwords are the credentials from client XMGR instance.

    This command will download corpus.csv from Watson Experience Manager (XMGR) interface which will be used in the subsequent steps. This may take several hours to run. It saves intermediate state, so if it drops in the middle, by running it again it will pick up where it left off. Optionally --retries can be specified as a parameter which automatically restarts a specified number of times.

  2. Download the truth file:

     themis xmgr truth <url> <username> <password>
    

    here url, username and password are are the credentials from client XMGR instance.

    This command will create two files in the output directory: a raw truth.json that contains all the information downloaded from XMGR and a filtered truth.csv file which will have Answers associated with Answer Id in the corpus.csv. Truth is used to train the WEA model and NLC model.

    Assumption: all answer Ids referenced in the truth.csv are present in the 'corpus.csv'. If this is not the case, then run

     themis xmgr validate-truth <corpus.csv> <truth.csv>
    

    If all the Answer IDs are present in the corpus, then this command will do nothing. If any are missing, it creates two new files: truth.in-corpus.csv and truth.not-in-corpus.csv.

  3. Download usage log from XMGR:

    Usage report from XMGR which will be downloaded as zip file. This zip will have QuestionsData.csv which contains records of the questions that were asked to Watson and the answers it provided. Use this to extract a set of questions that were asked to Watson along with the answers it gave.

  4. Derive Question-Answer pairs:

     themis question extract <QuestionsData.csv> > <qa-pairs.csv>
    

    here QuestionsData.csv is the csv file from the zip of usage log. qa-pairs.csv is the output file which will be used in the subsequent steps.

    This command will Extract questions and answers from usage logs, adding question frequency information. Frequency will be used in the final result as a factor in the measurement.

    Assumption: Given question always elicits the same answer. If this is not the case, then warning will be printed and unique answer will be selected from the bunch of answers. All other answers will be dropped arbitrarily except for the one answer to the question.

Clone this wiki locally