-
Notifications
You must be signed in to change notification settings - Fork 4
Introduction
#Cognitive Performance Evaluation using Themis:
###Goal: Main aim of this experiment is to measure technical accuracy of various question answering systems side by side and decide which QA-system performs better for given data-set. To achieve this, we need to ask questions to more than one QA systems and have basic fields such as Question, Answer and Score(Confidence with which answer is given by the system). The score has to be on the same scale for all QA systems. If it is not on the same scale, all confidence score need to be converted into unique scoring system to compare on the common platform.
Once answers from all QA Systems are derived with common style score, it is imperative to check the correctness of the answers given. To do so Annotation Assist tool is developed to add human judgements into each question set and decide the correctness and purview of the set.
This toolkit uses two cognitive metrics to measure the performance. Each point on the lines of these curves corresponds to a unique confidence (or score) returned by the system on the test data set.
####1. Precision Curve:
This metrics is used to measure ability of the system to answer in purview questions.A strong negative slope on the precision curve denotes that the score or confidence is highly correlated with the probability the answer is correct.
####2. ROC Curve:
This metrics is used to represent the ability of the system to correctly detect out of purview questions. In this experiment, idea of ROC curve analysis is extended to question answering by considering the accuracy of an answer to an in purview question along the y-axis as the “True Positive Rate (TPR),” and the “False Positive Rate (FPR)” on the x-axis is the probability that the system will return an answer to an out of purview question, which is by definition incorrect.