The Ktrlf dataset includes QA pairs for each document in every line.
id
: URL of the document extracted from C4target_text
: The document used for In-Document Searchqa_pairs
: Pairs of questions and their corresponding answer entitiesentity_info
: Metadata extracted from the document using GCPmention
: The text mentioned in the actual documententity
: The linked wiki entitystart
: The start character indices in the documentend
: The end character indices in the document
{
'id': <Dump ULR of Document>,
'data': {
'qa_pairs': [
{'question': 'Social media platforms', 'target_entities': ['Twitter']},
{'question': '...', 'target_entities': ['...']},
...
],
'target_text': '...',
'entity_info': [
{'mention': 'Trump',
'entity': 'Donald Trump',
'start': 11,
'end': 16,
'wikipedia_link': 'https://en.wikipedia.org/wiki/Donald_Trump',
'gcp_entity_type': 'Type.ORGANIZATION'},
{'mention': 'Democratic',
'entity': 'Democratic Party (United States)',
'start': 179,
'end': 189,
'wikipedia_link': 'https://en.wikipedia.org/wiki/Democratic_Party_(United_States)',
'gcp_entity_type': 'Type.ORGANIZATION'},
...
]
}
}