Skip to content

UniversalDependencies/UD_French-ParTUT

Repository files navigation

Summary

UD_French-ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts and Wikipedia articles, among others.

Introduction

UD_French-ParTUT data is derived from the already-existing parallel treebank Par(allel)TUT.

ParTUT is a morpho-syntactically annotated collection of Italian/French/English parallel sentences, which includes texts from different sources and representing different genres and domains, released in several formats.

ParTUT comprises approximately 167,000 tokens, with an average amount of 2,100 sentences per language. The texts of the collection currently available were gathered from a large number of sources and domains:

ParTUT data can be downloaded here and here.

Acknowledgements

We are deeply grateful to Project Syndicate© for letting us download and exploit their articles as text material, under the terms of educational use.

Corpus splitting

The corpus was randomly split using a script. In order to preserve the 1:1 correspondence among the three language sections, all of them were partitioned in the same way; therefore the same sentences, in the same order, are found in the training, development and test set of the English and Italian treebanks as well. However, considering that since v2.1 UD_Italian-ParTUT has been re-partitioned, because of overlapping sentences with UD_Italian, the French section now appears as follows:

  • fr_partut-ud-train.conllu: 24146 words (804 sentences)
  • fr_partut-ud-dev.conllu: 3237 words (160 sentences)
  • fr_partut-ud-test.conllu: 1214 words (56 sentences)

Basic statistics

  • Tree count: 1020
  • Word count: 28597
  • Token count: 27661
  • Dep. relations: 48 of which 14 language specific
  • POS tags: 17
  • Category=value feature pairs: 34

References

  • Manuela Sanguinetti, Cristina Bosco. 2014. PartTUT: The Turin University Parallel Treebank. In Basili, Bosco, Delmonte, Moschitti, Simi (editors) Harmonization and development of resources and tools for Italian Natural Language Processing within the PARLI project, LNCS, Springer Verlag

  • Manuela Sanguinetti, Cristina Bosco. 2014. Converting the parallel treebank ParTUT in Universal Stanford Dependencies. In Proceedings of the 1rst Conference for Italian Computational Linguistics (CLiC-it 2014), Pisa (Italy)

  • Cristina Bosco, Manuela Sanguinetti. 2014. Towards a Universal Stanford Dependencies parallel treebank. In Proceedings of the 13th Workshop on Treebanks and Linguistic Theories (TLT-13), Tubingen (Germany)

Changelog

2024-05-15 v2.14

  • Fix a few UPOS tags

2021-05-15 v2.8

  • fixed wrong lemmas
  • changed annotation of "pouvoir","devoir", which are no longer considered AUX
  • harmonized PronType annotation with other French treebanks
  • changed annotation of present participle

2019-11-15 v2.5

  • fixed common and proper nouns wrongly annotated as amod

2019-05-15 v2.4

  • various corrections to pass new validation

2018-11-15 v2.3

  • corrections of incorrect lemmas into "être" (15 cases)

2018-4-15 v2.2

  • minor corrections in the training set

2017-11-15 v2.1

  • dates were revised and annotated as flat structures
  • change of xpos for copulas (from VA to V)
  • revised "il + être + ADJ + de/que + VERB" construction
  • revised deprel of "en", "où" and "y" pronouns
  • change of deprel of possessives (from nmod:poss to det)
  • revised deprels of "tout"
  • revised "il y a" construction (both temporal and existential)
  • clefts, pseudo-clefts and causatives annotated according to language-specific guidelines
  • other minor corrections
  • revised splits, in order to align French sentences to Italian counterparts

2017-03-01 v2

  • initial release

=== Machine-readable metadata ================================================

Data available since: UD v2.0 License: CC BY-NC-SA 4.0 Includes text: yes Genre: legal news wiki Lemmas: converted from manual UPOS: converted from manual XPOS: converted from manual Features: converted from manual Relations: converted from manual Contributors: Bosco, Cristina; Sanguinetti, Manuela Contributing: elsewhere Contact: [email protected]

===============================================================================

About

French part of the ParTUT parallel treebank.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •