Skip to content

dbpedia-spotlight/model-quickstarter

This branch is 73 commits ahead of jodaiber/model-quickstarter:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

1d00d64 · Jan 13, 2022

History

98 Commits
Jun 28, 2017
Jan 10, 2022
Apr 19, 2013
Mar 11, 2014
Jan 28, 2016
Apr 4, 2020
Feb 18, 2018
Jan 28, 2016
Jan 28, 2016
Jan 10, 2022
Jan 28, 2016
Mar 11, 2014
Mar 11, 2014
Apr 19, 2013
Jul 18, 2017
Jan 10, 2022
Jan 28, 2016
Jul 20, 2016
Feb 18, 2018
Jan 28, 2016
Jan 10, 2022
Apr 19, 2013
Jan 19, 2016
Jan 28, 2016
Jan 28, 2016
Jul 20, 2016
Jan 13, 2022
Jan 21, 2016
Jan 12, 2022
Jan 21, 2016
Jan 10, 2022
Jul 20, 2016

Repository files navigation

Quickstarter for DBpedia Spotlight models

Gitter

Update, January 2022

The language models are build with the latest version of redirects, disambiguations, and instance-types artifacts, downloaded from the DBpedia Databus. The Catalan, Finish, Lithuanian, and Romanian languages was integrated to the creation model list.

Update, January 2016

This tool now uses the wikistatsextractor by the great folks over at DiffBot. This means: no more Hadoop and Pig! Running the biggest model (English) takes around 2h on a single machine with around 32GB of RAM. We recommend running this script on an SSD with around 100GB of free space.

Requirements

  • Git
  • Maven 3

Spotlight model creation

You can use this tool for creating models of DBpedia Spotlight in your language.

  1. docker run -it dbpediaspotlight/model-quickstarter bash

    Generate the models outside the container - If you want to generate the models outside the container, just map volumes for the folders `/model-quickstarter/wdir`, `/model-quickstarter/data` and `/model-quickstarter/models`, e.g.
     docker run -v /home/user/data/model/wdir:/model-quickstarter/wdir -v /home/user/data/model/data:/model-quickstarter/data -v /home/user/data/model/models:/model-quickstarter/models -it dbpediaspotlight/model-quickstarter bash
    
  2. cd model-quickstarter/

  3. Copy & paste one of the following commands to begin the corresponding language model creation process.

Language Language code Locator code Analyzer+Stemmer language prefix Command
Catalan ca ES Catalan ./index_db.sh wdir ca_ES ca/stopwords.list Catalan models/ca
Danish da DK Danish ./index_db.sh wdir da_DK da/stopwords.list Danish models/da
German de DE German ./index_db.sh -b de/ignore.list wdir de_DE de/stopwords.list German models/de
English en US English ./index_db.sh -b en/ignore.list wdir en_US en/stopwords.list English models/en
Spanish es ES Spanish ./index_db.sh -b es/ignore.list wdir es_ES es/stopwords.list Spanish models/es
Finish fi FI Finnish ./index_db.sh wdir fi_FI fi/stopwords.list Finnish models/fi
French fr FR French ./index_db.sh -b fr/ignore.list wdir fr_FR fr/stopwords.list French models/fr
Hungarian hu HU Hungarian ./index_db.sh wdir hu_HU hu/stopwords.list Hungarian models/hu
Italian it IT Italian ./index_db.sh wdir it_IT it/stopwords.list Italian models/it
Lithuanian lt LT Lithuanian ./index_db.sh wdir lt_LT lt/stopwords.list Lithuanian models/lt
Dutch nl NL Dutch ./index_db.sh -b nl/ignore.list wdir nl_NL nl/stopwords.list Dutch models/nl
Norwegian no NO Norwegian ./index_db.sh -b no/ignore.list wdir no_NO no/stopwords.list Norwegian models/no
Portuguese pt BR Portuguese ./index_db.sh -b pt/ignore.list wdir pt_BR pt/stopwords.list Portuguese models/pt
Romanian ro RO Romanian ./index_db.sh wdir ro_RO ro/stopwords.list Romanian models/ro
Russian ru RU Russian ./index_db.sh wdir ru_RU ru/stopwords.list Russian models/ru
Swedish sv SE Swedish ./index_db.sh -b sv/ignore.list wdir sv_SE sv/stopwords.list Swedish models/sv
Turkish tr TR Turkish ./index_db.sh -b tr/ignore.list wdir tr_TR tr/stopwords.list Turkish models/tr

Datasets

You can find pre-built datasets created using the model-quickstarter here:

Citation

If you use the current (statistical version) of DBpedia Spotlight or the data/models created using this repository, please cite the following paper.

@inproceedings{isem2013daiber,
  title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
  author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
  year = {2013},
  booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}

Releases

No releases published

Packages

No packages published

Languages

  • Shell 88.5%
  • Python 7.9%
  • Dockerfile 3.6%