Quickstarter for DBpedia Spotlight models

Update, January 2022

The language models are build with the latest version of redirects, disambiguations, and instance-types artifacts, downloaded from the DBpedia Databus. The Catalan, Finish, Lithuanian, and Romanian languages was integrated to the creation model list.

Update, January 2016

This tool now uses the wikistatsextractor by the great folks over at DiffBot. This means: no more Hadoop and Pig! Running the biggest model (English) takes around 2h on a single machine with around 32GB of RAM. We recommend running this script on an SSD with around 100GB of free space.

Requirements

Git
Maven 3

Spotlight model creation

You can use this tool for creating models of DBpedia Spotlight in your language.

docker run -it dbpediaspotlight/model-quickstarter bash
Generate the models outside the container
- If you want to generate the models outside the container, just map volumes for the folders `/model-quickstarter/wdir`, `/model-quickstarter/data` and `/model-quickstarter/models`, e.g.
```
 docker run -v /home/user/data/model/wdir:/model-quickstarter/wdir -v /home/user/data/model/data:/model-quickstarter/data -v /home/user/data/model/models:/model-quickstarter/models -it dbpediaspotlight/model-quickstarter bash
```
cd model-quickstarter/
Copy & paste one of the following commands to begin the corresponding language model creation process.

Language	Language code	Locator code	Analyzer+Stemmer language prefix	Command
Catalan	ca	ES	Catalan	./index_db.sh wdir ca_ES ca/stopwords.list Catalan models/ca
Danish	da	DK	Danish	./index_db.sh wdir da_DK da/stopwords.list Danish models/da
German	de	DE	German	./index_db.sh -b de/ignore.list wdir de_DE de/stopwords.list German models/de
English	en	US	English	./index_db.sh -b en/ignore.list wdir en_US en/stopwords.list English models/en
Spanish	es	ES	Spanish	./index_db.sh -b es/ignore.list wdir es_ES es/stopwords.list Spanish models/es
Finish	fi	FI	Finnish	./index_db.sh wdir fi_FI fi/stopwords.list Finnish models/fi
French	fr	FR	French	./index_db.sh -b fr/ignore.list wdir fr_FR fr/stopwords.list French models/fr
Hungarian	hu	HU	Hungarian	./index_db.sh wdir hu_HU hu/stopwords.list Hungarian models/hu
Italian	it	IT	Italian	./index_db.sh wdir it_IT it/stopwords.list Italian models/it
Lithuanian	lt	LT	Lithuanian	./index_db.sh wdir lt_LT lt/stopwords.list Lithuanian models/lt
Dutch	nl	NL	Dutch	./index_db.sh -b nl/ignore.list wdir nl_NL nl/stopwords.list Dutch models/nl
Norwegian	no	NO	Norwegian	./index_db.sh -b no/ignore.list wdir no_NO no/stopwords.list Norwegian models/no
Portuguese	pt	BR	Portuguese	./index_db.sh -b pt/ignore.list wdir pt_BR pt/stopwords.list Portuguese models/pt
Romanian	ro	RO	Romanian	./index_db.sh wdir ro_RO ro/stopwords.list Romanian models/ro
Russian	ru	RU	Russian	./index_db.sh wdir ru_RU ru/stopwords.list Russian models/ru
Swedish	sv	SE	Swedish	./index_db.sh -b sv/ignore.list wdir sv_SE sv/stopwords.list Swedish models/sv
Turkish	tr	TR	Turkish	./index_db.sh -b tr/ignore.list wdir tr_TR tr/stopwords.list Turkish models/tr

Datasets

You can find pre-built datasets created using the model-quickstarter here:

Citation

If you use the current (statistical version) of DBpedia Spotlight or the data/models created using this repository, please cite the following paper.

@inproceedings{isem2013daiber,
  title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
  author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
  year = {2013},
  booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}

Name	Name	Last commit message	Last commit date
Latest commit Julio-Noe Update README.md Jan 13, 2022 1d00d64 · Jan 13, 2022 History 98 Commits
ar	ar	Adding Arabic language	Jun 28, 2017
ca	ca	adding stopwords for ca, fi, lt, and ro languages	Jan 10, 2022
cs	cs	Changes from final indexing.	Apr 19, 2013
da	da	Disabled OpenNLP by default, added default spotter thresholds.	Mar 11, 2014
de	de	Added URI blacklists.	Jan 28, 2016
docker	docker	Change docker image	Apr 4, 2020
el	el	Adding Greek language	Feb 18, 2018
en	en	Added URI blacklists.	Jan 28, 2016
es	es	Added URI blacklists.	Jan 28, 2016
fi	fi	adding stopwords for ca, fi, lt, and ro languages	Jan 10, 2022
fr	fr	Added URI blacklists.	Jan 28, 2016
hu	hu	Disabled OpenNLP by default, added default spotter thresholds.	Mar 11, 2014
it	it	Disabled OpenNLP by default, added default spotter thresholds.	Mar 11, 2014
ja	ja	Changes from final indexing.	Apr 19, 2013
ko	ko	added korean stopwords	Jul 18, 2017
lt	lt	adding stopwords for ca, fi, lt, and ro languages	Jan 10, 2022
nl	nl	Added URI blacklists.	Jan 28, 2016
no	no	Add Norwegian (Bokmål)	Jul 20, 2016
pl	pl	Adding Polish	Feb 18, 2018
pt	pt	Added URI blacklists.	Jan 28, 2016
ro	ro	adding stopwords for ca, fi, lt, and ro languages	Jan 10, 2022
ru	ru	Changes from final indexing.	Apr 19, 2013
scripts	scripts	Added data creation.	Jan 19, 2016
sv	sv	Added URI blacklists.	Jan 28, 2016
tr	tr	Added URI blacklists.	Jan 28, 2016
zh	zh	Add Chinese	Jul 20, 2016
README.md	README.md	Update README.md	Jan 13, 2022
eval.sh	eval.sh	Added empty eval script	Jan 21, 2016
index_db.sh	index_db.sh	updating index_db.sh	Jan 12, 2022
model_readme.txt	model_readme.txt	Create model_readme.txt	Jan 21, 2016
prepare.sh	prepare.sh	Update prepare.sh	Jan 10, 2022
run.sh	run.sh	Merge pull request #3 from krzd/master	Jul 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quickstarter for DBpedia Spotlight models

Update, January 2022

Update, January 2016

Requirements

Spotlight model creation

Datasets

Citation

About

Releases

Packages

Languages

dbpedia-spotlight/model-quickstarter

Folders and files

Latest commit

History

Repository files navigation

Quickstarter for DBpedia Spotlight models

Update, January 2022

Update, January 2016

Requirements

Spotlight model creation

Datasets

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages