Skip to content

Parse wikipedia dumps and index (some) page data to elasticsearch

Notifications You must be signed in to change notification settings

inhortte/wikiparse

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikiparse

Imports wikipedia data dump XML into elasticsearch.

Usage

  • Download the pages-articles XML dump, find the link on this page. You want pages-articles.xml.bz2. DO NOT UNCOMPRESS THE BZ2 FILE.
  • From the releases page, download the wikiparse JAR
  • Run the jar on the BZ2 file: java -jar -Xmx1g wikiparse-0.2.1.jar --es http://localhost:9200 /var/lib/elasticsearch/enwiki-latest-pages-articles.xml.bz2
  • The data will be indexed to an index named en-wikipedia (by default). This can be changed with --index parameter.

License

Wikisample.bz2 Copyright: http://en.wikipedia.org/wiki/Wikipedia:Copyrights All code and other files Copyright © 2013 Andrew Cholakian and distributed under the Eclipse Public License, the same as Clojure.

About

Parse wikipedia dumps and index (some) page data to elasticsearch

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Clojure 100.0%