Skip to content

Latest commit

 

History

History

wikipedia_parser

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Wikipedia Parser

This library implements the reading of files generated by https://github.com/mlabs-haskell/wikipedia_parser/

Note: Binary mode vs Text mode open()

The files comprising the wikipedia parser output, even though they contain text, must be opened in binary mode. The index used to speed up access of individual articles in these files, uses byte offsets. In python, if a file is opened in text mode, seek operations are performed based on the number of UTF-8 codepoints, not the number of bytes. So the operations that use the index will return incorrect data unless the data file is opened in binary mode.

This library takes care of opening the data and index files with the correct flags.