Skip to content

Commit

Permalink
Revert "wikipedia-kyoto-japanese-english: increase REXML entity expan…
Browse files Browse the repository at this point in the history
…sion limit during XML parsing (#198) (#201)

This reverts commit a76b917.

REXML has fixed the bug where
`REXML::Security.entity_expansion_text_limit` incorrectly calculated
text size in both SAX and pull parsers. Therefore, we no longer need to
handle this issue within Red Datasets.

ref: https://github.com/ruby/rexml/releases/tag/v3.3.5
  • Loading branch information
otegami authored Aug 24, 2024
1 parent 6f2ccf8 commit e3691b9
Showing 1 changed file with 1 addition and 14 deletions.
15 changes: 1 addition & 14 deletions lib/datasets/wikipedia-kyoto-japanese-english.rb
Original file line number Diff line number Diff line change
Expand Up @@ -89,9 +89,7 @@ def each(&block)
next unless base_name.end_with?(".xml")
listener = ArticleListener.new(block)
parser = REXML::Parsers::StreamParser.new(entry.read, listener)
with_increased_entity_expansion_text_limit do
parser.parse
end
parser.parse
when :lexicon
next unless base_name == "kyoto_lexicon.csv"
is_header = true
Expand All @@ -108,9 +106,6 @@ def each(&block)
end

private

ENTITY_EXPANSION_TEXT_LIMIT = 163_840

def download_tar_gz
base_name = "wiki_corpus_2.01.tar.gz"
data_path = cache_dir_path + base_name
Expand All @@ -119,14 +114,6 @@ def download_tar_gz
data_path
end

def with_increased_entity_expansion_text_limit
default_limit = REXML::Security.entity_expansion_text_limit
REXML::Security.entity_expansion_text_limit = ENTITY_EXPANSION_TEXT_LIMIT
yield
ensure
REXML::Security.entity_expansion_text_limit = default_limit
end

class ArticleListener
include REXML::StreamListener

Expand Down

0 comments on commit e3691b9

Please sign in to comment.