Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikipedia-kyoto-japanese-english: increase REXML entity expansion limit during XML parsing #198

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 15 additions & 2 deletions lib/datasets/wikipedia-kyoto-japanese-english.rb
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,10 @@ def each(&block)
when :article
next unless base_name.end_with?(".xml")
listener = ArticleListener.new(block)
parser = REXML::Parsers::StreamParser.new(entry.read, listener)
parser.parse
with_increased_entity_expansion_text_limit do
parser = REXML::Parsers::StreamParser.new(entry.read, listener)
otegami marked this conversation as resolved.
Show resolved Hide resolved
parser.parse
end
when :lexicon
next unless base_name == "kyoto_lexicon.csv"
is_header = true
Expand All @@ -106,6 +108,9 @@ def each(&block)
end

private

ENTITY_EXPANSION_TEXT_LIMIT = 163_840

def download_tar_gz
base_name = "wiki_corpus_2.01.tar.gz"
data_path = cache_dir_path + base_name
Expand All @@ -114,6 +119,14 @@ def download_tar_gz
data_path
end

def with_increased_entity_expansion_text_limit
default_limit = REXML::Security.entity_expansion_text_limit
REXML::Security.entity_expansion_text_limit = ENTITY_EXPANSION_TEXT_LIMIT
yield
ensure
REXML::Security.entity_expansion_text_limit = default_limit
end
Comment on lines +122 to +128
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. We can use this for now but it seems that REXML::Parsers::StreamParser#entity_expansion_text_limit= should exist for local limitation change. Could you propose it to ruby/rexml?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds pretty nice. Sure!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proposed it at ruby/rexml#192.


class ArticleListener
include REXML::StreamListener

Expand Down
Loading