Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

converter.py crashes #11

Closed
ramanshah opened this issue Jul 23, 2016 · 5 comments
Closed

converter.py crashes #11

ramanshah opened this issue Jul 23, 2016 · 5 comments

Comments

@ramanshah
Copy link

The crawler worked beautifully for me, but converter.py crashed.

Found 476 answers
Filename: 2013-02-07 Whats-an-efficient-way-to-overcome-procrastination.html
Traceback (most recent call last):
  File "./quora-backup/converter.py", line 233, in <module>
    document = parser.parse(page_html, encoding='utf-8')
  File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/html5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/html5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/Users/raman/a/2013-2016-quora/venv/lib/python3.5/site-packages/html5lib/_inputstream.py", line 151, in HTMLInputStream
    return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

When I experimented by removing encoding from the **kwargs, I got this:

Found 476 answers
Filename: 2013-02-07 Whats-an-efficient-way-to-overcome-procrastination.html
[WARNING] Unrecognized node
Traceback (most recent call last):
  File "./quora-backup/converter.py", line 279, in <module>
    saved_page.write(serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False).render(walker))
AttributeError: module 'html5lib.serializer' has no attribute 'htmlserializer'

I wonder if maybe it's an html5lib version thing or similar.

Here's my configuration:

(venv) $ python3 --version
Python 3.5.2
(venv) $ pip freeze
html5lib==0.999999999
six==1.10.0
webencodings==0.5

Happy to help triangulate, and while I don't code every weekend, I'd offer a pull request eventually if I can figure out how to get it working.

@ideamonger
Copy link

same there... code is getting out of sync with quora

Filename: 2015-07-01 What-are-the-biggest-battles-wars-in-this-millennium.html
Traceback (most recent call last):
File "/Users/smatei/quora-crawl/converter.py", line 233, in
document = parser.parse(page_html, encoding='utf-8')
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/html5lib/html5parser.py", line 235, in parse
self._parse(stream, False, None, _args, *_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/html5lib/html5parser.py", line 85, in _parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/html5lib/_tokenizer.py", line 36, in init
self.stream = HTMLInputStream(stream, *_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/html5lib/_inputstream.py", line 151, in HTMLInputStream
return HTMLBinaryInputStream(source, *_kwargs)
TypeError: init() got an unexpected keyword argument 'encoding'

@jbwl
Copy link

jbwl commented Sep 4, 2016

+1 please provide a fix

@opoudjis
Copy link

See Kozea/WeasyPrint#334 : change encoding to default_encoding

@opoudjis
Copy link

And, I think what now works is:

#walker = treewalkers.getTreeWalker('dom')(new_page)
try:
    with open(args.output_dir + '/' + filename, 'wb', 0o600) as saved_page:
        #saved_page.write('<!DOCTYPE html>')
        #saved_page.write(serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False).render(walker))
        #saved_page.write(serializer.HTMLSerializer(omit_optional_tags=False).render(walker, 'utf-8'))
        saved_page.write(serializer.serialize(new_page, 'dom', 'utf-8', omit_optional_tags=False))

But that may not be what you intended.

@t3nsor
Copy link
Owner

t3nsor commented Sep 12, 2016

Should be fixed by ccab555

@t3nsor t3nsor closed this as completed Sep 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants