-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export in CDRv2 format #3
Conversation
Also remove export of found forms, and do not save pages from other domains.
crawler=self.settings.get('CDR_CRAWLER'), | ||
extracted_metadata={}, | ||
extracted_text='\n'.join( | ||
response.xpath('//body//text()').extract()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paul suggested string()
xpath function here: scrapy/parsel#34; I've tried it, and output is a bit cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is nicer, and less string joining happening, thanks!
Following @kmike suggestion. This gives cleaner output with less extra newlines.
What do you think about making CDR export non-optional and putting all extra info in |
Yeah, I like the idea - I'll update the PR. There is also an optional |
What was previously stored in PageItem and FormItem is now stored in extracted_metadata: is_page, depth, forms.
Yeah, it is not clear what is |
url=url, | ||
text=response.text, | ||
if not self.link_extractor.matches(url): | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This might be for some structured data from the page, like price of an item, etc. So |
Also remove export of found forms, and do not save pages from other domains (we can get there after following redirects).
I've included only required field, and left
extracted_metadata
empty. Also, I did not include required_timestamp
field, since the docs say it should be autogenerated by elasticsearch.