Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove http headers, and html on webpages() #538

Closed
ruebot opened this issue May 26, 2022 · 1 comment · Fixed by #539
Closed

Remove http headers, and html on webpages() #538

ruebot opened this issue May 26, 2022 · 1 comment · Fixed by #539

Comments

@ruebot
Copy link
Member

ruebot commented May 26, 2022

In ARCH we remove headers and html on .webpages(). We should be consistent with none.

If folks need the content with headers and html, they can grab it from .all().

@ruebot
Copy link
Member Author

ruebot commented May 26, 2022

...we also remove it here: https://github.com/archivesunleashed/aut/blob/main/src/main/scala/io/archivesunleashed/app/WebPagesExtractor.scala#L48=

So there's an inconsistency as well.

ruebot added a commit that referenced this issue May 27, 2022
- Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation
- Update PlainTextExtractor to use .all() since HTML is removed from content
- Add domain to all()
- Update csv exports on app so that they are rfc4180 compliant
- Apply GitHub workflows to main branch
- Consistent formating on DataFrameLoader.scala
- Update tests as needed
- Resolves #538
ruebot added a commit that referenced this issue May 30, 2022
- Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation
- Change all content to raw_content.
- Update PlainTextExtractor to use .all() since HTML is removed from content
- Add domain to all()
- Update csv exports on app so that they are rfc4180 compliant
- Apply GitHub workflows to main branch
- Consistent formatting on DataFrameLoader.scala
- Update tests as needed
-  Update Apache Spark version in README.
- Resolves #538
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant