Remove http headers, and html on webpages() #538

ruebot · 2022-05-26T18:03:43Z

In ARCH we remove headers and html on .webpages(). We should be consistent with none.

If folks need the content with headers and html, they can grab it from .all().

The text was updated successfully, but these errors were encountered:

ruebot · 2022-05-26T18:06:06Z

...we also remove it here: https://github.com/archivesunleashed/aut/blob/main/src/main/scala/io/archivesunleashed/app/WebPagesExtractor.scala#L48=

So there's an inconsistency as well.

- Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation - Update PlainTextExtractor to use .all() since HTML is removed from content - Add domain to all() - Update csv exports on app so that they are rfc4180 compliant - Apply GitHub workflows to main branch - Consistent formating on DataFrameLoader.scala - Update tests as needed - Resolves #538

- Filter HTTP headers, and HTML from content on webpages so that it is consistent with the app implementation, and the ARCH implementation - Change all content to raw_content. - Update PlainTextExtractor to use .all() since HTML is removed from content - Add domain to all() - Update csv exports on app so that they are rfc4180 compliant - Apply GitHub workflows to main branch - Consistent formatting on DataFrameLoader.scala - Update tests as needed - Update Apache Spark version in README. - Resolves #538

ruebot added the enhancement label May 26, 2022

ruebot self-assigned this May 26, 2022

ruebot added bug DataFrames labels May 26, 2022

ruebot mentioned this issue May 27, 2022

Make webpages() consistent across aut and ARCH. #539

Merged

ruebot closed this as completed in #539 May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove http headers, and html on webpages() #538

Remove http headers, and html on webpages() #538

ruebot commented May 26, 2022

ruebot commented May 26, 2022

Remove http headers, and html on webpages() #538

Remove http headers, and html on webpages() #538

Comments

ruebot commented May 26, 2022

ruebot commented May 26, 2022