Skip to content

Releases: GateNLP/ultimate-sitemap-parser

1.1.1

29 Jan 12:14
1.1.1
95b592f
Compare
Choose a tag to compare

Bug Fixes

  • Changed log level when a suspected gzipped sitemap can't be un-gzipped from error to warning, since parsing can usually continue (#62 by @redreceipt)
  • Line references in logs now reference the correct location instead of lines within the logging helper file (#63)

1.1.0

20 Jan 14:12
1.1.0
ae51209
Compare
Choose a tag to compare

New Features

  • Added support for alternate localised pages with hreflang.
  • If an HTTP error is encountered, the contents of the error page is logged at INFO level.
  • Added optional configurable wait time to HTTP request client.

1.0.0

13 Jan 11:26
1.0.0
91b343e
Compare
Choose a tag to compare

Ultimate Sitemap Parser is now maintained by the GATE Team at the School of Computer Science, University of Sheffield. We’d like to thank Linas Valiukas and Hal Roberts for their work on this package, and Paige Gulley for coordinating the transfer of the library.

Breaking Changes

New Features

  • CLI tool to parse and list sitemaps on the command line (see CLI Reference)
  • All sitemap objects now implement a consistent interface, allowing traversal of the tree irrespective of type:
  • All sitemaps now have pages and sub_sitemaps properties, returning their children of that type, or an empty list where not applicable
  • Added all_sitemaps() method to iterate over all descendant sitemaps
  • Pickling page sitemaps now includes page data, which previously was not included as it was swapped to disk
  • Sitemaps and pages now implement to_dict() method to convert to dictionaries (requested in #18)
  • Added optional arguments to usp.tree.sitemap_tree_for_homepage() to disable robots.txt-based or known-path-based sitemap discovery. Default behaviour is still to use both.
  • Parse sitemaps from a string with Local Parsing (requested in #26)
  • Support for the Google Image sitemap extension
  • Add proxy support with RequestsWebClient.set_proxies() (#20 by @tgrandje)
  • Add additional sitemap discovery paths for news sitemaps (d3bdaae)
  • Add parameter to RequestsWebClient.init() to disable certificate verification (#37 by @japherwocky)

Performance

  • Improvement of parse performance by approximately 90%
  • Optimised lookup of page URLs when checking if duplicate
  • Optimised datetime parse in XML Sitemaps by trying full ISO8601 parsers before the general parser

Bug Fixes

  • Invalid datetimes will be parsed as None instead of crashing (reported in #22, #31)
  • Invalid priorities will be set to the default (0.5) instead of crashing
  • Moved version attribute into main class module
  • Robots.txt index sitemaps now count for the max recursion depth (reported in #29). The default maximum has been increased by 1 to compensate for this.
  • Remove log configuration so it can be specified at application level (reported in #25, #24 by @dsoprea/@antonialoytorrens-ikaue)
  • Resolve warnings caused by http.HTTPStatus usage (3867b6e)
  • Don’t add InvalidSitemap object if robots.txt is not found (#39 by @gbenson)
  • Fix incorrect lowercasing of URLS discovered in robots.txt (reported in #40, #35 by @ArthurMelin)

1.0.0rc1

18 Dec 11:44
1.0.0rc1
a3b066b
Compare
Choose a tag to compare
1.0.0rc1 Pre-release
Pre-release
Release 1.0.0rc1