Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update beautifulsoup4 to 4.12.3 #783

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pyup-bot
Copy link
Collaborator

This PR updates beautifulsoup4 from 4.6.0 to 4.12.3.

Changelog

4.11.1

This release was done to ensure that the unit tests are packaged along
with the released source. There are no functionality changes in this
release, but there are a few other packaging changes:

* The Japanese and Korean translations of the documentation are included.
* The changelog is now packaged as CHANGELOG, and the license file is
packaged as LICENSE. NEWS.txt and COPYING.txt are still present,
but may be removed in the future.
* TODO.txt is no longer packaged, since a TODO is not relevant for released
code.

4.11.0

* Ported unit tests to use pytest.

* Added special string classes, RubyParenthesisString and RubyTextString,
to make it possible to treat ruby text specially in get_text() calls.
[bug=1941980]

* It's now possible to customize the way output is indented by
providing a value for the 'indent' argument to the Formatter
constructor. The 'indent' argument works very similarly to the
argument of the same name in the Python standard library's
json.dump() function. [bug=1955497]

* If the charset-normalizer Python module
(https://pypi.org/project/charset-normalizer/) is installed, Beautiful
Soup will use it to detect the character sets of incoming documents.
This is also the module used by newer versions of the Requests library.
For the sake of backwards compatibility, chardet and cchardet both take
precedence if installed. [bug=1955346]

* Added a workaround for an lxml bug
(https://bugs.launchpad.net/lxml/+bug/1948551) that causes
problems when parsing a Unicode string beginning with BYTE ORDER MARK.
[bug=1947768]

* Issue a warning when an HTML parser is used to parse a document that
looks like XML but not XHTML. [bug=1939121]

* Do a better job of keeping track of namespaces as an XML document is
parsed, so that CSS selectors that use namespaces will do the right
thing more often. [bug=1946243]

* Some time ago, the misleadingly named "text" argument to find-type
methods was renamed to the more accurate "string." But this supposed
"renaming" didn't make it into important places like the method
signatures or the docstrings. That's corrected in this
version. "text" still works, but will give a DeprecationWarning.
[bug=1947038]

* Fixed a crash when pickling a BeautifulSoup object that has no
tree builder. [bug=1934003]

* Fixed a crash when overriding multi_valued_attributes and using the
html5lib parser. [bug=1948488]

* Standardized the wording of the MarkupResemblesLocatorWarning
warnings to omit untrusted input and make the warnings less
judgmental about what you ought to be doing. [bug=1955450]

* Removed support for the iconv_codec library, which doesn't seem
to exist anymore and was never put up on PyPI. (The closest
replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use
it--it's also quite old.)

4.10.0

* This is the first release of Beautiful Soup to only support Python
3. I dropped Python 2 support to maintain support for newer versions
(58 and up) of setuptools. See:
https://github.com/pypa/setuptools/issues/2769 [bug=1942919]

* The behavior of methods like .get_text() and .strings now differs
depending on the type of tag. The change is visible with HTML tags
like <script>, <style>, and <template>. Starting in 4.9.0, methods
like get_text() returned no results on such tags, because the
contents of those tags are not considered 'text' within the document
as a whole.

But a user who calls script.get_text() is working from a different
definition of 'text' than a user who calls div.get_text()--otherwise
there would be no need to call script.get_text() at all. In 4.10.0,
the contents of (e.g.) a <script> tag are considered 'text' during a
get_text() call on the tag itself, but not considered 'text' during
a get_text() call on the tag's parent.

Because of this change, calling get_text() on each child of a tag
may now return a different result than calling get_text() on the tag
itself. That's because different tags now have different
understandings of what counts as 'text'. [bug=1906226] [bug=1868861]

* NavigableString and its subclasses now implement the get_text()
method, as well as the properties .strings and
.stripped_strings. These methods will either return the string
itself, or nothing, so the only reason to use this is when iterating
over a list of mixed Tag and NavigableString objects. [bug=1904309]

* The 'html5' formatter now treats attributes whose values are the
empty string as HTML boolean attributes. Previously (and in other
formatters), an attribute value must be set as None to be treated as
a boolean attribute. In a future release, I plan to also give this
behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]

* The 'replace_with()' method now takes a variable number of arguments,
and can be used to replace a single element with a sequence of elements.
Patch by Bill Chandos. [rev=605]

* Corrected output when the namespace prefix associated with a
namespaced attribute is the empty string, as opposed to
None. [bug=1915583]

* Performance improvement when processing tags that speeds up overall
tree construction by 2%. Patch by Morotti. [bug=1899358]

* Corrected the use of special string container classes in cases when a
single tag may contain strings with different containers; such as
the <template> tag, which may contain both TemplateString objects
and Comment objects. [bug=1913406]

* The html.parser tree builder can now handle named entities
found in the HTML5 spec in much the same way that the html5lib
tree builder does. Note that the lxml HTML tree builder doesn't handle
named entities this way. [bug=1924908]

* Added a second way to pass specify encodings to UnicodeDammit and
EncodingDetector, based on the order of precedence defined in the
HTML5 spec, starting at:
https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding

Encodings in 'known_definite_encodings' are tried first, then
byte-order-mark sniffing is run, then encodings in 'user_encodings'
are tried. The old argument, 'override_encodings', is now a
deprecated alias for 'known_definite_encodings'.

This changes the default behavior of the html.parser and lxml tree
builders, in a way that may slightly improve encoding
detection but will probably have no effect. [bug=1889014]

* Improve the warning issued when a directory name (as opposed to
the name of a regular file) is passed as markup into the BeautifulSoup
constructor. [bug=1913628]

4.9.3

* Implemented a significant performance optimization to the process of
searching the parse tree. Patch by Morotti. [bug=1898212]

4.9.2

* Fixed a bug that caused too many tags to be popped from the tag
stack during tree building, when encountering a closing tag that had
no matching opening tag. [bug=1880420]

* Fixed a bug that inconsistently moved elements over when passing
a Tag, rather than a list, into Tag.extend(). [bug=1885710]

* Specify the soupsieve dependency in a way that complies with
PEP 508. Patch by Mike Nerone. [bug=1893696]

* Change the signatures for BeautifulSoup.insert_before and insert_after
(which are not implemented) to match PageElement.insert_before and
insert_after, quieting warnings in some IDEs. [bug=1897120]

4.9.1

* Added a keyword argument 'on_duplicate_attribute' to the
BeautifulSoupHTMLParser constructor (used by the html.parser tree
builder) which lets you customize the handling of markup that
contains the same attribute more than once, as in:
<a href="url1" href="url2"> [bug=1878209]

* Added a distinct subclass, GuessedAtParserWarning, for the warning
issued when BeautifulSoup is instantiated without a parser being
specified. [bug=1873787]

* Added a distinct subclass, MarkupResemblesLocatorWarning, for the
warning issued when BeautifulSoup is instantiated with 'markup' that
actually seems to be a URL or the path to a file on
disk. [bug=1873787]

* The new NavigableString subclasses (Stylesheet, Script, and
TemplateString) can now be imported directly from the bs4 package.

* If you encode a document with a Python-specific encoding like
'unicode_escape', that encoding is no longer mentioned in the final
XML or HTML document. Instead, encoding information is omitted or
left blank. [bug=1874955]

* Fixed test failures when run against soupselect 2.0. Patch by Tomáš
Chvátal. [bug=1872279]

4.9.0

* Added PageElement.decomposed, a new property which lets you
check whether you've already called decompose() on a Tag or
NavigableString.

* Embedded CSS and Javascript is now stored in distinct Stylesheet and
Script tags, which are ignored by methods like get_text() since most
people don't consider this sort of content to be 'text'. This
feature is not supported by the html5lib treebuilder. [bug=1868861]

* Added a Russian translation by 'authoress' to the repository.

* Fixed an unhandled exception when formatting a Tag that had been
decomposed.[bug=1857767]

* Fixed a bug that happened when passing a Unicode filename containing
non-ASCII characters as markup into Beautiful Soup, on a system that
allows Unicode filenames. [bug=1866717]

* Added a performance optimization to PageElement.extract(). Patch by
Arthur Darcet.

4.8.2

* Added Python docstrings to all public methods of the most commonly
used classes.

* Added a Chinese translation by Deron Wang and a Brazilian Portuguese
translation by Cezar Peixeiro to the repository.

* Fixed two deprecation warnings. Patches by Colin
Watson and Nicholas Neumann. [bug=1847592] [bug=1855301]

* The html.parser tree builder now correctly handles DOCTYPEs that are
not uppercase. [bug=1848401]

* PageElement.select() now returns a ResultSet rather than a regular
list, making it consistent with methods like find_all().

4.8.1

* When the html.parser or html5lib parsers are in use, Beautiful Soup
will, by default, record the position in the original document where
each tag was encountered. This includes line number (Tag.sourceline)
and position within a line (Tag.sourcepos).  Based on code by Chris
Mayo. [bug=1742921]

* When instantiating a BeautifulSoup object, it's now possible to
provide a dictionary ('element_classes') of the classes you'd like to be
instantiated instead of Tag, NavigableString, etc.

* Fixed the definition of the default XML namespace when using
lxml 4.4. Patch by Isaac Muse. [bug=1840141]

* Fixed a crash when pretty-printing tags that were not created
during initial parsing. [bug=1838903]

* Copying a Tag preserves information that was originally obtained from
the TreeBuilder used to build the original Tag. [bug=1838903]

* Raise an explanatory exception when the underlying parser
completely rejects the incoming markup. [bug=1838877]

* Avoid a crash when trying to detect the declared encoding of a
Unicode document. [bug=1838877]

* Avoid a crash when unpickling certain parse trees generated
using html5lib on Python 3. [bug=1843545]

4.8.0

This release focuses on making it easier to customize Beautiful Soup's
input mechanism (the TreeBuilder) and output mechanism (the Formatter).

* You can customize the TreeBuilder object by passing keyword
arguments into the BeautifulSoup constructor. Those keyword
arguments will be passed along into the TreeBuilder constructor.

The main reason to do this right now is to change how which
attributes are treated as multi-valued attributes (the way 'class'
is treated by default). You can do this with the
'multi_valued_attributes' argument. [bug=1832978]

* The role of Formatter objects has been greatly expanded. The Formatter
class now controls the following:

- The function to call to perform entity substitution. (This was
 previously Formatter's only job.)
- Which tags should be treated as containing CDATA and have their
 contents exempt from entity substitution.
- The order in which a tag's attributes are output. [bug=1812422]
- Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'

All preexisting code should work as before.

* Added a new method to the API, Tag.smooth(), which consolidates
multiple adjacent NavigableString elements. [bug=1697296]

* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always
recognized as a named entity and converted to a single quote. [bug=1818721]

4.7.1

* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]

* Fixed an incorrectly raised exception when inserting a tag before or
after an identical tag. [bug=1810692]

* Beautiful Soup will no longer try to keep track of namespaces that
are not defined with a prefix; this can confuse soupselect. [bug=1810680]

* Tried even harder to avoid the deprecation warning originally fixed in
4.6.1. [bug=1778909]

4.7.0

* Beautiful Soup's CSS Selector implementation has been replaced by a
dependency on Isaac Muse's SoupSieve project (the soupsieve package
on PyPI). The good news is that SoupSieve has a much more robust and
complete implementation of CSS selectors, resolving a large number
of longstanding issues. The bad news is that from this point onward,
SoupSieve must be installed if you want to use the select() method.

You don't have to change anything lf you installed Beautiful Soup
through pip (SoupSieve will be automatically installed when you
upgrade Beautiful Soup) or if you don't use CSS selectors from
within Beautiful Soup.

SoupSieve documentation: https://facelessuser.github.io/soupsieve/

* Added the PageElement.extend() method, which works like list.append().
[bug=1514970]

* PageElement.insert_before() and insert_after() now take a variable
number of arguments. [bug=1514970]

* Fix a number of problems with the tree builder that caused
trees that were superficially okay, but which fell apart when bits
were extracted. Patch by Isaac Muse. [bug=1782928,1809910]

* Fixed a problem with the tree builder in which elements that
contained no content (such as empty comments and all-whitespace
elements) were not being treated as part of the tree. Patch by Isaac
Muse. [bug=1798699]

* Fixed a problem with multi-valued attributes where the value
contained whitespace. Thanks to Jens Svalgaard for the
fix. [bug=1787453]

* Clarified ambiguous license statements in the source code. Beautiful
Soup is released under the MIT license, and has been since 4.4.0.

* This file has been renamed from NEWS.txt to CHANGELOG.

4.6.3

* Exactly the same as 4.6.2. Re-released to make the README file
render properly on PyPI.

4.6.2

* Fix an exception when a custom formatter was asked to format a void
element. [bug=1784408]

4.6.1

* Stop data loss when encountering an empty numeric entity, and
possibly in other cases.  Thanks to tos.kamiya for the fix. [bug=1698503]

* Preserve XML namespaces introduced inside an XML document, not just
the ones introduced at the top level. [bug=1718787]

* Added a new formatter, "html5", which represents void elements
as "<element>" rather than "<element/>".  [bug=1716272]

* Fixed a problem where the html.parser tree builder interpreted
a string like "&foo " as the character entity "&foo;"  [bug=1728706]

* Correctly handle invalid HTML numeric character entities like &147;
which reference code points that are not Unicode code points. Note
that this is only fixed when Beautiful Soup is used with the
html.parser parser -- html5lib already worked and I couldn't fix it
with lxml.  [bug=1782933]

* Improved the warning given when no parser is specified. [bug=1780571]

* When markup contains duplicate elements, a select() call that
includes multiple match clauses will match all relevant
elements. [bug=1770596]

* Fixed code that was causing deprecation warnings in recent Python 3
versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]

* Fixed a Windows crash in diagnose() when checking whether a long
markup string is a filename. [bug=1737121]

* Stopped HTMLParser from raising an exception in very rare cases of
bad markup. [bug=1708831]

* Fixed a bug where find_all() was not working when asked to find a
tag with a namespaced name in an XML document that was parsed as
HTML. [bug=1723783]

* You can get finer control over formatting by subclassing
bs4.element.Formatter and passing a Formatter instance into (e.g.)
encode(). [bug=1716272]

* You can pass a dictionary of `attrs` into
BeautifulSoup.new_tag. This makes it possible to create a tag with
an attribute like 'name' that would otherwise be masked by another
argument of new_tag. [bug=1779276]

* Clarified the deprecation warning when accessing tag.fooTag, to cover
the possibility that you might really have been looking for a tag
called 'fooTag'.
Links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant