Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sseclient 0.0.24 loses events due to Unterminated string... ValueError #28

Closed
xqt opened this issue May 9, 2019 · 18 comments
Closed

sseclient 0.0.24 loses events due to Unterminated string... ValueError #28

xqt opened this issue May 9, 2019 · 18 comments

Comments

@xqt
Copy link
Contributor

xqt commented May 9, 2019

sseclient 0.0.24 fails randomly when json.load is processed at the EventSource's data. Randomly means that you can find this failure pretty sure if you process say 50 events from stream. The problem occures with Python 3.7; I've not tested other Python releases. This problem does not occur with sseclient 0.0.22.

I've have implemented a test suite to check this behavior, see https://gerrit.wikimedia.org/r/#/c/pywikibot/core/+/509132/2/tests/eventstreams_tests.py

Additional details you can find on our bug tracker https://phabricator.wikimedia.org/T222885

@xqt
Copy link
Contributor Author

xqt commented May 9, 2019

Digging deeper into this matter, this failure was introduced with 0.0.23 if using this line
chunk = self.resp.raw._fp.fp.read1(self.chunk_size)
ignoring this and using chunk = self.resp.raw._fp.fp.read1(self.chunk_size) my test passes but the sse socket wasn't closed.

xqt added a commit to xqt/sseclient that referenced this issue May 10, 2019
As noted in issue btubbs#28 json.loads(event.data) fails sometimes with ValueError.
This error was introduced with release 0.0.23. This test will indicate whether
the sseclient implementation works as expected without losing data.

See also https://phabricator.wikimedia.org/T222885

Change-Id: I2d042ca1e17f7ec57ad569ce062413f58e1fcd96
@xqt
Copy link
Contributor Author

xqt commented May 10, 2019

I made a pull request PR#29 to check this ugly behaviour

@OverlordQ
Copy link

Running into this exact same issue as well with read1 not actually reading full events before passing it off to the handler.

@OverlordQ
Copy link

OverlordQ commented Aug 3, 2019

Okay, so I think it comes down to this.

Calling read1 using the raw file-pointer is context agnostic. You say read, it goes gobble gobble and spits back everything it finds.

Problem is, SSE streams require context. If you dump the headers returned from connecting to an SSE Endpoint, you'll probably notice a Transfer-Encoding: chunked header.

What gets pre-pended to every chunk? Size of the chunk.

resp.raw is still a http.client.HTTPResponse object, calling read on that will take into account chunked encoding and strip off the relavent bits and return the actual data.

Using resp.raw._fp.fp.read1 bypasses all of that.

So for example, if you're feeding JSON data through an SSE stream, unless it always matches up perfectly to a chunk, using read1 is going to insert the chunksize and accompanying newlines into the middle of your data, breaking JSON formatting.

There's a reason you shouldn't poke into the guts.

@btubbs
Copy link
Owner

btubbs commented Aug 3, 2019

Thanks for digging into that @OverlordQ.

Looks like we probably need to revert the "short read" change from @mutantmonkey from a few months ago. (See #8 and 8585bac.)

@mutantmonkey
Copy link
Contributor

Regarding Transfer-Encoding: chunked, the Server-Sent Events specification recommends against doing this:

Authors are also cautioned that HTTP chunking can have unexpected negative effects on the reliability of this protocol. Where possible, chunking should be disabled for serving event streams unless the rate of messages is high enough for this not to matter.

If this is the issue, then it sounds like we'll need to handle chunked encoding directly in this library, or detect when it is being used and disable the short read functionality. @OverlordQ Do you happen to have an example SSE endpoint that uses chunked encoding that I would be able to test with?

I understand the concern with poking into the guts, but isn't the purpose of this library to support server-sent events without requiring a low-level understanding of HTTP? There's no other way to properly deal with timely delivery of messages shorter than the specified chunk_size while still using http.client.

@OverlordQ
Copy link

I would guess that the Wikimedia test case @xqt gave will exhibit this, but http://stream.pushshift.io/ was where I was running into this behavior.

And yes while it does say should, it doesn't say must, so it is a case that would to be handled

mutantmonkey added a commit to mutantmonkey/sseclient that referenced this issue Aug 5, 2019
When chunked transfer encoding is used, the size of the chunk is
prepended to each chunk. When calling read1 on the raw file-pointer,
this data is not stripped off and breaks the JSON formatting.
@mutantmonkey
Copy link
Contributor

mutantmonkey commented Aug 5, 2019

@OverlordQ I just pushed a branch that falls back to the old method when chunked encoding is enabled. Can you give it a try and confirm that it fixes your issue? https://github.com/mutantmonkey/sseclient/tree/disable_short_reads_when_chunked

Ideally, chunked encoding support should be implemented directly in this library, but that's going to be more involved. Just disabling short reads when using chunked encoding is a much quicker fix since this behavior is broken.

mutantmonkey added a commit to mutantmonkey/sseclient that referenced this issue Aug 11, 2019
When chunked transfer encoding is used, the size of the chunk is
prepended to each chunk. When calling read1 on the raw file-pointer,
this data is not stripped off and breaks the JSON formatting.
@TheSandDoctor
Copy link
Collaborator

@OverlordQ please see the above?

@sudologic
Copy link

@mutantmonkey @TheSandDoctor

I hit the same issue as @OverlordQ today, and stumbled onto this. The "disable_short_reads_when_chunked" fixed the issue for me.

I can't speak to the internals because I haven't spent the time digging in that Overlord did, because I found this pretty quickly and its working... but its working.

@Shoelace
Copy link
Contributor

Shoelace commented Sep 8, 2020

mutantmonkeys fix works for me also.
i have read that handling chunked encoding is a requirement for a http1.1 client
see https://en.wikipedia.org/wiki/Chunked_transfer_encoding

eino pushed a commit to Extrality/sseclient that referenced this issue Mar 29, 2021
@xqt
Copy link
Contributor Author

xqt commented May 26, 2022

I checked this again with sseclient 0.0.23 and had 136 such issues within 10 minutes. During the same time I runned the same script with sseclient 0.0.22 and had no such error. Seems this is caused by this change.

See also https://phabricator.wikimedia.org/T222885

@eino
Copy link

eino commented May 27, 2022

Yes, I advise to try mutantmonkey's patch (#35), it fixed the issue for me

@xqt
Copy link
Contributor Author

xqt commented May 27, 2022

Yes, I advise to try mutantmonkey's patch (#35), it fixed the issue for me

There are two patches which would solve it but they are pending for 3 years.

@grantcurell
Copy link

Can confirm that this bug is still present in May 2023

@sorensenjs
Copy link

sorensenjs commented Aug 8, 2024

This is still happening in sseclient-0.0.27. Mini repo

import json
import sseclient

for event  in sseclient.SSEClient('https://stream.wikimedia.org/v2/stream/recentchange'):
  if event.event == 'message' and event.data:
    try:
      change = json.loads(event.data)
    except json.JSONDecodeError as err:
      print(event, err)
      raise

Fails with

{"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://www.wikidata.org/wiki/Q71160922","request_id":"b17b0516-1468-4893-ac7f-42f402e76f44","id":"f06b24db-9549-46e3-ad47-9e23e3b9738b","dt":"2024-08-08T17:22:54Z","domain":"[www.wiki](http://www.wiki/) Unterminated string starting at: line 1 column 233 (char 232)

This appears to be due to a partial read as the input is incomplete JSON.

@pi3ch
Copy link

pi3ch commented Feb 24, 2025

Yes. Issue still present. Expecting ',' delimiter: line 1 column 4073 (char 4072). Any workaround @sorensenjs @TheSandDoctor

@sorensenjs
Copy link

I abandoned trying to use this sseclient and did the following which worked well enough for me. It seems to me not hard to introduce incomplete chunk reading into the unit tests but decided against investing in that here as the project seems unmaintained.

_FIELD_SEPARATOR = ':'

class SSEClient(object):
  """Minimal SSE client.

  Adapted from https://integrand.io/blog/how-to-consume-sse-python
  """

  def __init__(self, event_source: Generator[bytes, None, None]):
    self._event_source = event_source

  def _read(self):
    data = b''
    for chunk in self._event_source:
     for line in chunk.splitlines(True):
       data += line
       if data.endswith((b'\r\r', b'\n\n', b'\r\n\r\n')):
         yield data
         data = b''
    if data:
      yield data

  def events(self):
    for chunk in self._read():
      event = { 'id': None, 'event': 'message', 'data': '' }
      for raw_line in chunk.splitlines():
        line = raw_line.decode('UTF-8')
        if not line.strip() or line.startswith(_FIELD_SEPARATOR):
          continue
        data = line.split(_FIELD_SEPARATOR, 1)
        field = data[0]
        # Ignore unknown fields.
        if field not in event.keys():
          continue
        value = ''
        if len(data) > 1:
          value = data[1].lstrip()
        if field == 'data':
          event['data'] += value + '\n'
        else:
          event[field] = value
      if not event['data']:
        continue
      yield event

  def close(self) -> None:
    self._event_source.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests