ControlSymbol entity code #37

osession · 2024-10-22T19:31:53Z

I'm trying to use this package for one of my projects and ran into an issue where the parser tries to decode a character and gets a value error in the Control_Symbol initilization. The character it's finding is "\'", so I'm getting a ValueError: invalid literal for int() with base 16: "\'"

so I'm trying to understand more what this Control_Symbol init function is doing. I couldn't understand what the purpose of these two lines of code are:

if self.text in "\\{}": file.seek(file.tell() - SYMBOL)

Could you please provide an explanation for this part of the code?

The text was updated successfully, but these errors were encountered:

fleetingbytes · 2024-10-25T11:33:39Z

Hi, @osession . Thank for the issue. I am sorry, but I don't understand it myself. 😔 The code is terrible. I wish I knew the Clean Code principles when I was writing this. I have issue #32 open for an API rewrite in clean code, but don't really have much urge to do it. It's been a while since I had to work with RTF files.

The code lines in question should have some something to do with how the characters \, {, and } are escaped in RTF. Each of them has to be escaped when you want to write them as a literal, \\, \{, \}. Technically, such sequences fall under the ControlSymbol category. A backslash is how the ControlSymbol Entity starts, and then it is followed by one byte matching the symbol pattern, unless it is an escaped ansi character, for which two more bytes have to be read, e.g. \'e1 represents the lowercase letter "á" (a with acute accent).

So, when the Symbol is ', self.text is ', and once the two more bytes for this special case are read and converted (decoded) to a unicode letter, self.text was replaced with that unicode letter. And that's where we encounter our mysterious if block.

In case a literal \, {, or } were not encoded in the RTF like a normal ControlSymbol (\\, \{, \}) but rather like an escaped ANSI character, i.e. \'5c, \'7b, or \'7d, (who in their sane mind would do this? but I guess I had to deal with the products of exactly such brilliant RTF encoders), self.text will be in r"\{}" (or, as I have written, in "\\{}"): then we move the position at which we read the file back by two bytes (remember, SYMBOL is defined effectively as 2). And this is exactly where I don't understand my code even after all this painful reading — not reading, decryption. I don't know why we backtrack these two bytes anymore. 🤷🏻‍♂️ .

If you find out why, please let me know.

fleetingbytes · 2024-10-25T11:54:06Z

@osession in your case, it looks to me like your RTF contains the bytes sequence b"\'\'". After the first \' the program gets into the special case of an escaped ANSI character, reading and decoding the two next bytes which it expects to be two hexadecimal digits. Alas, instead of those digits, it finds another \' which it complains about when it is supposed to convert it to an integer of base 16. This can't be done, so it throws an error.

I am not sure if \'\' is even a valid RTF bytes sequence. One could think it is supposed to be two literal apostrophes, '', but an apostrophe can be encoded as plain text, without escaping. A \' signals that the next two bytes are the hex digits of an escaped ANSI character, but in your case instead of those we get another \'. This, IMO, is an incorrectly encoded RTF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ControlSymbol entity code #37

ControlSymbol entity code #37

osession commented Oct 22, 2024 •

edited

Loading

fleetingbytes commented Oct 25, 2024 •

edited

Loading

fleetingbytes commented Oct 25, 2024

ControlSymbol entity code #37

ControlSymbol entity code #37

Comments

osession commented Oct 22, 2024 • edited Loading

fleetingbytes commented Oct 25, 2024 • edited Loading

fleetingbytes commented Oct 25, 2024

osession commented Oct 22, 2024 •

edited

Loading

fleetingbytes commented Oct 25, 2024 •

edited

Loading