Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ControlSymbol entity code #37

Open
osession opened this issue Oct 22, 2024 · 2 comments
Open

ControlSymbol entity code #37

osession opened this issue Oct 22, 2024 · 2 comments

Comments

@osession
Copy link

osession commented Oct 22, 2024

I'm trying to use this package for one of my projects and ran into an issue where the parser tries to decode a character and gets a value error in the Control_Symbol initilization. The character it's finding is "\'", so I'm getting a ValueError: invalid literal for int() with base 16: "\'"

so I'm trying to understand more what this Control_Symbol init function is doing. I couldn't understand what the purpose of these two lines of code are:

if self.text in "\\{}": file.seek(file.tell() - SYMBOL)

Could you please provide an explanation for this part of the code?

@fleetingbytes
Copy link
Owner

fleetingbytes commented Oct 25, 2024

Hi, @osession . Thank for the issue. I am sorry, but I don't understand it myself. 😔 The code is terrible. I wish I knew the Clean Code principles when I was writing this. I have issue #32 open for an API rewrite in clean code, but don't really have much urge to do it. It's been a while since I had to work with RTF files.

The code lines in question should have some something to do with how the characters \, {, and } are escaped in RTF. Each of them has to be escaped when you want to write them as a literal, \\, \{, \}. Technically, such sequences fall under the ControlSymbol category. A backslash is how the ControlSymbol Entity starts, and then it is followed by one byte matching the symbol pattern, unless it is an escaped ansi character, for which two more bytes have to be read, e.g. \'e1 represents the lowercase letter "á" (a with acute accent).

So, when the Symbol is ', self.text is ', and once the two more bytes for this special case are read and converted (decoded) to a unicode letter, self.text was replaced with that unicode letter. And that's where we encounter our mysterious if block.

In case a literal \, {, or } were not encoded in the RTF like a normal ControlSymbol (\\, \{, \}) but rather like an escaped ANSI character, i.e. \'5c, \'7b, or \'7d, (who in their sane mind would do this? but I guess I had to deal with the products of exactly such brilliant RTF encoders), self.text will be in r"\{}" (or, as I have written, in "\\{}"): then we move the position at which we read the file back by two bytes (remember, SYMBOL is defined effectively as 2). And this is exactly where I don't understand my code even after all this painful reading — not reading, decryption. I don't know why we backtrack these two bytes anymore. 🤷🏻‍♂️ .

If you find out why, please let me know.

@fleetingbytes
Copy link
Owner

@osession in your case, it looks to me like your RTF contains the bytes sequence b"\'\'". After the first \' the program gets into the special case of an escaped ANSI character, reading and decoding the two next bytes which it expects to be two hexadecimal digits. Alas, instead of those digits, it finds another \' which it complains about when it is supposed to convert it to an integer of base 16. This can't be done, so it throws an error.

I am not sure if \'\' is even a valid RTF bytes sequence. One could think it is supposed to be two literal apostrophes, '', but an apostrophe can be encoded as plain text, without escaping. A \' signals that the next two bytes are the hex digits of an escaped ANSI character, but in your case instead of those we get another \'. This, IMO, is an incorrectly encoded RTF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants