Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix special characters causing file creation failure #168

Merged
merged 17 commits into from
Sep 22, 2023

Conversation

Luosiyuan
Copy link
Contributor

Description

Fix a bug where the compressed package contains special characters under the window as directory or file names, resulting in file creation failure.
The location where the bug appears:
Fs:: create for bit7z src internal fileextraccallback.cpp file_ The directories function.
Repair plan:
Using regular expressions to replace special characters with the "" character supported by window can solve the problem.
std::wstring CharacterStandard(const std::wstring& src)
{
std::wstring destChar = src;
//Define Rules
std::wregex illegalCharRegex(L"[<>:"/|?*]");
//Replacing illegal characters with underscores using regular expressions
destChar = std::regex_replace(destChar, illegalCharRegex, L"
");
return destChar.c_str();
}
Remaining issues:

  1. Currently, only the wide character version is available
  2. The special file name was not processed, only special characters were processed. Special file names include:
    The file name cannot start or end with a space, nor can it contain one of the following device names: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9

Motivation and Context

Some users' compressed packages contain special characters as the file name for the compressed content, which causes an exception to be thrown when creating the file and prevents normal decompression.

How Has This Been Tested?

You can generate compressed packages containing the following window special characters as directory or file names on the Mac platform for decompression.
<(less than sign)

(greater than sign)
: (colon)
"(Double quotes)
/(forward slash)
(backslash)
|(Vertical line)
? (Question mark)
*(asterisk)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@Luosiyuan
Copy link
Contributor Author

If you need to test the sample compression package, leave a message and I will upload it to the project

@Luosiyuan
Copy link
Contributor Author

@rikyoz

@rikyoz rikyoz self-assigned this Sep 18, 2023
@rikyoz
Copy link
Owner

rikyoz commented Sep 18, 2023

Hi!
Thank you for your pull request!
I took the liberty of pushing some changes to your original proposal.
Apart from some code formatting, I made the normalization use standard algorithms rather than std::regex since this latter has awful performance (e.g., https://quick-bench.com/q/jkVrve2vYOhbLLpvj7jZI6RBcIs - this benchmark uses GCC, but MSVC is not so much different, as far as I know).
I also made the "sanitization" happen only on Windows since the character limitations are only on this platform.

As for the remaining issues:

Currently, only the wide character version is available

I think it's best to use only wide characters, which is the native string type on Windows.

The special file name was not processed, only special characters were processed. Special file names include:
The file name cannot start or end with a space, nor can it contain one of the following device names: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9

Yeah, I'm evaluating what is the best approach in the case of forbidden names.
Probably, it's better to simply throw an exception if a forbidden name appears in any component of the path of the file.

I'm also evaluating whether to add an option to enable/disable the path "sanitization".

@Luosiyuan
Copy link
Contributor Author

I strongly agree with your approach.

As for the file name issue: I have seen some software solutions for special file names by adding the "_" character before their file names. Processing file names is okay. However, if the directory contains special characters (which just split the path) such as C:\abc\NUL\def, it will be very difficult to handle. In short, these compressed packages cannot be generated on the Windows platform, but there is always a strange thing: compressed packages generated from other platforms are decompressed under the window.

@Luosiyuan
Copy link
Contributor Author

There is another issue, as the special string issue was not fixed from the bottom layer of 7z, if string normalization was performed during decompression, it would result in different file names returned by the callback.

@rikyoz
Copy link
Owner

rikyoz commented Sep 20, 2023

Hi!

As for the file name issue: I have seen some software solutions for special file names by adding the "_" character before their file names. Processing file names is okay. However, if the directory contains special characters (which just split the path) such as C:\abc\NUL\def, it will be very difficult to handle. In short, these compressed packages cannot be generated on the Windows platform, but there is always a strange thing: compressed packages generated from other platforms are decompressed under the window.

Between yesterday and today, I pushed some commits that improve the path sanitization:

  • Now the path sanitization is performed on each component of the path (i.e., the elements between path separators);
  • I added a check for special file names: if any component's name is an invalid Windows name, it is prefixed with a _ character as you suggested (I've seen similar implementations too);
    • For example, a path like Test\\COM0\\hello?world<.txt becomes Test\\_COM0\\hello_world_.txt.
  • I made the path sanitization optional: to enable it, you need to set the CMake option BIT7Z_PATH_SANITIZATION to ON (available only on Windows).

There is another issue, as the special string issue was not fixed from the bottom layer of 7z, if string normalization was performed during decompression, it would result in different file names returned by the callback.

If I understand it correctly, you mean that the FileCallback should report the original "unsanitized" path, right?
If this is the case, I also fixed this.

As a side note, I moved all the sanitization functions to fsutil.hpp/fsutil.cpp.

@rikyoz
Copy link
Owner

rikyoz commented Sep 21, 2023

Just so you know, I've pushed some commits that should fix and improve the path sanitization; I also added some unit tests.
Please let me know if I missed anything and if everything is okay for your use case so I can merge the pull request.

@Luosiyuan
Copy link
Contributor Author

After using the code optimized by the author, it is currently functioning normally without any issues. thanks

@rikyoz
Copy link
Owner

rikyoz commented Sep 22, 2023

Perfect, you're welcome!

@rikyoz rikyoz merged commit 08d2def into rikyoz:master Sep 22, 2023
@Luosiyuan
Copy link
Contributor Author

In fact, I was not detailed enough to examine, but just discovered a small problem, and the special text of the window was incompatible with ‘[’ ‘]’
企业微信截图_16967360357176
image

@rikyoz
Copy link
Owner

rikyoz commented Oct 8, 2023

That is not the final code merged into the master branch, as this was a bug that I fixed in the PR with this commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants