Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: under windows, librime doesn't support resource_id(file name) with non-ANSI characters, which is supported by code page. #804

Closed
wants to merge 2 commits into from

Conversation

fxliang
Copy link
Contributor

@fxliang fxliang commented Feb 2, 2024

Pull request

Issue tracker

Fixes will automatically close the related issue

Fixes #

under windows, librime doesn't support resource_id(file name) with non-ANSI characters, which is supported by code page.

Feature

Describe feature of pull request

Unit test

  • Done

Manual test

  • Done

Code Review

  1. Unit and manual test pass
  2. GitHub Action CI pass
  3. At least one contributor reviews and votes
  4. Can be merged clean without conflicts
  5. PR will be merged by rebase upstream base

Additional Info

@fxliang
Copy link
Contributor Author

fxliang commented Feb 3, 2024

@lotem
under windows with msvc, path initialized by std::string will get wrong result if there are characters not supported by the current code page of Windows system, but get right result by initializing the path with std::wstring.

here is a sample test

#include <iostream>
#include <string>
#include <filesystem>
#include <windows.h>

static std::wstring U8ToU16(const std::string& str) {
	std::wstring ret;
	int length = static_cast<int>(str.length());
	int convcnt = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), length, NULL, 0);
	if (convcnt > 0) {
		ret.resize(convcnt);
		MultiByteToWideChar(CP_UTF8, 0, str.c_str(), length, &ret[0], convcnt);
	}
	return ret;
}

int main() {
	// with a file name ς.txt in the same directory
	// whereas "ς" is not a character supported by cp936
	std::string xlfn = u8"ς";
	std::cout << "file name no ext: " << xlfn << std::endl;
	std::cout << "-----------------------------------------" << std::endl;
	// path with wstirng
	std::wstring xlfnw = U8ToU16(xlfn);
	std::filesystem::path xlpathw(U8ToU16(".\\") + xlfnw + U8ToU16(".txt"));
	bool chk = std::filesystem::exists(xlpathw);
	std::cout << "path by wstring, file exists ? " << (chk ? "yes" : "no") << std::endl;
	std::cout << "-----------------------------------------" << std::endl;
	// path with stirng
	std::filesystem::path xlpath(".\\" + xlfn + ".txt");
	chk = std::filesystem::exists(xlpath);
	std::cout << "path by string, file exists ? " << (chk ? "yes" : "no") << std::endl;

	return 0;
}

If acceptable, the final solution would be( to be tested )

update: failed and don't know why. : (
update: failed caused by ifstream/istream/stringstream/yaml-cpp and so on.

#ifdef _MSC_VER
static std::wstring U8ToU16(const std::string& str) {
  std::wstring ret;
  int length = static_cast<int>(str.length());
  int convcnt = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), length, NULL, 0);
  if (convcnt > 0) {
    ret.resize(convcnt);
    MultiByteToWideChar(CP_UTF8, 0, str.c_str(), length, &ret[0], convcnt);
  }
  return ret;
}
#endif /* _MSC_VER*/

std::filesystem::path ResourceResolver::ResolvePath(const string& resource_id) {
#ifdef _MSC_VER
  // use std::wstring, avoiding codepage issue
  return std::filesystem::absolute(
      root_path_ /
      std::filesystem::path(U8ToU16(type_.prefix) + U8ToU16(resource_id) +
                            U8ToU16(type_.suffix)));
#else
  return std::filesystem::absolute(
      root_path_ /
      std::filesystem::path(type_.prefix + resource_id + type_.suffix));
#endif
}

std::filesystem::path FallbackResourceResolver::ResolvePath(
    const string& resource_id) {
  auto default_path = ResourceResolver::ResolvePath(resource_id);
  if (!std::filesystem::exists(default_path)) {
#ifdef _MSC_VER
    // use std::wstring, avoiding codepage issue
    auto fallback_path = std::filesystem::absolute(
        fallback_root_path_ /
        std::filesystem::path(U8ToU16(type_.prefix) + U8ToU16(resource_id) +
                              U8ToU16(type_.suffix)));
#else
    auto fallback_path = std::filesystem::absolute(
        fallback_root_path_ /
        std::filesystem::path(type_.prefix + resource_id + type_.suffix));
#endif
    if (std::filesystem::exists(fallback_path)) {
      return fallback_path;
    }
  }
  return default_path;
}

@fxliang fxliang changed the title fix: under Windows(codepage not CP_UTF8), ResolvePath(resource_id) failed in some cases. fix: under windows, librime doesn't support resource_id(file name) with non-ANSI characters, which is supported by code page. Feb 3, 2024
lotem added a commit to lotem/librime that referenced this pull request Feb 3, 2024
Follow @fxliang 's PR, use `u8path` on Windows to convert UTF-8 string
to Windows native path.
Closes rime#804

Fixes rime/weasel#576
Fixes rime/weasel#1080

BREAKING CHANGE: `installation.yaml` should be UTF-8 encoded.

Previouly on Windows, the file can be written in local encoding to
enable paths with non-ASCII characters. It should be updated to UTF-8
after this change.
lotem added a commit to lotem/librime that referenced this pull request Feb 3, 2024
Follow @fxliang 's PR, use `u8path` on Windows to convert UTF-8 string
to Windows native path.
Closes rime#804

Fixes rime/weasel#576
Fixes rime/weasel#1080

BREAKING CHANGE: `installation.yaml` should be UTF-8 encoded.

Previouly on Windows, the file can be written in local encoding to
enable paths with non-ASCII characters. It should be updated to UTF-8
after this change.
lotem added a commit to lotem/librime that referenced this pull request Feb 3, 2024
Follow @fxliang 's PR, use `u8path` on Windows to convert UTF-8 string
to Windows native path.
Closes rime#804

Fixes rime/weasel#576
Fixes rime/weasel#1080

BREAKING CHANGE: `installation.yaml` should be UTF-8 encoded.

Previouly on Windows, the file can be written in local encoding to
enable paths with non-ASCII characters. It should be updated to UTF-8
after this change.
lotem added a commit to lotem/librime that referenced this pull request Feb 3, 2024
Follow @fxliang 's PR, use `u8path` on Windows to convert UTF-8 string
to Windows native path.
Closes rime#804

Fixes rime/weasel#576
Fixes rime/weasel#1080

BREAKING CHANGE: `installation.yaml` should be UTF-8 encoded.

Previouly on Windows, the file can be written in local encoding to
enable paths with non-ASCII characters. It should be updated to UTF-8
after this change.
lotem added a commit to lotem/librime that referenced this pull request Feb 3, 2024
Follow @fxliang 's PR, use `u8path` on Windows to convert UTF-8 string
to Windows native path.
Closes rime#804

Fixes rime/weasel#576
Fixes rime/weasel#1080

BREAKING CHANGE: `installation.yaml` should be UTF-8 encoded.

Previouly on Windows, the file can be written in local encoding to
enable paths with non-ASCII characters. It should be updated to UTF-8
after this change.
lotem added a commit to lotem/librime that referenced this pull request Feb 4, 2024
Follow @fxliang 's PR, use `u8path` on Windows to convert UTF-8 string
to Windows native path.
Closes rime#804

Fixes rime/weasel#576
Fixes rime/weasel#1080

BREAKING CHANGE: `installation.yaml` should be UTF-8 encoded.

Previouly on Windows, the file can be written in local encoding to
enable paths with non-ASCII characters. It should be updated to UTF-8
after this change.
@lotem lotem closed this in #806 Feb 6, 2024
lotem added a commit that referenced this pull request Feb 6, 2024
refactor: convert path to native encoding on Windows

feat(rime_api): provide secure version of path getter functions `RimeApi::get_*_dir_s`. 

Follow @fxliang 's PR, use `u8path` on Windows to convert UTF-8 string
to Windows native path.

Closes #804
Fixes rime/weasel#576
Fixes rime/weasel#1080

BREAKING CHANGE: Most `string` filenames in APIs are changed to `path`;
`installation.yaml` should be UTF-8 encoded.

Previouly on Windows, the file can be written in local encoding to
enable paths with non-ASCII characters. It should be updated to UTF-8
after this change.

Details of the code refactor

Wrap `std::filesystem::path` in a thin wrapper class `rime::path` which calls `std::filesystem::u8path` in the constructor on Windows.

Operator `/=` and `/` are also overloaded to convert the right operand from UTF-8 string to native path.

Follow these rules to apply correct conversion between `string` and `rime::path`:

- construct `rime::path` with UTF-8 encoded string;
- get native string by `path::u8string`;
- to extract UTF-8 string from `path`, for example to find schema ID from file name, call `path::u8string`;
- avoid implicit conversion from string, which results in `std::filesystem::path` without performing UTF-8 to native conversion;
- explicitly construct `rime::path` from `std::filesystem::path` before append operation, to ensure the overloaded operator with string conversion is used.
@fxliang fxliang deleted the acp_resource_id branch March 12, 2024 05:30
graphemecluster pushed a commit to TypeDuck-HK/librime that referenced this pull request Mar 18, 2024
refactor: convert path to native encoding on Windows

feat(rime_api): provide secure version of path getter functions `RimeApi::get_*_dir_s`.

Follow @fxliang 's PR, use `u8path` on Windows to convert UTF-8 string
to Windows native path.

Closes rime#804
Fixes rime/weasel#576
Fixes rime/weasel#1080

BREAKING CHANGE: Most `string` filenames in APIs are changed to `path`;
`installation.yaml` should be UTF-8 encoded.

Previouly on Windows, the file can be written in local encoding to
enable paths with non-ASCII characters. It should be updated to UTF-8
after this change.

Details of the code refactor

Wrap `std::filesystem::path` in a thin wrapper class `rime::path` which calls `std::filesystem::u8path` in the constructor on Windows.

Operator `/=` and `/` are also overloaded to convert the right operand from UTF-8 string to native path.

Follow these rules to apply correct conversion between `string` and `rime::path`:

- construct `rime::path` with UTF-8 encoded string;
- get native string by `path::u8string`;
- to extract UTF-8 string from `path`, for example to find schema ID from file name, call `path::u8string`;
- avoid implicit conversion from string, which results in `std::filesystem::path` without performing UTF-8 to native conversion;
- explicitly construct `rime::path` from `std::filesystem::path` before append operation, to ensure the overloaded operator with string conversion is used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant