Utf8 support #468

OpportunityLiu · 2019-06-24T06:17:01Z

Rewrite whole io API with stdio/winapi (for std files) and tbox (for disk files)
- Do not support file:read("n") for disk files
- For windows, io.read(num) will not ensure returns num of chars since it reads num of utf-16 chars and convert it to utf-8
- Add opt param for most api
```
opt = { continuation = "\\", encoding = "utf16le" }
```
  - encoding will be detected if not specified, use "binary" to disable this feature and read as a binary file
  - continuation is not supported for binary files, for other files, char sequence continuation .. "\r\n" or continuation .. "\n" will be replaced to an empty char sequence in read apis, this option does not affect writing
- Add file:path() and file:size() API
Add a .clang-format file for ./core
Add some codepage related api for winos
- winos.ansi_cp() get ansi codepage
- winos.oem_cp() get oem codepage
- winos.cp_info(cp) get info of a codepage
- winos.console_cp(cp) set console cp, if cp not provided, get current console cp
- winos.console_output_cp(cp) set console output cp, if cp not provided, get current console output cp
- winos.mbstoutf8(cp, data) convert data of cp to utf8 string

Add table.to_array method

table.to_array(ipairs({"a", "b"})) -- {{1,"a",n=2},{2,"b",n=2}},2
table.to_array(io.lines("file")) -- {"line 1","line 2", ... , "line n"},n

Add raw* method and global method to debug
Fix a bug that circular reference will cause stackoverflow on utils.dump

For xmake developers:

~~temp file named "null" and "nullin" will be used as os.nuldev() for windows~~
print is not available in ./xmake/core domain, you should use io.write here

core/src/demo/xmake.c

core/src/xmake/io/write.c

xmake/core/sandbox/modules/winos.lua

xmake/plugins/weather/xmake.lua

程序/源文件/你好.c

waruqi · 2019-06-24T12:24:50Z

需要在tboox里也搞一个类似 PDC_FORCE_UTF8 这种macro

用于强制 wcstombs/mbstowcs 切到 utf8上去?

waruqi · 2019-06-24T12:27:43Z

感觉需要去改TBOOX，那里面全部使用的ansi，虽然调用win api的时候是转成wchar_t然后调用W版本，但接收的还是ansi编码的char*，我现在全改成utf8的char*以后，这些api调用全都要重新转一遍。。。

都是char* 处理，你传utf8就是处理utf8 传ansi就是处理ansi，实际最终转换成win api调用的时候，转下不就好了，如果是转的有问题。加上你刚说的 PDC_FORCE_UTF8 去修改 wcstombs/mbstowcs 的行为就行了吧。。

其他地方还要转换什么？

OpportunityLiu · 2019-06-24T12:30:52Z

剩下都差不多了，虽然命令行那里比较丑陋，不过能用

core/src/xmake/io/readfile.c

core/src/tbox/src/tbox/charset/charset.h

core/src/xmake/machine.c

OpportunityLiu · 2019-06-27T03:04:51Z

rebase 了，先切到dev了

xmake/core/sandbox/modules/io.lua

CHANGELOG.md

OpportunityLiu · 2019-07-02T06:05:16Z

差不多了就切到dev分支吧，我已经把那边的更改merge进来了

waruqi · 2019-07-02T06:06:54Z

差不多了就切到dev分支吧，我已经把那边的更改merge进来了

你暂时不用merge dev哈，我还要先merge到 utf8分支下，本地测试，然后做些修饰后，确认没啥问题了，才会去merge dev

我之后我去对dev merge的，这回改动太大。。我还是要在utf8 branch下。。多改改测测的

waruqi · 2019-07-02T06:31:10Z

tests/modules/io/test.lua

@@ -0,0 +1,80 @@
+function test_read(t)
+    t:are_equal(io.readfile("files/utf8bom-lf-eleof"), "123\\\n456\n789\n")
+    t:are_equal(io.readfile("files/utf8-crlf-neleof"), "123\\\n456\n789")


就是说，现在readfile默认的行为，会对 \r\n 自动处理成 \n 跟 unix下保持一致？

以前也会啊，C标准规定了<stdio.h> 里的所有字符串相关函数都需要将 '\n' 看作 Platform specific的换行符，这也是fopen(..,“r”)和fopen(..,“rb”)的唯一区别。

现在自己读字节流了这个屁股当然也要擦啊，不然全错

fopen(..,“r”)和fopen(..,“rb”)在'\n'做行尾的平台是完全一致的

OpportunityLiu · 2019-07-02T06:31:18Z

只是把dev的一些更改合并进来了，到时候你的utf8分支merge进去方便一点

waruqi · 2019-07-02T06:34:41Z

只是把dev的一些更改合并进来了，到时候你的utf8分支merge进去方便一点

嗯，先这样吧，目前初步看了下，没啥大问题，等晚上有时间我在拉下来测测其他一些代码风格的修饰，我之后会稍微处理下。。

waruqi · 2019-07-02T07:11:45Z

ruki:xmake ruki$ xmake l
> {"1"}
< { &1    <-- 这个是啥？
    "1"
  }

core/src/xmake/sandbox/interactive.c

OpportunityLiu · 2019-07-02T07:21:39Z

ruki:xmake ruki$ xmake l
> {"1"}
< { &1    <-- 这个是啥？
    "1"
  }

你可以看下debug.global("io").file，我又不想跑两遍来决定谁加谁不加，所以就无脑加上了
参考的是Yaml的表示方式

waruqi · 2019-07-02T07:33:01Z

debug.global("io").file

? dump._print_anchor当初加这个的目的是什么？仅仅是为了dump缩进位置？还是 debug 模块需要？

什么场景需要dump这个？有相关reference么？

OpportunityLiu · 2019-07-02T07:51:55Z

debug.global("io").file

这就是相关场景啊，不搞这个直接stackoverflow啊

waruqi · 2019-07-02T08:06:47Z

debug.global("io").file

这就是相关场景啊，不搞这个直接stackoverflow啊

这个跟 _print_anchor &1 &2 有什么关系，在stackoverflow上没看到相关说明，有相关参考links么

OpportunityLiu · 2019-07-02T08:11:29Z

..我说的是stackoverflow异常，循环引用的obj不可能print的出来，只能anchor-ref表示

waruqi · 2019-07-02T08:21:15Z

..我说的是stackoverflow异常，循环引用的obj不可能print的出来，只能anchor-ref表示

但是我看打印循环引用的obj，不是显示 *1 这种么，而 &1 的显示，我看仅仅只是判断了下 first_value，并没看到有哪里判断obj引用么。。

> a
< { &1   ---> 那此处是什么，我代码里面没看到有判断obj引用，仅仅在first value的时候，打印下缩进位置
    1,
    2,
    a = *1   -> 显示循环引用obj
  }

难道不应该是这么显示么：

> a
< { 
    1,
    2,
    a = *1   -> 显示循环引用obj
  }

是为了标示 *1 所在引用id的位置？

OpportunityLiu · 2019-07-02T08:28:48Z

去看YAML的文档，用的那个语法

是为了标示 *1 所在引用id的位置？

不然谁知道*1是那个

OpportunityLiu · 2019-07-02T08:30:03Z

waruqi · 2019-07-02T08:50:47Z

但是，如果当前没有obj引用，也都全部显示 &x ，比如纯scalar array，这么显示就很莫名了，也不简洁。。

可以当前obj确实被其他obj有引用，才显示对应的 &x，会更好些，不过这样应该需要预先遍历遍，预处理生成下引用关系table

waruqi · 2019-07-02T08:56:15Z

不过这个倒还好，回头我可以改进下显示。。我刚win xp上测了下，dump似乎有点问题 , 显示各种scalar都是nil，显示obj ，直接卡住。

OpportunityLiu · 2019-07-02T09:01:58Z

这部分纯lua实现的啊，怎么还有平台问题

waruqi · 2019-07-02T09:02:44Z

不清楚，我先merge到utf8了，等晚上我再调下这块。。

OpportunityLiu · 2019-07-02T09:03:24Z

记得拉个dev<-utf8的PR，我来review

OpportunityLiu changed the base branch from master to dev June 24, 2019 06:17