Skip to content

Commit

Permalink
fix: fix romaji conversion bugs
Browse files Browse the repository at this point in the history
Closes #46, #47
  • Loading branch information
hexenq committed Oct 19, 2018
1 parent 9dd0b0d commit 45a921f
Show file tree
Hide file tree
Showing 9 changed files with 152 additions and 38 deletions.
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
<a name="1.1.2"></a>
## [1.1.2](https://github.com/hexenq/kuroshiro/compare/1.1.1...1.1.2) (2018-10-19)

### Bug Fixes

* fix conversion bug when handling chōon with passport-shiki romanization ([#47](https://github.com/hexenq/kuroshiro/issues/47))
* fix kanji->romaji conversion bug when using nippon-shiki/hepburn-shiki romanization ([#46](https://github.com/hexenq/kuroshiro/issues/46))

### Test

* Update test specification

### Miscellaneous

* Update docs, add notice for romaji conversion

<a name="1.1.1"></a>
## [1.1.1](https://github.com/hexenq/kuroshiro/compare/1.1.0...1.1.1) (2018-08-28)

Expand Down
9 changes: 9 additions & 0 deletions README.jp.md
Original file line number Diff line number Diff line change
Expand Up @@ -220,6 +220,15 @@ kuroshiroは三種類のローマ字表記法をサポートします。

各種ローマ字表の比較は[こちら](http://jgrammar.life.coocan.jp/ja/data/rohmaji2.htm)を参考にしてください。

### ローマ字変換のお知らせ
フリガナは音声を正確にあらわしていないため、__フリガナ____ローマ字__ に完全自動的に変換することは不可能です。([なぜフリガナではダメなのか?](https://green.adam.ne.jp/roomazi/onamae.html#naze)を参照)

そのゆえ、`nippon``hepburn`のローマ字表記法を使って、フリガナ(仮名)-> ローマ字 変換を行うとき、kuroshiroは長音の処理を実行しません。(`passport`表記法そのものが長音を無視します)

*例えば`nippon`` passport``hepburn`のローマ字表記法を使って フリガナ->ローマ字 変換を行うと、それぞれ"kousi"、 "koshi"、 "koushi"が得られます。*

フリガナモードを使うかどうかにかかわらず、漢字->ローマ字の変換はこの仕組みに影響を与えられないです。

## 貢献したい方
[CONTRIBUTING](CONTRIBUTING.md) を参考にしてみてください。

Expand Down
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ Convert given string to target syllabary with options available
__Arguments__

* `str` - A String to be converted.
* `options` - *Optional* kuroshiro has several convert options as below. `romajiSystem` is only applied when the value of param `to` is `romaji`
* `options` - *Optional* kuroshiro has several convert options as below.

| Options | Type | Default | Description |
|---|---|---|---|
Expand Down Expand Up @@ -207,7 +207,7 @@ Convert input kana string to hiragana.
Convert input kana string to katakana.

#### kanaToRomaji(str, system)
Convert input kana string to romaji. Param `system` accepts `"nippon"`, `"passport"`, `"hepburn"` (Default: "hepburn")
Convert input kana string to romaji. Param `system` accepts `"nippon"`, `"passport"`, `"hepburn"` (Default: "hepburn").

## Romanization System
kuroshiro supports three kinds of romanization systems.
Expand All @@ -220,6 +220,16 @@ kuroshiro supports three kinds of romanization systems.

There is a useful [webpage](http://jgrammar.life.coocan.jp/ja/data/rohmaji2.htm) for you to check the difference between these romanization systems.

### Notice for Romaji Conversion
Since it's impossible to fully automatically convert __furigana__ directly to __romaji__ because furigana lacks information on pronunciation (Refer to [なぜ フリガナでは ダメなのか?](https://green.adam.ne.jp/roomazi/onamae.html#naze)).

kuroshiro will not handle chōon when processing directly furigana (kana) -> romaji conversion with `nippon` or `hepburn` romanization system (Chōon will be ignored by `passport` romanization system)

*For example, you'll get "kousi", "koshi", "koushi" respectively when converts kana "こうし" to romaji
using `nippon`, `passport`, `hepburn` romanization system.*

The kanji -> romaji conversion with/without furigana mode is __unaffected__ by this logic.

## Contributing
Please check [CONTRIBUTING](CONTRIBUTING.md).

Expand Down
11 changes: 10 additions & 1 deletion README.zh-cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ const result = Kuroshiro.Util.isHiragana("あ"));
转换输入假名字符串至片假名。

#### kanaToRomaji(str, system)
转换输入假名字符串至罗马字。参数`system`可选值为`"nippon"`, `"passport"`, `"hepburn"` (默认值: "hepburn")
转换输入假名字符串至罗马字。参数`system`可选值为`"nippon"`, `"passport"`, `"hepburn"` (默认值: "hepburn")

## 罗马字体系
kuroshiro支持三种罗马字体系。
Expand All @@ -221,6 +221,15 @@ kuroshiro支持三种罗马字体系。

想快速了解这些罗马字体系的不同,可参考这个实用的[网页](http://jgrammar.life.coocan.jp/ja/data/rohmaji2.htm)

### 罗马字转换须知
完全自动化进行注音假名到罗马字的直接转换是不可能的,这是因为一般的注音假名都缺乏正确的发音信息,可以参考 [なぜ フリガナでは ダメなのか?](https://green.adam.ne.jp/roomazi/onamae.html#naze)

因此kuroshiro在进行直接的注音假名->罗马字转换(使用`nippon``hepburn`罗马字体系)时,不会处理长音。(`passport`罗马字体系本身便忽略长音)

*例如,当进行假名"こうし"到罗马字的转换时,对于`nippon`, `passport`, `hepburn`三种罗马字体系,你会分别得到"kousi", "koshi", "koushi"这几个结果*

汉字->罗马字的转换无论使用注音假名模式与否都 __不受__ 此逻辑影响。

## 贡献
请查阅文档 [CONTRIBUTING](CONTRIBUTING.md).

Expand Down
11 changes: 10 additions & 1 deletion README.zh-tw.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,7 +208,7 @@ const result = Kuroshiro.Util.isHiragana("あ"));
轉換輸入假名字元串至片假名。

#### kanaToRomaji(str, system)
轉換輸入假名字元串至羅馬字。參數`system`可選值為`"nippon"`, `"passport"`, `"hepburn"` (默認值: "hepburn")
轉換輸入假名字元串至羅馬字。參數`system`可選值為`"nippon"`, `"passport"`, `"hepburn"` (默認值: "hepburn")

## 羅馬字體系
kuroshiro支持三種羅馬字體系。
Expand All @@ -221,6 +221,15 @@ kuroshiro支持三種羅馬字體系。

想快速了解這些羅馬字體系的不同,可參考這個實用的[網頁](http://jgrammar.life.coocan.jp/ja/data/rohmaji2.htm)

### 羅馬字轉換須知
完全自動化進行注音假名到羅馬字的直接轉換是不可能的,這是因為一般的注音假名都缺乏正確的發音信息,可以參考 [なぜ フリガナでは ダメなのか?](https://green.adam.ne.jp/roomazi/onamae.html#naze)

因此kuroshiro在進行直接的注音假名->羅馬字轉換(使用`nippon``hepburn`羅馬字體系)時,不會處理長音。(`passport`羅馬字體系本身便忽略長音)

*例如,當進行假名"こうし"到羅馬字的轉換時,對於`nippon`, `passport`, `hepburn`三種羅馬字體系,你會分別得到"kousi", "koshi", "koushi"這幾個結果*

漢字->羅馬字的轉換無論使用注音假名模式與否都 __不受__ 此邏輯影響。

## 貢獻
請查閱文檔 [CONTRIBUTING](CONTRIBUTING.md).

Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "kuroshiro",
"version": "1.1.1",
"version": "1.1.2",
"description": "kuroshiro is a Japanese language library for converting Japanese sentence to Hiragana, Katakana or Romaji with furigana and okurigana modes supported.",
"main": "lib/index.js",
"module": "src/index.js",
Expand Down
31 changes: 16 additions & 15 deletions src/core.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import {
ROMANIZATION_SYSTEM,
getStrType,
patchTokens,
isHiragana,
isKatakana,
isKana,
Expand Down Expand Up @@ -90,7 +91,8 @@ class Kuroshiro {
throw new Error("Invalid Romanization System.");
}

const tokens = await this._analyzer.parse(str);
const rawTokens = await this._analyzer.parse(str);
const tokens = patchTokens(rawTokens);
for (let cr = 0; cr < tokens.length; cr++) {
if (hasJapanese(tokens[cr].surface_form)) {
if (!tokens[cr].reading) {
Expand Down Expand Up @@ -119,9 +121,9 @@ class Kuroshiro {
return tokens.map(token => token.reading).join(" ");
case "romaji":
if (options.mode === "normal") {
return tokens.map(token => toRawRomaji(token.reading, options.romajiSystem)).join("");
return tokens.map(token => toRawRomaji(token.pronunciation || token.reading, options.romajiSystem)).join("");
}
return tokens.map(token => toRawRomaji(token.reading, options.romajiSystem)).join(" ");
return tokens.map(token => toRawRomaji(token.pronunciation || token.reading, options.romajiSystem)).join(" ");
case "hiragana":
for (let hi = 0; hi < tokens.length; hi++) {
if (hasKanji(tokens[hi].surface_form)) {
Expand Down Expand Up @@ -171,14 +173,12 @@ class Kuroshiro {
}
}
else if (options.mode === "okurigana" || options.mode === "furigana") {
const notations = []; // [basic,basic_type[1=kanji,2=kana,3=others],notation]
const notations = []; // [basic, basic_type[1=kanji,2=kana,3=others], notation, pronunciation]
for (let i = 0; i < tokens.length; i++) {
tokens[i].reading = toRawHiragana(tokens[i].reading);

const strType = getStrType(tokens[i].surface_form);
switch (strType) {
case 0:
notations.push([tokens[i].surface_form, 1, tokens[i].reading]);
notations.push([tokens[i].surface_form, 1, toRawHiragana(tokens[i].reading), tokens[i].pronunciation || tokens[i].reading]);
break;
case 1:
let pattern = "";
Expand All @@ -202,30 +202,31 @@ class Kuroshiro {
}
}
const reg = new RegExp(`^${pattern}$`);
const matches = reg.exec(tokens[i].reading);
const matches = reg.exec(toRawHiragana(tokens[i].reading));
if (matches) {
let pickKanji = 1;
for (let c1 = 0; c1 < subs.length; c1++) {
if (isKanji(subs[c1][0])) {
notations.push([subs[c1], 1, matches[pickKanji++]]);
notations.push([subs[c1], 1, matches[pickKanji], toRawKatakana(matches[pickKanji])]);
pickKanji += 1;
}
else {
notations.push([subs[c1], 2, toRawHiragana(subs[c1])]);
notations.push([subs[c1], 2, toRawHiragana(subs[c1]), toRawKatakana(subs[c1])]);
}
}
}
else {
notations.push([tokens[i].surface_form, 1, tokens[i].reading]);
notations.push([tokens[i].surface_form, 1, toRawHiragana(tokens[i].reading), tokens[i].pronunciation || tokens[i].reading]);
}
break;
case 2:
for (let c2 = 0; c2 < tokens[i].surface_form.length; c2++) {
notations.push([tokens[i].surface_form[c2], 2, tokens[i].reading[c2]]);
notations.push([tokens[i].surface_form[c2], 2, toRawHiragana(tokens[i].reading[c2]), (tokens[i].pronunciation && tokens[i].pronunciation[c2]) || tokens[i].reading[c2]]);
}
break;
case 3:
for (let c3 = 0; c3 < tokens[i].surface_form.length; c3++) {
notations.push([tokens[i].surface_form[c3], 3, tokens[i].surface_form[c3]]);
notations.push([tokens[i].surface_form[c3], 3, tokens[i].surface_form[c3], tokens[i].surface_form[c3]]);
}
break;
default:
Expand Down Expand Up @@ -263,14 +264,14 @@ class Kuroshiro {
result += notations[n2][0];
}
else {
result += notations[n2][0] + options.delimiter_start + toRawRomaji(notations[n2][2], options.romajiSystem) + options.delimiter_end;
result += notations[n2][0] + options.delimiter_start + toRawRomaji(notations[n2][3], options.romajiSystem) + options.delimiter_end;
}
}
}
else { // furigana
result += "<ruby>";
for (let n3 = 0; n3 < notations.length; n3++) {
result += `${notations[n3][0]}<rp>${options.delimiter_start}</rp><rt>${toRawRomaji(notations[n3][2], options.romajiSystem)}</rt><rp>${options.delimiter_end}</rp>`;
result += `${notations[n3][0]}<rp>${options.delimiter_start}</rp><rt>${toRawRomaji(notations[n3][3], options.romajiSystem)}</rt><rp>${options.delimiter_end}</rp>`;
}
result += "</ruby>";
}
Expand Down
54 changes: 47 additions & 7 deletions src/util.js
Original file line number Diff line number Diff line change
Expand Up @@ -1396,6 +1396,7 @@ const toRawRomaji = function (str, system) {
}
}

// [ALL] kana to roman chars
const max = str.length;
while (pnt <= max) {
if (r = romajiSystem[system][str.substring(pnt, pnt + 2)]) {
Expand Down Expand Up @@ -1423,13 +1424,6 @@ const toRawRomaji = function (str, system) {
result = result.replace(/np/gm, "mp");
}

// [PASSPORT] 長音省略 他の場合
if (system === ROMANIZATION_SYSTEM.PASSPORT) {
result = result.replace(/uu/gm, "u");
result = result.replace(/ou/gm, "o");
result = result.replace(/oo(?!$)/gm, "o");
}

// [NIPPON] 長音変換
if (system === ROMANIZATION_SYSTEM.NIPPON) {
result = result.replace(/a/gm, "â");
Expand Down Expand Up @@ -1474,6 +1468,51 @@ const getStrType = function (str) {
return 3;
};

/**
* Patch tokens for conversion
* @param {Object} tokens Given tokens
* @return {Object} Patched tokens
*/
const patchTokens = function (tokens) {
// patch for 助動詞"う" after 動詞
for (let i = 0; i < tokens.length; i++) {
if (tokens[i].pos && tokens[i].pos === "助動詞" && (tokens[i].surface_form === "う" || tokens[i].surface_form === "ウ")) {
if (i - 1 >= 0 && tokens[i - 1].pos && tokens[i - 1].pos === "動詞") {
tokens[i - 1].surface_form += "う";
if (tokens[i - 1].pronunciation) {
tokens[i - 1].pronunciation += "ー";
}
else {
tokens[i - 1].pronunciation = `${tokens[i - 1].reading}ー`;
}
tokens[i - 1].reading += "ウ";
tokens.splice(i, 1);
i--;
}
}
}

// patch for "っ" at the tail of 動詞、形容詞
for (let j = 0; j < tokens.length; j++) {
if (tokens[j].pos && (tokens[j].pos === "動詞" || tokens[j].pos === "形容詞") && tokens[j].surface_form.length > 1 && (tokens[j].surface_form[tokens[j].surface_form.length - 1] === "っ" || tokens[j].surface_form[tokens[j].surface_form.length - 1] === "ッ")) {
if (j + 1 < tokens.length && tokens[j + 1].pos && (tokens[j + 1].pos === "動詞" || tokens[j + 1].pos === "助動詞")) {
tokens[j].surface_form += tokens[j + 1].surface_form;
if (tokens[j].pronunciation) {
tokens[j].pronunciation += tokens[j + 1].pronunciation;
}
else {
tokens[j].pronunciation = `${tokens[j].reading}${tokens[j + 1].reading}`;
}
tokens[j].reading += tokens[j + 1].reading;
tokens.splice(j + 1, 1);
j--;
}
}
}

return tokens;
};

/**
* Convert kana to hiragana
*
Expand Down Expand Up @@ -1509,6 +1548,7 @@ export {
// language
ROMANIZATION_SYSTEM,
getStrType,
patchTokens,
isHiragana,
isKatakana,
isKana,
Expand Down
Loading

0 comments on commit 45a921f

Please sign in to comment.