cutlet
cutlet
Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.
issueを英語で書く必要はありません。
Features:
- support for Modified Hepburn, Kunreisiki, Nihonsiki systems
- custom overrides for individual mappings
- custom overrides for specific words
- built in exceptions list (Tokyo, Osaka, etc.)
- uses foreign spelling when available in UniDic
- proper nouns are capitalized
- slug mode for url generation
Things not supported:
- traditional Hepburn n-to-m: Shimbashi
- macrons or circumflexes: Tōkyō, Tôkyô
- passport Hepburn: Satoh (but you can use an exception)
- hyphenating words
- Traditional Hepburn in general is not supported
Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.
Installation
Cutlet can be installed through pip as usual.
pip install cutlet
Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.
pip install unidic-lite
Usage
A command-line script is included for quick testing. Just use cutlet
and each
line of stdin will be treated as a sentence. You can specify the system to use
(hepburn
, kunrei
, nippon
, or nihon
) as the first argument.
$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.
In code:
import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'
# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'
# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'
# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'
# comparison
nkatu = cutlet.Cutlet('nihon')
sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'
Alternatives
104class Cutlet: 105 def __init__( 106 self, 107 system="hepburn", 108 use_foreign_spelling=True, 109 ensure_ascii=True, 110 mecab_args="", 111 ): 112 """Create a Cutlet object, which holds configuration as well as 113 tokenizer state. 114 115 `system` is `hepburn` by default, and may also be `kunrei` or 116 `nihon`. `nippon` is permitted as a synonym for `nihon`. 117 118 If `use_foreign_spelling` is true, output will use the foreign spelling 119 provided in a UniDic lemma when available. For example, "カツ" will 120 become "cutlet" instead of "katsu". 121 122 If `ensure_ascii` is true, any non-ASCII characters that can't be 123 romanized will be replaced with `?`. If false, they will be passed 124 through. 125 126 Typical usage: 127 128 ```python 129 katsu = Cutlet() 130 roma = katsu.romaji("カツカレーを食べた") 131 # "Cutlet curry wo tabeta" 132 ``` 133 """ 134 # allow 'nippon' for 'nihon' 135 if system == "nippon": 136 system = "nihon" 137 self.system = system 138 try: 139 # make a copy so we can modify it 140 self.table = dict(SYSTEMS[system]) 141 except KeyError: 142 print("unknown system: {}".format(system)) 143 raise 144 145 self.tagger = fugashi.Tagger(mecab_args) 146 self.exceptions = load_exceptions() 147 148 # these are too minor to be worth exposing as arguments 149 self.use_tch = self.system in ("hepburn",) 150 self.use_wa = self.system in ("hepburn", "kunrei") 151 self.use_he = self.system in ("nihon",) 152 self.use_wo = self.system in ("hepburn", "nihon") 153 154 self.use_foreign_spelling = use_foreign_spelling 155 self.ensure_ascii = ensure_ascii 156 157 def add_exception(self, key, val): 158 """Add an exception to the internal list. 159 160 An exception overrides a whole token, for example to replace "Toukyou" 161 with "Tokyo". Note that it must match the tokenizer output and be a 162 single token to work. To replace longer phrases, you'll need to use a 163 different strategy, like string replacement. 164 """ 165 self.exceptions[key] = val 166 167 def update_mapping(self, key, val): 168 """Update mapping table for a single kana. 169 170 This can be used to mix common systems, or to modify particular 171 details. For example, you can use `update_mapping("ぢ", "di")` to 172 differentiate ぢ and じ in Hepburn. 173 174 Example usage: 175 176 ``` 177 cut = Cutlet() 178 cut.romaji("お茶漬け") # Ochazuke 179 cut.update_mapping("づ", "du") 180 cut.romaji("お茶漬け") # Ochaduke 181 ``` 182 """ 183 self.table[key] = val 184 185 def slug(self, text): 186 """Generate a URL-friendly slug. 187 188 After converting the input to romaji using `Cutlet.romaji` and making 189 the result lower-case, any runs of non alpha-numeric characters are 190 replaced with a single hyphen. Any leading or trailing hyphens are 191 stripped. 192 """ 193 roma = self.romaji(text).lower() 194 slug = re.sub(r"[^a-z0-9]+", "-", roma).strip("-") 195 return slug 196 197 def romaji_tokens(self, words, capitalize=True, title=False): 198 """Build a list of tokens from input nodes. 199 200 If `capitalize` is true, then the first letter of the first token will be 201 capitalized. This is typically the desired behavior if the input is a 202 complete sentence. 203 204 If `title` is true, then words will be capitalized as in a book title. 205 This means most words will be capitalized, but some parts of speech 206 (particles, endings) will not. 207 208 If the text was not normalized before being tokenized, the output is 209 undefined. For details of normalization, see `normalize_text`. 210 211 The number of output tokens will equal the number of input nodes. 212 """ 213 214 out = [] 215 216 for wi, word in enumerate(words): 217 po = out[-1] if out else None 218 pw = words[wi - 1] if wi > 0 else None 219 nw = words[wi + 1] if wi < len(words) - 1 else None 220 221 # handle possessive apostrophe as a special case 222 if ( 223 word.surface == "'" 224 and (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) 225 and not word.white_space 226 ): 227 # remove preceeding space 228 if po: 229 po.space = False 230 out.append(Token(word.surface, False)) 231 continue 232 233 # resolve split verbs / adjectives 234 roma = self.romaji_word(word) 235 if roma and po and po.surface and po.surface[-1] == "っ": 236 po.surface = po.surface[:-1] + roma[0] 237 if word.feature.pos2 == "固有名詞": 238 roma = roma.title() 239 if ( 240 title 241 and word.feature.pos1 not in ("助詞", "助動詞", "接尾辞") 242 and not (pw and pw.feature.pos1 == "接頭辞") 243 ): 244 roma = roma.title() 245 246 foreign = self.use_foreign_spelling and has_foreign_lemma(word) 247 tok = Token(roma, False, foreign) 248 # handle punctuation with atypical spacing 249 if word.surface in "「『": 250 if po: 251 po.space = True 252 out.append(tok) 253 continue 254 if roma in "([": 255 if po: 256 po.space = True 257 out.append(tok) 258 continue 259 if roma == "/": 260 out.append(tok) 261 continue 262 263 # preserve spaces between ascii tokens 264 if word.surface.isascii() and nw and nw.surface.isascii(): 265 use_space = bool(nw.white_space) 266 out.append(Token(word.surface, use_space)) 267 continue 268 269 out.append(tok) 270 271 # no space sometimes 272 # お酒 -> osake 273 if word.feature.pos1 == "接頭辞": 274 continue 275 # 今日、 -> kyou, ; 図書館 -> toshokan 276 if nw and nw.feature.pos1 in ("補助記号", "接尾辞"): 277 continue 278 # special case for half-width commas 279 if nw and nw.surface == ",": 280 continue 281 # special case for prefixes 282 if foreign and roma[-1] == "-": 283 continue 284 # 思えば -> omoeba 285 if nw and nw.feature.pos2 in ("接続助詞"): 286 continue 287 # 333 -> 333 ; this should probably be handled in mecab 288 if word.surface.isdigit() and nw and nw.surface.isdigit(): 289 continue 290 # そうでした -> sou deshita 291 if ( 292 nw 293 and word.feature.pos1 in ("動詞", "助動詞", "形容詞") 294 and nw.feature.pos1 == "助動詞" 295 and nw.surface != "です" 296 ): 297 continue 298 299 # if we get here, it does need a space 300 tok.space = True 301 302 # remove any leftover っ 303 for tok in out: 304 tok.surface = tok.surface.replace("っ", "") 305 306 # capitalize the first letter 307 if capitalize and out and out[0].surface: 308 ss = out[0].surface 309 out[0].surface = ss[0].capitalize() + ss[1:] 310 return out 311 312 def romaji(self, text, capitalize=True, title=False): 313 """Build a complete string from input text. 314 315 If `capitalize` is true, then the first letter of the text will be 316 capitalized. This is typically the desired behavior if the input is a 317 complete sentence. 318 319 If `title` is true, then words will be capitalized as in a book title. 320 This means most words will be capitalized, but some parts of speech 321 (particles, endings) will not. 322 """ 323 if not text: 324 return "" 325 326 text = normalize_text(text) 327 print("normalized", text) 328 words = self.tagger(text) 329 print("words", words) 330 for word in words: 331 print(word.surface, word.feature.pron) 332 333 tokens = self.romaji_tokens(words, capitalize, title) 334 out = "".join([str(tok) for tok in tokens]).strip() 335 return out 336 337 def romaji_word(self, word): 338 """Return the romaji for a single word (node).""" 339 340 if word.surface in self.exceptions: 341 return self.exceptions[word.surface] 342 343 if word.surface.isdigit(): 344 return word.surface 345 346 if word.surface.isascii(): 347 return word.surface 348 349 # deal with unks first 350 if word.is_unk: 351 # at this point is is presumably an unk 352 # Check character type using the values defined in char.def. 353 # This is constant across unidic versions so far but not guaranteed. 354 if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): 355 kana = jaconv.kata2hira(word.surface) 356 return self.map_kana(kana) 357 358 # At this point this is an unknown word and not kana. Could be 359 # unknown kanji, could be hangul, cyrillic, something else. 360 # By default ensure ascii by replacing with ?, but allow pass-through. 361 if self.ensure_ascii: 362 out = "?" * len(word.surface) 363 return out 364 else: 365 return word.surface 366 367 if word.feature.pos1 == "補助記号": 368 # If it's punctuation we don't recognize, just discard it 369 return self.table.get(word.surface, "") 370 elif self.use_wa and word.feature.pos1 == "助詞" and word.feature.pron == "ワ": 371 return "wa" 372 elif ( 373 not self.use_he 374 and word.feature.pos1 == "助詞" 375 and word.feature.pron == "エ" 376 ): 377 return "e" 378 elif ( 379 not self.use_wo 380 and word.feature.pos1 == "助詞" 381 and word.feature.pron == "オ" 382 ): 383 return "o" 384 elif self.use_foreign_spelling and has_foreign_lemma(word): 385 # this is a foreign word with known spelling 386 return word.feature.lemma.split("-", 1)[-1] 387 elif word.feature.kana: 388 # for known words 389 kana = jaconv.kata2hira(word.feature.kana) 390 return self.map_kana(kana) 391 else: 392 # unclear when we would actually get here 393 return word.surface 394 395 def map_kana(self, kana): 396 """Given a list of kana, convert them to romaji. 397 398 The exact romaji resulting from a kana sequence depend on the preceding 399 or following kana, so this handles that conversion. 400 """ 401 out = "" 402 for ki, char in enumerate(kana): 403 nk = kana[ki + 1] if ki < len(kana) - 1 else None 404 pk = kana[ki - 1] if ki > 0 else None 405 out += self.get_single_mapping(pk, char, nk) 406 return out 407 408 def get_single_mapping(self, pk, kk, nk): 409 """Given a single kana and its neighbors, return the mapped romaji.""" 410 # handle odoriji 411 # NOTE: This is very rarely useful at present because odoriji are not 412 # left in readings for dictionary words, and we can't follow kana 413 # across word boundaries. 414 if kk in ODORI: 415 if kk in "ゝヽ": 416 if pk: 417 return pk 418 else: 419 return "" # invalid but be nice 420 if kk in "ゞヾ": # repeat with voicing 421 if not pk: 422 return "" 423 vv = add_dakuten(pk) 424 if vv: 425 return self.table[vv] 426 else: 427 return "" 428 # remaining are 々 for kanji and 〃 for symbols, but we can't 429 # infer their span reliably (or handle rendaku) 430 return "" 431 432 # handle digraphs 433 if pk and (pk + kk) in self.table: 434 return self.table[pk + kk] 435 if nk and (kk + nk) in self.table: 436 return "" 437 438 if nk and nk in SUTEGANA: 439 if kk == "っ": 440 return "" # never valid, just ignore 441 return self.table[kk][:-1] + self.table[nk] 442 if kk in SUTEGANA: 443 return "" 444 445 if kk == "ー": # 長音符 446 if pk and pk in self.table: 447 return self.table[pk][-1] 448 else: 449 return "-" 450 451 if kk == "っ": 452 if nk: 453 if self.use_tch and nk == "ち": 454 return "t" 455 elif nk in "あいうえおっ": 456 return "-" 457 else: 458 return self.table[nk][0] # first character 459 else: 460 # seems like it should never happen, but 乗っ|た is two tokens 461 # so leave this as is and pick it up at the word level 462 return "っ" 463 464 if kk == "ん": 465 if nk and nk in "あいうえおやゆよ": 466 return "n'" 467 else: 468 return "n" 469 470 return self.table[kk]
105 def __init__( 106 self, 107 system="hepburn", 108 use_foreign_spelling=True, 109 ensure_ascii=True, 110 mecab_args="", 111 ): 112 """Create a Cutlet object, which holds configuration as well as 113 tokenizer state. 114 115 `system` is `hepburn` by default, and may also be `kunrei` or 116 `nihon`. `nippon` is permitted as a synonym for `nihon`. 117 118 If `use_foreign_spelling` is true, output will use the foreign spelling 119 provided in a UniDic lemma when available. For example, "カツ" will 120 become "cutlet" instead of "katsu". 121 122 If `ensure_ascii` is true, any non-ASCII characters that can't be 123 romanized will be replaced with `?`. If false, they will be passed 124 through. 125 126 Typical usage: 127 128 ```python 129 katsu = Cutlet() 130 roma = katsu.romaji("カツカレーを食べた") 131 # "Cutlet curry wo tabeta" 132 ``` 133 """ 134 # allow 'nippon' for 'nihon' 135 if system == "nippon": 136 system = "nihon" 137 self.system = system 138 try: 139 # make a copy so we can modify it 140 self.table = dict(SYSTEMS[system]) 141 except KeyError: 142 print("unknown system: {}".format(system)) 143 raise 144 145 self.tagger = fugashi.Tagger(mecab_args) 146 self.exceptions = load_exceptions() 147 148 # these are too minor to be worth exposing as arguments 149 self.use_tch = self.system in ("hepburn",) 150 self.use_wa = self.system in ("hepburn", "kunrei") 151 self.use_he = self.system in ("nihon",) 152 self.use_wo = self.system in ("hepburn", "nihon") 153 154 self.use_foreign_spelling = use_foreign_spelling 155 self.ensure_ascii = ensure_ascii
Create a Cutlet object, which holds configuration as well as tokenizer state.
system
is hepburn
by default, and may also be kunrei
or
nihon
. nippon
is permitted as a synonym for nihon
.
If use_foreign_spelling
is true, output will use the foreign spelling
provided in a UniDic lemma when available. For example, "カツ" will
become "cutlet" instead of "katsu".
If ensure_ascii
is true, any non-ASCII characters that can't be
romanized will be replaced with ?
. If false, they will be passed
through.
Typical usage:
katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"
157 def add_exception(self, key, val): 158 """Add an exception to the internal list. 159 160 An exception overrides a whole token, for example to replace "Toukyou" 161 with "Tokyo". Note that it must match the tokenizer output and be a 162 single token to work. To replace longer phrases, you'll need to use a 163 different strategy, like string replacement. 164 """ 165 self.exceptions[key] = val
Add an exception to the internal list.
An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.
167 def update_mapping(self, key, val): 168 """Update mapping table for a single kana. 169 170 This can be used to mix common systems, or to modify particular 171 details. For example, you can use `update_mapping("ぢ", "di")` to 172 differentiate ぢ and じ in Hepburn. 173 174 Example usage: 175 176 ``` 177 cut = Cutlet() 178 cut.romaji("お茶漬け") # Ochazuke 179 cut.update_mapping("づ", "du") 180 cut.romaji("お茶漬け") # Ochaduke 181 ``` 182 """ 183 self.table[key] = val
Update mapping table for a single kana.
This can be used to mix common systems, or to modify particular
details. For example, you can use update_mapping("ぢ", "di")
to
differentiate ぢ and じ in Hepburn.
Example usage:
cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke
185 def slug(self, text): 186 """Generate a URL-friendly slug. 187 188 After converting the input to romaji using `Cutlet.romaji` and making 189 the result lower-case, any runs of non alpha-numeric characters are 190 replaced with a single hyphen. Any leading or trailing hyphens are 191 stripped. 192 """ 193 roma = self.romaji(text).lower() 194 slug = re.sub(r"[^a-z0-9]+", "-", roma).strip("-") 195 return slug
Generate a URL-friendly slug.
After converting the input to romaji using Cutlet.romaji
and making
the result lower-case, any runs of non alpha-numeric characters are
replaced with a single hyphen. Any leading or trailing hyphens are
stripped.
197 def romaji_tokens(self, words, capitalize=True, title=False): 198 """Build a list of tokens from input nodes. 199 200 If `capitalize` is true, then the first letter of the first token will be 201 capitalized. This is typically the desired behavior if the input is a 202 complete sentence. 203 204 If `title` is true, then words will be capitalized as in a book title. 205 This means most words will be capitalized, but some parts of speech 206 (particles, endings) will not. 207 208 If the text was not normalized before being tokenized, the output is 209 undefined. For details of normalization, see `normalize_text`. 210 211 The number of output tokens will equal the number of input nodes. 212 """ 213 214 out = [] 215 216 for wi, word in enumerate(words): 217 po = out[-1] if out else None 218 pw = words[wi - 1] if wi > 0 else None 219 nw = words[wi + 1] if wi < len(words) - 1 else None 220 221 # handle possessive apostrophe as a special case 222 if ( 223 word.surface == "'" 224 and (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) 225 and not word.white_space 226 ): 227 # remove preceeding space 228 if po: 229 po.space = False 230 out.append(Token(word.surface, False)) 231 continue 232 233 # resolve split verbs / adjectives 234 roma = self.romaji_word(word) 235 if roma and po and po.surface and po.surface[-1] == "っ": 236 po.surface = po.surface[:-1] + roma[0] 237 if word.feature.pos2 == "固有名詞": 238 roma = roma.title() 239 if ( 240 title 241 and word.feature.pos1 not in ("助詞", "助動詞", "接尾辞") 242 and not (pw and pw.feature.pos1 == "接頭辞") 243 ): 244 roma = roma.title() 245 246 foreign = self.use_foreign_spelling and has_foreign_lemma(word) 247 tok = Token(roma, False, foreign) 248 # handle punctuation with atypical spacing 249 if word.surface in "「『": 250 if po: 251 po.space = True 252 out.append(tok) 253 continue 254 if roma in "([": 255 if po: 256 po.space = True 257 out.append(tok) 258 continue 259 if roma == "/": 260 out.append(tok) 261 continue 262 263 # preserve spaces between ascii tokens 264 if word.surface.isascii() and nw and nw.surface.isascii(): 265 use_space = bool(nw.white_space) 266 out.append(Token(word.surface, use_space)) 267 continue 268 269 out.append(tok) 270 271 # no space sometimes 272 # お酒 -> osake 273 if word.feature.pos1 == "接頭辞": 274 continue 275 # 今日、 -> kyou, ; 図書館 -> toshokan 276 if nw and nw.feature.pos1 in ("補助記号", "接尾辞"): 277 continue 278 # special case for half-width commas 279 if nw and nw.surface == ",": 280 continue 281 # special case for prefixes 282 if foreign and roma[-1] == "-": 283 continue 284 # 思えば -> omoeba 285 if nw and nw.feature.pos2 in ("接続助詞"): 286 continue 287 # 333 -> 333 ; this should probably be handled in mecab 288 if word.surface.isdigit() and nw and nw.surface.isdigit(): 289 continue 290 # そうでした -> sou deshita 291 if ( 292 nw 293 and word.feature.pos1 in ("動詞", "助動詞", "形容詞") 294 and nw.feature.pos1 == "助動詞" 295 and nw.surface != "です" 296 ): 297 continue 298 299 # if we get here, it does need a space 300 tok.space = True 301 302 # remove any leftover っ 303 for tok in out: 304 tok.surface = tok.surface.replace("っ", "") 305 306 # capitalize the first letter 307 if capitalize and out and out[0].surface: 308 ss = out[0].surface 309 out[0].surface = ss[0].capitalize() + ss[1:] 310 return out
Build a list of tokens from input nodes.
If capitalize
is true, then the first letter of the first token will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
If the text was not normalized before being tokenized, the output is
undefined. For details of normalization, see normalize_text
.
The number of output tokens will equal the number of input nodes.
312 def romaji(self, text, capitalize=True, title=False): 313 """Build a complete string from input text. 314 315 If `capitalize` is true, then the first letter of the text will be 316 capitalized. This is typically the desired behavior if the input is a 317 complete sentence. 318 319 If `title` is true, then words will be capitalized as in a book title. 320 This means most words will be capitalized, but some parts of speech 321 (particles, endings) will not. 322 """ 323 if not text: 324 return "" 325 326 text = normalize_text(text) 327 print("normalized", text) 328 words = self.tagger(text) 329 print("words", words) 330 for word in words: 331 print(word.surface, word.feature.pron) 332 333 tokens = self.romaji_tokens(words, capitalize, title) 334 out = "".join([str(tok) for tok in tokens]).strip() 335 return out
Build a complete string from input text.
If capitalize
is true, then the first letter of the text will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
337 def romaji_word(self, word): 338 """Return the romaji for a single word (node).""" 339 340 if word.surface in self.exceptions: 341 return self.exceptions[word.surface] 342 343 if word.surface.isdigit(): 344 return word.surface 345 346 if word.surface.isascii(): 347 return word.surface 348 349 # deal with unks first 350 if word.is_unk: 351 # at this point is is presumably an unk 352 # Check character type using the values defined in char.def. 353 # This is constant across unidic versions so far but not guaranteed. 354 if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): 355 kana = jaconv.kata2hira(word.surface) 356 return self.map_kana(kana) 357 358 # At this point this is an unknown word and not kana. Could be 359 # unknown kanji, could be hangul, cyrillic, something else. 360 # By default ensure ascii by replacing with ?, but allow pass-through. 361 if self.ensure_ascii: 362 out = "?" * len(word.surface) 363 return out 364 else: 365 return word.surface 366 367 if word.feature.pos1 == "補助記号": 368 # If it's punctuation we don't recognize, just discard it 369 return self.table.get(word.surface, "") 370 elif self.use_wa and word.feature.pos1 == "助詞" and word.feature.pron == "ワ": 371 return "wa" 372 elif ( 373 not self.use_he 374 and word.feature.pos1 == "助詞" 375 and word.feature.pron == "エ" 376 ): 377 return "e" 378 elif ( 379 not self.use_wo 380 and word.feature.pos1 == "助詞" 381 and word.feature.pron == "オ" 382 ): 383 return "o" 384 elif self.use_foreign_spelling and has_foreign_lemma(word): 385 # this is a foreign word with known spelling 386 return word.feature.lemma.split("-", 1)[-1] 387 elif word.feature.kana: 388 # for known words 389 kana = jaconv.kata2hira(word.feature.kana) 390 return self.map_kana(kana) 391 else: 392 # unclear when we would actually get here 393 return word.surface
Return the romaji for a single word (node).
395 def map_kana(self, kana): 396 """Given a list of kana, convert them to romaji. 397 398 The exact romaji resulting from a kana sequence depend on the preceding 399 or following kana, so this handles that conversion. 400 """ 401 out = "" 402 for ki, char in enumerate(kana): 403 nk = kana[ki + 1] if ki < len(kana) - 1 else None 404 pk = kana[ki - 1] if ki > 0 else None 405 out += self.get_single_mapping(pk, char, nk) 406 return out
Given a list of kana, convert them to romaji.
The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.
408 def get_single_mapping(self, pk, kk, nk): 409 """Given a single kana and its neighbors, return the mapped romaji.""" 410 # handle odoriji 411 # NOTE: This is very rarely useful at present because odoriji are not 412 # left in readings for dictionary words, and we can't follow kana 413 # across word boundaries. 414 if kk in ODORI: 415 if kk in "ゝヽ": 416 if pk: 417 return pk 418 else: 419 return "" # invalid but be nice 420 if kk in "ゞヾ": # repeat with voicing 421 if not pk: 422 return "" 423 vv = add_dakuten(pk) 424 if vv: 425 return self.table[vv] 426 else: 427 return "" 428 # remaining are 々 for kanji and 〃 for symbols, but we can't 429 # infer their span reliably (or handle rendaku) 430 return "" 431 432 # handle digraphs 433 if pk and (pk + kk) in self.table: 434 return self.table[pk + kk] 435 if nk and (kk + nk) in self.table: 436 return "" 437 438 if nk and nk in SUTEGANA: 439 if kk == "っ": 440 return "" # never valid, just ignore 441 return self.table[kk][:-1] + self.table[nk] 442 if kk in SUTEGANA: 443 return "" 444 445 if kk == "ー": # 長音符 446 if pk and pk in self.table: 447 return self.table[pk][-1] 448 else: 449 return "-" 450 451 if kk == "っ": 452 if nk: 453 if self.use_tch and nk == "ち": 454 return "t" 455 elif nk in "あいうえおっ": 456 return "-" 457 else: 458 return self.table[nk][0] # first character 459 else: 460 # seems like it should never happen, but 乗っ|た is two tokens 461 # so leave this as is and pick it up at the word level 462 return "っ" 463 464 if kk == "ん": 465 if nk and nk in "あいうえおやゆよ": 466 return "n'" 467 else: 468 return "n" 469 470 return self.table[kk]
Given a single kana and its neighbors, return the mapped romaji.