cutlet
cutlet
Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.
issueを英語で書く必要はありません。
Features:
- support for Modified Hepburn, Kunreisiki, Nihonsiki systems
- custom overrides for individual mappings
- custom overrides for specific words
- built in exceptions list (Tokyo, Osaka, etc.)
- uses foreign spelling when available in UniDic
- proper nouns are capitalized
- slug mode for url generation
Things not supported:
- traditional Hepburn n-to-m: Shimbashi
- macrons or circumflexes: Tōkyō, Tôkyô
- passport Hepburn: Satoh (but you can use an exception)
- hyphenating words
- Traditional Hepburn in general is not supported
Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.
Installation
Cutlet can be installed through pip as usual.
pip install cutlet
Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.
pip install unidic-lite
Usage
A command-line script is included for quick testing. Just use cutlet
and each
line of stdin will be treated as a sentence. You can specify the system to use
(hepburn
, kunrei
, nippon
, or nihon
) as the first argument.
$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.
In code:
import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'
# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'
# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'
# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'
# comparison
nkatu = cutlet.Cutlet('nihon')
sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'
Alternatives
98class Cutlet: 99 def __init__( 100 self, 101 system = 'hepburn', 102 use_foreign_spelling = True, 103 ensure_ascii = True, 104 mecab_args = "", 105): 106 """Create a Cutlet object, which holds configuration as well as 107 tokenizer state. 108 109 `system` is `hepburn` by default, and may also be `kunrei` or 110 `nihon`. `nippon` is permitted as a synonym for `nihon`. 111 112 If `use_foreign_spelling` is true, output will use the foreign spelling 113 provided in a UniDic lemma when available. For example, "カツ" will 114 become "cutlet" instead of "katsu". 115 116 If `ensure_ascii` is true, any non-ASCII characters that can't be 117 romanized will be replaced with `?`. If false, they will be passed 118 through. 119 120 Typical usage: 121 122 ```python 123 katsu = Cutlet() 124 roma = katsu.romaji("カツカレーを食べた") 125 # "Cutlet curry wo tabeta" 126 ``` 127 """ 128 # allow 'nippon' for 'nihon' 129 if system == 'nippon': system = 'nihon' 130 self.system = system 131 try: 132 # make a copy so we can modify it 133 self.table = dict(SYSTEMS[system]) 134 except KeyError: 135 print("unknown system: {}".format(system)) 136 raise 137 138 self.tagger = fugashi.Tagger(mecab_args) 139 self.exceptions = load_exceptions() 140 141 # these are too minor to be worth exposing as arguments 142 self.use_tch = (self.system in ('hepburn',)) 143 self.use_wa = (self.system in ('hepburn', 'kunrei')) 144 self.use_he = (self.system in ('nihon',)) 145 self.use_wo = (self.system in ('hepburn', 'nihon')) 146 147 self.use_foreign_spelling = use_foreign_spelling 148 self.ensure_ascii = ensure_ascii 149 150 def add_exception(self, key, val): 151 """Add an exception to the internal list. 152 153 An exception overrides a whole token, for example to replace "Toukyou" 154 with "Tokyo". Note that it must match the tokenizer output and be a 155 single token to work. To replace longer phrases, you'll need to use a 156 different strategy, like string replacement. 157 """ 158 self.exceptions[key] = val 159 160 def update_mapping(self, key, val): 161 """Update mapping table for a single kana. 162 163 This can be used to mix common systems, or to modify particular 164 details. For example, you can use `update_mapping("ぢ", "di")` to 165 differentiate ぢ and じ in Hepburn. 166 167 Example usage: 168 169 ``` 170 cut = Cutlet() 171 cut.romaji("お茶漬け") # Ochazuke 172 cut.update_mapping("づ", "du") 173 cut.romaji("お茶漬け") # Ochaduke 174 ``` 175 """ 176 self.table[key] = val 177 178 def slug(self, text): 179 """Generate a URL-friendly slug. 180 181 After converting the input to romaji using `Cutlet.romaji` and making 182 the result lower-case, any runs of non alpha-numeric characters are 183 replaced with a single hyphen. Any leading or trailing hyphens are 184 stripped. 185 """ 186 roma = self.romaji(text).lower() 187 slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-') 188 return slug 189 190 def romaji_tokens(self, words, capitalize=True, title=False): 191 """Build a list of tokens from input nodes. 192 193 If `capitalize` is true, then the first letter of the first token will be 194 capitalized. This is typically the desired behavior if the input is a 195 complete sentence. 196 197 If `title` is true, then words will be capitalized as in a book title. 198 This means most words will be capitalized, but some parts of speech 199 (particles, endings) will not. 200 201 If the text was not normalized before being tokenized, the output is 202 undefined. For details of normalization, see `normalize_text`. 203 204 The number of output tokens will equal the number of input nodes. 205 """ 206 207 out = [] 208 209 for wi, word in enumerate(words): 210 po = out[-1] if out else None 211 pw = words[wi - 1] if wi > 0 else None 212 nw = words[wi + 1] if wi < len(words) - 1 else None 213 214 # handle possessive apostrophe as a special case 215 if (word.surface == "'" and 216 (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and 217 not word.white_space): 218 # remove preceeding space 219 if po: 220 po.space = False 221 out.append(Token(word.surface, False)) 222 continue 223 224 # resolve split verbs / adjectives 225 roma = self.romaji_word(word) 226 if roma and po and po.surface and po.surface[-1] == 'っ': 227 po.surface = po.surface[:-1] + roma[0] 228 if word.feature.pos2 == '固有名詞': 229 roma = roma.title() 230 if (title and 231 word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and 232 not (pw and pw.feature.pos1 == '接頭辞')): 233 roma = roma.title() 234 235 foreign = self.use_foreign_spelling and has_foreign_lemma(word) 236 tok = Token(roma, False, foreign) 237 # handle punctuation with atypical spacing 238 if word.surface in '「『': 239 if po: 240 po.space = True 241 out.append(tok) 242 continue 243 if roma in '([': 244 if po: 245 po.space = True 246 out.append(tok) 247 continue 248 if roma == '/': 249 out.append(tok) 250 continue 251 252 # preserve spaces between ascii tokens 253 if (word.surface.isascii() and 254 nw and nw.surface.isascii()): 255 use_space = bool(nw.white_space) 256 out.append(Token(word.surface, use_space)) 257 continue 258 259 out.append(tok) 260 261 # no space sometimes 262 # お酒 -> osake 263 if word.feature.pos1 == '接頭辞': continue 264 # 今日、 -> kyou, ; 図書館 -> toshokan 265 if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue 266 # special case for half-width commas 267 if nw and nw.surface == ',': continue 268 # special case for prefixes 269 if foreign and roma[-1] == "-": continue 270 # 思えば -> omoeba 271 if nw and nw.feature.pos2 in ('接続助詞'): continue 272 # 333 -> 333 ; this should probably be handled in mecab 273 if (word.surface.isdigit() and 274 nw and nw.surface.isdigit()): 275 continue 276 # そうでした -> sou deshita 277 if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞') 278 and nw.feature.pos1 == '助動詞' 279 and nw.surface != 'です'): 280 continue 281 282 # if we get here, it does need a space 283 tok.space = True 284 285 # remove any leftover っ 286 for tok in out: 287 tok.surface = tok.surface.replace("っ", "") 288 289 # capitalize the first letter 290 if capitalize and out and out[0].surface: 291 ss = out[0].surface 292 out[0].surface = ss[0].capitalize() + ss[1:] 293 return out 294 295 def romaji(self, text, capitalize=True, title=False): 296 """Build a complete string from input text. 297 298 If `capitalize` is true, then the first letter of the text will be 299 capitalized. This is typically the desired behavior if the input is a 300 complete sentence. 301 302 If `title` is true, then words will be capitalized as in a book title. 303 This means most words will be capitalized, but some parts of speech 304 (particles, endings) will not. 305 """ 306 if not text: 307 return '' 308 309 text = normalize_text(text) 310 words = self.tagger(text) 311 312 tokens = self.romaji_tokens(words, capitalize, title) 313 out = ''.join([str(tok) for tok in tokens]).strip() 314 return out 315 316 def romaji_word(self, word): 317 """Return the romaji for a single word (node).""" 318 319 if word.surface in self.exceptions: 320 return self.exceptions[word.surface] 321 322 if word.surface.isdigit(): 323 return word.surface 324 325 if word.surface.isascii(): 326 return word.surface 327 328 # deal with unks first 329 if word.is_unk: 330 # at this point is is presumably an unk 331 # Check character type using the values defined in char.def. 332 # This is constant across unidic versions so far but not guaranteed. 333 if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): 334 kana = jaconv.kata2hira(word.surface) 335 return self.map_kana(kana) 336 337 # At this point this is an unknown word and not kana. Could be 338 # unknown kanji, could be hangul, cyrillic, something else. 339 # By default ensure ascii by replacing with ?, but allow pass-through. 340 if self.ensure_ascii: 341 out = '?' * len(word.surface) 342 return out 343 else: 344 return word.surface 345 346 if word.feature.pos1 == '補助記号': 347 # If it's punctuation we don't recognize, just discard it 348 return self.table.get(word.surface, '') 349 elif (self.use_wa and 350 word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'): 351 return 'wa' 352 elif (not self.use_he and 353 word.feature.pos1 == '助詞' and word.feature.pron == 'エ'): 354 return 'e' 355 elif (not self.use_wo and 356 word.feature.pos1 == '助詞' and word.feature.pron == 'オ'): 357 return 'o' 358 elif (self.use_foreign_spelling and 359 has_foreign_lemma(word)): 360 # this is a foreign word with known spelling 361 return word.feature.lemma.split('-', 1)[-1] 362 elif word.feature.kana: 363 # for known words 364 kana = jaconv.kata2hira(word.feature.kana) 365 return self.map_kana(kana) 366 else: 367 # unclear when we would actually get here 368 return word.surface 369 370 def map_kana(self, kana): 371 """Given a list of kana, convert them to romaji. 372 373 The exact romaji resulting from a kana sequence depend on the preceding 374 or following kana, so this handles that conversion. 375 """ 376 out = '' 377 for ki, char in enumerate(kana): 378 nk = kana[ki + 1] if ki < len(kana) - 1 else None 379 pk = kana[ki - 1] if ki > 0 else None 380 out += self.get_single_mapping(pk, char, nk) 381 return out 382 383 def get_single_mapping(self, pk, kk, nk): 384 """Given a single kana and its neighbors, return the mapped romaji.""" 385 # handle odoriji 386 # NOTE: This is very rarely useful at present because odoriji are not 387 # left in readings for dictionary words, and we can't follow kana 388 # across word boundaries. 389 if kk in ODORI: 390 if kk in 'ゝヽ': 391 if pk: return pk 392 else: return '' # invalid but be nice 393 if kk in 'ゞヾ': # repeat with voicing 394 if not pk: return '' 395 vv = add_dakuten(pk) 396 if vv: return self.table[vv] 397 else: return '' 398 # remaining are 々 for kanji and 〃 for symbols, but we can't 399 # infer their span reliably (or handle rendaku) 400 return '' 401 402 403 # handle digraphs 404 if pk and (pk + kk) in self.table: 405 return self.table[pk + kk] 406 if nk and (kk + nk) in self.table: 407 return '' 408 409 if nk and nk in SUTEGANA: 410 if kk == 'っ': return '' # never valid, just ignore 411 return self.table[kk][:-1] + self.table[nk] 412 if kk in SUTEGANA: 413 return '' 414 415 if kk == 'ー': # 長音符 416 if pk and pk in self.table: return self.table[pk][-1] 417 else: return '-' 418 419 if kk == 'っ': 420 if nk: 421 if self.use_tch and nk == 'ち': return 't' 422 elif nk in 'あいうえおっ': return '-' 423 else: return self.table[nk][0] # first character 424 else: 425 # seems like it should never happen, but 乗っ|た is two tokens 426 # so leave this as is and pick it up at the word level 427 return 'っ' 428 429 if kk == 'ん': 430 if nk and nk in 'あいうえおやゆよ': return "n'" 431 else: return 'n' 432 433 return self.table[kk]
99 def __init__( 100 self, 101 system = 'hepburn', 102 use_foreign_spelling = True, 103 ensure_ascii = True, 104 mecab_args = "", 105): 106 """Create a Cutlet object, which holds configuration as well as 107 tokenizer state. 108 109 `system` is `hepburn` by default, and may also be `kunrei` or 110 `nihon`. `nippon` is permitted as a synonym for `nihon`. 111 112 If `use_foreign_spelling` is true, output will use the foreign spelling 113 provided in a UniDic lemma when available. For example, "カツ" will 114 become "cutlet" instead of "katsu". 115 116 If `ensure_ascii` is true, any non-ASCII characters that can't be 117 romanized will be replaced with `?`. If false, they will be passed 118 through. 119 120 Typical usage: 121 122 ```python 123 katsu = Cutlet() 124 roma = katsu.romaji("カツカレーを食べた") 125 # "Cutlet curry wo tabeta" 126 ``` 127 """ 128 # allow 'nippon' for 'nihon' 129 if system == 'nippon': system = 'nihon' 130 self.system = system 131 try: 132 # make a copy so we can modify it 133 self.table = dict(SYSTEMS[system]) 134 except KeyError: 135 print("unknown system: {}".format(system)) 136 raise 137 138 self.tagger = fugashi.Tagger(mecab_args) 139 self.exceptions = load_exceptions() 140 141 # these are too minor to be worth exposing as arguments 142 self.use_tch = (self.system in ('hepburn',)) 143 self.use_wa = (self.system in ('hepburn', 'kunrei')) 144 self.use_he = (self.system in ('nihon',)) 145 self.use_wo = (self.system in ('hepburn', 'nihon')) 146 147 self.use_foreign_spelling = use_foreign_spelling 148 self.ensure_ascii = ensure_ascii
Create a Cutlet object, which holds configuration as well as tokenizer state.
system
is hepburn
by default, and may also be kunrei
or
nihon
. nippon
is permitted as a synonym for nihon
.
If use_foreign_spelling
is true, output will use the foreign spelling
provided in a UniDic lemma when available. For example, "カツ" will
become "cutlet" instead of "katsu".
If ensure_ascii
is true, any non-ASCII characters that can't be
romanized will be replaced with ?
. If false, they will be passed
through.
Typical usage:
katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"
150 def add_exception(self, key, val): 151 """Add an exception to the internal list. 152 153 An exception overrides a whole token, for example to replace "Toukyou" 154 with "Tokyo". Note that it must match the tokenizer output and be a 155 single token to work. To replace longer phrases, you'll need to use a 156 different strategy, like string replacement. 157 """ 158 self.exceptions[key] = val
Add an exception to the internal list.
An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.
160 def update_mapping(self, key, val): 161 """Update mapping table for a single kana. 162 163 This can be used to mix common systems, or to modify particular 164 details. For example, you can use `update_mapping("ぢ", "di")` to 165 differentiate ぢ and じ in Hepburn. 166 167 Example usage: 168 169 ``` 170 cut = Cutlet() 171 cut.romaji("お茶漬け") # Ochazuke 172 cut.update_mapping("づ", "du") 173 cut.romaji("お茶漬け") # Ochaduke 174 ``` 175 """ 176 self.table[key] = val
Update mapping table for a single kana.
This can be used to mix common systems, or to modify particular
details. For example, you can use update_mapping("ぢ", "di")
to
differentiate ぢ and じ in Hepburn.
Example usage:
cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke
178 def slug(self, text): 179 """Generate a URL-friendly slug. 180 181 After converting the input to romaji using `Cutlet.romaji` and making 182 the result lower-case, any runs of non alpha-numeric characters are 183 replaced with a single hyphen. Any leading or trailing hyphens are 184 stripped. 185 """ 186 roma = self.romaji(text).lower() 187 slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-') 188 return slug
Generate a URL-friendly slug.
After converting the input to romaji using Cutlet.romaji
and making
the result lower-case, any runs of non alpha-numeric characters are
replaced with a single hyphen. Any leading or trailing hyphens are
stripped.
190 def romaji_tokens(self, words, capitalize=True, title=False): 191 """Build a list of tokens from input nodes. 192 193 If `capitalize` is true, then the first letter of the first token will be 194 capitalized. This is typically the desired behavior if the input is a 195 complete sentence. 196 197 If `title` is true, then words will be capitalized as in a book title. 198 This means most words will be capitalized, but some parts of speech 199 (particles, endings) will not. 200 201 If the text was not normalized before being tokenized, the output is 202 undefined. For details of normalization, see `normalize_text`. 203 204 The number of output tokens will equal the number of input nodes. 205 """ 206 207 out = [] 208 209 for wi, word in enumerate(words): 210 po = out[-1] if out else None 211 pw = words[wi - 1] if wi > 0 else None 212 nw = words[wi + 1] if wi < len(words) - 1 else None 213 214 # handle possessive apostrophe as a special case 215 if (word.surface == "'" and 216 (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and 217 not word.white_space): 218 # remove preceeding space 219 if po: 220 po.space = False 221 out.append(Token(word.surface, False)) 222 continue 223 224 # resolve split verbs / adjectives 225 roma = self.romaji_word(word) 226 if roma and po and po.surface and po.surface[-1] == 'っ': 227 po.surface = po.surface[:-1] + roma[0] 228 if word.feature.pos2 == '固有名詞': 229 roma = roma.title() 230 if (title and 231 word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and 232 not (pw and pw.feature.pos1 == '接頭辞')): 233 roma = roma.title() 234 235 foreign = self.use_foreign_spelling and has_foreign_lemma(word) 236 tok = Token(roma, False, foreign) 237 # handle punctuation with atypical spacing 238 if word.surface in '「『': 239 if po: 240 po.space = True 241 out.append(tok) 242 continue 243 if roma in '([': 244 if po: 245 po.space = True 246 out.append(tok) 247 continue 248 if roma == '/': 249 out.append(tok) 250 continue 251 252 # preserve spaces between ascii tokens 253 if (word.surface.isascii() and 254 nw and nw.surface.isascii()): 255 use_space = bool(nw.white_space) 256 out.append(Token(word.surface, use_space)) 257 continue 258 259 out.append(tok) 260 261 # no space sometimes 262 # お酒 -> osake 263 if word.feature.pos1 == '接頭辞': continue 264 # 今日、 -> kyou, ; 図書館 -> toshokan 265 if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue 266 # special case for half-width commas 267 if nw and nw.surface == ',': continue 268 # special case for prefixes 269 if foreign and roma[-1] == "-": continue 270 # 思えば -> omoeba 271 if nw and nw.feature.pos2 in ('接続助詞'): continue 272 # 333 -> 333 ; this should probably be handled in mecab 273 if (word.surface.isdigit() and 274 nw and nw.surface.isdigit()): 275 continue 276 # そうでした -> sou deshita 277 if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞') 278 and nw.feature.pos1 == '助動詞' 279 and nw.surface != 'です'): 280 continue 281 282 # if we get here, it does need a space 283 tok.space = True 284 285 # remove any leftover っ 286 for tok in out: 287 tok.surface = tok.surface.replace("っ", "") 288 289 # capitalize the first letter 290 if capitalize and out and out[0].surface: 291 ss = out[0].surface 292 out[0].surface = ss[0].capitalize() + ss[1:] 293 return out
Build a list of tokens from input nodes.
If capitalize
is true, then the first letter of the first token will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
If the text was not normalized before being tokenized, the output is
undefined. For details of normalization, see normalize_text
.
The number of output tokens will equal the number of input nodes.
295 def romaji(self, text, capitalize=True, title=False): 296 """Build a complete string from input text. 297 298 If `capitalize` is true, then the first letter of the text will be 299 capitalized. This is typically the desired behavior if the input is a 300 complete sentence. 301 302 If `title` is true, then words will be capitalized as in a book title. 303 This means most words will be capitalized, but some parts of speech 304 (particles, endings) will not. 305 """ 306 if not text: 307 return '' 308 309 text = normalize_text(text) 310 words = self.tagger(text) 311 312 tokens = self.romaji_tokens(words, capitalize, title) 313 out = ''.join([str(tok) for tok in tokens]).strip() 314 return out
Build a complete string from input text.
If capitalize
is true, then the first letter of the text will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
316 def romaji_word(self, word): 317 """Return the romaji for a single word (node).""" 318 319 if word.surface in self.exceptions: 320 return self.exceptions[word.surface] 321 322 if word.surface.isdigit(): 323 return word.surface 324 325 if word.surface.isascii(): 326 return word.surface 327 328 # deal with unks first 329 if word.is_unk: 330 # at this point is is presumably an unk 331 # Check character type using the values defined in char.def. 332 # This is constant across unidic versions so far but not guaranteed. 333 if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): 334 kana = jaconv.kata2hira(word.surface) 335 return self.map_kana(kana) 336 337 # At this point this is an unknown word and not kana. Could be 338 # unknown kanji, could be hangul, cyrillic, something else. 339 # By default ensure ascii by replacing with ?, but allow pass-through. 340 if self.ensure_ascii: 341 out = '?' * len(word.surface) 342 return out 343 else: 344 return word.surface 345 346 if word.feature.pos1 == '補助記号': 347 # If it's punctuation we don't recognize, just discard it 348 return self.table.get(word.surface, '') 349 elif (self.use_wa and 350 word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'): 351 return 'wa' 352 elif (not self.use_he and 353 word.feature.pos1 == '助詞' and word.feature.pron == 'エ'): 354 return 'e' 355 elif (not self.use_wo and 356 word.feature.pos1 == '助詞' and word.feature.pron == 'オ'): 357 return 'o' 358 elif (self.use_foreign_spelling and 359 has_foreign_lemma(word)): 360 # this is a foreign word with known spelling 361 return word.feature.lemma.split('-', 1)[-1] 362 elif word.feature.kana: 363 # for known words 364 kana = jaconv.kata2hira(word.feature.kana) 365 return self.map_kana(kana) 366 else: 367 # unclear when we would actually get here 368 return word.surface
Return the romaji for a single word (node).
370 def map_kana(self, kana): 371 """Given a list of kana, convert them to romaji. 372 373 The exact romaji resulting from a kana sequence depend on the preceding 374 or following kana, so this handles that conversion. 375 """ 376 out = '' 377 for ki, char in enumerate(kana): 378 nk = kana[ki + 1] if ki < len(kana) - 1 else None 379 pk = kana[ki - 1] if ki > 0 else None 380 out += self.get_single_mapping(pk, char, nk) 381 return out
Given a list of kana, convert them to romaji.
The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.
383 def get_single_mapping(self, pk, kk, nk): 384 """Given a single kana and its neighbors, return the mapped romaji.""" 385 # handle odoriji 386 # NOTE: This is very rarely useful at present because odoriji are not 387 # left in readings for dictionary words, and we can't follow kana 388 # across word boundaries. 389 if kk in ODORI: 390 if kk in 'ゝヽ': 391 if pk: return pk 392 else: return '' # invalid but be nice 393 if kk in 'ゞヾ': # repeat with voicing 394 if not pk: return '' 395 vv = add_dakuten(pk) 396 if vv: return self.table[vv] 397 else: return '' 398 # remaining are 々 for kanji and 〃 for symbols, but we can't 399 # infer their span reliably (or handle rendaku) 400 return '' 401 402 403 # handle digraphs 404 if pk and (pk + kk) in self.table: 405 return self.table[pk + kk] 406 if nk and (kk + nk) in self.table: 407 return '' 408 409 if nk and nk in SUTEGANA: 410 if kk == 'っ': return '' # never valid, just ignore 411 return self.table[kk][:-1] + self.table[nk] 412 if kk in SUTEGANA: 413 return '' 414 415 if kk == 'ー': # 長音符 416 if pk and pk in self.table: return self.table[pk][-1] 417 else: return '-' 418 419 if kk == 'っ': 420 if nk: 421 if self.use_tch and nk == 'ち': return 't' 422 elif nk in 'あいうえおっ': return '-' 423 else: return self.table[nk][0] # first character 424 else: 425 # seems like it should never happen, but 乗っ|た is two tokens 426 # so leave this as is and pick it up at the word level 427 return 'っ' 428 429 if kk == 'ん': 430 if nk and nk in 'あいうえおやゆよ': return "n'" 431 else: return 'n' 432 433 return self.table[kk]
Given a single kana and its neighbors, return the mapped romaji.