cutlet
cutlet
Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.
issueを英語で書く必要はありません。
Features:
- support for Modified Hepburn, Kunreisiki, Nihonsiki systems
- custom overrides for individual mappings
- custom overrides for specific words
- built in exceptions list (Tokyo, Osaka, etc.)
- uses foreign spelling when available in UniDic
- proper nouns are capitalized
- slug mode for url generation
Things not supported:
- traditional Hepburn n-to-m: Shimbashi
- macrons or circumflexes: Tōkyō, Tôkyô
- passport Hepburn: Satoh (but you can use an exception)
- hyphenating words
- Traditional Hepburn in general is not supported
Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.
Installation
Cutlet can be installed through pip as usual.
pip install cutlet
Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.
pip install unidic-lite
Usage
A command-line script is included for quick testing. Just use cutlet
and each
line of stdin will be treated as a sentence. You can specify the system to use
(hepburn
, kunrei
, nippon
, or nihon
) as the first argument.
$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.
In code:
import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'
# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'
# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'
# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'
# comparison
nkatu = cutlet.Cutlet('nihon')
sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'
Alternatives
98class Cutlet: 99 def __init__( 100 self, 101 system = 'hepburn', 102 use_foreign_spelling = True, 103 ensure_ascii = True, 104 mecab_args = "", 105): 106 """Create a Cutlet object, which holds configuration as well as 107 tokenizer state. 108 109 `system` is `hepburn` by default, and may also be `kunrei` or 110 `nihon`. `nippon` is permitted as a synonym for `nihon`. 111 112 If `use_foreign_spelling` is true, output will use the foreign spelling 113 provided in a UniDic lemma when available. For example, "カツ" will 114 become "cutlet" instead of "katsu". 115 116 If `ensure_ascii` is true, any non-ASCII characters that can't be 117 romanized will be replaced with `?`. If false, they will be passed 118 through. 119 120 Typical usage: 121 122 ```python 123 katsu = Cutlet() 124 roma = katsu.romaji("カツカレーを食べた") 125 # "Cutlet curry wo tabeta" 126 ``` 127 """ 128 # allow 'nippon' for 'nihon' 129 if system == 'nippon': system = 'nihon' 130 self.system = system 131 try: 132 # make a copy so we can modify it 133 self.table = dict(SYSTEMS[system]) 134 except KeyError: 135 print("unknown system: {}".format(system)) 136 raise 137 138 self.tagger = fugashi.Tagger(mecab_args) 139 self.exceptions = load_exceptions() 140 141 # these are too minor to be worth exposing as arguments 142 self.use_tch = (self.system in ('hepburn',)) 143 self.use_wa = (self.system in ('hepburn', 'kunrei')) 144 self.use_he = (self.system in ('nihon',)) 145 self.use_wo = (self.system in ('hepburn', 'nihon')) 146 147 self.use_foreign_spelling = use_foreign_spelling 148 self.ensure_ascii = ensure_ascii 149 150 def add_exception(self, key, val): 151 """Add an exception to the internal list. 152 153 An exception overrides a whole token, for example to replace "Toukyou" 154 with "Tokyo". Note that it must match the tokenizer output and be a 155 single token to work. To replace longer phrases, you'll need to use a 156 different strategy, like string replacement. 157 """ 158 self.exceptions[key] = val 159 160 def update_mapping(self, key, val): 161 """Update mapping table for a single kana. 162 163 This can be used to mix common systems, or to modify particular 164 details. For example, you can use `update_mapping("ぢ", "di")` to 165 differentiate ぢ and じ in Hepburn. 166 167 Example usage: 168 169 ``` 170 cut = Cutlet() 171 cut.romaji("お茶漬け") # Ochazuke 172 cut.update_mapping("づ", "du") 173 cut.romaji("お茶漬け") # Ochaduke 174 ``` 175 """ 176 self.table[key] = val 177 178 def slug(self, text): 179 """Generate a URL-friendly slug. 180 181 After converting the input to romaji using `Cutlet.romaji` and making 182 the result lower-case, any runs of non alpha-numeric characters are 183 replaced with a single hyphen. Any leading or trailing hyphens are 184 stripped. 185 """ 186 roma = self.romaji(text).lower() 187 slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-') 188 return slug 189 190 def romaji_tokens(self, words, capitalize=True, title=False): 191 """Build a list of tokens from input nodes. 192 193 If `capitalize` is true, then the first letter of the first token will be 194 capitalized. This is typically the desired behavior if the input is a 195 complete sentence. 196 197 If `title` is true, then words will be capitalized as in a book title. 198 This means most words will be capitalized, but some parts of speech 199 (particles, endings) will not. 200 201 If the text was not normalized before being tokenized, the output is 202 undefined. For details of normalization, see `normalize_text`. 203 204 The number of output tokens will equal the number of input nodes. 205 """ 206 207 out = [] 208 209 for wi, word in enumerate(words): 210 po = out[-1] if out else None 211 pw = words[wi - 1] if wi > 0 else None 212 nw = words[wi + 1] if wi < len(words) - 1 else None 213 214 # handle possessive apostrophe as a special case 215 if (word.surface == "'" and 216 (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and 217 not word.white_space): 218 # remove preceeding space 219 if po: 220 po.space = False 221 out.append(Token(word.surface, False)) 222 continue 223 224 # resolve split verbs / adjectives 225 roma = self.romaji_word(word) 226 if roma and po and po.surface and po.surface[-1] == 'っ': 227 po.surface = po.surface[:-1] + roma[0] 228 if word.feature.pos2 == '固有名詞': 229 roma = roma.title() 230 if (title and 231 word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and 232 not (pw and pw.feature.pos1 == '接頭辞')): 233 roma = roma.title() 234 235 foreign = self.use_foreign_spelling and has_foreign_lemma(word) 236 tok = Token(roma, False, foreign) 237 # handle punctuation with atypical spacing 238 if word.surface in '「『': 239 if po: 240 po.space = True 241 out.append(tok) 242 continue 243 if roma in '([': 244 if po: 245 po.space = True 246 out.append(tok) 247 continue 248 if roma == '/': 249 out.append(tok) 250 continue 251 252 out.append(tok) 253 254 # no space sometimes 255 # お酒 -> osake 256 if word.feature.pos1 == '接頭辞': continue 257 # 今日、 -> kyou, ; 図書館 -> toshokan 258 if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue 259 # special case for half-width commas 260 if nw and nw.surface == ',': continue 261 # 思えば -> omoeba 262 if nw and nw.feature.pos2 in ('接続助詞'): continue 263 # 333 -> 333 ; this should probably be handled in mecab 264 if (word.surface.isdigit() and 265 nw and nw.surface.isdigit()): 266 continue 267 # そうでした -> sou deshita 268 if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞') 269 and nw.feature.pos1 == '助動詞' 270 and nw.surface != 'です'): 271 continue 272 273 # if we get here, it does need a space 274 tok.space = True 275 276 # remove any leftover っ 277 for tok in out: 278 tok.surface = tok.surface.replace("っ", "") 279 280 # capitalize the first letter 281 if capitalize and out and out[0].surface: 282 ss = out[0].surface 283 out[0].surface = ss[0].capitalize() + ss[1:] 284 return out 285 286 def romaji(self, text, capitalize=True, title=False): 287 """Build a complete string from input text. 288 289 If `capitalize` is true, then the first letter of the text will be 290 capitalized. This is typically the desired behavior if the input is a 291 complete sentence. 292 293 If `title` is true, then words will be capitalized as in a book title. 294 This means most words will be capitalized, but some parts of speech 295 (particles, endings) will not. 296 """ 297 if not text: 298 return '' 299 300 text = normalize_text(text) 301 words = self.tagger(text) 302 303 tokens = self.romaji_tokens(words, capitalize, title) 304 out = ''.join([str(tok) for tok in tokens]).strip() 305 return out 306 307 def romaji_word(self, word): 308 """Return the romaji for a single word (node).""" 309 310 if word.surface in self.exceptions: 311 return self.exceptions[word.surface] 312 313 if word.surface.isdigit(): 314 return word.surface 315 316 if word.surface.isascii(): 317 return word.surface 318 319 # deal with unks first 320 if word.is_unk: 321 # at this point is is presumably an unk 322 # Check character type using the values defined in char.def. 323 # This is constant across unidic versions so far but not guaranteed. 324 if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): 325 kana = jaconv.kata2hira(word.surface) 326 return self.map_kana(kana) 327 328 # At this point this is an unknown word and not kana. Could be 329 # unknown kanji, could be hangul, cyrillic, something else. 330 # By default ensure ascii by replacing with ?, but allow pass-through. 331 if self.ensure_ascii: 332 out = '?' * len(word.surface) 333 return out 334 else: 335 return word.surface 336 337 if word.feature.pos1 == '補助記号': 338 # If it's punctuation we don't recognize, just discard it 339 return self.table.get(word.surface, '') 340 elif (self.use_wa and 341 word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'): 342 return 'wa' 343 elif (not self.use_he and 344 word.feature.pos1 == '助詞' and word.feature.pron == 'エ'): 345 return 'e' 346 elif (not self.use_wo and 347 word.feature.pos1 == '助詞' and word.feature.pron == 'オ'): 348 return 'o' 349 elif (self.use_foreign_spelling and 350 has_foreign_lemma(word)): 351 # this is a foreign word with known spelling 352 return word.feature.lemma.split('-')[-1] 353 elif word.feature.kana: 354 # for known words 355 kana = jaconv.kata2hira(word.feature.kana) 356 return self.map_kana(kana) 357 else: 358 # unclear when we would actually get here 359 return word.surface 360 361 def map_kana(self, kana): 362 """Given a list of kana, convert them to romaji. 363 364 The exact romaji resulting from a kana sequence depend on the preceding 365 or following kana, so this handles that conversion. 366 """ 367 out = '' 368 for ki, char in enumerate(kana): 369 nk = kana[ki + 1] if ki < len(kana) - 1 else None 370 pk = kana[ki - 1] if ki > 0 else None 371 out += self.get_single_mapping(pk, char, nk) 372 return out 373 374 def get_single_mapping(self, pk, kk, nk): 375 """Given a single kana and its neighbors, return the mapped romaji.""" 376 # handle odoriji 377 # NOTE: This is very rarely useful at present because odoriji are not 378 # left in readings for dictionary words, and we can't follow kana 379 # across word boundaries. 380 if kk in ODORI: 381 if kk in 'ゝヽ': 382 if pk: return pk 383 else: return '' # invalid but be nice 384 if kk in 'ゞヾ': # repeat with voicing 385 if not pk: return '' 386 vv = add_dakuten(pk) 387 if vv: return self.table[vv] 388 else: return '' 389 # remaining are 々 for kanji and 〃 for symbols, but we can't 390 # infer their span reliably (or handle rendaku) 391 return '' 392 393 394 # handle digraphs 395 if pk and (pk + kk) in self.table: 396 return self.table[pk + kk] 397 if nk and (kk + nk) in self.table: 398 return '' 399 400 if nk and nk in SUTEGANA: 401 if kk == 'っ': return '' # never valid, just ignore 402 return self.table[kk][:-1] + self.table[nk] 403 if kk in SUTEGANA: 404 return '' 405 406 if kk == 'ー': # 長音符 407 if pk and pk in self.table: return self.table[pk][-1] 408 else: return '-' 409 410 if kk == 'っ': 411 if nk: 412 if self.use_tch and nk == 'ち': return 't' 413 elif nk in 'あいうえおっ': return '-' 414 else: return self.table[nk][0] # first character 415 else: 416 # seems like it should never happen, but 乗っ|た is two tokens 417 # so leave this as is and pick it up at the word level 418 return 'っ' 419 420 if kk == 'ん': 421 if nk and nk in 'あいうえおやゆよ': return "n'" 422 else: return 'n' 423 424 return self.table[kk]
99 def __init__( 100 self, 101 system = 'hepburn', 102 use_foreign_spelling = True, 103 ensure_ascii = True, 104 mecab_args = "", 105): 106 """Create a Cutlet object, which holds configuration as well as 107 tokenizer state. 108 109 `system` is `hepburn` by default, and may also be `kunrei` or 110 `nihon`. `nippon` is permitted as a synonym for `nihon`. 111 112 If `use_foreign_spelling` is true, output will use the foreign spelling 113 provided in a UniDic lemma when available. For example, "カツ" will 114 become "cutlet" instead of "katsu". 115 116 If `ensure_ascii` is true, any non-ASCII characters that can't be 117 romanized will be replaced with `?`. If false, they will be passed 118 through. 119 120 Typical usage: 121 122 ```python 123 katsu = Cutlet() 124 roma = katsu.romaji("カツカレーを食べた") 125 # "Cutlet curry wo tabeta" 126 ``` 127 """ 128 # allow 'nippon' for 'nihon' 129 if system == 'nippon': system = 'nihon' 130 self.system = system 131 try: 132 # make a copy so we can modify it 133 self.table = dict(SYSTEMS[system]) 134 except KeyError: 135 print("unknown system: {}".format(system)) 136 raise 137 138 self.tagger = fugashi.Tagger(mecab_args) 139 self.exceptions = load_exceptions() 140 141 # these are too minor to be worth exposing as arguments 142 self.use_tch = (self.system in ('hepburn',)) 143 self.use_wa = (self.system in ('hepburn', 'kunrei')) 144 self.use_he = (self.system in ('nihon',)) 145 self.use_wo = (self.system in ('hepburn', 'nihon')) 146 147 self.use_foreign_spelling = use_foreign_spelling 148 self.ensure_ascii = ensure_ascii
Create a Cutlet object, which holds configuration as well as tokenizer state.
system
is hepburn
by default, and may also be kunrei
or
nihon
. nippon
is permitted as a synonym for nihon
.
If use_foreign_spelling
is true, output will use the foreign spelling
provided in a UniDic lemma when available. For example, "カツ" will
become "cutlet" instead of "katsu".
If ensure_ascii
is true, any non-ASCII characters that can't be
romanized will be replaced with ?
. If false, they will be passed
through.
Typical usage:
katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"
150 def add_exception(self, key, val): 151 """Add an exception to the internal list. 152 153 An exception overrides a whole token, for example to replace "Toukyou" 154 with "Tokyo". Note that it must match the tokenizer output and be a 155 single token to work. To replace longer phrases, you'll need to use a 156 different strategy, like string replacement. 157 """ 158 self.exceptions[key] = val
Add an exception to the internal list.
An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.
160 def update_mapping(self, key, val): 161 """Update mapping table for a single kana. 162 163 This can be used to mix common systems, or to modify particular 164 details. For example, you can use `update_mapping("ぢ", "di")` to 165 differentiate ぢ and じ in Hepburn. 166 167 Example usage: 168 169 ``` 170 cut = Cutlet() 171 cut.romaji("お茶漬け") # Ochazuke 172 cut.update_mapping("づ", "du") 173 cut.romaji("お茶漬け") # Ochaduke 174 ``` 175 """ 176 self.table[key] = val
Update mapping table for a single kana.
This can be used to mix common systems, or to modify particular
details. For example, you can use update_mapping("ぢ", "di")
to
differentiate ぢ and じ in Hepburn.
Example usage:
cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke
178 def slug(self, text): 179 """Generate a URL-friendly slug. 180 181 After converting the input to romaji using `Cutlet.romaji` and making 182 the result lower-case, any runs of non alpha-numeric characters are 183 replaced with a single hyphen. Any leading or trailing hyphens are 184 stripped. 185 """ 186 roma = self.romaji(text).lower() 187 slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-') 188 return slug
Generate a URL-friendly slug.
After converting the input to romaji using Cutlet.romaji
and making
the result lower-case, any runs of non alpha-numeric characters are
replaced with a single hyphen. Any leading or trailing hyphens are
stripped.
190 def romaji_tokens(self, words, capitalize=True, title=False): 191 """Build a list of tokens from input nodes. 192 193 If `capitalize` is true, then the first letter of the first token will be 194 capitalized. This is typically the desired behavior if the input is a 195 complete sentence. 196 197 If `title` is true, then words will be capitalized as in a book title. 198 This means most words will be capitalized, but some parts of speech 199 (particles, endings) will not. 200 201 If the text was not normalized before being tokenized, the output is 202 undefined. For details of normalization, see `normalize_text`. 203 204 The number of output tokens will equal the number of input nodes. 205 """ 206 207 out = [] 208 209 for wi, word in enumerate(words): 210 po = out[-1] if out else None 211 pw = words[wi - 1] if wi > 0 else None 212 nw = words[wi + 1] if wi < len(words) - 1 else None 213 214 # handle possessive apostrophe as a special case 215 if (word.surface == "'" and 216 (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and 217 not word.white_space): 218 # remove preceeding space 219 if po: 220 po.space = False 221 out.append(Token(word.surface, False)) 222 continue 223 224 # resolve split verbs / adjectives 225 roma = self.romaji_word(word) 226 if roma and po and po.surface and po.surface[-1] == 'っ': 227 po.surface = po.surface[:-1] + roma[0] 228 if word.feature.pos2 == '固有名詞': 229 roma = roma.title() 230 if (title and 231 word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and 232 not (pw and pw.feature.pos1 == '接頭辞')): 233 roma = roma.title() 234 235 foreign = self.use_foreign_spelling and has_foreign_lemma(word) 236 tok = Token(roma, False, foreign) 237 # handle punctuation with atypical spacing 238 if word.surface in '「『': 239 if po: 240 po.space = True 241 out.append(tok) 242 continue 243 if roma in '([': 244 if po: 245 po.space = True 246 out.append(tok) 247 continue 248 if roma == '/': 249 out.append(tok) 250 continue 251 252 out.append(tok) 253 254 # no space sometimes 255 # お酒 -> osake 256 if word.feature.pos1 == '接頭辞': continue 257 # 今日、 -> kyou, ; 図書館 -> toshokan 258 if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue 259 # special case for half-width commas 260 if nw and nw.surface == ',': continue 261 # 思えば -> omoeba 262 if nw and nw.feature.pos2 in ('接続助詞'): continue 263 # 333 -> 333 ; this should probably be handled in mecab 264 if (word.surface.isdigit() and 265 nw and nw.surface.isdigit()): 266 continue 267 # そうでした -> sou deshita 268 if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞') 269 and nw.feature.pos1 == '助動詞' 270 and nw.surface != 'です'): 271 continue 272 273 # if we get here, it does need a space 274 tok.space = True 275 276 # remove any leftover っ 277 for tok in out: 278 tok.surface = tok.surface.replace("っ", "") 279 280 # capitalize the first letter 281 if capitalize and out and out[0].surface: 282 ss = out[0].surface 283 out[0].surface = ss[0].capitalize() + ss[1:] 284 return out
Build a list of tokens from input nodes.
If capitalize
is true, then the first letter of the first token will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
If the text was not normalized before being tokenized, the output is
undefined. For details of normalization, see normalize_text
.
The number of output tokens will equal the number of input nodes.
286 def romaji(self, text, capitalize=True, title=False): 287 """Build a complete string from input text. 288 289 If `capitalize` is true, then the first letter of the text will be 290 capitalized. This is typically the desired behavior if the input is a 291 complete sentence. 292 293 If `title` is true, then words will be capitalized as in a book title. 294 This means most words will be capitalized, but some parts of speech 295 (particles, endings) will not. 296 """ 297 if not text: 298 return '' 299 300 text = normalize_text(text) 301 words = self.tagger(text) 302 303 tokens = self.romaji_tokens(words, capitalize, title) 304 out = ''.join([str(tok) for tok in tokens]).strip() 305 return out
Build a complete string from input text.
If capitalize
is true, then the first letter of the text will be
capitalized. This is typically the desired behavior if the input is a
complete sentence.
If title
is true, then words will be capitalized as in a book title.
This means most words will be capitalized, but some parts of speech
(particles, endings) will not.
307 def romaji_word(self, word): 308 """Return the romaji for a single word (node).""" 309 310 if word.surface in self.exceptions: 311 return self.exceptions[word.surface] 312 313 if word.surface.isdigit(): 314 return word.surface 315 316 if word.surface.isascii(): 317 return word.surface 318 319 # deal with unks first 320 if word.is_unk: 321 # at this point is is presumably an unk 322 # Check character type using the values defined in char.def. 323 # This is constant across unidic versions so far but not guaranteed. 324 if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): 325 kana = jaconv.kata2hira(word.surface) 326 return self.map_kana(kana) 327 328 # At this point this is an unknown word and not kana. Could be 329 # unknown kanji, could be hangul, cyrillic, something else. 330 # By default ensure ascii by replacing with ?, but allow pass-through. 331 if self.ensure_ascii: 332 out = '?' * len(word.surface) 333 return out 334 else: 335 return word.surface 336 337 if word.feature.pos1 == '補助記号': 338 # If it's punctuation we don't recognize, just discard it 339 return self.table.get(word.surface, '') 340 elif (self.use_wa and 341 word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'): 342 return 'wa' 343 elif (not self.use_he and 344 word.feature.pos1 == '助詞' and word.feature.pron == 'エ'): 345 return 'e' 346 elif (not self.use_wo and 347 word.feature.pos1 == '助詞' and word.feature.pron == 'オ'): 348 return 'o' 349 elif (self.use_foreign_spelling and 350 has_foreign_lemma(word)): 351 # this is a foreign word with known spelling 352 return word.feature.lemma.split('-')[-1] 353 elif word.feature.kana: 354 # for known words 355 kana = jaconv.kata2hira(word.feature.kana) 356 return self.map_kana(kana) 357 else: 358 # unclear when we would actually get here 359 return word.surface
Return the romaji for a single word (node).
361 def map_kana(self, kana): 362 """Given a list of kana, convert them to romaji. 363 364 The exact romaji resulting from a kana sequence depend on the preceding 365 or following kana, so this handles that conversion. 366 """ 367 out = '' 368 for ki, char in enumerate(kana): 369 nk = kana[ki + 1] if ki < len(kana) - 1 else None 370 pk = kana[ki - 1] if ki > 0 else None 371 out += self.get_single_mapping(pk, char, nk) 372 return out
Given a list of kana, convert them to romaji.
The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.
374 def get_single_mapping(self, pk, kk, nk): 375 """Given a single kana and its neighbors, return the mapped romaji.""" 376 # handle odoriji 377 # NOTE: This is very rarely useful at present because odoriji are not 378 # left in readings for dictionary words, and we can't follow kana 379 # across word boundaries. 380 if kk in ODORI: 381 if kk in 'ゝヽ': 382 if pk: return pk 383 else: return '' # invalid but be nice 384 if kk in 'ゞヾ': # repeat with voicing 385 if not pk: return '' 386 vv = add_dakuten(pk) 387 if vv: return self.table[vv] 388 else: return '' 389 # remaining are 々 for kanji and 〃 for symbols, but we can't 390 # infer their span reliably (or handle rendaku) 391 return '' 392 393 394 # handle digraphs 395 if pk and (pk + kk) in self.table: 396 return self.table[pk + kk] 397 if nk and (kk + nk) in self.table: 398 return '' 399 400 if nk and nk in SUTEGANA: 401 if kk == 'っ': return '' # never valid, just ignore 402 return self.table[kk][:-1] + self.table[nk] 403 if kk in SUTEGANA: 404 return '' 405 406 if kk == 'ー': # 長音符 407 if pk and pk in self.table: return self.table[pk][-1] 408 else: return '-' 409 410 if kk == 'っ': 411 if nk: 412 if self.use_tch and nk == 'ち': return 't' 413 elif nk in 'あいうえおっ': return '-' 414 else: return self.table[nk][0] # first character 415 else: 416 # seems like it should never happen, but 乗っ|た is two tokens 417 # so leave this as is and pick it up at the word level 418 return 'っ' 419 420 if kk == 'ん': 421 if nk and nk in 'あいうえおやゆよ': return "n'" 422 else: return 'n' 423 424 return self.table[kk]
Given a single kana and its neighbors, return the mapped romaji.