cutlet

Open in Streamlit Current PyPI packages

cutlet

cutlet by Irasutoya

Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.

issueを英語で書く必要はありません。

Features:

  • support for Modified Hepburn, Kunreisiki, Nihonsiki systems
  • custom overrides for individual mappings
  • custom overrides for specific words
  • built in exceptions list (Tokyo, Osaka, etc.)
  • uses foreign spelling when available in UniDic
  • proper nouns are capitalized
  • slug mode for url generation

Things not supported:

  • traditional Hepburn n-to-m: Shimbashi
  • macrons or circumflexes: Tōkyō, Tôkyô
  • passport Hepburn: Satoh (but you can use an exception)
  • hyphenating words
  • Traditional Hepburn in general is not supported

Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.

Installation

Cutlet can be installed through pip as usual.

pip install cutlet

Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.

pip install unidic-lite

Usage

A command-line script is included for quick testing. Just use cutlet and each line of stdin will be treated as a sentence. You can specify the system to use (hepburn, kunrei, nippon, or nihon) as the first argument.

$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.

In code:

import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'

# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'

# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'

# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'

# comparison
nkatu = cutlet.Cutlet('nihon')

sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'

Alternatives

  • kakasi: Historically important, but not updated since 2014.
  • pykakasi: self contained, it does segmentation on its own and uses its own dictionary.
  • kuroshiro: Javascript based.
  • kana: Go based.
1"""
2
3.. include:: ../README.md
4"""
5
6from .cutlet import *
7
8__all__ = ("Cutlet",)
class Cutlet:
 98class Cutlet:
 99    def __init__(
100            self,
101            system = 'hepburn',
102            use_foreign_spelling = True,
103            ensure_ascii = True,
104            mecab_args = "",
105):
106        """Create a Cutlet object, which holds configuration as well as
107        tokenizer state.
108
109        `system` is `hepburn` by default, and may also be `kunrei` or
110        `nihon`. `nippon` is permitted as a synonym for `nihon`.
111
112        If `use_foreign_spelling` is true, output will use the foreign spelling
113        provided in a UniDic lemma when available. For example, "カツ" will
114        become "cutlet" instead of "katsu".
115
116        If `ensure_ascii` is true, any non-ASCII characters that can't be
117        romanized will be replaced with `?`. If false, they will be passed
118        through.
119
120        Typical usage:
121
122        ```python
123        katsu = Cutlet()
124        roma = katsu.romaji("カツカレーを食べた")
125        # "Cutlet curry wo tabeta"
126        ```
127        """
128        # allow 'nippon' for 'nihon'
129        if system == 'nippon': system = 'nihon'
130        self.system = system
131        try:
132            # make a copy so we can modify it
133            self.table = dict(SYSTEMS[system])
134        except KeyError:
135            print("unknown system: {}".format(system))
136            raise
137
138        self.tagger = fugashi.Tagger(mecab_args)
139        self.exceptions = load_exceptions()
140
141        # these are too minor to be worth exposing as arguments
142        self.use_tch = (self.system in ('hepburn',))
143        self.use_wa  = (self.system in ('hepburn', 'kunrei'))
144        self.use_he  = (self.system in ('nihon',))
145        self.use_wo  = (self.system in ('hepburn', 'nihon'))
146
147        self.use_foreign_spelling = use_foreign_spelling
148        self.ensure_ascii = ensure_ascii
149
150    def add_exception(self, key, val):
151        """Add an exception to the internal list.
152
153        An exception overrides a whole token, for example to replace "Toukyou"
154        with "Tokyo". Note that it must match the tokenizer output and be a
155        single token to work. To replace longer phrases, you'll need to use a
156        different strategy, like string replacement.
157        """
158        self.exceptions[key] = val
159
160    def update_mapping(self, key, val):
161        """Update mapping table for a single kana.
162
163        This can be used to mix common systems, or to modify particular
164        details. For example, you can use `update_mapping("ぢ", "di")` to
165        differentiate ぢ and じ in Hepburn.
166
167        Example usage:
168
169        ```
170        cut = Cutlet()
171        cut.romaji("お茶漬け") # Ochazuke
172        cut.update_mapping("づ", "du")
173        cut.romaji("お茶漬け") # Ochaduke
174        ```
175        """
176        self.table[key] = val
177
178    def slug(self, text):
179        """Generate a URL-friendly slug.
180
181        After converting the input to romaji using `Cutlet.romaji` and making
182        the result lower-case, any runs of non alpha-numeric characters are
183        replaced with a single hyphen. Any leading or trailing hyphens are
184        stripped.
185        """
186        roma = self.romaji(text).lower()
187        slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-')
188        return slug
189
190    def romaji_tokens(self, words, capitalize=True, title=False):
191        """Build a list of tokens from input nodes.
192
193        If `capitalize` is true, then the first letter of the first token will be
194        capitalized. This is typically the desired behavior if the input is a
195        complete sentence.
196
197        If `title` is true, then words will be capitalized as in a book title.
198        This means most words will be capitalized, but some parts of speech
199        (particles, endings) will not.
200
201        If the text was not normalized before being tokenized, the output is
202        undefined. For details of normalization, see `normalize_text`.
203
204        The number of output tokens will equal the number of input nodes.
205        """
206
207        out = []
208
209        for wi, word in enumerate(words):
210            po = out[-1] if out else None
211            pw = words[wi - 1] if wi > 0 else None
212            nw = words[wi + 1] if wi < len(words) - 1 else None
213
214            # handle possessive apostrophe as a special case
215            if (word.surface == "'" and
216                    (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and
217                    not word.white_space):
218                # remove preceeding space
219                if po:
220                    po.space = False
221                out.append(Token(word.surface, False))
222                continue
223
224            # resolve split verbs / adjectives
225            roma = self.romaji_word(word)
226            if roma and po and po.surface and po.surface[-1] == 'っ':
227                po.surface = po.surface[:-1] + roma[0]
228            if word.feature.pos2 == '固有名詞':
229                roma = roma.title()
230            if (title and
231                word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and
232                not (pw and pw.feature.pos1 == '接頭辞')):
233                roma = roma.title()
234
235            foreign = self.use_foreign_spelling and has_foreign_lemma(word)
236            tok = Token(roma, False, foreign)
237            # handle punctuation with atypical spacing
238            if word.surface in '「『':
239                if po:
240                    po.space = True
241                out.append(tok)
242                continue
243            if roma in '([':
244                if po:
245                    po.space = True
246                out.append(tok)
247                continue
248            if roma == '/':
249                out.append(tok)
250                continue
251
252            out.append(tok)
253
254            # no space sometimes
255            # お酒 -> osake
256            if word.feature.pos1 == '接頭辞': continue
257            # 今日、 -> kyou, ; 図書館 -> toshokan
258            if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
259            # special case for half-width commas
260            if nw and nw.surface == ',': continue
261            # 思えば -> omoeba
262            if nw and nw.feature.pos2 in ('接続助詞'): continue
263            # 333 -> 333 ; this should probably be handled in mecab
264            if (word.surface.isdigit() and
265                    nw and nw.surface.isdigit()):
266                continue
267            # そうでした -> sou deshita
268            if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
269                   and nw.feature.pos1 == '助動詞'
270                   and nw.surface != 'です'):
271                continue
272
273            # if we get here, it does need a space
274            tok.space = True
275
276        # remove any leftover っ
277        for tok in out:
278            tok.surface = tok.surface.replace("っ", "")
279
280        # capitalize the first letter
281        if capitalize and out and out[0].surface:
282            ss = out[0].surface
283            out[0].surface = ss[0].capitalize() + ss[1:]
284        return out
285
286    def romaji(self, text, capitalize=True, title=False):
287        """Build a complete string from input text.
288
289        If `capitalize` is true, then the first letter of the text will be
290        capitalized. This is typically the desired behavior if the input is a
291        complete sentence.
292
293        If `title` is true, then words will be capitalized as in a book title.
294        This means most words will be capitalized, but some parts of speech
295        (particles, endings) will not.
296        """
297        if not text:
298            return ''
299
300        text = normalize_text(text)
301        words = self.tagger(text)
302
303        tokens = self.romaji_tokens(words, capitalize, title)
304        out = ''.join([str(tok) for tok in tokens]).strip()
305        return out
306
307    def romaji_word(self, word):
308        """Return the romaji for a single word (node)."""
309
310        if word.surface in self.exceptions:
311            return self.exceptions[word.surface]
312
313        if word.surface.isdigit():
314            return word.surface
315
316        if word.surface.isascii():
317            return word.surface
318
319        # deal with unks first
320        if word.is_unk:
321            # at this point is is presumably an unk
322            # Check character type using the values defined in char.def.
323            # This is constant across unidic versions so far but not guaranteed.
324            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
325                kana = jaconv.kata2hira(word.surface)
326                return self.map_kana(kana)
327
328            # At this point this is an unknown word and not kana. Could be
329            # unknown kanji, could be hangul, cyrillic, something else.
330            # By default ensure ascii by replacing with ?, but allow pass-through.
331            if self.ensure_ascii:
332                out = '?' * len(word.surface)
333                return out
334            else:
335                return word.surface
336
337        if word.feature.pos1 == '補助記号':
338            # If it's punctuation we don't recognize, just discard it
339            return self.table.get(word.surface, '')
340        elif (self.use_wa and
341                word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'):
342            return 'wa'
343        elif (not self.use_he and
344                word.feature.pos1 == '助詞' and word.feature.pron == 'エ'):
345            return 'e'
346        elif (not self.use_wo and
347                word.feature.pos1 == '助詞' and word.feature.pron == 'オ'):
348            return 'o'
349        elif (self.use_foreign_spelling and
350                has_foreign_lemma(word)):
351            # this is a foreign word with known spelling
352            return word.feature.lemma.split('-')[-1]
353        elif word.feature.kana:
354            # for known words
355            kana = jaconv.kata2hira(word.feature.kana)
356            return self.map_kana(kana)
357        else:
358            # unclear when we would actually get here
359            return word.surface
360
361    def map_kana(self, kana):
362        """Given a list of kana, convert them to romaji.
363
364        The exact romaji resulting from a kana sequence depend on the preceding
365        or following kana, so this handles that conversion.
366        """
367        out = ''
368        for ki, char in enumerate(kana):
369            nk = kana[ki + 1] if ki < len(kana) - 1 else None
370            pk = kana[ki - 1] if ki > 0 else None
371            out += self.get_single_mapping(pk, char, nk)
372        return out
373
374    def get_single_mapping(self, pk, kk, nk):
375        """Given a single kana and its neighbors, return the mapped romaji."""
376        # handle odoriji
377        # NOTE: This is very rarely useful at present because odoriji are not
378        # left in readings for dictionary words, and we can't follow kana
379        # across word boundaries.
380        if kk in ODORI:
381            if kk in 'ゝヽ':
382                if pk: return pk
383                else: return '' # invalid but be nice
384            if kk in 'ゞヾ': # repeat with voicing
385                if not pk: return ''
386                vv = add_dakuten(pk)
387                if vv: return self.table[vv]
388                else: return ''
389            # remaining are 々 for kanji and 〃 for symbols, but we can't
390            # infer their span reliably (or handle rendaku)
391            return ''
392
393
394        # handle digraphs
395        if pk and (pk + kk) in self.table:
396            return self.table[pk + kk]
397        if nk and (kk + nk) in self.table:
398            return ''
399
400        if nk and nk in SUTEGANA:
401            if kk == 'っ': return '' # never valid, just ignore
402            return self.table[kk][:-1] + self.table[nk]
403        if kk in SUTEGANA:
404            return ''
405
406        if kk == 'ー': # 長音符
407            if pk and pk in self.table: return self.table[pk][-1]
408            else: return '-'
409
410        if kk == 'っ':
411            if nk:
412                if self.use_tch and nk == 'ち': return 't'
413                elif nk in 'あいうえおっ': return '-'
414                else: return self.table[nk][0] # first character
415            else:
416                # seems like it should never happen, but 乗っ|た is two tokens
417                # so leave this as is and pick it up at the word level
418                return 'っ'
419
420        if kk == 'ん':
421            if nk and nk in 'あいうえおやゆよ': return "n'"
422            else: return 'n'
423
424        return self.table[kk]
Cutlet( system='hepburn', use_foreign_spelling=True, ensure_ascii=True, mecab_args='')
 99    def __init__(
100            self,
101            system = 'hepburn',
102            use_foreign_spelling = True,
103            ensure_ascii = True,
104            mecab_args = "",
105):
106        """Create a Cutlet object, which holds configuration as well as
107        tokenizer state.
108
109        `system` is `hepburn` by default, and may also be `kunrei` or
110        `nihon`. `nippon` is permitted as a synonym for `nihon`.
111
112        If `use_foreign_spelling` is true, output will use the foreign spelling
113        provided in a UniDic lemma when available. For example, "カツ" will
114        become "cutlet" instead of "katsu".
115
116        If `ensure_ascii` is true, any non-ASCII characters that can't be
117        romanized will be replaced with `?`. If false, they will be passed
118        through.
119
120        Typical usage:
121
122        ```python
123        katsu = Cutlet()
124        roma = katsu.romaji("カツカレーを食べた")
125        # "Cutlet curry wo tabeta"
126        ```
127        """
128        # allow 'nippon' for 'nihon'
129        if system == 'nippon': system = 'nihon'
130        self.system = system
131        try:
132            # make a copy so we can modify it
133            self.table = dict(SYSTEMS[system])
134        except KeyError:
135            print("unknown system: {}".format(system))
136            raise
137
138        self.tagger = fugashi.Tagger(mecab_args)
139        self.exceptions = load_exceptions()
140
141        # these are too minor to be worth exposing as arguments
142        self.use_tch = (self.system in ('hepburn',))
143        self.use_wa  = (self.system in ('hepburn', 'kunrei'))
144        self.use_he  = (self.system in ('nihon',))
145        self.use_wo  = (self.system in ('hepburn', 'nihon'))
146
147        self.use_foreign_spelling = use_foreign_spelling
148        self.ensure_ascii = ensure_ascii

Create a Cutlet object, which holds configuration as well as tokenizer state.

system is hepburn by default, and may also be kunrei or nihon. nippon is permitted as a synonym for nihon.

If use_foreign_spelling is true, output will use the foreign spelling provided in a UniDic lemma when available. For example, "カツ" will become "cutlet" instead of "katsu".

If ensure_ascii is true, any non-ASCII characters that can't be romanized will be replaced with ?. If false, they will be passed through.

Typical usage:

katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"
system
tagger
exceptions
use_tch
use_wa
use_he
use_wo
use_foreign_spelling
ensure_ascii
def add_exception(self, key, val):
150    def add_exception(self, key, val):
151        """Add an exception to the internal list.
152
153        An exception overrides a whole token, for example to replace "Toukyou"
154        with "Tokyo". Note that it must match the tokenizer output and be a
155        single token to work. To replace longer phrases, you'll need to use a
156        different strategy, like string replacement.
157        """
158        self.exceptions[key] = val

Add an exception to the internal list.

An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.

def update_mapping(self, key, val):
160    def update_mapping(self, key, val):
161        """Update mapping table for a single kana.
162
163        This can be used to mix common systems, or to modify particular
164        details. For example, you can use `update_mapping("ぢ", "di")` to
165        differentiate ぢ and じ in Hepburn.
166
167        Example usage:
168
169        ```
170        cut = Cutlet()
171        cut.romaji("お茶漬け") # Ochazuke
172        cut.update_mapping("づ", "du")
173        cut.romaji("お茶漬け") # Ochaduke
174        ```
175        """
176        self.table[key] = val

Update mapping table for a single kana.

This can be used to mix common systems, or to modify particular details. For example, you can use update_mapping("ぢ", "di") to differentiate ぢ and じ in Hepburn.

Example usage:

cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke
def slug(self, text):
178    def slug(self, text):
179        """Generate a URL-friendly slug.
180
181        After converting the input to romaji using `Cutlet.romaji` and making
182        the result lower-case, any runs of non alpha-numeric characters are
183        replaced with a single hyphen. Any leading or trailing hyphens are
184        stripped.
185        """
186        roma = self.romaji(text).lower()
187        slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-')
188        return slug

Generate a URL-friendly slug.

After converting the input to romaji using Cutlet.romaji and making the result lower-case, any runs of non alpha-numeric characters are replaced with a single hyphen. Any leading or trailing hyphens are stripped.

def romaji_tokens(self, words, capitalize=True, title=False):
190    def romaji_tokens(self, words, capitalize=True, title=False):
191        """Build a list of tokens from input nodes.
192
193        If `capitalize` is true, then the first letter of the first token will be
194        capitalized. This is typically the desired behavior if the input is a
195        complete sentence.
196
197        If `title` is true, then words will be capitalized as in a book title.
198        This means most words will be capitalized, but some parts of speech
199        (particles, endings) will not.
200
201        If the text was not normalized before being tokenized, the output is
202        undefined. For details of normalization, see `normalize_text`.
203
204        The number of output tokens will equal the number of input nodes.
205        """
206
207        out = []
208
209        for wi, word in enumerate(words):
210            po = out[-1] if out else None
211            pw = words[wi - 1] if wi > 0 else None
212            nw = words[wi + 1] if wi < len(words) - 1 else None
213
214            # handle possessive apostrophe as a special case
215            if (word.surface == "'" and
216                    (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and
217                    not word.white_space):
218                # remove preceeding space
219                if po:
220                    po.space = False
221                out.append(Token(word.surface, False))
222                continue
223
224            # resolve split verbs / adjectives
225            roma = self.romaji_word(word)
226            if roma and po and po.surface and po.surface[-1] == 'っ':
227                po.surface = po.surface[:-1] + roma[0]
228            if word.feature.pos2 == '固有名詞':
229                roma = roma.title()
230            if (title and
231                word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and
232                not (pw and pw.feature.pos1 == '接頭辞')):
233                roma = roma.title()
234
235            foreign = self.use_foreign_spelling and has_foreign_lemma(word)
236            tok = Token(roma, False, foreign)
237            # handle punctuation with atypical spacing
238            if word.surface in '「『':
239                if po:
240                    po.space = True
241                out.append(tok)
242                continue
243            if roma in '([':
244                if po:
245                    po.space = True
246                out.append(tok)
247                continue
248            if roma == '/':
249                out.append(tok)
250                continue
251
252            out.append(tok)
253
254            # no space sometimes
255            # お酒 -> osake
256            if word.feature.pos1 == '接頭辞': continue
257            # 今日、 -> kyou, ; 図書館 -> toshokan
258            if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
259            # special case for half-width commas
260            if nw and nw.surface == ',': continue
261            # 思えば -> omoeba
262            if nw and nw.feature.pos2 in ('接続助詞'): continue
263            # 333 -> 333 ; this should probably be handled in mecab
264            if (word.surface.isdigit() and
265                    nw and nw.surface.isdigit()):
266                continue
267            # そうでした -> sou deshita
268            if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
269                   and nw.feature.pos1 == '助動詞'
270                   and nw.surface != 'です'):
271                continue
272
273            # if we get here, it does need a space
274            tok.space = True
275
276        # remove any leftover っ
277        for tok in out:
278            tok.surface = tok.surface.replace("っ", "")
279
280        # capitalize the first letter
281        if capitalize and out and out[0].surface:
282            ss = out[0].surface
283            out[0].surface = ss[0].capitalize() + ss[1:]
284        return out

Build a list of tokens from input nodes.

If capitalize is true, then the first letter of the first token will be capitalized. This is typically the desired behavior if the input is a complete sentence.

If title is true, then words will be capitalized as in a book title. This means most words will be capitalized, but some parts of speech (particles, endings) will not.

If the text was not normalized before being tokenized, the output is undefined. For details of normalization, see normalize_text.

The number of output tokens will equal the number of input nodes.

def romaji(self, text, capitalize=True, title=False):
286    def romaji(self, text, capitalize=True, title=False):
287        """Build a complete string from input text.
288
289        If `capitalize` is true, then the first letter of the text will be
290        capitalized. This is typically the desired behavior if the input is a
291        complete sentence.
292
293        If `title` is true, then words will be capitalized as in a book title.
294        This means most words will be capitalized, but some parts of speech
295        (particles, endings) will not.
296        """
297        if not text:
298            return ''
299
300        text = normalize_text(text)
301        words = self.tagger(text)
302
303        tokens = self.romaji_tokens(words, capitalize, title)
304        out = ''.join([str(tok) for tok in tokens]).strip()
305        return out

Build a complete string from input text.

If capitalize is true, then the first letter of the text will be capitalized. This is typically the desired behavior if the input is a complete sentence.

If title is true, then words will be capitalized as in a book title. This means most words will be capitalized, but some parts of speech (particles, endings) will not.

def romaji_word(self, word):
307    def romaji_word(self, word):
308        """Return the romaji for a single word (node)."""
309
310        if word.surface in self.exceptions:
311            return self.exceptions[word.surface]
312
313        if word.surface.isdigit():
314            return word.surface
315
316        if word.surface.isascii():
317            return word.surface
318
319        # deal with unks first
320        if word.is_unk:
321            # at this point is is presumably an unk
322            # Check character type using the values defined in char.def.
323            # This is constant across unidic versions so far but not guaranteed.
324            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
325                kana = jaconv.kata2hira(word.surface)
326                return self.map_kana(kana)
327
328            # At this point this is an unknown word and not kana. Could be
329            # unknown kanji, could be hangul, cyrillic, something else.
330            # By default ensure ascii by replacing with ?, but allow pass-through.
331            if self.ensure_ascii:
332                out = '?' * len(word.surface)
333                return out
334            else:
335                return word.surface
336
337        if word.feature.pos1 == '補助記号':
338            # If it's punctuation we don't recognize, just discard it
339            return self.table.get(word.surface, '')
340        elif (self.use_wa and
341                word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'):
342            return 'wa'
343        elif (not self.use_he and
344                word.feature.pos1 == '助詞' and word.feature.pron == 'エ'):
345            return 'e'
346        elif (not self.use_wo and
347                word.feature.pos1 == '助詞' and word.feature.pron == 'オ'):
348            return 'o'
349        elif (self.use_foreign_spelling and
350                has_foreign_lemma(word)):
351            # this is a foreign word with known spelling
352            return word.feature.lemma.split('-')[-1]
353        elif word.feature.kana:
354            # for known words
355            kana = jaconv.kata2hira(word.feature.kana)
356            return self.map_kana(kana)
357        else:
358            # unclear when we would actually get here
359            return word.surface

Return the romaji for a single word (node).

def map_kana(self, kana):
361    def map_kana(self, kana):
362        """Given a list of kana, convert them to romaji.
363
364        The exact romaji resulting from a kana sequence depend on the preceding
365        or following kana, so this handles that conversion.
366        """
367        out = ''
368        for ki, char in enumerate(kana):
369            nk = kana[ki + 1] if ki < len(kana) - 1 else None
370            pk = kana[ki - 1] if ki > 0 else None
371            out += self.get_single_mapping(pk, char, nk)
372        return out

Given a list of kana, convert them to romaji.

The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.

def get_single_mapping(self, pk, kk, nk):
374    def get_single_mapping(self, pk, kk, nk):
375        """Given a single kana and its neighbors, return the mapped romaji."""
376        # handle odoriji
377        # NOTE: This is very rarely useful at present because odoriji are not
378        # left in readings for dictionary words, and we can't follow kana
379        # across word boundaries.
380        if kk in ODORI:
381            if kk in 'ゝヽ':
382                if pk: return pk
383                else: return '' # invalid but be nice
384            if kk in 'ゞヾ': # repeat with voicing
385                if not pk: return ''
386                vv = add_dakuten(pk)
387                if vv: return self.table[vv]
388                else: return ''
389            # remaining are 々 for kanji and 〃 for symbols, but we can't
390            # infer their span reliably (or handle rendaku)
391            return ''
392
393
394        # handle digraphs
395        if pk and (pk + kk) in self.table:
396            return self.table[pk + kk]
397        if nk and (kk + nk) in self.table:
398            return ''
399
400        if nk and nk in SUTEGANA:
401            if kk == 'っ': return '' # never valid, just ignore
402            return self.table[kk][:-1] + self.table[nk]
403        if kk in SUTEGANA:
404            return ''
405
406        if kk == 'ー': # 長音符
407            if pk and pk in self.table: return self.table[pk][-1]
408            else: return '-'
409
410        if kk == 'っ':
411            if nk:
412                if self.use_tch and nk == 'ち': return 't'
413                elif nk in 'あいうえおっ': return '-'
414                else: return self.table[nk][0] # first character
415            else:
416                # seems like it should never happen, but 乗っ|た is two tokens
417                # so leave this as is and pick it up at the word level
418                return 'っ'
419
420        if kk == 'ん':
421            if nk and nk in 'あいうえおやゆよ': return "n'"
422            else: return 'n'
423
424        return self.table[kk]

Given a single kana and its neighbors, return the mapped romaji.