cutlet

Open in Streamlit Current PyPI packages

cutlet

cutlet by Irasutoya

Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.

issueを英語で書く必要はありません。

Features:

  • support for Modified Hepburn, Kunreisiki, Nihonsiki systems
  • custom overrides for individual mappings
  • custom overrides for specific words
  • built in exceptions list (Tokyo, Osaka, etc.)
  • uses foreign spelling when available in UniDic
  • proper nouns are capitalized
  • slug mode for url generation

Things not supported:

  • traditional Hepburn n-to-m: Shimbashi
  • macrons or circumflexes: Tōkyō, Tôkyô
  • passport Hepburn: Satoh (but you can use an exception)
  • hyphenating words
  • Traditional Hepburn in general is not supported

Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.

Installation

Cutlet can be installed through pip as usual.

pip install cutlet

Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.

pip install unidic-lite

Usage

A command-line script is included for quick testing. Just use cutlet and each line of stdin will be treated as a sentence. You can specify the system to use (hepburn, kunrei, nippon, or nihon) as the first argument.

$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.

In code:

import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'

# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'

# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'

# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'

# comparison
nkatu = cutlet.Cutlet('nihon')

sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'

Alternatives

  • kakasi: Historically important, but not updated since 2014.
  • pykakasi: self contained, it does segmentation on its own and uses its own dictionary.
  • kuroshiro: Javascript based.
  • kana: Go based.
1"""
2
3.. include:: ../README.md
4"""
5
6from .cutlet import *
7
8__all__ = ("Cutlet",)
class Cutlet:
 98class Cutlet:
 99    def __init__(
100            self,
101            system = 'hepburn',
102            use_foreign_spelling = True,
103            ensure_ascii = True,
104            mecab_args = "",
105):
106        """Create a Cutlet object, which holds configuration as well as
107        tokenizer state.
108
109        `system` is `hepburn` by default, and may also be `kunrei` or
110        `nihon`. `nippon` is permitted as a synonym for `nihon`.
111
112        If `use_foreign_spelling` is true, output will use the foreign spelling
113        provided in a UniDic lemma when available. For example, "カツ" will
114        become "cutlet" instead of "katsu".
115
116        If `ensure_ascii` is true, any non-ASCII characters that can't be
117        romanized will be replaced with `?`. If false, they will be passed
118        through.
119
120        Typical usage:
121
122        ```python
123        katsu = Cutlet()
124        roma = katsu.romaji("カツカレーを食べた")
125        # "Cutlet curry wo tabeta"
126        ```
127        """
128        # allow 'nippon' for 'nihon'
129        if system == 'nippon': system = 'nihon'
130        self.system = system
131        try:
132            # make a copy so we can modify it
133            self.table = dict(SYSTEMS[system])
134        except KeyError:
135            print("unknown system: {}".format(system))
136            raise
137
138        self.tagger = fugashi.Tagger(mecab_args)
139        self.exceptions = load_exceptions()
140
141        # these are too minor to be worth exposing as arguments
142        self.use_tch = (self.system in ('hepburn',))
143        self.use_wa  = (self.system in ('hepburn', 'kunrei'))
144        self.use_he  = (self.system in ('nihon',))
145        self.use_wo  = (self.system in ('hepburn', 'nihon'))
146
147        self.use_foreign_spelling = use_foreign_spelling
148        self.ensure_ascii = ensure_ascii
149
150    def add_exception(self, key, val):
151        """Add an exception to the internal list.
152
153        An exception overrides a whole token, for example to replace "Toukyou"
154        with "Tokyo". Note that it must match the tokenizer output and be a
155        single token to work. To replace longer phrases, you'll need to use a
156        different strategy, like string replacement.
157        """
158        self.exceptions[key] = val
159
160    def update_mapping(self, key, val):
161        """Update mapping table for a single kana.
162
163        This can be used to mix common systems, or to modify particular
164        details. For example, you can use `update_mapping("ぢ", "di")` to
165        differentiate ぢ and じ in Hepburn.
166
167        Example usage:
168
169        ```
170        cut = Cutlet()
171        cut.romaji("お茶漬け") # Ochazuke
172        cut.update_mapping("づ", "du")
173        cut.romaji("お茶漬け") # Ochaduke
174        ```
175        """
176        self.table[key] = val
177
178    def slug(self, text):
179        """Generate a URL-friendly slug.
180
181        After converting the input to romaji using `Cutlet.romaji` and making
182        the result lower-case, any runs of non alpha-numeric characters are
183        replaced with a single hyphen. Any leading or trailing hyphens are
184        stripped.
185        """
186        roma = self.romaji(text).lower()
187        slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-')
188        return slug
189
190    def romaji_tokens(self, words, capitalize=True, title=False):
191        """Build a list of tokens from input nodes.
192
193        If `capitalize` is true, then the first letter of the first token will be
194        capitalized. This is typically the desired behavior if the input is a
195        complete sentence.
196
197        If `title` is true, then words will be capitalized as in a book title.
198        This means most words will be capitalized, but some parts of speech
199        (particles, endings) will not.
200
201        If the text was not normalized before being tokenized, the output is
202        undefined. For details of normalization, see `normalize_text`.
203
204        The number of output tokens will equal the number of input nodes.
205        """
206
207        out = []
208
209        for wi, word in enumerate(words):
210            po = out[-1] if out else None
211            pw = words[wi - 1] if wi > 0 else None
212            nw = words[wi + 1] if wi < len(words) - 1 else None
213
214            # handle possessive apostrophe as a special case
215            if (word.surface == "'" and
216                    (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and
217                    not word.white_space):
218                # remove preceeding space
219                if po:
220                    po.space = False
221                out.append(Token(word.surface, False))
222                continue
223
224            # resolve split verbs / adjectives
225            roma = self.romaji_word(word)
226            if roma and po and po.surface and po.surface[-1] == 'っ':
227                po.surface = po.surface[:-1] + roma[0]
228            if word.feature.pos2 == '固有名詞':
229                roma = roma.title()
230            if (title and
231                word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and
232                not (pw and pw.feature.pos1 == '接頭辞')):
233                roma = roma.title()
234
235            foreign = self.use_foreign_spelling and has_foreign_lemma(word)
236            tok = Token(roma, False, foreign)
237            # handle punctuation with atypical spacing
238            if word.surface in '「『':
239                if po:
240                    po.space = True
241                out.append(tok)
242                continue
243            if roma in '([':
244                if po:
245                    po.space = True
246                out.append(tok)
247                continue
248            if roma == '/':
249                out.append(tok)
250                continue
251
252            # preserve spaces between ascii tokens
253            if (word.surface.isascii() and
254                nw and nw.surface.isascii()):
255                use_space = bool(nw.white_space)
256                out.append(Token(word.surface, use_space))
257                continue
258
259            out.append(tok)
260
261            # no space sometimes
262            # お酒 -> osake
263            if word.feature.pos1 == '接頭辞': continue
264            # 今日、 -> kyou, ; 図書館 -> toshokan
265            if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
266            # special case for half-width commas
267            if nw and nw.surface == ',': continue
268            # special case for prefixes
269            if foreign and roma[-1] == "-": continue
270            # 思えば -> omoeba
271            if nw and nw.feature.pos2 in ('接続助詞'): continue
272            # 333 -> 333 ; this should probably be handled in mecab
273            if (word.surface.isdigit() and
274                    nw and nw.surface.isdigit()):
275                continue
276            # そうでした -> sou deshita
277            if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
278                   and nw.feature.pos1 == '助動詞'
279                   and nw.surface != 'です'):
280                continue
281
282            # if we get here, it does need a space
283            tok.space = True
284
285        # remove any leftover っ
286        for tok in out:
287            tok.surface = tok.surface.replace("っ", "")
288
289        # capitalize the first letter
290        if capitalize and out and out[0].surface:
291            ss = out[0].surface
292            out[0].surface = ss[0].capitalize() + ss[1:]
293        return out
294
295    def romaji(self, text, capitalize=True, title=False):
296        """Build a complete string from input text.
297
298        If `capitalize` is true, then the first letter of the text will be
299        capitalized. This is typically the desired behavior if the input is a
300        complete sentence.
301
302        If `title` is true, then words will be capitalized as in a book title.
303        This means most words will be capitalized, but some parts of speech
304        (particles, endings) will not.
305        """
306        if not text:
307            return ''
308
309        text = normalize_text(text)
310        words = self.tagger(text)
311
312        tokens = self.romaji_tokens(words, capitalize, title)
313        out = ''.join([str(tok) for tok in tokens]).strip()
314        return out
315
316    def romaji_word(self, word):
317        """Return the romaji for a single word (node)."""
318
319        if word.surface in self.exceptions:
320            return self.exceptions[word.surface]
321
322        if word.surface.isdigit():
323            return word.surface
324
325        if word.surface.isascii():
326            return word.surface
327
328        # deal with unks first
329        if word.is_unk:
330            # at this point is is presumably an unk
331            # Check character type using the values defined in char.def.
332            # This is constant across unidic versions so far but not guaranteed.
333            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
334                kana = jaconv.kata2hira(word.surface)
335                return self.map_kana(kana)
336
337            # At this point this is an unknown word and not kana. Could be
338            # unknown kanji, could be hangul, cyrillic, something else.
339            # By default ensure ascii by replacing with ?, but allow pass-through.
340            if self.ensure_ascii:
341                out = '?' * len(word.surface)
342                return out
343            else:
344                return word.surface
345
346        if word.feature.pos1 == '補助記号':
347            # If it's punctuation we don't recognize, just discard it
348            return self.table.get(word.surface, '')
349        elif (self.use_wa and
350                word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'):
351            return 'wa'
352        elif (not self.use_he and
353                word.feature.pos1 == '助詞' and word.feature.pron == 'エ'):
354            return 'e'
355        elif (not self.use_wo and
356                word.feature.pos1 == '助詞' and word.feature.pron == 'オ'):
357            return 'o'
358        elif (self.use_foreign_spelling and
359                has_foreign_lemma(word)):
360            # this is a foreign word with known spelling
361            return word.feature.lemma.split('-', 1)[-1]
362        elif word.feature.kana:
363            # for known words
364            kana = jaconv.kata2hira(word.feature.kana)
365            return self.map_kana(kana)
366        else:
367            # unclear when we would actually get here
368            return word.surface
369
370    def map_kana(self, kana):
371        """Given a list of kana, convert them to romaji.
372
373        The exact romaji resulting from a kana sequence depend on the preceding
374        or following kana, so this handles that conversion.
375        """
376        out = ''
377        for ki, char in enumerate(kana):
378            nk = kana[ki + 1] if ki < len(kana) - 1 else None
379            pk = kana[ki - 1] if ki > 0 else None
380            out += self.get_single_mapping(pk, char, nk)
381        return out
382
383    def get_single_mapping(self, pk, kk, nk):
384        """Given a single kana and its neighbors, return the mapped romaji."""
385        # handle odoriji
386        # NOTE: This is very rarely useful at present because odoriji are not
387        # left in readings for dictionary words, and we can't follow kana
388        # across word boundaries.
389        if kk in ODORI:
390            if kk in 'ゝヽ':
391                if pk: return pk
392                else: return '' # invalid but be nice
393            if kk in 'ゞヾ': # repeat with voicing
394                if not pk: return ''
395                vv = add_dakuten(pk)
396                if vv: return self.table[vv]
397                else: return ''
398            # remaining are 々 for kanji and 〃 for symbols, but we can't
399            # infer their span reliably (or handle rendaku)
400            return ''
401
402
403        # handle digraphs
404        if pk and (pk + kk) in self.table:
405            return self.table[pk + kk]
406        if nk and (kk + nk) in self.table:
407            return ''
408
409        if nk and nk in SUTEGANA:
410            if kk == 'っ': return '' # never valid, just ignore
411            return self.table[kk][:-1] + self.table[nk]
412        if kk in SUTEGANA:
413            return ''
414
415        if kk == 'ー': # 長音符
416            if pk and pk in self.table: return self.table[pk][-1]
417            else: return '-'
418
419        if kk == 'っ':
420            if nk:
421                if self.use_tch and nk == 'ち': return 't'
422                elif nk in 'あいうえおっ': return '-'
423                else: return self.table[nk][0] # first character
424            else:
425                # seems like it should never happen, but 乗っ|た is two tokens
426                # so leave this as is and pick it up at the word level
427                return 'っ'
428
429        if kk == 'ん':
430            if nk and nk in 'あいうえおやゆよ': return "n'"
431            else: return 'n'
432
433        return self.table[kk]
Cutlet( system='hepburn', use_foreign_spelling=True, ensure_ascii=True, mecab_args='')
 99    def __init__(
100            self,
101            system = 'hepburn',
102            use_foreign_spelling = True,
103            ensure_ascii = True,
104            mecab_args = "",
105):
106        """Create a Cutlet object, which holds configuration as well as
107        tokenizer state.
108
109        `system` is `hepburn` by default, and may also be `kunrei` or
110        `nihon`. `nippon` is permitted as a synonym for `nihon`.
111
112        If `use_foreign_spelling` is true, output will use the foreign spelling
113        provided in a UniDic lemma when available. For example, "カツ" will
114        become "cutlet" instead of "katsu".
115
116        If `ensure_ascii` is true, any non-ASCII characters that can't be
117        romanized will be replaced with `?`. If false, they will be passed
118        through.
119
120        Typical usage:
121
122        ```python
123        katsu = Cutlet()
124        roma = katsu.romaji("カツカレーを食べた")
125        # "Cutlet curry wo tabeta"
126        ```
127        """
128        # allow 'nippon' for 'nihon'
129        if system == 'nippon': system = 'nihon'
130        self.system = system
131        try:
132            # make a copy so we can modify it
133            self.table = dict(SYSTEMS[system])
134        except KeyError:
135            print("unknown system: {}".format(system))
136            raise
137
138        self.tagger = fugashi.Tagger(mecab_args)
139        self.exceptions = load_exceptions()
140
141        # these are too minor to be worth exposing as arguments
142        self.use_tch = (self.system in ('hepburn',))
143        self.use_wa  = (self.system in ('hepburn', 'kunrei'))
144        self.use_he  = (self.system in ('nihon',))
145        self.use_wo  = (self.system in ('hepburn', 'nihon'))
146
147        self.use_foreign_spelling = use_foreign_spelling
148        self.ensure_ascii = ensure_ascii

Create a Cutlet object, which holds configuration as well as tokenizer state.

system is hepburn by default, and may also be kunrei or nihon. nippon is permitted as a synonym for nihon.

If use_foreign_spelling is true, output will use the foreign spelling provided in a UniDic lemma when available. For example, "カツ" will become "cutlet" instead of "katsu".

If ensure_ascii is true, any non-ASCII characters that can't be romanized will be replaced with ?. If false, they will be passed through.

Typical usage:

katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"
system
tagger
exceptions
use_tch
use_wa
use_he
use_wo
use_foreign_spelling
ensure_ascii
def add_exception(self, key, val):
150    def add_exception(self, key, val):
151        """Add an exception to the internal list.
152
153        An exception overrides a whole token, for example to replace "Toukyou"
154        with "Tokyo". Note that it must match the tokenizer output and be a
155        single token to work. To replace longer phrases, you'll need to use a
156        different strategy, like string replacement.
157        """
158        self.exceptions[key] = val

Add an exception to the internal list.

An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.

def update_mapping(self, key, val):
160    def update_mapping(self, key, val):
161        """Update mapping table for a single kana.
162
163        This can be used to mix common systems, or to modify particular
164        details. For example, you can use `update_mapping("ぢ", "di")` to
165        differentiate ぢ and じ in Hepburn.
166
167        Example usage:
168
169        ```
170        cut = Cutlet()
171        cut.romaji("お茶漬け") # Ochazuke
172        cut.update_mapping("づ", "du")
173        cut.romaji("お茶漬け") # Ochaduke
174        ```
175        """
176        self.table[key] = val

Update mapping table for a single kana.

This can be used to mix common systems, or to modify particular details. For example, you can use update_mapping("ぢ", "di") to differentiate ぢ and じ in Hepburn.

Example usage:

cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke
def slug(self, text):
178    def slug(self, text):
179        """Generate a URL-friendly slug.
180
181        After converting the input to romaji using `Cutlet.romaji` and making
182        the result lower-case, any runs of non alpha-numeric characters are
183        replaced with a single hyphen. Any leading or trailing hyphens are
184        stripped.
185        """
186        roma = self.romaji(text).lower()
187        slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-')
188        return slug

Generate a URL-friendly slug.

After converting the input to romaji using Cutlet.romaji and making the result lower-case, any runs of non alpha-numeric characters are replaced with a single hyphen. Any leading or trailing hyphens are stripped.

def romaji_tokens(self, words, capitalize=True, title=False):
190    def romaji_tokens(self, words, capitalize=True, title=False):
191        """Build a list of tokens from input nodes.
192
193        If `capitalize` is true, then the first letter of the first token will be
194        capitalized. This is typically the desired behavior if the input is a
195        complete sentence.
196
197        If `title` is true, then words will be capitalized as in a book title.
198        This means most words will be capitalized, but some parts of speech
199        (particles, endings) will not.
200
201        If the text was not normalized before being tokenized, the output is
202        undefined. For details of normalization, see `normalize_text`.
203
204        The number of output tokens will equal the number of input nodes.
205        """
206
207        out = []
208
209        for wi, word in enumerate(words):
210            po = out[-1] if out else None
211            pw = words[wi - 1] if wi > 0 else None
212            nw = words[wi + 1] if wi < len(words) - 1 else None
213
214            # handle possessive apostrophe as a special case
215            if (word.surface == "'" and
216                    (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and
217                    not word.white_space):
218                # remove preceeding space
219                if po:
220                    po.space = False
221                out.append(Token(word.surface, False))
222                continue
223
224            # resolve split verbs / adjectives
225            roma = self.romaji_word(word)
226            if roma and po and po.surface and po.surface[-1] == 'っ':
227                po.surface = po.surface[:-1] + roma[0]
228            if word.feature.pos2 == '固有名詞':
229                roma = roma.title()
230            if (title and
231                word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and
232                not (pw and pw.feature.pos1 == '接頭辞')):
233                roma = roma.title()
234
235            foreign = self.use_foreign_spelling and has_foreign_lemma(word)
236            tok = Token(roma, False, foreign)
237            # handle punctuation with atypical spacing
238            if word.surface in '「『':
239                if po:
240                    po.space = True
241                out.append(tok)
242                continue
243            if roma in '([':
244                if po:
245                    po.space = True
246                out.append(tok)
247                continue
248            if roma == '/':
249                out.append(tok)
250                continue
251
252            # preserve spaces between ascii tokens
253            if (word.surface.isascii() and
254                nw and nw.surface.isascii()):
255                use_space = bool(nw.white_space)
256                out.append(Token(word.surface, use_space))
257                continue
258
259            out.append(tok)
260
261            # no space sometimes
262            # お酒 -> osake
263            if word.feature.pos1 == '接頭辞': continue
264            # 今日、 -> kyou, ; 図書館 -> toshokan
265            if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
266            # special case for half-width commas
267            if nw and nw.surface == ',': continue
268            # special case for prefixes
269            if foreign and roma[-1] == "-": continue
270            # 思えば -> omoeba
271            if nw and nw.feature.pos2 in ('接続助詞'): continue
272            # 333 -> 333 ; this should probably be handled in mecab
273            if (word.surface.isdigit() and
274                    nw and nw.surface.isdigit()):
275                continue
276            # そうでした -> sou deshita
277            if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
278                   and nw.feature.pos1 == '助動詞'
279                   and nw.surface != 'です'):
280                continue
281
282            # if we get here, it does need a space
283            tok.space = True
284
285        # remove any leftover っ
286        for tok in out:
287            tok.surface = tok.surface.replace("っ", "")
288
289        # capitalize the first letter
290        if capitalize and out and out[0].surface:
291            ss = out[0].surface
292            out[0].surface = ss[0].capitalize() + ss[1:]
293        return out

Build a list of tokens from input nodes.

If capitalize is true, then the first letter of the first token will be capitalized. This is typically the desired behavior if the input is a complete sentence.

If title is true, then words will be capitalized as in a book title. This means most words will be capitalized, but some parts of speech (particles, endings) will not.

If the text was not normalized before being tokenized, the output is undefined. For details of normalization, see normalize_text.

The number of output tokens will equal the number of input nodes.

def romaji(self, text, capitalize=True, title=False):
295    def romaji(self, text, capitalize=True, title=False):
296        """Build a complete string from input text.
297
298        If `capitalize` is true, then the first letter of the text will be
299        capitalized. This is typically the desired behavior if the input is a
300        complete sentence.
301
302        If `title` is true, then words will be capitalized as in a book title.
303        This means most words will be capitalized, but some parts of speech
304        (particles, endings) will not.
305        """
306        if not text:
307            return ''
308
309        text = normalize_text(text)
310        words = self.tagger(text)
311
312        tokens = self.romaji_tokens(words, capitalize, title)
313        out = ''.join([str(tok) for tok in tokens]).strip()
314        return out

Build a complete string from input text.

If capitalize is true, then the first letter of the text will be capitalized. This is typically the desired behavior if the input is a complete sentence.

If title is true, then words will be capitalized as in a book title. This means most words will be capitalized, but some parts of speech (particles, endings) will not.

def romaji_word(self, word):
316    def romaji_word(self, word):
317        """Return the romaji for a single word (node)."""
318
319        if word.surface in self.exceptions:
320            return self.exceptions[word.surface]
321
322        if word.surface.isdigit():
323            return word.surface
324
325        if word.surface.isascii():
326            return word.surface
327
328        # deal with unks first
329        if word.is_unk:
330            # at this point is is presumably an unk
331            # Check character type using the values defined in char.def.
332            # This is constant across unidic versions so far but not guaranteed.
333            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
334                kana = jaconv.kata2hira(word.surface)
335                return self.map_kana(kana)
336
337            # At this point this is an unknown word and not kana. Could be
338            # unknown kanji, could be hangul, cyrillic, something else.
339            # By default ensure ascii by replacing with ?, but allow pass-through.
340            if self.ensure_ascii:
341                out = '?' * len(word.surface)
342                return out
343            else:
344                return word.surface
345
346        if word.feature.pos1 == '補助記号':
347            # If it's punctuation we don't recognize, just discard it
348            return self.table.get(word.surface, '')
349        elif (self.use_wa and
350                word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'):
351            return 'wa'
352        elif (not self.use_he and
353                word.feature.pos1 == '助詞' and word.feature.pron == 'エ'):
354            return 'e'
355        elif (not self.use_wo and
356                word.feature.pos1 == '助詞' and word.feature.pron == 'オ'):
357            return 'o'
358        elif (self.use_foreign_spelling and
359                has_foreign_lemma(word)):
360            # this is a foreign word with known spelling
361            return word.feature.lemma.split('-', 1)[-1]
362        elif word.feature.kana:
363            # for known words
364            kana = jaconv.kata2hira(word.feature.kana)
365            return self.map_kana(kana)
366        else:
367            # unclear when we would actually get here
368            return word.surface

Return the romaji for a single word (node).

def map_kana(self, kana):
370    def map_kana(self, kana):
371        """Given a list of kana, convert them to romaji.
372
373        The exact romaji resulting from a kana sequence depend on the preceding
374        or following kana, so this handles that conversion.
375        """
376        out = ''
377        for ki, char in enumerate(kana):
378            nk = kana[ki + 1] if ki < len(kana) - 1 else None
379            pk = kana[ki - 1] if ki > 0 else None
380            out += self.get_single_mapping(pk, char, nk)
381        return out

Given a list of kana, convert them to romaji.

The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.

def get_single_mapping(self, pk, kk, nk):
383    def get_single_mapping(self, pk, kk, nk):
384        """Given a single kana and its neighbors, return the mapped romaji."""
385        # handle odoriji
386        # NOTE: This is very rarely useful at present because odoriji are not
387        # left in readings for dictionary words, and we can't follow kana
388        # across word boundaries.
389        if kk in ODORI:
390            if kk in 'ゝヽ':
391                if pk: return pk
392                else: return '' # invalid but be nice
393            if kk in 'ゞヾ': # repeat with voicing
394                if not pk: return ''
395                vv = add_dakuten(pk)
396                if vv: return self.table[vv]
397                else: return ''
398            # remaining are 々 for kanji and 〃 for symbols, but we can't
399            # infer their span reliably (or handle rendaku)
400            return ''
401
402
403        # handle digraphs
404        if pk and (pk + kk) in self.table:
405            return self.table[pk + kk]
406        if nk and (kk + nk) in self.table:
407            return ''
408
409        if nk and nk in SUTEGANA:
410            if kk == 'っ': return '' # never valid, just ignore
411            return self.table[kk][:-1] + self.table[nk]
412        if kk in SUTEGANA:
413            return ''
414
415        if kk == 'ー': # 長音符
416            if pk and pk in self.table: return self.table[pk][-1]
417            else: return '-'
418
419        if kk == 'っ':
420            if nk:
421                if self.use_tch and nk == 'ち': return 't'
422                elif nk in 'あいうえおっ': return '-'
423                else: return self.table[nk][0] # first character
424            else:
425                # seems like it should never happen, but 乗っ|た is two tokens
426                # so leave this as is and pick it up at the word level
427                return 'っ'
428
429        if kk == 'ん':
430            if nk and nk in 'あいうえおやゆよ': return "n'"
431            else: return 'n'
432
433        return self.table[kk]

Given a single kana and its neighbors, return the mapped romaji.