cutlet

cutlet by Irasutoya

Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.

issueを英語で書く必要はありません。

Features:

support for Modified Hepburn, Kunreisiki, Nihonsiki systems
custom overrides for individual mappings
custom overrides for specific words
built in exceptions list (Tokyo, Osaka, etc.)
uses foreign spelling when available in UniDic
proper nouns are capitalized
slug mode for url generation

Things not supported:

traditional Hepburn n-to-m: Shimbashi
macrons or circumflexes: Tōkyō, Tôkyô
passport Hepburn: Satoh (but you can use an exception)
hyphenating words
Traditional Hepburn in general is not supported

Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.

Installation

Cutlet can be installed through pip as usual.

pip install cutlet

Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.

pip install unidic-lite

Usage

A command-line script is included for quick testing. Just use cutlet and each line of stdin will be treated as a sentence. You can specify the system to use (hepburn, kunrei, nippon, or nihon) as the first argument.

$ cutlet
ローマ字変換プログラム作ってみた。
Roma ji henkan program tsukutte mita.

In code:

import cutlet
katsu = cutlet.Cutlet()
katsu.romaji("カツカレーは美味しい")
# => 'Cutlet curry wa oishii'

# you can print a slug suitable for urls
katsu.slug("カツカレーは美味しい")
# => 'cutlet-curry-wa-oishii'

# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
katsu.romaji("カツカレーは美味しい")
# => 'Katsu karee wa oishii'

# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
katu.romaji("富士山")
# => 'Huzi yama'

# comparison
nkatu = cutlet.Cutlet('nihon')

sent = "彼女は王への手紙を読み上げた。"
katsu.romaji(sent)
# => 'Kanojo wa ou e no tegami wo yomiageta.'
katu.romaji(sent)
# => 'Kanozyo wa ou e no tegami o yomiageta.'
nkatu.romaji(sent)
# => 'Kanozyo ha ou he no tegami wo yomiageta.'

Alternatives

kakasi: Historically important, but not updated since 2014.
pykakasi: self contained, it does segmentation on its own and uses its own dictionary.

View Source

1"""
2
3.. include:: ../README.md
4"""
5
6from .cutlet import *
7
8__all__ = ("Cutlet",)

class Cutlet: View Source

104class Cutlet:
105    def __init__(
106        self,
107        system="hepburn",
108        use_foreign_spelling=True,
109        ensure_ascii=True,
110        mecab_args="",
111    ):
112        """Create a Cutlet object, which holds configuration as well as
113        tokenizer state.
114
115        `system` is `hepburn` by default, and may also be `kunrei` or
116        `nihon`. `nippon` is permitted as a synonym for `nihon`.
117
118        If `use_foreign_spelling` is true, output will use the foreign spelling
119        provided in a UniDic lemma when available. For example, "カツ" will
120        become "cutlet" instead of "katsu".
121
122        If `ensure_ascii` is true, any non-ASCII characters that can't be
123        romanized will be replaced with `?`. If false, they will be passed
124        through.
125
126        Typical usage:
127
128        ```python
129        katsu = Cutlet()
130        roma = katsu.romaji("カツカレーを食べた")
131        # "Cutlet curry wo tabeta"
132        ```
133        """
134        # allow 'nippon' for 'nihon'
135        if system == "nippon":
136            system = "nihon"
137        self.system = system
138        try:
139            # make a copy so we can modify it
140            self.table = dict(SYSTEMS[system])
141        except KeyError:
142            print("unknown system: {}".format(system))
143            raise
144
145        self.tagger = fugashi.Tagger(mecab_args)
146        self.exceptions = load_exceptions()
147
148        # these are too minor to be worth exposing as arguments
149        self.use_tch = self.system in ("hepburn",)
150        self.use_wa = self.system in ("hepburn", "kunrei")
151        self.use_he = self.system in ("nihon",)
152        self.use_wo = self.system in ("hepburn", "nihon")
153
154        self.use_foreign_spelling = use_foreign_spelling
155        self.ensure_ascii = ensure_ascii
156
157    def add_exception(self, key, val):
158        """Add an exception to the internal list.
159
160        An exception overrides a whole token, for example to replace "Toukyou"
161        with "Tokyo". Note that it must match the tokenizer output and be a
162        single token to work. To replace longer phrases, you'll need to use a
163        different strategy, like string replacement.
164        """
165        self.exceptions[key] = val
166
167    def update_mapping(self, key, val):
168        """Update mapping table for a single kana.
169
170        This can be used to mix common systems, or to modify particular
171        details. For example, you can use `update_mapping("ぢ", "di")` to
172        differentiate ぢ and じ in Hepburn.
173
174        Example usage:
175
176        ```
177        cut = Cutlet()
178        cut.romaji("お茶漬け") # Ochazuke
179        cut.update_mapping("づ", "du")
180        cut.romaji("お茶漬け") # Ochaduke
181        ```
182        """
183        self.table[key] = val
184
185    def slug(self, text):
186        """Generate a URL-friendly slug.
187
188        After converting the input to romaji using `Cutlet.romaji` and making
189        the result lower-case, any runs of non alpha-numeric characters are
190        replaced with a single hyphen. Any leading or trailing hyphens are
191        stripped.
192        """
193        roma = self.romaji(text).lower()
194        slug = re.sub(r"[^a-z0-9]+", "-", roma).strip("-")
195        return slug
196
197    def romaji_tokens(self, words, capitalize=True, title=False):
198        """Build a list of tokens from input nodes.
199
200        If `capitalize` is true, then the first letter of the first token will be
201        capitalized. This is typically the desired behavior if the input is a
202        complete sentence.
203
204        If `title` is true, then words will be capitalized as in a book title.
205        This means most words will be capitalized, but some parts of speech
206        (particles, endings) will not.
207
208        If the text was not normalized before being tokenized, the output is
209        undefined. For details of normalization, see `normalize_text`.
210
211        The number of output tokens will equal the number of input nodes.
212        """
213
214        out = []
215
216        for wi, word in enumerate(words):
217            po = out[-1] if out else None
218            pw = words[wi - 1] if wi > 0 else None
219            nw = words[wi + 1] if wi < len(words) - 1 else None
220
221            # handle possessive apostrophe as a special case
222            if (
223                word.surface == "'"
224                and (nw and nw.char_type == CHAR_ALPHA and not nw.white_space)
225                and not word.white_space
226            ):
227                # remove preceeding space
228                if po:
229                    po.space = False
230                out.append(Token(word.surface, False))
231                continue
232
233            # resolve split verbs / adjectives
234            roma = self.romaji_word(word)
235            if roma and po and po.surface and po.surface[-1] == "っ":
236                po.surface = po.surface[:-1] + roma[0]
237            if word.feature.pos2 == "固有名詞":
238                roma = roma.title()
239            if (
240                title
241                and word.feature.pos1 not in ("助詞", "助動詞", "接尾辞")
242                and not (pw and pw.feature.pos1 == "接頭辞")
243            ):
244                roma = roma.title()
245
246            foreign = self.use_foreign_spelling and has_foreign_lemma(word)
247            tok = Token(roma, False, foreign)
248            # handle punctuation with atypical spacing
249            if word.surface in "「『":
250                if po:
251                    po.space = True
252                out.append(tok)
253                continue
254            if roma in "([":
255                if po:
256                    po.space = True
257                out.append(tok)
258                continue
259            if roma == "/":
260                out.append(tok)
261                continue
262
263            # preserve spaces between ascii tokens
264            if word.surface.isascii() and nw and nw.surface.isascii():
265                use_space = bool(nw.white_space)
266                out.append(Token(word.surface, use_space))
267                continue
268
269            out.append(tok)
270
271            # no space sometimes
272            # お酒 -> osake
273            if word.feature.pos1 == "接頭辞":
274                continue
275            # 今日、 -> kyou, ; 図書館 -> toshokan
276            if nw and nw.feature.pos1 in ("補助記号", "接尾辞"):
277                continue
278            # special case for half-width commas
279            if nw and nw.surface == ",":
280                continue
281            # special case for prefixes
282            if foreign and roma[-1] == "-":
283                continue
284            # 思えば -> omoeba
285            if nw and nw.feature.pos2 in ("接続助詞"):
286                continue
287            # 333 -> 333 ; this should probably be handled in mecab
288            if word.surface.isdigit() and nw and nw.surface.isdigit():
289                continue
290            # そうでした -> sou deshita
291            if (
292                nw
293                and word.feature.pos1 in ("動詞", "助動詞", "形容詞")
294                and nw.feature.pos1 == "助動詞"
295                and nw.surface != "です"
296            ):
297                continue
298
299            # if we get here, it does need a space
300            tok.space = True
301
302        # remove any leftover っ
303        for tok in out:
304            tok.surface = tok.surface.replace("っ", "")
305
306        # capitalize the first letter
307        if capitalize and out and out[0].surface:
308            ss = out[0].surface
309            out[0].surface = ss[0].capitalize() + ss[1:]
310        return out
311
312    def romaji(self, text, capitalize=True, title=False):
313        """Build a complete string from input text.
314
315        If `capitalize` is true, then the first letter of the text will be
316        capitalized. This is typically the desired behavior if the input is a
317        complete sentence.
318
319        If `title` is true, then words will be capitalized as in a book title.
320        This means most words will be capitalized, but some parts of speech
321        (particles, endings) will not.
322        """
323        if not text:
324            return ""
325
326        text = normalize_text(text)
327        print("normalized", text)
328        words = self.tagger(text)
329        print("words", words)
330        for word in words:
331            print(word.surface, word.feature.pron)
332
333        tokens = self.romaji_tokens(words, capitalize, title)
334        out = "".join([str(tok) for tok in tokens]).strip()
335        return out
336
337    def romaji_word(self, word):
338        """Return the romaji for a single word (node)."""
339
340        if word.surface in self.exceptions:
341            return self.exceptions[word.surface]
342
343        if word.surface.isdigit():
344            return word.surface
345
346        if word.surface.isascii():
347            return word.surface
348
349        # deal with unks first
350        if word.is_unk:
351            # at this point is is presumably an unk
352            # Check character type using the values defined in char.def.
353            # This is constant across unidic versions so far but not guaranteed.
354            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
355                kana = jaconv.kata2hira(word.surface)
356                return self.map_kana(kana)
357
358            # At this point this is an unknown word and not kana. Could be
359            # unknown kanji, could be hangul, cyrillic, something else.
360            # By default ensure ascii by replacing with ?, but allow pass-through.
361            if self.ensure_ascii:
362                out = "?" * len(word.surface)
363                return out
364            else:
365                return word.surface
366
367        if word.feature.pos1 == "補助記号":
368            # If it's punctuation we don't recognize, just discard it
369            return self.table.get(word.surface, "")
370        elif self.use_wa and word.feature.pos1 == "助詞" and word.feature.pron == "ワ":
371            return "wa"
372        elif (
373            not self.use_he
374            and word.feature.pos1 == "助詞"
375            and word.feature.pron == "エ"
376        ):
377            return "e"
378        elif (
379            not self.use_wo
380            and word.feature.pos1 == "助詞"
381            and word.feature.pron == "オ"
382        ):
383            return "o"
384        elif self.use_foreign_spelling and has_foreign_lemma(word):
385            # this is a foreign word with known spelling
386            return word.feature.lemma.split("-", 1)[-1]
387        elif word.feature.kana:
388            # for known words
389            kana = jaconv.kata2hira(word.feature.kana)
390            return self.map_kana(kana)
391        else:
392            # unclear when we would actually get here
393            return word.surface
394
395    def map_kana(self, kana):
396        """Given a list of kana, convert them to romaji.
397
398        The exact romaji resulting from a kana sequence depend on the preceding
399        or following kana, so this handles that conversion.
400        """
401        out = ""
402        for ki, char in enumerate(kana):
403            nk = kana[ki + 1] if ki < len(kana) - 1 else None
404            pk = kana[ki - 1] if ki > 0 else None
405            out += self.get_single_mapping(pk, char, nk)
406        return out
407
408    def get_single_mapping(self, pk, kk, nk):
409        """Given a single kana and its neighbors, return the mapped romaji."""
410        # handle odoriji
411        # NOTE: This is very rarely useful at present because odoriji are not
412        # left in readings for dictionary words, and we can't follow kana
413        # across word boundaries.
414        if kk in ODORI:
415            if kk in "ゝヽ":
416                if pk:
417                    return pk
418                else:
419                    return ""  # invalid but be nice
420            if kk in "ゞヾ":  # repeat with voicing
421                if not pk:
422                    return ""
423                vv = add_dakuten(pk)
424                if vv:
425                    return self.table[vv]
426                else:
427                    return ""
428            # remaining are 々 for kanji and 〃 for symbols, but we can't
429            # infer their span reliably (or handle rendaku)
430            return ""
431
432        # handle digraphs
433        if pk and (pk + kk) in self.table:
434            return self.table[pk + kk]
435        if nk and (kk + nk) in self.table:
436            return ""
437
438        if nk and nk in SUTEGANA:
439            if kk == "っ":
440                return ""  # never valid, just ignore
441            return self.table[kk][:-1] + self.table[nk]
442        if kk in SUTEGANA:
443            return ""
444
445        if kk == "ー":  # 長音符
446            if pk and pk in self.table:
447                return self.table[pk][-1]
448            else:
449                return "-"
450
451        if kk == "っ":
452            if nk:
453                if self.use_tch and nk == "ち":
454                    return "t"
455                elif nk in "あいうえおっ":
456                    return "-"
457                else:
458                    return self.table[nk][0]  # first character
459            else:
460                # seems like it should never happen, but 乗っ|た is two tokens
461                # so leave this as is and pick it up at the word level
462                return "っ"
463
464        if kk == "ん":
465            if nk and nk in "あいうえおやゆよ":
466                return "n'"
467            else:
468                return "n"
469
470        return self.table[kk]

Cutlet( system='hepburn', use_foreign_spelling=True, ensure_ascii=True, mecab_args='') View Source

105    def __init__(
106        self,
107        system="hepburn",
108        use_foreign_spelling=True,
109        ensure_ascii=True,
110        mecab_args="",
111    ):
112        """Create a Cutlet object, which holds configuration as well as
113        tokenizer state.
114
115        `system` is `hepburn` by default, and may also be `kunrei` or
116        `nihon`. `nippon` is permitted as a synonym for `nihon`.
117
118        If `use_foreign_spelling` is true, output will use the foreign spelling
119        provided in a UniDic lemma when available. For example, "カツ" will
120        become "cutlet" instead of "katsu".
121
122        If `ensure_ascii` is true, any non-ASCII characters that can't be
123        romanized will be replaced with `?`. If false, they will be passed
124        through.
125
126        Typical usage:
127
128        ```python
129        katsu = Cutlet()
130        roma = katsu.romaji("カツカレーを食べた")
131        # "Cutlet curry wo tabeta"
132        ```
133        """
134        # allow 'nippon' for 'nihon'
135        if system == "nippon":
136            system = "nihon"
137        self.system = system
138        try:
139            # make a copy so we can modify it
140            self.table = dict(SYSTEMS[system])
141        except KeyError:
142            print("unknown system: {}".format(system))
143            raise
144
145        self.tagger = fugashi.Tagger(mecab_args)
146        self.exceptions = load_exceptions()
147
148        # these are too minor to be worth exposing as arguments
149        self.use_tch = self.system in ("hepburn",)
150        self.use_wa = self.system in ("hepburn", "kunrei")
151        self.use_he = self.system in ("nihon",)
152        self.use_wo = self.system in ("hepburn", "nihon")
153
154        self.use_foreign_spelling = use_foreign_spelling
155        self.ensure_ascii = ensure_ascii

Create a Cutlet object, which holds configuration as well as tokenizer state.

system is hepburn by default, and may also be kunrei or nihon. nippon is permitted as a synonym for nihon.

If use_foreign_spelling is true, output will use the foreign spelling provided in a UniDic lemma when available. For example, "カツ" will become "cutlet" instead of "katsu".

If ensure_ascii is true, any non-ASCII characters that can't be romanized will be replaced with ?. If false, they will be passed through.

Typical usage:

katsu = Cutlet()
roma = katsu.romaji("カツカレーを食べた")
# "Cutlet curry wo tabeta"

system

tagger

exceptions

use_tch

use_wa

use_he

use_wo

use_foreign_spelling

ensure_ascii

def add_exception(self, key, val): View Source

157    def add_exception(self, key, val):
158        """Add an exception to the internal list.
159
160        An exception overrides a whole token, for example to replace "Toukyou"
161        with "Tokyo". Note that it must match the tokenizer output and be a
162        single token to work. To replace longer phrases, you'll need to use a
163        different strategy, like string replacement.
164        """
165        self.exceptions[key] = val

Add an exception to the internal list.

An exception overrides a whole token, for example to replace "Toukyou" with "Tokyo". Note that it must match the tokenizer output and be a single token to work. To replace longer phrases, you'll need to use a different strategy, like string replacement.

def update_mapping(self, key, val): View Source

167    def update_mapping(self, key, val):
168        """Update mapping table for a single kana.
169
170        This can be used to mix common systems, or to modify particular
171        details. For example, you can use `update_mapping("ぢ", "di")` to
172        differentiate ぢ and じ in Hepburn.
173
174        Example usage:
175
176        ```
177        cut = Cutlet()
178        cut.romaji("お茶漬け") # Ochazuke
179        cut.update_mapping("づ", "du")
180        cut.romaji("お茶漬け") # Ochaduke
181        ```
182        """
183        self.table[key] = val

Update mapping table for a single kana.

This can be used to mix common systems, or to modify particular details. For example, you can use update_mapping("ぢ", "di") to differentiate ぢ and じ in Hepburn.

Example usage:

cut = Cutlet()
cut.romaji("お茶漬け") # Ochazuke
cut.update_mapping("づ", "du")
cut.romaji("お茶漬け") # Ochaduke

def slug(self, text): View Source

185    def slug(self, text):
186        """Generate a URL-friendly slug.
187
188        After converting the input to romaji using `Cutlet.romaji` and making
189        the result lower-case, any runs of non alpha-numeric characters are
190        replaced with a single hyphen. Any leading or trailing hyphens are
191        stripped.
192        """
193        roma = self.romaji(text).lower()
194        slug = re.sub(r"[^a-z0-9]+", "-", roma).strip("-")
195        return slug

Generate a URL-friendly slug.

After converting the input to romaji using Cutlet.romaji and making the result lower-case, any runs of non alpha-numeric characters are replaced with a single hyphen. Any leading or trailing hyphens are stripped.

def romaji_tokens(self, words, capitalize=True, title=False): View Source

197    def romaji_tokens(self, words, capitalize=True, title=False):
198        """Build a list of tokens from input nodes.
199
200        If `capitalize` is true, then the first letter of the first token will be
201        capitalized. This is typically the desired behavior if the input is a
202        complete sentence.
203
204        If `title` is true, then words will be capitalized as in a book title.
205        This means most words will be capitalized, but some parts of speech
206        (particles, endings) will not.
207
208        If the text was not normalized before being tokenized, the output is
209        undefined. For details of normalization, see `normalize_text`.
210
211        The number of output tokens will equal the number of input nodes.
212        """
213
214        out = []
215
216        for wi, word in enumerate(words):
217            po = out[-1] if out else None
218            pw = words[wi - 1] if wi > 0 else None
219            nw = words[wi + 1] if wi < len(words) - 1 else None
220
221            # handle possessive apostrophe as a special case
222            if (
223                word.surface == "'"
224                and (nw and nw.char_type == CHAR_ALPHA and not nw.white_space)
225                and not word.white_space
226            ):
227                # remove preceeding space
228                if po:
229                    po.space = False
230                out.append(Token(word.surface, False))
231                continue
232
233            # resolve split verbs / adjectives
234            roma = self.romaji_word(word)
235            if roma and po and po.surface and po.surface[-1] == "っ":
236                po.surface = po.surface[:-1] + roma[0]
237            if word.feature.pos2 == "固有名詞":
238                roma = roma.title()
239            if (
240                title
241                and word.feature.pos1 not in ("助詞", "助動詞", "接尾辞")
242                and not (pw and pw.feature.pos1 == "接頭辞")
243            ):
244                roma = roma.title()
245
246            foreign = self.use_foreign_spelling and has_foreign_lemma(word)
247            tok = Token(roma, False, foreign)
248            # handle punctuation with atypical spacing
249            if word.surface in "「『":
250                if po:
251                    po.space = True
252                out.append(tok)
253                continue
254            if roma in "([":
255                if po:
256                    po.space = True
257                out.append(tok)
258                continue
259            if roma == "/":
260                out.append(tok)
261                continue
262
263            # preserve spaces between ascii tokens
264            if word.surface.isascii() and nw and nw.surface.isascii():
265                use_space = bool(nw.white_space)
266                out.append(Token(word.surface, use_space))
267                continue
268
269            out.append(tok)
270
271            # no space sometimes
272            # お酒 -> osake
273            if word.feature.pos1 == "接頭辞":
274                continue
275            # 今日、 -> kyou, ; 図書館 -> toshokan
276            if nw and nw.feature.pos1 in ("補助記号", "接尾辞"):
277                continue
278            # special case for half-width commas
279            if nw and nw.surface == ",":
280                continue
281            # special case for prefixes
282            if foreign and roma[-1] == "-":
283                continue
284            # 思えば -> omoeba
285            if nw and nw.feature.pos2 in ("接続助詞"):
286                continue
287            # 333 -> 333 ; this should probably be handled in mecab
288            if word.surface.isdigit() and nw and nw.surface.isdigit():
289                continue
290            # そうでした -> sou deshita
291            if (
292                nw
293                and word.feature.pos1 in ("動詞", "助動詞", "形容詞")
294                and nw.feature.pos1 == "助動詞"
295                and nw.surface != "です"
296            ):
297                continue
298
299            # if we get here, it does need a space
300            tok.space = True
301
302        # remove any leftover っ
303        for tok in out:
304            tok.surface = tok.surface.replace("っ", "")
305
306        # capitalize the first letter
307        if capitalize and out and out[0].surface:
308            ss = out[0].surface
309            out[0].surface = ss[0].capitalize() + ss[1:]
310        return out

Build a list of tokens from input nodes.

If capitalize is true, then the first letter of the first token will be capitalized. This is typically the desired behavior if the input is a complete sentence.

If title is true, then words will be capitalized as in a book title. This means most words will be capitalized, but some parts of speech (particles, endings) will not.

If the text was not normalized before being tokenized, the output is undefined. For details of normalization, see normalize_text.

The number of output tokens will equal the number of input nodes.

def romaji(self, text, capitalize=True, title=False): View Source

312    def romaji(self, text, capitalize=True, title=False):
313        """Build a complete string from input text.
314
315        If `capitalize` is true, then the first letter of the text will be
316        capitalized. This is typically the desired behavior if the input is a
317        complete sentence.
318
319        If `title` is true, then words will be capitalized as in a book title.
320        This means most words will be capitalized, but some parts of speech
321        (particles, endings) will not.
322        """
323        if not text:
324            return ""
325
326        text = normalize_text(text)
327        print("normalized", text)
328        words = self.tagger(text)
329        print("words", words)
330        for word in words:
331            print(word.surface, word.feature.pron)
332
333        tokens = self.romaji_tokens(words, capitalize, title)
334        out = "".join([str(tok) for tok in tokens]).strip()
335        return out

Build a complete string from input text.

If capitalize is true, then the first letter of the text will be capitalized. This is typically the desired behavior if the input is a complete sentence.

If title is true, then words will be capitalized as in a book title. This means most words will be capitalized, but some parts of speech (particles, endings) will not.

def romaji_word(self, word): View Source

337    def romaji_word(self, word):
338        """Return the romaji for a single word (node)."""
339
340        if word.surface in self.exceptions:
341            return self.exceptions[word.surface]
342
343        if word.surface.isdigit():
344            return word.surface
345
346        if word.surface.isascii():
347            return word.surface
348
349        # deal with unks first
350        if word.is_unk:
351            # at this point is is presumably an unk
352            # Check character type using the values defined in char.def.
353            # This is constant across unidic versions so far but not guaranteed.
354            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
355                kana = jaconv.kata2hira(word.surface)
356                return self.map_kana(kana)
357
358            # At this point this is an unknown word and not kana. Could be
359            # unknown kanji, could be hangul, cyrillic, something else.
360            # By default ensure ascii by replacing with ?, but allow pass-through.
361            if self.ensure_ascii:
362                out = "?" * len(word.surface)
363                return out
364            else:
365                return word.surface
366
367        if word.feature.pos1 == "補助記号":
368            # If it's punctuation we don't recognize, just discard it
369            return self.table.get(word.surface, "")
370        elif self.use_wa and word.feature.pos1 == "助詞" and word.feature.pron == "ワ":
371            return "wa"
372        elif (
373            not self.use_he
374            and word.feature.pos1 == "助詞"
375            and word.feature.pron == "エ"
376        ):
377            return "e"
378        elif (
379            not self.use_wo
380            and word.feature.pos1 == "助詞"
381            and word.feature.pron == "オ"
382        ):
383            return "o"
384        elif self.use_foreign_spelling and has_foreign_lemma(word):
385            # this is a foreign word with known spelling
386            return word.feature.lemma.split("-", 1)[-1]
387        elif word.feature.kana:
388            # for known words
389            kana = jaconv.kata2hira(word.feature.kana)
390            return self.map_kana(kana)
391        else:
392            # unclear when we would actually get here
393            return word.surface

Return the romaji for a single word (node).

def map_kana(self, kana): View Source

395    def map_kana(self, kana):
396        """Given a list of kana, convert them to romaji.
397
398        The exact romaji resulting from a kana sequence depend on the preceding
399        or following kana, so this handles that conversion.
400        """
401        out = ""
402        for ki, char in enumerate(kana):
403            nk = kana[ki + 1] if ki < len(kana) - 1 else None
404            pk = kana[ki - 1] if ki > 0 else None
405            out += self.get_single_mapping(pk, char, nk)
406        return out

Given a list of kana, convert them to romaji.

The exact romaji resulting from a kana sequence depend on the preceding or following kana, so this handles that conversion.

def get_single_mapping(self, pk, kk, nk): View Source

408    def get_single_mapping(self, pk, kk, nk):
409        """Given a single kana and its neighbors, return the mapped romaji."""
410        # handle odoriji
411        # NOTE: This is very rarely useful at present because odoriji are not
412        # left in readings for dictionary words, and we can't follow kana
413        # across word boundaries.
414        if kk in ODORI:
415            if kk in "ゝヽ":
416                if pk:
417                    return pk
418                else:
419                    return ""  # invalid but be nice
420            if kk in "ゞヾ":  # repeat with voicing
421                if not pk:
422                    return ""
423                vv = add_dakuten(pk)
424                if vv:
425                    return self.table[vv]
426                else:
427                    return ""
428            # remaining are 々 for kanji and 〃 for symbols, but we can't
429            # infer their span reliably (or handle rendaku)
430            return ""
431
432        # handle digraphs
433        if pk and (pk + kk) in self.table:
434            return self.table[pk + kk]
435        if nk and (kk + nk) in self.table:
436            return ""
437
438        if nk and nk in SUTEGANA:
439            if kk == "っ":
440                return ""  # never valid, just ignore
441            return self.table[kk][:-1] + self.table[nk]
442        if kk in SUTEGANA:
443            return ""
444
445        if kk == "ー":  # 長音符
446            if pk and pk in self.table:
447                return self.table[pk][-1]
448            else:
449                return "-"
450
451        if kk == "っ":
452            if nk:
453                if self.use_tch and nk == "ち":
454                    return "t"
455                elif nk in "あいうえおっ":
456                    return "-"
457                else:
458                    return self.table[nk][0]  # first character
459            else:
460                # seems like it should never happen, but 乗っ|た is two tokens
461                # so leave this as is and pick it up at the word level
462                return "っ"
463
464        if kk == "ん":
465            if nk and nk in "あいうえおやゆよ":
466                return "n'"
467            else:
468                return "n"
469
470        return self.table[kk]

Given a single kana and its neighbors, return the mapped romaji.