Cutlet is a tool to convert Japanese to romaji. Check out the interactive demo! Also see the docs and the original blog post.



  • support for Modified Hepburn, Kunreisiki, Nihonsiki systems
  • custom overrides for individual mappings
  • custom overrides for specific words
  • built in exceptions list (Tokyo, Osaka, etc.)
  • uses foreign spelling when available in UniDic
  • proper nouns are capitalized
  • slug mode for url generation

Things not supported:

  • traditional Hepburn n-to-m: Shimbashi
  • macrons or circumflexes: Tōkyō, Tôkyô
  • passport Hepburn: Satoh (but you can use an exception)
  • hyphenating words
  • Traditional Hepburn in general is not supported

Internally, cutlet uses fugashi, so you can use the same dictionary you use for normal tokenization.


Cutlet can be installed through pip as usual.

pip install cutlet

Note that if you don't have a MeCab dictionary installed you'll also have to install one. If you're just getting started unidic-lite is a good choice.

pip install unidic-lite


A command-line script is included for quick testing. Just use cutlet and each line of stdin will be treated as a sentence. You can specify the system to use (hepburn, kunrei, nippon, or nihon) as the first argument.

$ cutlet
Roma ji henkan program tsukutte mita.

In code:

import cutlet
katsu = cutlet.Cutlet()
# => 'Cutlet curry wa oishii'

# you can print a slug suitable for urls
# => 'cutlet-curry-wa-oishii'

# You can disable using foreign spelling too
katsu.use_foreign_spelling = False
# => 'Katsu karee wa oishii'

# kunreisiki, nihonsiki work too
katu = cutlet.Cutlet('kunrei')
# => 'Huzi yama'

# comparison
nkatu = cutlet.Cutlet('nihon')

sent = "彼女は王への手紙を読み上げた。"
# => 'Kanojo wa ou e no tegami wo yomiageta.'
# => 'Kanozyo wa ou e no tegami o yomiageta.'
# => 'Kanozyo ha ou he no tegami wo yomiageta.'


  • kakasi: Historically important, but not updated since 2014.
  • pykakasi: self contained, it does segmentation on its own and uses its own dictionary.
  • kuroshiro: Javascript based.
  • kana: Go based.
class Cutlet:
 98class Cutlet:
 99    def __init__(
100            self,
101            system = 'hepburn',
102            use_foreign_spelling = True,
103            ensure_ascii = True,
104            mecab_args = "",
106        """Create a Cutlet object, which holds configuration as well as
107        tokenizer state.
109        `system` is `hepburn` by default, and may also be `kunrei` or
110        `nihon`. `nippon` is permitted as a synonym for `nihon`.
112        If `use_foreign_spelling` is true, output will use the foreign spelling
113        provided in a UniDic lemma when available. For example, "カツ" will
114        become "cutlet" instead of "katsu".
116        If `ensure_ascii` is true, any non-ASCII characters that can't be
117        romanized will be replaced with `?`. If false, they will be passed
118        through.
120        Typical usage:
122        ```python
123        katsu = Cutlet()
124        roma = katsu.romaji("カツカレーを食べた")
125        # "Cutlet curry wo tabeta"
126        ```
127        """
128        # allow 'nippon' for 'nihon'
129        if system == 'nippon': system = 'nihon'
130        self.system = system
131        try:
132            # make a copy so we can modify it
133            self.table = dict(SYSTEMS[system])
134        except KeyError:
135            print("unknown system: {}".format(system))
136            raise
138        self.tagger = fugashi.Tagger(mecab_args)
139        self.exceptions = load_exceptions()
141        # these are too minor to be worth exposing as arguments
142        self.use_tch = (self.system in ('hepburn',))
143        self.use_wa  = (self.system in ('hepburn', 'kunrei'))
144        self.use_he  = (self.system in ('nihon',))
145        self.use_wo  = (self.system in ('hepburn', 'nihon'))
147        self.use_foreign_spelling = use_foreign_spelling
148        self.ensure_ascii = ensure_ascii
150    def add_exception(self, key, val):
151        """Add an exception to the internal list.
153        An exception overrides a whole token, for example to replace "Toukyou"
154        with "Tokyo". Note that it must match the tokenizer output and be a
155        single token to work. To replace longer phrases, you'll need to use a
156        different strategy, like string replacement.
157        """
158        self.exceptions[key] = val
160    def update_mapping(self, key, val):
161        """Update mapping table for a single kana.
163        This can be used to mix common systems, or to modify particular
164        details. For example, you can use `update_mapping("ぢ", "di")` to
165        differentiate ぢ and じ in Hepburn.
167        Example usage:
169        ```
170        cut = Cutlet()
171        cut.romaji("お茶漬け") # Ochazuke
172        cut.update_mapping("づ", "du")
173        cut.romaji("お茶漬け") # Ochaduke
174        ```
175        """
176        self.table[key] = val
178    def slug(self, text):
179        """Generate a URL-friendly slug.
181        After converting the input to romaji using `Cutlet.romaji` and making
182        the result lower-case, any runs of non alpha-numeric characters are
183        replaced with a single hyphen. Any leading or trailing hyphens are
184        stripped.
185        """
186        roma = self.romaji(text).lower()
187        slug = re.sub(r'[^a-z0-9]+', '-', roma).strip('-')
188        return slug
190    def romaji_tokens(self, words, capitalize=True, title=False):
191        """Build a list of tokens from input nodes.
193        If `capitalize` is true, then the first letter of the first token will be
194        capitalized. This is typically the desired behavior if the input is a
195        complete sentence.
197        If `title` is true, then words will be capitalized as in a book title.
198        This means most words will be capitalized, but some parts of speech
199        (particles, endings) will not.
201        If the text was not normalized before being tokenized, the output is
202        undefined. For details of normalization, see `normalize_text`.
204        The number of output tokens will equal the number of input nodes.
205        """
207        out = []
209        for wi, word in enumerate(words):
210            po = out[-1] if out else None
211            pw = words[wi - 1] if wi > 0 else None
212            nw = words[wi + 1] if wi < len(words) - 1 else None
214            # handle possessive apostrophe as a special case
215            if (word.surface == "'" and
216                    (nw and nw.char_type == CHAR_ALPHA and not nw.white_space) and
217                    not word.white_space):
218                # remove preceeding space
219                if po:
220           = False
221                out.append(Token(word.surface, False))
222                continue
224            # resolve split verbs / adjectives
225            roma = self.romaji_word(word)
226            if roma and po and po.surface and po.surface[-1] == 'っ':
227                po.surface = po.surface[:-1] + roma[0]
228            if word.feature.pos2 == '固有名詞':
229                roma = roma.title()
230            if (title and
231                word.feature.pos1 not in ('助詞', '助動詞', '接尾辞') and
232                not (pw and pw.feature.pos1 == '接頭辞')):
233                roma = roma.title()
235            foreign = self.use_foreign_spelling and has_foreign_lemma(word)
236            tok = Token(roma, False, foreign)
237            # handle punctuation with atypical spacing
238            if word.surface in '「『':
239                if po:
240           = True
241                out.append(tok)
242                continue
243            if roma in '([':
244                if po:
245           = True
246                out.append(tok)
247                continue
248            if roma == '/':
249                out.append(tok)
250                continue
252            out.append(tok)
254            # no space sometimes
255            # お酒 -> osake
256            if word.feature.pos1 == '接頭辞': continue
257            # 今日、 -> kyou, ; 図書館 -> toshokan
258            if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
259            # special case for half-width commas
260            if nw and nw.surface == ',': continue
261            # 思えば -> omoeba
262            if nw and nw.feature.pos2 in ('接続助詞'): continue
263            # 333 -> 333 ; this should probably be handled in mecab
264            if (word.surface.isdigit() and
265                    nw and nw.surface.isdigit()):
266                continue
267            # そうでした -> sou deshita
268            if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
269                   and nw.feature.pos1 == '助動詞'
270                   and nw.surface != 'です'):
271                continue
273            # if we get here, it does need a space
274   = True
276        # remove any leftover っ
277        for tok in out:
278            tok.surface = tok.surface.replace("っ", "")
280        # capitalize the first letter
281        if capitalize and out and out[0].surface:
282            ss = out[0].surface
283            out[0].surface = ss[0].capitalize() + ss[1:]
284        return out
286    def romaji(self, text, capitalize=True, title=False):
287        """Build a complete string from input text.
289        If `capitalize` is true, then the first letter of the text will be
290        capitalized. This is typically the desired behavior if the input is a
291        complete sentence.
293        If `title` is true, then words will be capitalized as in a book title.
294        This means most words will be capitalized, but some parts of speech
295        (particles, endings) will not.
296        """
297        if not text:
298            return ''
300        text = normalize_text(text)
301        words = self.tagger(text)
303        tokens = self.romaji_tokens(words, capitalize, title)
304        out = ''.join([str(tok) for tok in tokens]).strip()
305        return out
307    def romaji_word(self, word):
308        """Return the romaji for a single word (node)."""
310        if word.surface in self.exceptions:
311            return self.exceptions[word.surface]
313        if word.surface.isdigit():
314            return word.surface
316        if word.surface.isascii():
317            return word.surface
319        # deal with unks first
320        if word.is_unk:
321            # at this point is is presumably an unk
322            # Check character type using the values defined in char.def.
323            # This is constant across unidic versions so far but not guaranteed.
324            if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
325                kana = jaconv.kata2hira(word.surface)
326                return self.map_kana(kana)
328            # At this point this is an unknown word and not kana. Could be
329            # unknown kanji, could be hangul, cyrillic, something else.
330            # By default ensure ascii by replacing with ?, but allow pass-through.
331            if self.ensure_ascii:
332                out = '?' * len(word.surface)
333                return out
334            else:
335                return word.surface
337        if word.feature.pos1 == '補助記号':
338            # If it's punctuation we don't recognize, just discard it
339            return self.table.get(word.surface, '')
340        elif (self.use_wa and
341                word.feature.pos1 == '助詞' and word.feature.pron == 'ワ'):
342            return 'wa'
343        elif (not self.use_he and
344                word.feature.pos1 == '助詞' and word.feature.pron == 'エ'):
345            return 'e'
346        elif (not self.use_wo and
347                word.feature.pos1 == '助詞' and word.feature.pron == 'オ'):
348            return 'o'
349        elif (self.use_foreign_spelling and
350                has_foreign_lemma(word)):
351            # this is a foreign word with known spelling
352            return word.feature.lemma.split('-')[-1]
353        elif word.feature.kana:
354            # for known words
355            kana = jaconv.kata2hira(word.feature.kana)
356            return self.map_kana(kana)
357        else:
358            # unclear when we would actually get here
359            return word.surface
361    def map_kana(self, kana):
362        """Given a list of kana, convert them to romaji.
364        The exact romaji resulting from a kana sequence depend on the preceding
365        or following kana, so this handles that conversion.
366        """
367        out = ''
368        for ki, char in enumerate(kana):
369            nk = kana[ki + 1] if ki < len(kana) - 1 else None
370            pk = kana[ki - 1] if ki > 0 else None
371            out += self.get_single_mapping(pk, char, nk)
372        return out
374    def get_single_mapping(self, pk, kk, nk):
375        """Given a single kana and its neighbors, return the mapped romaji."""
376        # handle odoriji
377        # NOTE: This is very rarely useful at present because odoriji are not
378        # left in readings for dictionary words, and we can't follow kana
379        # across word boundaries.
380        if kk in ODORI:
381            if kk in 'ゝヽ':
382                if pk: return pk
383                else: return '' # invalid but be nice
384            if kk in 'ゞヾ': # repeat with voicing
385                if not pk: return ''
386                vv = add_dakuten(pk)
387                if vv: return self.table[vv]
388                else: return ''
389            # remaining are 々 for kanji and 〃 for symbols, but we can't
390            # infer their span reliably (or handle rendaku)
391            return ''
394        # handle digraphs
395        if pk and (pk + kk) in self.table:
396            return self.table[pk + kk]
397        if nk and (kk + nk) in self.table:
398            return ''
400        if nk and nk in SUTEGANA:
401            if kk == 'っ': return '' # never valid, just ignore
402            return self.table[kk][:-1] + self.table[nk]
403        if kk in SUTEGANA:
404            return ''
406        if kk == 'ー': # 長音符
407            if pk and pk in self.table: return self.table[pk][-1]
408            else: return '-'
410        if kk == 'っ':
411            if nk:
412                if self.use_tch and nk == 'ち': return 't'
413                elif nk in 'あいうえおっ': return '-'
414                else: return self.table[nk][0] # first character
415            else:
416                # seems like it should never happen, but 乗っ|た is two tokens
417                # so leave this as is and pick it up at the word level
418                return 'っ'
420        if kk == 'ん':
421            if nk and nk in 'あいうえおやゆよ': return "n'"
422            else: return 'n'
424        return self.table[kk]
