Decode any character.

Q: What is mojibake?

Mojibake is garbled text that results from decoding bytes with the wrong character encoding. For example, reading UTF-8 text as Windows-1252 produces sequences like Ã© instead of é.

The complete reference for Unicode and character encoding. Look up any of 138,571+ characters and see exactly how it's encoded in UTF-8, UTF-16, ASCII, Latin-1, Windows-1252, Shift-JIS, and more.

Try: é € ☃ 中 &

Common Questions

What is UTF-8?

UTF-8 is a variable-width character encoding that can represent every character in Unicode. It uses 1 to 4 bytes per character and is backwards-compatible with ASCII. It is the dominant encoding on the web, used by over 98% of websites.

UTF-8 reference →

What is mojibake?

Mojibake (文字化け) is garbled text that results from decoding bytes with the wrong character encoding. For example, reading UTF-8 text as Windows-1252 produces sequences like Ã© instead of é.

Fix mojibake →

UTF-8 vs UTF-16

UTF-8 is variable-width (1–4 bytes) and ASCII-compatible. UTF-16 uses 2 bytes for most characters and 4 for supplementary ones. UTF-8 is preferred for files and web; UTF-16 is used internally by Windows, Java, and JavaScript engines.

UTF-16 reference →

What is a Unicode codepoint?

A codepoint is a number that uniquely identifies a character in the Unicode standard. Written as U+XXXX in hex. For example, U+0041 is the Latin capital letter A. Unicode defines over 140,000 codepoints across 17 planes.

Browse characters →

What is ASCII?

ASCII (American Standard Code for Information Interchange) is a 7-bit encoding defining 128 characters: English letters, digits, punctuation, and control codes. Every ASCII character has the same byte value in UTF-8, Latin-1, and Windows-1252.

ASCII reference →

What is a BOM?

A Byte Order Mark (BOM) is a special character (U+FEFF) placed at the start of a file to indicate encoding and byte order. UTF-8 BOM is EF BB BF; UTF-16 LE BOM is FF FE. BOMs are optional in UTF-8 but often cause problems.

Encode text →

Character Encodings

All encodings →

A character encoding is the rule that maps a character's number (codepoint) to the bytes stored on disk or sent over a network. Different encodings cover different languages and use different numbers of bytes — choosing the wrong one is the most common cause of garbled text.

UTF-8 Unicode

The dominant encoding for the web. Variable-width (1–4 bytes). Fully backwards-c...

Since 1993

UTF-16 LE Unicode

Little-endian UTF-16. Used internally by Windows, Java, and .NET. Variable-width...

Since 1996

UTF-16 BE Unicode

Big-endian UTF-16. Network byte-order variant of UTF-16. Used in some network pr...

Since 1996

UTF-32 LE Unicode

Fixed-width encoding using 4 bytes per character. Simple to process but memory-i...

Since 2003

UTF-32 BE Unicode

Fixed-width encoding using 4 bytes per character. Big-endian byte order. Rarely...

Since 2003

ASCII

The original 7-bit character encoding standard. Covers 128 characters: English l...

Since 1963

Latin-1 (ISO-8859-1)

Extends ASCII to 256 characters, covering most Western European languages. The f...

Since 1987

Windows-1252

Microsoft's extension of Latin-1. Assigns printable characters to the C1 control...

Since 1985

ISO-8859-2 (Latin-2)

Covers Central and Eastern European languages using Latin script: Polish, Czech,...

Since 1987

ISO-8859-5 (Cyrillic)

ISO standard for Cyrillic script. Covers Russian, Bulgarian, Serbian, Macedonian...

Since 1988

KOI8-R

Russian character encoding widely used in Unix systems and early internet. Desig...

Since 1993

Shift-JIS

Variable-width encoding for Japanese. Single-byte for ASCII and half-width kana,...

Since 1982

EUC-JP

Extended Unix Code for Japanese. Variable-width encoding common in Unix/Linux Ja...

Since 1991

GBK

Chinese national standard encoding for Simplified Chinese. Superset of GB2312. V...

Since 1993

Big5

Traditional Chinese encoding used in Taiwan, Hong Kong, and Macau. Variable-widt...

Since 1984

Browse by Block

All blocks →

Unicode organises its 138,571+ characters into named blocks — contiguous ranges of codepoints grouped by script or purpose. Select a block to browse every character it contains, or click any character to see how it's encoded across all supported encodings.

CJK Unified Ideographs

20,992 characters