Understanding Mojibake: Why Your Text Looks Like Gibberish
You’ve probably encountered it before: you open an email, a document, or a webpage and instead of readable text, you see something like “Café” or ““Helloâ€l or “don’t”. This garbled mess of characters has a name—mojibake—and understanding why it happens is the first step to fixing it.
What Is Mojibake?
Mojibake (文字化け) is a Japanese term meaning “character garbling.” It describes the phenomenon where text becomes unreadable because it was decoded using the wrong character encoding. When software interprets bytes meant for one encoding system as if they were another, the result is a jumble of incorrect characters that replace the original text.
The term comes from Japanese computing, where encoding problems were particularly common due to the complexity of representing Japanese characters. However, mojibake affects all languages and scripts—anywhere text crosses system boundaries, encoding errors can occur.
How Character Encoding Works
To understand mojibake, you need to understand how computers store text. At the most basic level, computers store everything as numbers. Character encoding is the system that maps characters (letters, symbols, punctuation) to specific numbers.
ASCII: The Beginning
The original American Standard Code for Information Interchange (ASCII) used 7 bits to represent 128 characters—enough for English letters, numbers, and basic punctuation. But ASCII couldn’t represent accented characters, non-Latin scripts, or special symbols.
Latin-1 and Windows-1252
To accommodate Western European languages, Latin-1 (ISO-8859-1) extended ASCII to 8 bits, adding 128 more characters for accented letters like é, ñ, and ü. Microsoft’s Windows-1252 further extended this by replacing some control characters with useful symbols like curly quotes (“ ”), em dashes (—), and the Euro sign (€).
UTF-8: The Modern Standard
UTF-8 is today’s dominant encoding, capable of representing every character in Unicode—over 140,000 characters from virtually every writing system. UTF-8 uses variable-length encoding: ASCII characters remain single bytes (maintaining backward compatibility), while other characters use 2–4 bytes.
Why Mojibake Happens
Mojibake occurs when there’s a mismatch between how text was encoded and how it’s being decoded. The most common scenario involves UTF-8 text being incorrectly interpreted as Latin-1 or Windows-1252.
UTF-8 Decoded as Latin-1
Consider the word “café”. In UTF-8, the "é" character is encoded as two bytes: 0xC3 0xA9. When software incorrectly decodes this as Latin-1, each byte becomes a separate character:
Café (UTF-8 decoded as Latin-1)
Café (Correctly decoded UTF-8)
The “Ô is Latin-1 character 195 (0xC3), and “©” is Latin-1 character 169 (0xA9). Together, they’re the UTF-8 representation of “é” being misread as two Latin-1 characters.
Windows-1252 Smart Quote Problems
Microsoft Word and other applications use “smart” or “curly” quotes, which are encoded differently in Windows-1252 than in UTF-8. When these characters are misinterpreted, you get distinctive mojibake patterns:
“Hello†becomes “Hello”
don’t becomes don’t
â€" becomes — (em dash)
Double Encoding
Sometimes text goes through encoding conversion twice, creating multiply-garbled text that’s even harder to decode. If UTF-8 text is treated as Latin-1 and then “converted” to UTF-8 again, each corrupted character gets re-encoded:
é → é (single encoding error)
é → é (double encoding error)
Common Sources of Mojibake
Mojibake typically appears in these scenarios:
- Email: Messages sent without proper encoding headers or processed through servers with different encoding assumptions.
- Databases: Data stored with one encoding but retrieved with another, particularly common during migrations.
- Web scraping: Extracting text from websites without respecting their declared character encoding.
- Copy-paste: Moving text between applications with different default encodings.
- Legacy systems: Old software or databases that predate UTF-8 adoption.
- File imports: Opening text files with encoding different from what the editor assumes.
Recognizing Mojibake Patterns
With practice, you can learn to recognize common mojibake patterns and even identify the type of encoding error:
Accented Characters
UTF-8 accented characters decoded as Latin-1 produce recognizable “Ô sequences:
é → é
ñ → ñ
ü → ü
à → à
Currency and Symbols
€ → €
£ → £
© → ©
° → °
Emojis
Emojis use 4-byte UTF-8 sequences, making them particularly vulnerable to encoding errors. Corrupted emojis often appear as sequences starting with “ð”:
ð😀 → 😀
ð👍 → 👍
ð❤ → ❤
Fixing Mojibake
Repairing mojibake requires reversing the encoding error. The process involves:
- Identifying the corruption type: Determine which encoding was incorrectly applied.
- Re-encoding the bytes: Interpret the garbled characters as their original byte values.
- Decoding correctly: Apply the proper encoding (usually UTF-8) to restore the original text.
Our Mojibake Decoder automates this process, using pattern recognition and re-encoding techniques to repair corrupted text. It handles UTF-8/Latin-1 mismatches, Windows-1252 smart quote problems, double encoding, and emoji corruption.
Preventing Mojibake
The best approach to mojibake is prevention. Here are key practices:
Use UTF-8 Everywhere
Standardize on UTF-8 for all text handling. Set it as the default encoding in your text editors, development environments, and databases.
Declare Encoding in HTML
Always include the character encoding declaration in your HTML documents:
<meta charset="UTF-8">
This should be one of the first elements in the <head> section, before any text content.
Configure Databases Properly
Set your database to use UTF-8. In MySQL, use utf8mb4 (not just utf8, which doesn’t support 4-byte characters like emojis). Ensure both the database and the connection use the same encoding.
Specify Encoding in Code
When reading or writing text files in any programming language, explicitly specify UTF-8 encoding rather than relying on system defaults.
The Connection to Typography
Mojibake is closely related to proper typography. Many mojibake issues involve typographic characters—curly quotes, em dashes, and special symbols that distinguish professional writing from plain text. When these characters get corrupted, not only does the text become unreadable, but the careful typographic choices made by the author are lost.
Using a typography tool to properly format your text, combined with correct encoding practices, ensures your polished content reaches readers exactly as intended.
Conclusion
Mojibake might seem like an arcane technical problem, but it affects anyone who works with text across different systems, platforms, or applications. Understanding the underlying cause—mismatched character encodings—demystifies the problem and points to the solution: consistent use of UTF-8 and proper encoding declarations.
When prevention fails and you encounter garbled text, our Mojibake Decoder can restore your corrupted content in seconds. Whether you’re dealing with a single corrupted email or a database full of encoding errors, the right tools and knowledge can turn gibberish back into readable text.