Unicode and UTF-8 Explained: Code Points, Encodings, and Why Emoji Have Length 2

Unicode: a universal character catalog

Before Unicode, every region had its own character encoding: ASCII for English, Latin-1 for Western European languages, Shift-JIS for Japanese. Files would appear garbled when opened on a system that expected a different encoding.

Unicode solves this by assigning a unique number, called a code point, to every character in every writing system. Code points are written as U+XXXX in hexadecimal. For example: U+0041 is "A", U+03B1 is "α" (Greek alpha), U+1F600 is the grinning face emoji.

Unicode currently defines over 140,000 characters across 154 scripts. The catalog is organized into 17 planes, each holding 65,536 code points. The Basic Multilingual Plane (BMP, U+0000 to U+FFFF) covers most everyday characters. Emoji and many historic scripts live in the supplementary planes above U+FFFF.

UTF-8: one way to store code points as bytes

Unicode defines what characters exist and what number each one gets. It does not say how to store those numbers in a file or transmit them over a network. That is what an encoding does. UTF-8 is the dominant encoding for Unicode.

UTF-8 uses a variable number of bytes per character (1 to 4), depending on the code point value. The ASCII range (U+0000 to U+007F) encodes in exactly 1 byte, which makes UTF-8 backward-compatible with all ASCII text: any valid ASCII file is also a valid UTF-8 file.

Char	Code point	Bytes	Hex bytes	Note
A	U+0041	1	41	Basic ASCII, 1 byte
é	U+00E9	2	C3 A9	Latin extended, 2 bytes
€	U+20AC	3	E2 82 AC	Currency symbol, 3 bytes
😀	U+1F600	4	F0 9F 98 80	Emoji, 4 bytes (surrogate pair in JS)

Why string lengths surprise you

JavaScript does not store strings as UTF-8 internally. It uses UTF-16, where most characters take one 16-bit code unit but supplementary plane characters (like emoji) take two. The built-in .length property counts these code units, not characters.

'A'.length        // 1  (correct)
'é'.length        // 1  (correct)
'😀'.length       // 2  (surprise: two UTF-16 code units)
[...'😀'].length  // 1  (correct: spread iterates code points)

Python 3, Go, and Rust all expose character counts by default. Java and older C# have the same UTF-16 length issue as JavaScript. When in doubt, use your language's "code point count" function rather than the raw byte or unit length.

The BOM (byte order mark)

The BOM is the character U+FEFF placed at the very beginning of a file. In UTF-16, it indicates byte order (big-endian vs little-endian). In UTF-8, there is no byte order to specify, so the BOM is technically unnecessary.

The problem: some Windows tools (notably older versions of Excel and Notepad) add a UTF-8 BOM anyway. This creates a hidden three-byte sequence at the start of the file. Code that strips the first character instead of the BOM will silently break. If you encounter a mystery character at the start of a string, check for a BOM first.

Other encodings you will encounter

UTF-16

Internal use in runtimes

Uses 2 or 4 bytes per character. Used internally by JavaScript, Java, and C#. The source of the "emoji has length 2" issue.

UTF-32

Rare; some internal databases

Uses exactly 4 bytes per character for every code point. Simple to index but wastes memory: ASCII text takes 4x more space than UTF-8.

Latin-1 (ISO-8859-1)

Legacy systems only

Legacy 8-bit encoding covering only 256 characters. Cannot represent most non-Western scripts, emoji, or thousands of common symbols.

ASCII

Historical; covered by UTF-8

The original 7-bit encoding. Only 128 characters. UTF-8 is a strict superset: any valid ASCII file is also valid UTF-8.

Practical implications for developers

‣HTML pages: declare the charset. Always include <meta charset="utf-8"> as the first element inside <head>. Browsers use it to determine how to decode the file.
‣HTTP responses: set Content-Type. Add charset=utf-8 to your Content-Type header: Content-Type: text/html; charset=utf-8. This overrides any meta tag.
‣Databases: use utf8mb4 in MySQL. MySQL's utf8 charset only supports 3-byte sequences, silently dropping 4-byte characters like emoji. Use utf8mb4 for full Unicode support.
‣String length: count code points, not UTF-16 units. In JavaScript, use Array.from(str).length or [...str].length to count characters correctly. The built-in .length property counts UTF-16 code units.
‣File reading: specify the encoding. In Node.js, fs.readFile returns a Buffer by default. Pass 'utf8' as the second argument to get a decoded string.

Frequently asked questions

What is the difference between Unicode and UTF-8?

Unicode is a standard that assigns a unique number (called a code point) to every character in every writing system on earth. UTF-8 is one way to store those numbers as bytes. You can think of Unicode as the phone book (a character's name and number) and UTF-8 as the format you use to write that number down. Other encodings like UTF-16 and UTF-32 are different formats for the same phone book.

Why does my emoji have length 2 in JavaScript?

JavaScript strings are encoded in UTF-16 internally. Most characters fit in one 16-bit code unit, so their .length is 1. Emoji and some other characters have code points above U+FFFF, which require two 16-bit code units called a surrogate pair. The .length property counts code units, not characters, so emoji report a length of 2. Use [...str].length or Array.from(str).length to count actual characters.

What causes the black diamond with a question mark?

That symbol (the Unicode replacement character U+FFFD) appears when a decoder encounters a byte sequence that is not valid in the declared encoding. The most common cause: a file or database field is stored in Latin-1 or Windows-1252 but read as if it were UTF-8. Characters like curly quotes, em dashes, or accented letters that exist in Latin-1 but encode differently in UTF-8 get mangled. Fix: ensure the encoding at every layer (file, database, HTTP header) is consistent.

Why is UTF-8 the default everywhere?

UTF-8 hits a rare combination of properties: it covers every Unicode character, it is backward-compatible with ASCII, it uses only 1 byte for the entire English alphabet (making ASCII-heavy text compact), it is self-synchronizing (you can find the start of any character from any byte), and it has no byte-order ambiguity. HTML5, JSON, and most modern protocols mandate or default to UTF-8.

What is Unicode normalization (NFC vs NFD)?

Some characters can be represented in multiple ways. The letter "é" can be stored as a single precomposed code point U+00E9, or as the letter "e" (U+0065) followed by a combining accent (U+0301). These look identical but compare as unequal. NFC (Canonical Decomposition followed by Canonical Composition) prefers precomposed forms. NFD prefers decomposed forms. When comparing or searching user-supplied strings, normalize to the same form first.

ASISA Tools

Unicode and UTF-8 explained: code points, encodings, and why emoji have length 2