Text

Strings

Strings are immutable Unicode text sequences.

Compare three words by code-point count and UTF-8 byte count. ASCII characters take one byte each (hello → 5 bytes); the é in café is one code point but two UTF-8 bytes; each Thai character takes three. The str type abstracts over all three.

Source

english = "hello"
french = "café"
thai = "สวัสดี"

for label, word in [("English", english), ("French", french), ("Thai", thai)]:
    print(label, word, len(word), len(word.encode("utf-8")))

Output

English hello 5 5
French café 4 5
Thai สวัสดี 6 18
CODEPOINTScaféUTF-8 BYTES636166c3a9
Strings are sequences of Unicode codepoints. UTF-8 encoding turns them into bytes; `é` takes two bytes, `c` takes one.

Indexing and iteration work with Unicode code points, not encoded bytes. ord() returns the integer code point, which is often displayed in hexadecimal when teaching text encoding.

Source

print(thai[0])
print([hex(ord(char)) for char in thai[:2]])

Output

ส
['0xe2a', '0xe27']

String methods return new strings because strings are immutable. Encoding turns text into bytes when another system needs a byte representation.

Source

text = "  café  "
clean = text.strip()
print(clean)
print(clean.upper())
print(clean.encode("utf-8"))

Output

café
CAFÉ
b'caf\xc3\xa9'

Notes

See also

Run the complete example

Example code

Expected output

English hello 5 5
French café 4 5
Thai สวัสดี 6 18
ส
['0xe2a', '0xe27']
café
CAFÉ
b'caf\xc3\xa9'

Execution time appears here after you run the example.