Text
Strings
Compare three words by code-point count and UTF-8 byte count. ASCII characters take one byte each (hello → 5 bytes); the é in café is one code point but two UTF-8 bytes; each Thai character takes three. The str type abstracts over all three.
Source
english = "hello"
french = "café"
thai = "สวัสดี"
for label, word in [("English", english), ("French", french), ("Thai", thai)]:
print(label, word, len(word), len(word.encode("utf-8")))Output
English hello 5 5
French café 4 5
Thai สวัสดี 6 18Indexing and iteration work with Unicode code points, not encoded bytes. ord() returns the integer code point, which is often displayed in hexadecimal when teaching text encoding.
Source
print(thai[0])
print([hex(ord(char)) for char in thai[:2]])Output
ส
['0xe2a', '0xe27']String methods return new strings because strings are immutable. Encoding turns text into bytes when another system needs a byte representation.
Source
text = " café "
clean = text.strip()
print(clean)
print(clean.upper())
print(clean.encode("utf-8"))Output
café
CAFÉ
b'caf\xc3\xa9'Notes
- Use
strfor text andbytesfor binary data. len(text)counts Unicode code points;len(text.encode("utf-8"))counts encoded bytes.- ASCII text is a useful baseline because each ASCII code point is one UTF-8 byte.
- String methods return new strings because strings are immutable.
- User-visible “characters” can be more subtle than code points; combining marks and emoji sequences may need specialized text handling.
See also
- prerequisite: Values
- related: String Formatting
- next depth: Bytes and Bytearray
- related: Regular Expressions
Run the complete example
Expected output
English hello 5 5
French café 4 5
Thai สวัสดี 6 18
ส
['0xe2a', '0xe27']
café
CAFÉ
b'caf\xc3\xa9'
Execution time appears here after you run the example.