Text

Strings

Strings are immutable Unicode text sequences.

Compare three words by code-point count and UTF-8 byte count. ASCII characters take one byte each (hello → 5 bytes); the é in café is one code point but two UTF-8 bytes; each Thai character takes three. The str type abstracts over all three.

Source

english = "hello"
french = "café"
thai = "สวัสดี"

for label, word in [("English", english), ("French", french), ("Thai", thai)]:
    print(label, word, len(word), len(word.encode("utf-8")))

Output

English hello 5 5
French café 4 5
Thai สวัสดี 6 18

Indexing and iteration work with Unicode code points, not encoded bytes. ord() returns the integer code point, which is often displayed in hexadecimal when teaching text encoding.

Source

print(thai[0])
print([hex(ord(char)) for char in thai[:2]])

Output

ส
['0xe2a', '0xe27']

String methods return new strings because strings are immutable. Encoding turns text into bytes when another system needs a byte representation.

Source

text = "  café  "
clean = text.strip()
print(clean)
print(clean.upper())
print(clean.encode("utf-8"))

Output

café
CAFÉ
b'caf\xc3\xa9'

Notes

Use str for text and bytes for binary data.
len(text) counts Unicode code points; len(text.encode("utf-8")) counts encoded bytes.
ASCII text is a useful baseline because each ASCII code point is one UTF-8 byte.
String methods return new strings because strings are immutable.
User-visible “characters” can be more subtle than code points; combining marks and emoji sequences may need specialized text handling.

Run the complete example

Expected output

English hello 5 5
French café 4 5
Thai สวัสดี 6 18
ส
['0xe2a', '0xe27']
café
CAFÉ
b'caf\xc3\xa9'

Execution time appears here after you run the example.

Strings

See also

Run the complete example

Example code

Expected output