Python Strings


A Python string, like 'Hello' stores text as a sequence of individual characters. Text is central to many compautions - urls, chat messages, the underlying HTML code that makes up web pages.

Python strings are written between single quote marks like 'Hello' or alternately they can be written in double quote marks like "There".

a = 'Hello'
b = "isn't"

Each character in a string is drawn from the unicode character set, which includes the "characters" or pretty much every language on earth, plus many emojis. See the unicode section below for more information.

String len()

The len() function returns the length of a string, the number of chars in it. It is valid to have a string of zero characters, written just as '', called the "empty string". The length of the empty string is 0. The len() function in Python is omnipresent - it's used to retrieve the length of every data type, with string just a first example.

>>> s = 'Python'
>>> len(s)
6
>>> len('')   # empty string
0

Convert Between Int and String

The formal name of the string type is "str". The str() function serves to convert many values to a string form. Here is an example this code computes the str form of the number 123:

>>> str(123)
'123'

Looking carefully at the values, 123 is a number, while '123' is a string length-3, made of the three chars '1' '2' and '3'.

Going the other direction, the formal name of the integer type is "int", and the int() function takes in a value and tries to convert it to be an int value:

>>> int('1234')
1234
>>> int('xx1234')   # fails due to extra chars
ValueError: invalid literal for int() with base 10: 'xx1234'

String Indexing [ ]

Chars are accessed with zero-based indexing with square brackets, so the first chars is index 0, the next index 1, and the last char is at index len-1.

string 'Python' shown with index numbers 0..5

Accessing a too large index number is an error. Strings are immutable, so they cannot be changed once created. Code to compute a different string always creates a new string in memory to represent the result (e.g. + below), leaving the original strings unchanged.

>>> s = 'Python'
>>> len(s)
6
>>> s[0]
'P'
>>> s[1]
'y'
>>> s[5]
'n'
>>> s[6]
IndexError: string index out of range
>>> s[0] = 'x'   # no, string is immutable
TypeError: 'str' object does not support item assignment

String +

The + operator combines (aka "concatenates") two strings to make a bigger string. This creates new strings to represent the result, leaving the original strings unchanged. (See the working with immutable below.)

>>> s1 = 'Hello'
>>> s2 = 'There'
>>> s3 = s1 + ' ' + s2
>>> s3
'Hello There'
>>> s1
'Hello'

Concatenate + only works with 2 or more strings, not for example to concatenate a string and an int. Call the function str() function to make a string out of an int, then concatenation works.

>>> 'score:' + 6
TypeError: can only concatenate str (not "int") to str
>>> 'score:' + str(6)
'score:6'

String in

The in operator checks, True or False, if something appears anywhere in a string. In this and other string comparisons, characters much match exactly, so 'a' matches 'a', but does not match 'A'.(Mnemonic: this is the same word "in" as used in the for-loop.)

>>> 'c' in 'abcd'
True
>>> 'c' in 'ABCD'
False
>>> 'aa'  in 'iiaaii'  # test string can be any length
True
>>> 'aaa' in 'iiaaii'
False
>>> '' in 'abcd'       # empty string in always True
True

Character Class Tests

The characters that make up a string can be divided into several categories or "character classes":

alt: divide chars into alpha (lower/upper), digit, space, and misc leftovers

alphabetic chars - e.g. 'abcXYZ' that make words. Alphabetic chars are further divided into upper and lowercase versions (the details depend on the particular unicode alphabet).

digit chars - e.g. '0' '1' .. '9' to make numbers

space chars - e.g. space ' ' newline '\n' and tab '\t'

Then there are all the other miscellaneous characters like '$' '^' '<' which are not alphabetic, digit, or space.

These test functions return True if all the chars in s are in the given class:

s.isalpha() - True for alphabetic "word" characters like 'abcXYZ' (applies to "word" characters in other unicode alphabets too like 'Σ')

s.isdigit() - True if all chars in s are digits '0..9'

s.isspace() - True for whitespace char, e.g. space, tab, newline

s.isupper(), s.islower() - True for uppercase / lowercase alphabetic chars. False for other characters like '9' and '$' which do not have upper/lower versions.

>>> 'a'.isalpha()
True
>>> '$'.isalpha()
False
>>> 'a'.islower()
True
>>> 'a'.isupper()
False
>>> s = '\u03A3'  # Unicode Sigma char
>>> s
'Σ'
>>> s.isalpha()
True
>>> '6'.isdigit()
True
>>> 'a'.isdigit()
False
>>> '$'.islower()
False
>>> ' '.isspace()
True
>>> '\n'.isspace()
True

Unicode aside: In the roman a-z alphabet, all alphabetic chars have upper/lower versions. In some alphabets, there are chars which are alphabetic, but which do not have upper/lower versions.

Startswith EndsWith

These convenient functions return a boolean True/False depending on what appears at one end of a string. These are convenient when you need to check for something at an end, e.g. if a filename ends with '.html'.

s.startswith(x) - True if s start with string x

s.endswith(x) - True if s ends with string x

>>> 'Python'.startswith('Py')
True
>>> 'Python'.startswith('Px')
False
>>> 'resume.html'.endswith('.html')
True

String find()

s.find(x) - searches s left to right, returns int index where string x appears, or -1 if not found. Use s.find() to compute the index where a substring first appears.

>>> s = 'Python'
>>> s.find('y')
1
>>> s.find('tho')
2
>>> s.find('xx')
-1

There are some more rarely used variations of s.find(): s.find(x, start_index) - which begins the search at the given index instead of at 0; s.rfind(x) does the search right-to-left from the end of the string.

Change Upper/Lower Case

s.lower() - returns a new version of s where each char is converted to its lowercase form, so 'A' becomes 'a'. Chars like '$' are unchanged. The original s is unchanged - a good example of strings being immutable. (See the working with immutable below.) Each unicode alphabet includes its own rules about upper/lower case.

s.upper() - returns an uppercase version of s

>>> s = 'Python123'
>>> s.lower()
'python123'
>>> s.upper()
'PYTHON123'
>>> s
'Python123'

Stripe Whitespace

s.strip() - return a version of s with the whitespace characters from the very start and very end of the string all removed. Handy to clean up strings parsed out of a file.

>>> '   hi there  \n'.strip() 
'hi there'

String Replace

s.replace(old, new) - returns a version of s where all occurrences of old have been replaced by new. Does not pay attention to word boundaries, just replaces every instance of old in s. Replacing with the empty string effectively deletes the matching strings.

>>> 'this is it'.replace('is', 'xxx')
'thxxx xxx it'
>>> 'this is it'.replace('is', '')
'th  it'

Working With Immutable x = change(x)

Strings are "immutable", meaning the chars in a string never change. Instead of changing a string, code creates new strings.

Suppose we have a string, and want to change it to uppercase and add an exclamatin mark at its end, so 'Hello' becomes 'HELLO!'.

The following code looks reasonable but does not work

>>> s = 'Hello'
>>> s.upper()  # compute upper, but does not store it
'HELLO'
>>> s          # s is not changed
'Hello'

The correct form computes the uppercase form, and also stores it back in the s variable, a sort of x = change(x) form.

>>> s = 'Hello'
>>> s = s.upper()  # compute upper, store in s
>>> s = s + '!'    # add !, store in s
>>> s              # s is the new, computed string
'HELLO!'

Backslash Special Chars

A backslash \ in a string "escapes" a special char we wish to include in the string, such as a quote or \n newline. Common backslash escapes:

\'   # single quote
\"   # double quote
\\   # a backslash
\n   # newline char

A string using \n:

a = 'First line\nSecond line\nThird line\n'

Python strings can be written within triple ''' or """, in which case they can span multiple lines. This is useful for writing longer blocks of text.

a = """First line
Second line
Third line
"""

String Format

The string .format() function is a handy way to paste values into a string. It uses the special marker {} within a string to mark where things go, like this:

>>> 'Count: {}'.format(67)
'Count: 67'
>>> 'Count: {} and word: {}'.format(67, 'Yay')
'Count: 67 and word: Yay'

The older approach would be to compute str(67) manually and use + to put the result string together. The str.format() function is a more convenient tool for that situation.

For floating point values, typically you do not wantn to print all 15 digits of a float value. The format marker {:.4g} means print at most 4 digits to the right of the decimal; "g" here is the general format, that works for float and int values as appropriate.

>>> 2/3   # has lots of digits
0.6666666666666666
>>> 'val: {:.4g}'.format(2/3)
'val: 0.6667'
>>> 'val: {:.2g}'.format(2/3)
'val: 0.67'
>>> 'val: {:.2g}'.format(45)
'val: 45'

There are many, many other options for format markers, but {:.4g} is a good one to know for the common situation of printing float values.

String Loops

Standard i/range() loop goes through all index numbers for s:

for i in range(len(s)):
    # use s[i] in here

The "foreach" loop works on strings too, accessing each char. Unlike the above form, here you do not have access to the index of each char as it accessed.

for char in s:
    # use char in here

list('abc') of a string yields a list ['a', 'b', 'c'] of its chars.

More details at official Python String Docs

String Slices

string 'Python' shown with index numbers 0..5

Slice syntax is a powerful way to refer to sub-parts of a string instead of just 1 char. s[ start : end ] - returns a substring from s beginning at start index, running up to but not including end index. If the start index is omitted, starts from the beginning of the string. If the end index is omitted, runs through the end of the string. If the start index is equal to the end index, the slices is the empty string.

>>> s = 'Python'
>>> s[2:4]
'th'
>>> s[2:]
'thon'
>>> s[:5]
'Pytho'
>>> s[4:4]  # start = end: empty string
''

If the end index is too large (out of bounds), the slice just runs through the end of the string. This is the a case where Python is permissive about wrong/out-of-bounds indexes. Similarly, if the start index is larger than the end index, the slice is just the empty string.

>>> s = 'Python'
>>> s[2:999]
'thon'
>>> s[3:2]  # zero chars
''

Negative numbers also work within [ ] and slices: -1 is the rightmost char, -2 is the char to its left, and so on. This is convenient when you want to extract chars relative to their position from the end of the string.

>>> s[-1]
'n'
>>> s[-2:]
'on'

String split()

str.split(',') is a string function which divides a string up into a list of string pieces based on a "separator" parameter that separates the pieces:

>>> 'a,b,c'.split(',')
['a', 'b', 'c']
>>> 'a:b:c'.split(':')
['a', 'b', 'c']

A returned piece will be the empty string if we have two separators next to each other, e.g. the '::', or the separator is at the very start or end of the string:

>>> ':a:b::c:'.split(':')
['', 'a', 'b', '', 'c', '']

Special whitespace: split with no arguments at all splits on whitespace (space, tab, newline), and it groups multiple whitespace together. So it's a simple way to break a line of text into 'words' based on whitespace (note how the punctuation is lumped onto each 'word'):

>>> 'Hello there,     he said.'.split()
['Hello', 'there,', 'he', 'said.']

File strategy: a common pattern is to use 'for line in f' to loop over the lines in a file and 'line.split()' to break each line up into pieces. Some text file formats have a format that split() works on easily.

String Join

','.join(lst) is a string function which is approximately the opposite of split: take a list of strings parameter and forms it into a big string, using the string as a separator:

>>> ','.join(['a', 'b', 'c'])
'a,b,c'

The elements in the list should be strings, and join just puts them all together to make one big string. Note that split() and join() are both noun.verb on string. The list is just passed in as a parameter.

Unicode Characters

In the early days of computers, the ASCII character encoding was very common, encoding the roman a-z alphabet. ASCII is simple, and requires just 1 byte to store 1 character, but it has no ability to represent characters of other languages.

Each character in a Python string is a unicode character, so characters for all languages are supported. Also, many emoji have been added to unicode as a sort of character.

Every unicode character is defined by a unicode "code point" which is basically a big int value that uniquely identifies that character. Unicode characters can be written using the "hex" version of their code point, e.g. "03A3" is the "Sigma" char Σ, and "2665" is the heart emoji char ♥.

Hexadecimal aside: hexadecimal is a way of writing an int in base-16 using the digits 0-9 plus the letters A-F, like this: 7F9A or 7f9a. Two hex digits together like 9A or FF represent the value stored in one byte, so hex is a traditional easy way to write out the value of a byte. When you look up an emoji on the web, typically you will see the code point written out in hex, like 1F644, the eye-roll emoji 🙄.

You can write a unicode char out in a Python string with a \u followed by the 4 hex digits of its code point. Notice how each unicode char is just one more character in the string:

>>> s = 'hi \u03A3'
>>> s
'hi Σ'
>>> len(s)
4
>>> s[0]
'h'
>>> s[3]
'Σ'
>>>
>>> s = '\u03A9'  # upper case omega
>>> s
'Ω'
>>> s.lower()     # compute lowercase
'ω'
>>> s.isalpha()   # isalpha() knows about unicode
True
>>>
>>> 'I \u2665'
'I ♥'

For a code point with more than 4-hex-digits, use \U (uppercase U) followed by 8 digits with leading 0's as needed, like the fire emoji 1F525, and the inevitable 1F4A9.

>>> 'the place is on \U0001F525'
'the place is on 🔥'
>>> s = 'oh \U0001F4A9'
>>> len(s)
4

Not all computers have the ability to display all unicode chars, so the display of a string may fall back to something like \x0001F489 - telling you the hex digits for the char, even though it can't be drawn on screen.