Modern Python Cookbook
上QQ阅读APP看书,第一时间看更新

Choosing a data structure

Python offers a number of built-in data structures to help us work with collections of data. It can be confusing to match the data structure features with the problem we're trying to solve.

How do we choose which structure to use? What are the features of lists, sets, and dictionaries? Why do we have tuples and frozen sets?

Getting ready

Before we put data into a collection, we'll need to consider how we'll gather the data, and what we'll do with the collection once we have it. The big question is always how we'll identify a particular item within the collection.

We'll look at a few key questions that we need to answer to decide which of the built-in structures is appropriate.

How to do it...

  1. Is the programming focused on simple existence? Are items present or absent from a collection? An example of this is validating input values. When the user enters something that's in the collection, their input is valid; otherwise, their input is invalid. Simple membership tests suggest using a set:
    def confirm() -> bool:
        yes = {"yes", "y"}
        no = {"no", "n"}
        while (answer := input("Confirm: ")).lower() not in (yes|no):
            print("Please respond with yes or no")
        return answer in yes
    

    A set holds items in no particular order. Once an item is a member, we can't add it again:

    >>> yes = {"yes", "y"}
    >>> no = {"no", "n"}
    >>> valid_inputs = yes | no
    >>> valid_inputs.add("y")
    >>> valid_inputs
    {'yes', 'no', 'n', 'y'}
    

    We have created a set, valid_inputs, by performing a set union using the | operator among sets. We can't add another y to a set that already contains y. There is no exception raised if we try such an addition, but the contents of the set don't change.

    Also, note that the order of the items in the set isn't exactly the order in which we initially provided them. A set can't maintain any particular order to the items; it can only determine if an item exists in the set. If order matters, then a list is more appropriate.

  2. Are we going to identify items by their position in the collection? An example includes the lines in an input file—the line number is its position in the collection. When we must identify an item using an index or position, we must use a list:
    >>> month_name_list = ["Jan", "Feb", "Mar", "Apr",
    ...    "May", "Jun", "Jul", "Aug",
    ...    "Sep", "Oct", "Nov", "Dec"]
    >>> month_name_list[8]
    'Sep'
    >>> month_name_list.index("Feb")
    

    We have created a list, month_name_list, with 12 string items in a specific order. We can pick an item by providing its position. We can also use the index() method to locate the index of an item in the list. List index values in Python always start with a position of zero. While a list has a simple membership test, the test can be slow for a very large list, and a set might be a better idea if many such tests will be needed.

    If the number of items in the collection is fixed—for example, RGB colors have three values—this suggests a tuple instead of a list. If the number of items will grow and change, then the list collection is a better choice than the tuple collection.

  3. Are we going to identify the items in a collection by a key value that's distinct from the item's position? An example might include a mapping between strings of characters—words—and integers that represent the frequencies of those words, or a mapping between a color name and the RGB tuple for that color. We'll look at mappings and dictionaries in the next chapter; the important distinction is mappings don't locate items by position the way lists do.

    In contrast to a list, here's an example of a dictionary:

    >>> scheme = {"Crimson": (220, 14, 60),
    ... "DarkCyan": (0, 139, 139),
    ... "Yellow": (255, 255, 00)}
    >>> scheme['Crimson']
    (220, 14, 60)
    

    In this dictionary, scheme, we've created a mapping from color names to the RGB color tuples. When we use a key, for example "Crimson", to get an item from the dictionary, we can retrieve the value bound to that key.

  4. Consider the mutability of items in a set collection and the keys in a dict collection. Each item in a set must be an immutable object. Numbers, strings, and tuples are all immutable, and can be collected into sets. Since list, dict, are set objects and are mutable, they can't be used as items in a set. It's impossible to build a set of list objects, for example.

    Rather than create a set of list items, we must transform each list item into an immutable tuple. Similarly, dictionary keys must be immutable. We can use a number, a string, or a tuple as a dictionary key. We can't use a list, or a set, or any another mutable mapping as a dictionary key.

How it works...

Each of Python's built-in collections offers a specific set of unique features. The collections also offer a large number of overlapping features. The challenge for programmers new to Python is to identify the unique features of each collection.

The collections.abc module provides a kind of roadmap through the built-in container classes. This module defines the Abstract Base Classes (ABCs) underlying the concrete classes we use. We'll use the names from this set of definitions to guide us through the features.

From the ABCs, we can see that there are actually places for a total of six kinds of collections:

  • Set: Its unique feature is that items are either members or not. This means duplicates are ignored:
  • Mutable set: The set collection
  • Immutable set: The frozenset collection
  • Sequence: Its unique feature is that items are provided with an index position:
  • Mutable sequence: The list collection
  • Immutable sequence: The tuple collection
  • Mapping: Its unique feature is that each item has a key that refers to a value:
  • Mutable mapping: The dict collection.
  • Immutable mapping: Interestingly, there's no built-in frozen mapping.

Python's libraries offer a large number of additional implementations of these core collection types. We can see many of these in the Python Standard Library.

The collections module contains a number of variations of the built-in collections. These include:

  • namedtuple: A tuple that offers names for each item in a tuple. It's slightly clearer to use rgb_color.red than rgb_color[0].
  • deque: A double-ended queue. It's a mutable sequence with optimizations for pushing and popping from each end. We can do similar things with a list, but deque is more efficient when changes at both ends are needed.
  • defaultdict: A dict that can provide a default value for a missing key.
  • Counter: A dict that is designed to count occurrences of a key. This is sometimes called a multiset or a bag.
  • OrderedDict: A dict that retains the order in which keys were created.
  • ChainMap: A dict that combines several dictionaries into a single mapping.

There's still more in the Python Standard Library. We can also use the heapq module, which defines a priority queue implementation. The bisect module includes methods for searching a sorted list very quickly. This allows a list to have performance that is a little closer to the fast lookups of a dictionary.

There's more...

We can find lists of data structures in summary web pages, like this one: https://en.wikipedia.org/wiki/List_of_data_structures.

Different parts of the article provide slightly different summaries of data structures. We'll take a quick look at four additional classifications of data structures:

  • Arrays: The Python array module supports densely packed arrays of values. The numpy module also offers very sophisticated array processing. See https://numpy.org. (Python has no built-in or standard library data structure related to linked lists).
  • Trees: Generally, tree structures can be used to create sets, sequential lists, or key-value mappings. We can look at a tree as an implementation technique for building sets or dicts, rather than a data structure with unique features. (Python has no built-in or standard library data structure implemented via trees).
  • Hashes: Python uses hashes to implement dictionaries and sets. This leads to good speed but potentially large memory consumption.
  • Graphs: Python doesn't have a built-in graph data structure. However, we can easily represent a graph structure with a dictionary where each node has a list of adjacent nodes.

We can—with a little cleverness—implement almost any kind of data structure in Python. Either the built-in structures have the essential features, or we can locate a built-in structure that can be pressed into service. We'll look at mappings and dictionaries in the next chapter: they provide a number of important features for organizing collections of data.

See also