Exploring Dict and HashMap in Python and Rust

Comparison between python and rust code for counting strings in a list.

Learn Python with Rust: Counting String Occurrences

Let’s learn Python by looking at another language, Rust. Here is a Python pedagogical simplistic example that counts string occurrences in a list (could be done using Counter, I know), and the Rust counterpart (I know, could be done using into_iter and fold).

# Python

def foo(spam: list[str]) -> dict[str, int]:
    counter = {}
    for item in spam:
        counter[item] = counter.setdefault(item, 0) + 1
    return counter
// Rust

fn foo(spam: Vec<String>) -> HashMap<String, i32> {
    let mut counter: HashMap<String, i32> = HashMap::new();

    for item in spam {
        *counter
            .entry(item)
            .or_insert(0) += 1;
    }

    counter
}

Data Structures: Dict and HashMap

First, we define a dict in Python or a HashMap in Rust. Python dict can take any type of value by default, but both structures are hashmaps, meaning they bind a hash to a value. The value can be of any kind, but the key must be hashable (implement a __hash__ magic method). Immutable types are usually hashable, but not always. For example, a tuple containing a list cannot be hashed (here is why).

In Rust, the HashMap must be declared as mut, meaning mutable. Python dict are mutable by default. To create an immutable dict, you can use a mapping proxy (or the tier library frozendict):

import types

immutable_dict = types.MappingProxyType({'a': 1, 'b': 2})

Code Comparison

Rust Code

counter.entry(item).or_insert(0) accesses the entry associated with the key item, and if it’s vacant inserts 0 into the entry and returns it. This is the important part: It change the value of the entry and return this entry at the same time.

In order to update the value, we use the wildcard * and += 1. This will access and mutate the entry, adding 1 to the current count.

Python Code

The Python equivalent, counter.setdefault(item, 0), returns the value associated to the key if it exists, and If the key doesn’t exist in the dict (meaning there is no value), it set it to the default value, here 0 and returns it.

However, the result is a simple integer, not a reference to the dict value. Therefore, we need to increment the value explicitly with counter[item] = counter.get(item, 0) + 1.

Mutable vs Immutable Data Structures

In Rust, the or_insert call on a HashMap entry, give us a mutable reference to the entry inside the HashMap. Therefore the mutable state allow us to += its value, directly altering the HashMap. To do that we need * to dereference the mutable and get access to it. In Python, .setdefault returns a value, not a reference, so we must write it back to the dict with counter[item] =.

Hypothetical Python Enhanced Syntax

To achieve something similar to Rust in Python, we might want syntax like using a walrus operator:

(counter[item] := counter.get(item, 0)) += 1

Unfortunately, this isn’t valid Python syntax, although It is possible by altering the bytecode of a similar function (see codereview.stackexchange.com).

Enhanced versions

More pythonic code

To be clear, the python code showed previously in this article is not what you should aim for as It lack simplicity and conciseness. Here is a few improvement you might want to consider if you are looking for a more pythonic code.

Use collections.defaultdict instead of dict

The collections.defaultdict object allows you to define a default factory of value for each missing key that could be asked from the dict. By defining this default factory to int, the dict will generate a new int each time the key is missing. Then, the key is returned, effectively doing in valid Python what the hypothetical implementation using the walrus would do. Of course, it doesn’t looks like the Rust version at all anymore.

import collections

def foo(spam: list[str]) -> dict[str, int]:
    counter = collections.defaultdict(int)
    for item in spam:
        counter[item] += 1
    return counter

Use collections.Counter when you count something

What if we use something called Counter in the collections module? It should count stuff, right? Let’s change the counter definition first.

import collections

def foo(spam: list[str]) -> dict[str, int]:
    counter = collections.Counter()
    for item in spam:
        counter[item] += 1
    return counter

And it works. But that wouldn’t be very useful, isn’t it? I mean, we still have to perform a for loop and increment our-self. A object named “Counter” should be able to count stuff without our help, right? Well, it does.

def foo(spam: list[str]) -> dict[str, int]:
    return collections.Counter(spam)

Yes, now the function is pretty much a useless and confusing abstraction over Counter, but you get the point. Count stuff in python, think about collections.Counter!

More rustonic code

Same as the python code, there is better implementation to perform that task in rust, slightly more complex to do in Python, that’s why I didn’t used it in this article, but we’ll see those implementation now.

Use of Entry.and_modify() when modifying something

You cannot get a reference to an Entry that could be vacant as the compiler will not compile, simply because without making sure an entry is Occupied, there is no guaranty that the reference of the value could be acquired (the Occupied.get() method allows you to get a ref to the value). So, you need to perform a Entry.or_insert() before the +=.

But, you want to perform the += only when there is already a value to increment, right? Let’s change our code to use Entry.and_modify() then.

fn foo(spam: Vec<String>) -> HashMap<String, i32> {
    let mut counter: HashMap<String, i32> = HashMap::new();

    for item in spam {
        counter
            .entry(item)
            .and_modify(|counter| *counter += 1)
            .or_insert(1);
    }

    counter
}

Doing so, the increment will only be performed when the value already exists in the HashMap, and if it doesn’t, the inserted value is set to 1, since we don’t perform the increment after the insertion.

No more loop, let’s go functional

If you are used to functools.reduce() in Python, the following won’t be difficult to understand. If you are not, I advise you to look into it before.

So, what do we do with our code? We want to do something for each value of the list of strings. Each time we perform this “something”, we take the result and give it back to the next operation. Sound pretty familiar, right? Let’s write it in Python first.

def foo(spam: list[str]) -> dict[str, int]:
    return functools.reduce(
        lambda counter, s:
            counter | {s: counter.get(s, 0) + 1},
        spam,
        {},
    )

The Rust functional implementation is slightly different, let’s see how it would hypothetically looks like in Python.

def foo(spam: list[str]) -> dict[str, int]:
    return spam.reduce(
        lambda counter, s:
            counter | {s: counter.get(s, 0) + 1},
        {},
    )

Of course, this will crash because str.reduce doesn’t exists, but it could make sense, right?

This str.reduce signature would looks like this:

def reduce(self: Iterable[_S],
           function: (_T, _S) -> _T,
           initial: _T) -> _T

Let’s alter it a little bit more (still won’t be functional code).

def foo(spam: list[str]) -> dict[str, int]:
    return iter(spam).fold(
        {},
        lambda counter, s:
            counter | {s: counter.get(s, 0) + 1},
    )

Now, we first convert the string is a more generic object, an iterator, that could hypothetically have a function named “fold” that would simply be an alternative reduce where the initial value is the first parameter instead of the last one.

The signature of this hypothetical Iterator.fold function would looks like this:

def fold(self: Iterator[_S],
         initial: _T,
         function: (_T, _S) -> _T) -> _T

You start to see where I’m going? Let’s look at the Rust implementation.

fn foo(spam: Vec<String>) -> HashMap<String, i32> {
    spam
        .into_iter()
        .fold(
            HashMap::new(),
            |mut counter, item| {
                counter
                    .entry(item)
                    .and_modify(|counter| *counter += 1)
                    .or_insert(1);
                counter
            },
        )
}

Yes, this behave exactly like our fictional python code. Easy, right? Maybe not. Let’s see.

  1. spam.into_iter() is roughly equivalent to iter(spam)
  2. .fold() is equivalent to the fictitious fold method we talked before.
  3. |mut counter, item| {...} is called a closure and this is roughly equivalent to our lambdas.

This article was reviewed and improved by following reviewers, thank you
Antony Bara, Maxim Danilov.

Leave a Reply

Discover more from The Way of Python

Subscribe now to keep reading and get access to the full archive.

Continue reading