Bad row factory example in the sqlite docs #110643

kfdf · 2023-10-10T16:30:44Z

The row factory example that returns a namedtuple in the sqlite docs creates a new namedtuple class for every row instance and as a result is really, really slow. A couple of orders of magnitude slower is too much even for an example, people might still use it and assume that one class per row doesn't affect performance that much. Introducing caching makes the example a bit less concise, but at least it becomes usable.

from collections import namedtuple
from functools import lru_cache

@lru_cache
def make_row_class(description):
    return namedtuple('Row', (col[0] for col in description))

def named_tuple_factory(cursor, row):
    return make_row_class(cursor.description)(*row)

The text was updated successfully, but these errors were encountered:

erlend-aasland · 2023-10-10T16:39:30Z

It was a deliberate decision to exclude the optimisation. Please read thoroughly through the discussions on gh-99507.

erlend-aasland · 2023-10-10T16:40:08Z

CC. @merwok, @CAM-Gerlach

kfdf · 2023-10-10T18:05:49Z

I understand this is the relevant excerpt.

Its also unclear how practically impactful the optimization is in most situations; on Python 3.10 on a old laptop, running collections.namedtuple('Row', [f'field{idx}' for idx in range(n_cols] with timeit, I get a time of 199 us per execution for 10 cols, or 1.23 ms for 100 cols. This certainly isn't completely trivial, if its call in a tight hot loop, but seems unlikely to be the main critical path constraint in most real-world situations.

By contrast, even with a best-case toy example with a :memory: db and using constant SQL statements with no actual db access, running con.execute("SELECT " + ', '.join([f'{idx} AS column{idx}' for idx in range(100)])).fetchone() simulating an equivalent db with 100 rows takes 140 us per execution with the default tuple. Presumably, any real use case that needed to iterate through a large db would be substantially more costly.

So the overhead of creating a tuple class is comparable to reading a single row. However, a query that reads a thousand rows is just 60 times slower (just tested it with a ten column table in a on-disk database). So the per row overhead skyrockets. And in a very realistic scenario of reading a few hundred rows we are looking at a tenth of a second per request of just creating named tuple classes. I feel it is quite impactful.

erlend-aasland · 2023-10-12T08:38:30Z

@kfdf, do you want to open a PR to propose a change?

kfdf · 2023-10-14T06:59:16Z

I really am not adept at Github, Python and proper doc writing (which is, judging from the referenced discussion, quite involved, certainly well above "anything is better than nothing" level), so no.

felixxm · 2024-01-11T10:32:23Z

Creating an optimized namedtuple-factory is not straightforward and can make docs less readable. For example, this is what we do in Django:

def unpickle_named_row(names, values):
    return create_namedtuple_class(*names)(*values)


@functools.lru_cache
def create_namedtuple_class(*names):
    # Cache type() with @lru_cache since it's too slow to be called for every
    # QuerySet evaluation.
    def __reduce__(self):
        return unpickle_named_row, (names, tuple(self))

    return type(
        "Row",
        (namedtuple("Row", names),),
        {"__reduce__": __reduce__, "__slots__": ()},
    )

create_namedtuple_class() could be used in Python docs:

def namedtuple_factory(cursor, row):
    fields = [column[0] for column in cursor.description]
    cls = create_namedtuple_class(fields)
    return cls._make(row)

However, this will make the docs less readable. Maybe we can add a warning that

"Creating a namedtuple class for each row may become slow when fetching a large number of rows. You can consider using a caching mechanism."

erlend-aasland · 2024-01-11T10:54:03Z

However, this will make the docs less readable. Maybe we can add a warning that

"Creating a namedtuple class for each row may become slow when fetching a large number of rows. You can consider using a caching mechanism."

Exactly; if you see the linked discussion, you'll find that we concluded similar: putting optimisations into the docs make the docs less readable, and also more unfocused (so, it would be to deviate from our Diatáxis northstar). A note that adds a hint on how to optimise things (similar to your suggestion) is better.

(@felixxm, I restored your post; sorry 'bout my GH clumsiness!)

felixxm · 2024-01-11T10:56:40Z

Do you think it's worth linking to the Django repository? Referencing functools.lru_cache may also work 🤔

erlend-aasland · 2024-01-11T11:00:14Z

Do you think it's worth linking to the Django repository?

Linking to Django could be an option, yes.

Referencing functools.lru_cache may also work 🤔

IIRC, I did that in one of my early revisions of the row factory docs, but my proposed text was rejected, since it became a distraction from the main goal of the how-to.

The main goal of the how-to is to teach how row factories can be created; the main goal is not to explain how to optimise your Python code ;)

felixxm · 2024-01-11T11:09:36Z

The main goal of the how-to is to teach how row factories can be created; the main goal is not to explain how to optimise your Python code ;)

I agree with it, we also follow the principles outlined in the Diátaxis framework. We can close this issue as invalid, I think ⚖️

merwok · 2024-01-11T15:17:07Z

I don’t see this as an optimization issue (adding lru_cache would be a fig leaf), but a maybe confusing example that leads people to better understanding. (I had a different comment drafted, but am rewriting it after understanding!)

The row factory is set on the connection, and is a function that gets information from a cursor to extract the list of fields, and then maps the values from a specific row.
At the time where the row factory is associated to the connection, it is not possible to create a namedtuple or data class once and have it reused, because it would not work for all requests on all cursors!
So we see that the row factory on the connection is a way to control the type of data structure for rows (dicts, Rows, named tuples…) but it’s not a specific data structure (with specific fields).

kfdf added the docs Documentation in the Doc dir label Oct 10, 2023

erlend-aasland added the topic-sqlite3 label Oct 10, 2023

erlend-aasland self-assigned this Oct 10, 2023

erlend-aasland added this to sqlite3 issues Oct 30, 2023

erlend-aasland moved this to TODO: Docs in sqlite3 issues Oct 30, 2023

erlend-aasland added the pending The issue will be closed if no feedback is provided label Jan 11, 2024

erlend-aasland closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2024

github-project-automation bot moved this from TODO: Docs to Done in sqlite3 issues Jan 11, 2024

erlend-aasland removed the pending The issue will be closed if no feedback is provided label Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad row factory example in the sqlite docs #110643

Bad row factory example in the sqlite docs #110643

kfdf commented Oct 10, 2023 •

edited

Loading

erlend-aasland commented Oct 10, 2023

erlend-aasland commented Oct 10, 2023

kfdf commented Oct 10, 2023 •

edited

Loading

erlend-aasland commented Oct 12, 2023

kfdf commented Oct 14, 2023

felixxm commented Jan 11, 2024 •

edited by erlend-aasland

Loading

erlend-aasland commented Jan 11, 2024

felixxm commented Jan 11, 2024

erlend-aasland commented Jan 11, 2024

felixxm commented Jan 11, 2024

merwok commented Jan 11, 2024

Bad row factory example in the sqlite docs #110643

Bad row factory example in the sqlite docs #110643

Comments

kfdf commented Oct 10, 2023 • edited Loading

erlend-aasland commented Oct 10, 2023

erlend-aasland commented Oct 10, 2023

kfdf commented Oct 10, 2023 • edited Loading

erlend-aasland commented Oct 12, 2023

kfdf commented Oct 14, 2023

felixxm commented Jan 11, 2024 • edited by erlend-aasland Loading

erlend-aasland commented Jan 11, 2024

felixxm commented Jan 11, 2024

erlend-aasland commented Jan 11, 2024

felixxm commented Jan 11, 2024

merwok commented Jan 11, 2024

kfdf commented Oct 10, 2023 •

edited

Loading

kfdf commented Oct 10, 2023 •

edited

Loading

felixxm commented Jan 11, 2024 •

edited by erlend-aasland

Loading