Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad row factory example in the sqlite docs #110643

Closed
kfdf opened this issue Oct 10, 2023 · 11 comments
Closed

Bad row factory example in the sqlite docs #110643

kfdf opened this issue Oct 10, 2023 · 11 comments
Assignees
Labels
docs Documentation in the Doc dir topic-sqlite3

Comments

@kfdf
Copy link

kfdf commented Oct 10, 2023

The row factory example that returns a namedtuple in the sqlite docs creates a new namedtuple class for every row instance and as a result is really, really slow. A couple of orders of magnitude slower is too much even for an example, people might still use it and assume that one class per row doesn't affect performance that much. Introducing caching makes the example a bit less concise, but at least it becomes usable.

from collections import namedtuple
from functools import lru_cache

@lru_cache
def make_row_class(description):
    return namedtuple('Row', (col[0] for col in description))

def named_tuple_factory(cursor, row):
    return make_row_class(cursor.description)(*row)
@kfdf kfdf added the docs Documentation in the Doc dir label Oct 10, 2023
@erlend-aasland
Copy link
Contributor

It was a deliberate decision to exclude the optimisation. Please read thoroughly through the discussions on gh-99507.

@erlend-aasland
Copy link
Contributor

CC. @merwok, @CAM-Gerlach

@kfdf
Copy link
Author

kfdf commented Oct 10, 2023

I understand this is the relevant excerpt.

Its also unclear how practically impactful the optimization is in most situations; on Python 3.10 on a old laptop, running collections.namedtuple('Row', [f'field{idx}' for idx in range(n_cols] with timeit, I get a time of 199 us per execution for 10 cols, or 1.23 ms for 100 cols. This certainly isn't completely trivial, if its call in a tight hot loop, but seems unlikely to be the main critical path constraint in most real-world situations.

By contrast, even with a best-case toy example with a :memory: db and using constant SQL statements with no actual db access, running con.execute("SELECT " + ', '.join([f'{idx} AS column{idx}' for idx in range(100)])).fetchone() simulating an equivalent db with 100 rows takes 140 us per execution with the default tuple. Presumably, any real use case that needed to iterate through a large db would be substantially more costly.

So the overhead of creating a tuple class is comparable to reading a single row. However, a query that reads a thousand rows is just 60 times slower (just tested it with a ten column table in a on-disk database). So the per row overhead skyrockets. And in a very realistic scenario of reading a few hundred rows we are looking at a tenth of a second per request of just creating named tuple classes. I feel it is quite impactful.

@erlend-aasland
Copy link
Contributor

@kfdf, do you want to open a PR to propose a change?

@kfdf
Copy link
Author

kfdf commented Oct 14, 2023

I really am not adept at Github, Python and proper doc writing (which is, judging from the referenced discussion, quite involved, certainly well above "anything is better than nothing" level), so no.

@felixxm
Copy link
Contributor

felixxm commented Jan 11, 2024

Creating an optimized namedtuple-factory is not straightforward and can make docs less readable. For example, this is what we do in Django:

def unpickle_named_row(names, values):
    return create_namedtuple_class(*names)(*values)


@functools.lru_cache
def create_namedtuple_class(*names):
    # Cache type() with @lru_cache since it's too slow to be called for every
    # QuerySet evaluation.
    def __reduce__(self):
        return unpickle_named_row, (names, tuple(self))

    return type(
        "Row",
        (namedtuple("Row", names),),
        {"__reduce__": __reduce__, "__slots__": ()},
    )

create_namedtuple_class() could be used in Python docs:

def namedtuple_factory(cursor, row):
    fields = [column[0] for column in cursor.description]
    cls = create_namedtuple_class(fields)
    return cls._make(row)

However, this will make the docs less readable. Maybe we can add a warning that

"Creating a namedtuple class for each row may become slow when fetching a large number of rows. You can consider using a caching mechanism."

@erlend-aasland
Copy link
Contributor

However, this will make the docs less readable. Maybe we can add a warning that

"Creating a namedtuple class for each row may become slow when fetching a large number of rows. You can consider using a caching mechanism."

Exactly; if you see the linked discussion, you'll find that we concluded similar: putting optimisations into the docs make the docs less readable, and also more unfocused (so, it would be to deviate from our Diatáxis northstar). A note that adds a hint on how to optimise things (similar to your suggestion) is better.

(@felixxm, I restored your post; sorry 'bout my GH clumsiness!)

@felixxm
Copy link
Contributor

felixxm commented Jan 11, 2024

Do you think it's worth linking to the Django repository? Referencing functools.lru_cache may also work 🤔

@erlend-aasland
Copy link
Contributor

Do you think it's worth linking to the Django repository?

Linking to Django could be an option, yes.

Referencing functools.lru_cache may also work 🤔

IIRC, I did that in one of my early revisions of the row factory docs, but my proposed text was rejected, since it became a distraction from the main goal of the how-to.

The main goal of the how-to is to teach how row factories can be created; the main goal is not to explain how to optimise your Python code ;)

@felixxm
Copy link
Contributor

felixxm commented Jan 11, 2024

The main goal of the how-to is to teach how row factories can be created; the main goal is not to explain how to optimise your Python code ;)

I agree with it, we also follow the principles outlined in the Diátaxis framework. We can close this issue as invalid, I think ⚖️

@erlend-aasland erlend-aasland added the pending The issue will be closed if no feedback is provided label Jan 11, 2024
@erlend-aasland erlend-aasland closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2024
@github-project-automation github-project-automation bot moved this from TODO: Docs to Done in sqlite3 issues Jan 11, 2024
@erlend-aasland erlend-aasland removed the pending The issue will be closed if no feedback is provided label Jan 11, 2024
@merwok
Copy link
Member

merwok commented Jan 11, 2024

I don’t see this as an optimization issue (adding lru_cache would be a fig leaf), but a maybe confusing example that leads people to better understanding. (I had a different comment drafted, but am rewriting it after understanding!)

The row factory is set on the connection, and is a function that gets information from a cursor to extract the list of fields, and then maps the values from a specific row.
At the time where the row factory is associated to the connection, it is not possible to create a namedtuple or data class once and have it reused, because it would not work for all requests on all cursors!
So we see that the row factory on the connection is a way to control the type of data structure for rows (dicts, Rows, named tuples…) but it’s not a specific data structure (with specific fields).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir topic-sqlite3
Projects
Status: Done
Development

No branches or pull requests

4 participants