Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill process never completes during periods of high INSERT query rates #583

Open
santhosh-programmer opened this issue Jan 10, 2025 · 1 comment
Milestone

Comments

@santhosh-programmer
Copy link

Hello team,

For testing, I tried this example 14_add_reviews_table.json to create a reviews table and then 17_add_rating_column.json to add a rating column.

I have inserted 50k dummy rows initially to this table.

Now to test alter column operation, i started 18_change_column_type.json migration and i ran pgbench parallelly as follows,

pgbench -f insert_script.sql -c 1 -T 2147483647 postgres 
insert_script.sql
INSERT INTO public.reviews (username, product, review, rating)
SELECT 
    'user_' || generate_series(1, 10000), 
    'Product_' || generate_series(1, 10000), 
    'This is a dummy review for Product_' || generate_series(1, 10000),
    (random() * 5)::integer::text;

Since the table is loaded with high INSERT query rates from pgbench, the backfill procedure is not ending. Though initially 50k rows was present in the table, the backfilling continued for all the newer rows inserted by pgbench.

Since the trigger is created before starting backfill, i think it is redundant to backfill the newer rows which are inserted after trigger creation as backfill for this rows will be taken care by the trigger created earlier. Below is the console screenshot where the backfilling continued for 6 million+ rows created with pgbench and running infinitely.

image

In a large scale production environment, we can expect high volume of INSERT queries because multiple connections parallelly sends the INSERT queries. In such case, what should be done to avoid infinite backfilling problem? Should we block INSERT queries by manually acquiring some LOCK until backfill is completed ? If so, does it violates the concept of zero-downtime ?

@andrew-farries
Copy link
Collaborator

Thanks for opening this @santhosh-programmer 🙏

I've added the issue to the v1 milestone as it's something we need to address.

We need some way to prevent the backfill process from touching new rows added after backfill start. A couple of ideas OTOH:

  • Use transaction ids to ensure that only rows visible before backfill start are considered for each batch
  • Have the backfill run entirely inside one transaction that is isolated to exclude rows committed by other transactions.

@andrew-farries andrew-farries added this to the v1 milestone Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants