-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
export task progress #958
export task progress #958
Conversation
onadata/libs/utils/export_builder.py
Outdated
meta = {'progress': additions} | ||
if total: | ||
meta.update({'total': total}) | ||
current_task.update_state(state='PENDING', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any other started other than pending
? For example, started
? http://docs.celeryproject.org/en/latest/reference/celery.states.html#all-states
@@ -542,6 +565,7 @@ def write_row(row, csv_writer, fields): | |||
self.pre_process_row(child_row, section), | |||
csv_writer, fields) | |||
index += 1 | |||
track_task_progress(i, total_records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to make this updates in chunks/batches of 10 or 100 or X? Something configurable in settings. Could be an overkill when you are iterating over 10k or 100k records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also only batch for large datasets, i.e. if it's < 10k, do not batch
onadata/libs/utils/export_tools.py
Outdated
else: | ||
records = query_data(xform, query=filter_query, start=start, end=end) | ||
total_records = xform.num_of_submissions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this take into account the filter query? An export could be using filters where the number of records is less than the total number of submissions.
62c2ac7
to
a704a0e
Compare
else: | ||
records = query_data(xform, query=filter_query, start=start, end=end) | ||
|
||
if filter_query: | ||
total_records = query_data(xform, query=filter_query, start=start, | ||
end=end, count=True)[0].get('count') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a huge assumption that we will always receive a list of at least size one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, since count is True
, looks like this will be a list of length 1 unless the query results in a generator, in which case maybe you'd need to call len
on it? Is this guaranteed to be on line 15 of the apps.viewer.models.parsed_instance
function? Not related, but why does the condition on lines 10-11 of apps.viewer.models.parsed_instance
exist at all? this would put a lot of objects in memory and probably segfault on large datasets
@@ -3,14 +3,18 @@ | |||
PENDING = 0 | |||
SUCCESSFUL = 1 | |||
FAILED = 2 | |||
PROGRESS = 3 | |||
RETRY = 4 | |||
STARTED = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between this and PROGRESS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between PENDING
? PENDING
is a unkown state, I have added PROGRESS
to indicate that the task has started and is processing.
Docs: http://docs.celeryproject.org/en/latest/reference/celery.states.html#all-states
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense, I am asking about the difference between, PROGRESS
and STARTED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no big difference STARTED
means the job has started and PROGRESS
(custom task state) is how the task is progressing. STARTED
is fired when the task is started but is cleared and we require a custom state when updating the task meta field.
onadata/libs/utils/export_builder.py
Outdated
""" | ||
try: | ||
if additions % getattr(settings, 'EXPORT_TASK_PROGRESS_UPDATE_BATCH', | ||
100) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put this in a DEFAULT_UPDATE_BATCH
var, @ukanga what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
@@ -542,6 +565,7 @@ def write_row(row, csv_writer, fields): | |||
self.pre_process_row(child_row, section), | |||
csv_writer, fields) | |||
index += 1 | |||
track_task_progress(i, total_records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could also only batch for large datasets, i.e. if it's < 10k, do not batch
else: | ||
records = query_data(xform, query=filter_query, start=start, end=end) | ||
|
||
if filter_query: | ||
total_records = query_data(xform, query=filter_query, start=start, | ||
end=end, count=True)[0].get('count') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, since count is True
, looks like this will be a list of length 1 unless the query results in a generator, in which case maybe you'd need to call len
on it? Is this guaranteed to be on line 15 of the apps.viewer.models.parsed_instance
function? Not related, but why does the condition on lines 10-11 of apps.viewer.models.parsed_instance
exist at all? this would put a lot of objects in memory and probably segfault on large datasets
Signed-off-by: Dennis Wambua <[email protected]>
Signed-off-by: Dennis Wambua <[email protected]>
Signed-off-by: Dennis Wambua <[email protected]>
records when filter query supplied Signed-off-by: Dennis Wambua <[email protected]>
Signed-off-by: Dennis Wambua <[email protected]>
a704a0e
to
df1e41a
Compare
Is there a reason not to combine these into a single state constant?
|
I introduced Especially here
|
cool, got it
|
closes #951