-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suites a large number of cycles in can have high cpu usage #1751
Comments
Going through the logs for the first suite, it looks like a new event hit the suite every 2-5 seconds. |
As discussed, maybe worth using Python's resource module to do some profiling. (Or just use the |
Will do. |
Irritatingly, this is a problem that seems to build up with time. Its taken roughly 12 hours for the first example suite above to build up to a consistently high cpu usage on my machine. |
Analysis still ongoing but suggestive of some kind of unintended leak/backlog building up are these increasing numbers of "succeeded with unreported" entries in the logs:
Initially I don't see any of these but they become regular once the suite's at that high cpu point. |
Current diagnosis is that #1053 is to blame as the evaluation of next cycle point is expensive after a large number of previous cycles (more likely to be seen in a rapidly cycling suite and more of an issue as it's nearly continuously carrying out those calculations so what might normally be seen as a periodic spike is seen as a continual load). Keeping open pending #1053 to see if that fixes this or just moves the problem elsewhere. |
Underlying cause appears (as suspected) that it is the number of cycle points into the suite that causes the problem, its just that you're more likely to get to a stage where the issue is visible/noticeable in a simple rapidly cycling suite than in one of our more complex suites with clock triggers etc. See #1850 (comment) for a good way of reproducing the problem without need for long running of suite to get to that point. |
We know we want to do more profiling and @arjclark will do it at some point. |
[meeting]
Question for @benfitzpatrick - can we save a recent valid point in the db to avoid the initial hit on restart? |
I guess we already have a valid point in the form of the cyclepoint for the created task proxy on restart (or min cycle point from those tasks of the same name in the task pool on restart)? Presumably with a bit of black magic we could re-use that point. I guess the only problem for that (or any caching) would be if the restart applies a change in graphing that means the existing sequence of dates used is no longer valid. |
@hjoliver - do we still want to worry about the initial hit on a restart? Otherwise I think we're suitably performant at this point. |
@arjclark - it depends on how big the hit is - have you got a handle on that? Maybe close this issue and put up a new one for the start-up fix. Your task-cycle-point-on-restart (or caching) idea above sounds good to me. Easy enough to check if the old point is still valid for the current sequence settings. |
Haven't got a real handle on it no. I'll take a crack at using the cached point once the monster #2157 pull is merged. |
An example: [scheduling]
initial cycle point = 2018
max active cycle points = 96
[[dependencies]]
[[[T00, T01, T02, T03, T04, T05, T06, T07, T08, T09, T10, T11, T12, T13, T14, T15, T16, T17, T18, T19, T20, T21, T22, T23]]]
#[[[PT1H]]]
#[[[T-00]]]
graph = 't1 => t2'
[runtime]
[[t1, t2]]
script = true This suite uses 100% CPU in my environment while running, spending all its time parsing cycle points and comparing them. (Replacing the explicit sequences with |
Ouch. |
One underlying problem is that cycle point objects actually store the cycle point values as strings, and in order to compare 2 date-time cycle points, it has to parse the cycle point strings back to The logic are even stranger for integer cycle points - which are stored as strings for some reason - and we seem to have a whole industry to convert the strings back and forth to integers in order to do some basic integer arithmetic. I think the code should be modified so we actually store date-time cycle points and intervals as sub-classes of We can also optimise the comparison logic between 2 date-time cycle point objects. E.g. If 2 (non-truncated) time points are in the same time zone and have the same parse/dump format, (which would normally be the case for cycle points in a suite), then it should be quickest to simply |
I feel a bit guilty about this because I think I helped write the many-cycle-point suite in question... intended to get around weekend problems for hourly cycling.
The https://github.com/cylc/cylc/blob/01331bc/lib/cylc/cycling/iso8601.py#L104-L112 In terms of representing them as actual date-time objects - making the points based on isodatetime point representation might be pretty slow, since there is deliberately no underlying representation of the datetimes as e.g. Unix time, days since 1900, etc. Parsing string values is not too slow, and the other way around requires dumping. TimePoints as-is are not immutable, so the dump value can't just be cached. The core problem for that suite looks like the (2x) sorting and current cylc master, suite stopped after ~60 seconds:
( 100,000 memoize limit, suite stopped after ~60 seconds:
The popitem count reduces to 3000 for 1,000,000 memoize limit, but has not much effect on speed. Basically, lifting the memoize limit in iso8601.py will help (also true to a lesser extent for the Sequence An attempt including isodatetime changes (which may not be optimal generally) gives:
using this diff: diff --git a/lib/cylc/cycling/iso8601.py b/lib/cylc/cycling/iso8601.py
index 49f2e808f..19b58465f 100755
--- a/lib/cylc/cycling/iso8601.py
+++ b/lib/cylc/cycling/iso8601.py
@@ -34,7 +34,7 @@ from parsec.validate import IllegalValueError
CYCLER_TYPE_ISO8601 = "iso8601"
CYCLER_TYPE_SORT_KEY_ISO8601 = "b"
-MEMOIZE_LIMIT = 10000
+MEMOIZE_LIMIT = 100000
DATE_TIME_FORMAT = "CCYYMMDDThhmm"
EXPANDED_DATE_TIME_FORMAT = "+XCCYYMMDDThhmm"
@@ -321,7 +321,7 @@ class ISO8601Sequence(SequenceBase):
TYPE = CYCLER_TYPE_ISO8601
TYPE_SORT_KEY = CYCLER_TYPE_SORT_KEY_ISO8601
- _MAX_CACHED_POINTS = 100
+ _MAX_CACHED_POINTS = 1000
__slots__ = ('dep_section', 'context_start_point', 'context_end_point',
'offset', '_cached_first_point_values',
diff --git a/lib/isodatetime/data.py b/lib/isodatetime/data.py
index 539c070f5..eaecb0d3a 100644
--- a/lib/isodatetime/data.py
+++ b/lib/isodatetime/data.py
@@ -619,6 +619,15 @@ class TimeZone(Duration):
return TimeZone(hours=self.hours, minutes=self.minutes,
unknown=self.unknown)
+ def __cmp__(self, other):
+ if not isinstance(other, TimeZone):
+ raise TypeError(
+ "Invalid type for comparison: " +
+ "'%s' should be TimePoint." %
+ type(other).__name__
+ )
+ return cmp((self.hours, self.minutes), (other.hours, other.minutes))
+
def __str__(self):
if self.unknown:
return ""
@@ -1301,8 +1310,6 @@ class TimePoint(object):
raise TypeError(
"Cannot compare truncated to non-truncated " +
"TimePoint: %s, %s" % (self, other))
- if self.get_props() == other.get_props():
- return 0
if self.truncated:
for attribute in self.DATA_ATTRIBUTES:
other_attr = getattr(other, attribute)
@@ -1310,8 +1317,9 @@ class TimePoint(object):
if other_attr != self_attr:
return cmp(self_attr, other_attr)
return 0
- other = other.copy()
- other.set_time_zone(self.get_time_zone())
+ if other.get_time_zone() != self.get_time_zone():
+ other = other.copy()
+ other.set_time_zone(self.get_time_zone())
if self.get_is_calendar_date():
my_date = self.get_calendar_date()
other_date = other.get_calendar_date() I think this would need profiling vs other suites, particularly the get_props commenting out. It does save quite a bit of time here though. Memory use might be an issue perhaps? |
👍 |
That's certainly my recollection based around why we ended up with the 10000 cap we did. |
Another fun (hypothetical) use case for many cycle points would be an inside-out post-processing suite representing multiple incoming forecasts cycling on validity time, pushing model+cycle+leadtime into tasks under that cycle... wasteful but an interesting exercise! |
This is never reliably repeated or observed again in real suites. Let's say it will be solved by that giant #3304, and close this down. |
Oddly, some quite simple suites which rapidly cycle can result in high cpu usage.
For example, this suite:
was regularly using 100% cpu on one of our servers (N.B. there is no runtime so this is all sleep 10 operations)
Similarly, this suite also results in high server cpu load:
I think these problems are as a result of the rapid turnover of the tasks meaning the cylc daemon is constantly doing something but we should make sure we're not doing anything to make things worse than they could be.
The text was updated successfully, but these errors were encountered: