Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Workload pipeline CSV parser methods #5068

Closed
fdalmaup opened this issue Mar 4, 2024 · 6 comments
Closed

Review Workload pipeline CSV parser methods #5068

fdalmaup opened this issue Mar 4, 2024 · 6 comments
Assignees
Labels

Comments

@fdalmaup
Copy link
Member

fdalmaup commented Mar 4, 2024

Description

During wazuh/wazuh#22179, it was found that some of the set thresholds for the resource measurement comparisons were wrongly obtained due to the misbehavior of the CSV Parser module (found in deps/wazuh_testing/wazuh_testing/tools/performance/csv_parser.py). Using the values from a Workload test near the date that the threshold was established, the values (especially the ones for the File Descriptors), seem to be correct:

{
    "setup_phase": {
        "wazuh-clusterd": {
            "USS(KB)": {
                "workers": {
                    "mean": ("worker_23", 79302.4262295082),
                    "max": ("worker_16", 157676.0),
                    "reg_cof": ("worker_23", 887.0418113746235)
                },
                "master": {
                    "mean": ("master", 127014.2972972973),
                    "max": ("master", 267488.0),
                    "reg_cof": ("master", 596.5368609261227)
                }
            },
            "CPU(%)": {
                "workers": {
                    "mean": ("worker_14", 9.087704918032788),
                    "max": ("worker_14", 50.699999999999996),
                    "reg_cof": ("worker_14", 0.11926369947888614)
                },
                "master": {
                    "mean": ("master", 72.27432432432431),
                    "max": ("master", 179.8),
                    "reg_cof": ("master", 0.37989057404206)
                }
            },
            "FD": {
                "workers": {
                    "mean": ("worker_14", 63.122950819672134),
                    "max": ("worker_1", 65),
                    "reg_cof": ("worker_24", 0.04868202764976976)
                },
                "master": {
                    "mean": ("master", 98.82432432432432),
                    "max": ("master", 123),
                    "reg_cof": ("master", -0.31726124151819285)
                }
            }
        }
    },
    "stable_phase": {
        "wazuh-clusterd": {
            "USS(KB)": {
                "workers": {
                    "mean": ("worker_11", 116553.51381215469),
                    "max": ("worker_13", 205164.0),
                    "reg_cof": ("worker_11", 10.248787385086283)
                },
                "master": {
                    "mean": ("master", 141715.49247311827),
                    "max": ("master", 267820.0),
                    "reg_cof": ("master", -117.77925544357818)
                }
            },
            "CPU(%)": {
                "workers": {
                    "mean": ("worker_17", 10.236627906976745),
                    "max": ("worker_10", 28.4),
                    "reg_cof": ("worker_11", -0.0028954773944346317)
                },
                "master": {
                    "mean": ("master", 57.59978494623656),
                    "max": ("master", 118.4),
                    "reg_cof": ("master", -0.050952369585662786)
                }
            },
            "FD": {
                "workers": {
                    "mean": ("worker_17", 54.61046511627907),
                    "max": ("worker_1", 64),
                    "reg_cof": ("worker_11", -0.0159273770496572)
                },
                "master": {
                    "mean": ("master", 75.8752688172043),
                    "max": ("master", 121),
                    "reg_cof": ("master", -0.0680563048117365)
                }
            }
        }
    }
}

Values from the issue that established the latest threshold values:

{
    "setup_phase": {
        "wazuh-clusterd": {
            "USS(KB)": {
                "workers": {
                    "mean": ("worker_10", 85625.87368421053),
                    "max": ("worker_14", 170588.0),
                    "reg_cof": ("worker_10", 479.74241012653823)
                },
                "master": {
                    "mean": ("master", 196930.5723905724),
                    "max": ("master", 542408.0),
                    "reg_cof": ("master", 159.20029975063525)
                }
            },
            "CPU(%)": {
                "workers": {
                    "mean": ("worker_10", 7.013684210526315),
                    "max": ("worker_15", 37.4),
                    "reg_cof": ("worker_8", 0.03610734796000005)
                },
                "master": {
                    "mean": ("master", 46.47138047138047),
                    "max": ("master", 96.9),
                    "reg_cof": ("master", 0.04855447271554653)
                }
            },
            "FD": {
                "workers": {
                    "mean": ("worker_10", 63.28947368421053),
                    "max": ("worker_1", 65),
                    "reg_cof": ("worker_22", 0.01732982882868463)
                },
                "master": {
                    "mean": ("master", 80.21548821548822),
                    "max": ("master", 134),
                    "reg_cof": ("master", -0.19400746078598444)
                }
            }
        }
    },
    "stable_phase": {
        "wazuh-clusterd": {
            "USS(KB)": {
                "workers": {
                    "mean": ("worker_12", 136051.86324786325),
                    "max": ("worker_14", 183948.0),
                    "reg_cof": ("worker_14", 689.7257023353081)
                },
                "master": {
                    "mean": ("master", 140449.8947368421),
                    "max": ("master", 270984.0),
                    "reg_cof": ("master", 288.0156528220629)
                }
            },
            "CPU(%)": {
                "workers": {
                    "mean": ("worker_9", 12.6),
                    "max": ("worker_4", 34.8),
                    "reg_cof": ("worker_15", 0.06064032391874895)
                },
                "master": {
                    "mean": ("master", 48.4587044534413),
                    "max": ("master", 99.4),
                    "reg_cof": ("master", 0.09772510089603965)
                }
            },
            "FD": {
                "workers": {
                    "mean": ("worker_9", 61.92727272727273),
                    "max": ("worker_1", 64),
                    "reg_cof": ("worker_15", 0.28031911840670387)
                },
                "master": {
                    "mean": ("master", 52.33198380566802),
                    "max": ("master", 62),
                    "reg_cof": ("master", 0.09891076872111303)
                }
            }
        }
    }
}

Important

The mentioned artifacts will not return the expected results due to the recent changes in the csv_parser.py file applied in #4780. A previous version of the same should be used to replicate the behavior.

We should review the methods involved in the statistics calculations to avoid wrong results and therefore failed performance tests in the future.

@GGP1 GGP1 self-assigned this Mar 6, 2024
@GGP1
Copy link
Member

GGP1 commented Mar 6, 2024

Update

I created a script to test the CSV parser functionality and generated the statistics for the following artifacts. The results seem to be correct, I will continue to investigate any issues I may be missing.

Script
import argparse

from wazuh_testing.tools.performance.csv_parser import ClusterCSVTasksParser, ClusterCSVResourcesParser

def get_script_arguments():
    parser = argparse.ArgumentParser(usage="%(prog)s [options]", description="Script to generate data visualizations",
                                     formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument('-a', '--artifacts_path', dest='artifacts_path', required=True, type=str,
                        action='store', help='Directory where the cluster CSVs can be found.')

    return parser.parse_args()

def main():
    options = get_script_arguments()

    tasks = ClusterCSVTasksParser(artifacts_path=options.artifacts_path)
    print(tasks.get_stats())

    resources = ClusterCSVResourcesParser(artifacts_path=options.artifacts_path)
    print(resources.get_stats())


if __name__ == '__main__':
    main()
Tasks
{
  "setup_phase": {
    "integrity_sync": {
      "time_spent(s)": {
        "workers": {
          "mean":
            ("worker_2", 0.6762659145850121),
          "max":
            ("worker_2", 12.629)
        },
        "master": {
          "mean":
            ("master", 0.6762659145850121),
          "max":
            ("master", 12.629)
        }
      }
    },
    "agent-info_sync": {
      "time_spent(s)": {
        "workers": {
          "mean":
            ("worker_4", 0.8725),
          "max":
            ("worker_22", 14.119)
        },
        "master": {
          "mean":
            ("master", 0.6845116607773851),
          "max":
            ("master", 12.945)
        }
      }
    },
    "integrity_check": {
      "time_spent(s)": {
        "workers": {
          "mean":
            ("worker_9", 0.42400746268656714),
          "max":
            ("worker_22", 14.419)
        },
        "master": {
          "mean":
            ("master", 0.06794084084084084),
          "max":
            ("master", 1.148)
        }
      }
    }
  },
  "stable_phase": {
    "agent-info_sync": {
      "time_spent(s)": {
        "workers": {
          "mean":
            ("worker_2", 0.14200000000000002),
          "max":
            ("worker_2", 0.251)
        },
        "master": {
          "mean":
            ("master", 0.14200000000000002),
          "max":
            ("master", 0.251)
        }
      }
    },
    "integrity_check": {
      "time_spent(s)": {
        "workers": {
          "mean":
            ("worker_22", 0.0131304347826087),
          "max":
            ("worker_8", 0.042)
        },
        "master": {
          "mean":
            ("master", 0.004652727272727274),
          "max":
            ("master", 0.022)
        }
      }
    }
  }
}
Resources
{
  "setup_phase": {
    "wazuh-clusterd": {
      "USS(KB)": {
        "workers": {
          "mean":
            ("worker_4", 53349.6),
           "max": 
            ("worker_1", 80688.0),
          "reg_cof":
            ("worker_2", 179.27511093704192)
        },
        "master": {
          "mean":
            ("master", 62363.29770992367),
          "max":
            ("master", 95100.0),
          "reg_cof":
            ("master", 91.86956877118291)
        }
      },
      "CPU(%)": {
        "workers": {
          "mean":
            ("worker_13", 1.3464684014869888),
          "max":
            ("worker_6", 12.4),
          "reg_cof":
            ("worker_6", -0.0037710241657507556)
        },
        "master": {
          "mean":
            ("master", 33.14236641221375),
          "max":
            ("master", 143.5),
          "reg_cof":
            ("master", -0.18340651315418535)
        }
      },
      "FD": {
        "workers": {
          "mean":
            ("worker_2", 73.23255813953489),
          "max":
            ("worker_25", 76.0),
          "reg_cof":
            ("worker_2", -0.0072866796240246366)
        },
        "master": {
          "mean":
            ("master", 130.6030534351145),
          "max":
            ("master", 161.0),
          "reg_cof":
            ("master", -0.0478306111507564)
        }
      }
    }
  },
  "stable_phase": {
    "wazuh-clusterd": {
      "USS(KB)": {
        "workers": {
          "mean":
            ("worker_2", 74357.9512195122),
          "max":
            ("worker_2", 75844.0),
          "reg_cof":
            ("worker_2", 208.56585365853752)
        },
        "master": {
          "mean":
            ("master", 62881.46341463415),
          "max":
            ("master", 62940.0),
          "reg_cof":
            ("master", 3.854355400696879)
        }
      },
      "CPU(%)": {
        "workers": {
          "mean":
            ("worker_2", 0.8097560975609757),
          "max":
            ("worker_2", 16.0),
          "reg_cof":
            ("worker_14", 0.001393728222996515)
        },
        "master": {
          "mean":
            ("master", 1.9365853658536583),
          "max":
            ("master", 8.2),
          "reg_cof":
            ("master", -0.042822299651567884)
        }
      },
      "FD": {
        "workers": {
          "mean":
            ("worker_25", 73.0),
          "max":
            ("worker_17", 73.0),
          "reg_cof":
            ("worker_23", 0.10480217456961556)
        },
        "master": {
          "mean":
            ("master", 133.0),
          "max":
            ("master", 133.0),
          "reg_cof":
            ("master", 4.1492513876640575e-16)
        }
      }
    }
  }
}

@GGP1
Copy link
Member

GGP1 commented Mar 7, 2024

Update

I tested the script against the artifacts that caused this issue to be opened and I could find that the reason why the values are lower is because the timestamps between the parent process (wazuh-clusterd) and the child processes (wazuh-clusterd_child1, wazuh-clusterd_child2) do not match.

The files have 313, 314 and 314 lines, whereas the concatenated one has 560 (there's 246 timestamp mismatches). This caused the mean and max values to not be accurate since their calculations were done with the sum of 1 (parent process) and in other cases 2 (child processes) values.

Considering this and the fact that the tests performed on other artifacts were successful, I conclude that the issue has to be with those specific artifacts having different timestamps instead of the script itself.

Artifacts: 50000A25W.zip

Before grouping, concatenation and sum

List
wazuh-clusterd
0      57
1      57
2      57
3      57
4      57
       ..
308    58
309    59
310    61
311    62
312    61
Name: FD, Length: 313, dtype: int64
wazuh-clusterd_child_1
0      29
1      29
2      29
3      29
4      29
       ..
309    29
310    29
311    29
312    30
313    29
Name: FD, Length: 314, dtype: int64
wazuh-clusterd_child_2
0      31
1      31
2      31
3      31
4      31
       ..
309    31
310    31
311    32
312    31
313    31
Name: FD, Length: 314, dtype: int64

After grouping, concatenation and sum

List
wazuh-clusterd
0      117 -> correct
1       57 -> incorrect: parent process value
2       60 -> incorrect: child processes values sum
3      117 -> correct
4      117 -> correct
      ... 
555     61 -> incorrect: child processes values sum 
556     62 -> incorrect: parent process value
557     61 -> incorrect: child processes values sum
558     61 -> incorrect: parent process value
559     60 -> incorrect: child processes values sum
Name: FD, Length: 560, dtype: int64
Full list
wazuh-clusterd
0      117
1       57
2       60
3      117
4      117
5      117
6      117
7      117
8      117
9      117
10     117
11     117
12     117
13     117
14     117
15     117
16     117
17     117
18     117
19     117
20     117
21     117
22     117
23     117
24     117
25     117
26     117
27     117
28     120
29     120
30     121
31     117
32     121
33     118
34     117
35     122
36     117
37     117
38     118
39     120
40     117
41     120
42     120
43     121
44     118
45     124
46     126
47     126
48     123
49     126
50     126
51     123
52     122
53     124
54     126
55     121
56     126
57     126
58     125
59     127
60     125
61     122
62     127
63     128
64     130
65     128
66     127
67     127
68      62
69      68
70      60
71      67
72      63
73      62
74      60
75      64
76      61
77      69
78      64
79      62
80      61
81      69
82      62
83      60
84      62
85      62
86      60
87      68
88      60
89      63
90      61
91      65
92      63
93      60
94      60
95      64
96      60
97      70
98      62
99      65
100     60
101     70
102     60
103     68
104     60
105     67
106     61
107     66
108     60
109     66
110    128
111    128
112    124
113    121
114    120
115    130
116    133
117    128
118    124
119    126
120    125
121    126
122    131
123    127
124    126
125    121
126    130
127    125
128    127
129    124
130    129
131     61
132     60
133     60
134     63
135     60
136     67
137     60
138     64
139     61
140     71
141     60
142     64
143     60
144     62
145     62
146     69
147     60
148     64
149     61
150     65
151     61
152     71
153     60
154     60
155     60
156     68
157     60
158     69
159     60
160     75
161     60
162     69
163     60
164     64
165     60
166     62
167     60
168     62
169     29
170    102
171     29
172    101
173    127
174    124
175    130
176    125
177    129
178    134
179    130
180    130
181    126
182    124
183    126
184     62
185     66
186     63
187     69
188     61
189     65
190     61
191     71
192     60
193     68
194     63
195     62
196     62
197     62
198     62
199     63
200     62
201     60
202     63
203     64
204     60
205     65
206     62
207     63
208     60
209     65
210     60
211     63
212     61
213     62
214     61
215     64
216     60
217     62
218     60
219     64
220     61
221     63
222     60
223     61
224     62
225     59
226     60
227     61
228     60
229     59
230     60
231     61
232     61
233     61
234     60
235     62
236     60
237     62
238     61
239     61
240     61
241     60
242     60
243     60
244     61
245     61
246     60
247     62
248     60
249     64
250     60
251     64
252     61
253     66
254     61
255     62
256     62
257     70
258     60
259     69
260     61
261     66
262     63
263     70
264     61
265     69
266     62
267     65
268     61
269     68
270     60
271     68
272     61
273     68
274     64
275     66
276     62
277     69
278     62
279     68
280     60
281     70
282     60
283     66
284     60
285     64
286     61
287     66
288     62
289     64
290     60
291     68
292     60
293     70
294     61
295     70
296     61
297     67
298     62
299     67
300     63
301     57
302     61
303     64
304     60
305     60
306     62
307     63
308     61
309     61
310     60
311     58
312     61
313     59
314     60
315     58
316     60
317     58
318     60
319     59
320     61
321     58
322     60
323     58
324     30
325     31
326     58
327     29
328     31
329     58
330     29
331     31
332     59
333     29
334     31
335     60
336     30
337     31
338     58
339     29
340     32
341     60
342     30
343     32
344     59
345     30
346     31
347     59
348     29
349     31
350     59
351     29
352     31
353     58
354     29
355     31
356     57
357     29
358     31
359     59
360     29
361     31
362     59
363     29
364     31
365     59
366     29
367     32
368     59
369     30
370     31
371     60
372     29
373     31
374     60
375     30
376     31
377     59
378     29
379     31
380     60
381     29
382     32
383     58
384     29
385     31
386     59
387     29
388     31
389     58
390     29
391     31
392     59
393     29
394     31
395     59
396     29
397     31
398     58
399     30
400     32
401     60
402     29
403     31
404     60
405     29
406     31
407     59
408     30
409     31
410     58
411     29
412     31
413     60
414     30
415     31
416     60
417     60
418     58
419     60
420     58
421     60
422     59
423     61
424     58
425     60
426     59
427     61
428     58
429     60
430     60
431     60
432     59
433     61
434     60
435     61
436     60
437     61
438     61
439     60
440     60
441     61
442     60
443     60
444     58
445     60
446     60
447     60
448     60
449     62
450     59
451     60
452     59
453     60
454     59
455     60
456     60
457     61
458     59
459     60
460     60
461     61
462     59
463     62
464     61
465     61
466     59
467     61
468     59
469     61
470     58
471     60
472     58
473     61
474     59
475     60
476     60
477     60
478     60
479     61
480     59
481     61
482     60
483     60
484     59
485     61
486     60
487     61
488     59
489     60
490     59
491     60
492     58
493     60
494     60
495     61
496     60
497     61
498     59
499     60
500     59
501     60
502     58
503     60
504     61
505     61
506     59
507     60
508     59
509     61
510     60
511     60
512     60
513     61
514     59
515     60
516     60
517     61
518     59
519     61
520     59
521     60
522     59
523     61
524     58
525     60
526     60
527     60
528     59
529     61
530     60
531     60
532     59
533     61
534     58
535     61
536     62
537     61
538     60
539     60
540     59
541     60
542     60
543     61
544     58
545     62
546     59
547     61
548     59
549     60
550     58
551     60
552     59
553     60
554     61
555     61
556     62
557     61
558     61
559     60

Timestamps differences

timestamps

@fdalmaup
Copy link
Member Author

fdalmaup commented Mar 11, 2024

Review

The analysis was correctly carried out, taking a deeper look at the data that generated the misbehavior. The CSV parser methods group by the Timestamp literal value, so if these show a tiny difference in, to say, their seconds, it will take as separate new values some data that should have been grouped.
The only doubt remains whether we should take any preventive measures to avoid wrongly measured values in the future or accept it and leave it to the person in charge of interpreting the launched test results to analyze the values. It is important to mention that if this behavior happens again, we might not notice if any value has exceeded a certain threshold in the test_cluster_performance tests.

@GGP1
Copy link
Member

GGP1 commented Mar 12, 2024

Update

A potential solution could be to group the values by ranges instead of a single value, although it won't be guaranteed that all the ranges would be made up of the same number of values and this could alter the results.

The only value that the files share and that distinguishes each of the rows is the timestamp, so if the solution mentioned above is not useful, the way in which execution times are evaluated would have to change, potentially dropping use of this script altogether.

Considering this and that the bug occurred in version 4.3.0 and has not occurred since, I think it would be appropriate to investigate a solution in a separate issue with a different priority.

@fdalmaup
Copy link
Member Author

Review

I agree with the last Update, the alternative solution should take place in a separate issue and we must keep the current behavior in mind when reviewing the output metrics.

LGTM!

@Selutario
Copy link
Contributor

Final review

Good analysis @GGP1.

I have also taken a look at recent artifacts, and as you say, it seems that it is not common to find different timestamps. What I have seen is that the number of rows is different sometimes, probably due to the fact that the API Performance tests restart the master and then the entire cluster. These cases do not seem worrying since the existing rows do maintain the same timestamp.

The only thing that could cause the timestamps to be different, as in the artifacts you mentioned here, is if some of the threads in the script take longer than the rest to obtain metrics. It only needs to happen once for the timestamps to no longer match:

while not self.event.is_set():
data = dict()
try:
data = self.get_process_info(self.proc)
except Exception as e:
logger.error(f'Exception with {self.process_name} | {e}')
print(e.with_traceback())
finally:
self._write_csv(data)
sleep(self.time_step)

A possible alternative solution could be that, instead of running the loop every n seconds, do it in UTC time blocks. For example, every 5 seconds:

  • 15:51:15
  • 15:51:20
  • 15:51:25
  • 15:51:30
    In other words, instead of always sleeping 5 seconds, sleep as long as necessary until the next expected moment.

I agree with you in opening a new issue to study possible fixes for this. The priority will depend on the frequency with which we encounter these problems in the future, but it doesn't seem a common behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants