-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log converter drops trace attributes #199
Comments
Dear ppfeiff, The issue is that your case attributes are NaN. From version 2.1.3, we do by default a post-processing step to search and remove the NaN from the streams. So the attributes that are NaN simply disappear. |
If you do not like this behavior, please set the parameter "stream_postprocessing" to False |
Hi alessandro, Thanks for your quick reply. I understand that you want to get rid of NaN attributes. However, I find this behaviour weird for a couple of reasons:
|
Hej, Yes, it's always going to be a compromise. Try to imagine to load an event log such as roadtraffic (570k events) from the XES file. Many attributes appear only on the first event of the trace. Then, if you convert that to dataframe, and then back to log, you get an event log that is filled for NaN for the events that are not the first of the trace, that should not be there in principle. This setting is very common also starting from CSV (only some rows have some columns populated, and in the conversion to XES this was producing logs with huge amounts of NaN). In your case, you should get the "case:travel_start" attribute when it's not empty. Ah, small remark, it is assumed that in the dataframe all the events of the same case have the same value for the case attribute. If that is not the case, is better to leave it as event attribute in the conversion phase, and moving from "event" to "trace" attributes as a successive postprocessing step. |
Hi, I get this point but I can't really agree. What you say is that you prefer data sparity over having the full information about the observations in the log. In PM I need all the information. Deleting data to save some space is not what I would expect using these functionalities. On my specific dataset, there are 12.475 NaN rows in the .csv file but more than 70.781 non-NaN rows. Its you tool and your choices, but as an end-user I wonder that you say these 70.781 rows should not be there. (I can send you the log in case you want to check it out) |
Guidline 11 in "Guidlines for Logging" in the PM book: Do not remove events and ensure provenance. Reproducibility is key for process mining. For example, do not remove a student from the database after he dropped out since this may lead to misleading analysis results. Mark objects as not relevant (a so-called “soft delete”) rather than deleting them: concerts are not deleted—they are canceled; employees are not deleted—they are fired, etc. |
Hi @ppfeiff, We discussed this internally, and, we agree that this type 'data stripping' should not be default. We will create a hotfix release for this, i.e., 2.1.4.1, which we will release in a few mintues. |
Hi,
we noted that in version 2.1.4 the behaviour of the log_converter has changed. It drops the trace attributes (exepct the case id) now.
el_mobis_csv = pd.read_csv(os.path.join("MobIS", "mobis_challenge_log_2019.csv"), sep=";") el_mobis_csv.rename(columns={"travel_start": "case:travel_start", "travel_end": "case:travel_end"}, inplace=True)
gives:
activity case start end type user case:travel_start case:travel_end cost
0 pay expenses 1 16.01.2017 13:29 16.01.2017 13:40 Accounting FI12 NaN NaN 167,52
1 pay expenses 5 16.01.2017 08:38 16.01.2017 08:48 Accounting JH2172 NaN NaN 262,11
2 calculate payments 6 04.01.2017 06:59 16.01.2017 09:40 Accounting WE5108 NaN NaN 413,14
3 pay expenses 6 06.02.2017 09:27 06.02.2017 09:36 Accounting WE5108 NaN NaN 413,14
4 send original documents to archive 7 01.01.2017 03:46 09.01.2017 06:23 Employee UL2786 NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
83251 check if travel request needs preliminary pric... 7267 29.12.2017 18:41 29.12.2017 18:43 NaN NaN NaN NaN NaN
83252 decide on approval requirements 7267 29.12.2017 18:43 29.12.2017 18:44 NaN NaN NaN NaN NaN
83253 file travel request 7268 29.12.2017 19:52 29.12.2017 19:54 Employee RT4514 NaN NaN 458,31
83254 check if travel request needs preliminary pric... 7268 29.12.2017 19:54 29.12.2017 19:56 NaN NaN NaN NaN NaN
83255 decide on approval requirements 7268 29.12.2017 19:56 29.12.2017 19:56 NaN NaN NaN NaN NaN
After converting using
parameters = {log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ID_KEY: 'case', log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ATTRIBUTE_PREFIX: 'case:'} event_log = log_converter.apply(el_mobis_csv, parameters=parameters, variant=log_converter.Variants.TO_EVENT_LOG)
the log changes to
[{'attributes': {'concept:name': 1}, 'events': [{'activity': 'pay expenses', 'case': 1, 'start': '16.01.2017 13:29', 'end': '16.01.2017 13:40', 'type': 'Accounting', 'user': 'FI12', 'cost': '167,52'}]}, '....', {'attributes': {'concept:name': 7268}, 'events': [{'activity': 'file travel request', 'case': 7268, 'start': '29.12.2017 19:52', 'end': '29.12.2017 19:54', 'type': 'Employee', 'user': 'RT4514', 'cost': '458,31'}, '..', {'activity': 'decide on approval requirements', 'case': 7268, 'start': '29.12.2017 19:56', 'end': '29.12.2017 19:56'}]}]
It drops the case:travel_start and case:travel_end columns.
In version 2.0.1.3 everything worked fine.
The text was updated successfully, but these errors were encountered: