Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log converter drops trace attributes #199

Closed
ppfeiff opened this issue Jan 20, 2021 · 7 comments
Closed

Log converter drops trace attributes #199

ppfeiff opened this issue Jan 20, 2021 · 7 comments

Comments

@ppfeiff
Copy link

ppfeiff commented Jan 20, 2021

Hi,

we noted that in version 2.1.4 the behaviour of the log_converter has changed. It drops the trace attributes (exepct the case id) now.

el_mobis_csv = pd.read_csv(os.path.join("MobIS", "mobis_challenge_log_2019.csv"), sep=";") el_mobis_csv.rename(columns={"travel_start": "case:travel_start", "travel_end": "case:travel_end"}, inplace=True)

gives:
activity case start end type user case:travel_start case:travel_end cost
0 pay expenses 1 16.01.2017 13:29 16.01.2017 13:40 Accounting FI12 NaN NaN 167,52
1 pay expenses 5 16.01.2017 08:38 16.01.2017 08:48 Accounting JH2172 NaN NaN 262,11
2 calculate payments 6 04.01.2017 06:59 16.01.2017 09:40 Accounting WE5108 NaN NaN 413,14
3 pay expenses 6 06.02.2017 09:27 06.02.2017 09:36 Accounting WE5108 NaN NaN 413,14
4 send original documents to archive 7 01.01.2017 03:46 09.01.2017 06:23 Employee UL2786 NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
83251 check if travel request needs preliminary pric... 7267 29.12.2017 18:41 29.12.2017 18:43 NaN NaN NaN NaN NaN
83252 decide on approval requirements 7267 29.12.2017 18:43 29.12.2017 18:44 NaN NaN NaN NaN NaN
83253 file travel request 7268 29.12.2017 19:52 29.12.2017 19:54 Employee RT4514 NaN NaN 458,31
83254 check if travel request needs preliminary pric... 7268 29.12.2017 19:54 29.12.2017 19:56 NaN NaN NaN NaN NaN
83255 decide on approval requirements 7268 29.12.2017 19:56 29.12.2017 19:56 NaN NaN NaN NaN NaN

After converting using
parameters = {log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ID_KEY: 'case', log_converter.Variants.TO_EVENT_LOG.value.Parameters.CASE_ATTRIBUTE_PREFIX: 'case:'} event_log = log_converter.apply(el_mobis_csv, parameters=parameters, variant=log_converter.Variants.TO_EVENT_LOG)

the log changes to

[{'attributes': {'concept:name': 1}, 'events': [{'activity': 'pay expenses', 'case': 1, 'start': '16.01.2017 13:29', 'end': '16.01.2017 13:40', 'type': 'Accounting', 'user': 'FI12', 'cost': '167,52'}]}, '....', {'attributes': {'concept:name': 7268}, 'events': [{'activity': 'file travel request', 'case': 7268, 'start': '29.12.2017 19:52', 'end': '29.12.2017 19:54', 'type': 'Employee', 'user': 'RT4514', 'cost': '458,31'}, '..', {'activity': 'decide on approval requirements', 'case': 7268, 'start': '29.12.2017 19:56', 'end': '29.12.2017 19:56'}]}]

It drops the case:travel_start and case:travel_end columns.

In version 2.0.1.3 everything worked fine.

@fit-alessandro-berti
Copy link
Contributor

Dear ppfeiff,

The issue is that your case attributes are NaN.

From version 2.1.3, we do by default a post-processing step to search and remove the NaN from the streams. So the attributes that are NaN simply disappear.

@fit-alessandro-berti
Copy link
Contributor

If you do not like this behavior, please set the parameter "stream_postprocessing" to False

@ppfeiff
Copy link
Author

ppfeiff commented Jan 21, 2021

Hi alessandro,

Thanks for your quick reply. I understand that you want to get rid of NaN attributes. However, I find this behaviour weird for a couple of reasons:

  1. There is a information loss if you delete attributes, even though they are NaN. Knowing, that there is an attribute which is NaN all the time is something different than not knowing that this attribute exists.
  2. You have inconstistencies in your algorithms. Image you have two logs (A and B) from the same system with the same set of attributes. In A, there is a trace/case attribute that is NaN, in B it is not. Loading and converting A and B will result in different logs.
  3. In my case, not all values in travel_start are NaN. There are a lot of rows that have datetime values.

@fit-alessandro-berti
Copy link
Contributor

Hej,

Yes, it's always going to be a compromise. Try to imagine to load an event log such as roadtraffic (570k events) from the XES file. Many attributes appear only on the first event of the trace. Then, if you convert that to dataframe, and then back to log, you get an event log that is filled for NaN for the events that are not the first of the trace, that should not be there in principle. This setting is very common also starting from CSV (only some rows have some columns populated, and in the conversion to XES this was producing logs with huge amounts of NaN).

In your case, you should get the "case:travel_start" attribute when it's not empty. Ah, small remark, it is assumed that in the dataframe all the events of the same case have the same value for the case attribute. If that is not the case, is better to leave it as event attribute in the conversion phase, and moving from "event" to "trace" attributes as a successive postprocessing step.

@ppfeiff
Copy link
Author

ppfeiff commented Jan 21, 2021

Hi,

I get this point but I can't really agree. What you say is that you prefer data sparity over having the full information about the observations in the log. In PM I need all the information. Deleting data to save some space is not what I would expect using these functionalities.

On my specific dataset, there are 12.475 NaN rows in the .csv file but more than 70.781 non-NaN rows.

Its you tool and your choices, but as an end-user I wonder that you say these 70.781 rows should not be there.

(I can send you the log in case you want to check it out)

@ppfeiff
Copy link
Author

ppfeiff commented Jan 21, 2021

Guidline 11 in "Guidlines for Logging" in the PM book: Do not remove events and ensure provenance. Reproducibility is key for process mining. For example, do not remove a student from the database after he dropped out since this may lead to misleading analysis results. Mark objects as not relevant (a so-called “soft delete”) rather than deleting them: concerts are not deleted—they are canceled; employees are not deleted—they are fired, etc.

@fit-sebastiaan-van-zelst
Copy link
Contributor

Hi @ppfeiff,

We discussed this internally, and, we agree that this type 'data stripping' should not be default.
We do keep supporting it, in case people would like to use it, however, by default it is not applied.

We will create a hotfix release for this, i.e., 2.1.4.1, which we will release in a few mintues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants