Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

variants_filter.filter_log_variants_percentage is not working as expected. #179

Closed
selenecodes opened this issue Oct 8, 2020 · 1 comment

Comments

@selenecodes
Copy link
Contributor

When using the variants percentage filter (variants_filter.filter_log_variants_percentage) the filter doesn't actually scale linearly.
When using the following code I get the following results with my dataset:

variants_filter.filter_log_variants_percentage(log, percentage=1) # Length is 4617
variants_filter.filter_log_variants_percentage(log, percentage=.1) # Length is still 4617 
variants_filter.filter_log_variants_percentage(log, percentage=.05) # Length is 241 
@fit-alessandro-berti
Copy link
Contributor

Dear Selene Codes, the variants filter on percentage works as follow, given the percentage P:

  • The variants of the log are found along with their number of occurrences
  • A number N is chosen such that if we take all the variants with at least N occurrences, we include a percentage of cases that is at least P, while if we choose N+1 we would include a percentage of cases that is below P.

If the log contains the following variants:
ABC (2 occurrences)
A,B,C,AB,AC,BC,CB,BA,CA,ABCD,ABCE,ABCF,ABCG,ABCH,ABCI,ABCL,ABCM,ABCN (1 occurrence each)

Then with percentage=1, all the 20 cases would have been keep.

If we choose percentage=0.1, and N=1, then we include all the cases, while choosing N=2 we include only the cases of the first variant (that are the 5% of the log, hence N=2 is not valid according to the above principle).

If we choose percentage=0.05, and N=2, then we include exactly 5% of the cases of the log, that is the minimum requirement.

fit-alessandro-berti added a commit that referenced this issue Dec 11, 2023
…eir-visualization-in-the-performance-dfg' into 'integration'

[Priority 2] Support for the computation of sojourn times and their visualization in the performance DFG

Closes #179

See merge request process-mining/pm4py/pm4py-core!1163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants