Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improvements] Evaluation of existing approaches and optimizations #12

Closed
inishchith opened this issue Jun 24, 2019 · 14 comments
Closed
Labels
coding-period-three task completed during coding period #3 enhancement New feature or request ready work completed, minor meta related work left

Comments

@inishchith
Copy link
Owner

inishchith commented Jun 24, 2019

This thread is for discussion related to the evaluation of the existing implementation and their optimization.

/cc: We also have another issue ticket @ chaoss/grimoirelab-graal#18

@inishchith inishchith added enhancement New feature or request coding-period-three task completed during coding period #3 labels Jun 24, 2019
@inishchith
Copy link
Owner Author

(A possible solution to avoid executing the lizard at the repository level by @valeriocos )

The idea is to save a file (which contains the output of lizard at the end of each execution)
the process could be as follows

  • If the file is not found, lizard will be calculated over all the repository
  • If the file is found, lizard will be executed only on the files in the commit(s), and the process will take care of updating the files info in the log file.
    Thus for the next execution, it won't be needed to execute lizard over the whole repository.

(more information about current implementation can be found @ chaoss/grimoirelab-graal#39)

@inishchith
Copy link
Owner Author

inishchith commented Jul 2, 2019

CoCom Integration

(currently uses repository-level implementation)

  • [integration] Add support of Graal's CoCom Backend to ELK | Issue Ticket
  • [visualization] Dashboard for CoCom backend | Issue Ticket
  • [graal-x-elk] Add support of Graal's CoCom Backend to ELK | PR

@inishchith
Copy link
Owner Author

inishchith commented Jul 12, 2019

CoLic Integration

(currently uses commit-level implementation)

  • [integration] Add support of Graal's CoLic Backend to ELK | Issue Ticket
  • [visualization] Dashboard for CoLic backend | Issue Ticket
  • [graal-x-elk] Add support of Graal's CoLic Backend to ELK | PR

@inishchith
Copy link
Owner Author

inishchith commented Jul 14, 2019

Context

  • Earlier this week as we were inspecting the work done until now WRT the Code Complexity integration with ELK, we noticed incorrect results on (Metric) Line Chart (For multiple repositories, each point should be a sum of metric; say LOC, which wasn't the case)

  • This triggered the discussion of re-iterating through the implementation and what improvements could be made to solve the issue.

  • We had used repository-level analysis(as it provides data for the entire repository) as the key data for the CoCom dashboard which helped us get good results, But this was a huge issue as it would take a lot of time for execution even for a small repository for obvious reasons. Evaluation

  • @valeriocos and I had a discussion on the above issue and thought of conducting a study on the commit-level results from Graal (Study lets us access all the data in an enriched/raw index, manipulate as per our convenience and then insert it into another index and with the help of it we can produce more viz.) /working-branch

  • EVALUATION RESULTS

Time

Repository Number of Commits *File Level Repository Level File-Level + Study
chaoss/grimoirelab-kingarthur 208 3:12 min 34:23 min 5:06 min
chaoss/grimoirelab-graal 171 2:22 min 28:35 min 3:58 min

Memory

  • Repository Level: (Number of files) * (Number of Commits)
  • File-Level + Study(Non-Incremental): Sum of (number of files affected in each commit).

/cc @valeriocos @jgbarah @aswanipranjal

@valeriocos
Copy link

Thank you for the summary @inishchith :)

@inishchith
Copy link
Owner Author

@valeriocos Please do have a look at /working-branch whenever convenient. As in the approach can be better, so as to proceed further with focus on its incremental implementation.

@valeriocos
Copy link

valeriocos commented Jul 14, 2019

Thank you @inishchith for sharing your work, I had a look at the approach you pointed above, and it looks good as initial implementation. One of the potential issue we could end up is the size of the cache_dict, which may consume memory in case of a large number of repos and/or files. Another possible issue is the aggregation of cocom data from different repos with the current approach. If I understood correctly the code, there isn't a common datetime value that allows to sum up/merge the data of different repos.

We could try to explore approaches that leverage on more complex queries[1] on the enriched index. In this way the study code will have just to upload the data obtained from the query to the study index, thus avoding keeping a dictionary in memory. For instance, the query[2] that calculates the line chart cocom_project_wise_evolution_loc could be tweaked and executed for each origin, and the output data could be something similar to the table below (where every row represents an item in the enriched index).

study_datetime origin file_path total LOC total comments total_comments_ratio ...
2017-11-06 https://.../graal graal.py 1,972 806 ... ...
2017-11-06 https://.../perceval perceval.py 2,472 926 ... ...

WDYT?

[1] Please have a look at remove_commits and authors_min_max_dates to see how to issue queries to ElasticSearch in ELK.

[2] To find the query/request executed for cocom_project_wise_evolution_loc check the gif below:
Peek 14-07-2019 20-03

@inishchith
Copy link
Owner Author

@valeriocos Thanks for the response

If I understood correctly the code, there isn't a common datetime value that allows to sum up/merge the data of different repos.

Yes. Correct. The problem is we might not have data at common DateTime, which makes me doubt if this could be tackled in a way (along with the incremental updates)

(About the memory issues in the current implementation(cache_dict) and using queries.)

Yes. I'm aware of the memory issue.
Thanks for introducing me to the idea of using query, I wasn't sure if this could be a possible implementation.
I'll try to explore this idea, make changes in the current enricher and share some results soon :)

Thanks!

@inishchith
Copy link
Owner Author

inishchith commented Jul 15, 2019

@valeriocos

  • The data that we're trying to access here are the results of repository-level analysis whereas our purpose of using study mainly was to produce repository information from the commit-level data.
  • The cache_dict is an overhead (with complexity about number_of_repository x number_of_files). This should be an issue with large repositories as you've rightly addressed earlier.
    • Note: The cache_dict space complexity is much lesser than the repository-level enriched data. For instance, if we consider an item per file and the results for Graal repository would be as follows: (no. of files: ~200, no. of commits: ~180
      • Repository Level: 36000 items
      • File-Level + Study: 200 items
    • But even the number of files space complexity would be a lot in terms of performance issue.
  • I had thought to avoid the extra memory(cache_dict) approach, but couldn't find a better way to properly structure the data and insert in the study index.

  • I have later thought of using a similar approach, but for commit-level information. Given a thought, It'd require memory to keep track of what changes were made to a file and the changes to the corresponding metric due to it.

  • The idea of experimenting with the study index was a lot insightful, But I'll have to think of a better and efficient approach to deal with the above problem and would require some suggestions for further steps :)

In case any point of the above isn't clear or I've missed something out, Please do let me know.
Thanks!

@inishchith
Copy link
Owner Author

inishchith commented Jul 19, 2019

# About File-level + Study (query-chunk) approach

  • We had discussed the cache_dict approach in the last meeting, it was a good improvement over the repository-level approach(as per comparison above), memory was still a concern and we thought of trading it against a bit more time and see if there's another approach that we try.
  • The current approach completely gets rid of the extra memory space(cache_dict), instead, we use querying on the instance to retrieve all the latest files details until a given time and perform aggregation on it.
  • The time set is incremental and can be passed in terms of interval_months. The current implementation also supports incremental updates by checking the past insertion date on the study index.
  • Time complexity: (number_of_repository x number of intervals x number_of_files)

# Updated Evaluation

Repository Number of Commits *File Level Repository Level File-level + Study (cache_dict) File-level + Study (query-chunk)
grimoirelab-kingarthur 208 3:12 min 34:23 min 5:06 min 5:13 min
grimoirelab-graal 171 2:22 min 28:35 min 3:58 min 4:08 min
  • Chunk Interval is set to 1-month

@valeriocos
Copy link

valeriocos commented Jul 19, 2019

Great results @inishchith ! Thank you @inishchith for the evaluation.
If possible, could you try on one or more large repos (10.000 - 100.000 commits), something like: https://github.com/elastic/elasticsearch ? (The idea is to see if the file-level + study ( query-chunk) performs better on large repos than file-level + study (cache_dict))

Thanks

@inishchith
Copy link
Owner Author

@valeriocos Sure. I'll do it and update the thread :)

@inishchith
Copy link
Owner Author

inishchith commented Jul 20, 2019

Repository Number of Commits File-level + Study (cache_dict) File-level + Study (query-chunk)
kennethreitz/requests 5,646 1:43:28 hr 1:56:27 hr
aio-libs/aiohttp 6,445 2:31:13 hr 2:29:47 hr
  • The reading cannot be proportional/comparable across repositories due to the difference in the number of files

@inishchith inishchith added the ready work completed, minor meta related work left label Jul 27, 2019
@inishchith inishchith added this to the 📈 Second Evaluation milestone Aug 8, 2019
@inishchith
Copy link
Owner Author

closed in reference to #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
coding-period-three task completed during coding period #3 enhancement New feature or request ready work completed, minor meta related work left
Projects
None yet
Development

No branches or pull requests

2 participants