[improvements] Evaluation of existing approaches and optimizations #12

inishchith · 2019-06-24T14:00:55Z

This thread is for discussion related to the evaluation of the existing implementation and their optimization.

/cc: We also have another issue ticket @ chaoss/grimoirelab-graal#18

inishchith · 2019-06-24T14:07:08Z

(A possible solution to avoid executing the lizard at the repository level by @valeriocos )

The idea is to save a file (which contains the output of lizard at the end of each execution)
the process could be as follows

If the file is not found, lizard will be calculated over all the repository

If the file is found, lizard will be executed only on the files in the commit(s), and the process will take care of updating the files info in the log file.
Thus for the next execution, it won't be needed to execute lizard over the whole repository.

(more information about current implementation can be found @ chaoss/grimoirelab-graal#39)

inishchith · 2019-07-02T05:47:31Z

CoCom Integration

(currently uses repository-level implementation)

[integration] Add support of Graal's CoCom Backend to ELK | Issue Ticket
[visualization] Dashboard for CoCom backend | Issue Ticket
[graal-x-elk] Add support of Graal's CoCom Backend to ELK | PR

inishchith · 2019-07-12T07:43:45Z

CoLic Integration

(currently uses commit-level implementation)

[integration] Add support of Graal's CoLic Backend to ELK | Issue Ticket
[visualization] Dashboard for CoLic backend | Issue Ticket
[graal-x-elk] Add support of Graal's CoLic Backend to ELK | PR

inishchith · 2019-07-14T08:55:28Z

Context

Earlier this week as we were inspecting the work done until now WRT the Code Complexity integration with ELK, we noticed incorrect results on (Metric) Line Chart (For multiple repositories, each point should be a sum of metric; say LOC, which wasn't the case)
This triggered the discussion of re-iterating through the implementation and what improvements could be made to solve the issue.
We had used repository-level analysis(as it provides data for the entire repository) as the key data for the CoCom dashboard which helped us get good results, But this was a huge issue as it would take a lot of time for execution even for a small repository for obvious reasons. Evaluation
@valeriocos and I had a discussion on the above issue and thought of conducting a study on the commit-level results from Graal (Study lets us access all the data in an enriched/raw index, manipulate as per our convenience and then insert it into another index and with the help of it we can produce more viz.) /working-branch
EVALUATION RESULTS

Time

Repository	Number of Commits	*File Level	Repository Level	File-Level + Study
chaoss/grimoirelab-kingarthur	208	3:12 min	34:23 min	5:06 min
chaoss/grimoirelab-graal	171	2:22 min	28:35 min	3:58 min

Memory

Repository Level: (Number of files) * (Number of Commits)
File-Level + Study(Non-Incremental): Sum of (number of files affected in each commit).

/cc @valeriocos @jgbarah @aswanipranjal

valeriocos · 2019-07-14T11:19:39Z

Thank you for the summary @inishchith :)

inishchith · 2019-07-14T14:41:01Z

@valeriocos Please do have a look at /working-branch whenever convenient. As in the approach can be better, so as to proceed further with focus on its incremental implementation.

valeriocos · 2019-07-14T18:19:40Z

Thank you @inishchith for sharing your work, I had a look at the approach you pointed above, and it looks good as initial implementation. One of the potential issue we could end up is the size of the cache_dict, which may consume memory in case of a large number of repos and/or files. Another possible issue is the aggregation of cocom data from different repos with the current approach. If I understood correctly the code, there isn't a common datetime value that allows to sum up/merge the data of different repos.

We could try to explore approaches that leverage on more complex queries[1] on the enriched index. In this way the study code will have just to upload the data obtained from the query to the study index, thus avoding keeping a dictionary in memory. For instance, the query[2] that calculates the line chart cocom_project_wise_evolution_loc could be tweaked and executed for each origin, and the output data could be something similar to the table below (where every row represents an item in the enriched index).

study_datetime	origin	file_path	total LOC	total comments	total_comments_ratio	...
2017-11-06	https://.../graal	graal.py	1,972	806	...	...
2017-11-06	https://.../perceval	perceval.py	2,472	926	...	...

WDYT?

[1] Please have a look at remove_commits and authors_min_max_dates to see how to issue queries to ElasticSearch in ELK.

[2] To find the query/request executed for cocom_project_wise_evolution_loc check the gif below:

inishchith · 2019-07-14T19:49:01Z

@valeriocos Thanks for the response

If I understood correctly the code, there isn't a common datetime value that allows to sum up/merge the data of different repos.

Yes. Correct. The problem is we might not have data at common DateTime, which makes me doubt if this could be tackled in a way (along with the incremental updates)

(About the memory issues in the current implementation(cache_dict) and using queries.)

Yes. I'm aware of the memory issue.
Thanks for introducing me to the idea of using query, I wasn't sure if this could be a possible implementation.
I'll try to explore this idea, make changes in the current enricher and share some results soon :)

Thanks!

inishchith · 2019-07-15T13:31:59Z

@valeriocos

The data that we're trying to access here are the results of repository-level analysis whereas our purpose of using study mainly was to produce repository information from the commit-level data.
The cache_dict is an overhead (with complexity about number_of_repository x number_of_files). This should be an issue with large repositories as you've rightly addressed earlier.
- Note: The cache_dict space complexity is much lesser than the repository-level enriched data. For instance, if we consider an item per file and the results for Graal repository would be as follows: (no. of files: ~200, no. of commits: ~180
  - Repository Level: 36000 items
  - File-Level + Study: 200 items
- But even the number of files space complexity would be a lot in terms of performance issue.
I had thought to avoid the extra memory(cache_dict) approach, but couldn't find a better way to properly structure the data and insert in the study index.

I have later thought of using a similar approach, but for commit-level information. Given a thought, It'd require memory to keep track of what changes were made to a file and the changes to the corresponding metric due to it.
The idea of experimenting with the study index was a lot insightful, But I'll have to think of a better and efficient approach to deal with the above problem and would require some suggestions for further steps :)

In case any point of the above isn't clear or I've missed something out, Please do let me know.
Thanks!

inishchith · 2019-07-19T18:28:58Z

# About File-level + Study (query-chunk) approach

We had discussed the cache_dict approach in the last meeting, it was a good improvement over the repository-level approach(as per comparison above), memory was still a concern and we thought of trading it against a bit more time and see if there's another approach that we try.
The current approach completely gets rid of the extra memory space(cache_dict), instead, we use querying on the instance to retrieve all the latest files details until a given time and perform aggregation on it.
The time set is incremental and can be passed in terms of interval_months. The current implementation also supports incremental updates by checking the past insertion date on the study index.
Time complexity: (number_of_repository x number of intervals x number_of_files)

# Updated Evaluation

Repository	Number of Commits	*File Level	Repository Level	File-level + Study (`cache_dict`)	File-level + Study (`query-chunk`)
grimoirelab-kingarthur	208	3:12 min	34:23 min	5:06 min	5:13 min
grimoirelab-graal	171	2:22 min	28:35 min	3:58 min	4:08 min

Chunk Interval is set to 1-month

valeriocos · 2019-07-19T22:14:53Z

Great results @inishchith ! Thank you @inishchith for the evaluation.
If possible, could you try on one or more large repos (10.000 - 100.000 commits), something like: https://github.com/elastic/elasticsearch ? (The idea is to see if the file-level + study ( query-chunk) performs better on large repos than file-level + study (cache_dict))

Thanks

inishchith · 2019-07-20T06:11:30Z

@valeriocos Sure. I'll do it and update the thread :)

inishchith · 2019-07-20T18:53:53Z

Repository	Number of Commits	File-level + Study (`cache_dict`)	File-level + Study (`query-chunk`)
kennethreitz/requests	5,646	1:43:28 hr	1:56:27 hr
aio-libs/aiohttp	6,445	2:31:13 hr	2:29:47 hr

The reading cannot be proportional/comparable across repositories due to the difference in the number of files

inishchith · 2019-08-08T10:15:57Z

closed in reference to #17

inishchith added enhancement New feature or request coding-period-three task completed during coding period #3 labels Jun 24, 2019

inishchith mentioned this issue Jul 17, 2019

[visualization] Issues related to metrics visualization #13

Closed

inishchith mentioned this issue Jul 20, 2019

[cloc] Fix cloc error due to mulitple word language-name chaoss/grimoirelab-graal#46

Merged

inishchith added the ready work completed, minor meta related work left label Jul 27, 2019

inishchith added this to the 📈 Second Evaluation milestone Aug 8, 2019

inishchith closed this as completed Aug 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improvements] Evaluation of existing approaches and optimizations #12

[improvements] Evaluation of existing approaches and optimizations #12

inishchith commented Jun 24, 2019 •

edited

Loading

inishchith commented Jun 24, 2019

inishchith commented Jul 2, 2019 •

edited

Loading

inishchith commented Jul 12, 2019 •

edited

Loading

inishchith commented Jul 14, 2019 •

edited

Loading

valeriocos commented Jul 14, 2019

inishchith commented Jul 14, 2019

valeriocos commented Jul 14, 2019 •

edited

Loading

inishchith commented Jul 14, 2019

inishchith commented Jul 15, 2019 •

edited

Loading

inishchith commented Jul 19, 2019 •

edited

Loading

valeriocos commented Jul 19, 2019 •

edited

Loading

inishchith commented Jul 20, 2019

inishchith commented Jul 20, 2019 •

edited

Loading

inishchith commented Aug 8, 2019

[improvements] Evaluation of existing approaches and optimizations #12

[improvements] Evaluation of existing approaches and optimizations #12

Comments

inishchith commented Jun 24, 2019 • edited Loading

inishchith commented Jun 24, 2019

inishchith commented Jul 2, 2019 • edited Loading

inishchith commented Jul 12, 2019 • edited Loading

inishchith commented Jul 14, 2019 • edited Loading

Context

valeriocos commented Jul 14, 2019

inishchith commented Jul 14, 2019

valeriocos commented Jul 14, 2019 • edited Loading

inishchith commented Jul 14, 2019

inishchith commented Jul 15, 2019 • edited Loading

inishchith commented Jul 19, 2019 • edited Loading

valeriocos commented Jul 19, 2019 • edited Loading

inishchith commented Jul 20, 2019

inishchith commented Jul 20, 2019 • edited Loading

inishchith commented Aug 8, 2019

inishchith commented Jun 24, 2019 •

edited

Loading

inishchith commented Jul 2, 2019 •

edited

Loading

inishchith commented Jul 12, 2019 •

edited

Loading

inishchith commented Jul 14, 2019 •

edited

Loading

valeriocos commented Jul 14, 2019 •

edited

Loading

inishchith commented Jul 15, 2019 •

edited

Loading

inishchith commented Jul 19, 2019 •

edited

Loading

valeriocos commented Jul 19, 2019 •

edited

Loading

inishchith commented Jul 20, 2019 •

edited

Loading