Move sample data to a repository #62

JQFonseca · 2020-05-31T15:40:02Z

I installed DefDap on a different computer today and it took so long tod download everything, primarily because the example data (which is needed) is relatively large. Could I suggest we move it to a repository, like Zenodo and then have a command to download it in the example notebook?

rhysgt · 2020-06-01T09:37:39Z

I believe the data that we use in the example notebook is now contained within the tests folder and is small compared to the old example data. The old example data is still in the repo in the example_data folder though and I don't think we use it (@mikesmic can confirm). If we remove that, the repo will be ~20mb which I think is reasonable.

Were you downloading though GitHub Desktop by the way? I have found that to be very slow for some reason which is not directly related to repo size. It sometimes takes a long time on a fast connection.

mikesmic · 2020-06-01T10:08:45Z

The example notebook now only using the data from the tests directory which contains 8.7MB of data - 4.9MP is a ctf file which we should maybe make smaller as it's not used in the example notebook. I will delete the example data directory in develop (I thought I had already done this tbh) which will cut out 36MB (60-70% of the total size)

It would be great to work towards having a library of example datasets, defined with consistent filenames and formats to automatically pull into a notebook.

#62

mikesmic · 2020-06-26T16:42:52Z

This is still an issue. Cloning downloads 321.29 MiB of data. What's being downloaded? Does cloning include the whole history? Any ideas @merrygoat ?

merrygoat · 2020-06-28T19:05:44Z

Yes, the hidden .git folder has all of the historical diffs - you should be able to check this by doing a shallow clone:

git clone -–depth [depth] [remote-url]

Where depth is the number of diffs to fetch.

You can use git filter-branch to edit history and remove the files but do be careful, editing the history is slightly dangerous. Best to do locally first and ensure you are happy before a force push.

aplowman · 2020-06-29T08:40:07Z

Publishing on PyPI is a better approach than fiddling with the git history, isn't it?

merrygoat · 2020-06-29T08:50:53Z

I didn't think of it like that, but yes certainly.

mikesmic · 2020-06-29T09:49:31Z

I had a look at finding big files and deleting them from the history, I found a decent guide (https://web.archive.org/web/20190207210108/http://stevelorek.com/how-to-shrink-a-git-repository.html) but it scares me. I will publish to PyPI for now, it's daft that I haven't done that yet

mikesmic added a commit that referenced this issue Jun 1, 2020

Remove old example data

a2dfc68

#62

mikesmic added the enhancement label Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move sample data to a repository #62

Move sample data to a repository #62

JQFonseca commented May 31, 2020

rhysgt commented Jun 1, 2020

mikesmic commented Jun 1, 2020

mikesmic commented Jun 26, 2020

merrygoat commented Jun 28, 2020

aplowman commented Jun 29, 2020

merrygoat commented Jun 29, 2020

mikesmic commented Jun 29, 2020

Move sample data to a repository #62

Move sample data to a repository #62

Comments

JQFonseca commented May 31, 2020

rhysgt commented Jun 1, 2020

mikesmic commented Jun 1, 2020

mikesmic commented Jun 26, 2020

merrygoat commented Jun 28, 2020

aplowman commented Jun 29, 2020

merrygoat commented Jun 29, 2020

mikesmic commented Jun 29, 2020