Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move sample data to a repository #62

Open
JQFonseca opened this issue May 31, 2020 · 7 comments
Open

Move sample data to a repository #62

JQFonseca opened this issue May 31, 2020 · 7 comments

Comments

@JQFonseca
Copy link
Contributor

I installed DefDap on a different computer today and it took so long tod download everything, primarily because the example data (which is needed) is relatively large. Could I suggest we move it to a repository, like Zenodo and then have a command to download it in the example notebook?

@rhysgt
Copy link
Contributor

rhysgt commented Jun 1, 2020

I believe the data that we use in the example notebook is now contained within the tests folder and is small compared to the old example data. The old example data is still in the repo in the example_data folder though and I don't think we use it (@mikesmic can confirm). If we remove that, the repo will be ~20mb which I think is reasonable.

Were you downloading though GitHub Desktop by the way? I have found that to be very slow for some reason which is not directly related to repo size. It sometimes takes a long time on a fast connection.

@mikesmic
Copy link
Collaborator

mikesmic commented Jun 1, 2020

The example notebook now only using the data from the tests directory which contains 8.7MB of data - 4.9MP is a ctf file which we should maybe make smaller as it's not used in the example notebook. I will delete the example data directory in develop (I thought I had already done this tbh) which will cut out 36MB (60-70% of the total size)

It would be great to work towards having a library of example datasets, defined with consistent filenames and formats to automatically pull into a notebook.

mikesmic added a commit that referenced this issue Jun 1, 2020
@mikesmic
Copy link
Collaborator

This is still an issue. Cloning downloads 321.29 MiB of data. What's being downloaded? Does cloning include the whole history? Any ideas @merrygoat ?

@merrygoat
Copy link
Contributor

Yes, the hidden .git folder has all of the historical diffs - you should be able to check this by doing a shallow clone:

git clone -–depth [depth] [remote-url]

Where depth is the number of diffs to fetch.

You can use git filter-branch to edit history and remove the files but do be careful, editing the history is slightly dangerous. Best to do locally first and ensure you are happy before a force push.

@aplowman
Copy link

Publishing on PyPI is a better approach than fiddling with the git history, isn't it?

@merrygoat
Copy link
Contributor

I didn't think of it like that, but yes certainly.

@mikesmic
Copy link
Collaborator

I had a look at finding big files and deleting them from the history, I found a decent guide (https://web.archive.org/web/20190207210108/http://stevelorek.com/how-to-shrink-a-git-repository.html) but it scares me. I will publish to PyPI for now, it's daft that I haven't done that yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants