-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Size of repository 1.3GB #473
Comments
i just checked. my brainstorm folder is 3.37Go :o
|
This is also something that had been bothering me for a while: I could very efficiently pack it down from 1.4Gb to 200Mb using BFG:
The annoying part is that this rewrites the entire commit history, making all the existing clones obsolete. We could notify our collaborators who have been committing to the brainstorm3 repo and they could simply update everything. If I force push the current .git folder I have on my computer (200Mb), what would be the procedure to port all the development branches to this new compressed repo? |
I moved the compiled distribution ( The packed brainstorm3.git folder is now 85Mb (45Mb code + 40Mb history). Counterpart: now the compiled |
That is great. Thank you.
My experience with rewriting history is that clones will remain sufficiently independent. If they pull from upstream master they'll get that branch updated but not obsolete. Importantly, it is worth noting that the links to specific commits or lines of code (permalinks, etc..) will no longer work. That is inconvenient. forks actively developing would probably need to rebase their work after you force push your pruned master branch, to be able to easily merge their work later on. But that is not too painful. They will rewrite their own history, but I don't see that too limiting.
The way to go. 😄 Long-term, I'd suggest to use both github actions and releases functionalities which are here in github precisely to solve these problems we're having, and on top of that they are free for open-source projects.
If you want to keep the neuroimage server as the storage server for the binary files you can upload the file with a specific action. For instance, every time there is a new change in the repo, etc... can automatically trigger a script to run. server address, user name, pass, protocol and port can be stored in the Secrets section of this repository, to which only admins have access. Those secrets are environment variables automatically exported to any shell you are running in a github action. But perhaps it is a good point to start storing the binary files in the release section. The url they are stored at is very predictable. Something like : https://github.com/brainstorm-tools/brainstorm3/releases/download/<>/<<name-of-file.extension>> Or it even makes more sense to use the release functionality in brainstorm-tools/bst-java repository. In any of the previous cases you can automatically compile the .jar file and upload it. Another benefit is that, as long as I know, there is now limit to the number of files you can keep in the download area of an open source project. That is all provided by github, so you easily can keep a compiled file version for every single change in the java code. |
@rcassani @Edouard2laire @HosseinShahabi @Moo-Marc @tmedani @LeoNouvelle @DavideNuzzi
All the commits will change ID?...
I don't see how github actions/releases can help shrinking our history...
Uploading the compiled Brainstorm package on GitHub is not much easier than on the web server.
The repository The big .jar file that causes the dramatic history is the one that is created from the Matlab Compiler: the full Brainstorm compiled application. This is not something we can compile from GitHub. |
I see. Thanks for the detailed explanation. I've been discussing this with Gabriel (@gabrielbmotta) and some of the ideas I share come from that conversation. Two different problems: (1) reduce the repo size and (2) avoid this problem in the future. Issue 1 is quite stimulating because it kind of goes against git's principles. Very cool. Regarding your post.
I'm applying logic here, so I'm not sure. But, I'm assuming that if you modify a commit sha all the 'dependent' commits will also require modification. But there might be something I'm not understanding.
It doesn't. I brought it up to try to solve 2.
Yes it is possible! You can have a job run on anything, e. gr. a defined schedule (example Mon-Wed-Friday :)), not only based on a commit. It could be once a month, or you can have a specific named branch which, when pushed to, triggers a runner that can do this for you, or a specific text tagging a commit, etc... One problem is that github-hosted runners don't have matlab installed so that's a pain because you'd have to install matlab and then run the compilation script. But, for a project the size and complexity of bs, I think it is worth in the long-term to invest the effort into providing a self-hosted runner. You have your own machine with matlab installed and also a client app installed that lets github trigger a script to run on that machine. That way you can have matlab installed there no problem. Being an open source project we have to be careful with security. But that self-hosted runner can, only when you decide (not every commit)!, automatically checkout master branch, run Now... regarding the reduction of the size of the repo. See this interesting discussion. Apparently it is not completely clear you can seamlessly reduce a repo size without rewriting history. https://stackoverflow.com/questions/17470780/is-it-possible-to-slim-a-git-repository-without-rewriting-history Think about the first solution I mentioned. Create a "vault"/"hibernate" repository in brainstorm-tools which clones master branch as it is. And start cleaner ±40Mb master branch without old dependencies, either the way you proposed or the way it is proposed in the previous link. Check this tool referenced in github docs https://rtyley.github.io/bfg-repo-cleaner/ (you also shared it!) Hope it helps. |
But what is the advantage of having this done on GitHub rather on our server? As a reminder: we want to keep this idea that general users need to register on our website rather than download/clone it directly from github. The github distribution is mostly for developers, not for general users. The registration and download statistics are major arguments for requesting public funding for this project :)
As I reference the commits in the forum posts related with bug fixing, it would be good to keep the links to all the old commits working. |
Simplicity? I mean, since we're all using github already, if we use github's (free) service you could give anyone in the development team the capacity to generate (from anywhere) these bst compilation files without any extra work. Going back to the use of the “release” section in github. Regardless of the way to actually compile the binary files, if we were to use this "release" section of github we can automatically get a publication d.o.i. number for each stable release. Releases can be generated, as explained previously, manually or automatically, on a schedule or triggered by some event. This can help gain ease of use while referencing in the academic context. There is a description of the process even in github's documentation. See: In relation to this. I think it is a mistake to continue uploading the compiled java code to brainstorm3/java. Regardless of how small the size of the files are, or how infrequently the files are updated. brainstorm-tools/bst-java is java code that generates some compiled code that can be stored in the release section of that repository. Code repositories I think should be kept for text files containing code and code related stuff, like documentation etc... Going back to the actual problem we’re dealing with here: the size of the repo. Let me recapitulate: It is clear by now that you cannot add/delete/modify any file in a repository without modifying its history. And what we want is to try to have a slim repository without the file We have few options:
|
Great initiative Juan, and good idea to remove the JARs from the repository and its history! I don't think there's any way to do this without rewriting the git history which would indeed fork the code base. I would create a new stable branch, and I believe a simple Finally, like Juan alluded to, it could be a good time to rename the stable branch from |
Thank you @martcous ! After thinking about it for a while I think I can offer a few specific ideas. I've created a testing repo and I think it could work. I've been reading about the capacity to merge from unrelated histories, but I'm not sure how this can help. I think we can achieve a smaller repo size, while considering the following:
Option A. Two branches: master & history
But I think that given git and github's apis, it is either we stick to 1.3GB, or we modify the history. And in order to modify the history while keeping the links and references valid, we need two branches.
Interestingly, because of the way bfg works, github will think that changes have been made in the
or removing the references
Option B. Two branches: Notes.
You'll see how we have blobs bigger than 1M non related to the java compiled file.
I think that these Whatever you decide, i'm sure it will be the best option. |
Thanks for all this research! New branch vs. new repository We loosely discussed about moving to This major update will not be handled as the regular updates. Changing the "version" number (and logo/splash screen, and installation folder), would be a clear way to show the users that something changed. If we want to create a new repository java/brainstorm.jar I thought initially that we would need to link the correct version of the What would be your suggestion for having the Can we easily create a dependency link from the (Note that in the future, we might not have much more that 5 commits per year for this file, so 330Kb*5=1.6Mb added every year. I would tend to consider this negligible and keep it as it is now, with the Other libraries Now we have this plugin interface that allows the dynamic download of files on the fly, we could move most of the third-party I/O libraries from the
Additionnally, we will soon be able to decommission the support for JOGL completely (needed for the old connectivity graph display). This is causing trouble on many computers, and the new graph display seem to be working OK (we haven't had bug reports about this in a while). ICBM152 template
The ICBM152 template that is copied to all the new protocols, we want it completely versioned, fully linked with all the commit history. A minor change in these files (eg. the number of vertices of the cortex surface used for source reconstruction) would impact the results of any group study. It would be good to have it copied to any clone instead of downloaded on the fly. I could move these files to the neuroimage server, and having the daily cron job copy them into the .zip package available on the download page. But it would make it more complicated for any user to reconstruct the software at any given point in history... Can we have git automatically clone files from external links (copy files from the neuroimage server for example)? Sorry, this is a bit drafty, I thinking out loud... I'm OK with moving these away from the GitHub repo, but I'm not sure in which direction I should go. Reproducibility considerations For full reproducibility of an analysis, we want the users to archive the version of the software they used for processing together with the dataset (and get a DOI attached to the whole thing). The ICBM152 template should always be included in the brainstorm3 cloned folder, but all the depending plugins as well. |
Thank you. I see. Thanks for all the cover info.
That sounds great. And it makes a lot of sense. Brainstorm is now a mature project with thousands of things occurring all at once. I'm very happy to be witnessing this. Through this conversation I'm gaining insight into what version-control good practices might mean. So thank you for this. While doing this, allow me for a minute to share my thoughts: this is brainstorm (without Perhaps only slightly related to the current issue, but I think it is worth thinking about how 16mb can turn into 1.3GB. the conversation might be helpful in the near future, with either the new ... But, going back to the actual issue at hand. If you go for this new repo option, this would allow to solve the repo size no problem. You could (1) archive brainstorm3 and therefore all links would be preserved you would avoid confusion, (2) rewrite history on brainstorm4 so that the repo goes back to anywhere in the ~90MB or less. At some point you'll face this decision: whether to keep all the commits or to squash them into one single commit. You could set an initial commit that squashes all contributions to brainstorm3 in a single commit in brainstorm4. That squash of all the history would probably make the repo go down to the ~25-30MB area. I'm not sure which is the best option. I'm just sharing some decisions that would have to be made eventually. Personally I'd keep the back history, but avoid any non-text file and probably the repo would shrink to the ~30MB area, which for a repo with the moderate importance of brainstorm seems fair.
First thing I think of is git's submodules. But I wouldn't go that route. Different languages (matlab vs java) and difficulty to operate with git for not experienced users, makes me think just keep them separated. Forgive me for replying with a question. Why would you want a binary file with compiled code to be available always when you clone the repo? I think we need to enforce that a version control system is a great tool for that. But maybe not so much for deployment of an application. I don't think you need compiled java code when you want to maintain versions of a matlab codebase. You might need it to run the application. So then, make the In order to download from github you can either drag and drop or use a script (which can also be automatically triggered by a github action, either on a calendar schedule or whenever there is an event. Just created this release for demo purposes. Here you can see how we upload a tar file to the release section through a github action.
See the linked release I just created. The release related urls are immutable and direct you to the jar file. But not only that, you can link to a specific release or to a Maybe we could compile it manually on our computers using netbeans, and then create a release manually to upload on github the resulting dist/brainstorm.jar? Drag and drop in github's page. See: https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository Related to ICBM152 template. Perhaps making a
You can make github run a script on a schedule, and make that script do that, for sure. Here is an example of a script which downloads some files from dropbox (see) Or, you can implement a git post-checkout hook to run another script. But the hooks are complicated because (i) depend on each user's configuration and (ii) difficult for not experienced users. So I'd go actions. Related to reproducibility considerations. I would add to your proposition, to have some sort of report similar to matlab's version report ( @ftadel and all, thanks. It's been great to find possible solutions to this issue. |
Thank you for all these suggestions! I started implementing some of them in order to move all the binary files out of the github repository:
The next things on my todo list:
For the actual repository cleaning (new repo vs. new branch) we'll discuss with @rcassani and the rest of the team at our next weekly meeting in January.
I already started something like this with the first version of the plugin manager. I need to find a way to get the last commit hash, instead of |
That was quick, great work, François!
You could use GitHub's public API and parse the JSON result: Example: https://api.github.com/repos/brainstorm-tools/brainstorm3/commits/master |
Yes, apparently it is as easy as @martcous points out. You just do a http request uri = matlab.net.URI("https://api.github.com/repos/brainstorm-tools/brainstorm3/commits/master")
request = matlab.net.http.RequestMessage;
r = send(request,uri)
r.Body.Data
shows...
so, then
gives you the string.
uri = matlab.net.URI("https://api.github.com/repos/brainstorm-tools/brainstorm3/commits/master")
request = matlab.net.http.RequestMessage
request = request.addFields(matlab.net.http.HeaderField("Accept", "application/vnd.github.VERSION.sha"))
r = send(request,uri)
sha = r.Body.Data In the docs pages it is said that |
Thanks! Implemented here: 101e36f |
While this is still a work in progress, I just wanted to share the fact that it is possible to clone a repository without its history (and avoid the download of gigabytes of useless and outdated binary files). By adding This is definitely useful for testing pull requests... |
Hi all,
I'd like to bring up how big the brainstorm3 repository folder is (1.3Gigabytes). Not sure if this is not a problem for everybody, or if I should openly bring this up at all as an issue or otherwise. As always, I'm just trying to help out making bs (even) better. If this is not the time, please feel free to close the issue.
This only became clear to me after cloning the repo from scratch. Previously, updating through pulls, it didn't became so obvious or problematic, so that is why perhaps some of you haven't noticed.
I think it is problematic in the sense that the actual code of the repo is 20 times smaller. Lighter codebase is preferable. 1Gb of data is not limiting in today's computers either. But I still think it is desirable to aim for a lighter repository, or at least worth a conversation. On top of that, the main reason for such a big repository is the existence of old versions of binary files in the repo memory is the cause for such a big size. Typically, binary files should be kept separated from the repo, we all know that, but I'd agree that brainstorm's nature is a bit particular in that sense. So I get it.
I'd propose to generate a "vault-previous-to-2022" type of branch where the memory for these files could be kept as it is right now in master. And then alleviate master branch from such a big load.
Long-term solution would be to use the release functionality within github. If needed, I could help out with that.
Interesting link: https://stackoverflow.com/questions/11050265/remove-large-pack-file-created-by-git
Thanks for reading.
The text was updated successfully, but these errors were encountered: