-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files which are never created or modified #42
Comments
Hey @maelick, can you provide more information on this topic? |
Using last version of CVSAnaly git clone https://github.com/MetricsGrimoire/CVSAnalY.git
cd CVSAnalY
git co 3d67e700902e54d8ac3cfac60169e80b68d70f4e
cd ..
git clone git://git.gnome.org/tomboy
cd tomboy
git co cea2c730f3fe135067a26aafd6dd258348932662
cd ..
./CVSAnalY/cvsanaly2 -u <user> -p <pass> -d <db> tomboy/ Then I used to following SQL queries to compute total number of files SELECT COUNT(*) FROM files;
SELECT COUNT(DISTINCT f.id) FROM files f, actions a
WHERE f.id = a.file_id AND a.type IN ("C", "A"); This gives 3218 files that have been added or copied out of 5181 For example let's have a look at the README file at the root of CREATE VIEW not_created_files AS
SELECT * FROM files WHERE id NOT IN (
SELECT DISTINCT f.id FROM files f, actions a
WHERE f.id = a.file_id
AND a.type IN ("C", "A")
);
CREATE VIEW file_links_count AS
SELECT f.id, COUNT(*) n FROM files f, file_links fl
WHERE f.id = fl.file_id GROUP BY f.id; Then I looked at those files and the different actions that were done SELECT f.id, fl.file_path, flc.n, a.type, a.branch_id, l.date
FROM files f, file_links fl, file_links_count flc, actions a, scmlog l
WHERE fl.file_id = f.id AND flc.id = f.id AND f.id = a.file_id
AND f.file_name = "README" AND fl.file_path = "README" AND a.commit_id = l.id
ORDER BY l.date; This gives the following result:
The "n" column is used to ensure that the file has only one file link First a README file was added at the same time on two different Another example, the files never created on the master branch: SELECT f.id, fl.file_path, flc.n, a.type, a.branch_id, l.date
FROM not_created_files f, file_links fl, file_links_count flc,
actions a, scmlog l, branches b
WHERE fl.file_id = f.id AND flc.id = f.id AND f.id = a.file_id
AND a.branch_id = b.id AND b.name = "master" AND a.commit_id = l.id
ORDER BY l.date;
Here most of them are files that were only deleted. My first intuition was that when a file is created in branch X, then But looking at the file SELECT f.id, fl.file_path, flc.n, a.type, a.branch_id, l.date
FROM files f, file_links fl, file_links_count flc, actions a, scmlog l
WHERE fl.file_id = f.id AND flc.id = f.id AND f.id = a.file_id
AND fl.file_path = "Mono.Addins/Mono.Addins/Mono.Addins/AddinLocalizer.cs"
AND a.commit_id = l.id ORDER BY l.date;
I have found this problem recurrent across most GNOME git repositories |
Thanks for this detailed description. |
Here's more information. I created a simple test git repository as follow: mkdir test
cd test
git init
echo "hello" > file
git add file
git commit -m "added"
git checkout -b test
echo "hello world" > file
git commit -am "modified"
git checkout master
echo "hello foo" > file
git commit -am "modified master"
cd ..
./CVSAnalY/cvsanaly2 -u <user> -p <pass> -d <db> ./test It contains a single file, two branches with one commit each modifying the file.
Running CVSAnalY in debug mode shows that when the second commit (the
I wonder why the branch id is added to the file path? Is this Does anyone know if adding the branch id to the file path is needed? And if it is, why? |
Hey @maelick, a big thank you the detailed test case. Would you be so nice to create a pull request with your changes? What do you think :) ? |
Pull request sent ;) However this solves the problem only partially. For example there is still a problem with this this dummy repository: mkdir test
cd test
git init
echo "hello" > file
git add file
git commit -m "added"
git branch test
git rm file
sleep 1
git commit -m "file removed"
git checkout test
echo "hello world" > file
sleep 1
git commit -am "modified"
git checkout master
cd ..
./CVSAnalY/cvsanaly2 -u <user> -p <pass> -d <db> ./test In this case the file is removed from branch master but was modified on the second branch. What happens is that CVSAnalY first processes the commits from the master branch, and then the commits from the second branch. This means that when the commit on the second branch is processed, CVSAnalY finds no file matching the path in the cache (see here) and thus adds a new file entry while it shouldn't. This is more difficult to fix because it will probably require a knowledge of the commit DAG. I suppose other weird things could happen even in a repository without explicit branches because of implicit branching. For example what happens if the "test" branch is merged back in master? The result may be dependent of the order in which the commits are done.... I also wonder if this is also an issue in a centralized VCS. A solution which could partially resolve or at least minimize it is to parse only the master branch. I already wrote code for this to add an option "--no-ref" to CVSAnalY CLI which removes the "--all" option from the git CLI called by RepositoryHandler. I will create a pull request for this soon. |
This is a matter of how CVSAnalY was designed. I'm going to use the last case that @maelick posted to show you what the rationale is behind this. It's not really easy to explain (even in Spanish! and my English is not so good) so please, ask me anything that you don't understand. First of all, you have to take into account that our main source is the repository log. The key idea is to track the changes on the repository using just the log. Every action that was stored in the database is because was found in the log. We don't guess or invent anything (with minor exceptions, of course ;) ). This means that if you don't find an action for a file is because it doesn't exist in the log. When a branch is created there aren't actions about which 'branched files' were added (SVN is the exception, read below). You will only find 'A' (add) actions for those files that are new on that branch. When we were coding how to track branches and their files on the database, we were tempted to create add actions for the 'branched files' but finally we considered that extremely inefficient in terms of memory and database performance. To add those 'branched files' we have to store in memory the directory structure for every branch (tracking their changes) and store in the database thousands of entry files about files that will never be modified, deleted, copied, etc. We rejected that idea following another approach. We decided to consider that a file in a branch is new the first time that there is an action over it in that branch. Take into account that at this point CVSAnalY doesn't know anything about which files are on the tree and which file is which in another branch, either. When this happens, a new file_id is created for that file in that branch. Let's move to @maelick example to see how this is done:
The first commit (id 1) creates the file "file" on master branch (branch_id 1). Then, a new branch 'test' (branch_id 2) is created and "file" is modified. When the branch is created there aren't 'A' actions in the log, so no new file is created. Then, 'file' is modified. It's the first time that CVSAnalY knows anything about this file, so the new file_id 2 for this file is created and added to the actions table. And now. What the hell happens with SVN? Well, SVN is our Nemesis, the mother of all evil... The reason of our workarounds and tricks in CVSAnalY. In SVN there aren't real branches. In SVN a branch is a directory that someone says that is a branch. Creating a branch in SVN is copying the trunk directory to another place. That's the reason why you can find 'A' actions for branches files. There were explicit add actions on those files (svn add commands) . But you can also won't find any of these actions because if instead of adding files you just add the directory containing those files, the SVN log only stores that a directory was added. Damn it! Why do we need a branch_id for the files and actions? The documentation is really clear about that.
Think that CVSAnalY was designed first for CVS and SVN. Git was added later. Nowadays CVS and SVN are deprecated. Rethinking on how to do these things can led us to a better design. |
Thanks for your clarification. I understand why there is branch_id field in the actions table. However my concern was not about the branch_id field in the actions table. Moreover you mentioned also branch_id field in the files table but there aren't (only repository_id). When I mentioned branch id it isn't related to the database schema but in the python source code of CVSAnalY. For example those lines add the branch id before the file path and thus create a file entry for each branch when there is an action on that branch and file. Again I understand clearly why actions need to be related to branches but why do files? The only answer I can find to this answer is that it can lead to problems when files are deleted on a branch and not the other one... which is exactly the kind of problems I relate here. The main problem related I encountered is that it is impossible to reconstruct the list (or number) of files that existed in a repository at a given time even for centralized VCS. Maybe one way of solving it would be to effectively a branch_id field to the files table. |
Talking of design, I think the problem is strongly tight with the difference between files and file_links (see). I think it's a really nice feature but unfortunately I fear it causes a lot of problems. Thinking about the Evolution case that I mentioned earlier, there is a little less than 5 millions entries in the files table. Actually it shouldn't that much (there are less than 20000 unique absolute file path). I think that when working with big repositories with a lot of branches it becomes impossible to fulfill the original goal of the files/file_links feature: "Assigning identifiers to the |
Files need to be related to branches because the content of a file can be different between branches, can be deleted (as you wrote) in one branch and not in the others, can be replaced by other files, etc. These is useful for extensions like Metrics. Metrics extension retrieves the contents of files (to calculate sloc and other metrics) and you have to specify from which branch you get that file. Files are linked to branches via actions table. If I remember well, it is some kind of improvement to avoid replication of data among tables. branch_id can be also included in file_links table but as common queries go through actions table you can get the id from there. Regarding why branch_id is added to the path of a file it's because CVSAnalY stores a cache of files (the class DBContentHandler that you mentioned) and this path is used as key to know whether the file exists in the cache or not. |
I forgot to mention that you can reconstruct the file tree of a repository for a given revision... but it's very tricky. CVSAnalY was never designed to do that in an easy way because the fastest and reliable way of getting it is using the source code repository by itself. They were designed for it :) |
I still don't get why there need a branch id in both actions and files. What bothers me even more is that the branch_id is not in the files table but in the cache. You told that it is because SVN can modify the same file on many branches in the same commit that you need the files being specific for each branch. OK but then why not putting it into the files table? Moreover if the goal is also to avoid data replication it would make more sense to have the branch_id in the files rather than actions. Branch id in the actions makes sense if there is one file entry for all branches. Branch id in the files makes sense if there is one file entry for each branch. If it's important to keep files for each branch I think that we should move the branch_id field from actions to files. Regarding reconstructing file tree, this is not what I want to do. I am rather interested in doing things like counting how many code files were in the repository at a given time for example to have an idea of the size of the repository. Reconstructing file tree is easy with a VCS (in particular git). Counting number of code file is not as much trivial and should be as easy and efficient as: SELECT r.name, COUNT(DISTINCT f.id) FROM repositories r, files f WHERE r.id = f.repository_id AND f.name LIKE "%.py" GROUP BY r.id If you do that now on a repository with a lot of branches, you'll get a completely biased result. You could argue that then maybe you can simply restrict to the main branch. You'll miss files that have been created in the main branch but maybe it's not important. But then in this case we should have the branch_id in files rather than actions. |
Regarding to @sduenas's comment about "Files need to be related to branches", I would say yes and no. Yes, file content needs to be related to branches, and with And no, file id should not be related to branches. If we want to analyze how a specific file evolves over time, we would like to see its history starting with an 'A' action, not an 'M' action. Having several file_ids for the same file in different branch essentially makes it impossible to perform the change analysis on branches other than the "master" branch (assuming "master" is the oldest one). |
There are many files which are never created. For example in Tomboy I have 5143 entries in the files table but only 4514 are actually references in actions (i.e. 629 which are never touched). Moreover only 3202 have been created (added or copied) at least one. Using a bigger repository like Evolution this becomes even more enormous: on 4941692 file entries, only 19672 have been created!
Here are the queries I've used:
I have tried to find out what is the source of the problem while crawling through the code but I still didn't find the origin of the problem. In general there are too many entries created in files table (like the enormous number of entries in Evolution) and my intuition is that this might be related to branches. For example if a file is created in the master branch, then a new branch is created and the file modified in this new branch, then a new file entry will be created (and also one for each of the parent directories).
This might be related to issue #3 as I have also seen in Tomboy 5 files for which there are two entries associated to the same commits. For one of them, the file is renamed in a branch but was created in another one, thus new entries are created in files and file_links here, then a second file_links is created for the action of renaming here
The text was updated successfully, but these errors were encountered: