Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed FileRevs to handle renaming in another branch #60

Merged
merged 6 commits into from
Mar 23, 2014

Conversation

linzhp
Copy link
Contributor

@linzhp linzhp commented Feb 12, 2014

Problem

Commit B and commit C share the same parent A, but B and C are in different branches. B renamed file1 to file1.renamed, while C modified file1. C are made at a later time than B. When the branch of C is finally merged into the branch of B, commit C will appear after commit B in git log. When deciding the path of file1 at commit C, FileRevs found that the latest record in file_links table is the one related to commit B, so it used the file1.renamed as the path, which is wrong.

Solution

Use is_ancestor method in repositoryhandler to decide whether a commit is an ancestor for another. In this case, B is not an ancestor of C. Only use the latest file_links record related to ancestors of a commit to decide the file path.

An alternative implementation is to use commit graph to determine the ancestry reation. Not sure which one is better. Since #40 is not yet accepted, I try not to make this PR dependent on it.

@andygrunwald @sduenas please review, thanks!

@linzhp linzhp mentioned this pull request Feb 13, 2014
@andygrunwald
Copy link
Contributor

Hey @linzhp,
after a code review (read only at the moment) the code looks fine!
I have to test your changes with a small test repository to reproduce the described problem and i have to get a deeper look at the path_query which was changed in linzhp@dd465e2

What i really like that you added some unit tests!

I try to do this later this day or tomorrow. Will inform you about more progress.

@sduenas If i test this successful, is it ok that i will merge this?

@linzhp
Copy link
Contributor Author

linzhp commented Feb 22, 2014

This pull request contains an updated test repository. To reproduce the problem, you could run the old code on the test repository attached in this PR.

@andygrunwald
Copy link
Contributor

He @linzhp,

i tried to reproduce your code, but i was not successful.
I run it at the test repo (input.tat.gz) without extensions on a fresh database.
If i add the Metrics extension, the program failed:

./cvsanaly2 --config-file ./config --db-database cvsanaly_after /Users/andygrunwald/Development/tmp/input
Parsing log for /Users/andygrunwald/Development/tmp/input (git)
Warning: Detected empty branch 'master', it'll be ignored
Executing extensions
Executing extension FileTypes
Executing extension Metrics
Error obtaining aaa/otherthing.renamed@51a3b654f252210572297f47597b31527c475fb8. Command ['git', 'show', u'51a3b654f252210572297f47597b31527c475fb8:aaa/otherthing.renamed'] returned 128 (fatal: Path 'aaa/otherthing.renamed' exists on disk, but not in '51a3b654f252210572297f47597b31527c475fb8'.
)

Can you help me to test it?
Which database table has to be another value in which field? :)
THX

@linzhp
Copy link
Contributor Author

linzhp commented Feb 23, 2014

I think you just reproduced the bug that this PR is trying to fix. So try applying the changes in this PR, and the error will be gone.

__path_query__ = '''SELECT file_path FROM file_links,
(SELECT MAX(id) id FROM file_links WHERE file_id = ? AND commit_id <= ? ORDER BY commit_id DESC) fp
WHERE file_links.id = fp.id'''
__path_query__ = '''SELECT rev, file_path FROM file_links fl JOIN scmlog s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linzhp why did you remove MAX() function from path_query? MAX() function is required to get the last file link at a given commit/revision because, for unknown reasons, a file can have 2 links in the same revison. See issue #3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't ORDER BY commit_id DESC, fl.id DESC sufficient to get max file_links id?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to Test this next days again. Today i was not Successful :(

Am Sonntag, 23. Februar 2014 schrieb Zhongpeng Lin :

In pycvsanaly2/extensions/FileRevs.py:

@@ -32,9 +34,8 @@ class FileRevs:
from scmlog s, action_files af where s.id = af.commit_id and s.repository_id = ? order by s.id'''
# This query selects the newest entry for those cases with two filepaths
# for the same file. See #3 for more info.

  • path_query = '''SELECT file_path FROM file_links,
    -(SELECT MAX(id) id FROM file_links WHERE file_id = ? AND commit_id <= ? ORDER BY commit_id DESC) fp
    -WHERE file_links.id = fp.id'''
  • path_query = '''SELECT rev, file_path FROM file_links fl JOIN scmlog s

Isn't ORDER BY commit_id DESC, fl.id DESC sufficient to get max
file_links id?

Reply to this email directly or view it on GitHubhttps://github.com//pull/60/files#r9977111
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it is enough but I think your query can return more than a row while the current query returns only one. Moreover, I think the current query is more readable.

@andygrunwald what do you think? Which one do you think is better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you come up with a query that only returns one row and is guaranteed to be the right one, i.e., without the error in #65?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sduenas I checked the queries.
The new query is fine, too. In my point of view the new query is much more readable, because there is no subquery to generate a virtual table like in the last one.
You are right, the new query can return more than one query. The last query only returns one.
But it is possible to add a LIMIT 1 to the new query to return only one query.
What do you think about this @linzhp ?

@sduenas
Copy link
Member

sduenas commented Feb 23, 2014

@andygrunwald if you find that it works, merge the changes, but before that, I think the problem with the query that I mentioned above should be fixed.

Thanks both for your work.

@sduenas
Copy link
Member

sduenas commented Mar 3, 2014

Same problem here. I've been unable to make this patch work with input.tar.gz repo.

@andygrunwald
Copy link
Contributor

Zhongpeng, can you please post your config file and your Commands to Start
CVSAnalY?

And which was the result in the Database before and after.
Maybe then we have to reproduce this behaviour. Thanks

Am Montag, 3. März 2014 schrieb Santiago Dueñas :

Same problem here. I've been unable to make this patch work with
input.tar.gz repo.

Reply to this email directly or view it on GitHubhttps://github.com//pull/60#issuecomment-36475025
.

@linzhp
Copy link
Contributor Author

linzhp commented Mar 3, 2014

Sorry, it was my bad. I made the test pass, but I didn't update the existing calls to FileRevs to use the new get_path method. The error @andygrunwald and @sduenas got is exactly what this pull request was trying to fix. Could you take another look?

@andygrunwald
Copy link
Contributor

So if have to enable Metrics Extension or Blame extension to see the new results, right?

@linzhp
Copy link
Contributor Author

linzhp commented Mar 4, 2014

Yes

发自我的 iPad

在 Mar 3, 2014,11:35 PM,Andy Grunwald [email protected] 写道:

So if have to enable Metrics Extension or Blame extension to see the new results, right?


Reply to this email directly or view it on GitHub.

@linzhp
Copy link
Contributor Author

linzhp commented Mar 13, 2014

Any problems with testing this pull request?

@andygrunwald
Copy link
Contributor

I executes it on our test repository with metrics extension enabled.
The Metrics error mentioned in (#65) is gone.
The metrics table is different.
Without patch it contains 11 rows. With patch it contains 7 rows.
Metrics for commit 2 (renamed file), 4 (deleted and renamed file) and 7 (rename aaa/otherthing) are missing.
But for commit id 8 (modify aaa/otherthing) the metrics seems to be correct (with loc of 1). The file aaa/otherthing.renamed contains only one line of text.

But there seems to be a change in behaviour.
If a commit renames a file, for this commit there are no new metrics. Without this patch, the metrics will be collected as well.
At the moment i do not know if this is correct. Of course, if a commit only renames a file, there are no metric changes.
But now there are only metric entries for commits who modifies something.
Without this pr there are metrics for touched files (independent from the action (deleted, renamed, ...))

In my point of view, the behaviour of the patch (metrics only if a file changes) is correct.
But existing analysis can be break with this new behaviour.
Did i miss something?

Conflicts:
	pycvsanaly2/extensions/Blame.py
	pycvsanaly2/extensions/FileRevs.py
	pycvsanaly2/extensions/Metrics.py
	tests/file_path_test.py
@linzhp
Copy link
Contributor Author

linzhp commented Mar 19, 2014

Interesting... I just merged the master branch into this branch, deleted all tables, and ran CVSAnalY and Metrics again, I could see 11 rows. Could you check out the branch https://github.com/linzhp/CVSAnalY/tree/concurrent_file_path to your local directory, and try again?

@andygrunwald
Copy link
Contributor

I made a bigger test today. Maybe this helps to understand my results.

My config file:

## Run in debug mode?
debug = True
#
## Run quiet
quiet = False
#
## Enable profiling
# profile = False
#
# repo_logfile = None
# save_logfile = None
# no_parse = False
#
## Database parameters
db_driver = 'mysql'
db_user = 'root'
db_password = None
# db_database = 'cvsanaly'
db_hostname = 'localhost'

## Extensions
## No extensions enable by default
extensions = ['Metrics']
#extensions = ['Content']
#
## Metrics extension options
# metrics_all = False
# metrics_noerr = False

My command:

$ ./cvsanaly2 --config-file ./config --db-database DATABASE /Users/andygrunwald/Development/tmp/input > master.log

Current master:
Output-File: https://gist.github.com/andygrunwald/9725721
MySQL-Dump: https://gist.github.com/andygrunwald/9725737
Shell-Output:

Error obtaining bbb/something@c0d66f92a95e31c77be08dc9d0f11a16715d1885. Command ['git', 'show', u'c0d66f92a95e31c77be08dc9d0f11a16715d1885:bbb/something'] returned 128 (fatal: Path 'bbb/something' does not exist in 'c0d66f92a95e31c77be08dc9d0f11a16715d1885'
)
Error obtaining aaa/otherthing.renamed@51a3b654f252210572297f47597b31527c475fb8. Command ['git', 'show', u'51a3b654f252210572297f47597b31527c475fb8:aaa/otherthing.renamed'] returned 128 (fatal: Path 'aaa/otherthing.renamed' exists on disk, but not in '51a3b654f252210572297f47597b31527c475fb8'.
)

Current pr/60:
Output-File: https://gist.github.com/andygrunwald/9725910
MySQL-Dump: https://gist.github.com/andygrunwald/9725917
Shell-Output:

Error obtaining bbb/something@c0d66f92a95e31c77be08dc9d0f11a16715d1885. Command ['git', 'show', u'c0d66f92a95e31c77be08dc9d0f11a16715d1885:bbb/something'] returned 128 (fatal: Path 'bbb/something' does not exist in 'c0d66f92a95e31c77be08dc9d0f11a16715d1885'
)

linzhp/CVSAnalY/tree/concurrent_file_path*
The same as Current pr/60 (expect of some cache hashes).

** Results **
As you see in the database results: Now there are 11 entries in the metrics table (at both versions).
But in the current master two errors were thrown + two entries of table metrics (ID 6 + 11) got "-1" at every column.
In the current PR #60 only one error was thrown + one entry of table metrics (ID 6) got "-1" at every column. ID 11 seems to be correct.
I attached all logs files (in debug mode). @linzhp maybe this helps to get the last error done? Or did i make a mistake during testing?

@linzhp
Copy link
Contributor Author

linzhp commented Mar 23, 2014

All you did seem good to me. As far as I can tell, the remaining error and the ID 6 were not introduced by this PR, and this PR didn't mean to fix bugs in metrics extension. So you may want to create another issue for them. I don't see any problem with this PR now.

@andygrunwald
Copy link
Contributor

You are right.
I will check the change in the SQL query again (was mentioned by @sduenas as well).

@andygrunwald
Copy link
Contributor

I checked everything.
There is only one small thing: The SQL query can return more than one record. But afaik this does not break anything. We can add a LIMIT 1 to this query later.
I will merge this change now.

A huge thanks to @linzhp for his endurance to fix and explain and answers of my testing :)
Thanks. I hope we can merge the other PRs soon :)

andygrunwald added a commit that referenced this pull request Mar 23, 2014
Fixed FileRevs to handle renaming in another branch
@andygrunwald andygrunwald merged commit d307b25 into MetricsGrimoire:master Mar 23, 2014
@linzhp
Copy link
Contributor Author

linzhp commented Mar 23, 2014

Adding LIMIT 1 to the query will break the functionality of this PR.

Thanks @andygrunwald for reviewing!

@linzhp linzhp deleted the concurrent_file_path branch March 23, 2014 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants