Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements #304: Adds deprecated-bugs.csv and refactors commit-db into active-bugs.csv #312

Merged
merged 44 commits into from
Jul 3, 2020

Conversation

Greg4cr
Copy link
Collaborator

@Greg4cr Greg4cr commented Mar 11, 2020

This PR fixes #304.

  • commit-db has been refactored into active-bugs.csv.
  • A header has been added to the first line of active-bugs.csv noting what each column represents.
  • All code handling active-bugs.csv has been adjusted to handle the header.
  • deprecated-bugs.csv has been added, containing the entries for all deprecated bugs.
  • deprecated-bugs.csv utilizes the same header as active-bugs.csv, with the addition of the D4J version where the bugs was deprecated and the reason for deprecation.
  • Constants.pm has been updated with constants representing the column names.

@Greg4cr Greg4cr requested review from jose and rjust March 11, 2020 12:42
@Greg4cr Greg4cr self-assigned this Mar 11, 2020
@rjust
Copy link
Owner

rjust commented Mar 11, 2020

@Greg4cr, thanks for implementing this. Can you please define the file names in Constants.pm as well and make sure that every script uses the constants? Linking to these constants in the pod (as opposed to using the file name) would also be good to avoid inconsistencies in case these file names change.

@Greg4cr
Copy link
Collaborator Author

Greg4cr commented Mar 12, 2020

@rjust -

I have pushed commits that:

  • Add constants for the filenames, and use that constant elsewhere.
  • Renamed the column names in the CSV files to match style and usage elsewhere in the framework.
  • Refactored test_export_command to use the method get_bug_ids instead of duplicating the code in that method.

I looked at the use of system calls to tail in the bug mining framework. There is only one use of tail in the bug mining framework, in initialize-project-and-collect-issues.pl. This is when we mine more bugs for an existing project, and are filtering commit hashes that already appear in the existing project. I don't think this is worth refactoring right now. I don't know if there is a good way to surface that information via DBI, and it represents a fairly specialized corner case. This is something we may want to file a bug report for and tackle as part of a broader clean-up of the bug mining code.

@jose
Copy link
Collaborator

jose commented Mar 12, 2020

Thanks @Greg4cr for implementing this.

  1. I'm wondering whether we want to rename commit-db to something else. This single change will break all my scripts and I believe others' as well.
  2. The brand new active-bugs.csv and deprecated-bugs.csv do include a header line, which will also break all my scripts (even if I go ahead and update all my scripts to handle 1).

How about we keep commit-db as it is and add an extra column status which could be active or deprecated?

@jose
Copy link
Collaborator

jose commented Mar 12, 2020

By the way, @Greg4cr, I noticed in a couple pull requests that you've been mentioning @rene instead of @rjust. :-)

@Greg4cr
Copy link
Collaborator Author

Greg4cr commented Mar 12, 2020

@jose Haha, oops! I should proofread the comments more thoroughly.

This change came out of discussions @rjust and I had earlier this week. We discussed the concerns you mentioned. At the time, this is what we decided on:

  • active_bugs.csv and deprecated_bugs.csv makes a clear distinction between current and deprecated metadata that commit-db does not.
  • Two files instead of one enforces separation between bugs that should be used and those that are no longer considered valid - but that we retain information on for historical reasons.
  • A header should be added to document what these fields actually represent. This helps reduce confusion and questions when using the framework.

Something that Rene and I discussed, and that I certain agree with, is that internal changes to Defects4J can and should go forward even if it would impact external scripts. Even if we didn't make this particular change, other changes to Defects4J will break someone's scripts. We shouldn't let that prevent changes to the framework.

I'm certainly willing to discuss alternative implementations, though. @rjust - thoughts?

@rjust
Copy link
Owner

rjust commented Mar 12, 2020

I have strong opinions about improving the internals of the framework such as separating active and deprecated bugs, consistent use of headers, consistent naming, removing code clones, etc., but I do agree that backward compatibility is a concern to keep in mind. IMO, the problem with commit-db (and why this file is used in so many wrapper scripts) is the lack of a simple, public API to obtain the list of valid project ids and bug ids -- we already floated some ideas for addressing this issue.

I am OK with leaving the commit-db file in the repo for now and clearly documenting that (1) it is deprecated and (2) will be removed in the future. At the same time, we should offer a clear alternative, which is not use the new file name instead and skip the header. If we want others to migrate away from using commit-db, which is not part of the public API, we should make sure that there is an alternative public API. Iterating over all bug ids should be as simple as for bid in $(defects4j <info, export, or new cmd> -p $pid). Note that v2.0.0 will break existing scripts that assume that bug ids correspond to line numbers in commit-db. Having documented a preferred way of obtaining valid project and bug ids would clearly help users.

framework/core/Constants.pm Outdated Show resolved Hide resolved
framework/core/Vcs.pm Outdated Show resolved Hide resolved
framework/core/Vcs.pm Outdated Show resolved Hide resolved
framework/core/Vcs.pm Outdated Show resolved Hide resolved
framework/test/test_bug_mining.sh Outdated Show resolved Hide resolved
framework/test/test_gen_tests.sh Outdated Show resolved Hide resolved
framework/test/test_verify_bugs.sh Show resolved Hide resolved
@Greg4cr
Copy link
Collaborator Author

Greg4cr commented Mar 13, 2020

@rjust @jose

I am OK with leaving the commit-db file in the repo for now and clearly documenting that (1) it is deprecated and (2) will be removed in the future.

I agree. We can include the commit-db as well, with a disclaimer that it will be deprecated and removed in the future.

At the same time, we should offer a clear alternative, which is not use the new file name instead and skip the header. If we want others to migrate away from using commit-db, which is not part of the public API, we should make sure that there is an alternative public API. Iterating over all bug ids should be as simple as for bid in $(defects4j <info, export, or new cmd> -p $pid).

I also agree with this. I was initially thinking about this as an extension of defects4j info, but it might be clumsy to build it on top of that existing command. Perhaps it would be better to have a new command built around exporting info from active/deprecated bugs files:
defects4j bug-query -p $pid -q $query
(or some other name)

  • Basic call (just project, no query) would print the entire active-bugs.csv for a project
  • $query is a comma separated list of fields. These must correspond to the field names defined in the CSV header. This allows definition of a custom set of fields to be returned
    • For example, "bug_id" would return just bug IDs, "bug_id,bug_report_id" would return bug IDs and issue tracker IDs.
  • Additional flags then control active/deprecated inclusion:\
    • No flag prints info on active bugs only.
    • -D prints info on deprecated bugs only.
    • -A returns both active and deprecated bugs.
  • -H includes a header in what is printed. By default, a header will not be printed.

@Greg4cr
Copy link
Collaborator Author

Greg4cr commented Mar 13, 2020

I added another commit making minor tweaks (see @rjust's questions above) and restoring the commit-db files (with the intention that they are deprecated and will later be phased out, and that the framework makes use instead of active-bugs.csv).

@bkushigian
Copy link
Contributor

This makes sense.

@Greg4cr
Copy link
Collaborator Author

Greg4cr commented Apr 30, 2020

@rjust @jose

Some updates:

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
framework/bin/d4j/d4j-bugs Outdated Show resolved Hide resolved
framework/bin/d4j/d4j-bugs Outdated Show resolved Hide resolved
framework/bin/d4j/d4j-bugs Outdated Show resolved Hide resolved
framework/bin/d4j/d4j-bugs Outdated Show resolved Hide resolved
framework/bin/d4j/d4j-bugs Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@rjust rjust self-requested a review July 3, 2020 02:38
@rjust rjust merged commit fe7c3c6 into master Jul 3, 2020
@rjust rjust deleted the bugs-csv branch July 3, 2020 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We should add a "deprecated-commit-db"
4 participants