Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements #288: support for two additional version ids (<bid>b.min and <bid>b.orig) #309

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

rjust
Copy link
Owner

@rjust rjust commented Mar 11, 2020

I took a first stab at implementing the two new version ids (#288 ). I focused on backward compatibility to avoid breaking existing functionality in the 2.0.0 release.

I ended up using b.min and b.orig (as opposed to b-min and b-orig) because I found Lang-2b.min and -v2b.min easier to read.

I added a basic test for checking those new version ids.

rjust added 20 commits March 9, 2020 22:44
@rjust rjust requested review from Greg4cr and jose March 11, 2020 06:27
@Greg4cr
Copy link
Collaborator

Greg4cr commented Mar 11, 2020

@rjust - one minor suggestion/thought:

Should we also add version id <bid>f.orig as an alias for <bid>f? The ideas of "min" and "orig" have no meaning for the fixed version, but it could prevent accidental misuse.

I'm fine accepting this, as is, but the thought occurred to me.

@jose
Copy link
Collaborator

jose commented Mar 11, 2020

Thanks @rjust. Quick questions,

  1. Should we also provide the original/unminimized patches as we provide the minimal ones? I.e., defects4j/framework/projects/Chart/patches/1.src.patch and defects4j/framework/projects/Chart/patches/1.src.patch.orig? I'm aware that D4J can figure out the original patch at checkout, but I'm wondering whether that will be useful for others. Or perhaps we can create, yet another d4j-command to provide the original patch of a minimized patch.

  2. Don't we have to update the scripts/D4J commands that depend on the metadata of each patch? Let me give you an example. Suppose that I checkout the original buggy version of Chart-1 and then run the d4j-coverage command. Quoting the d4j-coverage command:

By default, code coverage is measured only for the classes modified by the bug fix.

In other words, by default, code coverage is measured only for the classes modified by the minimal bug fix. Because I checkout the original buggy version of Chart-1, shouldn't the d4j-coverage command measure coverage for the classes modified by the original bug fix instead?

@Greg4cr
Copy link
Collaborator

Greg4cr commented Mar 11, 2020

@jose @rjust - If we provide a .orig patch, we may also want to do provide .orig versions of modified_classes, loaded_classes, and relevant_tests as well. Any thoughts on this?

@rjust
Copy link
Owner Author

rjust commented Mar 11, 2020

Duplicating all meta data for the .orig versions seems a bit overkill and would grow the repo even more quickly. Given that most files would be identical for .min and .orig, maybe symlinks would be an option, though. Also, computing patches and the set of modified files is super quick and could be done on-the-fly.

I agree that the notion of instrumenting modified classes by default is confusing. We could fix this by more precisely documenting the default. Each command does support custom class lists, so a user can always override the default -- having such class lists precomputed would make this easier. Alternatively, we can compute the set of modified files prior to running a command rather than using pre-computed metadata. Computing the set of relevant tests would be more of an issue, though.

Based on the discussions that we had so far, it seems that we have the following goals:

  • Consistent defaults between .min and .orig: either (1) always use the .min metadata or (2) precompute (where necessary) metadata for .min and .orig
  • Minimize runtime: computing loaded classes and relevant tests on-the-fly is way too time-consuming.
  • Convenience: accessing metadata on Github (or locally) allows for quick sanity checks and analyses.

Here is a concrete proposal:

  1. Command: we add a top-level command that can export (any) metadata for a bug.
  2. Consistency: all commands use the metadata for the checked-out project version. For example, coverage will instrument different classes for .min and .orig.
  3. Meta-data repo: we move the precomputed metadata to a dedicated repository and add metadata for .min and .orig.

@Greg4cr, I am inclined to leave f.min and f.orig for later. We want to refactor b to be an actual alias rather than a dedicated version, we could tackle the f versions at the same time. This also allows us to get broader feedback first, which could influence future changes.

@Greg4cr
Copy link
Collaborator

Greg4cr commented Mar 12, 2020

@rjust - I like your proposal. If we have a command that can generate the metadata for the original version, we can actually keep our repo quite clean while still enabling access to data:

  • Our repo contains artifacts for the min version, as those require curation.
  • The artifacts for the orig version do not need curation, and can be extracted on-command.

@jose
Copy link
Collaborator

jose commented Mar 12, 2020

@rjust, I do agree with your proposal. It would be tricky to keep the metadata repository up-to-date and synchronized with the D4J repository but I think we can manage that. (I do prefer to have all metadata in place, i.e., without having to run any command or even clone D4J. It's just easier to consult metadata if it has been pre-computed.)

@rjust
Copy link
Owner Author

rjust commented Mar 12, 2020

@jose I was thinking about a submodule solution, but I agree it will be less convenient. As an alternative, having a snapshot of all pre-computed metadata in a dedicated repo (clearly linked from the D4J README) and keeping only necessary metadata in D4J proper might be acceptable.

Btw, I just noticed that the current metadata in D4J proper is not "complete" -- for example, modified_classes directory does not provide the list of modified test classes, but the patches directory does provide all test patches.

Let's group all the metadata into:

  • Curated (e.g., minimized src patches)
  • Pre-computed [instant] (e.g., modified files, original src patches, test patches)
  • Pre-computed [long-running]: (e.g., loaded classes, relevant tests)

We can put all of the above metadata (and possibly more) into a dedicated repo. We need to keep the curated files in D4J proper -- and probably want to keep the results of long-running analyses as well. I don't think we need to store metadata that can be instantly computed with a command (the dedicated metadata repo will still provide immediate access online). If we want to make pre-computed [instant] available locally, then the init script could populate the corresponding folders.

Any thoughts?

@rjust
Copy link
Owner Author

rjust commented Mar 12, 2020

@Greg4cr and @jose, for this particular PR, let's make a decision on what the semantics of the -r flag (test, coverage, etc.) should be for b.orig project versions and make sure that this is implemented correctly. This likely involves more code changes and possibly a few metadata changes.

I think that we should tackle the re-org of the metadata, too, but in another PR.

@jose
Copy link
Collaborator

jose commented Mar 13, 2020

IMO, the -r flag should use .orig metadata whenever the original buggy version is checkout and use .min metadata if the minimized buggy version is checkout. It might be a good idea to first address the re-org of the metadata in another PR and then revisit this.

@Greg4cr
Copy link
Collaborator

Greg4cr commented Mar 16, 2020

I agree with @jose - the current relevant_tests are generated based on the minimized patch. We should have a second relevant_tests generated for the original patch. If -r is used for the original version, we should run the relevant tests for that, not the ones calculated for the minimized version.

Given the growing complexity of the .orig/.min split, we may want to go ahead and tackle the creation of a metadata repo (or, at least, refactor of the metadata in our repo), then refactor to use that.

One suggestion for the metadata repo - maintain separate .orig and .min directories for each type of metadata. This could prevent each folder from becoming too messy. For a given project, we would have the following directory structure:

  • Project
    • failing_tests
    • loaded_classes
      • orig
      • min
    • modified_classes
      • orig
      • min
    • patches
      • orig
      • min
    • relevant_tests
      • orig
      • min
    • trigger_tests

(assuming we create static metadata files instead of generating on the fly)

@jose
Copy link
Collaborator

jose commented Mar 16, 2020

Yes, lets go ahead and address the metadata repository first. @Greg4cr, I like your suggestion.

@mernst
Copy link
Collaborator

mernst commented Aug 4, 2024

@rjust What is the status of this pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants