-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Strangeness with dataset.merge() #2015
Comments
Hi @daniel-falk thank you for raising the issue and for the reproduction script. We'll take a look at these issues asap. |
Hi @daniel-falk We had some internal miscommunication that led to the merge strangeness. Long-story-short, merging will bring back samples that were popped on other branches, and there are some technical limitation that resulted in this behavior. Enabling merges to respect popped samples is not a problem if a merge happens once. However the technical problem occurs if a merge has already happened, and then another sample is popped on one of the branches, and a merge happens again. One way for us to move forward is to leave that behavior alone (merge respects popped samples at the first merge, but not afterwards), or we could go one step further and lock branches after they've been merged to main. Do you have any thoughts on which option is better from a user-perspective? |
Hey @daniel-falk.
Further down the line we got requests for merge, which doesn't follow assumption 1. We got around this by making merge commits act as independent commits in which all the history is replayed. With the introduction of pop, assumption 2 was also discarded. The way merge works is that we compare the ids of tensors in the commits and add the missing ids from target commit to the original commit. The issue with pop becomes that it is hard to differentiate between cases where an id was added on target vs when id was deleted on original and not on target. Re hash values, we deliberately don't use them and prefer uuids instead for the reason that we want to keep track of samples and how they change over time. For example say a label was added with value 5 at index 10. The user then branches off to an alternate branch (alt1)and there after a few pops in the dataset it is at index 7. After that it is merged into another branch (alt2) which didn't have the sample at all, at index 25. The value is then changed from 5 to 1000. |
I see, I was wrong above stating that this
would prevent correct merging. The IDs are never changed when writing to the tensor by index, it is only created when appending or extending and only goes away on a pop. Merging should thus be fine unless a compute function was used. I understand the issue with pop and merge, I guess it is hard to avoid without using some kind of journal. Is the idea to use journals when overhauling the versioning? That would also open up for more advanced commands like rebasing a commit. It feels like if there should be a good library for dealing with journals, it's a common thing to do both in version control and in databases/filesystems. |
🐛🐛 Bug Report
To start with, awesome work getting the merge functionality implemented! I have used it for a while now and discovered a few issues with it:
Text tensors are not merged
When two branches are merged, where both branches has the same tensors, some tensors are empty after merge. This applies to e.g.
htype="text"
andhtype="generic"
. See example on colab here.UI shows errors during a merge
If using the ActiveLoop user interface while two branches are merging the dataset fails to load and there are error messages "Dataset is corrupted or does not exist".
Exception about modification on read-only tensor during merge
While merging two branches I got this exception once. Rerunning with the same exact code a second time worked.
Removed samples are not removed after a merge
Data samples that are removed on a branch does not get removed when that branch is merged.
This might be a design decision, that when merging it is the current state of the source branch that is merged, not the changes/deltas on the branch. This "union" behaviour feels very unnatural for a "merge" and is not how it works in e.g. git. I could not find any documentation explaining this behaviour.
See example on colab here.
Merged commits are not visible in history
One of the main advantages with version control of the data is that I can look in the log and see what different changes people has done to the dataset. This does not really work as is now since a merge only creates a "merged ..." commit message, it does not show the commits that was done to other branches. This means that if using trunk-based development where the changes are always done on a feature branch, committed with a self-explaining commit message and then merged with master, there will not be any useful commit messages on the main branch, only "merge ..." messages. I think we would really need something like gitk etc. that shows the different commits on the branches that are merged.
The text was updated successfully, but these errors were encountered: