-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behavior regression CI #1642
base: main
Are you sure you want to change the base?
Behavior regression CI #1642
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, this looks very nice, @3Rafal! @voodoos , what do you think?
A couple of general thoughts (that we can think about after merging this):
Even though this CI is more resource efficient than the General Behavior CI will be, it will still be resource heavy once we make it run on a significant sample set:
- Increase the number of samples per file (see one of the comments below)
- Add a larger/different code-base than only Irmin to act on.
- Add more Merlin queries to
merl-an
.
When to run the CI?
Given what I've just said, would you run this CI on every PR? Or would you enable it on concrete PRs via a github flag or similar?
Which code base to act on?
About the point "Add a larger/different code-base than only Irmin to act on" above: So far, we've chosen the Irmin code-base to act on. The reasoning is that it seems an interesting bode-base because of deep functor depth and wide general module type usage. So that's particularly interesting wrt Merlin performance, but also wrt Merlin behavior. What other OCaml features are interesting to test Merlin behavior and which projects do we know that use them? E.g.
- Classes? Which projects use classes? The first ones that come to my mind are
eio
andppxlib
. - Interesting type stuff? Which projects could we use to test that? My OCaml code-base knowledge is quite limited. I only know
ppxlib
'sVersion
module with "interesting type stuff". Maybe we can use Sherlocode to find something? - Something else?
Which Merlin queries to test?
About the point "Add more Merlin queries to merl-an
" above: The queries merl-an
supports so far were chosen with performance in mind. For this CI tha ttests whether Merlin gives a successful response, I can also think of destruct
and construct
as interesting. What do you two think?
cd irmin | ||
git checkout 8da4d16e7cc8beddfc8a824feca325426bae08a9 | ||
sudo apt install -y gnuplot-x11 libgmp-dev pkg-config libffi-dev | ||
opam install . --deps-only --with-test --no-checksums -y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opam install . --deps-only --with-test --no-checksums -y | |
opam install . --deps-only --with-test -y |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the irmin deps (duff) fail on checksum, so I had to make it this way.
(env (_ | ||
(binaries build-irmin))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you happen to know if the build-irmin
result gets cached? It takes about 5 min, so it's not a huge deal if it doesn't. If it's easy to do (or already the case), it would still be nice though, among others to reduce the cost of adding more source code than Irmin to test on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it doesn't. big parts of our pipelines aren't cached. I don't know about easy way to do this, but I agree it would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about easy way to do this, but I agree it would be nice.
If we had the script be called by the GH action rather than by Dune, could we make use of GH action caching? (It's a genuine question. I don't know myself either.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running this script via GH actions / dune seems orthogonal to caching. I run it via dune, because checking out Irmin code in test directory makes dune test
run Irmin tests as well. This could probably be fixed, but I didn't want to spend too much time on this.
Caching build dependency can be done via GH action, but I'm not sure about an actual implementation. It would be a nice optimisation, but realistically we would shave couple of minutes from the build job that currently takes ~40 minutes. I don't think it 's the most valuable thing to spend time on at this moment.
|
||
$ build-irmin >> /dev/null 2>&1 | ||
|
||
$ merl-an error-regression -s 1 --data=test-data 2> /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ merl-an error-regression -s 1 --data=test-data 2> /dev/null | |
$ merl-an error-regression --data=test-data 2> /dev/null |
One sample per file is very little. merl-an
's default is 30. It'd be interesting to see how long the run takes with 30 samples per file.
(Side-note: this has reminded me that merl-an
should distribute the samples better throughout the project by making the number of samples per file depend on the size of the file/AST. I've just opened an issue on merl-an
to remember: pitag-ha/merl-an#28)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge it like this, and later use higher numbers? I think it's easier to review a smaller file. You can test bigger runs on different branches.
Also, I think that having more than 2,5k tests isn't small. I feel like we will quickly get a diminishing returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, having -s 30 means, that I have to wait pretty long to get results and it slows down my dev process. It's easier to iterate on something smaller, especially in early stages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think that having more than 2,5k tests isn't small.
I agree that even with one sample per file, this isn't small. In fact, that's the reason why I've suggested to make it a bit more realistic: That way, we could find out if this approach scales or if we should consider choosing a different approach.
You can test bigger runs on different branches.
That sounds good. Have you done that already?
If we merge it like this without investigating more, we just need to be aware that we might end up removing this and implementing a different approach later. I have a tendency to first investigate things and then merge them, but I know that both you and Thibaut like doing it the other way around and that's ok as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this on my machine:
-- Results --
machine: Laptop with Intel i7 11 gen
sample size: 30 (default)
time:
real 203m8.524s
user 0m35.541s
sys 0m17.571
The results between CI and my dev env are comparable, because we use single thread.
If we merge it like this without investigating more, we just need to be aware that we might end up removing this and implementing a different approach later.
I'm surprised, that we are considering using different approach at this point. I thought that we already discussed this CI and agreed on its implementation. I don't see why we would need to remove it, instead of simply adding some tweaks.
I have a tendency to first investigate things and then merge them, but I know that both you and Thibaut like doing it the other way around and that's ok as well.
It's hard for me to agree with this remark. I think I have investigated a lot before creating this PR. What I'm suggesting, is that we already provide a value with this PR, so we can merge it and then improve according to specific needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm sorry, I wasn't aware you had already tested it with sample-size 30. How big is the test file when you do that?
If you've already investigated enough on your side and are convinced that this approach is the right way to go, I'm happy to believe you and stop the thinking/discussion around the approach itself to focus only on details/concrete improvements. Is that what you were saying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
artifact size:
$ ls -l --block-size=M
total 11M
-rwxr-xr-x 1 opam opam 8M Jun 25 23:43 commands.json
-rwxr-xr-x 1 opam opam 1M Jun 25 23:43 logs.json
-rwxr-xr-x 1 opam opam 2M Jun 25 23:43 results.json
It was measured without cmds, so it could be bigger (let's say x2).
I think that this approach looks promising. I can't guarantee that it will scale perfectly to all requirements. I think it's something that we can ask about @voodoos too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to put the test in a subfolder of the tests
folder: tests/regression
pull_request: | ||
push: | ||
branches: | ||
- master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should restrict the ci's run to changes made to .ml / .mli files
Since the CI only check errors, and >30min is quite a lot of time, maybe we should have it run only after PR are merged and revert them afterwards if it fails ? |
As in #1635