Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show some output validator feedback #140

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

thorehusfeldt
Copy link
Contributor

@thorehusfeldt thorehusfeldt commented Oct 21, 2019

If the output validator put something in feedbackdir/judgemessage.txt
or feedbackdir/teammessage.txt, then show (a summmary of) that
information, preferring the former.

The summary is just a truncation of the txt-file, except if it
is recognisable as the output from the standard output validator,
in which case it is summarised more carefully.

If the output validator put something in feedbackdir/judgemessage.txt
or feedbackdir/teammessage.txt, then show (a summmary of) that
information, preferring the former.

The summary is just a truncation of the txt-file, except if it
is recognisatble as the output from the standard output validator,
in which case it is summarised more carefully.
@austrin
Copy link
Contributor

austrin commented Nov 9, 2019

Sorry for slow response; various other duties have demanded attention recently.

If I understand correctly, the purpose of this is to show a digest of the validator feedback even for submissions which get their expected judgement? Can you post a screenshot of what it is supposed to look like?

I'm not necessarily against adding this, but am slightly hesitant because if anything I think that the submission summary is already a bit too verbose. But I'm thinking maybe there should be a command-line flag controlling whether or not this is shown?

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Nov 9, 2019

Here’s the intended output from an interactive problem, askmarilyn — basically the Monty Hall problem. The judge (or team) messages are shown in square brackets. The interesting part is the WA section, where you can see the (wrong) submissions implementing various broken strategies or protocols, and triggering various judge/team messages. For instance, the WA submission “random door”, that just picks doors at random, receives 346 cars/drinks (out of 1000), which are too few. (More than 600 are needed.) This is helpful for increasing my confidence during problem development that the WA submissions achieve full code coverage in the output verifier, and that the team messages make sense.

root@77c1d7592d08:/katdev/will-code-for-drinks-F2019# verifyproblem askmarilyn
Loading problem askmarilyn
Checking config
Checking statement
Checking validators
Checking graders
Checking data
Checking submissions
   AC sub sl.cpp (C++) OK: AC [Congratulations! You got 650 drinks, CPU: 0.08s @ tc secret/1-0]
   AC sub thore.py (Python 3) OK: AC [Congratulations! You got 697 drinks, CPU: 0.15s @ tc secret/2-1]
   Slowest AC runtime: 0.150, setting timelim to 1 secs, safety margin to 2 secs
   WA sub always_door_A.py (Python 3) OK: WA [Too few drinks: 330, tc secret/1-0, CPU: 0.04s @ tc secret/1-0]
   WA sub break_protocol.py (Python 2 w/PyPy) OK: WA [Round 1(1): Invalid door name: 1, tc secret/1-0, CPU: 0.10s @ tc secret/1-0]
   WA sub first_a.py (Python 3) OK: WA [Too few drinks: 471, tc secret/1-1, CPU: 0.13s @ tc secret/1-0]
   WA sub first_b.py (Python 3) OK: WA [Too few drinks: 469, tc secret/1-2, CPU: 0.15s @ tc secret/1-2]
   WA sub first_c.py (Python 3) OK: WA [Too few drinks: 476, tc secret/1-3, CPU: 0.19s @ tc secret/1-2]
   WA sub ignore_positive_hint.py (Python 3) OK: WA [Too few drinks: 0, tc secret/1-0, CPU: 0.16s @ tc secret/1-0]
   WA sub plays-1001-rounds.py (Python 3) OK: WA [Trailing output, tc secret/1-0, CPU: 0.06s @ tc secret/1-0]
   WA sub plays-1999-halfrounds.py (Python 3) OK: WA [Round 1000(2): No door guessed, tc secret/1-0, CPU: 0.07s @ tc secret/1-0]
   WA sub plays-999-rounds.py (Python 3) OK: WA [Round 1000(1): No door guessed, tc secret/1-0, CPU: 0.06s @ tc secret/1-0]
   WA sub plays-forever.py (Python 3) OK: WA [Trailing output, tc secret/1-0, CPU: 0.16s @ tc secret/1-0]
   WA sub random_door.py (Python 3) OK: WA [Too few drinks: 346, tc secret/1-0, CPU: 0.18s @ tc secret/1-0]
   WA sub silent-death.py (Python 2 w/PyPy) OK: WA [Round 1(1): No door guessed, tc secret/1-0, CPU: 0.03s @ tc secret/1-0]
   WA sub spam.py (Python 3) OK: WA [Round 1(1): Invalid door name: s, tc secret/1-0, CPU: 0.02s @ tc secret/1-0]
   TLE sub forever-silent.py (Python 2 w/PyPy) OK: TLE [tc secret/1-0, CPU: 4.00s @ tc secret/1-0]

@thorehusfeldt
Copy link
Contributor Author

Here’s the output on a simpler, non-interactive problem. The judge message from the standard output generator is shown, in a truncated but still useful way, in square brackets.

Checking submissions
   AC sub sl.cpp (C++) OK: AC [CPU: 0.00s @ tc sample/1]
   AC sub tgif-py3.py (Python 3) OK: AC [CPU: 0.03s @ tc sample/3]
   AC sub tgif-simulate-py3.py (Python 3) OK: AC [CPU: 0.01s @ tc sample/1]
   Slowest AC runtime: 0.030, setting timelim to 1 secs, safety margin to 2 secs
   WA sub tgif_31days-py3.py (Python 3) OK: WA [1/1: judge: "not", team: ":(", tc secret/18, CPU: 0.01s @ tc sample/1]
   WA sub tgif_leapday-py3.py (Python 3) OK: WA [1/1: judge: "TGIF", team: "not", tc secret/32, CPU: 0.02s @ tc secret/14]
   WA sub tgif_too_unsure-py3.py (Python 3) OK: WA [1/1: judge: ":(", team: "not", tc sample/1, CPU: 0.01s @ tc sample/1]
   RTE sub tgif-feb29-py3.py (Python 3) OK: RTE [tc secret/32, CPU: 0.02s @ tc secret/22]
   TLE sub tgif-simulate-buggy-py3.py (Python 3) OK: TLE [tc sample/1, CPU: 2.99s @ tc sample/1]
   TLE sub tgif-simulate-py3.py (Python 3) OK: TLE [tc sample/1, CPU: 3.01s @ tc sample/1]
tgif tested: 0 errors, 0 warnings

@thorehusfeldt
Copy link
Contributor Author

All of these examples show that submission has become sub and test case is abbreviated to tc, following the current conventions for data generator scripts.

@thorehusfeldt
Copy link
Contributor Author

Now we’re here, the way verification summaries are shown on https://github.com/RagnarGrootKoerkamp/BAPCtools (colours, alignment) are a big improvement for readability. I’m happy to submit a PR with a similar result. (But I’m a big fan of not interrupting workflows and UI conventions of experienced users. It’s highly plausible that I’m overlooking things and would prefer to have a lot more experience with actually using verifyproblems. I’m very much a newcomer.)

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Nov 9, 2019

As for the command line flag, the cleanest way is an option to verifyproblem that takes a format string, and a sensible default. This avoids overloading the options to verifyproblem. git’s pretty is a good model.

https://git-scm.com/docs/pretty-formats

@austrin
Copy link
Contributor

austrin commented Nov 10, 2019

Now we’re here, the way verification summaries are shown on https://github.com/RagnarGrootKoerkamp/BAPCtools (colours, alignment) are a big improvement for readability. I’m happy to submit a PR with a similar result.

That would be great! I've been meaning to do things like that for a long time but never get around to it. I know also @RagnarGrootKoerkamp has been wanting to merge those features of BAPCtools into problemtools.

As for the command line flag, the cleanest way is an option to verifyproblem that takes a format string, and a sensible default. This avoids overloading the options to verifyproblem. git’s pretty is a good model.

I agree, was also thinking that a format string is the way to go. Something along the lines of git's pretty sounds perfectly reasonable to me.

All of these examples show that submission has become sub and test case is abbreviated to tc, following the current conventions for data generator scripts.

I think these conventions you refer to are more localized than you think and I don't really like this part. But of course this becomes near-moot with the format string, then everyone can use whatever abbreviations they want.

@thorehusfeldt
Copy link
Contributor Author

OK, I think I know what to do. The cleanest way would be to implement __format__ for (at least) SubmissionResult and SourceCode, supporting syntax like %e for the string currently in expected_verdict so that for instance %e submission %n (%l) describes what is currently in the string variable desc (namely, AC submission binarysearch.py (Python 3)). Then all the strings created by Submissions.checksubmission() should be given in this mini-language. This rewrite should be its own commit and simply maintain current functionality.

Once that is done, (some of) the formatting strings can become command-line options to verifyproblem.

Once that is done, I can add “my” team and judge message digests to the formatting mini language.

I suggest we close this PR and I task myself with (some version) of the above plan.

@thorehusfeldt
Copy link
Contributor Author

Note to self (or others who understand the code better): the fields SubmissionResult.ac_runtime and SubmissionResult.ac_runtime_testcase seem to be never used and maybe should just be removed. (My guess is that runtime and runtime_testcase contain that information today.)

@austrin
Copy link
Contributor

austrin commented Nov 11, 2019

Sounds good. Would probably be good to settle on the details of the formatting notation. It may be more Pythonesque to use {name} notation, e.g. {expected_result} submission {submission_name} ({submission_language}) and then format these using .format()? That would also automatically enable things like writing in a fixed field width to get alignment.

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Nov 11, 2019

I’m doing that right now. Here’s a teaser from my current code:

 fmt_string = '   {exp}/{name:<{width}} \033[;32m OK:\033[0;0m {result}'

But I have to play around with this a bit and field-test it before I make a concrete suggestion.

Any comment on ac_runtime_testcase and ac_runtime?

Using __format__ makes immediate sense as a method for SourceCode. Submissions on the other hand are a bit in limbo, not having their own object, and not really knowing their own expected_result. (So the current desc variable can’t really turn into a __format__ method of any object.) But I have to familiarize myself with the code a bit more. It’s very … organic.

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Nov 12, 2019

After thinking a bit more: Would you object to either of

  1. Extend SourceCode with self.expected_verdict. Then check_submission in

    def check_submission(self, sub, args, expected_verdict, timelim, timelim_low, timelim_high):
    could lose one argument (because it’s a field of sub), but the primary value would be that SourceCode could get a sensible __format__ method that would default to the string currently in desc
    desc = '%s submission %s' % (expected_verdict, sub)

    i.e., “PAC submission brute_force.py (Python 3)”, but which could also become “partially_accepted/brute_force.py” or “the file brute_force.py which I tentatively assume should produced a PAC result based on its directory partially_accepted and which smells like Python 3 to me” depending on options to __format__.

  2. If you dislike SourceCode objects knowing about their expected verdict, create class Submission(SourceCode) with the desired fields. (I think this is overkill, but it’s quite clean.)

@austrin
Copy link
Contributor

austrin commented Nov 12, 2019

Hmm I think I would actually prefer the second option, since there are other things (validators) that are SourceCode instances which don't have an expected verdict, and I can imagine that one at one point wants to display their name.

@thorehusfeldt
Copy link
Contributor Author

Where does the new class Submission live? problemtools/problemtools/run/submission.py? (Going this route would lead to quite some refactoring. I’m not sure I have enough experience with the code to do that alone.)

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Nov 12, 2019

1st proposal for syntax of formatting strings:

There are (at least) two formatting strings, SUBMISSION_FORMAT and AC_RESULT_FORMAT.

SUBMISSION_FORMAT has access to expected_result, language, directory and name. The default value is

SUBMISSION_FORMAT = '{expected_result} submission {name} ({language})'

Now for the problem. The format for results is much, much harder to specify, because not all fields are present in all cases. It would make sense to specify the two standard cases (AC and anything else with reason). Even then there would be whitespace issues. Check out these, just to get the ball rolling:

This would reproduce the current behaviour:

AC_RESULT_FORMAT = 'OK: {scored_verdict} CPU: {max_running_time:.2f} @ test case: {max_tc}'

And this would be BAPC-tools:

AC_RESULT_FORMAT = ' \033[;32mOK: {verdict}\033[0;0m sum: {sum_running_time:.2f} max: {max_running_time:.2f} ({max_tc})'

And here is a suggestion with judge or team messages and pythonesque truncation

AC_RESULT_FORMAT = 'OK: {verdict:<5} {team_or_judge_message:.30} {max_running_time:.2f} ({max_tc})'

These examples show some of the fields I have currently in play.

But what about AC submissions with self.runtime == -1 or self.test_case == None? I’m not sure in which contexts these appear, but the current implementation assembles details depending on the presence of these fields in a careful manner. This is very hard to specify in Python’s mini-language, unless we want to live with None in the output.

And what about non-AC submissions? Do we really need REJECTED_RESULT_FORMAT_WITH_REASON and REJECTED_RESULT_WITHOUT_REASON_BUT_POSITIVE_RUNTIME ?

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Nov 13, 2019

Here’s a concrete suggestion for which keywords to recognise. Expected usage is a format string like this:

fmt = '{expected_verdict:<3} submission {name:<20}: {verdict_score:<5}, cpu: {max_time:1.2f} ({maxtime_tc:<20}) {team-message:.20}'
      '''Return submission result in the specified format.

        The following keywords are recognised in the format string:

        meaning             type   examples
        verdict           * str    AC
        score               int    12
        verdict_score     * str    AC, AC(12)
        expected_verdict  * str    TLE
        name              * str    binarysearch.py
        directory         * str    accepted
        language          * str    Python 3
        max_time            float  .455411 
        total_time          float  10.125315
        max_time_tc         str    group1/021-n-100000.in
        reason              str    validator produced "score.txt" ...
        reason_tc           str    group2/03-overflow.in
        judge_message       str    l 43: int expected
        team_message        str    Congratulation! 534 drinks
        
        The starred keywords are never None. The others may be None.
        The numerical keywords (score, max_time, total_time) may
        be None. If so, they degrade gracefully to minuses, but still
        respect formatting directives. For instance, {score:03d} will 
        result in '014' if score is 12, and '---' if score is None.
        '''

Related: the current version of verifyproblem seems to be fine with reporting negative running times, in case max_runtime is None. For instance on the input-free problem hello:

Slowest AC runtime: -1.000, setting timelim to 1 secs, safety margin to 2 secs

That’s way easier to implement, of course. Still, something is fishy in that part of the code, and I don’t quite understand the expected semantics of running times being -1, None, or something else.

@pehrsoderman
Copy link
Contributor

@thorehusfeldt Can you rebase and clean up this PR as a preparation to get it merged?

@thorehusfeldt
Copy link
Contributor Author

I’d be very happy to take this up again. My own conclusion is that there are two subtasks with well-defined but different goals:

  1. make the output of verifyproblem more readable and flexible (as per the conversation with Austrin in the thread above, from November 2019) by using some kind of formatting notation.
  2. actually show more validator feedback, which was the original motivation for the present PR.

1 precedes 2, and solving 1 well trivialises 2. So I guess the best way to proceed is to close the present pull request and task myself with solving 1 (resulting in a new pull request).

@pehrsoderman
Copy link
Contributor

@thorehusfeldt Sounds like a good plan.

@thorehusfeldt
Copy link
Contributor Author

@pehrsoderman : I would need an answer to the question from 12 Nov 2019.

The conversation with @austrin points to the solution of introducing class Submission(SourceCode), which is a good idea in the first place, but requires some refactoring.

Would it be useful have that as a first subgoal? The counterargument is that this would lead to quite a lot of changes to the code. (Thumbs-up is enough.)

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Jun 9, 2021

@pehrsoderman : I have now thought through one way of making verifyproblems output both more flexible and more informative. This is by using the already existing, and quite well-thought out, logging framework of Python. Doing so first involves some refactoring of the existing code (basically moving responsibility from the superclass ProblemAspect to a logger.py module) which I have done in PR198 #198 .

Before continuing along this avenue I’d like you (or anybody else) to comment on that suggestion.

(A next step would then be to separate Submission from Submissions; the current architecture would not allow that because Submission(ProblemAspect) would be confused about which log messages to send: (Error in Submissions?) because the logger and the base class are coupled.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants