-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't enlarge buffer, but no REJECT #620
Comments
You have defined "variable length" rules when you use constructions like As you've found, the input buffer is big but not huge. The challenge is biting off chunks of whitespace small enough that they fit in the buffer without needing to enlarge it. I'd refactor your scanner to use scopes, with |
I understand your answer to mean the behavior I described is expected. Thank you; I suspected as much. IMO there are problems with the documentation, and I'm not sure I understand your recommendation. The term "variable length rule" is not defined in the manual. There are warnings against variable-length trailing context, as a performance consideration. There is no indication that a simple "+" on any regex may cause a problem of any kind. (I have several persistent warnings from flex about "dangerous trailing context" that appear only in the context of a complex flex input file. My attempts to reduce the reported cases to a minimal example invariably result in a file for which flex reports no problem. For example, see, https://gitlab.cobolworx.com/COBOLworx/gcc-cobol/-/blob/master+cobol/gcc/cobol/scan.l?ref_type=heads) By "scope" I understand you to mean "start condition". By yybegin, which is not documented, I assume you mean BEGIN. I do not know what Is there any safe extent of "variable length" rules? Is If the pattern |
You're right, I left out a path component: examples/manual/string2.lex. |
Answering your broader question about safe lengths. You're safe using variable length repetitions if you can reasonably expect them to match text that's about the length of your other tokens. Whitespace, comments, and similar throw-away strings are hard to tokenize because they can run on so long. About the best we can do is delete them a character at a time or scan them into manageable chunks. What else... Yes, I said scope but meant start condition. yybegin and BEGIN are the same action. Flex user functions and actions are often prefixed with yy and the names are used interchangeably. You can set your own prefix that will replace yy if you ever need two independent scanners. But that's getting off topic. |
Thank you, I think I see my way home now. Just one more question. One useful tactic has been to remove trailing whitespace prior to scanning. If blank lines consist only of a single Unfortunately, my scanner must contend with many multi-string tokens such as That said, I wonder if the problem can be minimized by throwing memory at it. This problem is not, I think you agree, anything fundamental about regular expressions. After all,
works just fine. The problem is introduced, isn't it, by the buffer itself? Would I be better off setting the buffer size to, say, 10 MB, and telling the user that more than, oh, 5 MB of whitespace is disallowed? |
I'd think of the problem differently. Let Flex scan the words Now you can tell Flex to match and discard each single whitespace character it encounters. (Which results in Flex discarding all the whitespace it encounters.) Totally solves the problem of users trying to embed Brady & Morris Whitespace programs in their COBOL. |
Brady & Morris Whitespace takes me back. :-) It's a perfectly reasonable suggestion. What I have discovered over the years, though, is that There are many such, including I was serious, though. What harm in defining the input buffer 1000x bigger than currently? Memory is not a constraint. In my system, YY_INPUT is defined over a function that begins with mmap(2) over the whole input. It's a shame the input need be copied to the input buffer at all; it would be more efficient to use pointers into the mapped file. |
Flex supports systems that run everywhere, including on tiny hardware.
Minimal resource requirements is a design goal. You can edit your generated
scanner however you like, of course.
…On Wed, Feb 14, 2024, 15:51 James K. Lowden ***@***.***> wrote:
*Brady & Morris Whitespace* takes me back. :-)
It's a perfectly reasonable suggestion. What I have discovered over the
years, though, is that END serves numerous purposes in the COBOL grammar.
As I'm sure you appreciate, returning a unique END_PROGRAM token to the
parser simplifies the grammar.
There are many such, including LESS THAN OR EQUAL TO, which returns LE.
If the individual strings are returned as tokens, OR winds up both in the
midst of relation expressions *and* as their master in logical
expressions. Trapping the whole business in the scanner as a single string
(and token) was the way out of days spent in a maze of twisty shift/reduce
errors, all alike.
I was serious, though. What harm in defining the input buffer 1000x bigger
than currently? Memory is not a constraint. In my system, YY_INPUT is
defined over a function that begins with *mmap*(2) over the whole input.
It's a shame the input need be copied to the input buffer at all; it would
be more efficient to use pointers into the mapped file.
—
Reply to this email directly, view it on GitHub
<#620 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAVJXIP3BARP6P2O2ZPRCKTYTUPURAVCNFSM6AAAAABCNV5QGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBUGU3TANBVGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I've been thinking about your reply, Joseph. It doesn't make sense to me, literally. And I wonder if the defaults are really what anyone wants. I want to enlarge my buffers, and I want to check with you guys if I'm going to regret it. In the generated scanner, we have this definition:
Why for IA64 only? It's hardly the only 64-bit chip, and it's obsolete. I am using
and I'll bet you there are 1000 of those for every IA64 still in use. is there something mysterious and arcane that makes it special? My plan is to define these in my input file:
It's not obvious to me why I should stop at 10; even if we supported a 32-bit processor, our compiler wouldn't mind a 1600 KB input buffer. I have to read the whole file anyway, and it's already mapped into memory (in preparation for scanning). Is there any reason that would dissuade me from taking that approach? (Kudos, btw, for using macros in this way. I redefined YY_FATAL_ERROR to good effect.) I would also like to renew my complaint, for clarity. It may be that variable length rules use some of the same machinery as REJECT, as you say. But they're hardly a "corner case". Thompson used To be clear, I appreciate my problem domain is unusual. COBOL programs are regularly subjected to preprocessing, long before the lexer sees them. Those preprocessors have a mean streak: they output valid Cobol (FSVO valid), but neither readability nor parsimony are evident design goals. Why one might output 1000 blank lines as part of an interpolation is more than I can understand. But, like so many programmers, they didn't ask me. You might say code flows downhill. |
You can enlarge your buffers. Look at the FAQ section in the manual for the question, "Is backing up a big deal?" There's information about the changes you'll need to back to the scanner and possibly to flexdef.h to support a larger state buffer. The defaults support systems that were written 30 or 40 years ago on lex and we still support them. I tried changing them once and got an earful from the community. I really don't know what's up with the IA64 special case. I suspect it had to do with a default memory page size, but I have no evidence. When that sort of magic number exists it almost always leads back to page sizes or superstition. Answering your main questions: Your problem isn't that unusual. You should be able to use a character-at-a-time solution to collecting your strings of whitespace. If you need to return the string of whitespace as a token, you can accumulate it in an action. |
I reluctantly accept this report will not yield anything, except perhaps (hopefully, still) a limited solution to my particular situation. Your advice for a misleading message -- and a design choice -- is not to ask of flex what it cannot do. That's clear enough. It is a bug, though, as you tacitly admit, because flex is not behaving as advertised. The message
does not say what it means; I'm sure you agree. But an accurate replacement,
is practically tautological. How can anyone define a too-large token without using the '*' or '+' operator? You said, "Flex [does not] generate regular expression parsers." You will forgive my confusion, because the casual observer might expect otherwise,
Here's the thing: the user who gets bitten by this is blindsided. It crops up long after the project has committed to flex and has composed a substantial ruleset. Although the user assiduously avoids REJECT, it's to no avail because the extend-the-buffer logic is moribund. Nothing but nothing warns him. In fact, under "The Default Memory Management" he is told the opposite. At least, he should be warned. WorkaroundsI don't think "Is backing up a big deal?" is the FAQ you mean. It references the 32000 state limit. Nor do I find any discussion of the input buffer size in On my system, "unnamed-faq-82" is the only one that mentions I would nevertheless like to pursue that course, unless there is good reason not to. I appreciate your frankness regarding IA64. It sounds to me like "once burnt, twice shy", and it's made you reluctant to change any basic parameter. But changing a default is always dicey. Making it configurable shouldn't be. Why is As for what I am able to do, are you sure? Of course, in the context of flex, dealing with whitespace "by hand" is one workaround. But are you prepared to tell the Bison user that his 1800-rule parser needs a few hundred more to deal with 42 multi-string tokens? And to help out with the ensuing shift/reduce ambiguities? I'd like to point out that on one hand, you call this a "corner case", while on the other, you say it's not that unusual. But there is no such thing as a corner case that's not unusual. This is flex's job, I'm sure you agree: recognize regular expressions, and act on them. Your answer to my bug report is that
will not work if the input it matches exceeds N characters, for some unspecified value of N. That you don't consider that a defect utterly astounds me. |
I am using flex 2.6.4. I compile with
-Ca
, following random advice on the Internet, because at some earlier juncture it solved some internal error. I also use:My scanner uses yyless(), but not yymore() or
REJECT
. It has 54 start conditions and reads only one input file.When confronted with input of about 40 KB of ASCII spaces (0x20), it produces the message:
It looks like the code that would expand the buffer is not generated:
My scanner has 249 rules involving whitespace. Fortunately, whitespace is never returned to the parser. While it's not a great deal of work to convert every
[[:space:]]+
to[[:space:]]{1,4096}
or similar, the result is a bit awkward and counterintuitive. Nor do I want to just redefine the buffer size statically, because some other file will have even more whitespace.Why does the scanner think REJECT is being used? How can I convince it to expand the buffer as needed?
The text was updated successfully, but these errors were encountered: