Include files - overall situation #4022

Korporal · 2022-12-16T13:28:47Z

Korporal
Dec 16, 2022

I've been looking at closed issues and fork activities, related to supporting include files, that is lexing/parsing code like C, C++, PL/I and so on, where source contains metadata referring to additional source.

I found several mentions, including a valiant attempt by a forker that looked substantial but never became a pull request.

What is the status, views, opinions of this feature? is it considered important? a throwback not valued in a modern grammar? what is the status?

kaby76 · 2022-12-17T00:10:22Z

kaby76
Dec 17, 2022

Please add links to the Issues and forks you are referring to.

2 replies

Korporal Dec 17, 2022
Author

Please add links to the Issues and forks you are referring to.

Sure, here it is: #306

kaby76 Dec 17, 2022

Thanks.

antlr4/tool/test/org/antlr/v4/test/TestLexerIncludeStrategy.java

Line 68 in ed16b7c

"[@4,6:13='COPY CD.',<1>,2:0]\n"+

I'm not sure this wouldn't help in implementing a preprocessing for C or C++. First, the "COPY CD." gets inserted into the token stream, not replaced. (There might be a way to tell the lexer to ignore this token, and continue with the included stream.) The equivalent in C and C++ would be an include directive e.g., "#include ". The preprocessor does not emit the directive itself into the token stream. Further, it says in the ISO Spec that although the lexer rules between preprocessor and post-processor share some common rules, there are differences. So, in C/C++, the preprocessor and post-processor need to have completely different lexers and parsers.

Korporal · 2022-12-17T12:25:44Z

Korporal
Dec 17, 2022
Author

I think I can do this with the current tooling, its a little contrived though. Essentially I'd define a small grammar that just parses out include statements, that could then be used to write a utility function that can take any source file and recursively identify, open and read those files, however deeply nested.

That utility could then be used as the basis for class NestedTextReader that would expose a file and all associated includes, as a single stream, we'd pass that stream to the usual tokenizer.

The Antlr lever would the just work, just read that stream, unaware there were any includes, it would see one large stream as if each include had been simply copied and pasted into one source file.

In principle the NestedTextReader could be generated by Antlr, the user would supply a callback that implements the actual file IO.

Such an approach would have no impact on the existing lexing/parsing logic, it would just make a set of nested included files appear like a single file to the lexer.

0 replies

Korporal · 2022-12-17T12:35:30Z

Korporal
Dec 17, 2022
Author

In my case I'm designing a grammar primarily based on PL/I (Subset G). This has gone very well, I can parse basic source now and have no reserved words, able to support keywords in different languages (English, Spanish etc) looking very good.

Anyway PL/I typically includes files with % include 'common'; for example. The actual file access, location etc would be implementation defined in some way.

0 replies

Korporal · 2022-12-17T20:40:30Z

Korporal
Dec 17, 2022
Author

While I'm on this subject, does the code (in my case C# code) generated by Antlr always/only read the entire source in one hit with ReadAllLines? Can we get the lexer to read the source stream say line by line or char by char? if so, I could pass an IEnumerable<char> into it and then I'd control the character enumeration, taking care of reading from nested includes, the lexer would just see a sequence of chars...

2 replies

kaby76 Dec 17, 2022

Not sure what the runtime does, but maybe look at this code and other places

antlr4/runtime/CSharp/src/CharStreams.cs

Line 40 in 76fa05c

var pathContents = File.ReadAllText(path, encoding);

.

Korporal Dec 17, 2022
Author

Not sure what the runtime does, but maybe look at this code and other places

antlr4/runtime/CSharp/src/CharStreams.cs

Line 40 in 76fa05c

var pathContents = File.ReadAllText(path, encoding);

.

Thanks. Yes it could be passed a class derived from TextReader and the overridden ReadAllLines would do the work. It would read a file and then recursively open and read any included file. The consumer would see all lines, the expansion of included files being handled invisibly.

I think this is doable, as I mentioned I'd create a tiny grammar that recognizes include directives for the language, and with the generated lexer from that grammar we'd be able to transform a file with nested include directives, into a single file, that would be in an instance of this derived NestedTextReader, this seems doable, but...

If user code looked like this:

int counter;  #include "definitions.h"; // our std header

Now thats admittedly rare, I've never seen an include share a line with other code, but if we did encounter it what would we do?

return a line:

int counter;

followed by all lines in the include followed by the line?

// our std header

It might lead to subtle problems, with line endings and stuff, then again if a C compiler can handle it, it must have some policy I guess...

Korporal · 2022-12-17T21:22:57Z

Korporal
Dec 17, 2022
Author

Seems in the case of C the # include <abc> or # include "xyz" must appear isolated on dedicated line, so that kind of makes this easy.

My own language isn't C but I can certainly do includes that way too, on their own on a single line...

0 replies

Korporal · 2022-12-17T21:28:09Z

Korporal
Dec 17, 2022
Author

The file XXX

abcdef

included like this in file YYY

123
%include XXX;
789

would become

123
abcdef
789

That's what the lexer would see, no probs, include file support without any changes to Antlr...

0 replies

Korporal · 2022-12-17T21:35:20Z

Korporal
Dec 17, 2022
Author

Actually this could be even simpler, just read source line by line, then if we see a line literally beginning # include <abc> for example, we just open that include and start returning its content, recognizing the include directive is trivial, no need for a mini grammar...

0 replies

Korporal · 2022-12-17T21:37:35Z

Korporal
Dec 17, 2022
Author

I'll write this tomorrow, could become a Antlr utility class, let me get it running in C# with my own language source files...

0 replies

Korporal · 2022-12-18T16:54:30Z

Korporal
Dec 18, 2022
Author

@kaby76 - Hi, OK I have implemented this for my C# target. It is a simple pattern actually. The pattern could - in principle - be added to Antlr where it generates the target lexer code.

The pattern is that the consumer (the person leveraging the generated Antler code) must provide a callback method that accepts a string and uses that to open the file and return a TextReader, e.g.

private static TextReader ReadFileCallback (string Filename)
{
    return File.OpenText($@"..\..\..\..\..\Antlr\{Filename}");
}

Then I created a simple class that implements TextReader and overrrides itsReadToEnd method:

public  class NestedSourceReader : TextReader
{
    private string sourceFile;
    private Func<string, TextReader> fileReader;
    private Regex regex;
    public NestedSourceReader(string SourceFile, string IncludePattern,Func<string,TextReader> FileReader)
    {
        fileReader = FileReader;
        sourceFile= SourceFile;
        regex = new Regex(IncludePattern);
    }
    public override string ReadToEnd()
    {
        StringBuilder builder = new StringBuilder();

        append_stream(sourceFile, builder);

        var txt =  builder.ToString();

        return txt;

        /* Internal recursive method. */

        void append_stream(string SourceFile, StringBuilder Builder)
        {
            var rdr = fileReader(SourceFile);
            var line = rdr.ReadLine();

            while (line != null)
            {
                if (regex.IsMatch(line))
                {
                    // temp code to get the filename part, this could be handled generically...
                    var filename = line.Replace("#", "").Replace("include", "").Replace(";", "").Trim().TrimEnd('"').TrimStart('"');

                    append_stream(filename, builder);
                }
                else
                {
                    builder.AppendLine(line);
                }

                line = rdr.ReadLine();
            }

        }

    }
}

Because that's a TextReader we can pass it to Antler's AntlrInputStream.

We create the NestedSourceReader like this:

NestedSourceReader reader = new NestedSourceReader("test_3.nr", "\\#include\\s*(<([^\"<>|\\b]+)>|\"([^\"<>|\\b]+)\")", ReadFileCallback);

That regex is just a contrived one that recognizes the C # include syntax, in my case I will actually use % include etc, but this is just a proof that this works.

Well it does work, Antler sees a single source text that is basically the source file with every include statement replaced by the content of the included file and includes can include other includes as you'd expect.

It parses fine.

There are two issues here though and that is, first, line numbering, the line numbers seen/reported/recorded by Antler in its tree or diagnostic messages reflects the location inside the fully expanded file, rather than the line number in the actual file itself. Second the regex test is weak because we could be reading a line that's part of a comment, only Antlr itself can distinguish. Now it was a comment we're likely fine because for that very reason, despite the file being expanded into the source text, it will be commented out.

But for strings that contain text matching an include directive, this will break...

Now this code could be tidied up a bit, for example rather than extracting the include file's name in the NestedSourceReader we could pass the entire line #include "somefile.h"; for example, into the user's ReadFile callback, that way if the code were to be generated by Antlr it would be free of user specific include file manipulations, their callback would do the work of extracting file name from include statement text.

I doubt that an equivalent implementation for other languages is a big challenge too, so Java etc can readily to this.

Does anyone have any thoughts on this at all? My own problem is more or less solved, but the line numbering is a minor hassle.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include files - overall situation #4022

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Include files - overall situation #4022

Korporal Dec 16, 2022

Replies: 9 comments · 4 replies

kaby76 Dec 17, 2022

Korporal Dec 17, 2022 Author

kaby76 Dec 17, 2022

Korporal Dec 17, 2022 Author

Korporal Dec 17, 2022 Author

Korporal Dec 17, 2022 Author

kaby76 Dec 17, 2022

Korporal Dec 17, 2022 Author

Korporal Dec 17, 2022 Author

Korporal Dec 17, 2022 Author

Korporal Dec 17, 2022 Author

Korporal Dec 17, 2022 Author

Korporal Dec 18, 2022 Author

Korporal
Dec 16, 2022

Replies: 9 comments 4 replies

kaby76
Dec 17, 2022

Korporal Dec 17, 2022
Author

Korporal
Dec 17, 2022
Author

Korporal
Dec 17, 2022
Author

Korporal
Dec 17, 2022
Author

Korporal Dec 17, 2022
Author

Korporal
Dec 17, 2022
Author

Korporal
Dec 17, 2022
Author

Korporal
Dec 17, 2022
Author

Korporal
Dec 17, 2022
Author

Korporal
Dec 18, 2022
Author