Newlines and eventing lines #667

enebo · 2023-03-27T14:30:29Z

enebo
Mar 27, 2023
Collaborator

[Note: I changed this to reflect discussions had in #757]

In all runtimes we need to know when to emit line number changes. Historically parsers would mark statements (stmts and top_stmts in CRuby/JRuby) with a flag indicating the node is a newline node. In compile.c or IRBuilder we track last emitted line number and if we see a newline node we make sure it is different than the last one and emit it if so.

Currently YARP has no concept of newline nodes and it derives line from binary searching start_offset using a table of newlines (so it doesn't store line). The last PR opened for this would check this in all nodes (since it has no newline flag).

So both serialize and compile in YARP have an interest in figuring out when to emit lines and they can either share something built during parse or do something later.

Concerns:

flag newlines or just emit as we see elements change line
store line or always derive it
a. memory of parser
b. time of binary search
whose responsibility (parser OR serialize/compile
Is old way too overzealous in marking newlines? (e.g. what is a newline anyways)

kddnewton · 2023-03-27T14:46:24Z

kddnewton
Mar 27, 2023
Maintainer

@enebo Yeah this is a great question, and something we've been trying to figure out for the compilation stage in CRuby as well.

At the moment we're doing something a little hokey where we use memchr to gather up a list of newlines, and then binary search through them to find the position. You can see that in place here in Ruby: https://github.com/Shopify/yarp/blob/1ef4b55ae630c5d3fdef80470c8fa6b5b79d8cae/lib/yarp/lex_compat.rb#L642-L655. It's done in C on the compiler branch.

I don't love this solution but don't have a better one yet. The issue is we don't want to throw line numbers onto every token and node, as it would add quite a bit of memory when we don't actually need them in that many places. So still trying to figure out the right solution here.

0 replies

enebo · 2023-03-27T15:24:02Z

enebo
Mar 27, 2023
Collaborator Author

@kddnewton yeah I cannot really say I fully remember the logic of when a node is marked newline but I think it is possible to only adorn the nodes which need them with this info. From a C struct side of things perhaps there are complications.

I was going to chat with you about saving space on serialization by changing the fields stored from start/end offsets to start offset, length. I have some other ideas for reducing size on serialization side of things as well.

0 replies

enebo · 2023-03-27T15:25:49Z

enebo
Mar 27, 2023
Collaborator Author

Err. I should have added a binary lookup of lines I think is fine for things like error reporting so not not all nodes may need line numbers (although if it works out that way then that is great). But error reporting is already a slow path activity so I am less concerned about it.

0 replies

eregon · 2023-04-04T13:23:57Z

eregon
Apr 4, 2023
Maintainer

In all runtimes we need to know when to emit a line number changes. In Impls we have a newline marker on AST nodes and the parser historically has determined that point. How do we know what this point is?

For the :line TracePoint and Coverage I guess? Any other need?
For these two it should be at most one node per line marked as such.
Maybe we could have a bit of the node type to mark that, or a special node type that just means the next node is marked as first node of the line.
Not sure how to represent that efficiently in Java, probably just a boolean?

We could also compute that on our own with an array I guess. That would likely need an extra tree walk through which feels suboptimal. Or maybe it could be done by Loader while deserializing given nodes are evaluated in post-order there. Would be heavy in offset->line conversion though, so seems better to serialize this info.

2 replies

eregon Apr 11, 2023
Maintainer

Maybe we could have a bit of the node type to mark that, or a special node type that just means the next node is marked as first node of the line.
Not sure how to represent that efficiently in Java, probably just a boolean?

We didn't discuss this on the call, but it seems the obvious way. I.e. use a boolean field in each node to mark as newline and for serialization we can just serialize it as the first byte after the node type. Or if we want to be more compact and have < 128 node types, use the high bit of the node type for that.

enebo Apr 11, 2023
Collaborator Author

@eregon yeah. I don't know if you saw my outline but I am not sure is newline will change performance or correctness if the line is just a line != last_line sort of check. It appears to be what @HParker did in his patch and I can try it and see if it matters or not. The actual logic was if line != last_line && node.isNewline. So just removing that boolean check isn't potentially doing anything other than some other pretty simple checks.

I closed this dicussion but we can reopen if we have more to discuss here.

enebo · 2023-04-04T20:07:22Z

enebo
Apr 4, 2023
Collaborator Author

@eregon the most important one is just for :line and coverage. There are secondary ones like leaving line when exiting a module but those are not marked (although both MRI and JRuby store those end lines for that purpose).

MRI already has a flags field on node and so it marks nodes as newline. A long time ago they used to use a full node but it is expensive space wise. JRuby has a newline boolean field on Node. Both impls store start line and column in the node.

I would prefer at worst case serialize.c generates newline state and probably line number and more preferably we figure out a way for the parser to do it without wasting space. For error reporting where we need a line doing the binary lookup is no big deal but I don't think we want to do that search for every Ruby line in Java. A big draw of YARP is not needing stuff to warmup. The less bytecode we need the better.

0 replies

enebo · 2023-04-05T17:10:18Z

enebo
Apr 5, 2023
Collaborator Author

Linking PR: #757. A lot of comments happened related to this topic there.

0 replies

enebo · 2023-04-05T17:33:22Z

enebo
Apr 5, 2023
Collaborator Author

I updated description to match what I think we want to talk about.

I am going to add some more details that were also mentioned in #757 about serialization.

Some coverage needs an endLine. Those nodes are Module, Def, Defs, Lambda, Iter For, SClass, Class, PreExe (sorry JRuby node names 😄 ). This is static knowledge. For serialize, I personally would like line numbers put into these nodes. For compile?

Line for serialization. I wonder if instead of stmts if this is potentially static as well. Are all CallNodes always a newline? If so, since serialize uses config.yml we can mark those as line: true and then emit line into the blob to not have to generate and use a binary search table (warmup/execution in Java we want to keep low).

8 replies

enebo Apr 12, 2023
Collaborator Author

Yeah after talking I realized this is a difference between the two impls. I expect to see some differences in approach.

eregon Apr 12, 2023
Maintainer

Given TruffleRuby and I guess JRuby too plan to vendor YARP, we could maybe have some if in e.g. Loader.java.erb and serialize.c.erb based on some ENV var or other "global" config and that config could tell to serialize line or not and offsets or not, and that would then be encoded directly in the produced files (no runtime check).
I think that would be a good way to deal with small differences like this while sharing most of the logic.

enebo Apr 12, 2023
Collaborator Author

Yeah I am unsure. Syncing what helps us both is important but how to manage that is unclear. ENVs/ifs I think work up to a point so it might be where to start. Perhaps it will work out overall too.

I mentioned loading staticScope in the meeting but the reason for wanting it in Loader was that it prevents walking the whole tree a second time to set up the scopes. I cannot just construct StaticScopes because they are also a tree of lexical variable relationships. I also mentioned capturing other values that are simple to add like detecting assignments in a subtree (we emit some things like hashes differently based on conditions like that.

This is not to say it has to do this stuff as part of serialize/load but I am just pointing out we could deviate in approach over time to the point where we maybe don't want conditionals/envs.

enebo Apr 12, 2023
Collaborator Author

@eregon one semi-related thing to generation. JRuby needs to support yarp as a gem as well. So we will end up making equiv to cext for that. TR maybe can just use the cext but if perf is a concern there we may both need to make our own equivalents based on what the generic format of the serialized blob is.

eregon Apr 15, 2023
Maintainer

This is not to say it has to do this stuff as part of serialize/load but I am just pointing out we could deviate in approach over time to the point where we maybe don't want conditionals/envs.

Yeah, I think let's see how it goes over time and let's start with sharing as much as possible and the simplest way like ENV vars (when they become needed).

Regarding the latest comment, I think #836 is the way.

enebo · 2023-04-11T17:54:40Z

enebo
Apr 11, 2023
Collaborator Author

This can be closed now in favor of #808 after having met about it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newlines and eventing lines #667

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Newlines and eventing lines #667

enebo Mar 27, 2023 Collaborator

Replies: 8 comments · 10 replies

kddnewton Mar 27, 2023 Maintainer

enebo Mar 27, 2023 Collaborator Author

enebo Mar 27, 2023 Collaborator Author

eregon Apr 4, 2023 Maintainer

eregon Apr 11, 2023 Maintainer

enebo Apr 11, 2023 Collaborator Author

enebo Apr 4, 2023 Collaborator Author

enebo Apr 5, 2023 Collaborator Author

enebo Apr 5, 2023 Collaborator Author

enebo Apr 12, 2023 Collaborator Author

eregon Apr 12, 2023 Maintainer

enebo Apr 12, 2023 Collaborator Author

enebo Apr 12, 2023 Collaborator Author

eregon Apr 15, 2023 Maintainer

enebo Apr 11, 2023 Collaborator Author

enebo
Mar 27, 2023
Collaborator

Replies: 8 comments 10 replies

kddnewton
Mar 27, 2023
Maintainer

enebo
Mar 27, 2023
Collaborator Author

enebo
Mar 27, 2023
Collaborator Author

eregon
Apr 4, 2023
Maintainer

eregon Apr 11, 2023
Maintainer

enebo Apr 11, 2023
Collaborator Author

enebo
Apr 4, 2023
Collaborator Author

enebo
Apr 5, 2023
Collaborator Author

enebo
Apr 5, 2023
Collaborator Author

enebo Apr 12, 2023
Collaborator Author

eregon Apr 12, 2023
Maintainer

enebo Apr 12, 2023
Collaborator Author

enebo Apr 12, 2023
Collaborator Author

eregon Apr 15, 2023
Maintainer

enebo
Apr 11, 2023
Collaborator Author