Replace all Interpolated*Node by InterpolatedStringNode ? #1708

eregon · 2023-10-18T12:07:51Z

eregon
Oct 18, 2023
Maintainer

I have been thinking about e.g. InterpolatedXStringNode and InterpolatedRegularExpressionNode, etc.
Those are basically the same as doing something on top of InterpolatedStringNode.
So another way to represent the AST would be:

code -> current -> suggestion
`abc` -> XStringNode("abc") -> XStringNode(StringNode("abc"))
`a#{'b'}c` -> InterpolatedXStringNode(...) -> XStringNode(InterpolatedStringNode(...))

The same for RegularExpressionNode and SymbolNode and InterpolatedMatchLastLineNode.

One concern might be that this effectively encourages creating a Ruby String for those nodes, even if the child is just a StringNode, but that seems needed anyway, for Regexp#source and Symbol#name, etc. If one cares they could create those lazily and instead only store the bytes or so and handle the case of the child being a StringNode differently than for InterpolatedStringNode.

Then XStringNode would mean just call Kernel#` with the string, which seems nice and clean.

OTOH there would likely be if conditions to handle e.g. a RegularExpressionNode as if it's not interpolated then the result can be cached in the AST.
And also one extra node for the not-interpolated case so that's probably a good argument against.

Just a suggestion, WDYT?

enebo · 2023-10-18T13:32:46Z

enebo
Oct 18, 2023
Collaborator

I don't have a big opinion here since I think all of them use flags so I don't think this will use more space to consolidate these types. I have also already implemented them so this is more work (but very very little extra work).

This issue made me go back and audit the D*Node triplet and they are effectively the same code. Here is the change I just made:

    Operand buildDStr(Variable result, U[] nodePieces, Encoding encoding, boolean isFrozen, int line) {
        if (result == null) result = temp();

        Operand[] pieces = new Operand[nodePieces.length];
        int estimatedSize = 0;

        for (int i = 0; i < pieces.length; i++) {
            estimatedSize += dynamicPiece(pieces, i, nodePieces[i]);
        }

        addInstr(new BuildCompoundStringInstr(result, pieces, encoding, estimatedSize, isFrozen, getFileName(), line));

        return result;
    }

    Operand buildDSymbol(Variable result, U[] nodePieces, Encoding encoding, int line) {
        return copy(new DynamicSymbol(buildDStr(result, nodePieces, encoding, false, line)));
    }

    public Operand buildDXStr(Variable result, U[] nodePieces, Encoding encoding, int line) {
        return fcall(result, Self.SELF, "`", buildDStr(result, nodePieces, encoding, false, line));
    }

I guess I wonder now why frozen is false and not true for dsym and dxstr . Since the string is only an intermediate used internally it probably doesn't matter but it does lead me to your main dialogue...

If these type are only non-interpolated or immediate values (fixnum, bool, ...) then perhaps we do not need to construct a RubyString at all. We do for DStr case but even then we might be able to remove the dynamic aspect and create a single string during build/compile. Merging these to a single node type does not eliminate doing this sort of optimization but perhaps people will have a harder time seeing the possibility. Note: As for source() we construct a new string per call to it. It appears to return a modifiable string so that seems to need to be at least dupd (ruby -e '/foo/.source[0] = "b"'). Symbol name is frozen and is supposed to stay the same string (or frozen at a minimum).

So the point of the last paragraph goes beyond making a static value from only non-interpolating strings to say we can potentially make these nodes static if we have results which are known to be representable without making a ruby string for them first. This I think is pretty rare so I doubt it is worth it. Still something possible if we are talking about optimizations.

From JRuby perspective unless I change old parser to do the same thing then I will have these two extra tiny methods. I don't care either way. I have resisted moving older parser towards prism just because it is simpler to compare with MRI's parser but at some point I will probably start pushing on the tree so the building code keeps getting more in sync.

A case for this change is if you look at DSymbolNode in JRuby you will see it is extends DNode and there is nothing in it but require extendable methods which more or less are just indicating a type. It is largely similar to being a flag but because it is the node type we do not have an extra if statement (DNode would have extra conditional in processing that does not exist as a type). That is an extremely minor concern 😄

0 replies

kddnewton · 2023-11-19T02:40:25Z

kddnewton
Nov 19, 2023
Maintainer

I'm not very into the idea of adding an extra node in the simple case. I'm assuming/hoping that the majority of use cases are going to be the simple option, in which case it would be adding a lot of nodes for not a lot of gain.

That being said, I'm sympathetic to what you're saying here. I actually have been having some thoughts lately that Interpolated* is confusing since there's not always interpolation. Especially once we land #1799. When that happens, any string concat is automatically going to become an interpolated string node, which is confusing.

The way I see it, we have a couple of options:

Leave things as they are. It's not a particularly bad option honestly. For the most part it's working just fine, the only real issue is that we're calling something interpolated but then requiring our consumers to check if there actually is interpolation to determine the right behavior.
Rename all Interpolated*Node nodes to MultiPart*Node, but do not other changes. This would indicate to the consumer that it does't necessarily have interpolation, but wouldn't do anything else to help.
Add a flag to the Interpolated*Node that indicates whether or not they actually have interpolation.
Combine options 2 and 3, rename everything to MultiPart*Node and add the flag.
Create a new set of nodes. Something like MultiPart{MatchLastLine,String,Symbol,RegularExpression,XString}Node. These would be exactly the same nodes as their Interpolated*Node counterparts, but would indicate to the consumer that there isn't any interpolation.
The option Benoit mentions above, which is replacing the contents of every XString, RegularExpression, Symbol, String, MatchLastLine node to include a child node that is the actual contents.

As I said before, I'm not particularly keen on options 1 or 6. I think 2 is probably the simplest, but it's frustrating that we would be requiring all of our consumers to scan for interpolated content. 3 would be okay, but the naming is still confusing. 4 gets us closer to actually being intuitive naming. 5 is probably the most work, but every node can be consistently compiled/handled, so no surprise that's my favorite option.

Thoughts?

cc @jemmaissroff, @eregon, @enebo, @seven1m

0 replies

enebo · 2023-11-19T14:21:04Z

enebo
Nov 19, 2023
Collaborator

@kddnewton The naming changes doesn't bother me at all. Interpolated being MultiPart really makes no difference to me but I can see the term interpolated will be bothering some subset of people. If you want to eliminate somewhat rare complaints that an non-interpolating string is marked as interpolating then you will never have to hear someone mention it again.

I am ok with option 5 but with #1799 I am wondering why not boil the ocean and combine all of these nodes during parse if there is no interpolation? Is it just the complexity of taking pieces and then when you realize there is no interpolation you have to make a different type? Or is this the problem that we lose information about syntax? From an compilation/build perspective preserving MultipartString in the tree vs making non-interpolating into a String is adding time and space we don't really want.

0 replies

kddnewton · 2023-11-19T16:43:04Z

kddnewton
Nov 19, 2023
Maintainer

@enebo The issue is if we combined everything during parsing, there would be no way to round-trip. We lose the location information of the individual strings, and formatters/linters have no way of getting it back without referencing the source.

0 replies

enebo · 2023-11-19T17:14:27Z

enebo
Nov 19, 2023
Collaborator

@kddnewton ok. Info loss is a problem so long as parsing will only create a single universal tree. If we ever have more info-losing needs then we might want to consider splitting the tree between semantic and syntactic.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace all Interpolated*Node by InterpolatedStringNode ? #1708

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Replace all Interpolated*Node by InterpolatedStringNode ? #1708

eregon Oct 18, 2023 Maintainer

Replies: 5 comments

enebo Oct 18, 2023 Collaborator

kddnewton Nov 19, 2023 Maintainer

enebo Nov 19, 2023 Collaborator

kddnewton Nov 19, 2023 Maintainer

enebo Nov 19, 2023 Collaborator

eregon
Oct 18, 2023
Maintainer

enebo
Oct 18, 2023
Collaborator

kddnewton
Nov 19, 2023
Maintainer

enebo
Nov 19, 2023
Collaborator

kddnewton
Nov 19, 2023
Maintainer

enebo
Nov 19, 2023
Collaborator