Avoid boxing in Framing #1247

JD557 · 2024-04-03T20:30:49Z

Adds a specialized indexOf(Byte, Int) to all ByteString implementations, similar to what is done on the ByteIterator, in order to avoid boxing/unboxing when searching for a byte.

This speeds up framing quite a bit.

Benchmark results (Java 11, Scala 2.13, Apple M3 Max):

Old

Benchmark                 (framePerSeq)  (messageSize)   Mode  Cnt         Score         Error  Units
FramingBenchmark.framing              1             32  thrpt    3  10113843.601 ±  347860.528  ops/s
FramingBenchmark.framing              1             64  thrpt    3   9649511.788 ±  275242.133  ops/s
FramingBenchmark.framing              1            128  thrpt    3   8121220.174 ± 1320460.364  ops/s
FramingBenchmark.framing              1            256  thrpt    3   5466330.854 ± 3260665.704  ops/s
FramingBenchmark.framing              1            512  thrpt    3   2859562.317 ±  489750.599  ops/s
FramingBenchmark.framing              1           1024  thrpt    3   1594647.037 ±   52601.259  ops/s
FramingBenchmark.framing              8             32  thrpt    3   1772199.560 ±   36471.572  ops/s
FramingBenchmark.framing              8             64  thrpt    3   1595775.427 ±   94933.072  ops/s
FramingBenchmark.framing              8            128  thrpt    3   1074911.259 ±  198661.935  ops/s
FramingBenchmark.framing              8            256  thrpt    3    715528.264 ±  117832.304  ops/s
FramingBenchmark.framing              8            512  thrpt    3    388890.655 ±   23628.881  ops/s
FramingBenchmark.framing              8           1024  thrpt    3    218333.423 ±    3502.994  ops/s
FramingBenchmark.framing             16             32  thrpt    3   1073203.353 ±   56959.738  ops/s
FramingBenchmark.framing             16             64  thrpt    3    835697.061 ±   61190.351  ops/s
FramingBenchmark.framing             16            128  thrpt    3    620562.797 ±    4529.555  ops/s
FramingBenchmark.framing             16            256  thrpt    3    380776.167 ±    3069.815  ops/s
FramingBenchmark.framing             16            512  thrpt    3    201483.021 ±   27772.140  ops/s
FramingBenchmark.framing             16           1024  thrpt    3    110883.293 ±    2468.349  ops/s
FramingBenchmark.framing             32             32  thrpt    3    589338.870 ±   48664.767  ops/s
FramingBenchmark.framing             32             64  thrpt    3    438839.627 ±   98301.907  ops/s
FramingBenchmark.framing             32            128  thrpt    3    318200.080 ±    3745.204  ops/s
FramingBenchmark.framing             32            256  thrpt    3    196484.642 ±    1439.287  ops/s
FramingBenchmark.framing             32            512  thrpt    3    102559.681 ±    1066.612  ops/s
FramingBenchmark.framing             32           1024  thrpt    3     55713.288 ±    1021.259  ops/s
FramingBenchmark.framing             64             32  thrpt    3    260941.068 ±   11029.238  ops/s
FramingBenchmark.framing             64             64  thrpt    3    213252.824 ±    2763.842  ops/s
FramingBenchmark.framing             64            128  thrpt    3    152106.933 ±    3544.177  ops/s
FramingBenchmark.framing             64            256  thrpt    3     97187.752 ±   16306.733  ops/s
FramingBenchmark.framing             64            512  thrpt    3     51460.170 ±     851.556  ops/s
FramingBenchmark.framing             64           1024  thrpt    3     27853.852 ±    1002.975  ops/s
FramingBenchmark.framing            128             32  thrpt    3    143485.697 ±    2294.380  ops/s
FramingBenchmark.framing            128             64  thrpt    3    108585.695 ±    1580.502  ops/s
FramingBenchmark.framing            128            128  thrpt    3     74709.074 ±    1104.879  ops/s
FramingBenchmark.framing            128            256  thrpt    3     49446.714 ±    3098.863  ops/s
FramingBenchmark.framing            128            512  thrpt    3     25852.970 ±    2643.999  ops/s
FramingBenchmark.framing            128           1024  thrpt    3     13979.600 ±     221.422  ops/s

New

Benchmark                 (framePerSeq)  (messageSize)   Mode  Cnt         Score        Error  Units
FramingBenchmark.framing              1             32  thrpt    3  12488386.857 ± 550524.197  ops/s
FramingBenchmark.framing              1             64  thrpt    3  10825562.059 ± 619829.640  ops/s
FramingBenchmark.framing              1            128  thrpt    3  10298423.557 ± 379279.669  ops/s
FramingBenchmark.framing              1            256  thrpt    3   7195103.283 ± 489354.505  ops/s
FramingBenchmark.framing              1            512  thrpt    3   5318261.366 ± 271701.067  ops/s
FramingBenchmark.framing              1           1024  thrpt    3   3380494.654 ± 293146.013  ops/s
FramingBenchmark.framing              8             32  thrpt    3   1876127.541 ± 228271.371  ops/s
FramingBenchmark.framing              8             64  thrpt    3   1777053.701 ± 146526.586  ops/s
FramingBenchmark.framing              8            128  thrpt    3   1430280.685 ±  97618.551  ops/s
FramingBenchmark.framing              8            256  thrpt    3   1130490.909 ± 134372.173  ops/s
FramingBenchmark.framing              8            512  thrpt    3    784673.367 ±  42187.315  ops/s
FramingBenchmark.framing              8           1024  thrpt    3    496979.878 ±  65898.234  ops/s
FramingBenchmark.framing             16             32  thrpt    3   1085230.634 ±  80359.239  ops/s
FramingBenchmark.framing             16             64  thrpt    3    972758.829 ±  47127.839  ops/s
FramingBenchmark.framing             16            128  thrpt    3    794532.378 ±  35834.092  ops/s
FramingBenchmark.framing             16            256  thrpt    3    601945.868 ± 120485.938  ops/s
FramingBenchmark.framing             16            512  thrpt    3    414966.392 ± 187654.169  ops/s
FramingBenchmark.framing             16           1024  thrpt    3    259000.528 ±  58471.624  ops/s
FramingBenchmark.framing             32             32  thrpt    3    593169.099 ±   4840.018  ops/s
FramingBenchmark.framing             32             64  thrpt    3    528762.400 ±  23527.622  ops/s
FramingBenchmark.framing             32            128  thrpt    3    429482.997 ±  13070.825  ops/s
FramingBenchmark.framing             32            256  thrpt    3    321971.065 ±  35385.581  ops/s
FramingBenchmark.framing             32            512  thrpt    3    214747.165 ±  34767.609  ops/s
FramingBenchmark.framing             32           1024  thrpt    3    130871.717 ±  30464.051  ops/s
FramingBenchmark.framing             64             32  thrpt    3    318087.922 ±   7819.818  ops/s
FramingBenchmark.framing             64             64  thrpt    3    259738.166 ±   9262.097  ops/s
FramingBenchmark.framing             64            128  thrpt    3    208251.362 ±   9641.613  ops/s
FramingBenchmark.framing             64            256  thrpt    3    154602.853 ±  17896.718  ops/s
FramingBenchmark.framing             64            512  thrpt    3    101678.754 ±  20105.737  ops/s
FramingBenchmark.framing             64           1024  thrpt    3     62230.188 ±  19234.156  ops/s
FramingBenchmark.framing            128             32  thrpt    3    150634.966 ±  15251.295  ops/s
FramingBenchmark.framing            128             64  thrpt    3    127661.514 ±   2498.963  ops/s
FramingBenchmark.framing            128            128  thrpt    3    105351.600 ±   6728.938  ops/s
FramingBenchmark.framing            128            256  thrpt    3     80119.663 ±  11005.101  ops/s
FramingBenchmark.framing            128            512  thrpt    3     52383.251 ±  11319.119  ops/s
FramingBenchmark.framing            128           1024  thrpt    3     28313.662 ±  11664.417  ops/s

Both (Interleaved)

Benchmark                     (framePerSeq)  (messageSize)   Mode  Cnt         Score        Error  Units
FramingBenchmark.framing_old              1             32  thrpt    3  10113843.601 ±  347860.528  ops/s
FramingBenchmark.framing_new              1             32  thrpt    3  12488386.857 ± 550524.197  ops/s
FramingBenchmark.framing_old              1             64  thrpt    3   9649511.788 ±  275242.133  ops/s
FramingBenchmark.framing_new              1             64  thrpt    3  10825562.059 ± 619829.640  ops/s
FramingBenchmark.framing_old              1            128  thrpt    3   8121220.174 ± 1320460.364  ops/s
FramingBenchmark.framing_new              1            128  thrpt    3  10298423.557 ± 379279.669  ops/s
FramingBenchmark.framing_old              1            256  thrpt    3   5466330.854 ± 3260665.704  ops/s
FramingBenchmark.framing_new              1            256  thrpt    3   7195103.283 ± 489354.505  ops/s
FramingBenchmark.framing_old              1            512  thrpt    3   2859562.317 ±  489750.599  ops/s
FramingBenchmark.framing_new              1            512  thrpt    3   5318261.366 ± 271701.067  ops/s
FramingBenchmark.framing_old              1           1024  thrpt    3   1594647.037 ±   52601.259  ops/s
FramingBenchmark.framing_new              1           1024  thrpt    3   3380494.654 ± 293146.013  ops/s
FramingBenchmark.framing_old              8             32  thrpt    3   1772199.560 ±   36471.572  ops/s
FramingBenchmark.framing_new              8             32  thrpt    3   1876127.541 ± 228271.371  ops/s
FramingBenchmark.framing_old              8             64  thrpt    3   1595775.427 ±   94933.072  ops/s
FramingBenchmark.framing_new              8             64  thrpt    3   1777053.701 ± 146526.586  ops/s
FramingBenchmark.framing_old              8            128  thrpt    3   1074911.259 ±  198661.935  ops/s
FramingBenchmark.framing_new              8            128  thrpt    3   1430280.685 ±  97618.551  ops/s
FramingBenchmark.framing_old              8            256  thrpt    3    715528.264 ±  117832.304  ops/s
FramingBenchmark.framing_new              8            256  thrpt    3   1130490.909 ± 134372.173  ops/s
FramingBenchmark.framing_old              8            512  thrpt    3    388890.655 ±   23628.881  ops/s
FramingBenchmark.framing_new              8            512  thrpt    3    784673.367 ±  42187.315  ops/s
FramingBenchmark.framing_old              8           1024  thrpt    3    218333.423 ±    3502.994  ops/s
FramingBenchmark.framing_new              8           1024  thrpt    3    496979.878 ±  65898.234  ops/s
FramingBenchmark.framing_old             16             32  thrpt    3   1073203.353 ±   56959.738  ops/s
FramingBenchmark.framing_new             16             32  thrpt    3   1085230.634 ±  80359.239  ops/s
FramingBenchmark.framing_old             16             64  thrpt    3    835697.061 ±   61190.351  ops/s
FramingBenchmark.framing_new             16             64  thrpt    3    972758.829 ±  47127.839  ops/s
FramingBenchmark.framing_old             16            128  thrpt    3    620562.797 ±    4529.555  ops/s
FramingBenchmark.framing_new             16            128  thrpt    3    794532.378 ±  35834.092  ops/s
FramingBenchmark.framing_old             16            256  thrpt    3    380776.167 ±    3069.815  ops/s
FramingBenchmark.framing_new             16            256  thrpt    3    601945.868 ± 120485.938  ops/s
FramingBenchmark.framing_old             16            512  thrpt    3    201483.021 ±   27772.140  ops/s
FramingBenchmark.framing_new             16            512  thrpt    3    414966.392 ± 187654.169  ops/s
FramingBenchmark.framing_old             16           1024  thrpt    3    110883.293 ±    2468.349  ops/s
FramingBenchmark.framing_new             16           1024  thrpt    3    259000.528 ±  58471.624  ops/s
FramingBenchmark.framing_old             32             32  thrpt    3    589338.870 ±   48664.767  ops/s
FramingBenchmark.framing_new             32             32  thrpt    3    593169.099 ±   4840.018  ops/s
FramingBenchmark.framing_old             32             64  thrpt    3    438839.627 ±   98301.907  ops/s
FramingBenchmark.framing_new             32             64  thrpt    3    528762.400 ±  23527.622  ops/s
FramingBenchmark.framing_old             32            128  thrpt    3    318200.080 ±    3745.204  ops/s
FramingBenchmark.framing_new             32            128  thrpt    3    429482.997 ±  13070.825  ops/s
FramingBenchmark.framing_old             32            256  thrpt    3    196484.642 ±    1439.287  ops/s
FramingBenchmark.framing_new             32            256  thrpt    3    321971.065 ±  35385.581  ops/s
FramingBenchmark.framing_old             32            512  thrpt    3    102559.681 ±    1066.612  ops/s
FramingBenchmark.framing_new             32            512  thrpt    3    214747.165 ±  34767.609  ops/s
FramingBenchmark.framing_old             32           1024  thrpt    3     55713.288 ±    1021.259  ops/s
FramingBenchmark.framing_new             32           1024  thrpt    3    130871.717 ±  30464.051  ops/s
FramingBenchmark.framing_old             64             32  thrpt    3    260941.068 ±   11029.238  ops/s
FramingBenchmark.framing_new             64             32  thrpt    3    318087.922 ±   7819.818  ops/s
FramingBenchmark.framing_old             64             64  thrpt    3    213252.824 ±    2763.842  ops/s
FramingBenchmark.framing_new             64             64  thrpt    3    259738.166 ±   9262.097  ops/s
FramingBenchmark.framing_old             64            128  thrpt    3    152106.933 ±    3544.177  ops/s
FramingBenchmark.framing_new             64            128  thrpt    3    208251.362 ±   9641.613  ops/s
FramingBenchmark.framing_old             64            256  thrpt    3     97187.752 ±   16306.733  ops/s
FramingBenchmark.framing_new             64            256  thrpt    3    154602.853 ±  17896.718  ops/s
FramingBenchmark.framing_old             64            512  thrpt    3     51460.170 ±     851.556  ops/s
FramingBenchmark.framing_new             64            512  thrpt    3    101678.754 ±  20105.737  ops/s
FramingBenchmark.framing_old             64           1024  thrpt    3     27853.852 ±    1002.975  ops/s
FramingBenchmark.framing_new             64           1024  thrpt    3     62230.188 ±  19234.156  ops/s
FramingBenchmark.framing_old            128             32  thrpt    3    143485.697 ±    2294.380  ops/s
FramingBenchmark.framing_new            128             32  thrpt    3    150634.966 ±  15251.295  ops/s
FramingBenchmark.framing_old            128             64  thrpt    3    108585.695 ±    1580.502  ops/s
FramingBenchmark.framing_new            128             64  thrpt    3    127661.514 ±   2498.963  ops/s
FramingBenchmark.framing_old            128            128  thrpt    3     74709.074 ±    1104.879  ops/s
FramingBenchmark.framing_new            128            128  thrpt    3    105351.600 ±   6728.938  ops/s
FramingBenchmark.framing_old            128            256  thrpt    3     49446.714 ±    3098.863  ops/s
FramingBenchmark.framing_new            128            256  thrpt    3     80119.663 ±  11005.101  ops/s
FramingBenchmark.framing_old            128            512  thrpt    3     25852.970 ±    2643.999  ops/s
FramingBenchmark.framing_new            128            512  thrpt    3     52383.251 ±  11319.119  ops/s
FramingBenchmark.framing_old            128           1024  thrpt    3     13979.600 ±     221.422  ops/s
FramingBenchmark.framing_new            128           1024  thrpt    3     28313.662 ±  11664.417  ops/s

actor-tests/src/test/scala/org/apache/pekko/util/ByteStringSpec.scala

actor/src/main/scala-2.12/org/apache/pekko/util/ByteString.scala

He-Pin · 2024-04-04T06:22:13Z

The byte searching can do with SIMD too, but I lack of time to do this:(

actor/src/main/scala-2.13/org/apache/pekko/util/ByteString.scala

He-Pin

lgtm, thanks

He-Pin · 2024-04-12T06:24:52Z

@Roiocam @jxnu-liguobin @jrudolph Would you like to get a look into this?

He-Pin · 2024-04-15T03:58:56Z

actor/src/main/scala-2.12/org/apache/pekko/util/ByteString.scala

+      else {
+        var found = -1
+        var i = math.max(from, 0)
+        while (i < length && found == -1) {


I think if we swap with found == -1 && i < length) which can reduce a bit when we just found:)

He-Pin · 2024-04-15T04:08:46Z

actor/src/main/scala-2.12/org/apache/pekko/util/ByteString.scala

+          if (bytes(startIndex + i) == elem) found = i
+          i += 1
+        }
+        found


how about we extract the startIndex + 1 to the start of the loop and returning a found - startIndex？

He-Pin · 2024-04-15T05:31:32Z

I think we can delegate the indexOf[B >: Byte](elem: B, from: Int): Int to indexOf(elem: Byte, from: Int): Int with Byte.unbox, wdyt, the reduce the duplicated code. @JD557 , or we can just merge this and do that later.

I have checked the bytecode, will do this after work.

JD557 · 2024-04-15T09:26:07Z

Unfortunately, I don't think that would work, as the signature requires [B >: Byte]. B could be Any and in that case we need the slow implementation that falls back to Object.equals 😕

Code example:

case class MyByteSeq(data: List[Byte]) {
  def fastIndexOf(byte: Byte): Int = data.indexOf(byte)
  //def indexOf1[B >: Byte](x: B): Int = fastIndexOf(Byte.unbox(x)) // This won't compile
  def indexOf2[B >: Byte](x: B): Int = fastIndexOf(Byte.unbox(x.asInstanceOf[Object])) // This can fail at runtime
  def indexOf3[B >: Byte](x: B): Int = x match {
    case b: Byte => fastIndexOf(b) // I think the pattern match already unboxes it
    case _ => ??? // I can't call fastIndexOf here, so I need a duplicated version of the code anyway
  }
}

He-Pin · 2024-04-15T09:50:20Z

THANKS, But I checked the bytecode, it was Java.lang.Objrct, not sure why that would fail at Runtime.

He-Pin · 2024-04-15T09:53:55Z

@JD557 do you have any further improvement on this pr ?I would like to merge this and improvement can come up later.

JD557 · 2024-04-15T10:07:08Z

@JD557 do you have any further improvement on this pr ?I would like to merge this and improvement can come up later.

Not really, feel free to merge this and improve later

jrudolph

That still feels like a footgun, since it is quite unclear in which situations the new overload will be selected? Should we document in which case the static overload resolution will choose the new method?

Is it somehow possible to avoid the massive code duplication here (probably hard since the whole existing scheme is built around the function of Any.==)? If not it should be documented.

Ultimately, the main problem of ByteString has always been the attempt to make it fit seamlessly into the rest of the Scala collections by making it part of the collections type hierarchy (instead of making it fast and useful in the first place and then consider useful conversions/views to the Scala collection types).

actor/src/main/scala-3/org/apache/pekko/util/ByteString.scala

JD557 · 2024-04-15T12:45:05Z

That still feels like a footgun, since it is quite unclear in which situations the new overload will be selected? Should we document in which case the static overload resolution will choose the new method?

I don't have any strong feelings about the naming. I used this scheme because it's the same that's used by ByteIterator (See:

pekko/actor/src/main/scala-3/org/apache/pekko/util/ByteIterator.scala

Lines 498 to 502 in 3cd0801

    
           def indexOf(elem: Byte): Int = indexOf(elem, 0) 
        
           def indexOf(elem: Byte, from: Int): Int = indexWhere(_ == elem, from) 
        
           override def indexOf[B >: Byte](elem: B): Int = indexOf(elem, 0) 
        
           override def indexOf[B >: Byte](elem: B, from: Int): Int = indexWhere(_ == elem, from)

).

If we change the method here (e.g. to indexOfByte, like in 18c5db1), then I think it should also be changed in ByteIterator.

However, on that note:

Is it somehow possible to avoid the massive code duplication here (probably hard since the whole existing scheme is built around the function of Any.==)? If not it should be documented.

Looks like ByteIterator avoids the duplication by delegating the comparison to indexWhere. Not sure if the lambda could have a performance impact, though. I would need to benchmark this again.

(Actually, I think the equality on that ByteIterator#indexOf call to indexWhere is backwards... the stdlib does the right thing with elem == _ instead of _ == elem)

He-Pin · 2024-04-15T13:01:45Z

@jrudolph True, but as I checked the bytecode, the current will be compiled to indexOf(java.lang.Object, int) and then a Byte.boxed is been used in the method body, I think that's @JD557 is addressing.

with specialized version, bytecode ifcmp is been used instead

JD557 · 2024-04-15T15:10:16Z

So, I was doing some more tests with a specialized version of indexWhere, but that's noticeably slower (even though it's faster than ByteIterator#indexWhere). So I don't see many ways to reduce the duplication.

As such, I'm not sure how to proceed with this PR:

Should I introduce a specialized indexWhere anyway, leading to code multiplication?
Do I keep the indexOf name (like the ByteIterator does) or do I name it something else?

He-Pin

Lgtm

He-Pin · 2024-04-15T15:28:08Z

@JD557 I think if you want to change the method name, you can using the name of firstIndexOf.

And for the api, because it's using the scala collection's indexOf，so the input B where is actually a Byte, but now can be a any.

As for selection, I think the compiler will choose this new method when it know the elem for testing is a type Byte but will not when It doesn't.

As for the old index Of, which will always return -1 us the elem is not a Number/can boxed to Byte, so a quick check and delegate to the specified one will not harm too much. Who will using a ByteString to indexOf a any?

He-Pin · 2024-04-15T16:10:14Z

In BoxedRuntime.java

    public static byte unboxToByte(Object b) {
        return b == null ? 0 : ((java.lang.Byte)b).byteValue();
    }

    public static java.lang.Byte boxToByte(byte b) {
        return java.lang.Byte.valueOf(b);
    }

In Byte.java

    public boolean equals(Object obj) {
        if (obj instanceof Byte) {
            return value == ((Byte)obj).byteValue();
        }
        return false;
    }

Update: BoxedRuntime.toByte will not works too.

I think there are some implicitly conversion, because elem == (bytes(startIndex)) returns true and elem.equals(bytes(startIndex)) returns false 😢

JD557 · 2024-04-15T16:40:01Z

I don't think that BoxesRunTime trick is enough, unfortunately.

Say one asks ByteString.indexOf(500). This should obviously return -1, as it is impossible for a byte to have that value.
However BoxesRunTime.toByte(500) returns -12: Byte.

Ideally we would check if the number is between Byte.MinValue and Byte.MaxValue before boxing, but even that is not enough, due to floating point values (BoxesRunTime.toByte(127.9) == 127.toByte == 127.0).

(There's also the annoying possibility that someone creates their class RichByte(b: Byte) {override def equals(that: Any): b.equals(that) || ...})

Although I imagine most problems would come from Int and Char.

He-Pin · 2024-04-15T16:43:37Z

Yes, @JD557 , I just checked again, it's using the BoxesRunTime.equals for testing, otherwise we need something like indexOfWhere and extract the comparing thing with a lambda.

    BALOAD
    INVOKESTATIC scala/runtime/BoxesRunTime.boxToByte (B)Ljava/lang/Byte;
    ALOAD 1
    INVOKESTATIC scala/runtime/BoxesRunTime.equals (Ljava/lang/Object;Ljava/lang/Object;)Z
    IFEQ L6

He-Pin · 2024-04-15T17:04:49Z

@som-snytt Friendly ping , do you know any great way to handle this ,thanks.

jrudolph · 2024-04-16T07:26:52Z

As for selection, I think the compiler will choose this new method when it know the elem for testing is a type Byte but will not when It doesn't.

Probably, yes. If the general overload would not exist, the new one would also work for literals like 12 or 'a'. So, maybe it's really the best we can do right now? With this solution it will at least choose the faster version whenever the user is explicitly looking for a byte.

He-Pin · 2024-04-20T07:20:34Z

@JD557 need another update to make mima happy

JD557 · 2024-04-20T07:41:19Z

I can try to fix it, but out of curiosity, isn't it OK to have MiMa issues, since this only targets 1.1.x?

I was ignoring the issue because I thought 1.1.0 was going to break bincompat with 1.0.x anyway

mdedetrich · 2024-04-20T08:14:15Z

I can try to fix it, but out of curiosity, isn't it OK to have MiMa issues, since this only targets 1.1.x?

Pekko core follows SemVer, so the only acceptable MiMa issues are for internal code (i.e. @InternalApi) or with false positives

I was ignoring the issue because I thought 1.1.0 was going to break bincompat with 1.0.x anyway

No it doesn't, that would be for Pekko 2.x.x.

JD557 · 2024-04-22T12:22:52Z

actor/src/main/scala-3/org/apache/pekko/util/ByteString.scala

@@ -823,7 +874,33 @@ sealed abstract class ByteString
  override def indexWhere(p: Byte => Boolean, from: Int): Int = iterator.indexWhere(p, from)

  // optimized in subclasses
-  override def indexOf[B >: Byte](elem: B, from: Int): Int = indexOf(elem, from)
+  override def indexOf[B >: Byte](elem: B, from: Int): Int = super.indexOf(elem, from)


I am aware that this is override is a bit weird, but:

I think the old code was an infinite loop

Removing this override breaks the MiMa checks

This should never be called anyway, but I think having it like this is a bit safer.

JD557 · 2024-04-22T12:24:09Z

OK, I think the MiMA issues should be fixed now.

pjfanning

lgtm

pjfanning · 2024-04-24T19:43:54Z

@jrudolph do you still think this PR change needs more work?

He-Pin · 2024-04-24T20:15:36Z

@sirthias Would you like to give some input about this too？

sirthias · 2024-04-25T07:15:44Z

The only comment I have is that there would be a potentially much more efficient implementation of indexOf(elem: Byte, from: Int) if we somehow had low-level (unsafe) access to the byte array and could read 8 bytes at once into a long.
Then we could use SWAR (SIMD within a register) and reduce the loop count by 8 (asymptotically) over the current implementation.
Doesn't pekko already have an Unsafe access construct somewhere?
IIRC Akka did...

He-Pin · 2024-04-25T07:31:00Z

@sirthias Yes, I opened up an issue for SIMD in #1264 , would be nice to have that after this been merged, I think https://github.com/sirthias/borer must have already done this.

sirthias · 2024-04-25T07:42:01Z

Yes, there is a (much more complicated) SWAR loop implemented in borer's JSON parser.
Here we only have to look for a single known byte rather than a whole set of different characters and we also don't have to copy segments and do UTF8 decoding at the same time.

But the whole thing only makes sense if we can really get down to raw byte access via Unsafe or more modern means. And on ScalaJS a SWAR approach will just create overhead and be a lot slower than the simple loop.

pjfanning · 2024-05-02T10:12:32Z

@He-Pin @mdedetrich @Roiocam @samueleresca @raboof @jxnu-liguobin should we get this sorted out before 1.1.0-M1 RC or should we look at this again after the 1.1.0-M1 release?

He-Pin · 2024-05-02T15:46:20Z

Sorry for now response, holidays here. I think we can including this in m1 and try with SWAR after that.

changes have been made but I will leave the PR open a few days just in case

pjfanning · 2024-05-06T19:19:56Z

Merged - thanks @JD557

JD557 added 3 commits April 3, 2024 19:56

Fix FramingBenchmark

70bbeba

Add specialized indexOfByte

18c5db1

Rename indexOfByte to indexOf

37a532f

He-Pin reviewed Apr 4, 2024

View reviewed changes

actor-tests/src/test/scala/org/apache/pekko/util/ByteStringSpec.scala Show resolved Hide resolved

He-Pin reviewed Apr 4, 2024

View reviewed changes

actor/src/main/scala-2.12/org/apache/pekko/util/ByteString.scala Outdated Show resolved Hide resolved

He-Pin reviewed Apr 4, 2024

View reviewed changes

actor/src/main/scala-2.12/org/apache/pekko/util/ByteString.scala Show resolved Hide resolved

JD557 added 2 commits April 4, 2024 10:54

Add missing @SInCE(1.1.0) and reformat scaladoc

44cf4bc

Fix Scala 2.12 ambiguity problem

407de55

He-Pin added the performance Related to performance label Apr 6, 2024

He-Pin added this to the 1.1.0-M1 milestone Apr 6, 2024

He-Pin reviewed Apr 6, 2024

View reviewed changes

actor/src/main/scala-2.13/org/apache/pekko/util/ByteString.scala Outdated Show resolved Hide resolved

He-Pin reviewed Apr 6, 2024

View reviewed changes

actor/src/main/scala-2.13/org/apache/pekko/util/ByteString.scala Outdated Show resolved Hide resolved

He-Pin approved these changes Apr 6, 2024

View reviewed changes

He-Pin mentioned this pull request Apr 6, 2024

Feature request: Using SIMD for byte search #1264

Open

Inline nextString

e923cba

He-Pin requested a review from Roiocam April 12, 2024 06:24

He-Pin added the late-release-note late breaking changes that will require release notes changes label Apr 15, 2024

He-Pin reviewed Apr 15, 2024

View reviewed changes

jrudolph reviewed Apr 15, 2024

View reviewed changes

jrudolph previously requested changes Apr 15, 2024

View reviewed changes

actor/src/main/scala-3/org/apache/pekko/util/ByteString.scala Outdated Show resolved Hide resolved

JD557 force-pushed the faster-framing branch from 668a8fc to 2cce3ea Compare April 15, 2024 14:47

He-Pin approved these changes Apr 15, 2024

View reviewed changes

Fix MiMa issues

19fb7af

JD557 commented Apr 22, 2024

View reviewed changes

He-Pin requested review from pjfanning, jrudolph and mdedetrich April 22, 2024 13:25

pjfanning approved these changes Apr 22, 2024

View reviewed changes

pjfanning merged commit cce5f9b into apache:main May 6, 2024
17 of 18 checks passed

pjfanning removed the late-release-note late breaking changes that will require release notes changes label May 6, 2024

Avoid boxing in Framing #1247

Avoid boxing in Framing #1247

Conversation

JD557 commented Apr 3, 2024 • edited Loading

He-Pin commented Apr 4, 2024

He-Pin left a comment

Choose a reason for hiding this comment

He-Pin commented Apr 12, 2024

He-Pin Apr 15, 2024

Choose a reason for hiding this comment

He-Pin Apr 15, 2024

Choose a reason for hiding this comment

He-Pin commented Apr 15, 2024 • edited Loading

JD557 commented Apr 15, 2024

He-Pin commented Apr 15, 2024

He-Pin commented Apr 15, 2024 • edited Loading

JD557 commented Apr 15, 2024

jrudolph left a comment

Choose a reason for hiding this comment

JD557 commented Apr 15, 2024 • edited Loading

He-Pin commented Apr 15, 2024 • edited Loading

JD557 commented Apr 15, 2024

He-Pin left a comment

Choose a reason for hiding this comment

He-Pin commented Apr 15, 2024

He-Pin commented Apr 15, 2024 • edited Loading

JD557 commented Apr 15, 2024 • edited Loading

He-Pin commented Apr 15, 2024 • edited Loading

He-Pin commented Apr 15, 2024

jrudolph commented Apr 16, 2024

He-Pin commented Apr 20, 2024

JD557 commented Apr 20, 2024

mdedetrich commented Apr 20, 2024 • edited Loading

JD557 Apr 22, 2024

Choose a reason for hiding this comment

JD557 commented Apr 22, 2024

pjfanning left a comment

Choose a reason for hiding this comment

pjfanning commented Apr 24, 2024

He-Pin commented Apr 24, 2024

sirthias commented Apr 25, 2024

He-Pin commented Apr 25, 2024

sirthias commented Apr 25, 2024

pjfanning commented May 2, 2024

He-Pin commented May 2, 2024

pjfanning commented May 6, 2024

JD557 commented Apr 3, 2024 •

edited

Loading

He-Pin commented Apr 15, 2024 •

edited

Loading

He-Pin commented Apr 15, 2024 •

edited

Loading

JD557 commented Apr 15, 2024 •

edited

Loading

He-Pin commented Apr 15, 2024 •

edited

Loading

He-Pin commented Apr 15, 2024 •

edited

Loading

JD557 commented Apr 15, 2024 •

edited

Loading

He-Pin commented Apr 15, 2024 •

edited

Loading

mdedetrich commented Apr 20, 2024 •

edited

Loading