Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added fs2.text.linesWithEndings #2418

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

fizzy33
Copy link

@fizzy33 fizzy33 commented May 23, 2021

Worth noting that if we wanted to replace fs2.text.lines with

  /** Transforms a stream of `String` such that each emitted `String` is a line from the input. */
  def lines[F[_]]: Pipe[F, String, String] =
    s => linesWithEndings(s).map(_._1)

The performance hit using LinesBenchmark on my mac mini 2021 was down 3K ops/s from 96K ops/s to 93K ops/s. Which seems sensible as there are a few more allocations and one more map.

The unit test is a robotic copy of the test for lines.

So one open question I have is to replace fs2.text.lines or not.

While more LineEnding's could be expressed the current code doesn't support them so the effort was on expressing simply what the existing system does.

Copy link
Contributor

@diesalbla diesalbla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems a positive contribution as is, so I am approving.

However, I would like to ask you to consider two possible optimisations:

Reuse existing streams.

At present, your new method is going to allocate a new string, with its new byte array, for each element in the output. However, suppose that your input contains the following strings (one per line):

Once upon a midnight dreary,\r while I pondered, weak and 
weary,\r Over many a quaint and curious volume of forgotten lore\r

Here, the output should look as follows:

Once upon a midnight dreary,
while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore

If you notice, the first and last strings of output are each contained in the strings from input. One possible optimisation, thus, would be to find out which output-lines are entirely contained within one input line, and use the .substring method to get them out, without the extra process of the StringBuilder.

NOTE I also hoped that substring would be reusing the underlying array, but that may not be the case... https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/String.java#l1945

Copy link
Contributor

@nikiforo nikiforo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a task, where I needed to preserve different linebreaks. However, I believe there can be such a task.
Concerning the PR, I think that EOF is unnecessary. I understand that the intention was to get rid of nasty Option in (String, Option[LineEnding]), but from my POV the introduction of EOF does more harm than good.
I thought a bit about adding line breaking symbols into the StringBuilder, but that would make it harder to implement lines in terms of linesWithEnding(we would have to look at the last and the second to the last symbols and then cut them off with substring).
So, I’d prefer this signature: def linesWithEndings[F[_]]: Pipe[F, String, (String,Option[LineEnding])] to def linesWithEndings[F[_]]: Pipe[F, String, (String, LineEnding)], and the implementation of lines that you’ve proposed

first: Boolean
): Pull[F, (String, LineEnding), Unit] =
stream.pull.uncons.flatMap {
case None => if (first) Pull.done else Pull.output1(stringBuilder.result() -> LineEnding.EOF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the EOF remains, I think we should replace Pull.done with Pull.output1("", LineEnding.EOF). Otherwise there would be no EOF at the end of the stream, which breaks "EOF at the end of the stream" consistency.

// case object CR extends LineEnding("\r") looking at the existing code this won't be produced so leaving it out for now
case object LF extends LineEnding("\n")
case object CRLF extends LineEnding("\r\n")
case object EOF extends LineEnding("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think that EOF is technically not a LineEnding, but I can't come up with the appropriate name for the group(LF, CRLF, EOF).
  2. lines is not only about files, but about Stream[F, String]. Thus socket, pure streams are also consumable by .lines operator.

@fizzy33
Copy link
Author

fizzy33 commented May 24, 2021

@diesalbla fwiw I spent about 2 hours in morbid curiousity trying to eek more performance including trying what you suggested in various incarnations and couldn't eek out any more performance. Any case where substring's were used were signficantly less performant. Honestly really surprised by this. I also wanted to see if I could get the performance we wanted using scanChunks but no success. Someone already worked really hard to get it to perform. I suspect the StringBuilder allocated on the call stack is very cheap along with the arrays that StringBuilder.toString generates. They show to be significantly cheaper than String.substring in the profiler I was using.

@nikiforo we have the use case for this in http4s where in roundtrip unit tests it is needed to properly reify the input data with the expected and actual data.

Agreed on the EOF I have similar mis-givings. Does EOS (end of string | end of stream) lessen it ? Suggested from the comment that lines is about streams not files. I toss and turn on the Option[LineEnding] versus having EOF as well. Chose EOF because it was less allocations and assuming all else is equal (which it isn't).

@mpilquist
Copy link
Member

If we went the option route, we'd probably need to cache instances of Some(CRLF) and Some(CR), right?

The Java socket API uses end-of-input and end-of-output. I think either EOF or EOS would be fine here.

@nikiforo
Copy link
Contributor

If we went the option route, we'd probably need to cache instances of Some(CRLF) and Some(CR), right?

😃

The Java socket API uses end-of-input and end-of-output. I think either EOF or EOS would be fine here.

Consider this line.
If you Stream("one\r\ntwo").linesWithEndings you would get Stream(("one", CRLF), ("two", EOS)). That looks fine.
However, if you Stream.empty[String].linesWithEndings you would get either Stream.empty and no EOS, or Stream("", EOS). Both of them look wrong from where I stand.

@fizzy33
Copy link
Author

fizzy33 commented May 26, 2021

So oddly enough I spent the morning hacking through ServerSentEvent unit tests in http4s. ServerSentEvent defines end event as an empty line terminated with a \n. The existing protocol decoder was using fs2.text.lines and in the absence of line endings assumed \n which created corner cases when combined with this related behaviour fs2.Stream("\n").through(fs2.text.lines).toVector.size == 2.

@nikiforo great point the little response I have for you there is the existing behaviour I mentioned of fs2.Stream("\n").through(fs2.text.lines).toVector.size == 2 (not quite an exact analogy)

I have no strong opinion on EOS or Option[LineEnding] I can make both work. Being new here I am not sure how we come to consensus. There are more "exotic" ways to encode this (just throwing spaghetti against the wall),

sealed trait TextOrLineEnding
case class Text(value: String) extends TextOrLineEnding
case class LineEnding(value: String) extends TextOrLineEnding
def linesWithEndings[F[_]]: Pipe[F, String, TextOrLineEnding] = {

Noting with this encoding you lose the obvious no two Text instances in a row but it does have legs.

Another alternate is have the text be optional

def linesWithEndings[F[_]]: Pipe[F, String, (Option[String],LineEnding)] = {

You could arguably pass the @nikiforo test with
Stream.empty[String].linesWithEndings == Stream(None -> EOS)

Anyway I am having a bit of fun just to see if any bright ideas pop out. There is a practical need here. One needs the actual line endings if you are decoding protocols where \n \r\n \r and End are not considered equivalent

Anyone else find it fascinating something as simple as this can get so intricate ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants