Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "format string" support for tensors #655

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

AngelEzquerra
Copy link
Contributor

This comes in 2 forms:

  1. Add a version of the pretty function that takes a format string as its input instead of a precision value. Also add a new "showHeader" argument to the original pretty function. - This new version of pretty let's you specify the format string that must be used to display each element. It also adds some new tensor-specific format string "tokens" that are used to control how tensors are displayed (beyond the format of each element).
  2. Add a formatValue procedure that takes a tensor as its first input. This makes it possible to control how tensors are displayed in format strings, in the exact same way as if you were using the new pretty(specifier) procedure.

The special, tensor-specific tokens added by this change are:

  • "[:]": Display the tensor as if it were a nim "array". This makes it easy to use the representation of a tensor in your own code. No header is shown.
  • "[]": Same as "[:]" but displays the tensor in a single line. No header is shown.
  • "<>": Combined with the 2 above (i.e. "<>[:]" or "<>[]") adds a header with basic tensor info (type and shape). "<:>" can be used as a shortcut for "<>[:]" while "<>" on its own is equivalent to "<>[]". Can also be combined with "<>||" (see below).
  • "||": "Pretty-print" the tensor without a header. This can also be combined with "<>" (i.e. "<>||") to explicitly enable the default mode, which is pretty printing with a header.
  • 'j': Formats complex values as A+Bj like in mathematics. Ignored for non-Complex tensors

Note that these are only used to control how the tensors themselves are displayed as a whole and are removed before displaying the individual elements.

@Vindaar
Copy link
Collaborator

Vindaar commented Jun 7, 2024

I think this PR could use some other eyes / opinions. Not so much because of the added code (you're welcome to review of course), but more so because of the added syntax. @mratsim @HugoGranstrom @Clonkk (feel free to ping anyone else!)

If we decide to add it to arraymancer directly (and not e.g. as a submodule that can be imported import arraymancer / pretty_printing), we might want to be more certain that we like the syntax. Any input would be appreciated.

##
## Inputs:
## - Input Tensor
## - specifier: A format specifier similar to those used in format strings,
Copy link
Contributor

@Clonkk Clonkk Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't those be an enum instead of static[string] ?
In my experience custom string format specifier are rather confusing.
Multiple format output could even be different proc so it's easier to navigate code and change it. We could also introduce a CT switch -d:... to choose which prettyPrint version is used on $

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an enum is the first thing I tried but it was not very flexible. Using a string is how I let the user configure the individual element display in the same flexible manner that nim's strformat allows. For example, you can display integers as hex or as binary values, you can force the display of the sign for positive values, you can specify the number of digits in total and on the decimal part or (with nim 2.2) you can ask to use the "math-like" "j" complex display format (which displays complex numbers as 2.3-7.1j) which is particularly useful for a tensor library in the context of signal processing.

@Clonkk
Copy link
Contributor

Clonkk commented Jun 7, 2024

Good stuff over all. To me the biggest discussion to be had is about using string as specifier.

For non-mutually exclusive options we could use a set (something done in the std lib).
Maybe, another possibility is to have different proc and pass an enum to do the dispatch in prettyPrint ?

@HugoGranstrom
Copy link

Thanks for working on this @AngelEzquerra 😄

I agree with Clonkk that string formats are quite confusing.
I don't see what the <> format would add that a separate showHeader parameter couldn't bring. And the different modes ([], [:], ||) could be replaced with a printMode parameter that takes an enum as Clonkk suggested. So my suggestion would be a function with this signature:

type
  PrintMode = enum
    arrayMode
    singleLineMode
    prettyMode

proc pretty*[T](t: Tensor[T], formatString: string, showHeader: bool = true, printMode: PrintModeEnum = prettyMode): string =
    ...

Or is it something I'm missing here, some combination of tokens that can't be handled this way?

@Clonkk
Copy link
Contributor

Clonkk commented Jun 7, 2024

The only downside of enum I can see is you have 2 non mutually formatting options you need 3 enums (optA, optB, optA_B). And the more non mutually exclusive options you add the worse it gets.

That is why I.think we should also.consider set instead of enum since it can still work well even if we want/need different formatting options.

On the other hand if we want to keep limited possibility then enum is simpler.

@HugoGranstrom
Copy link

The only downside of enum I can see is you have 2 non mutually formatting options you need 3 enums (optA, optB, optA_B). And the more non mutually exclusive options you add the worse it gets.

I agree with you that it is more flexible. Do we have any plans to introduce any mutually inclusive options, though? If we do, it is easy enough to add an overload that just calls the set-version with the single argument in a set. :)

@AngelEzquerra
Copy link
Contributor Author

Thank you @Vindaar for asking others to comment and thank you for those comments guys! :-)

Just to give more context to other people, the main reason I made this PR is that while the current pretty printing of tensors is nice (and pretty! :-) ), it makes it hard to compare arraymancer's results with numpy's, which uses pretty different format, and, above all, it makes it really hard to use arraymancer's own output in your arraymancer code (e.g. when creating a test or using arraymancer to build a simulator). Another reason is to be able to control the format of the elements themselves (e.g. to use hex format, etc). Finally, I wanted to be able to do all of those while using nim's strformat, which I use extensively and find very useful and nice.

Another comment I'd make is that I tried several options before landing on the current proposal. The first thing I tried is to add an enum argument to pretty. The problem with that was that it was not usable with format strings nor let me configure the display of the individual elements. That is why I decided to simply add support for format strings. At first, I tried to use simple letters to indicate that the format should be array-like ("a"), single line array-like ("s") or that it should not have a header ("n"). However, after some thought I got worried that nim's own strformat specifiers (in order to support the control of the format of the individual elements) might evolve and potentially add a use for those letters in the future. That's why I decided to use something that seems much more unlikely to collide with nim's strformat. In particular, I decided to use symbols (and in fact more than one of them). This has the nice side-effect of making the chosen tokens kind of look like the desired output (<> for headers, [] for single-line arrays, [:] more arrays with more than one line and || for the existing table format).

One small tweak that could be done to the current proposal is to force these "tokens" to surround the "element level format". That is, instead of (or perhaps in addition to) adding "[]" to "+06.2f" (i.e. "[]+06.2f"), we could do "[+06.2f]".

@HugoGranstrom
Copy link

HugoGranstrom commented Jun 7, 2024

Yes it is great and something I have also been a bit annoyed at some times that the string representation is a bit odd. 😅

The first thing I tried is to add an enum argument to pretty. The problem with that was that it was not usable with format strings nor let me configure the display of the individual elements.

One doesn't exclude the other. You can specify the array format with an enum and the element format with a string (like in my example). Or was there something that didn't work out with that approach? :)

@AngelEzquerra
Copy link
Contributor Author

Thanks for working on this @AngelEzquerra 😄

I agree with Clonkk that string formats are quite confusing. I don't see what the <> format would add that a separate showHeader parameter couldn't bring. And the different modes ([], [:], ||) could be replaced with a printMode parameter that takes an enum as Clonkk suggested. So my suggestion would be a function with this signature:

type
  PrintMode = enum
    arrayMode
    singleLineMode
    prettyMode

proc pretty*[T](t: Tensor[T], formatString: string, showHeader: bool = true, printMode: PrintModeEnum = prettyMode): string =
    ...

Or is it something I'm missing here, some combination of tokens that can't be handled this way?

That is almost identical to the first thing I tried to do :-)
What that does not let you do (and the reason why I landed on this proposal) is configure the format of the individual elements. It also does not let you use nim's format strings (which I use very often) to print your tensors. With the current proposal you can do something like this:

echo &"Error the input tensor ({t:[]}) does not have the right rank ({t.rank})"

Or something like this:

echo {t=:+06.8f[]}

And get a nice representation of my tensor in which every element has the "+06.f" format and I can copy it into my code because it is valid nim syntax.

@Clonkk
Copy link
Contributor

Clonkk commented Jun 7, 2024

The problem with that was that it was not usable with format strings nor let me configure the display of the individual elements.

Yeah I understand, enum felt limited too when i thought about it.

In your opinion do you think a set of options would work ?

{HexaFmt, ArrayDisplay, NoHeader}

for example would display hexadecimal in array format without the header.

This seems flexible enough while allowing for high cardinality of options and at the same time doesn't require sanitizing input and is resilient to change in the specs (if an option disappears or is split into 2, it will instantly be a compile time error).

Since enums can have a string representation, we could define

type DisplayOpts = FormatNumbersEnum|HeadersEnum|...

Since enum can have a string value it shouldn't be too difficult to use the enum in a strformat.

EDIT :

With the current proposal you can do something like this:

This is a great argument, I am on my way to the airport and will look a bit more into your proposal.

@HugoGranstrom
Copy link

What that does not let you do (and the reason why I landed on this proposal) is configure the format of the individual elements.

That is what the formatString parameter is for, the format string for the individual elements. I could have been clearer about what I meant :D.

It also does not let you use nim's format strings (which I use very often) to print your tensors.

Okay, I think I'm starting to see the problem, it is specifically the &"{t:[]}" syntax you are after. I didn't read well enough and thought the updated pretty function was the important part here. 😅

@AngelEzquerra
Copy link
Contributor Author

The problem with that was that it was not usable with format strings nor let me configure the display of the individual elements.

Yeah I understand, enum felt limited too when i thought about it.

In your opinion do you think a set of options would work ?

´´{HexaFmt, ArrayDisplay, NoHeader}´´ for example would display hexadecimal in array format without the header.

This seems flexible enough while allowing for high cardinality of options and at the same time doesn't require sanitizing input and is resilient to change in the specs (if an option disappears or is split into 2, it will instantly be a compile time error).

FYI, strformat strings also give compile time errors because they are all static strings :-) This helped me more than once when I was working on this feature.

Since enums can have a string representation, we could define

´´´ type DisplayOpts = FormatNumbersEnum|HeadersEnum|... ´´´

Since enum can have a string value it shouldn't be too difficult to use the enum in a strformat.

Sorry, can you explain that a bit more? how would you do that? I'm not sure I understand your proposal.

@AngelEzquerra
Copy link
Contributor Author

AngelEzquerra commented Jun 7, 2024

What that does not let you do (and the reason why I landed on this proposal) is configure the format of the individual elements.

That is what the formatString parameter is for, the format string for the individual elements. I could have been clearer about what I meant :D.

It also does not let you use nim's format strings (which I use very often) to print your tensors.

Okay, I think I'm starting to see the problem, it is specifically the &"{t:[]}" syntax you are after. I didn't read well enough and thought the updated pretty function was the important part here. 😅

Yes, I guess I should have been clear about that too 😅. I wanted an easy way to use the non pretty syntax in format strings. I could have done that with a new function (e.g. toString and then do echo &"{t.toString}". But I also wanted to control the individual element format, which can only be done very flexibly by using nim's format strings. So once I added format string support it felt very natural to go all the way because using a format string as an argument to a function call inside a format string would be weird! that is echo &"{t.toString(\"6.2f\")" (which I don't know if it works?).

Plus echo &"{t:[]}" is pretty fast to write (and natural once you get used to it) for quick debug printing :)

@Clonkk
Copy link
Contributor

Clonkk commented Jun 7, 2024

Sorry, can you explain that a bit more? how would you do that? I'm not sure I understand your proposal.

My idea was to use set to pass options to pretty print :

Typically something like :

var A : Tensor[float] = randomTensor(3, 5 ,6, 1.0)
echo a.prettyPrint({optNoHeader, optHexFormat, ... })

This make having inclusive options easy and from a doc / api point of view it makes a lot of sense and it is easy to know what's possible and what isn't. It's nice to have but won't solve
the std/strformat problem.

We can always go for the string specifier and build a way to generate the string specifier from a set of enums (like have {optNoHeader, optHexFormat, ... } be converted to a specific string specifier at compile time ) so we could have the best of both world.

@AngelEzquerra
Copy link
Contributor Author

Sorry, can you explain that a bit more? how would you do that? I'm not sure I understand your proposal.

My idea was to use set to pass options to pretty print :

Typically something like :

var A : Tensor[float] = randomTensor(3, 5 ,6, 1.0)
echo a.prettyPrint({optNoHeader, optHexFormat, ... })

This make having inclusive options easy and from a doc / api point of view it makes a lot of sense and it is easy to know what's possible and what isn't. It's nice to have but won't solve the std/strformat problem.

We can always go for the string specifier and build a way to generate the string specifier from a set of enums (like have {optNoHeader, optHexFormat, ... } be converted to a specific string specifier at compile time ) so we could have the best of both world.

What if we extended the version of pretty that doesn't take a format string (the one that takes a precision value) with these new set-based options, but kept the version that takes the format string as is? That way people who don't want to learn about this custom syntax or who don't care about format strings can use the more functional style while those that like format strings can use the new syntax?

@Clonkk
Copy link
Contributor

Clonkk commented Jun 8, 2024

What if we extended the version of pretty that doesn't take a format string (the one that takes a precision value) with these new set-based options, but kept the version that takes the format string as is? That way people who don't want to learn about this custom syntax or who don't care about format strings can use the more functional style while those that like format strings can use the new syntax?

Yes, this is what I had in mind in my last paragraph. But it can be done after this PR once string specifier are merged

## - Input Tensor.
## - precision: The number of decimals printed (for float tensors),
## _including_ the decimal point.
## - showHeader: If true (the default) show a dscription header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add row-major or col-major indication in the header ? That would be very helpful when passing arraymancer's buffer to other interface (like Numpy or Julia) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, but #660 raises a good point about our colMajor support.

## - "||": "Pretty-print" the tensor _without_ a header. This
## can also be combined with "<>" (i.e. "<>||") to
## explicitly enable the default mode, which is pretty
## printing with a header.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AngelEzquerra Could we have a specifier to choose between showing the content 'in-memory' order OR in 'index' order ?

@Clonkk
Copy link
Contributor

Clonkk commented Aug 24, 2024

This seems good enough for me. I think we should merge this whenever CI is green to avoir PR rot.

cc @Vindaar cc @mratsim cc @HugoGranstrom

## - Input Tensor.
## - precision: The number of decimals printed (for float tensors),
## _including_ the decimal point.
## - showHeader: If true (the default) show a dscription header
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## - showHeader: If true (the default) show a dscription header
## - showHeader: If true (the default) show a description header

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this typo in a commit taking care of all your comments.

func getTensorDispMode(specifier: string): tensorDispMode =
## Get the display mode based on the special "tensor specifier tokens"
let brackets = "[" in specifier and "]" in specifier
let colon_brackets = specifier.count(':') == 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we test for 2 : here? Shouldn't it be one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"brackets" is when you use [] while "colon brackets" is when you use :: that's why we must check for a count of 2 colons.

# <:> is a shortcut for <>[:]
multi_line_array
elif brackets or logic_brackets:
# <> is a shortcut for <>[] _when_ not combined with ||, ::, [:] or [::]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, :: is not mentioned in the doc strings, no?

Comment on lines 69 to 144
result = specifier.replace("::")
.replace("[:]")
.replace("<:>")
.replace("<>")
.replace("[]")
.replace("||")

Copy link
Collaborator

@Vindaar Vindaar Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could use multiReplace, which will be more faster. But it doesn't really matter for this purpose (the overhead of the actual printing far exceeds this of course).

Comment on lines 221 to 224
# when T is SomeFloat:
# result = t.map_inline((x.formatBiggestFloat(precision = precision)).len).max
# else:
# result = t.map_inline(($x).len).max
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely.

## Describe the tensor in terms of its shape and type (in a "compact" way)
## Only used for array-style printing
# Most if not all tensors element types are part of the system or complex
# modules so we can remove them from the type without must information loss
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# modules so we can remove them from the type without must information loss
# modules so we can remove them from the type without much information loss

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this typo in a commit taking care of all your comments.


proc prettyImpl*[T](
t: Tensor[T], precision: int, specifier: static string): string =
## Non public implementation of the pretty function
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Non public implementation of the pretty function
## Public implementation of the pretty function

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users are not expected to use the prettyImpl function. It is "public" in the sense that it is exported from p_disp.nim, but p_disp.nim itself is not exported so in the end it should be non-public?

@Vindaar
Copy link
Collaborator

Vindaar commented Sep 20, 2024

Awesome work and sorry again for the very long delay!

This comes in 2 forms:
1. Add a version of the pretty function that takes a format string as its input instead of a precision value. Also add a new "showHeader" argument to the original pretty function.
    - This new version of pretty lets you specify the format string that must be used to display each element. It also adds some new tensor-specific format string "tokens" that are used to control how tensors are displayed (beyond the format of each element). See below for more details.
2. Add a formatValue procedure that takes a tensor as its first input. This makes it possible to control how tensors are displayed in format strings, in the exact same way as if you were using the new pretty(specifier) procedure.

This new formatValue procedure takes all of the standard format specifiers, such as "f", "g", "+", etc. (and including, in nim 2.2 and above, the 'j' specifier for complex tensors) which are used to configure of each tensor element is displayed.
In addition to those standard format specifiers you can use a few, tensor specific modifiers which can be combined to achieve different results:

- "[:]": Display the tensor as if it were a nim "array".
         This makes it easy to use the representation of a tensor in your own code. No header is shown.
- "<>[:]": Same as "[:]" but displays a header describing the tensor type and shape.
- "[]": Same as "[:]" but displays the tensor in a single line. No header is shown.
- "<>[:]": Same as "[]" but displays a header describing the tensor type and shape.
- "||": "Pretty-print" the tensor _without_ a header.
- "<>||": Same as "||" but displays a header describing the tensor type and shape. This is the default display mode.
- "<>": When used on its own (i.e. not combined with "[:]", "[]" and "||" as shown above) it is equivalent to "<>[]" (i.e. single-line, nim-like representation with a header). Cannot wrap the element specifier.

Notes:
- The default format (i.e. when none of these tensor-specific tokens is used) is pretty printing with a header (i.e. "<>||").
- These tensor specific modifiers can placed at the start or at the end of the format specifier (e.g. "<>[:]06.2f", or "06.2f<>[:]"), or they
can "wrap" the element specifier (e.g. "<>[:06.2f]").
- When wrapping you cannot use the "compact forms" of these specifiers (i.e. "<:>" and "<|>").
- When using the wrapping form of "[:]", start with "[:" and end with either "]" or ":]" (i.e. both "[:06.2f]" and "[:06.2f:]" are valid).
- This version of this function does not have a `showHeader` argument because to disable the header you can simply select the right format
specifier.

Fixes after PR comments

More fixes
@Vindaar Vindaar merged commit cd02bf7 into mratsim:master Oct 11, 2024
3 of 6 checks passed
@Vindaar
Copy link
Collaborator

Vindaar commented Oct 11, 2024

Thanks! 🥳

darkestpigeon pushed a commit to darkestpigeon/Arraymancer that referenced this pull request Oct 18, 2024
This comes in 2 forms:
1. Add a version of the pretty function that takes a format string as its input instead of a precision value. Also add a new "showHeader" argument to the original pretty function.
    - This new version of pretty lets you specify the format string that must be used to display each element. It also adds some new tensor-specific format string "tokens" that are used to control how tensors are displayed (beyond the format of each element). See below for more details.
2. Add a formatValue procedure that takes a tensor as its first input. This makes it possible to control how tensors are displayed in format strings, in the exact same way as if you were using the new pretty(specifier) procedure.

This new formatValue procedure takes all of the standard format specifiers, such as "f", "g", "+", etc. (and including, in nim 2.2 and above, the 'j' specifier for complex tensors) which are used to configure of each tensor element is displayed.
In addition to those standard format specifiers you can use a few, tensor specific modifiers which can be combined to achieve different results:

- "[:]": Display the tensor as if it were a nim "array".
         This makes it easy to use the representation of a tensor in your own code. No header is shown.
- "<>[:]": Same as "[:]" but displays a header describing the tensor type and shape.
- "[]": Same as "[:]" but displays the tensor in a single line. No header is shown.
- "<>[:]": Same as "[]" but displays a header describing the tensor type and shape.
- "||": "Pretty-print" the tensor _without_ a header.
- "<>||": Same as "||" but displays a header describing the tensor type and shape. This is the default display mode.
- "<>": When used on its own (i.e. not combined with "[:]", "[]" and "||" as shown above) it is equivalent to "<>[]" (i.e. single-line, nim-like representation with a header). Cannot wrap the element specifier.

Notes:
- The default format (i.e. when none of these tensor-specific tokens is used) is pretty printing with a header (i.e. "<>||").
- These tensor specific modifiers can placed at the start or at the end of the format specifier (e.g. "<>[:]06.2f", or "06.2f<>[:]"), or they
can "wrap" the element specifier (e.g. "<>[:06.2f]").
- When wrapping you cannot use the "compact forms" of these specifiers (i.e. "<:>" and "<|>").
- When using the wrapping form of "[:]", start with "[:" and end with either "]" or ":]" (i.e. both "[:06.2f]" and "[:06.2f:]" are valid).
- This version of this function does not have a `showHeader` argument because to disable the header you can simply select the right format
specifier.

Fixes after PR comments

More fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants