Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: transformations to format and parse values in the API #2310

Closed
aljungberg opened this issue Jun 10, 2022 · 30 comments · Fixed by #2839
Closed

Feature request: transformations to format and parse values in the API #2310

aljungberg opened this issue Jun 10, 2022 · 30 comments · Fixed by #2839
Labels
enhancement a feature, ready for implementation

Comments

@aljungberg
Copy link
Contributor

Summary: a way to have PostgREST apply formatting and parsing functions automatically on a per column basis.

Problem Description

Often the external data representation as seen by our customers (consumers of the API) is different from how we store it internally. For example, perhaps we store a colour as an int4 integer but customers see CSS colour hex strings 'ff00fa' when interacting with the API. The storage is an implementation detail.

Today, there seems to be only three ways to handle this scenario with PostgREST, each with some rather inconvenient drawbacks:

  1. create a custom type in Postgres and compile an extension with C code functions to transform to and from the representational form (e.g. CREATE TYPE colour (INTERNALLENGTH = 4, INPUT = my_colour_parser, OUTPUT = my_colour_formatter)) then use that as your column type; or
  2. format the column in the view (SELECT colour_format(colour) AS colour)
  3. transform the values at the proxy level (e.g. with OpenRESTY, provide body_filter_by_lua_block etc)

The first solution is awkward because it lets "how the data is presented" dictate "how the data is stored" which feels backwards. What if the data is presented two different ways in two different APIs? Maybe this is solvable using views that cast the column into a different presentation type, but there's a bigger problem. The solution is plain impossible with cloud hosting like Amazon's and Google SQL which don't allow you to install custom C code extensions nor even create "shell types" with CREATE TYPE colour;.

The second solution makes that column not updatable since Postgres doesn't know how to reverse the transform. This can be worked around using INSTEAD OF triggers at some complexity costs. But also when filtering by this column, we get full table scans for the same reason. Postgres must call colour_format(x) on every row in the source table of the view when evaluating WHERE, JOIN etc. The performance loss here can be avoided with a computed index, or using a materialised generated column. There's a third problem though: if the formatted column is used as a foreign key, PostgREST can no longer detect that relationship and embedding breaks. Embedding is nice.

The third solution, proxy level rewriting, is inefficient. Now you have to parse the JSON of each response from PostgREST at the Lua layer, find the relevant fields and rewrite them, and then serialise back to JSON again. What if the response is large? You have to buffer the whole thing in nginx before you can parse it. JSON parsing isn't free in terms of CPU usage. And it feels so wasteful for Postgres to encode the JSON only for Lua to immediately decode it. Similarly, for incoming POSTS, PATCHes etc you have to decode the incoming JSON, rewrite and re-encode.

Furthermore, some of PostgRESTs own features might make the rewriting hard. You can select a column and then rename it with query parameter syntax. That's a cool feature, but now how will the Lua code know which fields it needs to reformat if they can be named anything? We'd have to parse the query string and understand it in detail. PostgREST's query syntax is complex so this will be a source of bugs and surprises. Similarly with embedding, the fields we can care about can appear in unexpected places.

And on the note of query strings, it gets messy for filtering. If the client submits a request like thing?colour=in.(ff00f0,ff00dd), again Lua has to parse the query string, understand what is what, and then apply the appropriate reverse transformation.

Resource embedding further increases the requirement for a complete understanding of the PostgREST query language. It's not very DRY is it, reimplementing all of PostgRESTs query parsing in Lua? That has to then be maintained going forward as PostgREST evolves.

This problem also relates to date and time formatting and probably a number of issues on querying by tsrange and whatnot.

Proposal

Unlike Lua and OpenRESTY, PostgREST knows both its own query language syntax and the types of the columns it works with (from the cache). I'd like to propose a method to ask PostgREST to always apply some_parser(x) on incoming values for a certain column and format_for_api(x) for outgoing values before putting them into json_agg.

Here are two alternative ideas to accomplish this, I'm not sure which is better.

Automatically use parse and format functions for certain domain types

  • use user definable types made with CREATE DOMAIN (bypassing the cloud hosting ban on true custom types)
  • when defining API views we would cast to these types, like SELECT colour::colour_representation_type AS colour
  • ...and PostgREST would know to apply some predefined parsing and formatting functions as appropriate for this pseudo-type colour_representation_type
  • ...because there exists a function with a sentinel name like pgrsttransformer that accepts a single parameter of colour_representation_type and returns json, and there exists a reverse function which takes json and returns the representation type
  • (or if we don't want to automatically configure it in this way it could be an explicit mapping in the postgrest settings per column or per type)
  • We would do this before putting values into WHERE statements
  • And when selecting values from the incoming json body in POST/PATCH/PUT
  • And when selecting values to output before sticking them into json_agg

Automatically use parse and format functions for specially named columns

  • same as above but instead use the name of the column to trigger the functions
  • when defining the view we'd do something like SELECT colour AS colour__pgrsttransformer1
  • when selecting a column with the __pgrsttransformerX suffix PostgREST does something like SELECT pgrsttransformer1(colour) AS colour
  • and in reverse when parsing values to insert them, if the incoming column is called colour and there exists colour__pgrsttransformer1 in the target table apply pgrsttransformer1(incoming) (the function would be user defined and polymorphic to do go forwards or backwards depending on the type)
  • and do the same before using such values in WHERE statements

Discussion

I realise this is a big feature request. I think it will make PostgREST significantly more flexible and powerful without adding much complexity to PostgREST itself, though.

It is a way to solve this class of problem in general without the significant drawbacks of the above mentioned workarounds. (The other would be to convince the Postgres maintainers to allow casting to and from domains, but alas such casts are ignored today).

Transforming values to be more friendly for API consumers seems like a very ordinary thing to do. I'm surprised it hasn't come up more often. Other examples of such column types needing formatting and parsing for API use could be:

  • rational numbers such as 1/3 stored as a complex type
  • the aforementioned date and time formatting with various levels of precision or particular formats
  • data stored as binary strings presented and consumed as base64
  • a data type changing internally but needing to remain the same in the API for backwards compatibility and Postgres doesn't know how to automatically cast between the two types

The Lua workaround is "ok" if you don't care about the performance impact but having to reimplement a complicated part of PostgREST (its query language) in Lua just seems like crossing the stream to get to water. That code already exists, it's in PostgREST!

Related:

https://stackoverflow.com/questions/72537716/with-postgrest-convert-a-column-to-and-from-an-external-encoding-in-the-api?noredirect=1#comment128140735_72537716

@steve-chavez
Copy link
Member

@aljungberg Sorry for the late reply here.

This can be worked around using INSTEAD OF triggers at some complexity costs. But also when filtering by this column, we get full table scans for the same reason. Postgres must call colour_format(x) on every row in the source table of the view when evaluating WHERE, JOIN etc. The performance loss here can be avoided with a computed index, or using a materialised generated column.

Yes, I've done the above before. I used a view to get a geojson format and then RULEs(INSTEAD OF triggers would be better) to insert geojson and convert it to the geometry type.

There's a third problem though: if the formatted column is used as a foreign key, PostgREST can no longer detect that relationship and embedding breaks. Embedding is nice.

Now, this is a different issue. If the foreign key column is transformed, then yes you'd lose the ability to use it for embedding.

However, we have #2144, where you can define an embed operation manually in SQL(with a computed column, where you can transform your fk col). I think that would solve your issue. WDYT?

@wolfgangwalther
Copy link
Member

I think is a very reasonable request and I have faced the same challenge in the past.

What I tried was using a DOMAIN - and then adding a CAST from the domain's base type to the domain itself and vice versa. This is not supported by PostgreSQL however, it just ignores this cast as stated in the docs:

A cast to or from a domain type currently has no effect. Casting to or from a domain uses the casts associated with its underlying type.

BUT: It is possible to create that cast.

What if we were to parse all columns for exposed tables/views, check their types and if the type is a domain see whether there is a cast between the domain and it's base type available?

That should be possible in a query for the schema cache.

We could then apply the functions referenced in the cast as formatters / parsers on the input - thus allowing the actual queries to use the underlying type, use indexes, etc.

@wolfgangwalther wolfgangwalther added enhancement a feature, ready for implementation idea Needs of discussion to become an enhancement, not ready for implementation labels Aug 11, 2022
@aljungberg
Copy link
Contributor Author

What if we were to parse all columns for exposed tables/views, check their types and if the type is a domain see whether there is a cast between the domain and it's base type available?

Yes that would absolutely work if it's technically possible. Since the domain type is ignored for casting, I hope that doesn't also mean it's difficult/impossible to query such casting definitions.

This sounds like an almost perfect solution actually. It's just what one would expect to happen when defining such casts, so it's completely intuitive. (And I wish Postgres would do it automatically.)

@aljungberg
Copy link
Contributor Author

After som light research on this, yes, the casts can be read from pg_cast even for domain casts. You have the source and the destination type, plus the oid of the function to do the conversion.

So if the table has a domain type like color and there's a cast available to json, that's when we'd go ahead and use it.

Optionally, to make it harder to unintentionally trigger this feature (path of least surprise) we could require that the AS ASSIGNMENT flag to be set on the cast as well.

If the cast is marked AS ASSIGNMENT then it can be invoked implicitly when assigning a value to a column of the target data type.

In this case we're taking the value and assigning it to a "column" in the output of JSON type, so semantically the AS ASSIGNMENT flag seems to match.

@wolfgangwalther
Copy link
Member

So if the table has a domain type like color and there's a cast available to json, that's when we'd go ahead and use it.

Oh, that's better then what I suggested above.

Optionally, to make it harder to unintentionally trigger this feature (path of least surprise) we could require that the AS ASSIGNMENT flag to be set on the cast as well.

Hm. Maybe for the cast from json to color - because that would be used for the INSERT / UPDATE case. But it doesn't make much sense for the color to json case, I think. This is not an assignment.

I would turn it around and only use the cast, if it was marked implicit. Because that's what we're doing, right? We're not requiring the user to specify that cast anywhere.

@aljungberg
Copy link
Contributor Author

Okay, having had a play with this it looks reasonably straightforward. I think we may implement this feature for our use at Screenly and submit a pull request. It feels like a feature useful to almost anyone-- massaging data for API input and output is something everyone needs to do from time to time.

@aljungberg
Copy link
Contributor Author

Okay one quirk came up: in the incoming direction, API->Postgres, we may need two casts defined:

  1. json->domain (when the input is a JSON document e.g. in POST)
  2. text->domain (when the input is in a query string like a filter meant for a where clause, e.g. todos?color=#C0FFEE)

I think requiring 3 casts total is preferable to requiring that the user JSON encode their filters by hand before using them in a query (you'd have to do todos?color=%22%23C0FFEE%22 if you wanted valid JSON).

@wolfgangwalther
Copy link
Member

I think requiring 3 casts total is preferable to requiring that the user JSON encode their filters by hand before using them in a query (you'd have to do todos?color=%22%23C0FFEE%22 if you wanted valid JSON).

Ah, interesting - hadn't thought of that. Yes, I agree.

@aljungberg
Copy link
Contributor Author

aljungberg commented Oct 10, 2022

I have this working for read queries and where filters.

The input case is made a little challenging by json_populate_recordset. Consider the query PostgREST generates for an insert:

WITH pgrst_source AS (
  WITH pgrst_payload AS (
    SELECT
      '{"id":5, "color": "#001100"}' :: json AS json_data
  ),
  pgrst_body AS (
    SELECT
      CASE WHEN json_typeof(json_data) = 'array' THEN json_data ELSE json_build_array(json_data) END AS val
    FROM
      pgrst_payload
  )
  INSERT INTO
    "api"."todos"("color", "id")
  SELECT
    "color",
    "id"
  FROM
    json_populate_recordset (
      null :: "api"."todos",
      (
        SELECT
          val
        FROM
          pgrst_body
      )
    ) _ RETURNING "api"."todos".*
)
SELECT
  '' AS total_result_set,
  pg_catalog.count(_postgrest_t) AS page_total,
  array [] :: text [] AS header,
  coalesce(json_agg(_postgrest_t), '[]') :: character varying AS body,
  nullif(current_setting('response.headers', true), '') AS response_headers,
  nullif(current_setting('response.status', true), '') AS response_status
FROM
  (
    SELECT
      "todos".*
    FROM
      "pgrst_source" AS "todos"
  ) _postgrest_t;

Note the SELECT "color" part. Even if we change it to SELECT color_parser("color") to parse that external representation, it's too late. json_populate_recordset already tries to parse the incoming JSON using the base type of the color column, which in this case is an integer. The string "#001100" is not a valid integer.

ERROR: invalid input syntax for type integer: "#001100"

This happens if we don't even try to select that column at all (e.g. SELECT "id" FROM json_populate_recordset ... will also fail).

It feels like the solution is to switch to json_to_recordset(...) AS x(id integer, color json,...), giving the column names and types by hand (we know which we want and care about). We'd give the type json for the ones we want to manually parse, and Postgres should then preserve the input for those values as is without trying to parse it.

But I'm not familiar enough with all the ways incoming data might look like. There could be edge cases I haven't considered, like maybe nested stuff, arrays, compound values. Any thoughts?

@aljungberg
Copy link
Contributor Author

From the notes on json_populate_recordset:

If the output column is a composite (row) type, and the JSON value is a JSON object, the fields of the object are converted to columns of the output row type by recursive application of these rules.
Likewise, if the output column is an array type and the JSON value is a JSON array, the elements of the JSON array are converted to elements of the output array by recursive application of these rules.

So there is definitely some recursive parsing going on here based on the output column types. However, from my reading of the docs json_to_recordset should have the exact same functionality (except when we disable it by changing the column types).

In the cases where we do override the record column type to be json, that recursive parsing would be disabled. That feels like exactly the right behaviour. Your custom JSON parser gets called with the data as is, and if it's an array or whatever, you're the one who knows what to handle that properly for your custom data type (or you wouldn't have created this json->domain cast to begin with).

So at this stage json_to_recordset with explicitly specified target columns (and types) still feels viable. It is, hopefully, a smaller change than it looks like since json_populate_recordset and json_to_recordset work almost exactly the same from what I can see in the docs.

We'd need to carry the column types in MutateRequest instead of just the field names. This interacts with the columns parameter, but not in a complicated way (just need to use our DbStructure cache to find the fields by name and then pass in Field instead of FieldName in insCols, defaulting to all the fields if not specified which I believe is implicitly what happens today with json_populate_recordset since it'll use all the columns from that row type).

@steve-chavez
Copy link
Member

steve-chavez commented Jun 23, 2023

Okay one quirk came up: in the incoming direction, API->Postgres, we may need two casts defined:

  1. json->domain (when the input is a JSON document e.g. in POST)
  2. text->domain (when the input is in a query string like a filter meant for a where clause, e.g. todos?color=#C0FFEE)

Seems we can avoid the text->domaincast by converting the query string parameters to json. We're going to need this for #2125 (sending filters in the request body) anyway.

-- final shape of the query
  SELECT
    "test"."datarep_todos"."id",
    test.json("test"."datarep_todos"."due_at") AS "due_at"
  FROM "test"."datarep_todos"
  JOIN LATERAL (
    SELECT json_build_object('label_color', '000100') as val -- comes from `?label_color=eq.000100`
  ) args ON TRUE
  WHERE  "test"."datarep_todos"."label_color" = test.color(args.val->'label_color');
 id |         due_at
----+------------------------
  2 | "2018-01-03T00:00:00Z"


-- using the test fixtures from PostgREST/postgrest#2523 
DROP DOMAIN IF EXISTS public.color CASCADE;
CREATE DOMAIN public.color AS INTEGER CHECK (VALUE >= 0 AND VALUE <= 16777215);

CREATE TABLE datarep_todos (
  id bigint primary key,
  name text,
  label_color public.color default 0,
  due_at public.isodate default '2018-01-01'::date,
  icon_image public.bytea_b64,
  created_at public.unixtz default '2017-12-14 01:02:30'::timestamptz,
  budget public.monetary default 0
);

-- this CAST function is slightly changed from the one in PostgREST/postgrest#2523
CREATE OR REPLACE FUNCTION color(json) RETURNS public.color AS $$
  WITH x AS (
    SELECT $1 #>> '{}' AS val
  )
  SELECT (('x' || lpad((CASE WHEN SUBSTRING(x.val, 1, 1) = '#' THEN SUBSTRING(x.val, 2) ELSE x.val END), 8, '0'))::bit(32)::int)::public.color
  FROM x;
$$ LANGUAGE SQL IMMUTABLE;

CREATE OR REPLACE FUNCTION json(public.color) RETURNS json AS $$
  SELECT
    CASE WHEN $1 IS NULL THEN to_json(''::text)
    ELSE to_json('#' || lpad(upper(to_hex($1)), 6, '0'))
  END;
$$ LANGUAGE SQL IMMUTABLE;

CREATE CAST (public.color AS json) WITH FUNCTION json(public.color) AS IMPLICIT;
CREATE CAST (json AS public.color) WITH FUNCTION color(json) AS IMPLICIT;

Any concerns? So far this seems to work ok. I'll try to apply this change in the WHERE clause only for the "field representations" feature.

(I thought "field representations" is a better name bc we already have computed fields and doesn't clash with resource representations).

@aljungberg
Copy link
Contributor Author

Okay one quirk came up: in the incoming direction, API->Postgres, we may need two casts defined:

  1. json->domain (when the input is a JSON document e.g. in POST)
  2. text->domain (when the input is in a query string like a filter meant for a where clause, e.g. todos?color=#C0FFEE)

Seems we can avoid the text->domaincast by converting the query string parameters to json. We're going to need this for #2125 (sending filters in the request body) anyway.
Any concerns? So far this seems to work ok. I'll try to apply this change in the WHERE clause only for the "field representations" feature.

Well it's the one I noted which is that using JSON in your query string is not very nice UX: "I think requiring 3 casts total is preferable to requiring that the user JSON encode their filters by hand before using them in a query (you'd have to do todos?color=%22%23C0FFEE%22 if you wanted valid JSON)."

We're going to need this for #2125 (sending filters in the request body) anyway.

But in the body you can just specify your content type as json, right? I think query string parameters are kind of text by definition OTH so using a data representation seems like a very natural fit. You have data in one format, you want it in another... that's the whole feature!

The formats might even be different. In text maybe a colour is just #ABC and in JSON maybe it's [1, 20, 200]. So it's nice to have that flexibility.

I already implemented it in the PR so it shouldn't add any more work, I think.

(I thought "field representations" is a better name bc we already have computed fields and doesn't clash with resource representations).

Right but data representations don't apply to fields (or columns) specifically. You can use them to represent fields, sure, but also the output of procedures, complex types pulled from multiple fields, query parameters as mentioned above and like we discussed in 2523 they could indeed apply to entire tables/relations (describing e.g. this is how you transform table X into CSV).

And they'll in fact not clash with resource representations since they're a superset of resource representations. Indeed, if one goes all the way with #2523, data reps are the new resource representations.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 23, 2023

Well it's the one I noted which is that using JSON in your query string is not very nice UX: "I think requiring 3 casts total is preferable to requiring that the user JSON encode their filters by hand before using them in a query (you'd have to do todos?color=%22%23C0FFEE%22 if you wanted valid JSON)."

@aljungberg Oh, I'm not referring to force the user to do JSON on the query string. So you'd still use plain text:

GET /datarep_todos?label_color=eq.000100

But internally we convert the query parameters to json. Note the json_build_object here:

-- final shape of the query
  SELECT
    "test"."datarep_todos"."id",
    test.json("test"."datarep_todos"."due_at") AS "due_at"
  FROM "test"."datarep_todos"
  JOIN LATERAL (
   -- '000100' comes from `?label_color=eq.000100`
    SELECT json_build_object('label_color', '000100') as val 
  ) args ON TRUE
  WHERE  "test"."datarep_todos"."label_color" = test.color(args.val->'label_color');

json_build_object can also take the values as unknown. No casting for text needed there.

We only need to use the already defined CREATE OR REPLACE FUNCTION color(json) RETURNS public.color.. function. Since args.val->'label_color' obtains the value as json.

This should offer better DX because you only worry about the casting from/to json.

@steve-chavez
Copy link
Member

Btw, I'm continuing your great work on #2839. If we agree, then the remaining task for merging would be removing the need for the text casting.

@steve-chavez
Copy link
Member

The formats might even be different. In text maybe a colour is just #ABC and in JSON maybe it's [1, 20, 200]. So it's nice to have that flexibility.

I think it's preferrable to give up on that flexibility for keeping a simpler interface (just 2 casts). Also the tests on #2523 also operate under the assumption that the text rep will be the same as json.

(Maybe it's also possible to integrate field representations with resource representations later. If we hardcode the query parameters to text then we would lose that ability)

@wolfgangwalther Also curious on your thoughts on this.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 24, 2023

Hm, one thing that seems unsolvable with my above proposal is the in filter. Haven't been able to come up with a query that doesn't require the text[] casting here: https://github.com/PostgREST/postgrest/pull/2523/files#r1105262523

Edit: Scratch the above, the in filter is solvable and this way is a bit better than what we currently do (build the unknown array literal ourselves).

Basically using json_build_array plus json_array_elements:

SELECT test.color(json_array_elements(json_build_array('000100','000300'))) as val;
 val
-----
 256
 768
(2 rows)

The final query would be like:

-- '000100','000300' would come from the `in` filter: ?label_color=in.(000100, 000300)

SELECT
  "test"."datarep_todos"."id",
  test.json("test"."datarep_todos"."due_at") AS "due_at"
FROM "test"."datarep_todos"
JOIN LATERAL (
  SELECT json_build_object('label_color', json_build_array('000100','000300')) as val
) args ON TRUE
WHERE "test"."datarep_todos"."label_color" = ANY(SELECT test.color(json_array_elements(args.val->'label_color')));

 id |         due_at
----+------------------------
  2 | "2018-01-03T00:00:00Z"
(1 row)

Which is similar to the subquery done for #2523 on https://github.com/Screenly/postgrest/blob/8c35fce90f1722c3696c1666acfe1dc8a5d89e36/src/PostgREST/Query/SqlFragment.hs#L302 but without the text[] casting.

@wolfgangwalther
Copy link
Member

I think it's preferrable to give up on that flexibility for keeping a simpler interface (just 2 casts).
[...]
@wolfgangwalther Also curious on your thoughts on this.

I disagree.

@aljungberg had a good example:

The formats might even be different. In text maybe a colour is just #ABC and in JSON maybe it's [1, 20, 200]. So it's nice to have that flexibility.

In general, this will be most useful when the JSON representation is "complex", i.e. an array or even more so when it's an object. It's not possible to use them in a nice way in the query string.

@steve-chavez your proposal essentially adds a built-in, not replaceable text->json transformation for everything that is in the query string. But you can't know about custom representations, so you can't deal with them properly.

Example:

I currently have a cast added to represent tstzrange as the following json:

{
  "lower": ...,
  "upper": ...,
  "includeLower": ...,
  "includeUpper": ...
}

(you'll need to create your own custom tstzrange range type to make this work, because pg will ignore custom casts for built-in types in to_json())

This makes it much nicer to work with those ranges when select=ing them or when POSTing and PATCHing values to the endpoint - I can use the json object style. However, when filtering in the query string, I can use the [lower,upper) pg syntax.

This is essentially the same thing that the data rep feature is going to solve in a more generic way, allowing it for all kinds of types, some of which might already have casts defined etc. - all via domains.

With the proposed change to convert the query string to json - and then apply everything else, this will not work anymore. In fact it will not only prevent the data rep feature from being used in that way, but it will also break my already existing use-case. I won't be able to use [lower,upper) in the query string anymore, because it will be treated as json, not text.

@steve-chavez
Copy link
Member

I currently have a cast added to represent tstzrange as the following json:
(you'll need to create your own custom tstzrange range type to make this work, because pg will ignore custom casts for built-in types in to_json())

Yeah, I documented the tsrange case here some time ago. I'm also trying to get casts on domains to work on postgres core (see mailing list, so far they don't seem opposed).

@steve-chavez your proposal essentially adds a built-in, not replaceable text->json transformation for everything that is in the query string

The query string is not text by default, it's unknown. #2523 does unknown->text, I'm suggesting an unknown->json alternative.

On #2310 (comment), @aljungberg mentions that the text cast is a "quirk". Seems to me the most intuitive/expected way is to just deal with json.


The major problem with the text->domain cast is #2125 considering the users of client libraries. An example with supabase-js (vue-postgrest and other client libs would present the same problem):

supabase
.from('projects')
.eq('mytsrangecol', 'special-format-w-lower-upper')

Let's say the user does custom json casts plus gets clever and defines a text->domain cast for handling 'special-format-w-lower-upper'. Then #2125 is done, the library switchs to json filters (because 414 “Request-URI Too Long” errors keep happening surprisingly for users) so then the filter will take the json representation. 'special-format-w-lower-upper' won't work anymore. Major breaking change. This would be avoided if we don't differentiate the type of the query string from the json body.

It would be much better to keep client libraries users abstracted from a difference in the query string format. It would prevent the breaking change.

In general, this will be most useful when the JSON representation is "complex"
I won't be able to use [lower,upper) in the query string anymore, because it will be treated as json, not text.

IMHO it'd be better to let that flexibility go in exchange for consistency in behavior (and reduced impact on the codebase). But maybe there's a way to make this work.

We already have the Prefer: params=single-object header (ref).

We can add new ones:

  1. Prefer: params=values-as-text. Data Representations #2523 behavior.
  2. Prefer: params=values-as-unknown: default behavior.
  3. Prefer: params=values-as-media-type: What I'm proposing. I believe it would play better with custom media types later too.

@wolfgangwalther
Copy link
Member

I don't like the headers.

What about we just use the text->domain cast if it's defined - and if not, we fallback to treating the value as json, possibly applying the cast?

Then every user can decide themselves whether they actually add any of the text casts or not.

@aljungberg
Copy link
Contributor Author

@steve-chavez I understand the motivation behind your proposal. Treating all user input as JSON would be pretty close to how people use PostgREST today, I imagine, and I appreciate the effort to simplify the interface.

Yet I think overall simplicity would be worse off.

Deep and comprehensive data reps support could empower users to solve their own problems without adding complexity to PostgREST. That's a beautiful kind of simplicity in the product as a whole. I feel like we are actually addressing an underlying source of complexity head on, something that's come up time and time again in the history of this project.

I'm doing a quick search here so please forgive any mistakes but here are examples where deep data reps support would enable a user to solve their own problems, with no added complexity in PostgREST. Note how many of these examples involve filtering and querying -- it's not just about output, but everywhere we touch user data.

(I hope this also serves to support my previous argument that data reps is more than "field reps".)

I think I wrote this before but if taken to the full extent, CSV, binary downloads, XML, protobuf, none of it would be a special case anymore, not on the input side nor on the output side. PostgREST can be viewed as a HTTP->JSON->SQL->JSON->HTTP service and data reps would turn that into HTTP->SQL->HTTP. If that's not some true top level simplification, I don't know what is!

Constraining the type of query string field to JSON would hinder the user's ability to tackle diverse scenarios. Although it simplifies in one sense, it adds complexity in another. It leaves this presumption of JSON hardcoded into PostgREST.

Maintaining the minimum assumptions (e.g., query strings are strings) provides power to users. The less we need to decide about the form of data, the more generic our solution. Kind of like in Haskell itself. A strong type system can eliminate lots of code by supporting composition and function reuse. We don't need fmap_for_lists, fmap_for_sets and fmap_for_trees. It works for all types that have the appropriate functor implementations. So the parallel here is that we make our query string functionality able to operate on anything that suits the use case, and the type points to the necessary user provided implementations.

Think of all the future conversations where you can just sit back and relax. The next time someone asks for a new output format, whether that'll be protobuf, yaml or iambic pentameter, you have the answer. Filter by value equal to any cells in the 3rd column of this CSV featuring Eye of Horus? Sure you'll want to use the IN operator on a custom type with a text->csv->subset->hieroglyph decomposer (wadjet) transformation chain...

@aljungberg
Copy link
Contributor Author

The query string is not text by default, it's unknown. #2523 does unknown->text, I'm suggesting an unknown->json alternative.

Yep. No disagreement there, we're both looking at something of unknown type and trying to figure out how to resolve that into its true identity.

The reason I defaulted to text is that it's a more generous (less restrictive) starting interpretation. It's the only thing we do know for sure. It's text because query strings are text. Even if that turns out just to be like a superclass or a typeclass, or an in transit encoding to be unwrapped, it seems like a fair starting point. Then user code can take it from there if necessary.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 26, 2023

How about this. On PostgREST/postgrest-docs#652 (comment) I realized we can use our tx-scoped settings to change the casting function.

In text maybe a colour is just #ABC and in JSON maybe it's [1, 20, 200]

With that the above should be possible like:

CREATE OR REPLACE FUNCTION color(json) RETURNS public.color AS $$
  WITH x AS (
    SELECT $1 #>> '{}' AS val
  )
  SELECT
    CASE WHEN current_setting('request.get.params.color', true) IS NOT NULL
      -- my array format [1, 20, 200]
    ELSE
     -- #ABC
      (('x' || lpad((CASE WHEN SUBSTRING(x.val, 1, 1) = '#' THEN SUBSTRING(x.val, 2) ELSE x.val END), 8, '0'))::bit(32)::int)::public.color
    END
  FROM x;
$$ LANGUAGE SQL IMMUTABLE;

This would involve working on a feature similar to #1710. It should work for different representations of tsrange as well.

Although it simplifies in one sense, it adds complexity in another. It leaves this presumption of JSON hardcoded into PostgREST.
Maintaining the minimum assumptions provides power to users.

Yes, I'm fully in line with that. And actually I'm trying to remove the assumption that query strings are text which is done on #2523 (code here) . This is why my proposal is to act as Prefer: params=values-as-media-type, then we don't need the text or json assumption really. We only cast the query parameters based on the media type and we know application/json is json casting from the discussion on #1582 (at the bottom, long thread).

@wolfgangwalther
Copy link
Member

This is why my proposal is to act as Prefer: params=values-as-media-type, then we don't need the text or json assumption really.

While I understand that you want to somehow simplify #2125 for client libraries.. I really don't see any connection between the media-type (we are talking about the Content-Type header, right?) and the query string. This header just doesn't make sense to me from an api perspective.

The query string is, in the most generic sense, TEXT. It's not BYTEA or any other binary type - it's text. Specializing a bit more, you could say that the query string is also application/x-www-form-urlencoded, but the values would still be text. JSON is also a subset of TEXT, but a different one.

I think we're talking about two different things here actually:

  • @aljungberg's "quirk" with the TEXT cast is about the values in the query string. The query string itself, as the format that contains both "keys" and "values" is of type application/x-www-form-urlencoded and @aljungberg argues, that the values in that format should be treated as raw text instead of JSON - because that really doesn't work nicely in the querystring with all the escaping.
  • However, @steve-chavez your requirement for Implement new SEARCH/QUERY HTTP method to pass arguments in request body for read request #2125 is not about values. It's about the container format. You don't want to use application/x-www-form-urlencoded, but application/json (JSON) to get to the "keys and values". This will allow you to use it in the request body.

The key to supporting both nicely is to make sure that all values in the JSON body for such a SEARCH request are json strings. Nothing more complex, only json strings. Because strings as values are supported by both application/x-www-form-urlencoded and application/json.

We need to differentiate between two types of request bodies here: The request bodies we use currently pass a whole "document" or "entity", where everything in there is the resource. Those should go through the data representation transformation chain as a whole. However, the request bodies passed in a SEARCH request are not that. They use a container format of application/json as a replacement for the size-limited query string. This document should never go through a data representation transformation as a whole. The extracted values (text) should go through it, though. (btw: Why not just use application/x-www-form-urlencoded as the body format for those SEARCH requests? This would make the concept much clearer and be a lot easier for client-libraries to implement: Just build the query string and if it's too long, put it exactly like this in the body..)

If I understood correctly, @steve-chavez, you made two suggestions:

  1. To rewrite the SQL query to deal with the values passed in via query string as a JSON object, too. This would make it much easier to support the SEARCH with json body in a performant way - because you would not have to parse the body at all, but just pass it to the query. (This is a valid point for using JSON as the container here instead of application/x-www-form-urlencoded as mentioned above)
  2. To leave the text->domain cast out.

I think we can still do 1, but not do 2. We can keep the text->domain cast, and you can add it in the query after extracting the values from the json container. In your example above, you'd basically just do json_array_elements_text(args.val->'label_color') instead of json_array_elements(args.val->'label_color') or for single values you'd do args.val->>'label_color' instead of args.val->'label_color'.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 27, 2023

Specializing a bit more, you could say that the query string is also application/x-www-form-urlencoded, but the values would still be text.
However, the request bodies passed in a SEARCH request are not that. They use a container format of application/json as a replacement for the size-limited query string. This document should never go through a data representation transformation as a whole.

@wolfgangwalther Very nice explanation. Fully agree.

In your example above, you'd basically just do json_array_elements_text(args.val->'label_color') instead of json_array_elements(args.val->'label_color') or for single values you'd do args.val->>'label_color' instead of args.val->'label_color'.

Aha, so with that there would be no breaking change on client libraries once #2125 is introduced. The text->domain cast would keep working once the switch from query params filters to body filters is made.

All good then. I'd say let's #2839 merge as is. I'd still like to do some refactoring but since there's no concern of a breaking change that can be done later. We can release a v11.2 afterwards for feedback.

@aljungberg
Copy link
Contributor Author

All good then. I'd say let's #2839 merge as is. I'd still like to do some refactoring but since there's no concern of a breaking change that can be done later. We can release a v11.2 afterwards for feedback.

Fantastic!

There'll be some issues that can be closed right away after landing it, I think. All issues of the form "I want to filter by a custom type given in a custom format" and more generally "I want to send or receive type X in format Y".

(And then on the longer roadmap; when/if the path finder content negotiation engine we talked about is landed, CSV and XML output etc will stop being special cases so that's a nice low hanging fruit coming up there, taking us on that evolution from HTTP<->JSON<->SQL to dynamic HTTP<->SQL engine.)

@steve-chavez
Copy link
Member

I'm adding the docs for the feature on PostgREST/postgrest-docs#655. It's based on the discussions here.

Also, though "data representations" is the abstraction that powers this (+ custom media types later), I thought of calling the feature "domain representations". I find it will be more understandable for users this way and it's probably going to be more googleable (found "data representations" to refer to other things).

@aljungberg
Copy link
Contributor Author

Also, though "data representations" is the abstraction that powers this (+ custom media types later), I thought of calling the feature "domain representations". I find it will be more understandable for users this way and it's probably going to be more googleable (found "data representations" to refer to other things).

I bet we'll eventually add an alternative non-domain method to employ representations (because as we identified, even using a binary compatible domain cast in a view makes the field non-updateable as far as Postgres is concerned). But okay, I can live with domain reps, it's accurate as things stand. Googleability is definitely a factor.

@steve-chavez
Copy link
Member

steve-chavez commented Jun 30, 2023

I bet we'll eventually add an alternative non-domain method to employ representations (because as we identified, even using a binary compatible domain cast in a view makes the field non-updateable as far as Postgres is concerned)

Yeah, actually I'm not opposed to including "Data Representations" in the docs. Eventually it could be a superset of:

  • Domain Representations
  • Resource Representations (or media type representations, for Add pgrst.accept setting to RPC #1582). Although this will require aggregates/functions in addition to domains + casts.
  • Type Representations (for the problem you mention)

"Data Representations" would be a page like API that has other sub pages.

But okay, I can live with domain reps, it's accurate as things stand

Yes, the motivation is being specific for this feature.

@steve-chavez
Copy link
Member

Note the similarities between data representations and "data independence":

Data independence is therefore defined as the separation of the database user's experience from the database's means of in-memory or on-disk representation.

From project-m36 docs: https://github.com/agentm/project-m36/blob/master/docs/data_independence.markdown

project-m36 also mentions (ref) the drawbacks with views:

Furthermore, SQL views violate the "Principle of Interchangeability" because of the special nature of views. Whether views are updateable or not is at the discretion of the database user. Some SQL views are proveably not updateable such as with summary or report views which intentionally collate data. Therefore, views do not behave identically to tables/relation variables.

Also see https://en.wikipedia.org/wiki/Data_independence.

Data independence is the type of data transparency that matters for a centralized DBMS. It refers to the immunity of user applications to changes made in the definition and organization of data. Application programs should not, ideally, be exposed to details of data representation and storage. The DBMS provides an abstract view of the data that hides such details.

Clearly we're on to something fundamental with domain representations 💥.


Recently I noted that domain representations alone cannot provide two different json representations for the same column. By adding resource representations, we might be able to achieve that like:

create domain app_uuid as uuid;

CREATE CAST (app_uuid AS "application/json") 
WITH FUNCTION "application/json"(app_uuid) AS IMPLICIT;

CREATE CAST (app_uuid AS "application/vnd.co.myjson") 
WITH FUNCTION "application/vnd.co.myjson"(app_uuid) AS IMPLICIT;

@steve-chavez steve-chavez removed the idea Needs of discussion to become an enhancement, not ready for implementation label Aug 28, 2023
@aljungberg
Copy link
Contributor Author

Note the similarities between data representations and "data independence":

Oh wow, great find! I hadn't come across that term before but it's right.

Recently I noted that domain representations alone cannot provide two different json representations for the same column. By adding resource representations, we might be able to achieve that like:

Right, in that use case you would pretty much have to have a method for the user to specify which of the available reps they want. We wouldn't pick one JSON representation at random if two are available! And the natural fit is, as you say, content negotiation. Although one could also imagine specifying a cast in the PostgREST query string.

This ties into, somewhat, the broader discussion of having alternative ways (beyond setting the domain) to describe the representation you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a feature, ready for implementation
Development

Successfully merging a pull request may close this issue.

3 participants