-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: transformations to format and parse values in the API #2310
Comments
@aljungberg Sorry for the late reply here.
Yes, I've done the above before. I used a view to get a geojson format and then RULEs(INSTEAD OF triggers would be better) to insert geojson and convert it to the geometry type.
Now, this is a different issue. If the foreign key column is transformed, then yes you'd lose the ability to use it for embedding. However, we have #2144, where you can define an embed operation manually in SQL(with a computed column, where you can transform your fk col). I think that would solve your issue. WDYT? |
I think is a very reasonable request and I have faced the same challenge in the past. What I tried was using a
BUT: It is possible to create that cast. What if we were to parse all columns for exposed tables/views, check their types and if the type is a domain see whether there is a cast between the domain and it's base type available? That should be possible in a query for the schema cache. We could then apply the functions referenced in the cast as formatters / parsers on the input - thus allowing the actual queries to use the underlying type, use indexes, etc. |
Yes that would absolutely work if it's technically possible. Since the domain type is ignored for casting, I hope that doesn't also mean it's difficult/impossible to query such casting definitions. This sounds like an almost perfect solution actually. It's just what one would expect to happen when defining such casts, so it's completely intuitive. (And I wish Postgres would do it automatically.) |
After som light research on this, yes, the casts can be read from So if the table has a domain type like Optionally, to make it harder to unintentionally trigger this feature (path of least surprise) we could require that the
In this case we're taking the value and assigning it to a "column" in the output of JSON type, so semantically the |
Oh, that's better then what I suggested above.
Hm. Maybe for the cast from I would turn it around and only use the cast, if it was marked implicit. Because that's what we're doing, right? We're not requiring the user to specify that cast anywhere. |
Okay, having had a play with this it looks reasonably straightforward. I think we may implement this feature for our use at Screenly and submit a pull request. It feels like a feature useful to almost anyone-- massaging data for API input and output is something everyone needs to do from time to time. |
Okay one quirk came up: in the incoming direction, API->Postgres, we may need two casts defined:
I think requiring 3 casts total is preferable to requiring that the user JSON encode their filters by hand before using them in a query (you'd have to do |
Ah, interesting - hadn't thought of that. Yes, I agree. |
I have this working for read queries and where filters. The input case is made a little challenging by
Note the
This happens if we don't even try to select that column at all (e.g. It feels like the solution is to switch to But I'm not familiar enough with all the ways incoming data might look like. There could be edge cases I haven't considered, like maybe nested stuff, arrays, compound values. Any thoughts? |
From the notes on
So there is definitely some recursive parsing going on here based on the output column types. However, from my reading of the docs In the cases where we do override the record column type to be So at this stage We'd need to carry the column types in |
Seems we can avoid the -- final shape of the query
SELECT
"test"."datarep_todos"."id",
test.json("test"."datarep_todos"."due_at") AS "due_at"
FROM "test"."datarep_todos"
JOIN LATERAL (
SELECT json_build_object('label_color', '000100') as val -- comes from `?label_color=eq.000100`
) args ON TRUE
WHERE "test"."datarep_todos"."label_color" = test.color(args.val->'label_color');
id | due_at
----+------------------------
2 | "2018-01-03T00:00:00Z"
-- using the test fixtures from PostgREST/postgrest#2523
DROP DOMAIN IF EXISTS public.color CASCADE;
CREATE DOMAIN public.color AS INTEGER CHECK (VALUE >= 0 AND VALUE <= 16777215);
CREATE TABLE datarep_todos (
id bigint primary key,
name text,
label_color public.color default 0,
due_at public.isodate default '2018-01-01'::date,
icon_image public.bytea_b64,
created_at public.unixtz default '2017-12-14 01:02:30'::timestamptz,
budget public.monetary default 0
);
-- this CAST function is slightly changed from the one in PostgREST/postgrest#2523
CREATE OR REPLACE FUNCTION color(json) RETURNS public.color AS $$
WITH x AS (
SELECT $1 #>> '{}' AS val
)
SELECT (('x' || lpad((CASE WHEN SUBSTRING(x.val, 1, 1) = '#' THEN SUBSTRING(x.val, 2) ELSE x.val END), 8, '0'))::bit(32)::int)::public.color
FROM x;
$$ LANGUAGE SQL IMMUTABLE;
CREATE OR REPLACE FUNCTION json(public.color) RETURNS json AS $$
SELECT
CASE WHEN $1 IS NULL THEN to_json(''::text)
ELSE to_json('#' || lpad(upper(to_hex($1)), 6, '0'))
END;
$$ LANGUAGE SQL IMMUTABLE;
CREATE CAST (public.color AS json) WITH FUNCTION json(public.color) AS IMPLICIT;
CREATE CAST (json AS public.color) WITH FUNCTION color(json) AS IMPLICIT; Any concerns? So far this seems to work ok. I'll try to apply this change in the WHERE clause only for the "field representations" feature. (I thought "field representations" is a better name bc we already have computed fields and doesn't clash with resource representations). |
Well it's the one I noted which is that using JSON in your query string is not very nice UX: "I think requiring 3 casts total is preferable to requiring that the user JSON encode their filters by hand before using them in a query (you'd have to do todos?color=%22%23C0FFEE%22 if you wanted valid JSON)."
But in the body you can just specify your content type as json, right? I think query string parameters are kind of text by definition OTH so using a data representation seems like a very natural fit. You have data in one format, you want it in another... that's the whole feature! The formats might even be different. In text maybe a colour is just I already implemented it in the PR so it shouldn't add any more work, I think.
Right but data representations don't apply to fields (or columns) specifically. You can use them to represent fields, sure, but also the output of procedures, complex types pulled from multiple fields, query parameters as mentioned above and like we discussed in 2523 they could indeed apply to entire tables/relations (describing e.g. this is how you transform table X into CSV). And they'll in fact not clash with resource representations since they're a superset of resource representations. Indeed, if one goes all the way with #2523, data reps are the new resource representations. |
@aljungberg Oh, I'm not referring to force the user to do JSON on the query string. So you'd still use plain text: GET /datarep_todos?label_color=eq.000100 But internally we convert the query parameters to json. Note the -- final shape of the query
SELECT
"test"."datarep_todos"."id",
test.json("test"."datarep_todos"."due_at") AS "due_at"
FROM "test"."datarep_todos"
JOIN LATERAL (
-- '000100' comes from `?label_color=eq.000100`
SELECT json_build_object('label_color', '000100') as val
) args ON TRUE
WHERE "test"."datarep_todos"."label_color" = test.color(args.val->'label_color');
We only need to use the already defined This should offer better DX because you only worry about the casting from/to json. |
Btw, I'm continuing your great work on #2839. If we agree, then the remaining task for merging would be removing the need for the text casting. |
I think it's preferrable to give up on that flexibility for keeping a simpler interface (just 2 casts). Also the tests on #2523 also operate under the assumption that the (Maybe it's also possible to integrate field representations with resource representations later. If we hardcode the query parameters to @wolfgangwalther Also curious on your thoughts on this. |
Hm, one thing that seems unsolvable with my above proposal is the Edit: Scratch the above, the Basically using SELECT test.color(json_array_elements(json_build_array('000100','000300'))) as val;
val
-----
256
768
(2 rows) The final query would be like: -- '000100','000300' would come from the `in` filter: ?label_color=in.(000100, 000300)
SELECT
"test"."datarep_todos"."id",
test.json("test"."datarep_todos"."due_at") AS "due_at"
FROM "test"."datarep_todos"
JOIN LATERAL (
SELECT json_build_object('label_color', json_build_array('000100','000300')) as val
) args ON TRUE
WHERE "test"."datarep_todos"."label_color" = ANY(SELECT test.color(json_array_elements(args.val->'label_color')));
id | due_at
----+------------------------
2 | "2018-01-03T00:00:00Z"
(1 row) Which is similar to the subquery done for #2523 on https://github.com/Screenly/postgrest/blob/8c35fce90f1722c3696c1666acfe1dc8a5d89e36/src/PostgREST/Query/SqlFragment.hs#L302 but without the |
I disagree. @aljungberg had a good example:
In general, this will be most useful when the JSON representation is "complex", i.e. an array or even more so when it's an object. It's not possible to use them in a nice way in the query string. @steve-chavez your proposal essentially adds a built-in, not replaceable text->json transformation for everything that is in the query string. But you can't know about custom representations, so you can't deal with them properly. Example: I currently have a cast added to represent {
"lower": ...,
"upper": ...,
"includeLower": ...,
"includeUpper": ...
} (you'll need to create your own custom tstzrange range type to make this work, because pg will ignore custom casts for built-in types in This makes it much nicer to work with those ranges when This is essentially the same thing that the data rep feature is going to solve in a more generic way, allowing it for all kinds of types, some of which might already have casts defined etc. - all via domains. With the proposed change to convert the query string to json - and then apply everything else, this will not work anymore. In fact it will not only prevent the data rep feature from being used in that way, but it will also break my already existing use-case. I won't be able to use |
Yeah, I documented the tsrange case here some time ago. I'm also trying to get casts on domains to work on postgres core (see mailing list, so far they don't seem opposed).
The query string is not On #2310 (comment), @aljungberg mentions that the The major problem with the supabase
.from('projects')
.eq('mytsrangecol', 'special-format-w-lower-upper') Let's say the user does custom json casts plus gets clever and defines a It would be much better to keep client libraries users abstracted from a difference in the query string format. It would prevent the breaking change.
IMHO it'd be better to let that flexibility go in exchange for consistency in behavior (and reduced impact on the codebase). But maybe there's a way to make this work. We already have the We can add new ones:
|
I don't like the headers. What about we just use the Then every user can decide themselves whether they actually add any of the text casts or not. |
@steve-chavez I understand the motivation behind your proposal. Treating all user input as JSON would be pretty close to how people use PostgREST today, I imagine, and I appreciate the effort to simplify the interface. Yet I think overall simplicity would be worse off. Deep and comprehensive data reps support could empower users to solve their own problems without adding complexity to PostgREST. That's a beautiful kind of simplicity in the product as a whole. I feel like we are actually addressing an underlying source of complexity head on, something that's come up time and time again in the history of this project. I'm doing a quick search here so please forgive any mistakes but here are examples where deep data reps support would enable a user to solve their own problems, with no added complexity in PostgREST. Note how many of these examples involve filtering and querying -- it's not just about output, but everywhere we touch user data.
(I hope this also serves to support my previous argument that data reps is more than "field reps".) I think I wrote this before but if taken to the full extent, CSV, binary downloads, XML, protobuf, none of it would be a special case anymore, not on the input side nor on the output side. PostgREST can be viewed as a HTTP->JSON->SQL->JSON->HTTP service and data reps would turn that into Constraining the type of query string field to JSON would hinder the user's ability to tackle diverse scenarios. Although it simplifies in one sense, it adds complexity in another. It leaves this presumption of JSON hardcoded into PostgREST. Maintaining the minimum assumptions (e.g., query strings are strings) provides power to users. The less we need to decide about the form of data, the more generic our solution. Kind of like in Haskell itself. A strong type system can eliminate lots of code by supporting composition and function reuse. We don't need Think of all the future conversations where you can just sit back and relax. The next time someone asks for a new output format, whether that'll be protobuf, yaml or iambic pentameter, you have the answer. Filter by value equal to any cells in the 3rd column of this CSV featuring Eye of Horus? Sure you'll want to use the IN operator on a custom type with a |
Yep. No disagreement there, we're both looking at something of unknown type and trying to figure out how to resolve that into its true identity. The reason I defaulted to text is that it's a more generous (less restrictive) starting interpretation. It's the only thing we do know for sure. It's text because query strings are text. Even if that turns out just to be like a superclass or a typeclass, or an in transit encoding to be unwrapped, it seems like a fair starting point. Then user code can take it from there if necessary. |
How about this. On PostgREST/postgrest-docs#652 (comment) I realized we can use our tx-scoped settings to change the casting function.
With that the above should be possible like: CREATE OR REPLACE FUNCTION color(json) RETURNS public.color AS $$
WITH x AS (
SELECT $1 #>> '{}' AS val
)
SELECT
CASE WHEN current_setting('request.get.params.color', true) IS NOT NULL
-- my array format [1, 20, 200]
ELSE
-- #ABC
(('x' || lpad((CASE WHEN SUBSTRING(x.val, 1, 1) = '#' THEN SUBSTRING(x.val, 2) ELSE x.val END), 8, '0'))::bit(32)::int)::public.color
END
FROM x;
$$ LANGUAGE SQL IMMUTABLE;
This would involve working on a feature similar to #1710. It should work for different representations of
Yes, I'm fully in line with that. And actually I'm trying to remove the assumption that query strings are |
While I understand that you want to somehow simplify #2125 for client libraries.. I really don't see any connection between the media-type (we are talking about the The query string is, in the most generic sense, I think we're talking about two different things here actually:
The key to supporting both nicely is to make sure that all values in the JSON body for such a We need to differentiate between two types of request bodies here: The request bodies we use currently pass a whole "document" or "entity", where everything in there is the resource. Those should go through the data representation transformation chain as a whole. However, the request bodies passed in a If I understood correctly, @steve-chavez, you made two suggestions:
I think we can still do 1, but not do 2. We can keep the text->domain cast, and you can add it in the query after extracting the values from the json container. In your example above, you'd basically just do |
@wolfgangwalther Very nice explanation. Fully agree.
Aha, so with that there would be no breaking change on client libraries once #2125 is introduced. The All good then. I'd say let's #2839 merge as is. I'd still like to do some refactoring but since there's no concern of a breaking change that can be done later. We can release a |
Fantastic! There'll be some issues that can be closed right away after landing it, I think. All issues of the form "I want to filter by a custom type given in a custom format" and more generally "I want to send or receive type X in format Y". (And then on the longer roadmap; when/if the path finder content negotiation engine we talked about is landed, CSV and XML output etc will stop being special cases so that's a nice low hanging fruit coming up there, taking us on that evolution from |
I'm adding the docs for the feature on PostgREST/postgrest-docs#655. It's based on the discussions here. Also, though "data representations" is the abstraction that powers this (+ custom media types later), I thought of calling the feature "domain representations". I find it will be more understandable for users this way and it's probably going to be more googleable (found "data representations" to refer to other things). |
I bet we'll eventually add an alternative non-domain method to employ representations (because as we identified, even using a binary compatible domain cast in a view makes the field non-updateable as far as Postgres is concerned). But okay, I can live with domain reps, it's accurate as things stand. Googleability is definitely a factor. |
Yeah, actually I'm not opposed to including "Data Representations" in the docs. Eventually it could be a superset of:
"Data Representations" would be a page like API that has other sub pages.
Yes, the motivation is being specific for this feature. |
Note the similarities between data representations and "data independence":
From project-m36 docs: https://github.com/agentm/project-m36/blob/master/docs/data_independence.markdown project-m36 also mentions (ref) the drawbacks with views:
Also see https://en.wikipedia.org/wiki/Data_independence.
Clearly we're on to something fundamental with domain representations 💥. Recently I noted that domain representations alone cannot provide two different json representations for the same column. By adding resource representations, we might be able to achieve that like: create domain app_uuid as uuid;
CREATE CAST (app_uuid AS "application/json")
WITH FUNCTION "application/json"(app_uuid) AS IMPLICIT;
CREATE CAST (app_uuid AS "application/vnd.co.myjson")
WITH FUNCTION "application/vnd.co.myjson"(app_uuid) AS IMPLICIT; |
Oh wow, great find! I hadn't come across that term before but it's right.
Right, in that use case you would pretty much have to have a method for the user to specify which of the available reps they want. We wouldn't pick one JSON representation at random if two are available! And the natural fit is, as you say, content negotiation. Although one could also imagine specifying a cast in the PostgREST query string. This ties into, somewhat, the broader discussion of having alternative ways (beyond setting the domain) to describe the representation you want. |
Summary: a way to have PostgREST apply formatting and parsing functions automatically on a per column basis.
Problem Description
Often the external data representation as seen by our customers (consumers of the API) is different from how we store it internally. For example, perhaps we store a colour as an
int4
integer but customers see CSS colour hex strings'ff00fa'
when interacting with the API. The storage is an implementation detail.Today, there seems to be only three ways to handle this scenario with PostgREST, each with some rather inconvenient drawbacks:
CREATE TYPE colour (INTERNALLENGTH = 4, INPUT = my_colour_parser, OUTPUT = my_colour_formatter)
) then use that as your column type; orSELECT colour_format(colour) AS colour
)body_filter_by_lua_block
etc)The first solution is awkward because it lets "how the data is presented" dictate "how the data is stored" which feels backwards. What if the data is presented two different ways in two different APIs? Maybe this is solvable using views that cast the column into a different presentation type, but there's a bigger problem. The solution is plain impossible with cloud hosting like Amazon's and Google SQL which don't allow you to install custom C code extensions nor even create "shell types" with
CREATE TYPE colour;
.The second solution makes that column not updatable since Postgres doesn't know how to reverse the transform. This can be worked around using
INSTEAD OF
triggers at some complexity costs. But also when filtering by this column, we get full table scans for the same reason. Postgres must callcolour_format(x)
on every row in the source table of the view when evaluatingWHERE
,JOIN
etc. The performance loss here can be avoided with a computed index, or using a materialised generated column. There's a third problem though: if the formatted column is used as a foreign key, PostgREST can no longer detect that relationship and embedding breaks. Embedding is nice.The third solution, proxy level rewriting, is inefficient. Now you have to parse the JSON of each response from PostgREST at the Lua layer, find the relevant fields and rewrite them, and then serialise back to JSON again. What if the response is large? You have to buffer the whole thing in nginx before you can parse it. JSON parsing isn't free in terms of CPU usage. And it feels so wasteful for Postgres to encode the JSON only for Lua to immediately decode it. Similarly, for incoming POSTS, PATCHes etc you have to decode the incoming JSON, rewrite and re-encode.
Furthermore, some of PostgRESTs own features might make the rewriting hard. You can select a column and then rename it with query parameter syntax. That's a cool feature, but now how will the Lua code know which fields it needs to reformat if they can be named anything? We'd have to parse the query string and understand it in detail. PostgREST's query syntax is complex so this will be a source of bugs and surprises. Similarly with embedding, the fields we can care about can appear in unexpected places.
And on the note of query strings, it gets messy for filtering. If the client submits a request like
thing?colour=in.(ff00f0,ff00dd)
, again Lua has to parse the query string, understand what is what, and then apply the appropriate reverse transformation.Resource embedding further increases the requirement for a complete understanding of the PostgREST query language. It's not very DRY is it, reimplementing all of PostgRESTs query parsing in Lua? That has to then be maintained going forward as PostgREST evolves.
This problem also relates to date and time formatting and probably a number of issues on querying by
tsrange
and whatnot.Proposal
Unlike Lua and OpenRESTY, PostgREST knows both its own query language syntax and the types of the columns it works with (from the cache). I'd like to propose a method to ask PostgREST to always apply
some_parser(x)
on incoming values for a certain column andformat_for_api(x)
for outgoing values before putting them intojson_agg
.Here are two alternative ideas to accomplish this, I'm not sure which is better.
Automatically use parse and format functions for certain domain types
CREATE DOMAIN
(bypassing the cloud hosting ban on true custom types)SELECT colour::colour_representation_type AS colour
colour_representation_type
pgrsttransformer
that accepts a single parameter ofcolour_representation_type
and returns json, and there exists a reverse function which takes json and returns the representation typeWHERE
statementsAutomatically use parse and format functions for specially named columns
SELECT colour AS colour__pgrsttransformer1
__pgrsttransformerX
suffix PostgREST does something likeSELECT pgrsttransformer1(colour) AS colour
colour
and there existscolour__pgrsttransformer1
in the target table applypgrsttransformer1(incoming)
(the function would be user defined and polymorphic to do go forwards or backwards depending on the type)WHERE
statementsDiscussion
I realise this is a big feature request. I think it will make PostgREST significantly more flexible and powerful without adding much complexity to PostgREST itself, though.
It is a way to solve this class of problem in general without the significant drawbacks of the above mentioned workarounds. (The other would be to convince the Postgres maintainers to allow casting to and from domains, but alas such casts are ignored today).
Transforming values to be more friendly for API consumers seems like a very ordinary thing to do. I'm surprised it hasn't come up more often. Other examples of such column types needing formatting and parsing for API use could be:
1/3
stored as a complex typeThe Lua workaround is "ok" if you don't care about the performance impact but having to reimplement a complicated part of PostgREST (its query language) in Lua just seems like crossing the stream to get to water. That code already exists, it's in PostgREST!
Related:
https://stackoverflow.com/questions/72537716/with-postgrest-convert-a-column-to-and-from-an-external-encoding-in-the-api?noredirect=1#comment128140735_72537716
The text was updated successfully, but these errors were encountered: