Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial database schema #53

Draft
wants to merge 73 commits into
base: v5-develop
Choose a base branch
from
Draft

Initial database schema #53

wants to merge 73 commits into from

Conversation

mitchdowney
Copy link
Member

No description provided.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Serial cleans up the setup for 1) auto-increment, 2) NOT NULL and 3) Unique Primary Keys.
For the Foreign Keys (references), the default behavior is that you will not be able to delete a row if its being referenced in another table. There is nothing wrong with this approach, the only thing I would say is if there are going to be frequent deletes to tables with foreign keys, then the deletes have to be done in a specific order, or you can change the behavior of the foreign key to "CASCADE" on delete/update. This will make sure the other references are updated/deleted as well. There is a potential performance impact here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the account_id column (referencing the ID in account)...I think you have to have the same "ON DELETE CASCADE" behavior in other tables that are also referencing ID in the account table. I will circle back on this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for start_time/end_time-is there a reason why you want to use numeric? Is to store epoch time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the account_id column (referencing the ID in account)...I think you have to have the same "ON DELETE CASCADE" behavior in other tables that are also referencing ID in the account table. I will circle back on this

Ah, I am missing all of the ON DELETE CASCADE rules in the account sql file. I will review the full code and try to determine which are missing, and which are unnecessary.

for start_time/end_time-is there a reason why you want to use numeric? Is to store epoch time?

No, that is a good catch. We only need up to 2 decimal points for start/end time.

Since we only need 2 decimal points, I guess we will use the FLOAT type...however, I suppose we could make it a numeric and limit the numeric to 2 decimal points? While the latter seems more efficient, I suspect numeric values maybe quirkier than floats (I think I saw that both 0 and 0.0 are possible separate numeric values), so I'll go ahead with FLOAT unless we can think of good reasons to go with numeric.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you not just use the datatype "time" there? its format is 00:00:00.

https://www.postgresql.org/docs/current/datatype-datetime.html

Copy link
Member Author

@mitchdowney mitchdowney Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I hadn't thought of that. I guess we could use it since it has microsecond resolution? The only reason I can think of possibly not however, is that when parsing "time position" values from RSS feeds, or using them in a media player (ex. start playback from 90 seconds in), the values will be integers or decimals, not hh:mm:ss:ms...so I'm guessing we would need to convert those values to and from hh:mm:ss:ms time?

If there isn't an advantage, like a performance benefit, for using datatype "time", then I think I'd prefer to use "float".

Copy link

@randoneering randoneering Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know code reuse is something we all strive for, which 'INHERITS' allows for, but this can cause some performance overhead because you will be querying more than one table. https://www.postgresql.org/docs/current/tutorial-inheritance.html. I generally don't use this in my tables, only because I don't mind writing out the same columns over (and over) again. Sure, it makes for more reading, but it isn't going to cause any potential performance hit that 'INHERITS' does. But, as the documentation says, its great for object-oriented databases. I would suggest keeping this in place for now, but if you feel like you want to start squeezing performance out of the database-I would start here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am coming back from speaking with my colleague and I think it would be wise to not use inherit. Here is why;

By using inherit, you are essentially created a link between two tables for the sake of cleaning up code (not having to write out the same code to create the same type of columns in several tables). The postgres documentation actually gives a good example using 'Cities' table and 'Capitals' table. The 'Capitals' table inherits the 'Cities' columns plus addition columns added specific to 'Capitals'. The query hitting the Cities table will then, essentially be querying two tables (not just Cities). If you are to look for cities that are not capitals, you then have to use specific syntax of 'FROM ONLY CITIES'. Seems kind of silly when you could just essentially have a cities table with a boolean column labeled 'capital' with a default of false.

So-TLDR-I would avoid using inherit and just write out the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very glad you and your colleagues looked into this. The INHERITS usage is totally new to me, and I'm not sure if it would help or hurt more long run...

Part of why I used INHERITS is not only for code usability, but because we have concepts like a "playlist", and a playlist can have multiple different table types associated with it. For example, a playlist consists of playlist_items, and each playlist_item can have a foreign key relationship with 4 different data types: item, clip, chapter, soundbite.

I was thinking that if we use inherits, then when it is time to query the database for ALL of the playlist_items for a account's/user's playlist, then we could query the playlist_item_base table using the playlist_id, and use UNION so each playlist_item returned would select the different columns that correspond with the different tables that inherit playlist_item_base.

Does that make sense, or am I making incorrect assumptions? Is there another / better way to do this besides this INHERITS plus UNION approach?

(P.S. I know the table names may semantically sound like they can be consolidated into one table, like why can't "clip" and "soundbite" be the same table? But these are distinct entities according to RSS spec, so I'd suggest ignoring the apparent similarities for now.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned this in the matrix channel, but I think the idea would be less on the table setup and more in the query to get the data you need. Because you can get the same values using joins, and if there are indexes on those tables, the performance should be better than using INHERITS .

For example

SELECT a.column, b.column
FROM table_1 a
INNER JOIN table_2 b on a.id = b.id
WHERE a.column <> 42

But this would mean your sql queries will need to do the work with the joins, and not using the backend "magic" of INHERITS

Copy link
Member Author

@mitchdowney mitchdowney Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if the performance should be better than using inherits, we're definitely trying to choose performance wherever possible. If not using inherits allows that, I'd like go with that option, but I'd like to explain the use case one more time to be sure, before I stop using inherits.

Our playlists will have "playlist items" in them like this, of 5 potentially different types:

  • clip
  • item (episode)
  • chapter
  • soundbite
  • item (episode)
  • chapter
  • clip
  • add_by_rss_item (episode)
  • etc

We'll need to make paginated queries, for example for the first 20 of those playlist items for a specific playlist, ordered by the position column.

Also, when updating a position of a playlist item (ex. the playlist item at position 7 moves to position 3), we will need to get the playlist items with the next highest and lowest position values, so that the playlist item in the new position can be given a position value in between those 2 values.

If there is a playlist_item_base table, then I was thinking it might be more performant to query that table for the items ordered by position, rather than querying all 5 of the playlist item tables, and ordering the items by position, and then somehow stitching the requested range (ex. the next 20 playlist items by position from a playlist) of items together and returning them.

Do you think that querying the 5 tables separately could be more performant than a base table for this purpose? (Part of my problem may be that I don't yet know what a query that can stitch and return this paginated data together would look like.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id say stick with the inherit since you have already been investing in the design (in time and effort). What we could do down the line would be to put time into writing the query to get what you need and compare the performance. Ultimately, you want to do something that not just you will be able to understand, and inherit gives people that breadcrumb trail. Keep on keeping on.


CREATE OR REPLACE FUNCTION check_account_id_uniqueness()
RETURNS TRIGGER AS $$
BEGIN

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good use case for a UNION instead of a JOIN here. Nice work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :)

That said...we are actually discussing possibly merging all 3 of these tables...

history
now_playing
queue

And making them one table, where the position would be negative for history items, 0 for now playing item, and positive for upcoming queue items.

If that's the case, then I think we may not need a union like this.

CREATE TABLE queue_content_base (
id SERIAL PRIMARY KEY,
account_id INTEGER NOT NULL REFERENCES account(id) ON DELETE CASCADE,
position numeric_20_11 NOT NULL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again-for the numeric data type. Are you anticipating the numbers to be large for position of "queued content?" What would this table hold specifically? Queued uploads to rss feeds? Just trying to clarify things for myself !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and I got the idea for this blog post. @suorcd and I have discussed this quite a bit, and aren't 100% set on if we want to go this way or not.

The reasoning is basically 1) using integer for position can lead to problems, as reordering items in the list means every item after that item also needs its position updated. 2) using a decimal column allows you to reorder by fitting a reordered item between 2 positions, by calculating the middle point between the two.

An alternative to this could be a jsonb column that maintains the position order data on a higher order table (like playlist), which may be doable for our use case, but we'll lose some referential benefits.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, I think you are good to use the numeric datatype. Looking at this again, it makes more sense from a positioning perspective.

('mixed'), ('podcastL'), ('musicL'), ('videoL'), ('filmL'), ('audiobookL'), ('newsletterL'), ('blogL'), ('publisherL'), ('courseL')
;

CREATE TABLE medium_base (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH there we go-found the table :) Ignore my comment about missing the 'Medium' table

Copy link

@randoneering randoneering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to still review the 0001_init_podcasting_20_database.sql file. I will do this tomorrow if I get a chance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After speaking with my colleagues, what you are doing here makes sense-you are enforcing constraints/checks on commonly used columns that will be amongst your tables. You are essentially creating 'global check constraints.' So, in my eyes, its a great way to keep data integrity. You could also do the same thing per table, setting constraints on the columns by each table (instead of doing the domain). However, its just another way to 'skin a cat'. This just makes it easier across the board.

Some examples my colleague brought up was enforcing format of birthdays or email addresses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking, and yes exactly. A DOMAIN like varchar_url I think is convenient because we can easily ensure each url field is limited to 2083 characters and checks it begins with http:// or https://

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants