Replies: 1 comment
-
TL;DR: The example described is a great use-case for Materialize (because it is unbounded change on bounded sets of data), and it handles it well. But that shouldn't be too surprising because this is not the type of "late-arriving data" that typically causes issues with streaming systems. This answer can be broken into a few parts:
1. How would Materialize handle the exact scenario described above?This is very straightforward. As a human interpreting the data we might classify it as a "late arriving," but to a stream processor it's just more change events. To recap: You have an Originating from a databaseIf the
In a properly functioning setup*, Materialize will treat this like an update because it already has records on *Properly functioning = Tools like Debezium avoid potential issues created by starting after a database has state and activities by starting with a big snapshot->replay of "current state" of the database before switching to streaming of live changes. Originating from a streamIf your events are originating from a stream the answer is mostly the same with the caveat that if you need Materialize to have access to the entire history of your stream, you'll either need:
2. How does Materialize generally handle time"Time" in materialize is something you get to define, either explicitly by specifying it in your CDC input, or implicitly by letting us pick a timestamp. If you let us do it data are never "late" (like wizards), and if you define times yourself we will hold back results until we are certain they are correct (part of the CDC input are statements about the completion of times; like transaction boundaries in logical replication). 3. Are there other kinds of late-arriving data issues that are more difficult to handle?TBD For additional reference, see: Eventual consistency isnt for streaming |
Beta Was this translation helpful? Give feedback.
-
Let's say I have an
order
entity that might always change in any moment from one status to another, and never become fully finalized.For example, imagine there is a
status
field on the orderentity
that typically goes frompending
tocomplete
, but every once in a while a customer calls to complain about an old order and the status gets updated todisputed
. Say you have a materialized view that filters for alldisputed
order to calculate the SUM of disputed order revenue.I've read elsewhere that Materialize is strongly consistent and other stream processing systems are eventually consistent.
disputed
status comes through on a really old order, where does Materialize go to find theamount
for that order?To put it another way, can Materialize deal with data that isn't just an immutable stream of events (e.g. clicks, impressions)
(xPosting this question from a community slack discussion for future reference.)
Beta Was this translation helpful? Give feedback.
All reactions