12AC1VqEWTnxQ4O M8ScWoLvA
By Abhinaya Shetty, Bharath Mummadisetty
This blog post will cover how Psyberg helps automate the end-to-end catchup of different pipelines, including dimension tables.
In the previous installments of this series, we introduced Psyberg and delved into its core operational modes: Stateless and Stateful Data Processing. Now, let’s explore the state of our pipelines after incorporating Psyberg.
Let’s explore how different modes of Psyberg could help with a multistep data pipeline. We’ll return to the sample customer lifecycle:
Processing Requirement:
Keep track of the end-of-hour state of accounts, e.g., Active/Upgraded/Downgraded/Canceled.
Solution:
One potential approach here would be as follows
Let’s look at how this can be integrated with Psyberg to auto-handle late-arriving data and corresponding end-to-end data catchup.
We follow a generic workflow structure for both stateful and stateless processing with Psyberg; this helps maintain consistency and makes debugging and understanding these pipelines easier. The following is a concise overview of the various stages involved; for a more detailed exploration of the workflow specifics, please turn to the second installment of this series.
The workflow starts with the Psyberg initialization (init) step.
The session metadata table can then be read to determine the pipeline input.
This is the general pattern we use in our ETL pipelines.
a. Write
Apply the ETL business logic to the input data identified in Step 1 and write to an unpublished iceberg snapshot based on the Psyberg mode
b. Audit
Run various quality checks on the staged data. Psyberg’s metadata session table is used to identify the partitions included in a batch run. Several audits, such as verifying source and target counts, are performed on this batch of data.
c. Publish
If the audits are successful, cherry-pick the staging snapshot to publish the data to production.
Now that the data pipeline has been executed successfully, the new high watermark identified in the initialization step is committed to Psyberg’s high watermark metadata table. This ensures that the next instance of the workflow will pick up newer updates.
Let’s go back to our customer lifecycle example. Once we integrate all four components with Psyberg, here’s how we would set it up for automated catchup.
The three fact tables, comprising the signup and plan facts encapsulated in Psyberg’s stateless mode, along with the cancel fact in stateful mode, serve as inputs for the stateful sequential load ETL pipeline. This data pipeline monitors the various stages in the customer lifecycle.
In the sequential load ETL, we have the following features:
Here is a walkthrough on how this system would automatically catch up in the event of late-arriving data:
Premise: All the tables were last loaded up to hour 5, meaning that any data from hour 6 onwards is considered new, and anything before that is classified as late data (as indicated in red above)
Fact level catchup:
Dimension level catchup:
As seen above, by chaining these Psyberg workflows, we could automate the catchup for late-arriving data from hours 2 and 6. The Data Engineer does not need to perform any manual intervention in this case and can thus focus on more important things!
The introduction of Psyberg into our workflows has served as a valuable tool in enhancing accuracy and performance. The following are key areas that have seen improvements from using Psyberg:
These performance metrics suggest that adopting Psyberg has been beneficial to the efficiency of our data processing workflows.
Integrating Psyberg into our operations has improved our data workflows and opened up exciting possibilities for the future. As we continue to innovate, Netflix’s data platform team is focused on creating a comprehensive solution for incremental processing use cases. This platform-level solution is intended to enhance our data processing capabilities across the organization. Stay tuned for a new post on this!
In conclusion, Psyberg has proven to be a reliable and effective solution for our data processing needs. As we look to the future, we’re excited about the potential for further advancements in our data platform capabilities.
Psyberg: Automated end to end catch up was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Language models — often known for the acronym LLM for Large Language Models, their large-scale…
This post is in two parts; they are: • Understanding the Encoder-Decoder Architecture • Evaluating…
Investment professionals face the mounting challenge of processing vast amounts of data to make timely,…
GenLayer is betting that AI-driven contracts, enforced on the blockchain, will be the foundation for…
The acting inspector general says the Office of Personnel Management is investigating whether any “emerging…
AI models often rely on "spurious correlations," making decisions based on unimportant and potentially misleading…