Predictions about data for 2023 and beyond
Predictions about data for 2023 and beyond. End of the year: it’s the time for predictions. Let’s have a look at some predictions regarding data. There are many predictions for Machine Learning, Deep Learning, and AI - explainability, professionalisation, and...
Data Vault and Star Schema with PlantUML: Entity Relationship Diagram as Code
Entity Relationship Diagram as code means developers use the same tools for creating the diagrams - or documentation in general - as for coding. Documentation includes more than just source code and some comments. If the documentation is textual and not binary,...
Materialization examples of Data Engineering with dbt
dbt offers several materialization options to create ETL/ELT processes. The article shows and compares various approaches how to use dbt for ETL/ELT. A previous post contains an introduction into dbt: Data Engineering with dbt – first steps using PostgreSQL and...
Data Engineering with dbt – first steps using PostgreSQL and Oracle
dbt is a Data Engineering tool supporting version control with CI/CD for transformations and materialization. The approach with dbt differs from tools like SSIS, DataFactory, Informatica. The developer models the target tables/views and the transformations. dbt uses...
PostgreSQL application_name
PostgreSQL application_name can be set in the connection string. The view pg_stat_activity will show the application_name to help to identify the sessions. The article shows how to set application_name and how to benefit from it. It is highly recommended to set the...
PostgreSQL columnar extension cstore_fdw
PostgreSQL columnar extension cstore_fdw is a storage extension which is suited for OLAP-/DWH-style queries and data-intense applications. Columnar analytical databases have unique characteristics compared to row-oriented data access. Many commercial products exist:...
PostgreSQL partitioning guide
PostgreSQL partitioning is a powerful feature when dealing with huge tables. Partitioning allows breaking a table into smaller chunks, aka partitions. Logically, there seems to be one table only if accessing the data, but physically there are several partitions....
Anonymization techniques and data privacy
Anonymization techniques are essential for data analytics or in test/dev databases. Anonymization and pseudonymization are very different but often confused. GDPR does not apply to anonymized data anymore. GDPR is still applicable for pseudonymized data that can be...
Log-based Change Data Capture - lessons learnt
My article on medium summarizes experiences from various projects with log-based change data capture (CDC). There are many use cases for which CDC is beneficial. Some DBs even have CDC functionality integrated without requiring a separate tool. The article first...
Calvin: distributed ACID transactions
Most distributed databases do not offer ACID transactions. The support of linear scalability is the main reason that distributed NoSQL databases like MongoDB, Cassandra, AWS DynamoDB and many others have reduced transactional support. Abadi et al. propose in a paper...
Study on Knowledge Sharing – Spotify Guilds / CoPs
Communications of the ACM published a study on Spotify Guilds / CoPs (Communities of Practice). A CoP is a group of people with similar interests who share their knowledge, solve problems or establish standards. The study examines the challenge of knowledge sharing...
The Zettabyte challenge
IDC published a White Paper about the challenge of Big Data Volume in a data-driven world. IDC expects that the data volume will grow from 45 Zettabyte (ZB) in 2020 to 175 ZB in 2025. The data will be produced in various forms like transactional data, text, voices,...