Reference to blogs, tweets, discussions, etc that caught my attention during the last week.

Data Modeling

Blog post, link to web session and source code on how to use BIML to generate Data Vault. BIML (Business Intelligence Markup Language) is a XML dialect for defining BI assets liek tables, ETL flows, etc. “. See Auto Generate Data Vault using Biml – Part 1 – Webinar Content” by Peter Avenent.

A LinkedIn discussion about “Data Vault, Data Virtualisation and agile DW” addresses the current approach to make Data Vault also more agile for Data Marts. Views in the Data Mart layer are often sufficient with faster hardware and/or in-memory column-oriented DBs.

Data Storage

Archiving everything with Hadoop” by Mark Cusack on Roberto Zicari’s blog mentions three key features that Hadoop has to provide in order to be suitable as long-term storage: schema preservation, security/governance, and SQL access.

Data Flow

Mark Rittmann started a threepart blog series on Hadoop ETL using MapReduce, YARN, Tez, and Spark with examples and an overview how the tools work:

Gwen Shapira lists several links for more information about Kafka: “Getting started with Kafka – Resources“.

ETL tools are widely used in the classical DWH because of their supposed productivity and maintenance advantage compared to manual coding. But what is the role of ETL tools if code for data loading is generated automatically? Roelant Vos’ view on his blog article “Do we still want to automate against ETL tools?“.

Data Tools

Reference to the “Impala Cookbook” compiled by Cloudera’s Impala team covering schema and physical design, memory usage, query tuning basics, etc.

Data Visualization

Mike Bostock’s d3.js 3.5 is now available on GitHub. d3.js is a powerful JavaScript visualization library for HTML and SVG.

Star statistician Hans Rosling takes on Ebola” by ScienceMagazine / Kai Kupferschmidt. Rosling is well-known from his inspiring talks while showing great visualisations.