Reference to blogs, tweets, discussions, etc that caught my attention during the last week.

Data Modeling

The blog article “Data Vault 2.0 Staging Area learnings & suggestions” by Roelant Vos shows an approach to generate hash keys for Data Vault 2.0 in the staging layer for business keys, link relationships and hash difference determination.

Introduction to Cassandra Data Modeling” video by dbtube with Cassandra storage model and data modeling.

Pour Some Schema On Me: The Secret Behind Every Enterprise Information Lake” by Murthy Mathiprakasam on Informatica blog strengthens the need to care about schematas and metadata – just pouring log data, sensor data, etc into a data lake is not sufficient to get data quality in the long run.

Data Architecture

Link to ThoughtWorks’ “Rethink Dallas” videos on agile topics:

  • Agile architecture (Molly Bartlett Dishman, Martin Fowler)
  • The death of agile (Dave Thomas)
  • Rethinking the agile enterprise (Brandon Byars)

Data Storage

The Top 10 Posts of 2014 from the Cloudera Engineering Blog” by Justin Kestelyn contains many articles dealing with right-time capabilities for Hadoop ecosystem, e.g. Spark, Kafka, Impala.


MongoDB acquires storage engine WiredTiger. Press release: “MongoDB acquires WiredTiger Inc.

Frits Hoogland started an in-depth blog series about Oracle PGA:

Data Flow

Hortonworks webinar recap on Kafka & Storm with recording, slides, and Q&A: “Discover HDP 2.2: Apache Kafka and Apache Storm for Stream Data Processing“. Recently, Apache Storm 0.9.3 has been released with improvements in HDFS, HBase and Kafka integration. The new release allows Storm to write into Kafka – so Storm can now use Kafka as source and as target: “Storm 0.9.3 Released“.

Data Visualization

10 significant visualization development: July to December 2014” by Andy Kirk and “The Best Data Visualization Projects of 2014” by NathanYau show great visualisations in 2014.

Data Statistics

Google research blog “Automatically making sense of data” about automatically discovering insights from data and providing a human-readable explanation (see The Automatic Statistician project site).

The title says it all “Open-Sourced Advanced Analytics is increasing…” by Alexander Linden.

Data Quotes

“You’ve got to have a DR [disaster recovery] plan. It’s amazing how many bits of software you install over time. It’s also surprising how many odd little commands and config entries you put in over time. ” Source of the quote: The importance of backups and disaster recovery plans” by Tim Hall.

“If you have data, you have a schema. Whether you want one or not.” tweeted by Karen Lopez.