The static schema-based approach may not work in following scenarios
- Source data keeps changing, New attributes keep getting added or removed
- The data needs to be viewed from different systems (Analytic cubes, Services etc )
- Time to market is less
- The model should be adaptable to change with minimum effort
The traditional modeling present the following challenges
· Model needs to modify every time there is a change in the data
o This involves a lot of Data architect's effort, DBA effort, and system downtime
· Changes in ETL flow
· Testing effort
The approach is explained with a common business scenario
A traditional bank has acquired a new payment gateway and needs to incorporate the user activity of the system to the existing data warehouse
The user activity is stored in MongoDB and comes as a file with varying fields to the traditional ETL. The organization wants to incorporate the user activity to the traditional warehouse.
The data fields are varying in each geography and this may not be possible in traditional data modeling.
The NoSql based systems can be used to store the user activity data. The tech team wants to leverage the existing warehouse to store the user activity data and use SQL to query and join the data with existing legacy DW system.
There are modeling techniques like Entity value method /Open schema to address certain issues with this kind of scenarios
The architecture specified below is a hybrid approach which brings out best of both the worlds
Input data comes in forms of csv .Stream of strings coma separated with with header information .
The data model captures the session id and attribute names ( col names) and attribute values in a row fashion
· The model can accommodate varying list of columns
· Model maintains the order of columns in the input data
Data presentation for traditional and legacy systems
The fact can be exposed as a view to other traditional dimension models for legacy systems like Cubes or other reporting
Views can be created by pivot sqls .This can be done in multiple ways.SQL server has a pivot function and Oracle,Db2 , Teradata has recursive sqls and table functions to generate the column list dynamically.
A generic ANSI SQL way to do achive the functionality is given below
db2 " Create view ashok.view_FACT_USER_ACTIVITY as SELECT SESSION_ID,
MAX(DECODE(ATTRIBUTE_NAME, 'loginame' , ATTRIBUTE_VALUE)) AS loginame,
MAX(DECODE(ATTRIBUTE_NAME, 'location' , ATTRIBUTE_VALUE)) AS location,
MAX(DECODE(ATTRIBUTE_NAME, 'Time_logged_in' , ATTRIBUTE_VALUE)) AS Time_logged_in,
MAX(DECODE(ATTRIBUTE_NAME, 'Total_time' , ATTRIBUTE_VALUE)) AS Total_time,
MAX(DECODE(ATTRIBUTE_NAME, 'Browser_id' , ATTRIBUTE_VALUE)) AS Browser_id
FROM ashok.FACT_USER_ACTIVITY
GROUP BY SESSION_ID"
Base table
The output of the view
Integration with existing dimensional schema
Key Benefits
- Easy to load and maintain. Less of snowflaking and better performance
- Can use the RDBMS indexes, compression, and other features
- Integrate with existing dimensional model
- Integrates well with insert approach of big data systems like spark
- Easy to version the data
Comments