Data Warehouse Requirements Engineering A Decision Based Approach)

Metadata

Highlights

development model. The aim of this model is to provide an iterative and incremental development — location:


Aside from being expensive, data warehouse projects are risky. Ericson [3] cites a survey showing that data warehouse projects whose cost averaged above $12M failed 65% of the time. Hayen et al.[4] refer to studies that indicate the typical cost of a data warehouse project to be one million dollars in the very frst year and that one-half to two-thirds of most data warehouse projects fail. Loshin [1] points out that data warehouse projects generate high expectations but bring many disappointments. This is due to failure in the way that DW projects are taken from conception through to implementation. This is corroborated by [2] who concludes that the lessons learnt from data warehouse projects covered the entire development life cycle: • Requirements: The business had little idea of what they wanted because they had never experienced a data warehouse.Further, requirements gathering should have been done better. • Design: Design took a very long time and conficting defnitions of data made designs worthless. • Coding: It went slow, testing came under pressure, and crucial defects were not caught. — location:


The Requirements Engineering Stage The introduction of the requirements engineering stage in the DWSDLC addressed the concerns raised in conceptual schema-driven techniques. A clear effort was made to take into account needs of stakeholders in the data warehouse to-be. The defnition of multidimensional structures was based on understanding of business goals, business services, and business processes. This understanding was gained by interaction with decision-makers.Thus, the context of the data warehouse was explored and the requirements of the data warehouse to-be were seen to originate in the business. Since ab initio investigation into decision-maker needs is carried out in the requirements engineering stage, existing data sources and/or their conceptualization in conceptual schemas did not impose any limitations.Rather, the determined requirements could use data from existing data sources or come up with completely new data not available in these sources. With the introduction of the requirements engineering stage in DWSDLC, there is today no difference between the stages of the TSDLC of transactional systems and stages of the DWSDLC.However, the tasks carried out in these stages are different. This is brought out in Table 2.2. — location:


Monolithic Versus Bus Architecture Breadth-frst traversal can be done based on two different assumptions. The frst assumption is that the deliverable is the complete data warehouse.Hence, requirements of the entire data warehouse must be identifed; the multidimensional model for the enterprise must be designed and then taken into implementation.Thereafter, specifc subject-oriented data marts are defned so as to make appropriate subsets of the data warehouse available to specifc users. This is shown in Fig. 2.3 where sales, purchase, and production data marts are built on top of the enterprise-wide data warehouse. Defning data marts in this manner is analogous to construction of subschemas on the schema of a database. The main idea in both is to provide a limited view of the totality, limited to that which is relevant to specifc users. This approach of constructing the monolithic data warehouse follows the waterfall model; each stage of the DWSDLC must be completed before moving to the next stage. This implies that lead time in delivering the project is very high. There is danger that the requirements might change even as work is in progress — location:


A different process model results if the assumption of delivering the full data warehouse is relaxed. Rather than building the entire monolithic data warehouse, this approach calls for frst building data marts and then integrating them by putting them on a common bus. This bus consists of conformed dimensions, dimensions that are common across data marts, and therefore allow the drill across operation to be performed.Consequently, data held in different data marts can be retrieved. This approach is shown in Fig.2.4. Data marts are built independently of one another. Since the size of a data mart is smaller than the entire data warehouse, the lead time for release is lesser.Therefore, business value can be provided even with the release of the frst data mart. Freshly built data marts can then be added on to the bus.Thus, the data warehouse consists of a number of integrated, self-contained data marts rather than a big centralized data warehouse.Evidently, the bus approach promotes iterative and incremental development and no complete plan is required upfront. The risks are that data marts may contain missing or incompatible measures and dimensions contain replicated data and display inconsistent results. The success of the bus architecture is crucially dependent on conforming facts and dimensions.Thus, if one data mart contains product information in number of cases shipped and another keeps product information as units sold, then moving across these data marts yields incompatible information. Such facts must be conformed, by keeping, along with shipping data, unit data as well. This allows units shipped to be compared with units sold. Dimensions need to be conformed too. If one data mart has attributes day, quarter, and year for the dimension time and another has day and month half year, then drill across becomes diffcult. The dimension attributes must be made to conform and the lowest granularity attribute kept in both the dimensions. The product information must also be available on a daily basis in our example. — location:


Introduction of developer stories requires a number of additional roles, other than the three roles of product owner, Scrum master, and development team. These are as follows: • Project architect: This role is for conceptualizing the application and communicating it to both business stakeholders as well as to technical people. The job involves relating source data to target data in a presentation layer and formulating the major functions of the dashboards. • Data architect ensures that the semantics of the data are clear, manages the data models of the various layers, implements normalization, etc. • Systems analyst: Starting from user stories, determines the transformations of source data required to meet business needs. In doing so, the systems analyst will need to look at developer stories to determine the transformations across the multiple layers of the DWBI reference architecture. This role may also need to work with the data architect to defne any integrity constraints that must be satisfed by the data before it is accepted into the next layer. • Systems tester: To ascertain if the build is correct and complete. This is done on a daily basis, at the end of each iteration and when a release is issued. It is normally done at the end of each day — location:


Agile Data Modeling [13] is for exploring data-oriented structures. It provides for incremental, iterative, and collaborative data modeling. Incremental data modeling refers to availability of more requirements when they are better understood or become clear to the stakeholder. The additional requirements are obtained “on the fy” when the developer needs them for completing the implementation task at hand. Iterative data modeling emphasizes reworking to improve existing work. As requirements become better understood and as need for changing data schemas is felt, correcting errors, including missing information just discovered and other such rework, referred to as refactoring in the data warehouse community, is carried out. Collaborative data modeling calls for close interaction between the developers and stakeholders in obtaining and modeling data requirements.Thus, it moves away from merely eliciting and documenting data requirements with stakeholder participation but also includes stakeholder participation in modeling of data. — location:


BEAM* uses the notion of data stories that are told by stakeholders to capture data about business events that comprise business processes. These data stories are answers to seven types of questions about events and each answer provides a fact or dimension of the multidimensional schema. These questions, called 7W, are (1) Who is involved in the event? (2) What did they do? To what is it done? (3) When did it happen? (4) Where did it take place? (5) Why did it happen? (6) How did it happen—in what manner? (7) How many or much was recorded— how can it be measured? Out of these, the frst six supply dimensions whereas the last one supplies facts. As an example, the event, order delivered, can have three “who”-type dimensions, namely, Customer, Product, Carrier and two “when”-type dimensions, order date and shipment date. — location:


It is possible to follow the waterfall model and build the event matrix for all events in all processes in the organization.However, agility is obtained when just enough events have been identifed so as to enable defning the next sprint.Further, a prioritization of the backlog on the basis of the Importance value is done.Thus, the event matrix is the backlog. Events have event stories associated with them. Since conformed dimensions have already been identifed, it is expected that event stories will be written using these. The expression of events is as a table whose attributes are (a) specifc to the event and (b) are the conformed dimensions already obtained from the event matrix. The table is flled in with event stories; each event story is a row of the event table. An event table is flled in with several event stories so as to ensure that all stakeholders agree on the meaning of each attribute in the event table. If there is no agreement, then attributes that are homonyms have been discovered and separate attributes for each meaning must be defned. Reports that are desired by stakeholders are captured in report stories. These stories are taken further to do data profling and then on to development in a sprint. This is the BI application aspect of data warehouse development — location:


Dependent data marts: These are built from an already operational data warehouse and so data for the data mart is extracted directly from the latter.Therefore, such data marts have data which has already been integrated as part of developing the data warehouse. Data in a dependent data mart will also, quite naturally, be consistent with the data in the enterprise data warehouse. The — location:


enterprise data warehouse represents the “single version of the truth” and these data marts comply with this. • Independent data marts: Developed independently from the enterprise data warehouse, these are populated with data often directly from an application, an OLTP database or operational data sources.Consequently, data is not integrated and is likely to be inconsistent with the data warehouse. Independent data marts are built by several different teams using technologies preferred by these teams.Therefore, there is a proliferation of tools, software, hardware, and processes.Clearly, the foregoing happens if conformity across data marts is handled on the fy.Notice, however, that this happens even if consolidation is designed for as in BEAM* because post-design, data marts are developed independently and independent teams work on the several data marts — location:


Data mart proliferation raises a number of issues as follows: • A large number of data marts imply increased hardware and software costs as well as higher support and maintenance costs. • Each data mart has its own ETL process and so there are several such processes in a business. • Same data existing in a large number of data marts leads to redundancy and inconsistency between data. • There is no common data model. Multiple data defnitions, differing update cycles, and differing data sources abound. This leads to inconsistent/inaccurate reports and analyses. • Due to lack of consistency between similar data, it could happen that decision-making is inaccurate or inconsistent. Data mart proliferation can be a drain on company resources. Industry surveys [14] show that the number of data marts maintained by 59% of companies is 30. There are companies that maintain 100 or more data marts. Maintenance of a single data mart can cost between 2 million annually. Out of these costs, 35–70% are redundant costs. — location:


Goal-Oriented DWRE Techniques Goal-Oriented DWRE, GODWRE, techniques draw heavily from the notion of goals developed in GORE considered in Chap.1.Thus, the organizational context of the data warehouse is represented in terms of goals that the business wants to achieve. Goal reduction techniques as in GORE are adopted to yield the goal hierarchy.Thereafter, facts and dimensions are associated with goals. An early proposal for GODWRE was due to Bonifati et al.[27] who obtained DW structures from users’ goals and operational databases. This was done by three levels of analysis: (i) top-down using goals, (ii) bottom-up for operational databases, and (iii) integration for integrating data warehouse structures obtained from steps (i) and (ii). Our interest is in step (i) only that relies on Goal–Question–Metric analysis.Users’ requirements are collected through traditional techniques like interviewing and brainstorming to obtain goals. A goal is expressed in terms of • Object of study: the part of the reality being studied, • Purpose: why the study is being done, • Quality focus: the characteristics of interest in the study, • Viewpoints: who is interested in the study, and • Environment: the application context in which the study is being done — location:


Approaches to obtaining information Approach Obtaining data warehouse information Boehnlein and Ulbricht • Relevant business objects and attributes • Edges of SERM schema Bonifati Quality focus and variation factors Prakash and Gosain Information scenarios Mazón et al.Measures, context Georgini et al. Goal achievement measures, dimensions from leaves of goal hierarchy Nasiri et al.[37] Follows approach of Mazón et al. as in row 4 of this table Corr and Stagnitto Uses 7W framework — location: