Data development is the third Data Management Function in the data management framework shown in Figures 1.3 and 1.4. It is the second data management function that interacts with and is influenced by the Data Governance function. Chapter 5 defines the data development function and explains the concepts and activities involved in data development.
5.1 Introduction
Data development is the analysis, design, implementation, deployment, and maintenance of data solutions to maximize the value of the data resources to the enterprise. Data development is the subset of project activities within the system development lifecycle (SDLC) focused on defining data requirements, designing the data solution components, and implementing these components. The primary data solution components are databases and other data structures. Other data solution components include information products (screens and reports) and data access interfaces.
The context of the Data Development Function is shown in the context diagram in Figure 5.1
Project team members must collaborate with each other for effective solution design.
5.2 Concepts and Activities
The activities necessary to carry out the data development function are described below.
5.2.1 System Development Lifecycle (SDLC)
Data development activities occur in the context of systems development and maintenance efforts, known as the system development life cycle (SDLC). Projects manage most of these efforts. A project is an organized effort to accomplish something. A very small maintenance effort may be completed in a day. Very large multi-phase projects can take years to complete.
Figure 5.1 Data Development Context Diagram
System development and maintenance projects perform selected activities within the systems development lifecycle. The stages of the SDLC represent very high-level steps commonly taken to implement systems, as shown in Figure 5.2. There is no standardized outline of these stages, but in general, the SDLC includes the following specification and implementation activities:
Figure 5.2 The System Development Lifecycle (SDLC)
System maintenance efforts also generally follow the same high-level SDLC processes in very rapid sequence, performing some small amounts of analysis, design, coding, testing, and deployment.
Many organizations have adopted SDLC methods that integrate systems development methods and techniques into a comprehensive approach to systems development. Methods guide system development project planning and performance. Most methods recommend detailed tasks and specific techniques to perform activities within each SDLC stage. These tasks and techniques create a data modeling series of deliverables leading ultimately to an implemented system. The outputs from early tasks serve as the inputs guiding subsequent tasks.
Different methods portray the SDLC in different ways, each with its own distinctive use of terms. Some methods define a waterfall approach to performing SDLC stages. Some methods define a spiral, iterative approach. These methods deliver complete solutions in increments by performing SDLC stages in multiple project phases, guided by some high-level planning, analysis, and design.
Information systems capture and deliver information (data in context with relevance and a time frame) to support business functions. These functions range from strategic planning to operational performance. Data stores and information products are integral components of every information system. An effective systems development project will maintain a balanced emphasis on data, process, and technology.
5.2.2 Styles of Data Modeling
Several different data modeling methods are available, each using different diagramming conventions or styles. The syntax for each of these styles differs slightly. While all data models use boxes and lines, each style uses different symbols and box contents to communicate detailed specifications. The DAMA-DMBOK Guide offers only a very brief introduction to these styles.
UML defines several different types of models and diagrams. Class diagrams closely resemble other data model styles. In addition to modeling object-oriented software, semantic models for XML-based web services commonly use UML class diagrams. In fact, conceptual, logical, and even physical data modeling can use UML class diagrams.
Some practitioners see no need or value to modeling objects and data separately. Conceptual object class models are equivalent to conceptual data models. However, logical and physical data models usually differ substantially from logical and physical object-oriented program designs. Logical data models normalize data attributes, while object models do not. The attributes of an object represent data in program memory, while the attributes of a physical data model represent the data stored in a database, usually as columns in relational database tables. Recognizing these differences, most data professionals prefer to model data and / or databases in separate models with different diagramming styles.
When used consistently, the different diagramming conventions can quickly differentiate and communicate the purpose of each model. For example, some practitioners use IE notation for logical data modeling and use IDEF1X for physical data modeling, especially dimensional modeling. However, this is confusing for business data stewards reviewing different kinds of models. Data stewards do not need to become data modelers, but they should be fluent in reading and interpreting one primary diagramming convention.
5.2.3 Data Modeling, Analysis, and Solution Design
Data modeling is an analysis and design method used to 1) define and analyze data requirements, and 2) design data structures that support these requirements. A data model is a set of data specifications and related diagrams that reflect data requirements and designs. For the most part, conceptual data modeling and logical data modeling are requirements analysis activities, while physical data modeling is a design activity.
A model is a representation of something in our environment. It makes use of standard symbols that allow one quickly to grasp its content. Maps, organization charts, and building blueprints are examples of models in use every day. Think of a data model as a diagram that uses text and symbols to represent data elements and relationships between them. In fact, a single diagram may be one of several views provided for a single integrated data model. More formally, a data model is the integrated collection of specifications and related diagrams that represent data requirements and designs.
Although there are well-defined techniques and processes, there is an art to making data available in usable forms to a variety of different applications, as well as visually understandable. Data modeling is a complex process involving interactions between people and with technology, which do not compromise the integrity or security of the data. Good data models accurately express and effectively communicate data requirements and quality solution design. Some model diagrams try to communicate too much detail, reducing their effectiveness.
Two formulas guide a modeling approach:
The purpose of a data model is to facilitate:
Data models that include the same data may differ by:
Use data models to specify the data required to meet information needs. Data flows through business processes packaged in information products. The data contained in these information products must meet business requirements. Data modeling is, in that sense, an analysis activity, reflecting business requirements. However, data modeling presents creative opportunities at every step, making it, at the same time, a design activity. Generally, there is more analysis involved in conceptual data modeling, and more design involved in physical data modeling, with a more balanced mixture of both in logical data modeling.
5.2.3.1 Analyze Information Requirements
Information is data in context that has relevance, and is timely. To identify information requirements, we need to first identify business information needs, often in the context of one or more business processes. Business processes consume as input, information products output from other business processes. The names of these information products often identify an essential business vocabulary that serves as the basis for data modeling. Regardless of whether processes or data are modeled sequentially (in either order), or concurrently, effective analysis and design should ensure a relatively balanced view of data (nouns) and processes (verbs), with equal emphasis on both process and data modeling.
Projects typically begin with a project request and the definition of a project charter that defines project objectives, deliverables, and scope boundaries. Initial project plans estimate the resources, effort, time, and cost required to accomplish project objectives. Every project charter should include data-specific objectives and identify the data within its scope. Reference to an enterprise data model provides the vocabulary to define the data scope of the project effectively.
Requirements analysis includes the elicitation, organization, documentation, review, refinement, approval, and change control of business requirements. Some of these requirements identify business needs for data and information. Express requirement specifications in both words and diagrams.
Logical data modeling is an important means of expressing business data requirements. For many people, as the old saying goes, “a picture is worth a thousand words.” However, some people do not relate easily to pictures; they relate better to reports and tables created by data modeling tools. Many organizations have formal requirements - management disciplines to guide drafting and refining formal requirement statements, such as, “The system shall …”. Written data requirement specification documents may be maintained using requirements management tools. Carefully synchronize the contents of any such documentation with the specifications captured within data models.
Some methods include enterprise planning activities that define the enterprise data model, using techniques such as business systems planning (BSP) or information systems planning. Methods may also include the definition of related enterprise-wide data delivery architecture in the planning phase. Chapter 4 on Data Architecture Management covers these activities.
5.2.3.2 Develop and Maintain Conceptual Data Models
A conceptual data model is a visual, high-level perspective on a subject area of importance to the business. It contains only the basic and critical business entities within a given realm and function, with a description of each entity and the relationships between entities. Conceptual data models define the semantics (nouns and verbs) of the essential business vocabulary. Conceptual data model subject areas may reflect the data associated with a business process or application function. A conceptual data model is independent of technology (database, files, etc.) and usage context (whether the entity is in a billing system or a data warehouse).
Included in a conceptual data model is a glossary that defines each object within the conceptual data model. The definitions include business terms, relationship terms, entity synonyms, and security classifications. An example of a conceptual data model is shown in Figure 5.3.
Figure 5.3 Conceptual Data Model Example
To create a conceptual data model, start with one subject area from the subject area model. Determine what objects are included within that subject area, and how they relate to each other. For example, a Customer subject area may contain the following entities: Account Owner, Sub Account, Contact Preferences, and Contact Information. One Account Owner relates to one or more Sub Accounts. Each Account Owner has one set of Contact Preferences and one set of Contact Information at any time.
To maintain a conceptual data model, adopt a process to check any proposed changes to the production system against the conceptual model. If a project will involve changes, create an intermediate conceptual model and make the changes there. Copy the model changes to the production version of the conceptual model when implementing changes to the production system as part of the release process, to ensure that the model keeps in synch with current reality.
5.2.3.2.1 Entities
A business entity is something of interest to the organization, an object, or an event. A data entity is a collection of data about something that the business deems important and worthy of capture. An entity is a noun:
An entity occurrence is the instantiation of a particular business entity. The entity Customer can have instances named Bob, Joe, Jane, and so forth. The entity Account can have instances of Bob’s checking account, Bob’s savings account, Joe’s brokerage account, and so on.
An entity can appear in a conceptual or logical data model. Conceptual business entities describe the things about which we collect data, such as Customer, Product, and Account. Logical data entities follow the rules of normalization and abstraction, and therefore the concept Customer becomes numerous components such as Customer, Customer Type, and Customer Preference. Physical data models define tables that may or may not relate directly to entities in a comparable logical model.
Entities are either independent or dependent entities. An independent entity (or kernel entity) does not depend on any other entity for its existence. Each occurrence of an independent entity exists without referring to any other entity in the data model. A dependent entity depends on one or more other entities for its existence. There are three main types of dependent entity:
5.2.3.2.2 Relationships
Business rules define constraints on what can and cannot be done. Business rules divide into two major categories:
Data models express two primary types of data rules:
Express cardinality and referential integrity business rules as relationships between entities in data models. Combine the examples above to express the relationship between Company and Person as follows:
Relationship labels are verb phrases describing the business rules in each direction between two entities, along with the words that describe the “many” aspect of each relationship (cardinality) and the “zero or one” side of each relationship (referential integrity).
A relationship between two entities may be one of three relationship types:
A recursive relationship relates instances of an entity to other instances of the same entity. Recursive relationships may be one-to-one, one-to-many, or many-to-many.
5.2.3.3 Develop and Maintain Logical Data Models
A logical data model is a detailed representation of data requirements and the business rules that govern data quality, usually in support of a specific usage context (application requirements). Logical data models are still independent of any technology or specific implementation technical constraints. A logical data model often begins as an extension of a conceptual data model, adding data attributes to each entity. Organizations should have naming standards to guide the naming of logical data objects. Logical data models transform conceptual data model structures by applying two techniques: normalization and abstraction. An example of a logical data model is shown in Figure 5.4.
Normalization is the process of applying rules to organize business complexity into stable data structures. A deeper understanding of each data element is required, to see each data element in relationship to every other data element. The basic goal of normalization is to keep each data element in only one place.
Normalization rules sort data elements according to primary and foreign keys. Normalization rules sort into levels, with each level applying more granularity and specificity in search of the correct primary and foreign keys. Each level comprises a separate normal form, and each successive level includes previous levels. Normalization levels include:
Figure 5.4 Logical Data Model Example
The term normalized model usually means the data is in 3NF. Situations requiring BCNF, 4NF, 5NF, and 6NF occur rarely; these forms are considered advanced topics in data modeling.
Abstraction is the redefinition of data entities, elements, and relationships by removing details to broaden the applicability of data structures to a wider class of situations, often by implementing super-types rather than sub-types. Using the generic Party Role super-type to represent the Customer, Employee, and Supplier sub-types is an example of applying abstraction.
Use normalization to show known details of entities. Use abstraction when some details of entities are missing or not yet discovered, or when the generic version of entities is more important or useful than the subtypes.
5.2.3.3.1 Attributes
An attribute is a property of an entity; a type of fact important to the business whose values help identify or describe an entity instance. For example, the attribute Student Last Name describes the last name of each student. Attributes translate in a physical data model to a field in a file or a column in a database table. Attributes use business names, while fields and columns use technical names that frequently include technical abbreviations. In a logical data model, business entities represent the essential nouns in the organization’s vocabulary, and attributes represent adjectives.
An attribute in a logical model should be atomic. It should contain one and only one piece of data (fact) that cannot be divided into smaller pieces. For example, a conceptual data element called phone number divides into several logical data elements for phone type code (home, office, fax, mobile, etc.), country code, (1 for US and Canada), area code, prefix, base phone number, and extension.
An instance of an attribute is the value of the attribute for a particular entity instance. An occurrence of a data value is its appearance as an attribute instance for an entity instance. The data element instance 60106 for example, belongs to the Customer Employee Zip Code data element, which exists for the Customer instance Bob.
Entity and attribute definitions are essential contributors to the business value of any data model. High-quality definitions clarify the meaning of business vocabulary and provide rigor to the business rules governing entity relationships. High-quality definitions assist business professionals in making intelligent business decisions, and they assist IT professionals in making intelligent application design decisions. High-quality data definitions exhibit three essential characteristics: clarity, accuracy, and completeness.
5.2.3.3.2 Domains
The complete set of all possible values for an attribute is a domain. An attribute can never contain values outside of its assigned domain. Some domains have a limited number of specific defined values, or minimum or maximum limits for numbers. Business rules can also restrict domains.
Attributes often share the same domain. For example, an employee hire date and a purchase order date must be:
A data dictionary contains a collection of domains and the attributes that relate to each domain, among other things.
5.2.3.3.3 Keys
Attributes assigned to entities are either key or non-key attributes. A key data element helps identify one unique entity instance from all others, either fully (by itself) or partially (in combination with other key elements). Non-key data elements describe the entity instance but do not help uniquely identify it.
A key (or candidate key) represents the one or more attributes whose values uniquely identify an entity instance. A composite key is a key containing two or more attributes. One of these candidate keys becomes the primary key. There should be only one primary key. All other candidate keys become alternate keys.
To avoid using composite primary keys, or key attributes with values that change over time, use a surrogate key. A surrogate key contains a randomly generated value uniquely assigned to an entity instance. ‘Surrogate’ means ‘substitute’. Use a surrogate key when a truly unique data element or set of data elements exists within the entity. Other names for surrogate keys are anonymous keys, or non-intelligent keys. Note that simply having a key generated by sequence number actually still has some intelligence. A person can tell in which order the rows were inserted into the table by the sequence, similar to a row number. True surrogate keys are random, not sequential.
A foreign key is an attribute that provides a link to another entity. Simply put, a foreign key is an attribute that appears in both entities in a relationship, and partially or fully identifies either one or both of the entities. When a one-to-many relationship exists between two entities, the entity on the child side of the relationship inherits the primary key attributes from the entity on the parent side of the relationship. The foreign key enables navigation between data structures.
An identifying relationship occurs when the foreign key attribute(s) of a parent entity appears as part of the composite primary key of a child entity. A non-identifying relationship occurs when the foreign key of a parent entity is a non-key attribute(s) describing the child entity.
5.2.3.4 Develop and Maintain Physical Data Models
A physical data model optimizes the implementation of detailed data requirements and business rules in light of technology constraints, application usage, performance requirements, and modeling standards. Design relational databases with the specific capabilities of a database management system in mind (IBM DB2 or UDB, Oracle, Teradata, Sybase, or Microsoft SQL Server or Access). Organizations should have naming standards to guide the naming of physical data objects. An example of a physical data model is shown in Figure 5.5.
Figure 5.5 Physical Data Model Example
Physical data model design includes making decisions about:
Going forward, we will use the term ‘tables’ to refer to tables, files, and schemas; the term ‘columns’ to refer to columns, fields, and elements; and the term ‘rows’ to refer to rows, records, or instances.
Physical data modeling transforms the logical data model using several techniques, including:
5.2.4 Detailed Data Design
Detailed data design activities include:
Database administrators (DBAs) take the lead role in database design, and a collaborative role in designing information products (XML schemas, messages, screens, and reports) and related data services (data access services, data integration services, and business intelligence services). Data analysts take the lead role in designing information products and related data services, and a collaborative role in database design.
5.2.4.1 Design Physical Databases
Detailed design includes database implementation specifications. A physical database design may take advantage of the unique functions and capabilities of a specific database management system, which may or may not be included in the data model itself.
For relational databases, the primary design deliverables are the Data Definition Language (DDL) specifications. DDL is a subset of Structured Query Language (SQL) used to create tables, indexes, views, and other physical database objects. For XML databases, the primary design deliverable is the namespace.
A complete, high-quality database design document is more than just DDL statements. Section 5.2.4.1.3 describes a complete physical design document.
Whether or not the DBA collaborates in physical data modeling, the DBA has primary responsibility for detailed database design, including:
5.2.4.1.1 Physical Database Design
Choose a database design based on both a choice of architecture and a choice of technology. Base the choice of architecture (for example, relational, hierarchical, network, object, star schema, snowflake, cube, etc.) on several considerations:
The choice of implementation technology (for example, relational, XML, OLAP, or Object technology) may be governed by many different factors, including how long the data needs to be kept, whether it must be integrated with other data or passed across system or application boundaries, and on requirements of data security, integrity, recoverability, accessibility, and reusability.
There may also be organizational or political factors, including organizational biases and developer skill sets, that lean toward a particular technology or vendor. Other factors influencing physical database design include:
Database designers must find the answers to several questions, including:
In designing and building the database, the DBA should keep the following design principles firmly in mind (remember the acronym PRISM):
Here are some recommended best practices for physical database design:
5.2.4.1.2 Performance Modifications
When implementing a physical database, consider how the database will perform when applications make requests to access and modify data. There are several techniques used to optimize database performance.
Indexing can improve query performance in many cases. The database designer must select and define appropriate indexes for database tables. An index is an alternate path for accessing data in the database to optimize query (data retrieval) performance. Major RDBMS products support many types of indexes. Indexes can be unique or non-unique, clustered or non-clustered, partitioned or non-partitioned, single column or multi-column, b-tree or bitmap or hashed. Without an appropriate index, the DBMS will revert to reading every row in the table (table scan) to retrieve any data. On large tables, this is very costly. Try to build indexes on large tables to support the most frequently run queries, using the most frequently referenced columns, particularly keys (primary, alternate, and foreign).
Denormalization is the deliberate transformation of a normalized logical data model into tables with redundant data. In other words, it intentionally puts one data element in multiple places. This process does introduce risk of data errors due to duplication. Implement data quality checks to ensure that the copies of the data elements stay correctly stored. Only denormalized specifically to improve database query performance, by either segregating or combining data to reduce query set sizes, combining data to reduce joins, or performing and storing costly data calculations. Denormalization techniques include (among others):
5.2.4.1.3 Physical Database Design Documentation
The physical database design document guides implementation and maintenance. It is reviewable to catch and correct errors in the design before creating or updating the database. It is modifiable for ease of implementation of future iterations of the design. A physical database design document consists of the following components:
5.2.4.2 Design Information Products
While database design is the primary focus of data development, data professionals should also participate in the design of related data deliverables.
Data analysts may assist software designers and developers in the design of information products, including screens and reports, to meet business data requirements. Data analysts should ensure consistent use of business data terminology, and should ensure that presentation formats add appropriate context to the data for data producers and information consumers.
The DBA will often assist in the development of applications that make data more readily available, in a more usable form, to business users and managers. Many exciting new technologies exist for this purpose, and the DBA should be familiar with them:
The DBA’s involvement with the development of these products may include data analysis, the creation of data structures (such as XML schemas, OLAP cubes, or data marts) and database objects to support these products, enabling access to data, and assisting with data integration and delivery.
DBAs may assist software developers by creating and maintaining database access statements. In SQL, these statements are known as Data Manipulation Language (DML) and include SELECT, INSERT, UPDATE, and DELETE statements. DBAs often review these statements and recommend alternate approaches and performance tuning modifications.
DBAs may collaborate with software designers and developers on designing data access-layer services in a service-oriented architecture (SOA). Data access services standardize data access and insulate programs from database changes.
5.2.4.3 Design Data Access Services
It will oftentimes be necessary (and desirable) to access data in remote databases, and to combine that data with data in the local database. Several mechanisms exist for doing this, and the DBA should be familiar with the strengths and weaknesses of each. Some of the most common methods of accessing and reusing remote data are as follows:
Remember that the end goal is to enable the easy and inexpensive reuse of data across the enterprise, the avoidance, wherever possible, of costly data replication schemes, and the prevention, wherever possible, of redundant and inconsistent data.
5.2.4.4 Design Data Integration Services
A database transaction is an atomic unit of recoverable work. A transaction can include multiple database instructions. Upon completion of all the steps within the transaction, issue a database COMMIT to make all changes together. Up to that point, the changes can be rolled back. A transaction is atomic, meaning either “all or nothing”. It performs either all the instructions, or none. Application developers define database transactions by determining when to COMMIT changes.
A critical aspect of database design is determining appropriate update mechanisms. Whenever multiple users can concurrently update tables, implement some concurrency control mechanism to ensure that two users cannot update the same record at the same time. This usually involves adding a data element of type “timestamp” or “datetime” to each of these tables, making sure that the value of this field is checked before the record is modified, and updating whenever the record is changed.
Use locks to ensure the integrity of data, permitting only one user to change a database row at any one time. Lock data at different levels, known as lock granularity. DBAs determine the appropriate level of locking for each database object, such as column, row, page, table, file, or database.
Data analysts and data integration specialists define source-to-target mappings and data transformation designs for extract-transform-load (ETL) programs and other technology for on-going data movement, cleansing, and integration. DBAs may collaborate in this design activity.
Data analysts, data integration specialists, and DBAs also design programs and utilities for data migration and conversion from old data structures to new data structures.
Several methods are available, but any method chosen must satisfy the following criteria:
Consider the following possible update mechanisms:
5.2.5 Data Model and Design Quality Management
Data analysts and designers act as an intermediary between information consumers (the people with business requirements for data) and the data producers who capture the data in usable form. Data professionals must juggle the business data requirements of the information consumers, including executives, and the application requirements of data producers. Systems requirements document application data requirements in the form of use cases, an application class model, and service level agreements (SLAs).
Data professionals must also balance the short-term versus long-term business interests. Information consumers need data in a timely fashion to meet short-term business obligations and to take advantage of current business opportunities. System-development project teams must meet time and budget constraints. However, they must also meet the long-term interests of all stakeholders by ensuring that an organization’s data resides in data structures that are secure, recoverable, sharable, and reusable, and that this data is as correct, timely, relevant, and usable as possible. Therefore, data models and database designs should be a reasonable balance between the short-term needs and the long-term needs of the enterprise.
5.2.5.1 Develop Data Modeling and Design Standards
Data modeling and database design standards serve as the guiding principles to effectively meet business data needs, conform to data architecture, and ensure data quality. Data architects, data analysts, and database administrators must jointly develop these standards. They must complement and not conflict with related IT standards.
Publish data model and database naming standards for each type of modeling object and database object. Naming standards are particularly important for entities, tables, attributes, keys, views, and indexes. Names should be unique and as descriptive as possible.
Logical names should be meaningful to business users, using full words as much as possible and avoiding all but the most familiar abbreviations. Physical names must conform to the maximum length allowed by the DBMS and use abbreviations where necessary. While logical names use blank spaces as separators between words, physical names typically use underscores as word separators.
Naming standards should minimize name changes across environments. Names should not reflect their specific environment, such as test, QA, or production. Class words can be useful to distinguish attributes from entities, and column names from table names. They can also show which attributes and columns are quantitative rather than qualitative, which can be important when analyzing the contents of those columns.
Data modeling and database design standards should include:
5.2.5.2 Review Data Model and Database Design Quality
Project teams should conduct requirements reviews and design reviews as appropriate. These reviews should include a conceptual data model review, a logical data model review, and a physical database design review.
Conduct design reviews with a group of subject matter experts representing different backgrounds, skills, expectations, and opinions. Participants must be able to discuss different viewpoints and reach group consensus without personal conflict, as all participants share the common goal of promoting the most practical, best performing and most usable design. Chair each design review with one leader who facilitates the meeting. The leader creates and follows an agenda, ensures all required documentation is available and distributed, solicits input from all participants, maintains order and keeps the meeting moving, and summarizes the group’s consensus findings. Many design reviews also utilize a scribe to capture points of discussion.
5.2.5.2.1 Conceptual and Logical Data Model Reviews
Conceptual data model and logical data model design reviews should ensure that:
5.2.5.2.2 Physical Database Design Review
Physical database design reviews should ensure that:
All concerned stakeholders, including the DBA group, the data analyst / architect, the business data owners and / or stewards, the application developers, and the project managers, should review and approve the physical database design document. The complete design document should be ready as part of the production turnover of the database.
5.2.5.2.3 Data Model Validation
Validate data models against modeling standards, business requirements, and database requirements. Here are some sample validation questions:
5.2.5.3 Manage Data Model Versioning and Integration
Data models and other design specifications require careful change control, just like requirements specifications and other SDLC deliverables. Note each change to a data model to preserve the lineage of changes over time. If a change involves the logical model, such as a new or changed business data requirement, the data analyst or architect must review and approve the change.
Each change should note:
Changes may be made to multiple parts of Enterprise models simultaneously, as part of the normal process. It is important to integrate any changes to a model part back into the enterprise model, especially the enterprise logical model, to prevent errors in data and databases during future development.
Some data modeling tools include repositories that provide data model versioning and integration functionality. Otherwise, preserve data models in DDL exports or XML files, checking them in and out of a standard source code management (SCM) system just like application code.
5.2.6 Data Implementation
Data implementation consists of data management activities that support system building, testing, and deployment, including:
After design, the DBA is responsible for implementing the designed data structures in the development and test environments. These structures include database tables or files, views, stored procedures, and functions, OLAP data cubes, XSLT schemas, and other similar objects. The DBA is responsible for change control of the development database environment and its configuration. Change control procedures for development and test environments should be similar or the same as those used to control production environments. The DBA should manage configuration changes to database design specification (DDL) files using the same change and configuration management tools and practices used for other information system deliverables.
5.2.6.1 Implement Development / Test Database Changes
As changes to the database are required during the course of application development, the DBA either implements or oversees them. These changes usually come from the developer. Implementation happens depending on roles and responsibilities:
If an iterative development method is being used (for example, Agile Development), then some of the work of reviewing and approving changes, and updating the logical and physical models, may need to be done asynchronously. Consider giving approvals verbally so that development can proceed without undue interruption, and do the update of the models as a follow-on task. However, take care to ensure that the database does not get “out-of-synch” with the logical model, and that the database does not become “stove-piped” by being tightly coupled to a single application. Implement application-specific database requirements as much as possible, using views, stored procedures, functions, and other forms of data virtualization.
DBAs should carefully monitor all database code to ensure that it is written to the same standards as application code. All database code should be well documented, testable (ideally, containing built-in diagnostic code that can be triggered via a passed parameter), understandable, consistent with the agreed-upon standards, and easily maintainable. The DBA should also identify, as early as possible, poor SQL coding practices that could lead to errors or performance problems, and bring them to the attention of the developers before multiple stored procedures or functions replicate poor SQL code. A little extra attention at the beginning of a project can save everyone a great deal of grief later on.
5.2.6.2 Create and Maintain Test Data
The DBA and software developers and testers may collaborate to populate databases in the development environment with test data. Either generate test data, or extract a representative subset of production data. Strictly observe privacy and confidentiality requirements and practices for test data. Delete obsolete, unusable, and no longer needed test data.
The DBA may also assist the developers with the creation of SQL scripts and data integration “packages”, such as DTS or SSIS packages, used to create and maintain test data. Usually, this work is the primary responsibility of the development team, but oftentimes they need and appreciate the expertise of the DBA. This is another way that DBAs can add value to the development effort.
5.2.6.3 Migrate and Convert Data
A key component of many projects is the migration of legacy data to a new database environment, including any necessary data cleansing and reformatting. This is a significant effort. The time and cost required should not be (but probably will be) under-estimated. It will require the collaborative effort of the data architect / analyst(s) familiar with the legacy data model(s) and the target data model, the DBA, business users, and developers familiar with the legacy application(s). Depending on where the legacy data is stored, this effort may involve the use of many different technologies, including SQL, COBOL, Unix scripting, DBMS integration packages such as DTS or SSIS, non-relational DBMSs, third-party ETL applications, data integration web services, FTP, RPC, ODBC, OLE / DB, and so on. Data migration efforts can easily consume thousands of hours of effort.
5.2.6.4 Build and Test Information Products
Data professionals, including the DBA, should collaborate with software developers on development and testing of information products created by the system, including:
Software developers are responsible for coding and testing programs, including database access calls. Software developers are also responsible for creating, testing, and maintaining information products, including screens and reports. Testing includes unit, integration, and performance testing.
5.2.6.5 Build and Test Data Access Services
DBAs are responsible for developing data access services. The DBA collaborates with software developers in developing, testing, and executing data access services, first for development and test environments, and later for production deployment.
Data requirements should include business rules for data access to guide the implementation of data access services, collaborating with software developers. Business data stewards and other subject matter experts (SMEs) should validate the correct implementation of data access requirements and performance through user acceptance testing.
5.2.6.6 Build and Test Data Integration Services
Data integration specialists are responsible for developing ETL programs and technology for data integration, as well as data migration and conversion from old data structures into new structures. The DBA collaborates with software developers in developing, testing, and executing data migration and conversion programs and procedures, first for development and test data, and later for production deployment.
Data requirements should include business rules for data quality to guide the implementation of application edits and database referential integrity constraints. Business data stewards and other subject matter experts (SMEs) should validate the correct implementation of data requirements through user acceptance testing.
5.2.6.7 Validate Information Requirements
The responsibilities of data professionals within the SDLC do not end with design. They continue to interact as part of project teams for system development through the implementation of these designs. Database administrators are particularly active in these SDLC stages. Business data stewards may also remain involved after analysis and design, or a separate independent quality assurance team may control the test process. The primary work will be in testing and validating that the solution meets the requirements, but also in planning deployment, developing training, and documentation.
In any application development project, especially those using iterative (“Agile”) methods, data (and database) requirements may change abruptly, in response to either new or changed business requirements, invalidated assumptions regarding the data, or re-prioritization of existing requirements. The data modeler may serve as the intermediary between the developers and the data analyst / architect, reviewing any additions or changes to business data requirements. The data modeler would also properly reflect them in the logical and physical data models. The DBA would implement any changes in the most effective manner in the database. The DBA then works with the developers to test the implementation of the data requirements, and make sure that the application requirements are satisfied.
5.2.6.8 Prepare for Data Deployment
While database administrators resolve technical implementation and testing issues, data analysts can leverage the business knowledge captured in data modeling to define clear and consistent language in user training and documentation. Business concepts, terminology, definitions, and rules depicted in data models are an important part of application user training, even if data models themselves are not useful as teaching illustrations. The data stewards that contribute business knowledge to the definition of the data models, and who are accountable for system data quality, are often also the process and application owners responsible for user acceptance of both the system and related training and documentation. Use their nomenclature consistently.
Data stewards and data analysts should participate in deployment preparation, including development and review of training materials and system documentation, especially to ensure consistent use of defined business data terminology. Help desk support staff also requires orientation and training in how system users appropriately access, manipulate, and interpret data.
The DBA is primarily responsible for implementing new and changed database objects into the production environment (see Chapter 6 on Data Operations Management). Database administrators should carefully control the installation of new databases and changes to existing databases in the production environment. Once installed, business data stewards and data analysts should monitor the early use of the system to see that business data requirements are indeed met.
5.3 Summary
The guiding principles for implementing data development into an organization, a summary table of the roles for each data development activity, and organization and cultural issues that may arise during data development are summarized below.
5.3.1 Guiding Principles
The implementation of the data development function into an organization follows nine guiding principles:
5.3.2 Data Development Process Summary
The process summary for the data development function is shown in Table 5.1. The deliverables, responsible roles, approving roles, and contributing roles are shown for each activity in the data development function. The Table is also shown in Appendix A9.
Activities |
Deliverables |
Responsible Roles |
Approving Roles |
Contributing Roles |
3.1.1 Analyze Information Requirements (D) |
Information Requirement Specification Statements |
Data Architects, Data Analysts |
Data Stewards |
Data Stewards, Other SMEs |
3.1.2 Develop and Maintain Conceptual Data Models (D) |
Conceptual Data Model Diagrams and Reports |
Data Architects, Data Analysts |
Data Stewards, Data Architects |
Data Stewards, Other SMEs |
3.1.3 Develop and Maintain Logical Data Models (D) |
Logical Data Model Diagrams and Reports |
Data Architects, Data Analysts, Data Modelers |
Data Stewards, Data Architects |
Data Stewards, Other SMEs |
3.1.4 Develop and Maintain Physical Data Models (D) |
Physical Data Model Diagrams and Reports |
Data Architects, Data Modelers, DBAs |
DBAs, Data Architects |
Software Developers |
3.2.1 Design Physical Databases (D) |
DDL Specifications, OLAP Cube Specs, XML schemas |
DBAs, Application Architects, Software Developers |
Data Architects, DBAs, Application Architects |
Data Analysts, Data Modelers, Software Developers |
3.2.2 Design Information Products (D) |
Application Screens, Reports |
Software Developers |
Application Architects |
Data Analysts, DBAs |
3.2.3 Design Data Access Services (D) |
Data Access Service Design Specifications |
Software Developers, DBAs |
Application Architects, Data Architects |
Data Analysts, DBAs |
3.2.4 Design Data Integration Services (D) |
Source-to-Target Maps, ETL Design Specs, Conversion Designs |
Data Integration Specialists, DBAs, Data Analysts |
DBAs, Data Architects, Application Architects |
Data Analysts, Data Stewards, DBAs |
3.3.1 Develop Data Modeling and Database Design Standards (P) |
Data Modeling Standards Documents, Database Design Standards Documents |
Data Architects, Data Analysts, Data Modelers, DBAs |
DM Executive, Data Governance Council |
Data Stewards, Application Architects, Software Developers |
3.3.2 Review Data Model and Database Design Quality (C) |
Design Review Findings |
Data Architects, Data Analysts, Data Modelers, DBAs |
DM Executive, Project Manager |
Application Architects, Software Developers |
3.3.3 Manage Data Model Versioning and Integration (C) |
Model Management Libraries and Contents |
Data Model Administrators, Data Modelers |
Data Architects, DM Executive |
Data Analysts, DBAs |
3.4.1 Implement Development and Test Database Changes (D) |
Dev and Test DB Environments, Database Tables, Other DB Objects |
DBAs |
DM Executive |
Data Architects, Data Analysts, Software Developers |
3.4.2 Create and Maintain Test Data (D) |
Test Databases, Test Data |
DBAs, Data Analysts, Software Developers, Test Analysts |
Data Architects, Application Architects, Data Stewards |
Data Stewards, Software Developers, Data Analysts |
3.4.3 Migrate and Convert Data (D) |
Migrated and Converted Data |
DBAs, Software Developers |
Data Stewards, Data Architects |
Data Analysts |
3.4.4 Build and Test Information Products (D) |
Information Products: Screens, Reports |
Software Developers |
Data Stewards, Application Architects, Data Architects |
DBAs, Data Analysts |
3.4.5 Build and Test Data Access Services (D) |
Data Access Services (interfaces) |
Software Developers |
Data Architects, Application Architects |
DBAs |
3.4.6 Build and Test Data Integration Services (D) |
Data Integration Services (ETL, etc.) |
Data Integration Specialists |
Data Stewards, Data Architects |
DBAs, Data Analysts |
3.4.7 Validate Information Requirements (D) |
Validated Requirements, User Acceptance Signoff |
Data Stewards, Testing Specialists |
Data Stewards |
Data Analysts, Data Architects, DBAs |
3.4.8 Prepare for Data Deployment (D) |
User Training, User Documentation |
Data Stewards, Business SMEs, Training Specialists, Data Analysts |
Data Stewards, Data Architects |
Data Stewards, Data Architects, DBAs |
Table 5.1 Data Development Process Summary
5.3.3 Organizational and Cultural Issues
Q1: What is the biggest issue with data delivery?
A1: The biggest organizational and cultural issue regarding data delivery is simply recognizing the need for it and taking advantage of what data development offers. Many organizations focus on application development, overlooking the importance of the data itself. Simply discovering the importance and usefulness of data analysis and data modeling can be transformational to an organization. Both the Business and IT start considering the impact to data when considering system changes, sometimes realizing that they already have similar data and functionality in another application, or that they don’t really need what they thought they had or wanted.
Q2: How does one begin formal data development?
A2: In order to start the transformation, it is necessary to start documenting systems from a data point of view. Data flows, data models, and data quality analyses all factor into this documentation. Start with one system, and move to systems that either give or receive data directly from the first system. Network diagrams from infrastructure can help with this.
Next, distribute pictures of the systems’ data flows and data models to the stakeholders of that system, both business and IT. Sit down with them to verify that what the pictures show is what they understand the system to do, or what they see the system does. Make sure that all stakeholders believe that the documentation shows the current reality of the system.
Then, publicize the existence of these new documents. Create a master version of the documents and implement changes to them as part of the SDLC. When a project goes into production, part of the production release is to distribute the updated data flows and data models.
Once the word gets out, data analysts and data modelers will be very busy both documenting additional systems and helping software engineers to use these new documents during project work. Additional headcount for that team will probably become necessary.
It will be an iterative process to get access to all systems in order to analyze them. Be persistent. The money saved from reduced system redundancy, reduced redundancy of data storage, and more efficient development can save the organization millions of dollars.
The last step is to change the culture of the organization, moving toward automatically referring to these documents during requirements and design of projects as standard operating procedure. Once data development is part of the culture, the organization devoted to maintaining it will grow to fit the organization’s need.
5.4 Recommended Reading
The references listed below provide additional reading that support the material presented in Chapter 5. These recommended readings are also included in the Bibliography at the end of the Guide.
5.4.1 Data Modeling and Database Design
Ambler, Scott. Agile Database Techniques: Effective Strategies for the Agile Software Developer. Wiley & Sons, 2003. ISBN 0-471-20283-5.
Ambler, Scott W. and Pramodkumar J. Sadalage. Refactoring Databases: Evolutionary Database Design. Addison-Wesley, 2006. ISBN 0-321-29353-3.
Avison, David and Christine Cuthbertson. A Management Approach to Database Applications. McGraw Hill, 2002. ISBN 0-077-09782-3.
Brackett, Michael H. Practical Data Design. Prentice Hall, 1990. ISBN 0-136-90827-6.
Bruce, Thomas A. Designing Quality Databases with IDEF1X Information Models. Dorset House, 1991. ISBN 10:0932633188. 584 pages.
Carlis, John and Joseph Maguire. Mastering Data Modeling - A User-Driven Approach. Addison Wesley, 2000. ISBN 0-201-70045-X.
Date, C. J. An Introduction to Database Systems, 8th Edition. Addison-Wesley, 2003. ISBN 0-321-19784-4.
Date, C. J. and Hugh Darwen. Databases, Types and the Relational Model: The Third Manifesto, 3rd Edition. Addison Wesley, 2006. ISBN 0-321-39942-0.
DeAngelis, Carla. Data Modeling with Erwin. Indiana: Sams Publishing, 2000. ISBN 0-672-31868-7.
Dorsey, Paul. Enterprise Data Modeling Using UML. McGraw-Hill Osborne Media, 2007. ISBN 0-072-26374-1.
Fleming, Candace C. and Barbara Von Halle. The Handbook of Relational Database Design. Addison Wesley, 1989. ISBN 0-201-11434-8.
Halpin, Terry. Information Modeling and Relational Databases: From Conceptual Analysis to Logical Design. Morgan Kaufmann, 2001. ISBN 1-558-60672-6.
Halpin, Terry, Ken Evans, Pat Hallock, and Bill McLean. Database Modeling with Microsoft Visio for Enterprise Architects. Morgan Kaufmann, 2003. ISBN 1-558-60919-9.
Harrington, Jan L. Relational Database Design Clearly Explained, 2nd Edition. Morgan Kaufmann, 2002. ISBN 1-558-60820-6.
Hay, David C. Data Model Patterns: A Metadata Map. Morgan Kaufmann, 2006. ISBN 0-120-88798-3.
Hay, David C. Data Model Patterns: Conventions of Thought. Dorset House Publishing, 1996. ISBN 0-932633-29-3.
Hay, David C. Requirements Analysis From Business Views to Architecture. Prentice Hall, 2003. ISBN 0-120-28228-6.
Hernandez, Michael J. Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design, 2nd Edition. Addison-Wesley, 2003. ISBN 0-201-75284-0.
Hoberman, Steve. The Data Modeler’s Workbench. Tools and Techniques for Analysis and Design. John Wiley & Sons, 2001. ISBN 0-471-11175-9.
Hoberman, Steve. Data Modeling Made Simple: A Practical Guide for Business & Information Technology Professionals. Technics Publications, LLC, 2005. ISBN 0-977-14000-8.
Hoffer, Jeffrey A., Joey F.. George, and Joseph S. Valacich. Modern Systems Analysis and Design, 4th Edition. Prentice Hall, 2004. ISBN 0-131-45461-7.
Krogstie, John, Terry Halpin, and Keng Siau, editors. Information Modeling Methods and Methodologies: Advanced Topics in Database Research. Idea Group Publishing, 2005. ISBN 1-591-40375-8.
Muller, Robert. J. Database Design for Smarties: Using UML for Data Modeling. San Francisco, CA, USA, Morgan Kaufmann, 1999. ISBN 1-558-60515-0.
Newton, Judith J. and Daniel Wahl,, editors. Manual For Data Administration. Washington, DC: GPO, NIST Special Publications 500-208, 1993.
Pascal, Fabian. Practical Issues In Database Management: A Reference For The Thinking Practitioner. Addison-Wesley, 2000. ISBN 0-201-48555-9.
Reingruber, Michael. C. and William W. Gregory. The Data Modeling Handbook: A Best-Practice Approach to Building Quality Data Models. John Wiley & Sons, 1994. ISBN 0-471-05290-6.
Riordan, Rebecca M. Designing Effective Database Systems. Addison-Wesley, 2005. ISBN 0-321-20903-3.
Rob, Peter and Carlos Coronel. Database Systems: Design, Implementation, and Management, 7th Edition. Course Technology, 2006. ISBN 1-418-83593-5.
Schmidt, Bob. Data Modeling for Information Professionals. Prentice Hall, 1999. ISBN 0-13-080450-9.
Silverston, Len. The Data Model Resource Book, Volume 1: A Library of Universal Data Models for All Enterprises, 2nd Edition, John Wiley & Sons, 2001. ISBN 0-471-38023-7.
Silverston, Len. The Data Model Resource Book, Volume 2: A Library of Data Models for Specific Industries, 2nd Edition. John Wiley & Sons, 2001. ISBN 0-471-35348-5.
Simsion, Graeme C. and Graham C. Witt. Data Modeling Essentials, 3rd Edition. Morgan Kaufmann, 2005. ISBN 0-126-44551-6.
Teorey, Toby , Sam Lightstone, and Tom Nadeau. Database Modeling and Design, 4th Edition. Morgan Kaufmann, 2006. ISBN 1-558-60500-2.
Thalheim, Bernhard. Entity-Relationship Modeling: Foundations of Database Technology. Springer, 2000. ISBN 3-540-65470-4.
Van der Lans, Rick F. Introduction to SQL: Mastering the Relational Database Language, 4th Edition. Addison-Wesley, 2006. ISBN 0-321-30596-5.
Watson, Richard T. Data Management: Databases And Organization, 5th Edition. John Wiley & Sons, 2005. ISBN 0-471-71536-0.
5.4.2 Business Rules
Chisholm, Malcolm. How to Build a Business Rules Engine: Extending Application Functionality Through Metadata Engineering. Morgan Kaufmann, 2003. ISBN 1-558-60918-0.
Date, C. J., What Not How: The Business Rules Approach To Application Development. Addison-Wesley, 2000. ISBN 0-201-70850-7.
Morgan, Tony. Business Rules and Information Systems: Aligning IT with Business Goals. Addison-Wesley, 2002. ISBN 0-201-74391-4.
Ross, Ronald G. Business Rules Concepts, 2nd Edition. Business Rule Solutions, 2005. ISBN 0-941-04906-X.
Ross, Ronald G. Principles of the Business Rule Approach. Addison-Wesley, 2003. ISBN 0-201-78893-4.
Von Halle, Barbara. Business Rules Applied: Building Better Systems Using the Business Rules Approach. John Wiley & Sons, 2001. ISBN 0-471-41293-7.
5.4.3 Information Engineering
Finkelstein, Clive. An Introduction to Information Engineering: From Strategic Planning to Information Systems. Addison-Wesley, 1990. ISBN 0-201-41654-9.
Finkelstein, Clive. Information Engineering: Strategic Systems Development. Addison-Wesley, 1993. ASIN B000XUA41C.
Inmon, W. H. Advanced Topics in Information Engineering. John Wiley & Sons - QED, 1989. ISBN 0-894-35269-5.
Inmon, W. H. Information Engineering For The Practitioner. Prentice-Hall (Yourdon Press), 1988. ISBN 0-13-464579-0.
Martin, James. Information Engineering Book 1: Introduction. Prentice-Hall, 1989. ISBN 0-13-464462-X. Also see Book 2: Analysis and Design and Book 3: Design and Construction.
5.4.4 Agile Development
Ambler, Scott. Agile Database Techniques: Effective Strategies for the Agile Software Developer. Wiley & Sons, 2003. ISBN 0-471-20283-5.
5.4.5 Object Orientation and Object-Oriented Design
Wirfs-Brock, Rebecca, Brian Wilkerson, and Lauren Wiener. Designing Object-Oriented Software. NJ: Prentice Hall, 1990. ISBN 0-13-629825-7.
Coad, Peter. Object Models: Strategies, Patterns And Applications, 2nd Edition. Prentice Hall PTR, 1996. ISBN 0-13-840117-9.
Entsminger, Gary. The Tao Of Objects. M & T Books, 1990. ISBN 1-55851-155-5.
Goldberg, Adele and Kenneth S, Rubin. Succeeding With Objects. Addison-Wesley, 1995. ISBN 0-201-62878-3.
Graham, Ian, Migrating To Object Technology. Addison-Wesley, 1995. ISBN 0-201-59389-0.
Jacobson, Ivar, Maria Ericsson, and Agneta Jacobson. The Object Advantage. Addison-Wesley, 1995. ISBN 0-201-42289-1.
Taylor, David. Business Engineering With Object Technology. New York: John Wiley, 1995. ISBN 0-471-04521-7
Taylor, David. Object Oriented Technology: A Manager’s Guide. Reading, MA: Addison-Wesley, 1990. ISBN 0-201-56358-4
5.4.6 Service-oriented architecture (SOA)
Barry, Douglas K. Web Services and Service-Oriented Architectures: The Savvy Manager’s Guide. Morgan Kaufmann, 2003. ISBN 1-55860-906-7.
Erl, Thomas. Service-Oriented Architecture: A Field Guide to Integrating XML and Web Services. Prentice Hall, 2004. ISBN 0-131-42898-5.
Erl, Thomas. Service-Oriented Architecture: Concepts, Technology and Design. Prentice Hall, 2004. ISBN 0-131-85858-0.
5.4.7 SQL
Celko, Joe. Joe Celko’s SQL for Smarties: Advanced SQL Programming, 3rd Edition. ISBN 10: 0123693799. 840 pages.
Celko, Joe. Joe Celko’s Trees and Hierarchies in SQL for Smarties. Morgan Kaufmann, 2004. ISBN 1-558-60920-2.
Date, C. J., with Hugh Darwen. A Guide to the SQL Standard, 4th Edition. Addison-Wesley, 1997. ISBN 0-201-96426-0.
Kline, Kevin, with Daniel Kline. SQL in a Nutshell. O’Reilly, 2001. ISBN 0-471-16518-2.
Van der Lans, Rick F. Introduction to SQL: Mastering the Relational Database Language, 4th Edition. Addison-Wesley, 2006. ISBN 0-321-30596-5.
5.4.8 Software Process Improvement
Humphrey, Watts S. Managing The Software Process. Addison Wesley, 1989. ISBN 0-201-18095-2.
5.4.9 XML
Aiken, Peter and M. David Allen. XML in Data Management: Understanding and Applying Them Together. Morgan Kaufmann, 2004. ISBN 0-12-45599-4.
Bean, James. XML for Data Architects: Designing for Reuse and Integration. Morgan Kaufmann, 2003. ISBN 1-558-60907-5.
Finkelstein, Clive and Peter Aiken. Building Corporate Portals with XML. McGraw-Hill, 1999. ISBN 10: 0079137059. 512 pages.
Melton, Jim and Stephen Buxton. Querying XML: XQuery, XPath and SQL/XML in Context. Morgan Kaufmann, 2006. ISBN 1-558-60711-0.