Feature Article: August 2014
Most everyone has heard of “big data” – the popular term for data so massive it’s difficult to manage. Today, the volume of search engine queries, online retail sales and Twitter messages regularly exceeds the capabilities of traditional databases.
There’s a complement to big data that we call “big schema”. Modern data can not only have vast quantities and fast rates, but can also have diverse structure. Big schema can arise with enterprise data models, large data warehouses and scientific data.
Enterprise Data Models
An enterprise data model (EDM) describes the essence of an organization – it abstracts multiple apps, combining and reconciling their content. EDMs have many purposes, such as integrating app data, driving consistency across apps, documenting enterprise scope, finding functional gaps and overlaps, and providing a vision for future apps. Many enterprises have dozens of apps, so schema size can be very large.
The UK financial software vendor Avelo has been using an EDM to coordinate and integrate apps. Avelo was formed by the merger of four predecessor companies, so its apps aligned poorly. They have different abstractions, naming approaches and development styles. As a result, it was difficult to construct an EDM.
We limited the scope of Avelo’s EDM to cope with the poor alignment. We started by seeding the EDM via rapid reverse engineering. We browsed each app’s schema to find core concepts – the tables with the most foreign key connections – and used only the top 10. Business experts helped us reconcile the concepts to create a high-level EDM.
Large Data Warehouses
Data warehouses can also involve big schema. A data warehouse combines data from day-to-day operational apps and places it on a common basis for analysis and reporting. A large enterprise can have a great deal of data to analyze, leading to many data warehouse tables.
We can’t do much to restrain the size of a large data warehouse. But by using agile data modeling, we can make sure that payoff occurs incrementally, as the warehouse is constructed.
By Michael Blaha
We recently worked on a large data warehouse encompassing multiple departments that illustrates both good and bad approaches. One department’s staff focused on building their portion of the warehouse and deferring usage. After many months of work, they are still building. Another department chose to build incrementally, according to business demand. This latter approach has been more successful and easier to justify for continued funding.
Scientific data is a third source of big schema. Scientific apps are extremely complex, involving time series, complex data types, and deep dependencies and constraints. Scientific schema is often not only large, but also difficult to represent.
Many years ago, we worked on the PDXI project sponsored by the AIChE. The purpose of PDXI was to produce a data model to serve as the basis for a data exchange standard for chemical engineering apps. Chemical plants have a wide variety of equipment, complex mixtures of substances and a range of operating conditions, so there is a lot of data to represent. The PDXI model was several hundred pages. This was too much to manage, too much to explain and too much to understand.
In retrospect, we now realize that we should have used more generic data structures. For example, the PDXI model had fifty pages for equipment, such as tanks, reactors, pumps and distillation columns. A better model would have avoided all this detail by combining data and metadata. Then the fine particulars of each kind of equipment could have been specified elsewhere.
So when you build applications, think not only about big data, but also big schema. For where there is big data, there is often big schema. And big schema can even arise by itself.
Michael Blaha is a consultant and trainer who specializes in conceiving, architecting, modeling, designing and tuning databases. He has worked with dozens of organizations around the world. Blaha has authored seven U.S. patents, seven books and many articles. His most recent book is the "UML Database Modeling Workbook." He received his doctorate from Washington University in St. Louis, and is an alumnus of GE Global Research in Schenectady, New York.
About the Author:
Designers can encounter diverse structure of data when building apps.
This 2015 article by Craig Mullins is a part of a multi-part series on database systems from TechTarget.
|What is a Database?|
|The History and Future of Database Change Management|
|Fixing Corrupt Microsost Access Databases|
|How to Work Remotely and Still Be The Best|
|Getting in Touch with Big Data|
|Planning for Effective Data Warehouse Testing|
|Social Data Has Become Social Big Data|
|The Future of Data Centers: Achieving Agility in a Rapidly Shifting World|
|Here’s a News Year’s Resolution: Master Your Database|
|Making the Grade: Cost Savings Upgrades for Today's Data Center|
|How to Choose the Best DBA for Your Company|
|Virtualization: Wading Through the Deluge of Data|
|SQL Databases and Network Attached Storage|
|Why Big Data Needs Cloud|
|Ten reasons why you should use data models to build apps|
|Beware Big Schema|
|How to Implement Successful Data Integration Cross-Regionally|
|Forging a Path Beyond Hadoop - Software Database Mgmt Sys for Big Data Analytics|
|Database Tips and Tricks|
|Why Data Still Matters|