The goals of the DFG priority program on Scalable Data Management for Future Hardware are based on the observation that data management architectures will undergo a radical shift in the next years. This is driven by the fact that on the one hand, the range of applications requiring to handle large sets of data has significantly broadened, and on the other hand, new trends in hardware as well as at operating system level offer great opportunities for rethinking current system architectures.
Over the past thirty years, database management systems have been established as one of the most successful software concepts. In todays business environment they constitute the centerpiece of almost all critical IT systems. The reasons for this success are manyfold. On the one hand, such systems provide abstractions hiding the details of underlying hardware or operating systems layers. This covers the existence of a memory hierarchy, memory organization, data representation and efficient data access for multiple users or application developers. On the other hand, database management systems are ACID compliant, which enables them to represent an accurate picture of a real world scenario, and ensures correctness of the managed data even in extreme cases (e.g., a high number of concurrent database operations or possible system failures).
Today, the application of database systems has moved beyond pure transaction-oriented scenarios. Instead they are more and more utilized as data integration platforms to realize a unified access model (perhaps limited to read operations) to heterogeneous or even distributed data. In addition, database technology in a broader sense is exploited in pure analytical applications (e.g. building models for data mining algorithms such as classification, clustering, recommendation, etc.). These analyses are based on clickstream data or experimental results in the scientific environment (e.g. protein analyses in micro biology, or galaxy detection in astro physical research projects). For such applications, the rigid transaction-oriented architecture of classical database systems often proves to be too rigid, inflexible and not scalable to the required extent. This led to the development of NoSQL database systems and the MapReduce paradigm. During the consequently emerging diversification of data management solutions some of the well established functionalities of classical database systems fell by the wayside: For instance, consistency in eventual consistent system has to be realized at the application level (e.g. using versioning); in some NoSQL systems even join operations need to be re-implemented at application level.
Consequently, the vision of this priority program for future database systems is to loosen or even to shed the tight corset that was implied by the current assumptions on the required level of abstraction and the available hardware, and replace it with a more flexible architectural approach. This requires
- to provide additional data types for novel applications. It is obvious that dates solutions, for example, need to support matrix operations (and with this linear algebra operators) or graph operations for the analysis of large networks.
- to provide different levels of abstraction for data management services and with it the support to “unhinge” certain functionalities.
- to enable the exploitation of current and future hardware and more general of system level services. For example, it may become necessary to relocate certain database operators into specialized hardware chips to achieve the optimal throughput for the system. As another example, the optimal scheduling of hardware resources may be improved by a tighter integration with operating system services.
- to enable a tighter interweaving between application code and runtime system. For instance, to support the efficient processing of complex data mining algorithms, the efficient usage of statistical algorithms requires a “push-down” of application code towards the data level. This, on the other hand, requires to massively parallelize the code.
- to provide processing mechanisms particularly tailored to target application. In many scenarios an “update in place” strategy may, for instance, be replaced by a more efficient “append only” strategy. In a similar way, for many scenarios it may be sufficient to rely on BASE criteria instead of ACID to ensure correctness. The resulting degrees of freedom can then be exploited by more flexibility on the architecture level.
- to consider “schema on read” as an additional design pattern. A significant amount of data can not be typified before a proper data cleaning and data transformation, which is typically handled inside the database system. Often this step of schema definition is deliberately delayed until query time.
In summary, this priority program will radically break with previous concepts to create new common architectural concepts for data management services for current and future hardware platforms, and explore and evaluate these concepts for new fields of application. Where in the past thirty years correctness was regarded as the leitmotif for designing database systems, for the upcoming years this process will be dominated by the aspects of variability and scalability. The broad field of applications embraces distinctively varying requirements and to a certain extent significantly different domain specific languages (DSLs). Combined with the diversity in modern hardware and system level services (such as operating system level replication, remote direct memory access, persistent RAM, etc.) novel architectural designs with a focus on variability are required, being investigated from both, top and bottom, perspectives.