Projekt

Zurück zur Übersicht

GriDMan: Data Management for Scientific Applications in a Grid Environment

Titel Englisch GriDMan: Data Management for Scientific Applications in a Grid Environment
Gesuchsteller/in Schuldt Heiko
Nummer 132201
Förderungsinstrument Projektförderung (Abt. I-III)
Forschungseinrichtung Fachbereich Informatik Departement Mathematik und Informatik Universität Basel
Hochschule Universität Basel – BS
Hauptdisziplin Informatik
Beginn/Ende 01.05.2011 - 31.10.2013
Bewilligter Betrag 137'646.00
Alle Daten anzeigen

Keywords (8)

Data Grid; Replication Management; Transaciton Management; Scientific Data Management; Replication; Consistency; Data Archiving; Scientific Data

Lay Summary (Englisch)

Lead
Lay summary
Current data analysis tools for scientific applications are faced with terabytes or even petabytes of data. Dealing with such amounts of data makes traditional approaches to data management unworkable. Firstly, tools to manage data do not keep up with the amount of data and their distribution. Secondly, many scientific projects are performed by large groups of scientists that are distributed among several geographic locations. The Data Grid integrates distributed data sources to create a single virtual resource which provides its users with potentially unlimited storage capacity. Each data source may be a database, files or web pages, semi-structured and unstructured data, data streams, raw data from sensors, or multimedia data. But the Grid goes beyond sharing and distributing data and computing resources. The Data Grid has become a prevalent computing environment for data analysis in scientific applications. eScience applications can greatly benefit from Grid environments that adopt state of the art database technology. However, currently, the primary data unit used in the Grid is at files. Consequently, querying, importing, analyzing and updating data requires writing programs that operate on these files. This is labor intensive and requires re-programming when the format of a given file changes, new sources are added, or a query is slightly changed. In addition, since there is no global control in a Grid environment and since providers can decide to withdraw nodes at their sole discretion, guaranteeing a certain level of data availability means to replicate data across several nodes. Current Data Grids delegate replication management to the users and do not provide any built-in support for maintaining a certain degree of replication, for supporting updates of replicated data, and for providing data in different versions and levels of freshness. In contrast to the first Data Grid applications that mostly dealt with read-only data, novel eScience applications need to address both read-only and updateable data. As a result, these applications require the development of tools that guarantee consistent data replication without a substantial increase of data processing costs. Users' demands for data freshness required by applications also should be honored. Finally, the Data Grid should have an ability to uniformly distribute an application load among many Grid nodes.The GriDMan project focuses on these aspects, in particular on novel solutions for data management in a Grid environment, including dynamic replication and different levels of consistency.
Direktlink auf Lay Summary Letzte Aktualisierung: 21.02.2013

Verantw. Gesuchsteller/in und weitere Gesuchstellende

Mitarbeitende

Name Institut

Publikationen

Publikation
Comparison of Eager and Quorum-based Replication in a Cloud Environment
Stiemer Alexander, Fetai Ilir, Schuldt Heiko (2015), Comparison of Eager and Quorum-based Replication in a Cloud Environment, in Proceedings of the 3rd International Workshop on Scalable Cloud Data Management (SCDM'15), Santa Clara, CA, USAIEEE, Piscataway, NJ, USA.
Workload-Driven Adaptive Data Partitioning and Distribution – The Cumulus Approach
Fetai Ilir, Murezzan Damian, Schuldt Heiko (2015), Workload-Driven Adaptive Data Partitioning and Distribution – The Cumulus Approach, in Proceedings of the 3rd International Workshop on Scalable Cloud Data Management (SCDM'15), Santa Clara, CA, USAIEEE, Piscataway, NJ, USA.
SO-1SR: Towards a self-optimizing One-Copy Serializability Protocol for Data Management in the Cloud
Fetai Ilir Schuldt Heiko (2013), SO-1SR: Towards a self-optimizing One-Copy Serializability Protocol for Data Management in the Cloud, in Proceedings of the 5th International CIKM Workshop on Cloud Data Management (CloudDB 2013), San Francisco, CA, USAACM, New York, NY, USA.
Cost-Based Data Consistency in a Data-as-a-Service Cloud Environment
Fetai Ilir Schuldt Heiko (2012), Cost-Based Data Consistency in a Data-as-a-Service Cloud Environment, in 5th International Conference on Cloud Computing (CLOUD 2012), USAIEEE, Piscataway, NJ, USA.
Cost-Based Adaptive Concurrency Control in the Cloud
Fetai Ilir Schuldt Heiko, Cost-Based Adaptive Concurrency Control in the Cloud, Technical Report, Universität Basel.

Auszeichnungen

Titel Jahr
Amazon (AWS) in Education Research Grant 2013

Verbundene Projekte

Nummer Titel Start Förderungsinstrument
150061 ClouDMan: Cost-based Data Management in Cloud Environments 01.11.2013 Projektförderung (Abt. I-III)

Abstract

Current data analysis tools for scientic applications are faced with terabytes or even petabytes (PB) of data. For example, the size of the data in Earth Observation applications will reach 9 PB by the year 2010 and 14 PB by the year 2014. Dealing with such amounts of data makes traditional approaches to data management unworkable. Firstly, data analysis algorithms that are used by scientists are at least of O(N2) or higher complexity and cannot deal with the massive amount of data in reasonable time. Secondly, tools to manage data do not keep up with the amount of data and their distribution. Thirdly, many scientific projects are performed by large groups of scientists that are distributed among several geographic locations. Grid computing attempts to jointly address all these problems by providing the necessary tools for sharing large quantities of resources within Virtual Organizations. In particular, Computational Grids focus on the sharing of CPU cycles which allows to parallelize inherently complex algorithms for data analysis. Service Grids are dedicated to application services that are deployed on Grid nodes. Finally, in Data Grids, possibly heterogeneous nodes contribute local storage capacity to support the management of large volumes of data in a virtual organization. The Data Grid integrates distributed data sources to create a single virtual resource which provides its users with potentially unlimited storage capacity. Each data source may be a database, files or web pages, semistructured and unstructured data, data streams, raw data from sensors, or multimedia data. But the Grid goes beyond sharing and distributing data and computing resources. The Data Grid has become a prevalent computing environment for data analysis in scientific applications. eScience applications can greatly benefit from Grid environments that adopt state of the art database technology. However, currently, the primary data unit used in the Grid is at files. Consequently, querying, importing, analyzing and updating data requires writing programs that operate on these files. This is labor intensive and requires re-programming when the format of a given file changes, new sources are added, or a query is slightly changed. In addition, since there is no global control in a Grid environment and since providers can decide to withdraw nodes at their sole discretion, guaranteeing a certain level of data availability means to replicate data across several nodes. Current Data Grids delegate replication management to the users and do not provide any built-in support for maintaining a certain degree of replication, for supporting updates of replicated data, and for providing data in different versions and levels of freshness. Thus, Grid applications are faced with two basic problems: i.) Data Gathering & Integration: A typical eScience application processes extremely large amounts of data and it also generates large amounts of data. Thus, the problem of scalability of data gathering is a very serious one. In addition, data sets are heterogeneous and span multiple sites and granularities. Depending on the particular application, one might require different data models and views. ii.) Data Management: In contrast to the first Data Grid applications that mostly dealt with read-only data, novel eScience applications need to address both read-only and updateable data. As a result, new eScience applications require the development of tools that guarantee consistent data replication without a substantial increase of data processing costs. Users' demands for data freshness required by applications also should be honored. Finally, the Data Grid should have an ability to uniformly distribute an application load among many Grid nodes. The GriDMan project focuses on the second aspect, on novel solutions for data management in a Grid environment, including dynamic replication and different levels of consistency. The former aspect, data gathering and integration, is addressed in a parallel project that has been submitted by Prof. Y. Breitbart and Prof. R. Jin (Kent University, Ohio, USA) to the National Science Foundation in December 2009 (proposal No. 1018558). Although both projects are highly complementary, they have been defined and submitted independently. However, it is planned to run both projects in parallel and in close collaboration of the research groups involved, in order to exploit the synergies that will arise in the best possible way.
-