Harnessing the Power of Rucio for the DaFab Project: A Leap Towards Advanced Metadata Management

By Dimitris Xenakis, Martin Barisits July 1, 2024

Introduction

In the realm of both scientific research and production environments, efficiently managing and utilizing metadata is crucial. Metadata serves as the backbone for data discovery, organization, and retrieval, enabling effective data usage across various fields. This is particularly important in areas like Earth Observation (EO), where vast amounts of satellite data need to be processed and analysed to monitor and understand our planet.

The DaFab project, an ambitious initiative, aims to enhance the exploitation of Copernicus data through advanced AI and High-Performance Computing (HPC) technologies. By integrating these technologies, DaFab seeks to improve the timeliness, accuracy, and accessibility of EO data. At the heart of this endeavour lies Rucio, a robust data management system developed by CERN. Rucio’s role is pivotal in achieving key objectives of the project such as creating a unified, searchable catalogue of interlinked EO metadata, improving metadata ingestion and retrieval speeds, and facilitating seamless integration with AI-driven workflows and HPC systems.

What is Rucio? History and Adoption.

Rucio was initially developed by CERN to meet the demanding requirements of the ATLAS experiment at the Large Hadron Collider (LHC). The ATLAS experiment generates petabytes of data that need to be stored, managed, and accessed efficiently across a globally distributed network of computing centres. To address this challenge, CERN created Rucio as an advanced data management system capable of handling the complex needs of large-scale scientific projects.

Since its inception, Rucio has evolved to become a flexible, and highly scalable system, continually improved to meet the growing demands of data-intensive research and production environments. With a widespread adoption in the scientific community and beyond, it is currently used by numerous high-energy physics experiments, such as ATLAS and CMS at the LHC, and by other large-scale data projects that require robust data management solutions.

Rucio’s Main Architecture Components

Rucio’s architecture is distributed across four main layers: Clients, Server, Core, and Daemons, with additional support from storage resources and transfer tools (Figure 1).

Figure 1: High-level Rucio component overview

The Clients layer includes command-line tools, Python clients, and a JavaScript-based web interface, enabling users to interact with Rucio for tasks like data upload, download, and management.

The Server layer provides authentication, a REST API, and a web UI. It processes incoming queries, forwarding them to the core for action. The server ensures efficient handling of requests and delegates complex tasks to the daemons for asynchronous processing.

The Core layer handles the main system logic, managing components such as accounts, replication rules, data identifiers, metadata, quotas, and scopes. It represents the global state of the system and abstracts all core Rucio concepts.

The Daemons layer manages continuous and asynchronous workflows such as data transfers, rule evaluations, data deletion, consistency checks, dynamic data placement, rebalancing, messaging, or tracing. They ensure that large tasks are processed efficiently in the background, maintaining system performance.

Storage and Transfer Tools: These components manage interactions with various storage systems and transfer services. Storage elements (RSEs) abstract the complexities of distributed storage, while transfer tools provide interfaces for submitting, querying, and cancelling data transfers.

Functionality, Enhancements, and the Road Ahead for Rucio in DaFab

While Rucio excels in scientific data management, the specific needs of Earth Observation (EO) metadata require tailored improvements. A key goal is to develop a unified, searchable catalogue of interlinked EO metadata, facilitating powerful and intuitive data searches. This involves enhancing Rucio’s metadata management with a semantic layer to support sophisticated queries and meaningful metadata relationships.

Additionally, seamless integration with AI-driven metadata extraction processes and HPC workflows is crucial. To achieve this, SKIM will manage heterogeneous data and generate enriched metadata, which Rucio will store and query. DASI will provide in-HPC and in-Cloud access to data within workflows driven by domain-specific metadata, while FORTH will design a multi-site workflow orchestration system, enabling workflows to run efficiently across cloud and HPC environments. These advancements will allow users to extract more value from EO data, making it easier to discover and utilize relevant information, ultimately enhancing the effectiveness of the DaFab project.

Conclusion: A New Horizon for Earth Observation Data

By leveraging Rucio’s enriched metadata capabilities, DaFab is set to transform how we understand and interact with Earth Observation (EO) data. This transformation will enable real-time insights and informed decision-making in areas like flood monitoring and crop yield prediction. As DaFab evolves, it promises to set new standards in data interoperability, scalability, and usability. The future of EO data management is here, unlocking the full potential of our planet’s data. Stay tuned for more exciting developments.

Dimitris Xenakis Martin Barisits
European Organization for Nuclear Research (CERN)
Rucio - Scientific Data Management