-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsurvey-data.tex
282 lines (226 loc) · 20.6 KB
/
survey-data.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
\section{Data Management}
\label{data}
% \fixme{(authors: Anders, Brian, Maxim, Mike, Simon)}
\subsection{Definition}
\textbf{Data Management} is any \textit{content neutral} interaction with the data, e.g. it is the \textit{data flow}
component of the larger domain of \textbf{Workflow Management} (see \ref{workflow_workload}). It addresses issues of data storage
and archival, mechanisms of data access and distribution and curation -- over the full life cycle of the data. In order to remain within
the scope of this document we will
concentrate on issues related to data distribution, metadata and catalogs, and won't cover issues of mass storage
in much detail (which will be covered by the \textit{Systems} working group). Likewise, for the most part network-specific issues fall outside of our purview.
% It includes technical solutions, procedures and policies and deals with the full life cycle of the data.
\subsection{Moving Data}
\subsubsection{Modes of Data Transfer and Access}
\label{data_xfer}
In any distributed environment (and most HEP experiments are prime examples of that) the data are typically stored at multiple locations,
for a variety of reasons, and over their lifetime undergo a series of transmissions over networks, replications and/or deletions, with attendant bookkeeping
in appropriate catalogs. Data networks utilized in research can span \textit{hundreds} of Grid sites across multiple continents.
In HEP, we observe a few different and distinct modes of moving and accessing data (which, however, can be used in a complementary fashion).
Consider the following:
\begin{description}
\item[Bulk Transfer] In this conceptually simple case, data transport from point A to point B is automated and augmented
with redundancy and verification mechanism so as to minimize chances of data loss. Such implementation may be needed,
for example, to transport ``precious'' data from the detector to the point of permanent storage.
Examples of this can be found in SPADE (data transfer system used in Daya Bay) and components of SAM~\cite{SAM} and File Transfer Service at FNAL.
Similar functionality (as a part of a wider set) is implemented in the \textit{Globus Online} middleware kit~\cite{globus}.
\item[Managed Replication] In many cases the data management strategy involves creating replicas of certain segments of the data (datasets, blocks, etc)
at participating Grid sites. Such distribution is done according to a previously developed policy which may be based on storage capacities of
the sites, specific processing plans (cf. the concept of \textit{subscription}), resource quota and any number of other factors. Good examples of this type of systems are found in
ATLAS (\textit{Rucio)} and CMS (\textit{PhEDEx}), among other experiments~\cite{rucio_chep13,phedex_chep09}.
\item[``Data in the Grid'' (or Cloud)] In addition to processing data which is local to the processing element (i.e. local storage), such as a Worker Node
on the Grid, it is possible to access data over the network, provided there exists enough bandwidth between the remote storage
facility or device, and the processing element. There are many ways to achieve this. Examples:
\begin{itemize}
\item using \textit{http} to pull data from a remote location before executing the payload job. This can involve private data servers or public cloud facilities.
\item utilizing \xrootd~\cite{xrootd,xrootd_web} over WAN to federate storage resources and locate and deliver files transparently in a ``just-in-time'' manner.
\item sharing data using middleware like Globus~\cite{globus}.
\end{itemize}
\end{description}
Choosing the right approaches and technologies is a two-tiered process. First, one needs to identify the most
relevant use cases and match them to categories such as outlined above (e.g. replication vs network data on demand). Second, within
the chosen scenario, proper solutions must be identified (and hopefully reused rather than reimplemented).
\subsubsection{From MONARC to a Flat Universe}
The \textbf{MONARC} architecture is a useful case study, in part because it was used in the LHC Run 1 data processing campaign,
and also because it motivated the evolution of approaches to data management which is currently under way.
It stands for \textit{Models of Networked Analysis at Regional Centers} \cite{monarc}.
At the heart of MONARC is a manifestly hierarchical organization of computing centers in terms of
data flow, storage and distribution policies defined based on characteristics and goals of participating sites. The sites
are classified into ``Tiers'' according to the scale of their respective resources and planned functionality, with ``Tier-0'' denomination
reserved for central facilities at CERN, ``Tier-1'' corresponding to major regional centers while ``Tier-2'' describes
sites of smaller scale, to be configured and used mainly for analysis of the data (they are also used
to handle a large fraction of the Monte Carlo simulations workload). Smaller installations and what is termed ``non-pledged resources''
belong to Tier-3 in this scheme, implying a more \textit{ad hoc} approach to data distribution and handling of the computational
workload on these sites. The topology of the data flow among the
Tiers can be described as somewhat similar to a Directed Acyclic Graph (DAG), where data undergoes processing steps
and is distributed from Tier-0 to a number of Tier-1 facilities, and then on to Tier-2 sites -- but Tiers of the same rank
do not share data on ``p2p'' basis.
This architecture depends on coordinated operation of two major components:
\begin{itemize}
\item The Data Management System, that includes databases necessary to maintain records of the data parameters and location,
and which is equipped with automated tools to move data between
computing centers according to chosen data processing and analysis strategies and algorithms.
An important component of the data handling is a subsystem for managing \textit{Metadata}, i.e. information
derived from the actual data which is used to locate specific data segments for processing based on
certain selection criteria.
\item The Workload Management System (WMS) -- see Section~\ref{wms} -- which distributes computational payload in accordance
with optimal resource availability and various applicable policies. It typically also takes
into account data locality in order to minimize network traffic and expedite execution. A mature
and robust WMS also contains efficient and user-friendly monitoring capabilities, which allows
its operators to monitor and troubleshoot workflows executed on the Grid.
\end{itemize}
While there were a variety of factors which motivated this architecture, considerations of overall efficiency, given
limits of storage capacity and network throughput, were the primary drivers in the MONARC model. In particular,
reconstruction, reprocessing and some initial stages of data reduction are typically done at the sites with
ample storage capacity so as to avoid moving large amount of data over the network. As such, it can be argued
that the MONARC architecture was ultimately influenced by certain assumptions about bandwidth, performance
and reliability of networks which some authors now call ``pessimistic'' \cite [p.~105]{lhc_model_update}.
At the time when LHC computing was becoming mature, great progress was made in improving
characteristics of the networks serving the LHC projects. New generation of networks has lower
latencies, lower cost per unit of bandwidth and higher capacity. This applies to both local and wide
area networks \cite[p.104]{lhc_model_update}. This development opens new and significant possibilities
which were not available until relatively recently; as stated in \cite{lhc_model_update},
\begin{quote}
The performance of the network has allowed a more flexible model in terms of data access:
\begin{itemize}
\item Removal of the strict hierarchy of data moving down the tiers, and allowing a
more peer-2-peer data access policy (a site can obtain data from more or less any
other site);
\item The introduction of the ability to have remote access to data, either in obtaining
missing files needed by a job from over the WAN, or in some cases actually
streaming data remotely to a job.
\end{itemize}
\end{quote}
In practice, this new model results in a structure which is more ``flat'' and less hierarchical \cite{lhc_model_update}, \cite{courier_update} -- and indeed
resembles the ``p2p'' architecture.
In principle, this updated architecture does not necessarily require new networking and data transmission
technologies when compared to MONARC, as it mainly represents a different logic and policies for
distribution of, and access to data across multiple Grid sites. Still, there are a number of differing
protocols and systems which are more conducive to implementing this approach than others, for a variety of reasons:
\begin{itemize}
\item Reliance on proven, widely available and low-maintenance tools to actuate data transfer (e.g. utilizing HTTP/WebDAV).
\item Automation of data discovery in the distributed storage.
\item Transparent and automated "pull" of required data to local storage.
\end{itemize}
One outstanding example of leveraging the improved networking technology is XRootD - a system which facilitates \textit{federation} of widely
distributed resources~\cite{xrootd_fed,xrootd_snowmass}. While its use in HEP is widespread, two large-scale applications deserve a special mention:
it is employed in the form of CMS's ``Any Data, Anytime, Anywhere'' (AAA)
project and ATLAS's ``Federating ATLAS storage systems using Xrootd" (FAX) project, both of which rely
on XRootD as their underlying technology. ``These systems are already giving experiments and
individual users greater flexibility in how and where they run their workflows by making data more globally
available for access. Potential issues with bandwidth can be solved through optimization and prioritization''\cite{xrootd_snowmass}.
\subsection{Metadata, Data Catalogs and Levels of Data Aggregation}
To be able to locate, navigate and manage the data it has to be \textit{described}, or characterized. Metadata (data derived from the data) is
therefore a necessary part of data management. The list of possible types of metadata is long. Some key ones are:
\begin{itemize}
\item Data Provenance: for raw data, this may include information on when and where it was taken. For processed data,
it may specify which raw data were used. For many kinds of data, it is important to track information about calibrations used, etc.
\item Parentage and Production Information: one must keep track of software releases and its configuration details in each production step,
be able to trace a piece of data to its origin (e.g. where it was produced, by which Task ID etc), etc.
% seems to not be aligned with the subsequent definition
%\item Data location i.e. where the files are located - especially since data may be moved around.
%Note that the meaning of this changes in a Federated Storage Model.
\item Physics: this may include analysis summary information or a specific feature characterizing a segment of data, e.g. type of events selected, from which trigger stream data was derived, detector configuration.
\item Physical information: this might include the file size, check sum, file name, location, format, etc.
\end{itemize}
Generally speaking, a data catalog combines a file catalog i.e. information about where the data files are stored,
with additional metadata that may contain a number of attributes (physics, provenance etc). This enables the construction of logical (virtual)
data sets like 'WIMPcandidatesLoose' and makes it possible for users to select a subset of the available data, and/or ``discover'' the presence and locality of data which
is of interest to the user.
Grouping of data into datasets and even larger aggregation units helps handle complexity of processing which involved a very large number of induvudual files.
Here are some examples:
\begin{description}
\item[Fermi Data Catalog:] Metadata can be created when a file is registered in the database. A slightly
different approach was chosen by the Fermi Space Telescope Data Catalog. In addition to the initial metadata, it has a
data crawler that would go through all registered files and extract metadata like number of events etc. The advantage is
that the set of metadata then can be easily expanded after the fact - you just let loose the crawler with the list
of new quantities to extract which then is just added to the existing list of metadata). Obviously this only works for
metadata included in the file and file headers. Note that since the Fermi Data Catalog is independent of any
workflow management system, any data processing metadata will have to be explicitly added.
\item[SAM:] (Sequential Access Model) is a data handling system
developed at Fermilab. It is designed to track locations of files
and other file metadata. A small portion of this metadata schema is
reserved for SAM use and experiments may extend it in order to store
their quantities associated with any given file. SAM allows for the
defining of a \textit{dataset} as a query on this file metadata.
These datasets are then short hand which can then be expanded to
provide input data to for experiment processes. Beyond this role as
a file catalog SAM has additional functionality. It can manage the
storage of files and it can serve an extended roll as part of a workflow
management system. It does this through a concept called
\textit{projects} which are managed processes that may deliver files
to SAM for storage management and deliver files from storage elements
to managed processes. SAM maintains state information for files in
active projects to determine which files have been processed, which process
analyzed each file, and files consumed by failed processes. The installation footprint required
for SAM to be used at a participating site depends on the
functionality required. Lightweight access to catalog functionality
is provided via the SAM Web Services component through a REST web
interface which includes a Python client module. Full features,
including file management, requires a SAM \textit{station}
installation and these exist at a small number of locations.
\item[ATLAS:] Distributed Data Management in ATLAS (often termed \textit{DDM}) has always been one of its focus areas, in part due to sheer volume of
data being stored, shared and distributed worldwide (on multi-petabyte scale), and to the importance of optimal data placement to ensure efficiency
and high throughput of processing~\cite{atlas_ddm_chep12}. Just like with other major components of its systems, ATLAS has evolved its
data management infrastructure over the years. The systems currently utilized is Rucio~\cite{rucio_chep13}. We shall briefly consider basic
concepts and entities in this system pertaining to this section.
The atomic unit of data in ATLAS is file. Above that, there are levels of data aggregations, such as:
\begin{itemize}
\item \textit{dataset} Datasets are the operational unit of replication for DDM, i.e., they may be transferred to grid sites, whereas single files
may not. Datasets in DDM may contain files that are in other datasets, i.e., datasets may overlap.
\item \textit{container} Container is a collection of datasets. Containers are not units of replication, but allow
large blocks of data, which could not be replicated to single site, to be described in the system.
\end{itemize}
There are a few categories of metadata in Rucio:
\begin{itemize}
\item System-defined attributes (e.g. size, checksum etc)
\item Physics attributes (such as number of events)
\item Production attributes (parentage)
\item Data management attributes
\end{itemize}
\item[CMS:] CMS also employs the concept of a dataset. Metadata resides in, and is being handled by the ``The Data Bookkeeping Service'' (DBS).
This service maintains information regarding the provenance, parentage, physics attributes and other type of metadata necessary for efficient processing.
The Data Aggregation Service (DAS) is another critical component of the CMS Data Management System. It ``provides the ability to query CMS data-services via
a uniform query language without worrying about security policies and differences in underlying data representations''~\cite{phedex_chep13}.
\end{description}
\subsection{Small and Medium Scale Experiments}
Small and medium scale research programs often have smaller needs compared to the LHC or other large experiments. In these cases, it won't
be economical or feasible to deploy and operate same kind of middleware on the scale described in previous sections.
Data is often be stored in a single or just a few geographical locations ('host laboratories'), and
data processing itself is less distributed. However, many experiments today have data (or will have data) characterized by volumes
and complexity large enough to create and demand a real data management system instead of resorting to manual mode (files in unix directories and wiki pages).
In fact, we already find that some of the same elements, i.e. extensive metadata, data catalogs, XRootD, are also used by some smaller
experiments. The main challenge here is the very limited technical expertise and manpower available to develop, adapt and
operate this sort of tools.
With Cloud technology recently becoming more affordable, available and transparent for use in a variety of applications, smaller
scale collaborations are making use of services such as Globus~\cite{globus} to perform automated managed data transfers (cf.~\ref{data_xfer}),
implement data sharing and realize the ``Data in the Cloud'' approach. For small and mid-scale projects, platforms like \textit{Google Drive}
and \textit{Dropbox} offer attractive possibilities to share and store data at a reasonable cost, without having to own much of storage and
networking equipment and to deploy a complex middleware stack.
\subsection{Opportunities For Improvement}
\subsubsection{Modularity}
One problem with Data Management systems is that they often tend to become monolithic as more and more functionality is
added (organically) -- see Sec.~\ref{3domains}. While this may make it easier to operate in the short term, it makes it more difficult to maintain over
the long term. In particular, it makes it difficult to react to technical developments and update parts of the system. It's
therefore critical to make the system as modular as possible and avoid tight couplings to either specific technologies or
to other parts of the ecosystem, in particular the coupling to the Workload Management System. Modularity should therefore be
part of the core design and specifically separating the Metadata Catalogs from Data Movement tools, with carefully designed
object models and APIs. This would also make reuse easier to achieve.
\subsubsection{Smaller Projects}
Smaller experiments have different problems. Most small experiments have or will enter the PB era and can no longer use a manual
data management system built and operated by an army of grad students. They need modern data management tools. However, they have neither the
expertise to adapt LHC scale tools for their use, neither the technical manpower to operate them. Simplifying and downscaling existing
large scale tools to minimize necessary technical expertise and manpower to operate them, even at the price of decreasing functionality,
may therefore be a good option.
A second option is to take existing medium-scale data handling tools and repackage them for more general use. The problem is,
however, somewhat similar to what is described above. Often these systems have become monolithic, have strong couplings to certain
technologies and significant work may be necessary to make them modular. This can be difficult to achieve within the limited resources
available and will need dedicated support.
Finally, a few recent Cloud solutions became available (and are already used by small to medium size project), such as Globus~\cite{globus},
Google Drive and Dropbox, among others. They do provide a lot of necessary functionality for data distribution and sharing, and perhaps
provide an optimal solution at this scale, when combined with a flexible and reusable Metadata system (see notes on modularity above).
\subsubsection{Federation}
Last, the success of Federated Storage built on XRootD shows the importance of good building blocks and how they can be arranged into
larger successful systems.
% -mxp- I don't see how this ties into anything else, plus scalability of Postgres was not ascertained for this application
% The migration of SAM away from an Oracle database on the backend is an example of migration away from a limiting building block.
%\subsection{Common Elements vs Experiment Specific Ones}