[Up][Previous] [Next] [Navigate]


A View on Meta Data

Lassi A. Tuura

April 28, 1997

!!! Under Construction !!!

Abstract: This note discusses some ideas related to meta data: what it is, why should one have such a thing, and finally how meta data could be exploited in a LHC physics experiment. To a certain degree this note is a response to [1], attempting to argue that there are better alternatives. However, this note also takes a stab at several other interesting---and challenging---data management related topics such as data mining and object inspection.

Table of Contents

1   Introduction
2   What is meta data?
3   Why should we have meta data?
4   How would it be used?
4.1   Reconstruction
4.2   Data mining
4.3   Query optimisation
4.4   Data inspection
5   Meta data design
5.1   Do we need a meta language?
5.2   Design assumptions
5.3   Notation
5.4   Catalog and parts
5.5   Detector description
5.6   Other parts
6   Conclusions
7   References

1   Introduction

To be able to digest this note, the reader is expected to be familiar with two key concepts: objects and databases. Virtually all of this note is somehow related to one or the other, or to the combination of the both. Hence, when terms such as ``data'' are mentioned below, they usually refer to data implemented as objects, not just raw numerical values, potentially structured in some way. Similarly, a database is not considered to be just a huge bit bucket but a persistent store for objects.{1}

2   What is meta data?

ATLAS will have massive amounts of data: objects describing the detector response, reconstructed physics objects and analysis results. To a certain degree these objects are self-describing, and we can perform most if not all our calculations based on them. However, we can do better than that: if we arrange to describe lower level objects in a suitable manner, we can both automate certain tasks and at the same time raise the level of abstraction to perform operations closer to the real problem domain, the physics research. This ``suitable description'' would be the meta data, in other words data describing data.

Viewed from a different angle, meta modeling attempts to capture the application domain knowledge above and beyond the actual objects we have. Some of this knowledge is difficult to represent in a form comprehensible to computers or to the software we write. Some quite significant portion can, however, be cast to a form that can benefit us. To a large extent it is up to us to decide what to model and how. The choice whether to model some particular aspect must be evaluated against the cost of representing that knowledge throughout the application domain.

As mentioned in [1], the meta data enables us to automate a number of tasks. It is important not to underestimate the value of this, especially taken against the context of the previous paragraph. Namely, the application domain knowledge is not in general centralised to any particular location in the application---the knowledge can be implicit in the way we design the applications, or explicitly written out as coding patterns.{2} A side effect of automating tasks is that the knowledge associated with those task becomes more centralised; it is no longer spread throughout the application, or is present in more abstract and implicit forms. Modifying the parameters of the automated tasks is relatively easy since the application need not be touched at all---one only needs to modify the meta data and the behaviour of the application changes.

In summary, we may say that by exploiting meta data we will be able to understand our real problem and the data better.

3   Why should we have meta data?

By far the biggest drive for meta data should always come from the need to model the application domain. For trivial and medium size projects this rarely pays off due to the costs of extracting and representing the knowledge. The application at hand, the ATLAS software, is large enough that meta data is not only useful, but it can be argued even to be crucial.

To give some substance to these claims, one should consider the following.

4   How would it be used?

In this section we look in more detail how meta data could be used in practise. The list of scenarios is not, of course, exhaustive.

4.1   Reconstruction

Consider the case of reconstruction of some particular subdetector. Data may be available in more detail than some reconstruction algorithm would really care. For instance, reconstruction in the muon chambers might not want to see hits associated to wires, but to the chambers containing the wires. This mapping could be done using the meta data.

To be a bit more concrete, suppose the raw data in an event would be stored as a tree data structure (exactly what the data structure is is irrelevant for this example). Each object in the event structure would carry an unambiguous identifier with it; the purpose of the identifier is to describe the exact detector element that originated the data. Now, to request detector data for reconstruction, one would build another structure, presumably convenient for the reconstruction algorithm at hand. Objects in the structure would have identifiers just like in the event. This custom structure would then be passed to event in a message telling it to populate the data structure with the event data. Objects in the event data would be mapped to those in the reconstruction structure by using the identifiers and the detector description---a use of meta data. Note that the object mapping cannot be one-to-one because of the scenario described above: several ``more detailed'' identifiers (wires in a muon chamber) could map to a ``more generic'' identifier in the custom data structure (muon chamber).

4.2   Data mining

Data mining is currently a buzz-word that is used to describe virtually all sorts of database accesses. However, the physics analysis performed in a LHC experiment is perhaps as true to the original meaning of the term as possible: on the one hand trying to find details scattered in massive amounts of data, and on the other hand trying to generalise trends and other, mostly unknown, valuable information from the data.

Physics analysis activities such as selecting events based on some qualifications are not perhaps purely ``data mining'' operations, but they do include similar aspects. Whether these operations involve meta data in one form or the other is also open to discussion. However, it seems that at least one important area, query optimisation, could benefit from it; the next subsection is devoted to that topic.

It is difficult to say at this point how useful it would be to have true data mining capabilities. For instance, one might employ agent-like technologies that would roam the database trying to look for various kinds of patterns and then to generalise them. The results and even the state of these agents would be stored as meta data. The meta data could even include information that would somehow explain the agents what they should look for, although it may be easier to just construct specialised agents for each particular purpose, in essence incorporating the application domain knowledge into the design of the agent.

A more conservative use of data mining would be an application of content based searching (see also [2]). It is somewhat different from the traditional key based searching, where queries are made based on values of some predefined keys such as track momentum; the most important keys are usually indexed. However, a content based query relies on the computer to examine the objects and to report on their attributes. This is convenient if it is impossible to formulate a query based on the object itself. For example, when searching in an image database it is difficult if not impossible to search exactly for any particular image. A content based query can help by allowing users to search based on texture, color or shape measures. In the case of ATLAS software, it is easy to imagine that once the physicist has found an interesting event, she might issue a query for similar events (for some definition of similarity). Since the user is hardly interested in events with exactly the same data, this calls for a content based query.{5} The similarity measures for content based queries would be defined in meta data,and could include structural similarities (numbers and types of tracks), similar responses in certain subdetectors (using higher level objects such as clusters or using raw data and autocorrelation or Fourier transforms to achieve locality and shape independence) or even learned similarities (taught by self organising neural networks).

Certainly, more research is necessary in this area. It seems plausible that there should be co-operation between the physicists and the database experts: the physicists do have the application domain knowledge, but in general not a very good idea of what is the state of the art in large-scale data mining.

4.3   Query optimisation

One important part of data mining is that of query optimisation. It is highly important since the way the queries are carried out can have big impact on performance of both a single client and that of the entire distributed system.

One way to optimise the queries is to use the database's indexing capabilities, but it seems likely that this will not be sufficient alone. One would probably add classification attributes to events or parts thereof, and build indices based on those. For instance, one could classify events according to the physics channels they may be interesting to.{6} This still leaves some difficulties, including that of ordering the selection attributes. If the presence of some attribute severely cuts down the amount of result data, it is reasonable to apply first criterion on that attribute, and then refine using other attributes. The database itself may not be sufficiently informed to make this choice, and meta data may be used to provide the additional guidance.

How users actually access the data is a big factor when the database and application architectures are designed for optimal search performance. We may have some ad hoc guestimates of ``typical physicist behaviour'' that may or may not be valid. Certainly we do have knowledge of how things were done in the past, but it is not obvious this will apply in future---in particular because the FORTRAN-based systems are so different from the object oriented systems that are being designed now.

A typical way to analyse usage patterns is to instrument the physics analysis programs to collect information on the selections and searches made by the users, gather this information together, draw conclusions and go and modify the system's source code or meta data to optimise for the most common uses. Again, it is not obvious that this yields unbiased results (all analysis programs should be instrumented, which is unlikely to happen), nor that the results will reflect both short and long term usage patterns.

An alternative to the above approach is to design a system that from day one is able to adapt to the behaviour of the researcher community.{7} The difference to the above scenario is that the feedback to the running system would be mostly automatic, instead of a slow collect-analyse-implement cycle. Moreover, the clients (the physics analysis applications) could be learning both in short and long term. An example of the former would be adaptation to a particular job, while an example of the latter would be adaptation for the entire community or parts thereof---for instance by reclustering some significant portion of the event data, by repartitioning the database federation, or by moving data to a regional centre or to an institute-local server for faster access.

Since most of the analysis work is distributed around the world, a learning system requires some central repository for the knowledge gathered over time. The information could be collected to some designated part of the meta data. One could then assign a continuously learning system to look for and generalise patterns in this information and to instruct individual clients to assume different behaviours. Of course, in some cases it will be necessary for end users to retain the ultimate veto on what behaviour a client obeys.

4.4   Data inspection

Interestingly enough, the inspection and modification of individual objects or groups of objects is one area where meta data can effectively simplify application design. This thinking starts from the point of view that it is very undesirable to restrict ``viewable'' objects to a small predefined categories or to hard-code the ways in which they can be viewed. In other words, in the extreme case{8} it should be possible to inspect any persistent object with an object browser that automatically adapts to the object it is viewing.

This train of thought leads us to the concept of universal browsers, similar to the design of the latest versions of Netscape and Microsoft Web browsers (Constellation and Internet Explorer 4.0/Windows 97 desktop, respectively). That is, there is a universal data browser that can be used to manipulate virtually any kind of an object. However, the universal browser itself knows nothing about how to handle these objects---it just provides the basic infrastructure. What happens is that for each content type (object type) there is one or more registered handlers, which are activated by and in the universal browser.

For instance, suppose that there is a graphical ATLAS data manipulation interface, based on some component technology (e.g., CORBA, Java Beans or some home cooked simpler model). Now, selecting a track object one could launch an object examiner that showed the track's parameters in one of those property viewers that are now so common. Or, selecting a group of tracks one could shoot up a viewer that presented a histogram of some of their attributes. Even large-scale visualisers, such as event display, could be just yet another object browser. There could also be an event visualiser that just showed the hierarchical structure of an event, and would shoot up other object browsers when the user selects nodes in the tree, just as Windows Explorer is used to examine normal directory trees. The possibilities are virtually unlimited, yet the underlying design is very simple.

So, where does the meta data step in? First, it is necessary to have a registry storing information for each object type that has viewers or editors available. In essence this can also be seen as an extension of the database schema. The other necessary information is where these viewers and editors can be found. It could be at the very minimum the path of the object browser component implementation. In a more exciting scenario, the object browser could be a Java applet or a Java Bean and it could be stored directly in the database in the bytecode form and instantiated from there on request.

Features similar to what has just been described are not yet bundled with object databases; they would have to built in the application domain. However, it is not quite so unreasonable to expect that some degree of support to that direction would exist in near future. For instance, Figure 1 below (extracted from April 1997 BYTE, page 116), shows a data browser from Computer Associates' OODBMS, Jasmine. If you substitute nice dresses for LHC events, and a property sheet editor for the bitmap, you are not far from what was described above.

Jasmine Application Development Environment

Figure 1: Jasmine Application Development Environment

Another use of meta data in data inspection is as a track-keeping mechanism for data. The problem is that at some point some data will be removed. Since individual physicists might have some analysis result objects still pointing to the removed data, this effectively leaves dangling pointers that cannot be traversed.{9} One idea is to use meta data to record what has been removed so that users' programs can recover from disappeared data---or better yet, check in advance whether the data is still available.

5   Meta data design

5.1   Do we need a meta language?

C. Arnault et al. have raised in [1] the question of using a custom language to describe not only meta data, but also data structures and object schemas. One may notice that so far this note has not discussed the possibility of a language at all. Shortly put, the author considers a meta language unnecessary and even detrimental to the software development. At best a language could serve as an irrelevant implementation detail. The long story follows.

Let us first tackle the question of object schema and data structure description. Firstly, we note that for the database the schema must be described using ODL, the Object Definition Language, in any case. Basically ODL defines the interface and storage layout of an object class, plus any associations to other persistent classes. What is left undefined is the implementation of the object behaviour and how the associations are used to establish connections between individual objects. Since it is undesirable that a custom language would unduly restrict design choices, the language should be a superset of ODL---a target which the current proposal does not even attempt to reach, not to mention that it does not even describe objects, just raw data with some structuring.

If ODL or an ODL-like language is used to specify the schema of individual objects, that leaves only data structures as the extension capability of the custom language. However, anyone with any reasonable amount of programming experience must realise immediately that it is not sufficient to just describe data structures---they must also be used. Typically data structures are inherent in the the application design and cannot be changed easily: it is the relationships between individual objects that collaboratively establish a data structure. Hence, it would be necessary to modify numerous object relationships, and therefore modify the code that creates, traverses, or assumes properties of these relationships. There are three ways how to address this:

  1. embed the entire application into a language that automatically manages complex relationships (such languages do exist, for example BEEF which is a derivative of CLOS, the Common Lisp Object System)
  2. develop a class library that models these concepts
  3. forget about automatic relationship management and do it explicitly in the source code.

The last choice seems completely reasonable---that is, after all, what the large majority of software projects do. The second alternative implies that no custom language is necessary, since the class library already allows one to express the relationships directly. The first choice seems quite unattractive, in particular since ATLAS would be stuck for several decades with a programming language nobody else in the world is interested in.

Returning to the other aspects of meta data, it is noted that there is no particular reason why they should be implemented using a custom language. More precisely, there is no need for yet another formal specification, since there are very obvious, very widely known and understood formal specification methods for object oriented systems: the object modeling techniques such as OMT, Booch and UML. Moreover, these systems are far more descriptive and enlightening than what the proposed language can ever be. Thus, the formal specification of the meta data objects should be nothing more or less than their interface definitions in an object model.

Hence, starting from the assumption that an object oriented system should be defined first and foremost by an object oriented design model, we reach simply the conclusion that there is no need for a custom language. For instance, in the RD45 group there is experience in how automatic clustering information can be defined using fairly simple objects.

Taking this point further, it should be obvious that a meta model based on objects is independent of any concrete method used to view or modify it. That is, it is the meta data itself that should drive its design, not external and usually very transient access methods such as textual descriptions, graphical user interfaces and what not. Trying to design the meta model around what some particular tool can do today is about the worst possible design choice.

All that said, it can be acknowledged there might be some use for a language to load the meta data into the database. The syntax of this language is hardly cornerstone technology, and it is possible that some existing language or notation (such as OIF,{10} C++ or Java), will fit the bill just fine. Hence, the design of the loading mechanism should be irrelevant at this point. It is better to start with the actual object design, which is exactly what is discussed next.

5.2   Design assumptions

There are several possible approaches to the design of the meta data objects. This note starts from the following assumptions.

5.3   Notation

What follows is a sketchy design of some aspects of the meta data. It is not presented graphically, but in a textual form. As usual in object design, prominent nouns are likely to indicate classes, verbs are candidate for methods and so forth. Some words have been emphasised and spelled differently to indicate they are good candidates for classes.

5.4   Catalog and parts

It is proposed here that all the meta data, regardless of its use, should be stored logically in one place; physically it may be span multiple databases.{11} For that purpose it is convenient to have a special entity, a meta data catalog MetaCatalog, as a single root, entry point and manager of the various types of meta data. The catalog should probably be easily accessible from the database federation root, possibly as a named object available from the ``main'' or ``default'' database.

The areas of the application domain knowledge stored as a meta model are likely to be quite unrelated to each other. Thus, it is sensible to talk parts of meta data, MetaParts, that address a specific area of this knowledge. Examples of such parts would be detector description, clustering information or the type registry for object browsers; each of them would inherit from the general MetaPart interface. All these are known services, either directly hard wired to the meta catalog definition, or available using some lookup mechanism based on advertised names. While the latter is more flexible, it must said that new kinds of meta data do not spring into existence or disappear very often, not necessarily even every year. Hence, the former approach should be quite acceptable, too.

5.5   Detector description

Detector description meta data part could represent a fairly faithful tree-structured breakdown of the physical detector structure. Each detector element would have an identifier that uniquely names the element. An absolute path from the detector root could be described by an identifier path, consisting of all indentifiers from the root the element.

The above is made slightly more complicated by the fact that there are several uses for the detector description: simulation, reconstruction algorithms and others. Therefore, as proposed in [1], it would be beneficial to apply qualifiers to the identifiers. The qualifiers would enable a single detector description structure to serve multiple purposes, in essence slicing unselected qualifiers out. For best service, it would probably be useful to be able to alias a single qualifier as a short-cut to a group to other qualifiers. This way the qualifiers assigned to the detector description nodes could have very fine grained meanings, while users of the description would query information using more coarse grained qualifiers that would combine the fine grained meanings to a particular point of view.

Another complication for the detector description is that its definition should be relatively distributed. In other words, each subgroup should be responsible for defining its parts of the description. This means that the description cannot be monolithic, but assembled from multiple sources. Similar arguments apply to the definition of the qualifiers: it should be easy to add and remove qualifiers without disturbing other users of the detector description too much.

[FIXME: navigation]

5.6   Other parts

[FIXME: type registry]

[FIXME: data access locality profiles]

[FIXME: examples]

6   Conclusions

7   References

[1] Arnault C., Perus A., Schaffer RD: ``Management of Atlas persistent objects: ADb,'' April 1997.

[2] Mike Hurwicz: ``Opening Doors to Complex Data'', BYTE Magazine 22(4), April 1997, pp. 112-113.


{1} Most object databases do not store the behaviour of an object, just its state.

{2} It is assumed that the application is already designed using object oriented methods, and thus useful concepts and entities are already encapsulated and abstracted as objects. The patterns referred to here are ways in which several objects interact and which can be identified and described in detail, but cannot be encapsulated to any useful objects.

{3} Just to be clear on this: for maintenance reasons and because it would be too costly (humanly and in tangible resources) to train everybody to use the database interface.

{4} Note that we are not talking about mapping one kind of object to another kind of object, but mapping object structures to other structures.

{5} Strictly speaking a key based query could still be used. However, that would restrict queries to what ever keys have been predefined. A content based query allows queries on any similarity measure.

{6} Note that these classifications need not be simple truth values. They could also be numbers anywhere in the range [0, 1], or even fuzzy values. This would enable interesting likelyhood searches much as today's web search engines perform quite intelligent searches to millions of web documents based only few well chosen words and some qualifications combinging them.

{7} This concept is presented here because it seems interesting at the moment. In particular, this should not be considered as advocating learning systems, just as a note that they should be investigated. Should one start to implement these concepts, it would be likely that the clients would first start out only adaptive, and learning aspects would be added later. Moreover, parameters affecting behaviour could be added little by little, adding degrees of freedom to the adaptiveness.

{8} To be more realistic, not all object types would have specialised browser for them.

{9} This is unavoidable unless it is decided that ``experiment'' data, such as raw data, is never deleted. The reason is that individual physicists cannot be granted write access to the experiment data. Hence, it is impossible to create bidirectional associations from private analysis data to the experiment data, as the mere action of creation of bidirectional associations requires write access. Since bidirectional associations are out, the database will not automatically remove links to the data that disappears. It is possible to circumvent this with an indirection scheme---there is no problem in computer science that cannot be solved by adding one more layer of indirection---or using a meta data facility as described above.

{10} Object Interchange Format, the textual notation used to portably transfer object database contents (not only portably between platforms but also between different database products).

{11} Physical separation may be necessary if not for any other reason then at least to separate read only meta data objects from those that are universally writable.

[Up] [Previous] [Next] [Navigate]


LAT - April 1997