e4Graph Introduction

Introduction

This page explains what the e4Graph library is, the features it provides and the concepts it embodies, and shows by example how you would use it in your programs.

[*] What is the e4Graph Library?
[*] Release Methodology.
[*] A Taste of e4Graph.
[*] Installing and building e4Graph.
[*] Features of e4Graph.
[*] Concepts of e4Graph.
[*] Memory Management.
[*] Language Bindings for e4Graph.
[*] A Map of the Rest of the e4Graph Documentation.
[*] Acknowledgments for everyone who helped make e4Graph possible.

What is the e4Graph Library?

Are you a programmer, and do you work with information whose "shape" is irregular, data whose pieces inter-relate to each other in many different and non-uniform ways? And do you need to represent a collection of data with the ability to add new connections between any two data items on the fly? Do you need to efficiently work with huge data sets, and do you need guaranteed data consistency for your persistent data? And do you need to work with the data over many invocations of your programs, each invocation starting off at exactly the point where the previous invocation left off? Do you need to access the data from many different programs running on a variety of operating systems and machine architectures? Do you need to extract data from the data sets in XML format and do you need efficient access to external data available in XML format? If so, e4Graph can help you.

e4Graph is a C++ library that allows programs to store graph-like data persistently and to access and manipulate that data efficiently. With e4Graph, you can arrange your data in the most natural form that reflects the relationships between its parts, rather than having to force it into a table-like format. The e4Graph library allows you to concentrate on the relationships you want to represent, and not on how to store them in a database. You can modify data items, and add and remove connections and relationships between pieces of data on the fly. e4Graph allows you to represent an unlimited number of different connections between pieces of data, and your program can selectively manipulate the data according to the relationships it cares about, not having to know about other connections represented in the data set.

[back to top of page]

Release Methodology

The e4Graph library is free software, provided under a BSD style license. It is also an open source project, meaning that the source code for the whole library, instructions on how to build and use it, and full documentation, is provided in the distribution. Its development is being managed from http://e4graph.sourceforge.net/ kindly hosted by the good folks at SourceForge. My employer, Sun Microsystems Inc., has graciously allowed me to release this software freely, and I am solely responsible for its content, function and operation. You are allowed to use the e4Graph library for any purpose, and you can include it in your commercial, shareware or freeware products and projects. The library is provided in source form and, as a convenience, in binary form for several platforms, including Windows 95, Windows 98, Windows NT, Windows 2000, Windows XP, MacOS X 10.2 and later releases, Solaris 2.8 and later releases, and Linux. Ports to new platforms are easy, as I've made a conscious attempt to write the code portably; if you make a port to a new platform, please email jyl@mod3.net to allow your port to be included in a future distribution. If you run into portability problems then I'd appreciate the opportunity to learn from your experience and to help. I also ask that you send me email if you decide to include the library in your product. This will allow me to better support your needs.

The e4Graph library is provided as is, with no expressed or implied guarantee of usefulness, operation or correctness. I promise to reply to every email question or request. I wrote this software because I thought it's an interesting problem area, so I'm very interested in hearing about how you use it and what problems you ran into; unfortunately, I cannot promise commercial level support.

[back to top of page]

A Taste of e4Graph

Before I present the details, a small example to give you a glimpse of what e4Graph is about. I expect you to know a little bit of C++ in order to understand the example. Explaining C++ is beyond the scope of this document, sorry.

The example shows how a program uses e4Graph to find out whether John's grocery store has any sugar, and if it has, how many pounds are available and for what price.

e4_Storage s("John's grocery store", E4_METAKIT);
e4_Node root;
e4_Node items;
e4_Node sugar;
double price_per_pound;
int pounds_in_store;

if (!s.GetRootNode(root)) {
    throw new BadStoreException(s);
}
if {!root.GetNode("on hand items", items) {
    throw new NoItemsInStoreException(root);
}
if (!items.GetVertex("sugar", sugar)) {
    throw new OutOfSugarException(root);
}
if (!sugar.GetVertex("pounds in store", pounds_in_store)) {
    throw new OutOfSugarException(root);
}
if (!sugar.GetVertex("price per pound", price_per_pound)) {
    throw new NoPriceForSugarException(root);
}
out   << "John's store has "
      << pounds_in_store
      << " pounds of sugar at "
      << price_per_pound
      << " per pound\n";
if (!sugar.SetVertex("price per pound", price_per_pound+0.05)) {
    throw new CouldNotUpdatePriceOfSugarException(root);
}

This is quite straightforward. All of the concepts are explained in much more detail below, but for now its works to think in terms related to the data being represented. First, we open the e4Graph data that describes John's grocery store. Next, we obtain the root node of the data graph stored in the newly opened storage. Then, we obtain a node that contains all the items on hand. Next, we obtain all the related data that we care about for sugar in John's store, in this case its price per pound and how many pounds are on hand. Then we obtain an integer representing the number of pounds of sugar in the store, and a 64-bit floating point number that tells the price per pound. Then, we print out a report detailing how many pounds of sugar are available and at which price. Finally, we raise the price of sugar by five cents. This change takes effect immediately and will be made persistent when the storage is closed or committed. Along the way we handle various kinds of error conditions, such as there not being any sugar in John's store, by throwing C++ exceptions.

Note that the root node may contain many more items besides the catalog of available for-sale merchandise, such as accounts receivable and supplier information; since these are not used in this example, the program can ignore them. Also, there probably are many more items for sale in the store. However, in this case we only care about sugar. Also, the information about sugar may have much more detail, such as brands, links to one or more suppliers, incentive program information (discounts), links to information collected about client purchases of sugar, and so forth. In this example we only care about the number of units available and their price.

This is a very simplistic example that doesn't even begin to show off all the features of e4Graph. It's just a first taste. Hopefully it's sweet and gives you appetite for more!

[back to top of page]

Installing and Building e4Graph

You might want to pause in reading this page and find out how to install, build and use e4Graph in your own projects.

[back to top of page]

Features of e4Graph

The e4Graph library enables your program to represent and manipulate graph-like data efficiently and to store this data persistently. The overhead for persistence is etremely low, so e4Graph is useful also for when you do not care about keeping the data around between runs of your program; you can use it purely as a data structure library, without concern for persistence.

The key benefit provided by e4Graph is that it frees you to think about your data structures and the relations between the various entities without worrying about how to build the data structure and how to efficiently store it. Your program accesses and manipulates the data according to the relationships it represents, and e4Graph takes care of how to represent the data efficiently and persistently. Data is loaded into the executing program on demand, as a connection from an already loaded item to an on-disk item is followed; this data loading step is transparent to your program, and in-memory storage is recovered automatically when the data is no longer accessible by your program. This allows e4Graph to efficiently manipulate data graphs whose sizes are several orders of magnitude larger than the machine's available memory.

e4Graph makes changes to stored data permanent through a commit step. Persistent data cannot be corrupted by a program crash, because intermediate states are represented only in program memory, not in the persistent storage. At all times the persistent data represents the last committed state, while the program's in-memory state represents the state of data accessed or modified since the last commit. The e4Graph library by default automatically commits the changes made during execution at the time a storage is closed. All storages are automatically closed and optionally committed at normal program termination. This frees you from having to explicitly commit before program termination. Of course, you can still commit the intermediate state at any time during program execution.

The e4Graph library is layered on top of a generalized interface to table oriented or relational databases, and the library includes a database driver for the Metakit database package. It is possible, but currently not documented how, to write additional drivers for other database packages. If you are interested, send me an email and I'll help you get started.

Each database driver provides its own guarantees regarding the consistency of data stored persistently; the Metakit database provides extremely strong consistency guarantees, preventing data corruption even if a program crash happens in the middle of a commit operation.

The e4Graph library allows you to model any kind of relationship between data that can be represented by a directed graph, including circular graphs of connections between data items. e4Graph is unique in that it allows these circular relationships to be represented directly rather than implicitly or through meta-data, as is necessitated by other approaches such as relational databases. A bi-directional link between two data items can be represented by two directed connections between those data items.

Your program can open any number of persistent storages at a given time, up to the resource limitations imposed by the hosting operating system and machine. This means that your program can partition its data into separate persistent storages to divide it into "consistency units". All data stored in a single storage is guaranteed to be consistent, while relationships across storage units may not be consistent at all times. This feature also allows your program to use e4Graph data stored in different databases through a uniform interface.

[back to top of page]

e4Graph Concepts

The concepts embodied by e4Graph are necessarily related to each other. The description below therefore sometimes mentions a concept before fully explaining it. Later, it loops back to provide a complete explanation.

As already stated, data is organized into persistent storages represented by the C++ class e4_Storage. Each storage has a name and is controlled by a specific database driver; its internal layout is according to the rules imposed by that database driver. For storages managed by Metakit, the persistent data is contained in files on your machine's disk, and storages are named by the name of the file which contains them.

Each storage maintains a single root node, obtained by the GetRootNode method. That root node in turn may refer to other entities in the storage. The root node is not special in any way; it is just a convenient way to obtain a reference point in a graph stored in a newly opened storage.

A node is represented by the C++ class e4_Node. Nodes represent clusters of related data items. A node may contain zero or more vertices. Vertices store data items, and one kind of data item is a node; this allows the formation of graphs of nodes connected by vertices. A node A is a parent of a node B if A contains a vertex whose value is the node B. Thus, a node may also have zero or more parents. The e4Graph library provides methods for traversing from a node to a vertex contained in that node, or from a node to its parents.

While a storage is open, the program may hold references to detached nodes; such nodes are automatically and silently removed from the storage, and their memory is recycled, when the user program loses the last reference or when the storage is closed. A node is considered detached if it is not the value of any vertices that are contained within a node. If it is the value of at least one vertex that is contained within a node, it is considered attached. Detached nodes may result from operations on the storage such as detaching a vertex from a node, or may be created directly with the CreateDetachedNode method of class e4_Storage. Detached nodes are removed silently by e4Graph and their memory is recycled when the storage is closed or the program loses the last reference to them.

Vertices are the last concept left to explain. A vertex is a named piece of data. Each vertex has a name, type and associated data. Most vertices are contained within a node. If a vertex is not contained within a node, it is considered to be detached. Detached vertices are removed silently from the storage and their memory recycled when the program loses the last reference to them or when the storage is closed. A detached vertex results from operations on a storage such as detaching a vertex from a node, and may also be created explicitly by the user program using the CreateDetachedVertex method of class e4_Storage.

A vertex's name need not be unique within its node, and can be an ascii string of any length greater than zero. Vertices are arranged in a sibling relationship, and the e4Graph library provides methods to traverse from a vertex to its containing node and to the preceding and following vertex within its containing node. A vertex can contain data of a variety of types, including integers, 64-bit floating point numbers, null-terminated strings and binary uninterpreted data of arbitrary size. Additionally, a vertex value can be another node.

e4Graph supports direct representation of relationships only within a single storage; all of the vertices within a node are contained within the storage containing the node, as must the node's parents. Of course, your program is free to implicitly represent relationships between pieces of data stored within different storages. However, e4Graph does not know about these higher level relationships and does not extend any guarantee of consistency to their representation. For example, your program may choose to represent cross-storage relationships as URLs, and provide a mechanism for following such links automatically.

[back to top of page]

Memory Management

As explained above, nodes in the storage may be attached or detached, and the same is true for vertices. The e4_Node and e4_Vertex classes are derived from a common e4_RefCount base class, which provides reference counting and automatic memory management. When an instance of e4_Node or e4_Vertex is no longer referenced by the user program, its in-memory storage is automatically recycled. Additionally, the e4Graph library provides a garbage collection mechanism that recycles the memory occupied by detached nodes and vertices that are not referenced anymore by the user program.

The garbage collector uses reachability to decide when some memory can be recycled. The root node is considered to always be reachable, and nodes referenced by the user program are also considered reachable while the storage is open. Additionally, any nodes and vertices that can be reached by traversing some path from one of the reachable nodes is itself reachable. All nodes and vertices that are not reachable according to this definition are recycled when a garbage collection occurs. The garbage collector also recycles all circular graphs that cannot be reached by traversing from the root node or any nodes referenced by the user program.

[back to top of page]

e4Graph and Programming Languages

You must use e4Graph through an application programming interface (API) in your favorite programming language. The library comes with a C++ interface, and can be used directly from C++ programs. It also provides a very efficient Tcl binding that allows use of e4Graph in Tcl programs, and a Java binding through JNI.

There's a new, experimental Python 2.2 binding, contributed by Michael Krimerman. I welcome use of e4Graph in other programming languages. If you implement a binding to your favorite language, please drop me some email and let me know; I am very interested in including your binding in a future release. A PHP or Perl binding would be really nice to have, too!

[back to top of page]

The Rest

There's some more documentation:

This page explains how to install, build and use e4Graph in your own projects.
This page explains the various C++ classes that make up the e4Graph library. You will need this information to write C++ programs that use the library, or to create new language bindings.
This page explains the Tcl binding of e4Graph. Users that want to write Tcl programs using e4Graph need to read this.
This page explains the Java binding of e4Graph. Users that want to write Java programs that use e4Graph need to read this.
This page explains the Python 2.2 binding of e4Graph. Users that want to write Python programs using e4Graph need to read this.
This page provides a very extensive C++ example that demonstrates many features of e4Graph.
This page shows the same example, implemented in Tcl.
This page shows how to do XML input and output with e4Graph in C++.
This page shows how to work with XML data and e4Graph in Tcl.
This page shows how to work with XML data and e4Graph in Python.
This page discusses the database interface; you will need this information to write your own database driver for your favorite database. This interface is fluid and will not be finalized until the e4Graph 1.0 release.

[back to top of page]

Acknowledgments

Any piece of software is a labor of love, a birth. Many people are involved directly and indirectly. As the primary author of e4Graph, I've had the fortune to have many friends whose support helped make this all happen. Thank you all!

First and foremost, thanks to my family. I recognize the signs of an obsession; it's when even your eleven year old daughter who doesn't program knows what e4Graph is and when it's going to be released. :) Thanks to my wife, son and daughter for tolerating my michegas.

Thanks to Jean-Claude Wippler for help in ferreting out many of the ideas and concepts embodied by e4Graph. Jean-Claude also provided excellent support for his exceptional product, the Metakit package. Jean-Claude's patient help made it possible for this software to see the light of day.

Andreas Kupries (http://www.oche.de/~akupries/) provided valuable insights into how to generalize an earlier version of this package. At that time it only dealt with trees. Andreas suggested a way to generalize it to handle directed graphs, and as a consequence the package was renamed to e4Graph.

Jeffrey Hobbs (http://tcl.activestate.com/community/hobbs/) provided extensive help in clarifying the concepts behind the Tcl binding. Without Jeff's help the Tcl binding would not have reached maturity so quickly. Jeff also helped me overcome a couple of tight spots in implementing the Tcl binding.

Michael Krimerman provided the new Python 2.2 binding for e4Graph, and Daniel Steffen helped a lot with getting e4Graph to compile and run on MacOS X 10.2. Thanks to both of you! Daniel also includes e4Graph in his excellent Tcl/Tk precompiled distributions for MacOS X 10.2.

My employer, Sun Microsystems Inc., deserves special thanks for allowing me to release this hobby project. The idea has been knocking around in my head for a while, and Sun graciously tolerated my after-hours work on its implementation. Sun has no connection whatsoever to this package, and I bear complete and exclusive responsibility for the ideas, concepts and implementation of e4Graph.

Thanks to Jean-Claude Wippler for Metakit, John Ousterhout for Tcl, Sun Microsystems Inc. for Java, Guido Van Rossum and the Python crew for the Python programming language, and James Clark for expat. The e4Graph library depends on, or uses, all of these packages.

[back to top of page]