scala and hypergraphdb: why hypergraphDB rocks - and why users are still scared off

HyperGraphDB is cool. The main page already gives an excellent overview, but in particular look at the example applications on the bottom: its expressivity is so high that different things such as artificial intelligence with neural networks, prolog unification on properly typed hgdb atoms as fact base, and semantic web tripletstores or OWL databases can be naturally hosted in a hypergraphDB. There is much more, just naming a few examples.

However, the user base of HyperGraphDB appears to be rather small, which is a pity. Also, there are plenty of interesting things that could be done, but we are lacking man power.

This blog post is investigating why this is the case, and how to overcome it. One point I do not want to indulge here, is to improve usability. This will be soon treated in the next posts.

People that consider HyperGraphDB for a particular project, generally have several other options to choose from. They invest a rather limited amount of time and brains for each option in that process. If the concepts, the design, and their respective usefullness for the job do not sink in after a reasonable amount of time, it is just an option being dropped in favor of other options.

HyperGraphDB brings in many uncommon concepts, that also differ from standard version of these concepts in several aspects.
Wait, you could also slightly reword that last phrase into:
"hypergraphDB is the Mother of all the databases that make things the most different from the mainstream":

it is not relational database, but some form of NoSQL
it's not only a graph database, but also an object-oriented database
it's not just a graph database, but hypergraph database
it's tuple-based hypergraphs, not pairs-of-set-based hypergraphs
its hyperedges don't have two sides, but it can be directed anyway
it has its own type system, that is so mighty, that it cannot be fully expressed in java's type system.
it's a general rule of thumb that java's typed objects get mapped to corresponding hgdb typed atoms and vice versa. But... that does not always hold true.
it's also links that are fully typed in the type hierarchy, and that can point to other links. Rather, nodes are special cases of edges.

Ok, it gets clearer now, why people have problems adopting hypergraphDB!

Some explanations regarding the graph/hypergraph chaos:
A typical graph is nice and simple because there are nodes and edges, whereas pairs of nodes are represented as edges. It is hard not to understand a typical graph. It sinks in fast.
Hypergraphs are a generalization, i.e. they do not have a restriction on the binary relationship. I would intuitively understand that not-only-binary aspect in two possible ways: to allow more than one node on each side of an edge, i.e. pairs of sets of nodes (directed), or to have any number of nodes with no sides at all (undirected).
Also here, HypergraphDB doesn't stick to the general rule: A HGDB link is not a pair of sets of nodes, but one tuple of nodes, but at the same time, it can be directed anyway!
So, in some way, there are now two reasons, why it does no longer fit to common understanding of "a graph".

Wait, so how can it be directed without having two sides?
That is possible because unlike a set, a tuple has the notion of an order and in your very own definition of HGDB links, you can generally hard-code the meaning of positions in the tuple. That is, the tuple approach allows to supersede just "direction". This is illustrated in the example below.

Ok, but why is it worth it to make it that way, and not the standard way?
The short answer is, that neither graphs nor normal hypergraphs are expressive enough to express a big class of problems.
Maybe an example helps to illustrate, why it is useful have n-ary relationships, why in particular, the tuple-approach is interesting, and why it is helpful to have fully typed atoms.

Ok then, give me an example why you need that stuff!
I pick biology here because it is my domain, but you can find situations everywhere, in which modeling after binary relationships are either too strong simplifications, or imply splitting up "one thing" into many smaller ones. This means the representation of your domain entity (forgive me if I get terminology wrong), is fragmented into many nodes and vertices. This can be a good thing too, when calling it decoupling in some situations, but often it just doesn't make sense, to separate what belongs together. Therefore, often things are just simplified to a point until they fit the system. It must not be that way.

Ok, example. Graphs are useful for modeling enzyme reactions, where one typical graph representation is
subtrate ⇌ product

where the edge itself is simply understood as the enzyme. Actually, this is the first example of why the limitation of oversimplifying binary graphs hurt: the edge has now a twofold meaning, it would be at the same time a reaction and an enzyme (which quite often correlates well, but not always at all).

Ok, let's illustrate. Understanding details don't matter here, just look at the pic. Interesting to note, that this example, glycolysis, is what happens billions of times in each cell of your body, every hour or so. It is one of the most important ways of how chemical energy is converted to biologically useful energy forms, starting from glucose. You would not exist without that:

Even without understanding anything, there are several spots where you can see visually, that the above graph simplification could never come near the truth and that generally it would be a huge mess to model anything like that with a binary graph or even a regular hypergraph:

in some reactions there are not only subtrates and products and the (omitted) enzyme, but also cofactors such as ATP or NAD involved (energy intermediates) or magnesium. Cofactors can have state changes in both directions too. This is critical in order to know in which direction the whole pathway occurs.
Some reactions are reversible, others are not (correlates to some extent with cofactors implied and the kind of state change)
even when only considering substrates and products, some reactions are not binary, see triple-directed arrow in lower right corner.
although it is convenient to separate between substrates /products vs enzymes vs cofactors, these are generally in continuous flux of interconversion, limited only by the stoichiometry of actual elements present, and of course thermodynamics.

Well now with graphs this is tricky to model, you would basically need to encode that into probably several dozens if not hundreds of nodes and vertices. With normal, untyped hypergraphs it would be also very difficult if not impossible to model one reaction in one hyperedge, since there are different types of things on each side of the hyperedge whose role in that particular reaction you must distinguish: organic molecules that are substrates and products such as glucose, cofactors such as ATP and enzymes such as hexokinase.
Furthermore, you still would be unable to represent n-ary reactions such as in the case of F6BP-aldolase. Note that it is both, n-ary and directed (one 6-carbon molecules is converted to two different 3-carbon molecules).
Afaik, when your system such as binary graphs have limited expressivity, you have to make a trade-of between simplification and an exploding number of nodes and vertices. Obviously, simplification has a price in expressivity of the model. The question is hence, what questions can be answered with your simplified model.
Some interesting example questions to metabolic models like that:
How can the production of a specific desired product be maximized? How to avoid the backreaction of that desired product without disturbing the production? How to optimize the production? Feed the cell with which subtrate? What are the rate-limiting reactions in the pathway, due to which limiting factor in that reaction? What kind of products are accumulated, when a particular enzyme is knocked-out, or is that just compensated for by another reaction (btw an interesting property of biological systems called robustness)?
For these questions, you would need much information about the enzymes, and a variety of parameters in the cell, such as keep track of the concentration and their changes of each molecule species. For example, commonly, cofactors often are just simplified away, but you would also need to keep track what is the concentration of ATP/ADP, NAD+/NADH2 and how their concentration is changed by reactions involving them. At a second glance, you'd also need to know how much phosphate is bound to what kind of molecules (in the first reaction for example, glucose is activated by one of those yellow circles of phosphate that is split up of ATP). This matters because when lacking ATP, most but not all of them can be easily converted back to ATP when phosphorylation state in the cell is low. This are just some improvised examples, of why it is bad to be forced to do oversimplification. That is also probably one of the reason why graphs are not used as much in bioinformatics as one would expect.

So ok, modelling is complicated, in whatever field you are in. How is hypergraphDB different?
I speculate here that a single HypergraphDB's hyperedge can be designed such that it accomodates one entire glycolysis reaction. If that is wise in a particular case is another question.
The only main limitation is that you could not a single link with several variable length types of arguments. Just as with java varargs, you can only have one vararg argument in the final position. But there are no varargs here.
The java class would have to implement HGLink, and the ugly part, would need to encode specific positions in the link to particular fields in the class. Hence, hereby we create a hypergraphDB link in which the positions in the link have reserved meanings and with the respective type constraint of allowed atoms i.e. position 1: type of the reaction
position 2 : key enzyme
position 3 to 4: other involved enzymes
position 5 to 10: subtrates
position 10 to 12: cofactors (in specific state)
position 13 to 15: intermediary enzyme-substrate complexes
position 16 to 20: products
position 21 to xy: rate constants, Km values etc, pH and temperature optima metrics

This link would probably encompass at least half of all reactions, but more suitable would be to have a small type hierarchy of reaction links, where the positions are mapped to parameters of particular reaction types. Analogously, one could encode the n-ary directed reactions.

Now, with hypergraphDB type system, you could also reflect that all sorts of molecules are interconverted constantly, also across the oversimplifying roles we spoke about above (substrate vs enzyme etc). For this to achieve, you would need to define a type hierarchy that carries in their type definition the information of what the things actually are: polymers are a specific sequence of monomers (you have a field in the corresponding java class), monomers are specific compound of elements in specific amounts and bounds among them. Hence the type enzyme would be a subtype of protein, which in turn would hold a specific sequence of aminoacids, which in turn are composed of specific functional groups and elements. Hence it could be reflected what kind of amino acids or even compounds would be released into the cell, when a particular enzyme is degraded - (enzymes are indeed constantly formed and degraded, because this allows negative feedback and it prevents malfunctioning enzymes to form and do damage). In terms of hypergraphDB that would mean that one or several atoms of a given type would be erased and other atoms would be created, probably in a transaction, such that no mass just vanishes or appears out of thin air.

scala and hypergraphdb

Sonntag, 14. Oktober 2012

why hypergraphDB rocks - and why users are still scared off

1 Kommentar: