DSGRN Documentation

Purpose

Here we document technical information regarding the DSGRN (Dynamic Signatures Generated by Regulatory Networks) project.

Installation

The installation proceeds in two stages. First, install the prerequisites. Second, build and install the DSGRN software.

Dependencies

The dependencies are:

C++11 compiler
Boost 1.58
Open MPI
cluster-delegator
sqlite3

About the installation

DSGRN is installed by invoking the install.sh script in the source root. The location it installed is determined by the so-called installation prefix. This can be specified as in the following example:

./install.sh --prefix=/opt/local

If the installation prefix is not specified than a default value is used. This value is /usr/local. If, however, you do not have write privileges to /usr/local, then the installer will check to see if you have a directory named ~/.local. If you do, then that will be the installation prefix. Otherwise the installer fails.

Assuming an installation prefix of /usr/local the files installed are as follows:

  /usr/local/bin        (dsgrn)
  /usr/local/lib        (libdsgrn.a, libdsgrn.so, libdsgrn.dylib)
  /usr/local/share      (DSGRN/logic/*.dat)

Mac OS X:

On Mac OS X, C++11 and sqlite3 come along with the standard command line tools. For Boost and OpenMPI I recommend using the "homebrew" installer. Cluster-delegator can be retrieved from the git repository.

Example installation using homebrew:

brew install boost
brew install openmpi
git clone https://github.com/shaunharker/cluster-delegator.git
./cluster-delegator/install.sh
git clone https://github.com/shaunharker/DSGRN.git
./DSGRN/install.sh

Linux:

Logs of successful Linux installations are available from Travis CI:

https://travis-ci.org/shaunharker/DSGRN

You may find the included scripts

  ./.travis.yml
  ./.install/boost.sh
  ./.install/openmpi.sh
  ./.install/sqlite3.sh

useful.

Network Components

A network component is a building block of a network. It consists of a node, the number of input edges, the number of output edges, and a logic, which is an algebraic combination of variables associated with the input edges. In particular, we assume this algebraic combination is a product of sums, each input variable occurs precisely once, and all coefficients are unity.

Logics and Partitions

The logic \(L\), as a product of sums, can be associated with a partitioning of the inputs into \(k\) groups of sizes \(p_1, p_2, p_3, \cdots, p_k\), where \(p_1 + p_2 + p_3 + ... + p_k = n\) and without loss \(p_1 \leq p_2 \leq p_3 \leq \cdots \leq p_k\). We call \((p_1, p_2, \cdots, p_k)\) the partition associated with the logic.

Description Files

Given a network node, its description triple is the tuple \((n,m,(p_1, p_2, \cdots, p_k))\) where \(n\) is the number of inputs, \(m\) is the number of outputs, and \((p_1, p_2, \cdots, p_k)\) is the partition associated with the logic.

We maintain a repository of files which contain information about various network components. They are named in accordance to their description triple, using underscores as separators:

n_m_p1_p2_p3_..._pk

To clarify we give an example. Consider a network node named X1 with 3 inputs (X0, X3, and X4) and 2 outputs (X2 and X5), with a logic of \((\mathrm{X3}+\mathrm{X0})\mathrm{X4}\). See the following figure:

Figure 1. Network Component Example

Then we have \(n = 3\), \(m = 2\), and the logic \(L\) is associated with a partition of \(n\) into \(k=2\) groups of sizes \(p_1 = 1\) and \(p_2 = 2\). This results in the following name:

3_2_1_2.txt

Parameter Representation

A parameter can be understood as a set of inequalities between various input combinations and output thresholds for a network node. We give two conventions for representing parameters. The first is the binning representation and the second is the hex representation. The binning representation is more succinct, but we will prefer the hex representation because it lends itself to a more convenient comparison of parameters.

Binning Representation

The binning representation of a parameter arises by giving the bin for each of the \(2^n\) input combinations. A bin is a number in \(\{0,1,\cdots,m\}\) which determines the comparison between the input combination and the output edge. In particular a bin number of \(k\) indicates that the input combination activates edges \(j\) for \(j < k\). More precisely, a binning representation of parameter is the function

\(p: \{0,1\}^n \rightarrow \{0,1,\cdots,m\},\)

where \(p(i) > j\) iff the input combination \(i\) activates edge \(j\).

Programmatically, \(p\) may be encoded as an array of \(2^n\) binning values. For example in our C++ implementation we store the binning representation as

std::vector<int64_t> bin_;

Note	This is not as space efficient as possible. A more careful implementation would only require \(2^n \lceil \log_2 m \rceil\) bits rather than \(64 \cdot 2^n\).

Hex Representation

In the hex representation, we explicitly record the status of each of the comparisons between the \(2^n\) input combinations and the \(m\) output thresholds. Generically, comparisons are either \(<\) or \(>,\) which we may represent respectively as the binary bits \(0\) or \(1\). Accordingly, a hex representation encodes a \(2^n m\) length sequence of bits as a hexadecimal code. The hex codes strings are presented in big-endian fashion, meaning that reading from left to right the most significant nybbles (4-bit characters) come first.

We choose the following convention for the ordering of the bits. For \(0 \leq i < 2^n\) and \(0 \leq j < m\), let \(b_{ij} = 1\) if input i activates output j and let \(b_{ij} = 0\) otherwise. We order the bits \(b_{ij}\) so that the \((mi + j)\)th binary digit (where the 0th digit is the least significant) is \(b_{ij}\).

To continue the above example (Network Component Example), we list the contents of the network component description file for the \((3,2,(1,2))\) component:

> cat 3_2_1_2.txt | tr '\n' ' '
0000 4000 4040 4400 4440 4444 5000 5040 5050 5400 5440 5444 5450 5454 5500 5540 5544 5550 5554 5555 C000 C040 C0C0 C400 C440 C444 C4C0 C4C4 CC00 CC40 CC44 CCC0 CCC4 CCCC D000 D040 D050 D0C0 D0D0 D400 D440 D444 D450 D454 D4C0 D4C4 D4D0 D4D4 D500 D540 D544 D550 D554 D555 D5D0 D5D4 D5D5 DC00 DC40 DC44 DCC0 DCC4 DCCC DCD0 DCD4 DCDC DD00 DD40 DD44 DD50 DD54 DD55 DDC0 DDC4 DDCC DDD0 DDD4 DDD5 DDDC DDDD F000 F040 F050 F0C0 F0D0 F0F0 F400 F440 F450 F4C0 F4D0 F4D4 F4F0 F4F4 F500 F540 F550 F554 F555 F5D0 F5D4 F5D5 F5F0 F5F4 F5F5 FC00 FC40 FCC0 FCC4 FCCC FCD0 FCD4 FCDC FCF0 FCF4 FCFC FD00 FD40 FD44 FD50 FD54 FD55 FDC0 FDC4 FDCC FDD0 FDD4 FDD5 FDDC FDDD FDF0 FDF4 FDF5 FDFC FDFD FF00 FF40 FF44 FF50 FF54 FF55 FFC0 FFC4 FFCC FFD0 FFD4 FFD5 FFDC FFDD FFF0 FFF4 FFF5 FFFC FFFD FFFF

Note	The `cat` command reads the file and writes it to standard output, which is piped `\|` to the `tr '\n' ' '` command, which replaces newlines with spaces. In the file itself the hex codes are separated by newlines.

We see that the network component description file contains the 155 hex codes which represent the network node parameters.

Comparison

As mentioned, the binning representation (if implemented carefully) has a space advantage over the hex representation. The hex representation has its own advantages:

They can be encoded as a simple string
Adjacent parameters differ by one bit in this representation
They have meaning independent of the threshold ordering

Conversion

The following C++ code converts between the binning representation and the hex representation:

  /// hex
  ///   Return a hex code X which represents the parameter. The
  ///   hex code represents a binary string in big-endian fashion.
  ///   For 0 <= i < 2^n and 0 <= j < m, let b_{ij} be the
  ///   the (i*m + j)th binary digit of X (where the 0th digit is
  ///   the least significant).
  ///   Then b_{ij} = 1 if input i activates output j
  ///        b_{ij} = 0    otherwise
  std::string hex ( void ) const {
    std::string X;
    int64_t N = ( 1 << n );
    char nybble = 0, mask = 1;
    auto flush_nybble = [&] () {
      // Hex digits 0-9
      if ( nybble < 10 ) X.push_back((char)(nybble + '0'));
      // Hex digits A-F
      else X.push_back((char)(nybble - 10 + 'A'));
      nybble = 0; mask = 1;
    };
    for ( int64_t i = 0; i < N; ++ i ) {
      for ( int64_t j = 0; j < m; ++ j ) {
        if ( bin_[i] > j ) nybble |= mask;
        mask <<= 1; if ( mask == 16 ) flush_nybble();
      }
    }
    if ( mask != 1 ) flush_nybble ();
    // Put into big-endian form.
    std::reverse( X . begin(), X . end () );
    return result;
  }

Networks

Network Spec Files

Figure 2. Network Specification Example

A network specification file is a text file which describes the interactions in a gene regulatory network. The specification file contains newline separated lines which describe each network node and the structure of its inputs. Each line is of the form X : L, where X is the name of the network node (which we will call the target of the line, and L is an expression corresponding to the logic. Network node names may be any sequence of alphanumeric characters. In particular, spaces are not allowed. Logic expressions are sequences of node names joined together with whitespace, parentheses, + signs, and ~ signs. The whitespace and parentheses are optional and ignored except as separators. Nodes appearing in a logic are called source nodes of the line and indicate a repressing edge when prefixed by ~ and an activating edge otherwise. Source nodes are partitioned into factors: Consecutively appearing source nodes in a logic expression belong to the same factor provided there is an intervening + symbol, and we extend this to non-consecutive source nodes via transitive closure. The partitioning of source nodes into factors corresponds to the partition associated with the logic of the target node.

To the left we show a graphviz-rendering of a network specification file, where the activating edges are represented by normal arrowheads, the repressing edges are represented by blunt arrowheads, and the partitioning of input edges is represented by color coding.

X : (X + Y3) ~Z ~W1
Y1 : X
Y2 : (Y1 + W1)
Y3 : Y2(~W1)
Z : X(~W2)
W1 : Y2(~W2)
W2 : W1

Network Specification File

Note	The syntax for network specification files will be extended to accommodate imposition of constraints on parameters such as threshold ordering and activation ordering. These extensions will be done in a backward-compatible way.

Network Builder

We provide a tool, Network Builder, for graphically constructing networks and outputting network specification files. This tool also determines the number of parameters associated with a given network. The tool is browser based — so in fact we can embed it in the documentation! Here it is:

High-level API

DSGRN Tutorial

To begin with, we tell DSGRN what network we are interested in. Let’s study the network 2D_Example_C.txt:

X : (X)(~Y)
Y : (~X)(Y)

To do this we type:

> dsgrn network ./networks/2D_Example_C.txt

This produces a file called dsgrn.session in the current working directory. You can delete it when you are done. It just remembers which network we want to study. If you change directories and call dsgrn again, it won’t find the session file and will have forgotten about the network.

Anyhow, we can visualize the network using DSGRN by asking it for a graphviz representation of the network. To do this, we type:

> dsgrn network draw > network.gv

Here we used the shell redirection operator > to pipe the output to the file network.gv. We can open this up with Graphviz and we get the following image:

Next, we ask about parameters. Let’s find out how many parameters there are.

> dsgrn parameter

There are 1600 parameters.

Looks good. Later we should add a feature that tells us how many there are for each cohort of output orderings. In this case it happens to be 4 cohorts of 400, since there are 2 output orders for both \(X\) and \(Y\).

Let’s pick a parameter out of a hat: 126. Let’s ask DSGRN about parameter 126. First, we will ask it to give us a JSON-string representing the parameter:

> dsgrn parameter json 126

[["X",[2,2,"C0"],[0,1]],["Y",[2,2,"C0"],[0,1]]]

This is telling us that for this parameter, node \(X\) has the logic [2,2,"C0"] associated with it (2 inputs, 2 outputs, and hex code C0) and the output ordering [0,1]. The outputs have a natural ordering inherited from the sequence in which they appear in the network file; the [0,1] indicates an identity permutation of this natural ordering. Thus the outputs are ordered X then Y. Had it been [1,0] this would mean the other way around! In general, the \(k\)th out-ordered edge is the p[k] th node, where p is the permutation array.

We can ask about which parameter inequalities this logic/order corresponds to:

> dsgrn parameter inequalities 126

["{
L(X,X) L(Y,X) < THETA(X,X),
U(X,X) L(Y,X) < THETA(X,X),
L(X,X) U(Y,X) < THETA(X,X),
THETA(X,Y) < U(X,X) U(Y,X)
},{
THETA(X,X) < THETA(X,Y)
}",
"{
L(X,Y) L(Y,Y) < THETA(Y,X),
U(X,Y) L(Y,Y) < THETA(Y,X),
L(X,Y) U(Y,Y) < THETA(Y,X),
THETA(Y,Y) < U(X,Y) U(Y,Y)
},{
THETA(Y,X) < THETA(Y,Y)
}"]

Neat. Let’s try to turn around and find out the index (i.e. 126) from the JSON-string it hands us:

> dsgrn parameter index '[["X",[2,2,"C0"],[0,1]],["Y",[2,2,"C0"],[0,1]]]'

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Feature not implemented

Oh no! The function to make indices out of the internal representation of parameters hasn’t been written yet. Don’t worry, this will be implemented soon. (Maybe it already has been and I haven’t gotten around to updating this doc yet!) In the meantime, how do we know the internal representation is correct? Well, we can print out the parameter inequalities:

> dsgrn parameter inequalities '[["X",[2,2,"C0"],[0,1]],["Y",[2,2,"C0"],[0,1]]]'

["{
L(X,X) L(Y,X) < THETA(X,X),
U(X,X) L(Y,X) < THETA(X,X),
L(X,X) U(Y,X) < THETA(X,X),
THETA(X,Y) < U(X,X) U(Y,X)
},{
THETA(X,X) < THETA(X,Y)
}",
"{
L(X,Y) L(Y,Y) < THETA(Y,X),
U(X,Y) L(Y,Y) < THETA(Y,X),
L(X,Y) U(Y,Y) < THETA(Y,X),
THETA(Y,Y) < U(X,Y) U(Y,Y)
},{
THETA(Y,X) < THETA(Y,Y)
}"]

They are the same. Hooray!

Now let’s start doing dynamics. We can ask it to create a domain graph:

> dsgrn domaingraph json '[["X",[2,2,"C0"],[0,1]],["Y",[2,2,"C0"],[0,1]]]'

[[0],[2],[2],[0],[1,3],[2,4],[6],[6],[5,7]]

Here we had asked for json output, so what we have is a nested json array giving an adjacency list representation of the domain graph. This is not visually appealing, so let’s ask for a graphviz representation instead:

> dsgrn domaingraph graphviz '[["X",[2,2,"C0"],[0,1]],["Y",[2,2,"C0"],[0,1]]]’ > dg.gv

Here we used > to pipe the output to the file dg.gv. We can open this up with Graphviz and we get the following image:

We might improve this by setting the positions using the actual positions of domain in space (though it isn’t clear the best way to do this for higher than 2 dimensions).

And we could also pass the parameter by it’s index:

> dsgrn domaingraph json 126

[[0],[2],[2],[0],[1,3],[2,4],[6],[6],[5,7]]

> dsgrn domaingraph graphviz 126

Very similarly, we can also get wall graphs:

> dsgrn wallgraph json 126

[[12,1],[13],[12],[2],[1],[3,4],[13],[2,14],[14],[3,4,8],[8],[5,6],[],[],[]]

> dsgrn wallgraph graphviz 126 > wg.gv

Again, this might be improved by setting the position of the nodes using where the walls are in space rather than letting Graphviz choose.

Time for Morse theory. Here we ask for a Morse decomposition, which is essentially a Morse Graph which is annotated with the list of domains (the software is set up to calculate the Morse decomposition using the domain graph representation):

> dsgrn morsedecomposition json 126

[[],[],[]]

Hmm, this isn’t great, since it lacks the phase space annotatation. That will be fixed! Meanwhile the graphviz output does have the appropriate annotations:

> dsgrn morsedecomposition graphviz 126 > md.gv

Here we see the vertices are annotated with all the domains in the strongly connected component of the domain graph. Which in this case is just singletons sets. How boring! Maybe I should have picked a more interesting example.

Next up: Morse graphs. Same drill: we can give it the parameter either by index or by json string, and we can request either json output or graphviz output:

> dsgrn morsegraph json 126

{"poset":[[],[],[]],"annotations":[["FP"],["FP"],["FP OFF"]]}

> dsgrn morsegraph json '[["X",[2,2,"C0"],[0,1]],["Y",[2,2,"C0"],[0,1]]]'

{"poset":[[],[],[]],"annotations":[["FP"],["FP"],["FP OFF"]]}

> dsgrn morsegraph graphviz 126

To wrap up, here is a summary of DSGRN’s current syntax. Any path from the green diamond to the red octagon is a valid DSGRN command:

Database Tutorial

Signatures is a program which computes databses of Morse graphs for an entire Parameter graph. Here is a demonstration.

First, we use the Signatures program to create an SQLite database. From the DSGRN folder, type:

> mpiexec -np 5 ./software/Signatures/bin/Signatures networks/2D_Example_C.txt ./2D_Example_C.db

Tada! Now we have an SQLite database. This is a small example that computes in seconds even on a laptop. So… looks like we’re done here, everybody go home.

Just kidding! We have work to do. Let’s examine the output. How can we get at this data and how can we use it? To get the full power here, we need to use a database query language. The most widely known is called SQL, which stands for Structured Query Language, and SQLite uses this.

A crash course in SQL

Ultimately, we will have wonderful web interfaces to examine this database, but for right now, since it is an SQLite database, we can use the SQL query interface. So what follows is a tutorial to SQL, tailored to the needs of using it to query the database of dynamic sigatures computed by DSGRN.

Let’s fire up SQLite’s interactive interface:

sqlite3 2D_Example_C.db

Now we should see the prompt:

sqlite>

So far, so good.

Tables

A database is comprised of tables; we can ask SQLite about what tables we have:

sqlite> .tables

MorseGraphAnnotations  MorseGraphVertices     Signatures
MorseGraphEdges        MorseGraphViz

The information produced by Signatures is in these tables. Each table has a number of columns and rows. The columns are named. We can find out about the columns of each table by asking about the database’s "schema":

sqlite> .schema

CREATE TABLE Signatures (ParameterIndex INTEGER PRIMARY KEY, MorseGraphIndex INTEGER);
CREATE TABLE MorseGraphViz (MorseGraphIndex INTEGER PRIMARY KEY, Graphviz TEXT);
CREATE TABLE MorseGraphVertices (MorseGraphIndex INTEGER, Vertex INTEGER);
CREATE TABLE MorseGraphEdges (MorseGraphIndex INTEGER, Source INTEGER, Target INTEGER);
CREATE TABLE MorseGraphAnnotations (MorseGraphIndex INTEGER, Vertex INTEGER, Label TEXT);
CREATE INDEX Signatures2 on Signatures (MorseGraphIndex, ParameterIndex);
CREATE INDEX MorseGraphAnnotations3 on MorseGraphAnnotations (Label, MorseGraphIndex);
CREATE INDEX MorseGraphViz2 on MorseGraphViz (Graphviz, MorseGraphIndex);
CREATE INDEX MorseGraphVertices1 on MorseGraphVertices (MorseGraphIndex, Vertex);
CREATE INDEX MorseGraphVertices2 on MorseGraphVertices (Vertex, MorseGraphIndex);
CREATE INDEX MorseGraphEdges1 on MorseGraphEdges (MorseGraphIndex);
CREATE INDEX MorseGraphAnnotations1 on MorseGraphAnnotations (MorseGraphIndex);

What we see here are the commands that were invoked to create the tables and indices. Indices are data structures formed on tables to speed up different types of queries. Indexing is extremely important; we’ll talk about it later. For now, we’ll focus on the tables. Consider the Signatures table. There are two columns: ParameterIndex and MorseGraphIndex. Both hold integer data. The ParameterIndex is marked INTEGER PRIMARY KEY, which tells SQLite that it doesn’t need to add its customary implicit rowid column; it can assume the ParameterIndex data can serve that function equally well (in particular, we will never attempt to insert two rows with the same ParameterIndex value).

Selecting things

To learn more about tables we will want to see their contents. In order to get at the contents of a table, we use the SELECT command. Time for an example. The table MorseGraphViz contains a list of MorseGraphs in a somewhat human-readable graphviz format, so let’s take a look at that one.

We’ll type

sqlite> select * from MorseGraphViz;

The * here says you want to know the values for all the columns.

0|digraph {
0[label="FP"];
1[label="FP ON"];
}

1|digraph {
0[label="FP ON"];
}

2|digraph {
0[label="FP ON"];
1[label="FP"];
}

skip a few…

18|digraph {
0[label="FP ON"];
1[label="FP"];
2[label="FP ON"];
3[label="FP"];
4[label="FP"];
}

19|digraph {
0[label="FP ON"];
1[label="FP"];
2[label="FP"];
3[label="FP ON"];
4[label="FP"];
}

skip a few more…

27|digraph {
0[label="FP ON"];
1[label="FP ON"];
2[label="FP"];
3[label="FP"];
4[label="FP OFF"];
}

As we can see, the database computation recorded 28 distinct Morse graphs. By way of example, we notice Morse graphs 18 and 19 are isomorphic, so by distinct we mean having a distinct representation rather than not being isomorphic. Later, we will add a feature that attempts to reduce the number of Morse graphs by heuristic approaches to graph canonicalization. For now, we can query about both representations at the same time, which is a nice example anyway. We will look in the table Signatures, which provides a mapping between parameter indices and the Morse graphs that occur:

sqlite> select * from Signatures where MorseGraphIndex=18 or MorseGraphIndex=19;

Indexing

Let’s stop a moment and think about how the database answered that last query. Was it efficient? One approach which would be absolutely terrible is if the database software scanned the Signature table in entirety and returned all the matching rows. For large databases, this would be ridiculously slow. It is called a full table scan. Full table scans are to be prevented whenever possible. The means of preventing full table scans is indexing. With indexes, the database maintains a copy of the database sorted by other columns. So having sorted the Signatures table by the MorseGraphIndex column, one can pull out the matching rows extremely efficiently (just as one can look up a word in a dictionary efficiently). Looking back at the schema, we see that such an index was in fact constructed by DSGRN’s computation:

CREATE INDEX Signatures2 on Signatures (MorseGraphIndex, ParameterIndex);

Thus, these sorts of queries should be fast. We can actually ask SQLite how it plans to accomplish a query, and make sure it is using our index to do it right, rather than do a full table scan:

sqlite> explain query plan select * from Signatures where MorseGraphIndex=18 or MorseGraphIndex=19;

0|0|0|SEARCH TABLE Signatures USING COVERING INDEX Signatures2 (MorseGraphIndex=?)
0|0|0|EXECUTE LIST SUBQUERY 1

Seems legit. What happens if it didn’t have that index?

sqlite> drop index Signatures2;
sqlite> explain query plan select * from Signatures where MorseGraphIndex=18 or MorseGraphIndex=19;

0|0|0|SCAN TABLE Signatures

Oh no! The dreaded full table scan! Quick, rebuild the index!

sqlite> CREATE INDEX Signatures2 on Signatures (MorseGraphIndex, ParameterIndex);

Let’s try another query. For the network we are considering, there were 4 cohorts of 400 parameters. We might only be interested in a particular cohort (i.e. ordering of thresholds) so we only care about, for instance, parameters 0 through 399. We could make a similar query, but restricted to the first 400 parameters:

sqlite> select * from Signatures where (MorseGraphIndex=18 or MorseGraphIndex=19) and ParameterIndex < 400;

Here SQLite should again use the index, first quickly determining which rows satisfy MorseGraphIndex=18 or MorseGraphIndex=19 and then among these rows using the sorting of ParameterIndex to quickly retrieve the rows associated with the ParameterIndex < 400 cohort. This is a more complex usage of the index; see SQLite’s nice explanation of query planning for more details.

Selecting Individual Columns

Back to the example. Let’s say we do not care which parameter was assigned Morse graph 18 and which was assigned 19 since they are, after all, isomorphic. We can instruct to only give the column corresponding the parameter index, dropping the column which distinguishes between 18 and 19:

sqlite> select ParameterIndex from Signatures where (MorseGraphIndex=18 or MorseGraphIndex=19) and ParameterIndex < 400;

We could also ask for the columns out of order, or the same column more than once. For example:

sqlite> select ParameterIndex,ParameterIndex,MorseGraphIndex,ParameterIndex from Signatures where (MorseGraphIndex=18 or MorseGraphIndex=19) and ParameterIndex < 400;

147|147|18|147
207|207|18|207
307|307|18|307
150|150|19|150
155|155|19|155
210|210|19|210
215|215|19|215
310|310|19|310
315|315|19|315

Group concatenation

We may prefer a comma separated list rather than a long column, so we can use the GROUP_CONCAT feature:

sqlite> select GROUP_CONCAT(ParameterIndex) from Signatures where MorseGraphIndex=18 or MorseGraphIndex=19 and ParameterIndex < 400;

147,207,307,1389,1390,1392,150,155,210,215,310,315

If this seems sort of strange, good. It’s an example of an aggregate function, and we’ll talk about those more later.

Joining tables

An important tool to use database structures is the join. This lets us combine two tables based on matches in a certain way. Let’s consider an example. Suppose we want to find all parameters where the Morse graph has at least 4 vertices. Since vertices in a Morse graph are indexed contiguously beginning at zero, this is the collection of parameters for which the Morse graph contains the vertex 3. This gives us a hackish trick for quickly selecting them by querying the MorseGraphVertices table for Vertex=3. The select will be fast, since we have the following index available:

CREATE INDEX MorseGraphVertices2 on MorseGraphVertices (Vertex, MorseGraphIndex);

SQLite will use this index to quickly answer the following query:

sqlite> select GROUP_CONCAT(MorseGraphIndex) from MorseGraphVertices where Vertex=3;

8,10,13,16,18,19,21,22,24,25,26,27

But wait; these aren’t the parameters with Morse graphs having at least 4 vertices; these are the Morse graphs themselves. We need to turn around and look up parameters associated with these Morse graphs. Here is how NOT to do it:

sqlite> select ParameterIndex from Signatures where MorseGraphIndex=8 or MorseGraphIndex=10 or ...

Nope! Ack! Don’t do that. Instead, try this:

sqlite> create temp table GotAtLeastFour as select MorseGraphIndex from MorseGraphVertices where Vertex=3;

Now GotAtLeastFour is the table of results we were just considering. We told SQLite it was temporary, so it knows this isn’t going to be a permanent citizen of our database. We can use it with the Signatures table as follows:

sqlite> select GROUP_CONCAT(ParameterIndex) from Signatures natural join GotAtLeastFour;

693,713,1007,1047,1269,1270,1271,1272,1273,1391,1393,1493,990,995,1010,1015,1050,1055,122,127,182,187,282,287,288,688,821,822,861,862,866,867,868,922,927,982,986,988,1082,1087,1088,1266,1267,1268,1386,1387,1388,1488,142,202,222,227,302,322,327,328,728,732,753,1289,1290,1291,1292,1293,1411,1413,1451,1453,1513,1553,147,207,307,1389,1390,1392,150,155,210,215,310,315,1409,1410,1412,1449,1450,1452,308,312,317,692,708,717,752,1492,1512,1552,144,204,224,230,235,304,324,330,332,335,337,737,549,550,552,609,610,612,709,710,712,733,738,773,778,987

The natural join was the magic mustard here. The natural keyword is shorthand to tell it to perform a join on the columns with the same name. A join is a procedure whereby two tables are combined together into a single table on the basis of matching entries.

We can get rid of that temporary table now:

sqlite> drop table GotAtLeastFour;

Good riddance. In fact, we never needed that pesky tempory table in the first place:

sqlite> select GROUP_CONCAT(ParameterIndex) from Signatures natural join (select MorseGraphIndex from MorseGraphVertices where Vertex=3);

693,713,1007,1047,1269,1270,1271,1272,1273,1391,1393,1493,990,995,1010,1015,1050,1055,122,127,182,187,282,287,288,688,821,822,861,862,866,867,868,922,927,982,986,988,1082,1087,1088,1266,1267,1268,1386,1387,1388,1488,142,202,222,227,302,322,327,328,728,732,753,1289,1290,1291,1292,1293,1411,1413,1451,1453,1513,1553,147,207,307,1389,1390,1392,150,155,210,215,310,315,1409,1410,1412,1449,1450,1452,308,312,317,692,708,717,752,1492,1512,1552,144,204,224,230,235,304,324,330,332,335,337,737,549,550,552,609,610,612,709,710,712,733,738,773,778,987

We should stay in the habit of always making sure the queries we run are efficient:

sqlite> explain query plan select GROUP_CONCAT(ParameterIndex) from Signatures natural join (select MorseGraphIndex from MorseGraphVertices where Vertex=3);

0|0|1|SEARCH TABLE MorseGraphVertices USING COVERING INDEX MorseGraphVertices2 (Vertex=?)
0|1|0|SEARCH TABLE Signatures USING COVERING INDEX Signatures2 (MorseGraphIndex=?)

We see its plan involves using the proper indices, so it seems right.

Counting things

The last example was pretty fancy, so let’s do something a little more basic: counting.

There is a command COUNT which lets you count things:

sqlite> select COUNT(*) from Signatures;

Those are our 1600 parameters. There is also a distinct keyword, which lets us select only when a new item is found. Here is an example of its use:

sqlite> select COUNT(distinct MorseGraphIndex) from Signatures;

Looks great, but beware! You might assume these are fast operations, the former taking \(O(1)\) time and the latter taking (for an indexed table) \(O(n \log N)\) time, where \(n\) is the count returned and \(N\) is the number of rows. However, SQLite has apparently made a deliberate design decision which forces a full table scan whenever COUNT is used. We can actually speed up the latter:

sqlite> select COUNT(*) from (select distinct MorseGraphIndex from Signatures);

This still runs a full table scan with COUNT, but it does it on the small table returned by the interior SELECT. Meanwhile, the select distinct MorseGraphIndex from Signatures does the right thing:

sqlite> explain query plan select distinct MorseGraphIndex from Signatures;

0|0|0|SCAN TABLE Signatures USING COVERING INDEX Signatures2

Well, hopefully does the right thing. Technically, the asymptotically best solution is based on skipping from key to key with repeated binary searches, not scanning the entire indexed table. Word on the street is that this so-called skip-scan algorithm is only considered when there are more than 50 rows for each distinct value. In the current example, there are around 57 (we’ll see this later in an example), but the database is very small, so it might not bother. Definitely worth checking out as things scale up!

Aggregate functions

The two expressions we have seen, GROUP_CONCAT and COUNT are examples of what are known as aggregate functions. The idea of an aggregate function is that it takes a column and compresses it into a single item. If A is an aggregate function, then there is some seed value (for GROUP_CONCAT it is the empty string, for COUNT it is 0) and upon processing a new item it updates the value according to what it sees. For GROUP_CONCAT it affixes the next item with a comma delimiter. For COUNT it adds one.

In fact, we can combine aggregate functions together in queries in comma-concatenated lists. When we do this, we can regard vanilla column entries as aggregate functions as well: MorseGraphIndex can be understood as the aggregate function that reads a row and updates its value to be the entry in the MorseGraphIndex column. Here is an example:

sqlite> select MorseGraphIndex,COUNT(Vertex) from MorseGraphVertices;

Which yields:

27|89

The 89 makes sense; this is the total number of vertices in all the Morse graphs combined. But what is the 27? It isn’t the number of Morse graphs; there are 28 of them. The answer is that it is simply the last value of the ``MorseGraphIndex`` field encountered while processing the rows with the aggregate function.

For a very ridiculous example (purely to illustrate the behavior of aggregate functions):

sqlite> select GROUP_CONCAT(MorseGraphIndex),GROUP_CONCAT(Vertex) from MorseGraphVertices;

0,0,1,2,2,3,3,3,4,4,4,5,5,6,7,7,7,8,8,8,8,9,9,9,10,10,10,10,11,11,12,12,12,13,13,13,13,14,14,15,16,16,16,16,17,17,17,18,18,18,18,18,19,19,19,19,19,20,20,20,21,21,21,21,22,22,22,22,23,23,23,24,24,24,24,25,25,25,25,25,26,26,26,26,27,27,27,27,27|0,1,0,0,1,0,1,2,0,1,2,0,1,0,0,1,2,0,1,2,3,0,1,2,0,1,2,3,0,1,0,1,2,0,1,2,3,0,1,0,0,1,2,3,0,1,2,0,1,2,3,4,0,1,2,3,4,0,1,2,0,1,2,3,0,1,2,3,0,1,2,0,1,2,3,0,1,2,3,4,0,1,2,3,0,1,2,3,4

This has made comma-concatenated lists out of every MorseGraphIndex field and every Vertex field, and made a row with two columns out of them.

Grouping things

At this point one might get the idea that aggregate functions are a one-trick pony, good for counting the number of rows or concatenating entries from a single column but yielding nonsense in all other situations. This is because we have not yet covered the GROUP BY syntax, which makes aggregate functions far more useful.

The GROUP BY command tells a SELECT statement to break the rows into groups based on chosen columns. For example, GROUP BY MorseGraphIndex would tell SQLite you wanted to look at the table in several groups, each group having some given value in the MorseGraphIndex field. It automatically assumes you mean to use aggregate functions. The aggregate functions don’t process the entire table and produce a single row, but rather process each group and produce a table with rows corresponding to each group. This is quite useful.

We’ll illustrate this by example. Let’s figure out which Morse graphs have more than 50 parameters corresponding to them. We tackle this by first creating a table with rows reporting the number of parameters associated to each Morse graph. Then we select those rows for which there are more than 50 parameters.

sqlite> select MorseGraphIndex,COUNT(*) as Frequency from Signatures group by MorseGraphIndex order by Frequency desc;

Wonderful! But how did it work? There are several new pieces of syntax. First, we see the count(*) as Frequency. This tells SQLite that the thing it is computing with the aggregate function count(*) is something that you want to refer to by the name Frequency. Later, we have the phrase order by Frequency desc which tells SQLite you want the final table to be ordered by the Frequency column in descending order. Finally, saving the best for last, we have the group by MorseGraphIndex. The GROUP BY feature tells SQLite that we want to break the table into groups based on the value of the MorseGraphIndex column. It then "evaluates" the aggregate function MorseGraphIndex,COUNT(*) over each group, and the output for each group is the row consisting of the MorseGraphIndex for the group and the number of rows in that group. Which is what we wanted! So along with GROUP BY, aggregate functions become powerful tools.

As a quick aside, we can now calculate the average number of parameters associated to a Morse graph by using the AVG aggregate function:

sqlite> select AVG(Frequency) from MorseGraphClasses;

57.1428571428571

Anyhow, we are now ready to give the list of all Morse graphs have more than 50 parameters corresponding to them:

sqlite> select GROUP_CONCAT(MorseGraphIndex) from (select MorseGraphIndex,count(*) as Frequency from Signatures group by MorseGraphIndex order by Frequency desc) where Frequency > 50;
6,14,2,1,17,0,9,11

More grouping

Let’s do an even more glamorous example of grouping. Let’s create a table with three columns. The first column will be MorseGraphIndex, the second column will be Frequency, and the third column will be ParameterIndices. Also, let’s make sure the table is sorted in descending order on the Frequency column.

sqlite> create temp table MorseGraphClasses as select MorseGraphIndex,COUNT(*) as Frequency,GROUP_CONCAT(ParameterIndex) from Signatures group by MorseGraphIndex order by Frequency desc;

Is this table what we want? Instead of SELECT * FROM MorseGraphClasses, let’s instead go with

sqlite> select * from MorseGraphClasses where Frequency < 10

24|8|549,550,552,609,610,612,709,710
20|7|777,1429,1430,1432,1469,1470,1472
4|6|372,377,1030,1035,1070,1075
10|6|990,995,1010,1015,1050,1055
18|6|147,207,307,1389,1390,1392
26|4|733,738,773,778
23|3|232,237,637
25|1|712
27|1|987

Great! Just what we wanted. But is it doing it efficiently? You’d think because we told it order by Frequency desc it would know that it could use binary search to do this query. Nope!

sqlite> explain query plan select * from MorseGraphClasses where Frequency < 10;

0|0|0|SCAN TABLE MorseGraphClasses

Ah, the dreaded full table scan. Probably the rationale is that you could later insert rows which break this property, so it cannot make this assumption. To prevent the full table scan we could index our new table:

sqlite> create index MorseGraphClasses2 on MorseGraphClasses (Frequency);

Now the database software has twigged to the better plan:

sqlite> explain query plan select * from MorseGraphClasses where Frequency < 10;

0|0|0|SEARCH TABLE MorseGraphClasses USING INDEX MorseGraphClasses2 (Frequency<?)

Even more grouping

Let’s revisit the example with finding Morse graphs with a certain number of vertices. Before, we used a hackish trick that relied on the fact a Morse graph had a vertex with index \(k\) if and only if it had at least \(k+1\) vertices. Using grouping, we don’t need to rely on such tricks.

We’ll make a table which lists each Morse graph according to the number of vertices it has in descending order.

sqlite> create temp table MorseGraphVertexCount as select MorseGraphIndex,COUNT(*) as VertexCount from MorseGraphVertices group by MorseGraphIndex order by VertexCount desc;

Now let’s select the ones with more than 3 vertices:

sqlite> select GROUP_CONCAT(MorseGraphIndex) from MorseGraphVertexCount where VertexCount > 3;

18,19,25,27,8,10,13,16,21,22,24,26

This is the same collection of Morse graphs we found earlier, albeit in a different order.

Combinations of Labels

Our Morse graphs come equipped with annotations. These are visible in the graphviz records, but they are also available in the MorseGraphAnnotations table. This lets us use the querying capabilities. We might be interested in querying for various logical combinations of annotations applied to a Morse graph, and then learning which parameters are associated.

Let’s say we are interested in finding all parameters which have a Morse graph that has an annotation "FP OFF", an annotation "FC", but do not have the annotation "FP ON". We can accomplish this as follows:

sqlite> create temp table HasFPOFF as select MorseGraphIndex from MorseGraphAnnotations where Label="FP OFF";
sqlite> create temp table HasFC as select MorseGraphIndex from MorseGraphAnnotations where Label="FC";
sqlite> create temp table HasFPON as select MorseGraphIndex from MorseGraphAnnotations where Label="FP ON";
sqlite> select GROUP_CONCAT(ParameterIndex) from Signatures natural join (select * from HasFPOFF union select * from HasFC except select * from HasFPON);

1,2,6,7,8,21,22,26,27,28,61,62,66,67,68,120,180,280,406,407,408,426,427,428,466,467,468,520,521,522,580,581,582,680,681,682,801,802,806,807,808,820,860,920,980,1080,1206,1207,1208,1220,1221,1222,1260,1261,1262,1320,1321,1322,1380,1381,1382,1480,1481,1482,121,126,128,181,186,188,281,286,526,527,528,586,587,588,686,687,826,827,828,921,926,928,981,1081,1086,1226,1227,1228,1326,1327,1328,1486,1487,0,20,60,400,401,402,420,421,422,460,461,462,800,1200,1201,1202

We check the query plans to make sure this is happening efficiently:

sqlite> explain query plan select MorseGraphIndex from MorseGraphAnnotations where Label="FP ON";

0|0|0|SEARCH TABLE MorseGraphAnnotations USING COVERING INDEX MorseGraphAnnotations3 (Label=?)

sqlite> explain query plan select GROUP_CONCAT(ParameterIndex) from Signatures natural join (select * from HasFPOFF union select * from HasFC except select * from HasFPON);

3|0|0|SCAN TABLE HasFPOFF
4|0|0|SCAN TABLE HasFC
2|0|0|COMPOUND SUBQUERIES 3 AND 4 USING TEMP B-TREE (UNION)
5|0|0|SCAN TABLE HasFPON
1|0|0|COMPOUND SUBQUERIES 2 AND 5 USING TEMP B-TREE (EXCEPT)
0|0|1|SCAN SUBQUERY 1
0|1|0|SEARCH TABLE Signatures USING AUTOMATIC COVERING INDEX (MorseGraphIndex=?)

Looks good.

A Lesson in Query Planning

It is important to make sure that the database is being efficient in answering queries. Just because there is an efficient way to do something does not mean that SQLite will translate our intent into an efficient search. We earlier saw an example where we needed to create an index to get SQLite to do the right thing. We noticed with the COUNT function that SQLite forced full table scans. Here we give another example where SQLite actually has the correct indices, but doesn’t do the most efficient thing. To this end we try using the distinct keyword with GROUP_CONCAT:

sqlite> select GROUP_CONCAT(distinct MorseGraphIndex) from Signatures;

15,11,6,2,0,1,14,7,9,12,13,16,22,5,17,18,19,3,23,21,4,24,8,25,26,20,27,10

This result is a little bit disturbing: it is simply the Morse graph indices occurring in the Signatures table in the order which they appear. This implies that SQLite chose a full table scan to satisfy this query! Let’s check:

sqlite> explain query plan select GROUP_CONCAT(distinct MorseGraphIndex) from Signatures;

0|0|0|SCAN TABLE Signatures

A full table scan! Terrible! The culprit is the GROUP_CONCAT and not distinct, as the following shows:

sqlite> select distinct MorseGraphIndex from Signatures;

which returns

0
1
2
...

This indicates the query used the index efficiently:

sqlite> explain query plan select distinct MorseGraphIndex from Signatures;

0|0|0|SCAN TABLE Signatures USING COVERING INDEX Signatures2

Is this a bug in SQLite? Well, we could argue that SQLite believed that we were very much interested in the order the distinct MorseGraphIndex elements occurred, row by row, in the Signatures table. Thus a full table scan was required to ascertain this ordering.

So if I wanted the comma-separated list without a slow full-table scan, a solution would be

sqlite> select GROUP_CONCAT(MorseGraphIndex) from (select distinct MorseGraphIndex from Signatures);

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27

sqlite> explain query plan select GROUP_CONCAT(MorseGraphIndex) from (select distinct MorseGraphIndex from Signatures);

1|0|0|SCAN TABLE Signatures USING COVERING INDEX Signatures2
0|0|0|SCAN SUBQUERY 1

The moral of the story here is to check the query plans, especially when you are using aggregate functions. The difference between \(O(N)\) time full table scan versus a \(O(n \log N)\) time binary search is dramatic for usual choices of \(n\) and \(N\).

Low-level API

Network Interface

The C++ interface for handling networks is as follows:

/// Network
///   This class holds network data.
///     * Loads specification files
///     * Outputs Graphviz visualizations
///     * Provides an interface to networks
class Network {
public:
  /// load
  ///   load from network specification file
  void
  load ( const char* filename );

  /// size
  ///   Return the number of nodes in the network
  int
  size ( void ) const;

  /// index
  ///   Return index of node given name string
  int
  index ( std::string const& name ) const;

  /// name
  ///   Return name of node (given by index)
  std::string const&
  name ( int index ) const;

  /// inputs
  ///   Return a list of inputs to a node
  std::vector<int> const&
  inputs ( int index ) const;

  /// outputs
  ///   Return a list of outputs to a node
  std::vector<int> const&
  outputs ( int index ) const;

  /// logic
  ///   Return the logic of a node (given by index)
  std::vector<std::vector<int>> const&
  logic ( int index ) const;

  /// interaction
  ///   Return the interaction type of an edge
  ///   False for repression, true for activation
  bool
  interaction ( int source, int target ) const;

  /// graphviz
  ///   Return a graphviz string (dot language)
  std::string
  graphviz ( std::vector<std::string> const& theme =
  { "aliceblue", // background color
    "beige",     // node color
    "black", "darkgoldenrod", "blue", "orange", "red", "yellow" // edge group colors
  } ) const;

This is more-or-less self-documenting. The methods name and index provide inverse look-up tables that give either the name of a node or its index given the other piece of information. Edges may be queried via interaction to see if they are activating or repressing; a query on a non-existent edge results in undefined behavior (likely a segmentation fault). The logic of a network node can be obtained through the method logic and the return type is

std::vector<std::vector<int>> const&

The inner std::vector<int> represents a logic factor as an array of node indices, and the outer std::vector stores a logic as an array of logic factors.

The graphviz method returns a dot language representation of the network (see Network Specification Example for an example). The color theme is an array of color names starting with the background color, followed by node color, and then input factor colors (i.e. input edges are colored according to which factor they belong to in the target node’s logic). If there are more factors for some node than colors specified, then it rotates back to the first (and some groups become indistinguishable). This behavior can be used to disable the coloring of edges according to group by only giving three colors in the theme, which would then be background color, node color, and edge color.