Cathy posted some cool data yesterday coming from the new visualization features of the magnificent Stacks Project. Summary: you can make a directed graph whose vertices are the 10,445 tagged assertions in the Stacks Project, and whose edges are logical dependency. So this graph (hopefully!) doesn’t have any directed cycles. (Actually, Cathy tells me that the Stacks Project autovomits out any contribution that would create a logical cycle! I wish LaTeX could do that.)
Given any assertion v, you can construct the subgraph G_v of vertices which are the terminus of a directed path starting at v. And Cathy finds that if you plot the number of vertices and number of edges of each of these graphs, you get something that looks really, really close to a line.
Why is this so? Does it suggest some underlying structure? I tend to say no, or at least not much — my guess is that in some sense it is “expected” for graphs like this to have this sort of property.
Because I am trying to get strong at sage I coded some of this up this morning. One way to make a random directed graph with no cycles is as follows: start with N edges, and a function f on natural numbers k that decays with k, and then connect vertex N to vertex N-k (if there is such a vertex) with probability f(k). The decaying function f is supposed to mimic the fact that an assertion is presumably more likely to refer to something just before it than something “far away” (though of course the stack project is not a strictly linear thing like a book.)
Here’s how Cathy’s plot looks for a graph generated by N= 1000 and f(k) = (2/3)^k, which makes the mean out-degree 2 as suggested in Cathy’s post.
Pretty linear — though if you look closely you can see that there are really (at least) a couple of close-to-linear “strands” superimposed! At first I thought this was because I forgot to clear the plot before running the program, but no, this is the kind of thing that happens.
Is this because the distribution decays so fast, so that there are very few long-range edges? Here’s how the plot looks with f(k) = 1/k^2, a nice fat tail yielding many more long edges:
My guess: a random graph aficionado could prove that the plot stays very close to a line with high probability under a broad range of random graph models. But I don’t really know!
Update: Although you know what must be happening here? It’s not hard to check that in the models I’ve presented here, there’s a huge amount of overlap between the descendant graphs; in fact, a vertex is very likely to be connected all but c of the vertices below it for a suitable constant c.
I would guess the Stacks Project graph doesn’t have this property (though it would be interesting to hear from Cathy to what extent this is the case) and that in her scatterplot we are not measuring the same graph again and again.
It might be fun to consider a model where vertices are pairs of natural numbers and (m,n) is connected to (m-k,n-l) with probability f(k,l) for some suitable decay. Under those circumstances, you’d have substantially less overlap between the descendant trees; do you still get the approximately linear relationship between edges and nodes?