Large graph mining methods

Материал из WikiGrapp
Перейти к навигации Перейти к поиску

There are three canonical types of large graph mining methods (Методы обработки больших графов) that cover the vast majority of use cases and algorithms


Geodesic neighborhood-based computations involve a vertex, its neighboring vertices, and the edges among them. They are among the easiest to scale to large graphs, and they support a surprising variety of different applications, for example, anomaly detection. These methods are typically very easy to scale to large graphs when working simultaneously with all of the vertices. To determine whether or not an algorithm that uses this primitive will scale to even larger graphs, the main issue is the size of the highest degree node. Two examples of tasks that can be accomplished with geodesic neighborhood-based graph mining are the triangle counting and computing extremal eigenvalues of all neighborhoods.

Diffusion neighborhood-based computations can be thought of as a softer or fuzzy version of geodesic neighborhood-based computations. The use of graph diffusions in graph mining is typically a formalization of the following idea: importance flows from a source node along edges of the graph to target nodes. The mechanics of how the diffusion behaves on an edge of the graph determines the particular type of diffusion. Well known examples are: the PageRank diffusion, the Katz diffusion, the heat kernel diffusion, the truncated random walk diffusion. These methods can be used to answer questions such as “What does the region of this graph look like around this specific object?” They are also easy to scale to massive graphs because they do not need to explore the entire graph. For example, random walks with restart are an instance of this idea. One should think of running these diffusion neighborhoods on only [math]\displaystyle{ O(\log n) }[/math] or [math]\displaystyle{ O(\sqrt n) }[/math] of the nodes instead of on all nodes.

Mining with generalized matrix-vector products is one of the most flexible and scalable ways to mine large graphs. As the name suggests, these methods emerge from a generalized notion of a matrix-vector product on generalizations of the adjacency matrix of a graph. In a more general sense, a matrix-vector product can be seen as a special case of the following computational primitive: update vertex data based on a function of its neighbors data. The standard matrix-vector product uses summation as the function. Iterative sequences of these operations, with different functions, compute connected components, single-source shortest paths, label propagation, effective diameters, and distance histograms. Operations such as minimum spanning trees, maximum weight matchings, and message passing methods fit into the same framework as well. All of these methods and algorithms apply to batch systems rather than online systems.

Литература=

  • Buhlmann P., Drineas P., Kane M., van der Laan M. (eds.) Handbook of Big Data. — CRC Press, 2016.