2.2. Temporal Triangles

This is a synthetic dataset used as a benchmark kernel for assessing graph search performance.

2.2.1. Dataset Description

We use an RMAT Graph generator to generate a graph with a power-law degree distribution. Useful information about RMAT generation can be found here:

  • PaRMAT is a parallel RMAT graph generator

  • Graph500 is a benchmark aimed at assessing graph algorithms, but not specifically graph search. It has a description of RMAT generation.

We then attach a single property value to each edge called timestamp that has a uniformly random integer between 0 and 10,000.

2.2.2. Algorithm / Pattern

The goal of this benchmark is to find all patterns in the generated graph that have a directed 3-cycle where the timestamps on the 3 edges are in monotonically increasing order around the cycle. That is, the timestamps are non-decreasing around the cycle.

../../_images/TT.png

A final constraint is that the elapsed time from the first to the third edge be within some small threshold of time. The idea is that the presence of the cycle represents a collection of things that were observed in the data around some single event (or cause) of the three edges.

2.2.3. Reference Implementation

2.2.3.1. Cypher Description of This Graph Pattern:

MATCH (a)-[e0]->(b)-[e1]->(c)-[e2]->(a)
WHERE a <> b AND b <> c AND a <> c
  AND e0.time <= e1.time
  AND e1.time <= e2.time
  AND e2.time - e0.time < 42
RETURN a, e0.time AS t0, b, e1.time AS t1, c, e2.time AS t2

2.2.3.2. Gremlin Description of This Graph Pattern:

t.V().match(
  __.as('a').outE('edge').as('e0').values('time').as('t0').math('_ + 42').as('t3').select('e0').inV().as('b').where('b',neq('a')),
  __.as('b').outE('edge').as('e1').values('time').as('t1').where('t1',gte('t0')).select('e1').inV().as('c').where('c',neq('b')).where('c',neq('a')),
  __.as('c').outE('edge').as('e2').values('time').as('t2').where('t2', lt('t3')).where('t2',gte('t1')).select('e2').inV().as('a')
)

2.2.4. Dataset Generator

The datasets are generated in two phases:

  1. Generate the topology using an RMAT generator.

  2. Enrich the graph structure by adding a uniformly-distributed timestamp to each edge.

2.2.4.1. Generating the Topology

We used PaRMAT to generate the initial topology of the graph using the following options: PaRMAT -nVertices XXXXX -nEdges YYYYY. This generates a file called out.txt.

Our convention was to generate one vertex for every ten edges, and to consider the size of the dataset to be the number of edges.

To generate a size-100 graph, we would do: PaRMAT -nVertices 10 -nEdges 100.

2.2.4.2. Enrich with Timestamps

#!/usr/bin/env python

import argparse
import random
import sys

def genOutput(infile, outfile, seed=None, timestampRange=10000):
    """Generate the output file from the input file"""
    rand = random.SystemRandom()
    if seed:
        rand = random.Random(long(seed))
    with open(infile, 'r') as f:
        for line in f:
            t = rand.randint(0,timestampRange)
            flds = line.split(",")
            row = [int(x) for x in flds]
            rowstr = "%d" % row[0]
            for k in row[1:]:
               rowstr += ",%d" % k
            outfile.write("%s,%d\n" % (rowstr,t))

def temporalTriangleEnrichment(argv):
    """The main entry point for the temporal triangle enrichment generator"""
    # ---- Step 1:  process the arguments
    parser = argparse.ArgumentParser(prog='tt-enrichment',
                                description="Temporal Triangle data enrichment")
    parser.add_argument('-d', '--debug', dest='debug', action="store_true",
                        help="provide debugging information")
    parser.add_argument('-o', '--outfile', dest='outfile', default='out.csv',
                        help="the output data file, default: out.csv")
    parser.add_argument('-r', '--rmat', dest='rmat', default='out.txt',
                        help="the filename holding the RMAT generated file (default \"out.txt\")")
    parser.add_argument('-s', '--seed', dest='seed', default=None,
                        help="the seed to use for the randome generator (default None)")
    parser.add_argument('-t', '--timeRange', dest='timeRange', default=10000,
                        help="the timestamp range (0 to N-1), default N=10000")

    args = parser.parse_args(argv[1:])
    if args.debug:
        print "Args:", str(args)

    # ---- Step 2: Generate the output
    with open(args.outfile, 'w') as outfile:
        genOutput(args.rmat, outfile,
                  seed=args.seed, timestampRange=args.timeRange)

#----------------------------------------------------------------------------
if __name__ == "__main__":
    temporalTriangleEnrichment(sys.argv)

2.2.5. Pre-generated Datasets

We have already generated many datasets over a wide range of sizes that can be downloaded directly from our site:

Size

Edges

Vertices

Parquet

CSV

10K

10,000

1,000

10K Parquet

10K CSV

100K

100,000

10,000

100K Parquet

100K CSV

1M

1,000,000

100,000

1M Parquet

1M CSV

10M

10,000,000

1,000,000

10M Parquet

10M CSV

100M

100,000,000

10,000,000

100M Parquet

100M CSV

250M

250,000,000

25,000,000

250M Parquet

250M CSV

500M

500,000,000

50,000,000

500M Parquet

500M CSV

1B

1,000,000,000

100,000,000

1B Parquet

1B CSV

2B

2,000,000,000

200,000,000

2B Parquet

2B CSV

5B

5,000,000,000

500,000,000

5B Parquet

5B CSV

10B

10,000,000,000

1,000,000,000

10B Parquet

10B CSV

15B

15,000,000,000

1,500,000,000

15B Parquet

15B CSV

20B

20,000,000,000

2,000,000,000

20B Parquet

20B CSV

30B

30,000,000,000

3,000,000,000

30B Parquet

30B CSV