pgera's comments

pgera · on July 4, 2023

>You can also lay the graph array order out to minimize cache misses. This is something I've been looking into, but haven't implemented personally.

The issue with RCM is that it only works for undirected graphs, from what I remember. I'm less familiar with GNNs, but if you are trying to optimize for arbitrary BFS type traversals, an easy trick is to just do a few traversals from random sources and average the results. This can be used to reorder the graph and works pretty well in practice.

VHRanger · on July 4, 2023

Right, RCM assumes a symmetric matrix.

Doing something like a single pass of 1d embedding algos like GGVec [1] (note: I wrote the algo, but it's effectively just the GLoVe algo for general graphs) or ProNE (if you have enough RAM) are also very cheap computationally

[1] https://github.com/VHRanger/nodevectors

pgera · on July 3, 2023

> I believe you are not guaranteed for the edge data of adjacent nodes to be adjacent in memory

The edge data of a particular node is contiguous, but yes, the edge data of a collection of nodes is not contiguous. You can reorder (permute) the graph for some metric as a preprocessing step so that you get better locality for your average access pattern. This only works for static graphs though.

> For float-based edge data I think quantization works well, and I believe you can further compress the ROW/COL indices

Yes, index compression is pretty well studied and understood. The challenge here is mostly good compression ratio and high decompression performance. There are a couple of works that I'm aware of that do this for gpus. This repo by Mo Sha et al. (https://github.com/desert0616/GCGT) is pretty good, and I also did some work in this space (https://github.com/pgera/efg).

WinLychee · on July 3, 2023

Oh one more question (checked out your repo): why the usage of GPUs here vs CPUs? The possibility of more parallelism when traversing the graph versus CPU?

pgera · on July 3, 2023

Yes, you can get good performance on GPUs due to the parallelism and memory bandwidth. On a new GPU like H100, I believe you can do ~ 50-100 GTEPS (billions of traversed edges per sec) in a BFS. I'm not sure where the state of the art on CPUs is at, but you can certainly do efficient implementations on CPUs too. This paper is a few years old (https://people.csail.mit.edu/jshun/spaa2018.pdf), but has some numbers.

For CPU based index compression/decompression, the webgraph framework is quite mature and widely used and there's also Ligra+ that does it.

WinLychee · on July 4, 2023

Very cool, will digest all this! For my use-case CPU + lots of RAM has been fast enough, and there's a balance between per-thread latency and throughput (serving data concurrently). I'm definitely interested in compressing down the graph data further to see if I can drop latency further, will see if I can adapt some of this. Super neat to see this on GPUs, will check that out too.

Also a fan of https://github.com/frankmcsherry/COST if you've seen it before as well!

WinLychee · on July 3, 2023

Awesome, thanks! Hacking on something in this space atm ;D

pgera · on July 2, 2023

Btw, core EF is quite efficient (perf wise) on the decoding side even on GPUs. I wanted to do PEF, but that seemed a bit more involved and I didn't have the time to do it. Here's a GPU implementation for graph problems if anyone is interested: https://github.com/pgera/efg. I also used folly on the encoding side and it works great.