getting started with graph-frames

Mini project to get me started with graph-frames. I want to compute several metrics for each node.

For each node compute percentage of fraudulent connections for

direct node (directed)
direct node (undirected)
the friendship network from the node (directed)
the friendship network from the node (undirected) in total and per connection type.

Getting started with graph-frames I am not sure how to move forward. Looking forward to some suggestions.

The following nodes are present:

---+---------+-----+
| id|     name|fraud|
+---+---------+-----+
|  a|    Alice|    1|
|  b|      Bob|    0|
....
|  f|    Fanny|    0|
|  g|    Gabby|    0|
|  h|Fraudster|    1|
+---+---------+-----+

and edges

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  a|  b|        call|
|  b|  c|         sms|
....
|  a|  h|        call|
|  f|  h|        call|
+---+---+------------+

direct friends

TODO check: I did not cache the graph but still make sure spark is not caching the results, otherwise comparing them one after the other has no meaning

naive calculation

Using a naive method (join) the calculation can be performed. It takes around 4 seconds on my MacBook. It is measured including the df.show statement, but still even if it takes one second for this amount of vertices, that is very slow.

But:

this is a false friend. This method will ignore vertices not present as outgoing.

improved calculation

Using sparks RDD API and a statCounter can speed up the computation by a lot. Now it takes only 007 seconds. For such a small graph the DAG does not really look simpler. But the timing shows the advantage.

naive method with different types of edges

improved method with different types of edges This does not work really well taking Array(String, String) as the key -> otherwise non unique values would handled wrong, i.e. the key is id of vertex and type of connection, if only id is considered we are back to the first example.

pregel API: Message passing via AggregateMessages graph frames and graphX both do support this type of API.

TODO figure out how to use these to implement the same thing.

ongoing discussions

some graph-frames tutorials

cool libraries

https://github.com/sparkling-graph/sparkling-graph

visualization

from generated graphml gephi visualization looks like

open questions

make pregel apis work / learn how to use them
Are objects or primitives better for graphs?

GraphX optimizes the representation of vertex and edge types when they are primitive data types (e.g., int, double, etc?) reducing the in memory footprint by storing them in specialized arrays. http://lamastex.org/courses/ScalableDataScience/2016/graphXProgGuide.html

But scala tuples are objects as well ...

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
graph.png		graph.png
graphPercentageWithJoins.jpg		graphPercentageWithJoins.jpg
sync.sh		sync.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

getting started with graph-frames

direct friends

ongoing discussions

some graph-frames tutorials

cool libraries

visualization

open questions

About

Releases

Packages

Languages

geoHeil/graphFrameStarter

Folders and files

Latest commit

History

Repository files navigation

getting started with graph-frames

direct friends

ongoing discussions

some graph-frames tutorials

cool libraries

visualization

open questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages