Skip to content

Identity Resolution using Spark 2.4 and DSE GraphFrames (DSE v6.8) with BYOS

Notifications You must be signed in to change notification settings

versaurabh/spark-dsegf-identity-resolution

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-dsegf-identity-resolution

Identity Resolution using Spark 2.4 and DSE GraphFrames (DSE v6.8) with BYOS

What is Identity Resolution

  • Identity resolution (IDR) is the process of matching identifiers across devices and touchpoints to a single profile which helps build a cohesive, omnichannel view of a consumer, enabling brands to deliver relevant messaging throughout the customer journey
  • The underlying data source for IDR is generally an Identity Graph (IDG) which is a profile database housing all known identifiers correlated to individuals. The IDG also stores the metadata about the identifiers
  • An IDG as below for a single user who is represented via multiple identifiers all connected to each other directly or transitively

Whats in this demo

Below steps using Apache Spark 2.4 with DSE 6.8 via BYOS support (Bring your own spark)

  • Populating the IDG using JSON identifier data sets (id-graph-loader)
  • Computing the count of identifiers that are connected to a set of input identifiers via IDG (id-graph-resolver)
  • Exporting the matched identifiers that are connected to a set of input identifiers via IDG (id-graph-resolver)

IDG V & E data generation

How to run

  • id-graph-loader

    • Build:
      cd id-graph-loader
      ./gradlew build 
      
    • Generate the byos.properties
    • Run:
    cd spark-2.4.8-bin-hadoop2.7/
    bin/spark-submit --jars ~/dse-6.8.18/clients/dse-byos_2.11-6.8.18.jar --properties-file ~/dse-6.8.18/byos.properties --class com.datastax.examples.dsegf.Loader id-graph-loader-1.0-SNAPSHOT.jar <vertexJsonPath> <edgeJsonPath>
    
  • id-graph-resolver

    • Build:
      cd id-graph-resolver
      ./gradlew build 
    
    • Run:
    cd spark-2.4.8-bin-hadoop2.7/
    bin/spark-submit --jars /Users/saurabh.verma/Downloads/dse-6.8.18/clients/dse-byos_2.11-6.8.18.jar --properties-file /Users/saurabh.verma/Downloads/dse-6.8.18/byos.properties --class com.datastax.examples.dsegf.IDGResolver <toBeMatchedIDsPath.csv> <resolvedIDsOutputPath.csv>
    

Sample Data

  • The sample data for loading and resolution are located inside the resources folder of the loader and resolver modules

References

About

Identity Resolution using Spark 2.4 and DSE GraphFrames (DSE v6.8) with BYOS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages