Investigate possibility of PipelineDP API for Spark SQL #288

dvadym · 2022-06-02T08:55:45Z

Context

PipleineDP supports anonymzation with Spark RDD API (example). It seems interesting to have the support of Spark SQL API.

Goal

To investigate and to design SparkSQL API for PipelineDP.

Possible example of PipelineDP SparkSQL API:

private_sql = pipeline_dp.PrivateSparkSql(<DP parameters>)
result = private_sql.sql("SELECT * ...")

Note: This task consists for researching possible options (both API and implementation design) and proposing something that is useful for users and might be implemented reasonably simple.

Additional information

On PipelineDP Architecture

DPEngine (code) class which implements Differential Private (DP) logic independently of the pipeline framework (now run with Apache Spark, Apache Beam and w/o framework is supported).

DPEngine.aggregate() is the main method, which can perform any supported DP aggregations. Basically it's equivalent of running SQL query

SELECT dp_aggregate_function_1(value), ..., dp_aggregate_function_n(value)
GROUP BY partition_key
FROM collection
<with additional required for DP parameters>

where supported dp_aggregate_function are from the metric list.

On implementation

The implementation will likely be parsing of SQL and calling of DPEngine.aggregate().

Open questions from Spark

Is it possible to add aditional operators/function to Spark SQL, for example something like

 SELECT ...
 FROM ...
 DP_PARAMETERS = (...)

Having a SQL string "SELECT ...", how to transform it to python code with call of DPEngine.aggregate

The text was updated successfully, but these errors were encountered:

dvadym added Type: New Feature ➕ Introduction of a completely new addition to the codebase Type: Research 🔬 When further investigation into a subject is required labels Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate possibility of PipelineDP API for Spark SQL #288

Investigate possibility of PipelineDP API for Spark SQL #288

dvadym commented Jun 2, 2022 •

edited

Loading

Investigate possibility of PipelineDP API for Spark SQL #288

Investigate possibility of PipelineDP API for Spark SQL #288

Comments

dvadym commented Jun 2, 2022 • edited Loading

Context

Goal

Additional information

On PipelineDP Architecture

On implementation

Open questions from Spark

dvadym commented Jun 2, 2022 •

edited

Loading