Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate possibility of PipelineDP API for Spark SQL #288

Open
dvadym opened this issue Jun 2, 2022 · 0 comments
Open

Investigate possibility of PipelineDP API for Spark SQL #288

dvadym opened this issue Jun 2, 2022 · 0 comments
Labels
Type: New Feature ➕ Introduction of a completely new addition to the codebase Type: Research 🔬 When further investigation into a subject is required

Comments

@dvadym
Copy link
Collaborator

dvadym commented Jun 2, 2022

Context

PipleineDP supports anonymzation with Spark RDD API (example). It seems interesting to have the support of Spark SQL API.

Goal

To investigate and to design SparkSQL API for PipelineDP.

Possible example of PipelineDP SparkSQL API:

private_sql = pipeline_dp.PrivateSparkSql(<DP parameters>)
result = private_sql.sql("SELECT * ...")

Note: This task consists for researching possible options (both API and implementation design) and proposing something that is useful for users and might be implemented reasonably simple.

Additional information

On PipelineDP Architecture

DPEngine (code) class which implements Differential Private (DP) logic independently of the pipeline framework (now run with Apache Spark, Apache Beam and w/o framework is supported).

DPEngine.aggregate() is the main method, which can perform any supported DP aggregations. Basically it's equivalent of running SQL query

SELECT dp_aggregate_function_1(value), ..., dp_aggregate_function_n(value)
GROUP BY partition_key
FROM collection
<with additional required for DP parameters>

where supported dp_aggregate_function are from the metric list.

On implementation

The implementation will likely be parsing of SQL and calling of DPEngine.aggregate().

Open questions from Spark

  1. Is it possible to add aditional operators/function to Spark SQL, for example something like
 SELECT ...
 FROM ...
 DP_PARAMETERS = (...)
  1. Having a SQL string "SELECT ...", how to transform it to python code with call of DPEngine.aggregate
@dvadym dvadym added Type: New Feature ➕ Introduction of a completely new addition to the codebase Type: Research 🔬 When further investigation into a subject is required labels Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: New Feature ➕ Introduction of a completely new addition to the codebase Type: Research 🔬 When further investigation into a subject is required
Projects
None yet
Development

No branches or pull requests

1 participant