Skip to content

Spark Development Plan [OUTDATED]

Edwin Chan edited this page Aug 31, 2022 · 1 revision

To find out how to contribute, please join the slack and drop Edwin Chan a message or join the #feat-spark-profiling channel!

Rough Timeline (anddd this is outdated)

  • Oct 21 - build MVP features - config injection, describe Categorical/TimeStamp
  • Oct/Nov 21 - testing and optimisation
  • Early Dec 21 - release!

New Timeline

  • beta release asap (Q1 2022)

Objective

Enable pandas-profiling to use a spark backend in order to profile spark dataframes

General Constraints

  • Use native spark opts - As much as possible, use spark.sql and native spark dataframe functions. Do not bring data to local - because if we can, we could just use .toPandas() and apply the usual profile report.
  • Do not overload user's spark server - Also, we need to assume that we can only perform read operations on the spark dataframe, and not modifying/write functions so as not to inadvertently crash the user's spark server.

Broad implementation goal

We will try our best to replace all functions that operate on a pandas dataframe to spark functions operating on spark dataframe, without interfering with the rest of the result flow. This ensures that we can retain as much of the config builder, report builder, and visualisation code as possible.

Implementation Strategy and Outline

  1. Build Features
  • Configurations
  • Column Level Profiling
  • Table level
  1. Optimisations and Testing
  2. Release, bug fixes, reprioritize tasks

Getting Started


Design Overview

Ref design diagram


Implementation Details

Configurations (See Configurations)

We need to properly set spark default configurations and inject it during profile time

Description Functions Progress
Config injection In progress
Column Level Profiling (See Column Level Profiling)

A explanation on types : The pandas-profiling libraries uses the visions typing system, which maps spark types to more generic type objects (i.e. spark TimestampType -> visions DateTime type). The full list of visions types can be found here. This enables us to perform generic operations using visions types (without worrying about how it's implemented). The table below describes the status of profiling for native spark types and their respective visions map type.

Spark type Visions type Description Functions Progress
All types Unsupported describe_counts done
All types Unsupported describe_generic done
All types Unsupported describe_supported done
Spark numeric types Numeric describe_numeric_1d done
TimestampType DateTime describe_date_1d In progress
DateType Nil (not in visions) Nil - not till supported by visions Nil
String types (StringType,VarcharType,Chartype) Categorical describe_categorical_1d In progress
Nested types (ArrayType,MapType) Categorical describe_categorical_1d In progress
BinaryType Categorical describe_categorical_1d In progress
BooleanType Boolean nil done
Table Level Profiling (See Table Level Profiling)
Description Functions Progress
Correlations - Spearman done
Correlations - Pearson done
Correlations - Phi-K help wanted!
Correlations - Cramer's V help wanted!
Correlations - Kendall's help wanted!
Scatterplot help wanted!
Table stats done
Missing - bar done
Missing - dendrogram help wanted!
Missing - heatmap help wanted!
Missing - matrix help wanted!
Sample done
Duplicates done
Alerts done

Specific software patterns

multimethod - backend multi-dispatcher/switcher

The multimethod library is the main dispatcher for allowing a function to decide if it should use the pandas or spark implementation of a function. This allows us to abstract the implementation and underlying engine of a function (spark functions vs pandas functions) from how its called (see get_series_descriptions and their respective implementations in pandas and spark)