You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The report should not contained the correlations
It sued to work using previous release of ydata-profiling
/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/correlations.py:66: UserWarning: There was an attempt to calculate the auto correlation, but this failed.
To hide this warning, disable the calculation
(using df.profile_report(correlations={"auto": {"calculate": False}})
If this is problematic for your use case, please report this as an issue: https://github.com/ydataai/ydata-profiling/issues
(include the error message: '('compute: 0 methods found', (<class 'ydata_profiling.config.Settings'>, <class 'pyspark.sql.dataframe.DataFrame'>, <class 'dict'>), [])')
warnings.warn( master_stock_data_exploratory_report (1).json
compute correlation among columns of large datasets
my datasets are private but i can provide an example sample anonymized data
record_timestamp,plant,material_part_number,storage_location,unrestricted_use_stock,stock_in_transfer,stock_in_quality_inspection,all_restricted_stock,blocked_stock,block_stock_returns,stock_in_transit,stock_in_transfer_plant_to_plant,stock_at_vendor,valuated_stock_quantities,non_valuated_stock_quantities,stock_value,valuation_class,material_type,gl_account,account_description
2022-09-25,P006,79-2997197-11,,3.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,4.222222328186035,null,2442.855712890625,null,null,null,null
2022-09-25,P006,79-2997102-11,,0.1111111119389534,null,null,1.1111111640930176,null,null,null,null,null,1.2222222089767456,null,37961.89453125,null,null,null,null
2022-09-25,P006,72-2997190-11,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,21672.736328125,null,null,null,null
2022-09-25,P006,72-2997192-11,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,20033.513671875,null,null,null,null
2022-09-25,P006,72-2997197-11,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,5912.4423828125,null,null,null,null
2022-09-25,P006,72-2997102-11,,3.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,4.222222328186035,null,22345.1875,null,null,null,null
2022-09-25,P006,72-2997102-11252,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,325588.53125,null,null,null,null
2022-09-25,P006,72-2997132-11,,7.111111164093018,null,null,1.1111111640930176,null,null,null,null,null,8.222222328186035,null,76378.453125,null,null,null,null
2022-09-25,P006,72-2997138-11,,7.111111164093018,null,null,1.1111111640930176,null,null,null,null,null,8.222222328186035,null,78339.78125,null,null,null,null
2022-09-25,P006,82-2997112-19,,1.1111111640930176,null,null,3.1111111640930176,null,null,null,null,null,4.222222328186035,null,157067.140625,null,null,null,null
2022-09-25,P006,82-2997112-12,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,50453.15625,null,null,null,null
2022-09-25,P006,82-2997139-11,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,9203917,null,null,null,null
2022-09-25,P006,82-2997139-19,,0.1111111119389534,null,null,1.1111111640930176,null,null,null,null,null,1.2222222089767456,null,1100468.375,null,null,null,null
2022-09-25,P006,855112191,,87.11111450195312,null,null,1.1111111640930176,null,null,null,null,null,88.22222137451172,null,2399.089599609375,null,null,null,null
2022-09-25,P006,mj92119-9-r,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,219.49136352539062,null,null,null,null
2022-09-25,P006,mj92119-9h,,977.111083984375,null,null,1.1111111640930176,null,null,null,null,null,978.2222290039062,null,74072.5234375,null,null,null,null
2022-09-25,P006,mj92119-3-o,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,2.4691357612609863,null,null,null,null
2022-09-25,P006,mj92119-3-r,,927.111083984375,null,null,1.1111111640930176,null,null,null,null,null,928.2222290039062,null,27188.12109375,null,null,null,null
2022-09-25,P006,uj92119-3,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,5.713580131530762,null,null,null,null
2022-09-25,P006,uj92119-3r,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,6.3358025550842285,null,null,null,null
2022-09-25,P006,r002719,,302.1111145019531,null,null,1.1111111640930176,null,null,null,null,null,303.22222900390625,null,785.6824951171875,null,null,null,null
2022-09-25,P006,r002712,,72.11111450195312,null,null,1.1111111640930176,null,null,null,null,null,73.22222137451172,null,12.655077934265137,null,null,null,null
2022-09-25,P006,j932rm2111,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,2.4691357612609863,null,null,null,null
2022-09-25,P006,222l0112-9,,1.1111111640930176,null,null,93.11111450195312,null,null,null,null,null,94.22222137451172,null,733587,null,null,null,null
2022-09-25,P006,222l0112-9u,,2.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,3.222222328186035,null,71704.1796875,null,null,null,null
2022-09-25,P006,222l0112-9uom9,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,4913.580078125,null,null,null,null
Code that reproduces the bug
defcreate_profile_report(self,
dataset_to_analyze: pd.DataFrame,
report_name: str,
dataset_description_url: str) ->ProfileReport:
""" Creates a profile report for a given dataset. Args: dataset_to_analyze (pd.DataFrame): The dataset to analyze and generate a profile report for. report_name (str): The name of the report. dataset_description_url (str): The URL of the dataset description. Returns: ProfileReport: The generated profile report. """# Perform data quality operations and generate a profile report# ...# variables preferred characterization settingsvariables_settings= {
"num": {"low_categorical_threshold": 5, "chi_squared_threshold": 0.999},
"cat": {"length": True, "characters": False, "words": False,
"cardinality_threshold": 50, "imbalance_threshold": 0.5,
"n_obs": 5, "chi_squared_threshold": 0.999},
"bool": {"n_obs": 3, "imbalance_threshold": 0.5}
}
missing_diagrams_settings= {
"heatmap": False,
"matrix": False,
"bar": False
}
# Plot rendering option, way how to pass arguments to the underlying matplotlib visualization engineplot_rendering_settings= {
# "histogram": {"x_axis_labels": True, "bins": 5, "max_bins": 10},"dpi": 200,
"image_format": "png",
"missing": {"cmap": "RdBu_r", "force_labels": True},
"pie": {"max_unique": 10, "colors": ["gold", "b", "#FF796C"]},
"correlation": {"cmap": "RdBu_r", "bad": "#000000"}
}
# Correlation matrices through description_setcorrelations_settings= {
"auto": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
"pearson": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
"spearman": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
"kendall": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
"phi_k": {"calculate": False, "warn_high_correlations": True, "threshold": 0.9},
"cramers": {"calculate": False, "warn_high_correlations": False, "threshold": 0.9},
}
interactions_settings= {
"continuous": False,
"targets": []
}
# Customizing the report's themehtml_report_styling= {
"style": {
"theme": "flatly",
"full_width": True,
"primary_colors": {"#66cc00", "#ff9933", "#ff0099"}
}
}
current_datetime=datetime.now()
current_date=current_datetime.date()
current_year=current_date.strftime("%Y")
# compute amount of data used for profilingsamples_percent_size= (min(len(dataset_to_analyze.columns.tolist()), 20) *min(dataset_to_analyze.shape[0], 100000)) / (len(dataset_to_analyze.columns.tolist()) *dataset_to_analyze.shape[0])
samples= {
"head": 0,
"tail": 0,
"random": 0
}
dataset_description= {
"description": f"This profiling report was generated using a sample of {samples_percent_size}% of the filtered original dataset.",
"copyright_holder": "SLS Data platform",
"copyright_year": current_year,
"url": dataset_description_url
}
# Identify time series variables if any# Enable tsmode to True to automatically identify time-series variables# and provide the column name that provides the chronological order of your time-series# time_series_type_schema = {}time_series_mode=False# time_series_sortby = None# for column_name in dataset_to_analyze.columns.tolist():# if any(keyword in column_name.lower() for keyword in ["date", "timestamp"]):# self.logger.info("candidate column_name as timeseries %s", column_name)# time_series_type_schema[column_name] = "timeseries"# if len(time_series_type_schema) > 0:# time_series_mode = True# time_series_sortby = "Date Local"# is_run_minimal_mode = self.determine_run_minimal_mode(dataset_to_analyze.columns.tolist(), dataset_to_analyze.shape[0])# Convert the Pandas DataFrame to a Spark DataFrame# Configure pandas-profiling to handle Spark DataFrames# while preserving the categorical encoding# Enable Arrow-based columnar data transfersself.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
pd.DataFrame.iteritems=pd.DataFrame.items# psdf = ps.from_pandas(dataset_to_analyze)# data_to_analyze = psdf.to_spark()data_to_analyze=self.spark.createDataFrame(dataset_to_analyze)
ydata_profiling_instance_config=Settings()
ydata_profiling_instance_config.infer_dtypes=True# ydata_profiling_instance_config.Config.set_option("profilers", {"Spark": {"verbose": True}})returnProfileReport(
# dataset_to_analyze,data_to_analyze,
title=report_name,
dataset=dataset_description,
sort=None,
progress_bar=False,
vars=variables_settings,
explorative=True,
plot=plot_rendering_settings,
correlations=correlations_settings,
missing_diagrams=missing_diagrams_settings,
samples=samples,
# correlations=None,interactions=interactions_settings,
html=html_report_styling,
# minimal=is_run_minimal_mode,minimal=True,
tsmode=time_series_mode,
# tsmode=False,# sortby=time_series_sortby,# type_schema=time_series_type_schema
)
defis_categorical_column(self, df, column_name, n_unique_threshold=20, ratio_unique_values=0.05, exclude_patterns=[]):
""" Determines whether a column in a pandas DataFrame is categorical. Args: df (pandas.DataFrame): The DataFrame to check. column_name (str): The name of the column to check. n_unique_threshold (int): The threshold for the number of unique values. ratio_unique_values (float): The threshold for the ratio of unique values to total values. exclude_patterns (list): A list of patterns to exclude from consideration. Returns: bool: True if the column is categorical, False otherwise. """ifdf[column_name].dtypein [object, str]:
# Check if the column name matches any of the exclusion patternsifany(patternincolumn_nameforpatterninexclude_patterns):
returnFalse# Check if the number of unique values is less than a thresholdifdf[column_name].nunique() <n_unique_threshold:
returnTrue# Check if the ratio of unique values to total values is less than a thresholdif1.*df[column_name].nunique() /df[column_name].count() <ratio_unique_values:
returnTrue# Check if any of the other conditions are truereturnFalsedefget_categorical_columns(self, df, n_unique_threshold=10, ratio_threshold=0.05, exclude_patterns=[]):
""" Determines which columns in a pandas DataFrame are categorical. Args: df (pandas.DataFrame): The DataFrame to check. n_unique_threshold (int): The threshold for the number of unique values. ratio_threshold (float): The threshold for the ratio of unique values to total values. exclude_patterns (list): A list of patterns to exclude from consideration. Returns: list: A list of the names of the categorical columns. """categorical_cols= []
forcolumn_nameindf.columns:
ifself.is_categorical_column(df, column_name, n_unique_threshold, ratio_threshold, exclude_patterns):
categorical_cols.append(column_name)
returncategorical_colsprofile=self.create_profile_report(dataset_to_analyze=data_to_analyze,
report_name=report_name,
dataset_description_url=description_url)
returnprofile
Current Behaviour
The report should not contained the correlations
It sued to work using previous release of ydata-profiling
/home/spark/.local/lib/python3.10/site-packages/ydata_profiling/model/correlations.py:66: UserWarning: There was an attempt to calculate the auto correlation, but this failed.
To hide this warning, disable the calculation
(using
df.profile_report(correlations={"auto": {"calculate": False}})
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: '('compute: 0 methods found', (<class 'ydata_profiling.config.Settings'>, <class 'pyspark.sql.dataframe.DataFrame'>, <class 'dict'>), [])')
warnings.warn(
master_stock_data_exploratory_report (1).json
compute correlation among columns of large datasets
Expected Behaviour
The report should contained the correlations
Data Description
correlations_settings = {
"auto": {"calculate": True, "warn_high_correlations": True, "threshold": 0.9},
...
}
my datasets are private but i can provide an example sample anonymized data
record_timestamp,plant,material_part_number,storage_location,unrestricted_use_stock,stock_in_transfer,stock_in_quality_inspection,all_restricted_stock,blocked_stock,block_stock_returns,stock_in_transit,stock_in_transfer_plant_to_plant,stock_at_vendor,valuated_stock_quantities,non_valuated_stock_quantities,stock_value,valuation_class,material_type,gl_account,account_description
2022-09-25,P006,79-2997197-11,,3.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,4.222222328186035,null,2442.855712890625,null,null,null,null
2022-09-25,P006,79-2997102-11,,0.1111111119389534,null,null,1.1111111640930176,null,null,null,null,null,1.2222222089767456,null,37961.89453125,null,null,null,null
2022-09-25,P006,72-2997190-11,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,21672.736328125,null,null,null,null
2022-09-25,P006,72-2997192-11,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,20033.513671875,null,null,null,null
2022-09-25,P006,72-2997197-11,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,5912.4423828125,null,null,null,null
2022-09-25,P006,72-2997102-11,,3.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,4.222222328186035,null,22345.1875,null,null,null,null
2022-09-25,P006,72-2997102-11252,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,325588.53125,null,null,null,null
2022-09-25,P006,72-2997132-11,,7.111111164093018,null,null,1.1111111640930176,null,null,null,null,null,8.222222328186035,null,76378.453125,null,null,null,null
2022-09-25,P006,72-2997138-11,,7.111111164093018,null,null,1.1111111640930176,null,null,null,null,null,8.222222328186035,null,78339.78125,null,null,null,null
2022-09-25,P006,82-2997112-19,,1.1111111640930176,null,null,3.1111111640930176,null,null,null,null,null,4.222222328186035,null,157067.140625,null,null,null,null
2022-09-25,P006,82-2997112-12,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,50453.15625,null,null,null,null
2022-09-25,P006,82-2997139-11,,9.11111068725586,null,null,1.1111111640930176,null,null,null,null,null,10.222222328186035,null,9203917,null,null,null,null
2022-09-25,P006,82-2997139-19,,0.1111111119389534,null,null,1.1111111640930176,null,null,null,null,null,1.2222222089767456,null,1100468.375,null,null,null,null
2022-09-25,P006,855112191,,87.11111450195312,null,null,1.1111111640930176,null,null,null,null,null,88.22222137451172,null,2399.089599609375,null,null,null,null
2022-09-25,P006,mj92119-9-r,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,219.49136352539062,null,null,null,null
2022-09-25,P006,mj92119-9h,,977.111083984375,null,null,1.1111111640930176,null,null,null,null,null,978.2222290039062,null,74072.5234375,null,null,null,null
2022-09-25,P006,mj92119-3-o,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,2.4691357612609863,null,null,null,null
2022-09-25,P006,mj92119-3-r,,927.111083984375,null,null,1.1111111640930176,null,null,null,null,null,928.2222290039062,null,27188.12109375,null,null,null,null
2022-09-25,P006,uj92119-3,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,5.713580131530762,null,null,null,null
2022-09-25,P006,uj92119-3r,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,6.3358025550842285,null,null,null,null
2022-09-25,P006,r002719,,302.1111145019531,null,null,1.1111111640930176,null,null,null,null,null,303.22222900390625,null,785.6824951171875,null,null,null,null
2022-09-25,P006,r002712,,72.11111450195312,null,null,1.1111111640930176,null,null,null,null,null,73.22222137451172,null,12.655077934265137,null,null,null,null
2022-09-25,P006,j932rm2111,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,2.4691357612609863,null,null,null,null
2022-09-25,P006,222l0112-9,,1.1111111640930176,null,null,93.11111450195312,null,null,null,null,null,94.22222137451172,null,733587,null,null,null,null
2022-09-25,P006,222l0112-9u,,2.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,3.222222328186035,null,71704.1796875,null,null,null,null
2022-09-25,P006,222l0112-9uom9,,1.1111111640930176,null,null,1.1111111640930176,null,null,null,null,null,2.222222328186035,null,4913.580078125,null,null,null,null
Code that reproduces the bug
pandas-profiling version
v.4.6.3
Dependencies
OS
linux
Checklist
The text was updated successfully, but these errors were encountered: