Replacing pandas-profiling (deprecated) with ydata-profiling

the previously known pandas-profiling is now part of a bigger project and is decoupling from the idea that is intended to be used only with dataframes. The name of the package has changed, and the last version of `pandas-profiling` was released more than a year ago. The github workflow for profiling new datasets is not working as it should, due to deprecated dependences.
EpistasisLab · Sep 10, 2024 · 40b1918 · 40b1918
1 parent 872ec0a
commit 40b1918
Show file tree

Hide file tree

Showing 4 changed files with 7 additions and 7 deletions.
diff --git a/docs_sources/index.Rmd b/docs_sources/index.Rmd
@@ -12,7 +12,7 @@ These datasets cover a broad range of applications including binary/multi-class
 In the interactive  [plotly](https://plotly.com/) chart below, each dot represents a dataset colored based on its associated task (classification vs. regression).
 In log scale, the *x* and *y* axis shows the number of observations and features respectively.
 Please click on the legend to hide/show the groups of datasets.
-Click on each dot to access the dataset's [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report.
+Click on each dot to access the dataset's [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report.
 
 *Note*: If a dataset has more than 20 features, we randomly chose 20 to be displayed in its profiling report. Therefore, please disregard the `Number of variables` in the corresponding report and, instead, use the correct `n_features` in the chart and table below.
 
@@ -84,7 +84,7 @@ ply
 
 Browse, sort, filter and search the complete table of summary statistics below.
 
-* Click on the dataset's name to access its [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report.
+* Click on the dataset's name to access its [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report.
 
 * Click on the GitHub Octocat <i class="fab fa-github"></i> to access its metadata.
 

diff --git a/paper/paper.md b/paper/paper.md
@@ -122,10 +122,10 @@ API reference guides that detail all user-facing functions and variables in PMLB
 
 ## Pandas profiling reports 
 
-For each dataset, we use [`pandas-profiling`](https://pandas-profiling.github.io/pandas-profiling/) to generate summary statistic reports.
-In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `pandas-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples.
+For each dataset, we use [`ydata-profiling`](https://docs.profiling.ydata.ai/latest/) to generate summary statistic reports.
+In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `ydata-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples.
 Browsing a report allows users and contributors to easily assess dataset quality and make any necessary changes.
-For example, if a feature is flagged by `pandas-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset.
+For example, if a feature is flagged by `ydata-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset.
 
 The profiling reports can be accessed by clicking on the dataset name in the interactive data table or the data point in the interactive chart on the PMLB website.
 Alternatively, all reports can be viewed on the repository's [gh-pages](https://github.com/EpistasisLab/pmlb/tree/gh-pages/profile) branch, or generated manually by users on their local computing resources.

diff --git a/pmlb/profiling.py b/pmlb/profiling.py
@@ -3,7 +3,7 @@
 import subprocess
 
 import pandas as pd
-from pandas_profiling import ProfileReport
+from ydata_profiling import ProfileReport
 
 from .pmlb import (
     fetch_data, get_updated_datasets, last_commit_message

diff --git a/setup.py b/setup.py
@@ -41,7 +41,7 @@ def calculate_version():
                     ],
     extras_require={
         'dev': ['nose', 'numpy', 'scipy', 'tabulate', 'parameterized',
-        'matplotlib', 'seaborn', 'pandas-profiling'],
+        'matplotlib', 'seaborn', 'ydata-profiling'],
     },
     classifiers=[
         'Intended Audience :: Developers',