Skip to content

Commit

Permalink
Replacing pandas-profiling (deprecated) with ydata-profiling
Browse files Browse the repository at this point in the history
the previously known pandas-profiling is now part of a bigger
project and is decoupling from the idea that is intended to be
used only with dataframes.

The name of the package has changed, and the last version of
`pandas-profiling` was released more than a year ago.

The github workflow for profiling new datasets is not working
as it should, due to deprecated dependences.
  • Loading branch information
gAldeia committed Sep 10, 2024
1 parent 872ec0a commit 40b1918
Show file tree
Hide file tree
Showing 4 changed files with 7 additions and 7 deletions.
4 changes: 2 additions & 2 deletions docs_sources/index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ These datasets cover a broad range of applications including binary/multi-class
In the interactive [plotly](https://plotly.com/) chart below, each dot represents a dataset colored based on its associated task (classification vs. regression).
In log scale, the *x* and *y* axis shows the number of observations and features respectively.
Please click on the legend to hide/show the groups of datasets.
Click on each dot to access the dataset's [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report.
Click on each dot to access the dataset's [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report.

*Note*: If a dataset has more than 20 features, we randomly chose 20 to be displayed in its profiling report. Therefore, please disregard the `Number of variables` in the corresponding report and, instead, use the correct `n_features` in the chart and table below.

Expand Down Expand Up @@ -84,7 +84,7 @@ ply

Browse, sort, filter and search the complete table of summary statistics below.

* Click on the dataset's name to access its [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report.
* Click on the dataset's name to access its [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report.

* Click on the GitHub Octocat <i class="fab fa-github"></i> to access its metadata.

Expand Down
6 changes: 3 additions & 3 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,10 @@ API reference guides that detail all user-facing functions and variables in PMLB

## Pandas profiling reports

For each dataset, we use [`pandas-profiling`](https://pandas-profiling.github.io/pandas-profiling/) to generate summary statistic reports.
In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `pandas-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples.
For each dataset, we use [`ydata-profiling`](https://docs.profiling.ydata.ai/latest/) to generate summary statistic reports.
In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `ydata-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples.
Browsing a report allows users and contributors to easily assess dataset quality and make any necessary changes.
For example, if a feature is flagged by `pandas-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset.
For example, if a feature is flagged by `ydata-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset.

The profiling reports can be accessed by clicking on the dataset name in the interactive data table or the data point in the interactive chart on the PMLB website.
Alternatively, all reports can be viewed on the repository's [gh-pages](https://github.com/EpistasisLab/pmlb/tree/gh-pages/profile) branch, or generated manually by users on their local computing resources.
Expand Down
2 changes: 1 addition & 1 deletion pmlb/profiling.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import subprocess

import pandas as pd
from pandas_profiling import ProfileReport
from ydata_profiling import ProfileReport

from .pmlb import (
fetch_data, get_updated_datasets, last_commit_message
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def calculate_version():
],
extras_require={
'dev': ['nose', 'numpy', 'scipy', 'tabulate', 'parameterized',
'matplotlib', 'seaborn', 'pandas-profiling'],
'matplotlib', 'seaborn', 'ydata-profiling'],
},
classifiers=[
'Intended Audience :: Developers',
Expand Down

0 comments on commit 40b1918

Please sign in to comment.