From 65d6db26fe4e8fb1948377317f1914dbd4bdf79f Mon Sep 17 00:00:00 2001 From: romainx Date: Sat, 6 Jan 2018 17:52:13 +0100 Subject: [PATCH] Update README file to indicate new features and available options --- README.md | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 39f732752..e215bfddc 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # pandas-profiling -Generates profile reports from a pandas DataFrame. The *df.describe()* function is great but a little basic for serious exploratory data analysis. +Generates profile reports from a pandas `DataFrame`. The pandas `df.describe()` function is great but a little basic for serious exploratory data analysis. For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report: @@ -9,6 +9,7 @@ For each column the following statistics - if relevant for the column type - are * **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness * **Most frequent values** * **Histogram** +* **Correlations** highlighting of highly correlated variables, Spearman and Pearson matrixes ## Demo @@ -60,19 +61,31 @@ To retrieve the list of variables which are rejected due to high correlation: profile = pandas_profiling.ProfileReport(df) rejected_variables = profile.get_rejected_variables(threshold=0.9) -If you want to generate a HTML report file, save the ProfileReport to an object and use the *to_file()* function: +If you want to generate a HTML report file, save the `ProfileReport` to an object and use the `to_file()` function: profile = pandas_profiling.ProfileReport(df) profile.to_file(outputfile="/tmp/myoutputfile.html") ### Python -For standard formatted CSV files that can be read immediately by pandas, you can use the **profile_csv.py** script. Run +For standard formatted CSV files that can be read immediately by pandas, you can use the `profile_csv.py` script. Run python profile_csv.py -h for information about options and arguments. +### Advanced usage + +A set of options are available in order to adapt the report generated. + +* `bins` (`int`): Number of bins in histogram (10 by default). +* Correlation settings: + * `check_correlation` (`boolean`): Whether or not to check correlation (`True` by default) + * `correlation_threshold` (`float`): Threshold to determine if the variable pair is correlated (0.9 by default). + * `correlation_overrides` (`list`): Variable names not to be rejected because they are correlated (`None` by default). + * `check_recoded` (`boolean`): Whether or not to check recoded correlation (`False` by default). Since it's an expensive computation it can be activated for small datasets. +* `pool_size` (`int`): Number of workers in thread pool. The default is equal to the number of CPU. + ## Dependencies * **An internet connection.** Pandas-profiling requires an internet connection to download the Bootstrap and JQuery libraries. I might change this in the future, let me know if you want that sooner than later. @@ -80,5 +93,3 @@ for information about options and arguments. * pandas (>=0.19) * matplotlib (>=1.4) * six (>=1.9) - -