Skip to content

Commit

Permalink
Update README file to indicate new features and available options
Browse files Browse the repository at this point in the history
  • Loading branch information
romainx committed Jan 6, 2018
1 parent 6a4a810 commit 65d6db2
Showing 1 changed file with 16 additions and 5 deletions.
21 changes: 16 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# pandas-profiling

Generates profile reports from a pandas DataFrame. The *df.describe()* function is great but a little basic for serious exploratory data analysis.
Generates profile reports from a pandas `DataFrame`. The pandas `df.describe()` function is great but a little basic for serious exploratory data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Expand All @@ -9,6 +9,7 @@ For each column the following statistics - if relevant for the column type - are
* **Descriptive statistics** like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* **Most frequent values**
* **Histogram**
* **Correlations** highlighting of highly correlated variables, Spearman and Pearson matrixes

## Demo

Expand Down Expand Up @@ -60,25 +61,35 @@ To retrieve the list of variables which are rejected due to high correlation:
profile = pandas_profiling.ProfileReport(df)
rejected_variables = profile.get_rejected_variables(threshold=0.9)

If you want to generate a HTML report file, save the ProfileReport to an object and use the *to_file()* function:
If you want to generate a HTML report file, save the `ProfileReport` to an object and use the `to_file()` function:

profile = pandas_profiling.ProfileReport(df)
profile.to_file(outputfile="/tmp/myoutputfile.html")

### Python

For standard formatted CSV files that can be read immediately by pandas, you can use the **profile_csv.py** script. Run
For standard formatted CSV files that can be read immediately by pandas, you can use the `profile_csv.py` script. Run

python profile_csv.py -h

for information about options and arguments.

### Advanced usage

A set of options are available in order to adapt the report generated.

* `bins` (`int`): Number of bins in histogram (10 by default).
* Correlation settings:
* `check_correlation` (`boolean`): Whether or not to check correlation (`True` by default)
* `correlation_threshold` (`float`): Threshold to determine if the variable pair is correlated (0.9 by default).
* `correlation_overrides` (`list`): Variable names not to be rejected because they are correlated (`None` by default).
* `check_recoded` (`boolean`): Whether or not to check recoded correlation (`False` by default). Since it's an expensive computation it can be activated for small datasets.
* `pool_size` (`int`): Number of workers in thread pool. The default is equal to the number of CPU.

## Dependencies

* **An internet connection.** Pandas-profiling requires an internet connection to download the Bootstrap and JQuery libraries. I might change this in the future, let me know if you want that sooner than later.
* python (>= 2.7)
* pandas (>=0.19)
* matplotlib (>=1.4)
* six (>=1.9)


0 comments on commit 65d6db2

Please sign in to comment.