Skip to content

Commit

Permalink
Use Modern backend (#96)
Browse files Browse the repository at this point in the history
* add html2image and playwright backend

* bump lowest version to 3.8

* ensure test output path

* update readme

* remove ChromeController dependency

* increase waiting time in notebook convert
  • Loading branch information
PaleNeutron authored Sep 1, 2023
1 parent 0a4fc5e commit f71441b
Show file tree
Hide file tree
Showing 26 changed files with 582 additions and 376 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ["3.7", "3.8", "3.9", "3.10"]
python-version: ["3.8", "3.9", "3.10", "3.11"]
include:
- os: ubuntu-latest
pippath: ~/.cache/pip
Expand Down Expand Up @@ -104,7 +104,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip wheel
pip install pytest matplotlib selenium jupyter pandoc
pip install pytest matplotlib selenium jupyter pandoc playwright
pip install . --upgrade
- name: mac nbconvert patch fix # this is a tmp fix, related to https://github.com/jupyter/nbconvert/issues/1773
if: ${{ startsWith(matrix.os, 'macos') }}
Expand Down
75 changes: 60 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,10 @@

[![](https://img.shields.io/pypi/v/dataframe_image)](https://pypi.org/project/dataframe_image)
[![PyPI - License](https://img.shields.io/pypi/l/dataframe_image)](LICENSE)
[![Python Version](https://img.shields.io/pypi/pyversions/dataframe_image)](https://pypi.org/project/dataframe_image)
A package to convert pandas DataFrames as images.

A package to convert Jupyter Notebooks to PDF and/or Markdown embedding pandas DataFrames as images.

## Overview

When converting Jupyter Notebooks to pdf using nbconvert, pandas DataFrames appear as either raw text or as simple LaTeX tables. The left side of the image below shows this representation.

![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/dataframe_image_compare.png)

This package was first created to embed DataFrames into pdf and markdown documents as images so that they appear exactly as they do in Jupyter Notebooks, as seen from the right side of the image above. It has since added much more functionality.

## Usage

Upon installation, the option `DataFrame as Image (PDF or Markdown)` will appear in the menu `File -> Download as`. Clicking this option will open up a new browser tab with a short form to be completed.

![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/form.png)
Also convert Jupyter Notebooks to PDF and/or Markdown embedding dataframe as image into it.

### Exporting individual DataFrames

Expand All @@ -39,13 +27,70 @@ Here, an example of how exporting a DataFrame would look like in a notebook.

![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/dfi_export.png)

### Export Jupyter Notebook

When converting Jupyter Notebooks to pdf using nbconvert, pandas DataFrames appear as either raw text or as simple LaTeX tables. The left side of the image below shows this representation.

![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/dataframe_image_compare.png)

This package was first created to embed DataFrames into pdf and markdown documents as images so that they appear exactly as they do in Jupyter Notebooks, as seen from the right side of the image above. It has since added much more functionality.

#### Usage

Upon installation, the option `DataFrame as Image (PDF or Markdown)` will appear in the menu `File -> Download as`. Clicking this option will open up a new browser tab with a short form to be completed.

![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/form.png)


## Installation

Install with either:

* `pip install dataframe_image`
* `conda install -c conda-forge dataframe_image`

## Configuration

### table_conversion

When convert dataframe to image, we provide two kind of backend, browser or matplotlib. The default is browser, but you can change it by setting `table_conversion` parameter to `'matplotlib'`.

The major difference between these two backends is that browser backend will render the dataframe as it is in the notebook, while matplotlib backend can work without browser, can export all image format, eg. `svg`, and will be extremely fast. But currently matplotlib can only simulate header and cells, `set_caption` will not work.

```python
dfi.export(df.style.background_gradient(), "df_style.png", table_conversion="matplotlib")
```

#### Browser backend

Current we provide 4 difference browser backend liberary: `playwright`, `html2image`, `selenium` and `chrome`. The default is `chrome`.

`chrome`, which means convert image with your local chromium based browser by command line.

`html2image` is a backup method for `chrome`, which use `html2image`.

`playwright` is a much more stable method, but you have to install playwright first.

`selenium` is a method that use `Firefox` driver. Sometimes chrome will make some breaking changes which break methods above, `Firefox` will be a good backup. Not stable and hard to install. But can be installed in Google Colab.

### Other parameters

```python
dfi.export(
obj: pd.DataFrame,
filename,
fontsize=14,
max_rows=None,
max_cols=None,
table_conversion: Literal[
"chrome", "matplotlib", "html2image", "playwright", "selenium"
] = "chrome",
chrome_path=None,
dpi=None, # enlarge your image,default is 100,set it larger will get a larger image
use_mathjax=False, # enable mathjax support, which means you can use latex in your dataframe
)
```

## PDF Conversion - LaTeX vs Chrome Browser

By default, conversion to pdf happens via LaTeX, which you must have pre-installed on your machine. If you do not have the correct LaTeX installation, you'll need to select the Chrome Browser option to make the conversion.
Expand Down
10 changes: 6 additions & 4 deletions dataframe_image/_browser_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,10 @@
from tempfile import TemporaryDirectory, mkstemp

import aiohttp
import ChromeController
from nbconvert import TemplateExporter
from nbconvert.exporters import Exporter, HTMLExporter

from ._screenshot import get_chrome_path
from .converter.browser.chrome_converter import get_chrome_path


async def handler(ws, data, key=None):
Expand Down Expand Up @@ -62,7 +61,7 @@ async def main(file_name, p):
frameId = await handler(ws, data, "frameId")

# second - enable page
# await asyncio.sleep(1)
await asyncio.sleep(1)
data = {"id": 2, "method": "Page.enable"}
await handler(ws, data)

Expand All @@ -72,14 +71,16 @@ async def main(file_name, p):
await handler(ws, data, "content")

# fourth - get pdf
prev_len = 0
for _ in range(10):
await asyncio.sleep(1)
params = {"displayHeaderFooter": False, "printBackground": True}
data = {"id": 4, "method": "Page.printToPDF", "params": params}
pdf_data = await handler(ws, data, "data")
pdf_data = base64.b64decode(pdf_data)
if len(pdf_data) > 1000:
if len(pdf_data) > 1000 and len(pdf_data) == prev_len:
break
prev_len = len(pdf_data)
else:
raise TimeoutError("Could not get pdf data")
return pdf_data
Expand Down Expand Up @@ -131,6 +132,7 @@ def get_pdf_data(file_name):


def get_pdf_data_chromecontroller(file_name):
import ChromeController
additional_options = get_launch_args()
# ChromeContext will shlex.split binary, so add quote to it
with ChromeController.ChromeContext(
Expand Down
37 changes: 21 additions & 16 deletions dataframe_image/_convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
import tempfile
import time
import urllib.parse
import warnings
from pathlib import Path
from tempfile import TemporaryDirectory
import warnings

import nbformat
from nbconvert import MarkdownExporter, PDFExporter
Expand All @@ -25,6 +25,7 @@

_logger = logging.getLogger(__name__)


class Converter:
KINDS = ["pdf", "md"]
DISPLAY_DATA_PRIORITY = [
Expand Down Expand Up @@ -190,26 +191,26 @@ def get_resources(self):
if self.table_conversion == "html2image":
pass
elif self.table_conversion == "chrome":
from ._screenshot import Screenshot
from .converter.browser.chrome_converter import ChromeConverter

converter = Screenshot(
converter = ChromeConverter(
center_df=self.center_df,
max_rows=self.max_rows,
max_cols=self.max_cols,
chrome_path=self.chrome_path,
).run
elif self.table_conversion == "selenium":
from .selenium_screenshot import SeleniumScreenshot
from .converter.browser.selenium_converter import SeleniumConverter

converter = SeleniumScreenshot(
converter = SeleniumConverter(
center_df=self.center_df,
max_rows=self.max_rows,
max_cols=self.max_cols,
).run
else:
from ._matplotlib_table import TableMaker
from .converter.matplotlib_table import MatplotlibTableConverter

converter = TableMaker(fontsize=22).run
converter = MatplotlibTableConverter(fontsize=22).run

resources = {
"metadata": {"path": str(self.nb_home), "name": self.document_name},
Expand Down Expand Up @@ -295,15 +296,15 @@ def to_pdf_latex(self):
# get long path name of self.td
temp_dir = Path(self.td.name).resolve()
self.resources["temp_dir"] = temp_dir
print("TEMP_DIR", temp_dir) # TODO just for debug
print("TEMP_DIR", temp_dir) # TODO just for debug
MarkdownHTTPPreprocessor().preprocess(self.nb, self.resources)

for filename, image_data in self.resources["image_data_dict"].items():
fn_pieces = filename.split("_")
cell_idx = int(fn_pieces[1])
ext = fn_pieces[-1].split(".")[-1]
new_filename = str(temp_dir / filename)
print(new_filename) # TODO just for debug
new_filename = str(temp_dir / filename)
print(new_filename) # TODO just for debug

# extract first image from gif and use as png for latex pdf
if ext == "gif":
Expand All @@ -328,7 +329,9 @@ def to_pdf_latex(self):
try:
pdf_data, self.resources = pdf.from_notebook_node(self.nb, self.resources)
except Exception as ex:
latex, _ = super(PDFExporter, pdf).from_notebook_node(self.nb, self.resources)
latex, _ = super(PDFExporter, pdf).from_notebook_node(
self.nb, self.resources
)
_logger.error("nbconvert failed to create PDF via latex \n\n{latex}")
with open("notebook.tex", "w", encoding="utf-8") as f:
f.write(latex)
Expand Down Expand Up @@ -374,11 +377,13 @@ def convert(self):
# Step 2: if exporting as pdf with browser, do this first
# as it requires no other preprocessing
if "pdf_browser" in self.to:
warnings.warn("to pdf_browser method is deprecated"
"We suggest using nbconvert, install it using `pip install nbconvert[webpdf]`"
"and then run"
"`jupyter nbconvert --to WebPDF --allow-chromium-download notebook.ipynb`"
, DeprecationWarning)
warnings.warn(
"to pdf_browser method is deprecated"
"We suggest using nbconvert, install it using `pip install nbconvert[webpdf]`"
"and then run"
"`jupyter nbconvert --to WebPDF --allow-chromium-download notebook.ipynb`",
DeprecationWarning,
)
self.to_pdf_browser()

if "md" in self.to or "pdf_latex" in self.to:
Expand Down
59 changes: 0 additions & 59 deletions dataframe_image/_html2image.py

This file was deleted.

Loading

0 comments on commit f71441b

Please sign in to comment.