Use Modern backend (#96)

* add html2image and playwright backend * bump lowest version to 3.8 * ensure test output path * update readme * remove ChromeController dependency * increase waiting time in notebook convert
dexplo · Sep 1, 2023 · f71441b · f71441b
1 parent 0a4fc5e
commit f71441b
Show file tree

Hide file tree

Showing 26 changed files with 582 additions and 376 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -17,7 +17,7 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest, windows-latest, macos-latest]
-        python-version: ["3.7", "3.8", "3.9", "3.10"]
+        python-version: ["3.8", "3.9", "3.10", "3.11"]
         include:
         - os: ubuntu-latest
           pippath: ~/.cache/pip
@@ -104,7 +104,7 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip wheel
-        pip install pytest matplotlib selenium jupyter pandoc
+        pip install pytest matplotlib selenium jupyter pandoc playwright
         pip install . --upgrade
     - name: mac nbconvert patch fix # this is a tmp fix, related to https://github.com/jupyter/nbconvert/issues/1773
       if: ${{ startsWith(matrix.os, 'macos') }}

diff --git a/README.md b/README.md
@@ -2,22 +2,10 @@
 
 [![](https://img.shields.io/pypi/v/dataframe_image)](https://pypi.org/project/dataframe_image)
 [![PyPI - License](https://img.shields.io/pypi/l/dataframe_image)](LICENSE)
+[![Python Version](https://img.shields.io/pypi/pyversions/dataframe_image)](https://pypi.org/project/dataframe_image)
+A package to convert pandas DataFrames as images.
 
-A package to convert Jupyter Notebooks to PDF and/or Markdown embedding pandas DataFrames as images.
-
-## Overview
-
-When converting Jupyter Notebooks to pdf using nbconvert, pandas DataFrames appear as either raw text or as simple LaTeX tables. The left side of the image below shows this representation.
-
-![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/dataframe_image_compare.png)
-
-This package was first created to embed DataFrames into pdf and markdown documents as images so that they appear exactly as they do in Jupyter Notebooks, as seen from the right side of the image above. It has since added much more functionality.
-
-## Usage
-
-Upon installation, the option `DataFrame as Image (PDF or Markdown)` will appear in the menu `File -> Download as`. Clicking this option will open up a new browser tab with a short form to be completed.
-
-![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/form.png)
+Also convert Jupyter Notebooks to PDF and/or Markdown embedding dataframe as image into it.
 
 ### Exporting individual DataFrames
 
@@ -39,13 +27,70 @@ Here, an example of how exporting a DataFrame would look like in a notebook.
 
 ![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/dfi_export.png)
 
+### Export Jupyter Notebook
+
+When converting Jupyter Notebooks to pdf using nbconvert, pandas DataFrames appear as either raw text or as simple LaTeX tables. The left side of the image below shows this representation.
+
+![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/dataframe_image_compare.png)
+
+This package was first created to embed DataFrames into pdf and markdown documents as images so that they appear exactly as they do in Jupyter Notebooks, as seen from the right side of the image above. It has since added much more functionality.
+
+#### Usage
+
+Upon installation, the option `DataFrame as Image (PDF or Markdown)` will appear in the menu `File -> Download as`. Clicking this option will open up a new browser tab with a short form to be completed.
+
+![png](https://github.com/dexplo/dataframe_image/raw/gh-pages/images/form.png)
+
+
 ## Installation
 
 Install with either:
 
 * `pip install dataframe_image`
 * `conda install -c conda-forge dataframe_image`
 
+## Configuration
+
+### table_conversion
+
+When convert dataframe to image, we provide two kind of backend, browser or matplotlib. The default is browser, but you can change it by setting `table_conversion` parameter to `'matplotlib'`.
+
+The major difference between these two backends is that browser backend will render the dataframe as it is in the notebook, while matplotlib backend can work without browser, can export all image format, eg. `svg`, and will be extremely fast. But currently matplotlib can only simulate header and cells, `set_caption`  will not work.
+
+```python
+dfi.export(df.style.background_gradient(), "df_style.png", table_conversion="matplotlib")
+```
+
+#### Browser backend
+
+Current we provide 4 difference browser backend liberary: `playwright`, `html2image`, `selenium` and `chrome`. The default is `chrome`.
+
+`chrome`, which means convert image with your local chromium based browser by command line.
+
+`html2image` is a backup method for `chrome`, which use `html2image`.
+
+`playwright` is a much more stable method, but you have to install playwright first.
+
+`selenium` is a method that use `Firefox` driver. Sometimes chrome will make some breaking changes which break methods above, `Firefox` will be a good backup. Not stable and hard to install. But can be installed in Google Colab.
+
+### Other parameters
+
+```python
+dfi.export(
+    obj: pd.DataFrame,
+    filename,
+    fontsize=14,
+    max_rows=None,
+    max_cols=None,
+    table_conversion: Literal[
+        "chrome", "matplotlib", "html2image", "playwright", "selenium"
+    ] = "chrome",
+    chrome_path=None,
+    dpi=None, # enlarge your image，default is 100，set it larger will get a larger image
+    use_mathjax=False, # enable mathjax support， which means you can use latex in your dataframe
+)
+```
+
 ## PDF Conversion - LaTeX vs Chrome Browser
 
 By default, conversion to pdf happens via LaTeX, which you must have pre-installed on your machine. If you do not have the correct LaTeX installation, you'll need to select the Chrome Browser option to make the conversion.

diff --git a/dataframe_image/_browser_pdf.py b/dataframe_image/_browser_pdf.py
@@ -10,11 +10,10 @@
 from tempfile import TemporaryDirectory, mkstemp
 
 import aiohttp
-import ChromeController
 from nbconvert import TemplateExporter
 from nbconvert.exporters import Exporter, HTMLExporter
 
-from ._screenshot import get_chrome_path
+from .converter.browser.chrome_converter import get_chrome_path
 
 
 async def handler(ws, data, key=None):
@@ -62,7 +61,7 @@ async def main(file_name, p):
             frameId = await handler(ws, data, "frameId")
 
             # second - enable page
-            # await asyncio.sleep(1)
+            await asyncio.sleep(1)
             data = {"id": 2, "method": "Page.enable"}
             await handler(ws, data)
 
@@ -72,14 +71,16 @@ async def main(file_name, p):
             await handler(ws, data, "content")
 
             # fourth - get pdf
+            prev_len = 0
             for _ in range(10):
                 await asyncio.sleep(1)
                 params = {"displayHeaderFooter": False, "printBackground": True}
                 data = {"id": 4, "method": "Page.printToPDF", "params": params}
                 pdf_data = await handler(ws, data, "data")
                 pdf_data = base64.b64decode(pdf_data)
-                if len(pdf_data) > 1000:
+                if len(pdf_data) > 1000 and len(pdf_data) == prev_len:
                     break
+                prev_len = len(pdf_data)
             else:
                 raise TimeoutError("Could not get pdf data")
             return pdf_data
@@ -131,6 +132,7 @@ def get_pdf_data(file_name):
 
 
 def get_pdf_data_chromecontroller(file_name):
+    import ChromeController
     additional_options = get_launch_args()
     # ChromeContext will shlex.split binary, so add quote to it
     with ChromeController.ChromeContext(

diff --git a/dataframe_image/_convert.py b/dataframe_image/_convert.py
@@ -7,9 +7,9 @@
 import tempfile
 import time
 import urllib.parse
+import warnings
 from pathlib import Path
 from tempfile import TemporaryDirectory
-import warnings
 
 import nbformat
 from nbconvert import MarkdownExporter, PDFExporter
@@ -25,6 +25,7 @@
 
 _logger = logging.getLogger(__name__)
 
+
 class Converter:
     KINDS = ["pdf", "md"]
     DISPLAY_DATA_PRIORITY = [
@@ -190,26 +191,26 @@ def get_resources(self):
         if self.table_conversion == "html2image":
             pass
         elif self.table_conversion == "chrome":
-            from ._screenshot import Screenshot
+            from .converter.browser.chrome_converter import ChromeConverter
 
-            converter = Screenshot(
+            converter = ChromeConverter(
                 center_df=self.center_df,
                 max_rows=self.max_rows,
                 max_cols=self.max_cols,
                 chrome_path=self.chrome_path,
             ).run
         elif self.table_conversion == "selenium":
-            from .selenium_screenshot import SeleniumScreenshot
+            from .converter.browser.selenium_converter import SeleniumConverter
 
-            converter = SeleniumScreenshot(
+            converter = SeleniumConverter(
                 center_df=self.center_df,
                 max_rows=self.max_rows,
                 max_cols=self.max_cols,
             ).run
         else:
-            from ._matplotlib_table import TableMaker
+            from .converter.matplotlib_table import MatplotlibTableConverter
 
-            converter = TableMaker(fontsize=22).run
+            converter = MatplotlibTableConverter(fontsize=22).run
 
         resources = {
             "metadata": {"path": str(self.nb_home), "name": self.document_name},
@@ -295,15 +296,15 @@ def to_pdf_latex(self):
         # get long path name of self.td
         temp_dir = Path(self.td.name).resolve()
         self.resources["temp_dir"] = temp_dir
-        print("TEMP_DIR", temp_dir) # TODO just for debug
+        print("TEMP_DIR", temp_dir)  # TODO just for debug
         MarkdownHTTPPreprocessor().preprocess(self.nb, self.resources)
 
         for filename, image_data in self.resources["image_data_dict"].items():
             fn_pieces = filename.split("_")
             cell_idx = int(fn_pieces[1])
             ext = fn_pieces[-1].split(".")[-1]
-            new_filename =  str(temp_dir / filename)
-            print(new_filename) # TODO just for debug
+            new_filename = str(temp_dir / filename)
+            print(new_filename)  # TODO just for debug
 
             # extract first image from gif and use as png for latex pdf
             if ext == "gif":
@@ -328,7 +329,9 @@ def to_pdf_latex(self):
         try:
             pdf_data, self.resources = pdf.from_notebook_node(self.nb, self.resources)
         except Exception as ex:
-            latex, _ = super(PDFExporter, pdf).from_notebook_node(self.nb, self.resources)
+            latex, _ = super(PDFExporter, pdf).from_notebook_node(
+                self.nb, self.resources
+            )
             _logger.error("nbconvert failed to create PDF via latex \n\n{latex}")
             with open("notebook.tex", "w", encoding="utf-8") as f:
                 f.write(latex)
@@ -374,11 +377,13 @@ def convert(self):
         # Step 2: if exporting as pdf with browser, do this first
         # as it requires no other preprocessing
         if "pdf_browser" in self.to:
-            warnings.warn("to pdf_browser method is deprecated"
-                          "We suggest using nbconvert, install it using `pip install nbconvert[webpdf]`"
-                          "and then run"
-                          "`jupyter nbconvert --to WebPDF --allow-chromium-download notebook.ipynb`"
-                          , DeprecationWarning)
+            warnings.warn(
+                "to pdf_browser method is deprecated"
+                "We suggest using nbconvert, install it using `pip install nbconvert[webpdf]`"
+                "and then run"
+                "`jupyter nbconvert --to WebPDF --allow-chromium-download notebook.ipynb`",
+                DeprecationWarning,
+            )
             self.to_pdf_browser()
 
         if "md" in self.to or "pdf_latex" in self.to:

diff --git a/dataframe_image/_html2image.py b/dataframe_image/_html2image.py