Skip to content

Commit

Permalink
feat: turn PDF into HTML (#189)
Browse files Browse the repository at this point in the history
* feat: turn PDF into HTML

* ci: install mupdf

* docs: improve

* refactor: sort
  • Loading branch information
Kikobeats authored Jan 24, 2024
1 parent 23022bc commit cbf0835
Show file tree
Hide file tree
Showing 9 changed files with 250 additions and 78 deletions.
8 changes: 4 additions & 4 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
version: 2
updates:
- package-ecosystem: npm
directory: "/"
directory: '/'
schedule:
interval: daily
- package-ecosystem: "github-actions"
directory: "/"
- package-ecosystem: 'github-actions'
directory: '/'
schedule:
# Check for updates to GitHub Actions every weekday
interval: "daily"
interval: 'daily'
42 changes: 33 additions & 9 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -1,22 +1,46 @@
name: test
name: main

on:
push:
branches:
- master
pull_request:
branches:
- master

jobs:
test:
contributors:
if: "${{ github.event.head_commit.message != 'build: contributors' }}"
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
token: ${{ secrets.GH_TOKEN }}
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: lts/*
- name: Contributors
run: |
git config --global user.email ${{ secrets.GIT_EMAIL }}
git config --global user.name ${{ secrets.GIT_USERNAME }}
npm run contributors
- name: Push changes
run: |
git push origin ${{ github.head_ref }}
release:
if: |
!startsWith(github.event.head_commit.message, 'chore(release):') &&
!startsWith(github.event.head_commit.message, 'docs:') &&
!startsWith(github.event.head_commit.message, 'ci:')
needs: [contributors]
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
fetch-depth: 2
token: ${{ secrets.GITHUB_TOKEN }}
- name: Setup Node.js
uses: actions/setup-node@v4
with:
Expand All @@ -27,19 +51,19 @@ jobs:
version: latest
run_install: true
- name: Test
run: npm test
run: pnpm test
- name: Report
run: npx c8 report --reporter=text-lcov > coverage/lcov.info
- name: Coverage
uses: coverallsapp/github-action@main
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Release
if: ${{ github.ref == 'refs/heads/master' && !startsWith(github.event.head_commit.message, 'chore(release):') && !startsWith(github.event.head_commit.message, 'docs:') }}
env:
GH_TOKEN: ${{ secrets.GH_TOKEN }}
NPM_TOKEN: ${{ secrets.NPM_TOKEN }}
run: |
git config --global user.email ${{ secrets.GIT_EMAIL }}
git config --global user.name ${{ secrets.GIT_USERNAME }}
npm run release
git pull origin master
pnpm run release
38 changes: 38 additions & 0 deletions .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
name: pull_request

on:
push:
branches:
- master
pull_request:
branches:
- master

jobs:
test:
if: github.ref != 'refs/heads/master'
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: lts/*
- name: Setup PNPM
uses: pnpm/action-setup@v2
with:
version: latest
run_install: true
- name: Install mupdf-tools
run: sudo apt-get install -y mupdf-tools
- name: Test
run: pnpm test
- name: Report
run: npx c8 report --reporter=text-lcov > coverage/lcov.info
- name: Coverage
uses: coverallsapp/github-action@main
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
84 changes: 38 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,16 @@
[![Coverage Status](https://img.shields.io/coveralls/microlinkhq/html-get.svg?style=flat-square)](https://coveralls.io/github/microlinkhq/html-get)
[![NPM Status](https://img.shields.io/npm/dm/html-get.svg?style=flat-square)](https://www.npmjs.org/package/html-get)

> Get the HTML from any website, using prerendering when is necessary.
> Get the HTML from any website, fine-tuned for correction & speed.
## Features

- Get HTML markup from any website (client side apps as well).
- Prerendering detection based on domains list.
- Speed up process blocking ads trackers.
- Encoding body response properly.
- Get HTML markup for any URL, including images, video, audio, or pdf.
- Block ads tracker or any non-necessary network subrequest.
- Handle unreachable or timeout URLs gracefully.
- Ensure HTML markup is appropriately encoded.

<br>

Headless technology like [puppeteer](https://github.com/GoogleChrome/puppeteer) brings us to get the HTML markup from any website, even when the target URL is client side app and we need to wait until dom events fire for getting the real markup.

Generally this approach better than a simple GET request from the target URL, but because you need to wait for dom events, prerendering could be slow and in some scenario unnecessary (sites that use server side rendering could be resolved with a simple GET).

**html-get** bring the best of both worlds, doing the following algorithm:

- Determinate if the target URL actually needs prerendering (internally it has a [list of popular site domains](https://github.com/microlinkhq/html-get/blob/master/src/auto-domains.js) that don't need it).
- If it needs prerendering, perform the action using Headless technology, blocking ads trackers requests for speed up the process, trying to resolve the main request in the minimum amount of time.
- If it does not need prerendering or prerendering fails for any reason (for example, timeout), the request will be resolved doing a GET request.
**html-get** takes advantage of [puppeteer](https://github.com/GoogleChrome/puppeteer) headless technology when is needed, such as client-side apps that needs to be prerender.

## Install

Expand Down Expand Up @@ -89,68 +79,70 @@ Type: `string`

The target URL for getting the HTML markup.

#### options

##### encoding

Type: `string`
Default: `'utf-8'`

It ensures the HTML markup is encoded to the encoded value provided.

The value will be passes to [`html-encode`](https://github.com/kikobeats/html-encode)

##### getBrowserless

*Required*<br>
Type: `function`<br>
Type: `function`

A function that should return a [browserless](https://browserless.js.org/) instance to be used for interact with puppeteer:

#### options

##### prerender
##### getMode

Type: `boolean`|`string`<br>
Default: `'auto'`
Type: `function`

Enable or disable prerendering as mechanism for getting the HTML markup explicitly.
It determines the strategy to use based on the `url`, being the possibles values `'fetch'` or `'prerender'` .

The value `auto` means that that internally use a list of websites that don't need to use prerendering by default. This list is used for speedup the process, using `fetch` mode for these websites.
##### getTemporalFile

See [getMode parameter](#getMode) for know more.
Type: `function`

##### encoding
It creates a temporal file.

Type: `string`<br>
Default: `'utf-8'`
##### gotOpts

Encoding the HTML markup properly from the body response.
Type: `object`

It determines the encode to use A Node.js library for converting HTML documents of arbitrary encoding into a target encoding (utf8, utf16, etc).
It passes configuration object to [got](https://www.npmjs.com/package/got) under `'fetch'` strategy.

##### headers

Type: `object`<br>
Type: `object`

Request headers that will be passed to fetch/prerender process.

##### getMode
##### mutoolPath

Type: `function`<br>
Type: `function`

A function evaluation that will be invoked to determinate the resolutive `mode` for getting the HTML markup from the target URL.
It returns the path for [mutool](https://mupdf.com/) binary, used for turning PDF files into HTML markup.

The default `getMode` is:
##### prerender

```js
const getMode = (url, { prerender }) => {
if (prerender === false) return 'fetch'
if (prerender !== 'auto') return 'prerender'
return autoDomains.includes(getDomain(url)) ? 'fetch' : 'prerender'
}
```
Type: `boolean`|`string`<br>
Default: `'auto'`

##### gotOptions
Enable or disable prerendering as mechanism for getting the HTML markup explicitly.

Type: `object`<br>
The value `auto` means that that internally use a list of websites that don't need to use prerendering by default. This list is used for speedup the process, using `fetch` mode for these websites.

Under `mode=fetch`, pass configuration object to [got](https://www.npmjs.com/package/got).
See [getMode parameter](#getMode) for know more.

##### puppeteerOpts

Type: `object`

Under non `mode=fetch`, pass configuration object to [puppeteer](https://www.npmjs.com/package/puppeteer).
It passes coniguration object to [puppeteer](https://www.npmjs.com/package/puppeteer) under `'prerender'` strategy.

##### rewriteUrls

Expand Down
20 changes: 13 additions & 7 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"name": "html-get",
"description": "Get the HTML from any website, using prerendering when is necessary.",
"description": "Get the HTML from any website, fine-tuned for correction & speed",
"homepage": "https://nicedoc.com/microlinkhq/html-get",
"version": "2.14.4",
"version": "2.15.0-1",
"main": "src/index.js",
"bin": {
"html-get": "bin/index.js"
Expand All @@ -20,20 +20,25 @@
"url": "https://github.com/microlinkhq/html-get/issues"
},
"keywords": [
"audio",
"fetch",
"get",
"got",
"headless",
"html",
"image",
"markup",
"pdf",
"prerender",
"request"
"request",
"video"
],
"dependencies": {
"@kikobeats/time-span": "~1.0.3",
"@metascraper/helpers": "~5.43.0",
"@metascraper/helpers": "~5.43.4",
"cheerio": "~1.0.0-rc.12",
"css-url-regex": "~4.0.0",
"debug-logfmt": "~1.2.0",
"debug-logfmt": "~1.2.2",
"execall": "~2.0.0",
"got": "~11.8.6",
"html-encode": "~2.1.6",
Expand All @@ -44,7 +49,8 @@
"p-cancelable": "~2.1.0",
"p-retry": "~4.6.0",
"replace-string": "~3.1.0",
"top-sites": "~1.1.202"
"tinyspawn": "~1.2.6",
"top-sites": "~1.1.205"
},
"devDependencies": {
"@commitlint/cli": "latest",
Expand Down Expand Up @@ -82,7 +88,7 @@
"lint": "standard-markdown README.md && standard",
"postinstall": "node scripts/postinstall",
"postrelease": "npm run release:tags && npm run release:github && (ci-publish || npm publish --access=public)",
"prerelease": "npm run update:check && npm run contributors",
"prerelease": "npm run update:check",
"pretest": "npm run lint",
"release": "standard-version -a",
"release:github": "github-generate-release",
Expand Down
Loading

0 comments on commit cbf0835

Please sign in to comment.