Skip to content

Commit

Permalink
Merge pull request #711 from masylum/patch-1
Browse files Browse the repository at this point in the history
feat: allow passing `htmlDom`
  • Loading branch information
Kikobeats authored Jun 24, 2024
2 parents d641cbd + 6302e95 commit 75e1d8e
Show file tree
Hide file tree
Showing 9 changed files with 1,285 additions and 1,573 deletions.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,12 @@ Type: `String`

The HTML markup for extracting the content.

##### htmlDom

Type: `object`

The DOM representation of the HTML markup. When it's not provided, it's get from the `html` parameter.

#### rules

Type: `Array`
Expand Down
3 changes: 2 additions & 1 deletion packages/metascraper/src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ module.exports = rules => {
return async ({
url,
html = '',
htmlDom,
rules: inlineRules,
validateUrl = true,
...props
Expand All @@ -27,7 +28,7 @@ module.exports = rules => {

return getData({
url,
htmlDom: load(html, { baseURI: url }),
htmlDom: htmlDom ?? load(html, { baseURI: url }),
rules: mergeRules(inlineRules, loadedRules),
...props
})
Expand Down
1,300 changes: 202 additions & 1,098 deletions packages/metascraper/test/integration/bfi/input.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ Generated by [AVA](https://avajs.dev).
> Snapshot 1
{
audio: 'https://podcast-stream.wbez.org/recast/the-pie/20210113155621-ThePie-E04.mp3',
audio: null,
author: null,
date: '2021-01-14T03:00:06.000Z',
date: '2021-02-12T16:09:42.000Z',
description: 'The expanding market influence of tech companies has sparked new fear of an old economic problem – monopoly power. In this episode, Eric Posner and Chad Syverson discuss whether these […]',
image: 'https://bfi.uchicago.edu/wp-content/uploads/2018/11/pie-web-banner_6.png',
lang: 'en',
logo: 'https://bfi.uchicago.edu/wp-content/uploads/2019/03/favicon-228.png',
logo: 'https://bfi.uchicago.edu/wp-content/uploads/2024/03/BFI-Core-Logo-RGB.svg',
publisher: 'BFI',
title: 'The Big Tech Threat? | BFI',
url: 'https://bfi.uchicago.edu/podcast/the-big-tech-threat/',
Expand Down
Binary file not shown.
1,515 changes: 1,053 additions & 462 deletions packages/metascraper/test/integration/los-angeles-times/input.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ Generated by [AVA](https://avajs.dev).
{
audio: null,
author: 'Los Angeles Times',
author: 'Tracey Lien',
date: '2016-05-02T10:03:18.000Z',
description: 'Tech start-up Appthority’s office has plush conference rooms, soundproof phone booths, an enormous kitchen and a view of San Francisco Bay. It has ping-pong and foosball tables, beer on tap and 11 types of tea.',
image: 'http://www.trbimg.com/img-572421a4/turbine/la-fi-tn-tech-downturn-20160429',
description: 'Tech start-up Appthority’s office has plush conference rooms, soundproof phone booths, an enormous kitchen and a view of San Francisco Bay.',
image: 'https://ca-times.brightspotcdn.com/dims4/default/78d090f/2147483647/strip/true/crop/2048x1075+0+146/resize/1200x630!/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F07%2F12%2F36025c1a8b6f34ec37234f02980c%2Fla-la-fi-adv-start-up-funding007-jpg-20160429',
lang: 'en',
logo: 'https://www.latimes.com/favicon.ico',
publisher: 'latimes.com',
logo: 'https://www.latimes.com/apple-touch-icon.png',
publisher: 'Los Angeles Times',
title: 'As venture capital dries up, tech start-ups discover frugality',
url: 'http://www.latimes.com/business/technology/la-fi-tn-tech-downturn-20160429-story.html',
url: 'https://www.latimes.com/business/technology/la-fi-tn-tech-downturn-20160429-story.html',
video: null,
}
Binary file not shown.
16 changes: 13 additions & 3 deletions packages/metascraper/test/unit/interface.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ const test = require('ava')

const createMetascraper = require('../..')
const titleRules = require('metascraper-title')()
const { load } = require('cheerio')

test('`url` is required', async t => {
t.plan(9)
Expand Down Expand Up @@ -31,7 +32,7 @@ test('`url` is required', async t => {
}
})

test('Disable URL validation using `validateUrl`', async t => {
test('passing `{ validateUrl: false }`', async t => {
const metascraper = createMetascraper([titleRules])

const html = `
Expand Down Expand Up @@ -66,7 +67,7 @@ test('Disable URL validation using `validateUrl`', async t => {
t.is(metadata.title, 'Document')
})

test('load extra `rules`', async t => {
test('passing `rules`', async t => {
const url = 'https://microlink.io'

const html = `
Expand Down Expand Up @@ -104,7 +105,7 @@ test('load extra `rules`', async t => {
t.is(metadata.foo, 'bar')
})

test('associate test function with rules', async t => {
test('skip `rules` via `test` function', async t => {
const url = 'https://microlink.io'

const html = `
Expand Down Expand Up @@ -148,3 +149,12 @@ test('associate test function with rules', async t => {
t.is(metadata.foo, null)
t.true(isCalled)
})

test('passing `htmlDom`', async t => {
const url = 'https://microlink.io'
const htmlDom = load('<title>htmlDom</title>')
const html = '<title>Original HTML</title>'
const metascraper = createMetascraper([titleRules])
const metadata = await metascraper({ url, htmlDom, html })
t.is(metadata.title, 'htmlDom')
})

0 comments on commit 75e1d8e

Please sign in to comment.