Skip to content

Commit

Permalink
Improve readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Tamara Robichet committed May 2, 2019
1 parent a7c9b46 commit 2f4cd47
Show file tree
Hide file tree
Showing 7 changed files with 211 additions and 49 deletions.
205 changes: 180 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
# contract-scraper

This package lets you scrape a HTML page and easily return the data in a structure that you define (a contract).
With contract-scraper you can scrape a HTML page and easily return the data as an array of objects in a structure that you define.

[![Build Status](https://travis-ci.org/tamarasaurus/contract-scraper.svg?branch=master)](https://travis-ci.org/tamarasaurus/contract-scraper)

s

## Installation

```bash
npm install contract-scraper --save
```


## Usage

Let's say that you want to scrape data about a list of toys from this HTML page:
Let's say that you want to scrape data about a list of toys from a HTML page:

```html
<html>
Expand All @@ -41,15 +40,20 @@ You have the following properties that you can collect:
- Photo URL `[data-profile]`
- Link `a[href]`

So you can construct a contract with the following properties:
So you can use `contract-scraper` to scrape the page and return an array of JSON objects with the values of these properties:

```javascript
import Scraper from 'contract-scraper'

// Tell contract-scraper which selectors to scrape, and from where
const contract = {
// The selector of the list item
// The selector for an individual item on the page
itemSelector: 'ul li',

// Option to scrape with a headless browser
// Whether or not to use puppeteer to scrape the page
scrapeAfterLoading: false,

// A list of properties to scrape for each item
attributes: {
name: {
type: 'text',
Expand All @@ -67,28 +71,14 @@ const contract = {
}
},
};
```

For each attribute that you want to scrape, you have the following options:

| Property | Options | Description |
|---------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| name | 'text', 'background-image', 'price', 'size', 'link' | Each attribute type is able to scrape data from a HTML element differently. See here for more: Attributes |
| selector | '.class' | Any jQuery/DOM selector that is compatible with https://github.com/cheeriojs/cheerio |
| attribute (optional) | 'href' | You can use any HTML attribute that exists on the element that you want to scrape |
| data (optional) | data: { name: 'country' } <div data-name="country">Country</div> OR data: {name: 'price', key: 'currency'} <div data-price="{currency: 'aud'}" | For a string, you can simply specify the name of the data attribute to scrape the contents. Otherwise you can specify the key if the data attribute contains some JSON. |


Once you have your contract ready, you can pass it to the scraper with a url:

```typescript
import Scraper from 'contract-scraper'

// Create a new instance of the scraper with a URL and contract
const scraper = new Scraper(
'http://characters.com',
contract
)

// Resolve a promise that returns a list of scraped items
scraper.scrapePage().then((items) => {
console.log(items);
// [
Expand All @@ -107,6 +97,171 @@ scraper.scrapePage().then((items) => {

```

## Attributes
You create a new instance of the scraper by passing in a contract, the URL that you want to scrape, and optionally custom attribute types.

The contract can have the following properties:

* `itemSelector` (string) - A CSS selector for the set of items that you want to scrape. E.g. `ul li`
* `scrapeAfterLoading` (boolean = false) - Whether or not the provided URL needs to load with a browser before scraping.
* `attributes` (object) - An object that defines the properties that you want to scrape.

Each attribute in your contract represents some data that you want to scrape and format. Each attribute can have a type, which tells the scraper how the value should be formatted for the final output. For example, you can tell the scraper to only extract a number from an element for the following page:

```html
<ul>
<li>
<div class="name">Iron man</div>
<div class="price">100 euros</div>
</li>
<li>
<div class="name">Captain America</div>
<div class="price">500 euros</div>
</li>
<ul>
```

```javascript
const contract = {
itemSelector: 'li',
scrapeAfterLoading: false,
attributes: {
name: {
type: 'text',
selector: '.name'
},
price: {
type: 'digit',
selector: '.price'
}
}
}

const scraper = new Scraper(
'http://characters.com',
contract
)

scraper.scrapePage().then(items => {
console.log(items)
// [
// {
// name: 'Iron man',
// price: 100
// },
// {
// name: 'Captain America',
// price: 500
// }
// ]
})

```

By using the in-built attribute type `digit`, you can tell the scraper that you only want the number from the contents of the element.

Each attribute represents a HTML element to scrape, and it can have the following properties:

* `name` (string) - The name of the property to return in the final list of results
* `selector` (string) - The CSS selector for the element that matches this attribute.
* `type` (string) - One of the in-built types that tells the scraper how to format the contents of this element:
* `background-image`: Use this when you want to extract a background-image url from a style tag.
* `link`: For scraping an absolute or relative link from an element
* `digit`: For returning a number from a string
* `size`: Use this for extracting a number from a string for a size in m².
* `text`: Returns trimmed text from an element.
* `attribute (optional)` (string) - The name of the HTML attribute to scrape data from. E.g. for an element:
```html
<a href="http://linktoscrape">Homepage</a>
```
```javascript
{
name: 'URL',
type: 'link',
selector: 'a',
attribute: 'href'
}
```
By default the scraper will format the inner text of the element if `attribute` is not specified.
* `data (optional)` (object) - If you want to scrape data attributes you can do it in two ways:
* For directly scraping a data attribute:
```html
<div data-country="Australia">
```
```javascript
{
name: 'Country',
type: 'text',
selector: 'data-country',
data: { name: 'country' }
}
```
This will return "Australia" in your list of results.
* For scraping a JSON value inside a data attribute:
```html
<div data-price="{currency: 'aud'}"></div>
```
```javascript
{
name: 'Price',
type: 'digit',
selector: 'data-price',
data: {name: 'price', key: 'currency'}
}
````
This will return "aud" in your list of results.

## Custom attributes types

In addition to the in-built attribute types, you can provide your own when you create a new instance of the scraper. A custom attribute type needs to be a class or a function that has a `value` property. For example if you wanted to extract a list of tags as an array:

```html
<ul>
<li>
<div class="name">Australia</div>
<div class="tags">spiders,vegemite,scorching,heat</div>
</li>
</ul>
```

```javascript
import Scraper from 'contract-scraper';

const contract = {
itemSelector: 'li',
scrapeAfterLoading: false,
attributes: {
countryName: {
type: 'text',
selector: 'name'
},
tags: {
type: 'list',
selector: '.tags'
}
}
}

// The custom type that receives the raw string as an argument
function ListFromString(commaSeparatedString) {
this.value = commaSeparatedString.split(',');
}

const scraper = new Scraper(
'http://countries.com',
contract,
{ 'list': ListFromString }
)

scraper.scrapePage().then(items => {
console.log(items);
// [
// {
// countryName: 'Australia',
// tags: [ 'spiders', 'vegemite', 'scorching', 'heat' ]
// }
// ]
})

```


## Custom attributes
4 changes: 2 additions & 2 deletions index.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import BackgroundImage from './src/attribute/background-image';
import Link from './src/attribute/link';
import Price from './src/attribute/price';
import Digit from './src/attribute/digit';
import Size from './src/attribute/size';
import Text from './src/attribute/text';

Expand All @@ -18,7 +18,7 @@ class Scraper {
public defaultAttributes: any = {
'background-image': BackgroundImage,
link: Link,
price: Price,
digit: Digit,
size: Size,
text: Text,
};
Expand Down
12 changes: 6 additions & 6 deletions src/attribute/price.ts → src/attribute/digit.ts
Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
import { Attribute } from './attribute';

export default class Price implements Attribute {
export default class Digit implements Attribute {
private inputValue: string = null;

public constructor(price: string) {
this.inputValue = price;
public constructor(digit: string) {
this.inputValue = digit;
}

public get value(): number {
return this.normalize(this.inputValue);
}

public normalize(price: string): number {
if (price === undefined || price === null) {
public normalize(digit: string): number {
if (digit === undefined || digit === null) {
return null;
}

const strippedString = price.replace(/\s+/g, '');
const strippedString = digit.replace(/\s+/g, '');
const getValue = /\d+/gm;
const parsedString = strippedString.match(getValue);

Expand Down
2 changes: 1 addition & 1 deletion src/examples/blot-puppeteer.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ const contract = {
'description': { 'type': 'text', 'selector': '.title_part2' },
'size': { 'type': 'size', 'selector': '.chiffres_cles span:nth-child(2) strong' },
'link': { 'type': 'link', 'selector': '.title_part1', 'attribute': 'href' },
'price': { 'type': 'price', 'selector': '.prix strong' },
'digit': { 'type': 'digit', 'selector': '.prix strong' },
'photo': { 'type': 'link', 'selector': '.visuel img', 'attribute': 'src' },
},
};
Expand Down
11 changes: 9 additions & 2 deletions src/examples/blot-request.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,25 @@ const contract = {
'itemSelector': '.bloc_annonce_habitat',
'pageQuery': 'page',
'attributes': {
'name': { 'type': 'text', 'selector': '.title_part1' },
'name': { 'type': 'custom-text', 'selector': '.title_part1' },
'description': { 'type': 'text', 'selector': '.title_part2' },
'size': { 'type': 'size', 'selector': '.chiffres_cles span:nth-child(2) strong' },
'link': { 'type': 'link', 'selector': '.title_part1', 'attribute': 'href' },
'price': { 'type': 'price', 'selector': '.prix strong' },
'digit': { 'type': 'digit', 'selector': '.prix strong' },
'photo': { 'type': 'link', 'selector': '.visuel img', 'attribute': 'src' },
},
};

function customText(inputValue) {
this.value = `${inputValue}-custom`;
}

const scraper = new Scraper(
'https://www.blot-immobilier.fr/habitat/achat-location/immobilier/loire-atlantique/nantes/',
contract,
{
'custom-text': customText,
}
);

scraper.scrapePage().then((data) => {
Expand Down
24 changes: 12 additions & 12 deletions test/unit/attribute/price.test.ts
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
import * as assert from 'assert';
import Price from '../../../src/attribute/price';
import Digit from '../../../src/attribute/digit';

describe('creates a price attribute', () => {
it('returns the number from a price string', () => {
assert.equal(new Price('345 000 €').value, 345000);
assert.equal(new Price('TAXE FONCIÈRE 423 €.').value, 423);
describe('creates a digit attribute', () => {
it('returns the number from a digit string', () => {
assert.equal(new Digit('345 000 €').value, 345000);
assert.equal(new Digit('TAXE FONCIÈRE 423 €.').value, 423);
});

it('returns zero for an empty price', () => {
assert.equal(new Price('There is no size in this string').value, null);
it('returns zero for an empty digit', () => {
assert.equal(new Digit('There is no size in this string').value, null);
});

it('returns null if the price is not valid', () => {
assert.equal(new Price(null).value, null);
assert.equal(new Price(undefined).value, null);
assert.equal(new Price('A string with no price').value, null);
assert.equal(new Price(' ').value, null);
it('returns null if the digit is not valid', () => {
assert.equal(new Digit(null).value, null);
assert.equal(new Digit(undefined).value, null);
assert.equal(new Digit('A string with no digit').value, null);
assert.equal(new Digit(' ').value, null);
});
});
2 changes: 1 addition & 1 deletion test/unit/index.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ const contract = {
scrapeAfterLoading: true,
attributes: {
name: { type: 'text', selector: '[itemprop=\name\]' },
price: { type: 'price', selector: '[itemprop=\price\]' },
price: { type: 'digit', selector: '[itemprop=\price\]' },
},
};

Expand Down

0 comments on commit 2f4cd47

Please sign in to comment.