Skip to content

Latest commit

 

History

History
377 lines (323 loc) · 8.82 KB

README.md

File metadata and controls

377 lines (323 loc) · 8.82 KB

contract-scraper

With contract-scraper you can easily scrape a HTML page and return the data in a structured format.

Build status npm

Installation

npm install contract-scraper --save
yarn add contract-scraper

Usage

To scrape a page, you can create a new instance of contract-scraper with these parameters:

let contract = {
  itemSelector: 'li',
  puppeteer: true,
  attributes: {
    name: {
      type: 'text',
      selector: '.name'
    },
    link: {
      type: 'link',
      selector: 'a',
      attribute: 'href'
    }
  }
}

const puppeteerOptions = {
  headless: false,
}

const scraper = new Scraper('http://website.com', contract, puppeteerOptions)

A scraper can be initialised with custom puppeteer launch options.

A contract accepts the following properties:

itemSelector (string)

A CSS selector for the element to be scraped. The scraper will process all the elements matching this selector.

puppeteer (boolean)

If set to true contract-scraper will use Puppeteer to load and scrape the page contents

waitForPageLoadSelector (string)

Puppeteer will wait for this CSS selector to exist in the DOM before scraping the page. Must be used in conjunction with pupeeteer: true

attributes (object)

Defines the data to scrape for each item.

Each attribute matches a HTML element to scrape. The attribute type will define how data wil be extracted from the element, and how the data should be formatted in the final output. For example you can use one of the in-built types to extract a number from an element:

<ul>
  <li>
    <div class="name">Iron man</div>
    <div class="price">100 euros</div>
  </li>
  <li>
    <div class="name">Captain America</div>
    <div class="price">500 euros</div>
  </li>
  <ul></ul>
</ul>
const contract = {
  itemSelector: 'li',
  attributes: {
    name: {
      type: 'text',
      selector: '.name',
    },
    price: {
      type: 'number',
      selector: '.price',
    },
  },
};

const scraper = new Scraper('http://characters.com', contract);

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     name: 'Iron man',
  //     price: 100
  //   },
  //       {
  //     name: 'Captain America',
  //     price: 500
  //   }
  // ]
});

Each attribute can have the following properties:

  • name (string) - A label for this attribute for the final output

  • selector (string) - The CSS selector for the element (scoped to itemSelector).

  • type (string) - A custom type, or one of the in-built ones that returns:

    • background-image: A background-image url from a style string
    • link: An absolute URL
    • number: A number
    • size: A number for size in m².
    • text: Inner text of the element
  • attribute (optional) (string)

    The name of the HTML attribute to scrape data from. E.g. for an element:

    <a href="http://linktoscrape">Homepage</a>
      {
        name: 'URL',
        type: 'link',
        selector: 'a',
        attribute: 'href'
      }

    By default the attribute type will use the innerText of the element if attribute is not specified.

  • data (optional) (object) - If you want to scrape HTML data attributes you can do it in two ways:

    • Directly scraping a data attribute:
      <div data-country="Australia"></div>
      {
        name: 'Country',
        type: 'text',
        selector: 'data-country',
        data: { name: 'country' }
      }
      This will return "Australia" in your list of results.
    • For scraping a JSON value inside a data attribute:
      <div data-price="{currency: 'aud'}"></div>
      {
        name: 'Price',
        type: 'number',
        selector: 'data-price',
        data: { name: 'price', key: 'currency'}
      }
      This will return "aud" in your list of results.

Nested attributes

It's also possible to scrape nested attributes, like a list inside an item:

<ul class="friends">
  <li>
    <span>Spiderman</span>
    <ul>
      <li><strong>Iron</strong><em>Man</em></li>
      <li><strong>Captain</strong><em>America</em></li>
    </ul>
  </li>
</ul>

The contract:

{
  "itemSelector": ".friends li",
  "attributes": {
    "name": { "type": "text", "selector": "span" },
    "friends": {
      "itemSelector": "ul li",
      "attributes": {
        "firstName": { "type": "text", "selector": "strong" },
        "lastName": { "type": "text", "selector": "em" }
      }
    }
  }
}

So this will return all the friends as an array (using any type):

[
  {
    name: 'Spiderman',
    friends: [
      { firstName: 'Iron', lastName: 'Man' },
      { firstName: 'Captain', lastName: 'America' },
    ],
  },
];

Custom attributes types

In addition to the in-built attribute types, you can provide your own when you create a new instance of the scraper. A custom attribute type needs to be a class or a function that has a value property. As a constructor argument it will receive the string innerText value from the matching element. Then you can format it however you like.

For example if you wanted to extract a list of tags and format them as an array:

<ul>
  <li>
    <div class="name">Australia</div>
    <div class="tags">spiders,vegemite,scorching,heat</div>
  </li>
</ul>
import Scraper from 'contract-scraper';

const contract = {
  itemSelector: 'li',
  attributes: {
    countryName: {
      type: 'text',
      selector: '.name',
    },
    tags: {
      type: 'list',
      selector: '.tags',
    },
  },
};

function ListFromString(commaSeparatedString) {
  return commaSeparatedString.split(',');
}

const scraper = new Scraper('http://countries.com', contract, {
  list: ListFromString,
});

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     countryName: 'Australia',
  //     tags: [ 'spiders', 'vegemite', 'scorching', 'heat' ]
  //   }
  // ]
});

Parsing JSON inside script tags

Sometimes you may want to extract values from inside a script tag on the page. For the moment, contract-scraper only supports parsing JSON. For example:

<html>
  <head>
    <title>Page with a script tag</title>
  </head>
  <body>
    <script type="application/ld+json" id="info">
      {
        "characters": [
          {
            "name": "Jon Snow",
            "friends": [
              { "firstName": "Sansa", "lastName": "Stark" },
              { "firstName": "Bran", "lastName": "Stark" },
              { "firstName": "Arya", "lastName": "Stark" }
            ],
            "photo": "http://images.com/jonsnow",
            "price": {
              "amount": "12345 dollars"
            }
          },
          {
            "name": "Ned Stark",
            "friends": [
              { "firstName": "Sansa", "lastName": "Stark" },
              { "firstName": "Bobby", "lastName": "B" },
              { "firstName": "Little", "lastName": "finger" }
            ],
            "photo": "http://images.com/nedstark",
            "price": {
              "amount": "6789 euros"
            }
          }
        ]
      }
    </script>
  </body>
</html>
const contract = {
  scriptTagSelector: '#info',
  itemSelector: 'characters',
  attributes: {
    name: { type: 'text', selector: 'name' },
    friends: {
      itemSelector: 'friends',
      attributes: {
        firstName: { type: 'text', selector: 'firstName' },
        lastName: { type: 'text', selector: 'lastName' },
      },
    },
    photo: { type: 'link', selector: 'photo' },
    price: { type: 'number', selector: 'price.amount' },
  },
};

const scraper = new Scraper('http://characters.com', contract);

scraper.scrapePage().then(items => {
  console.log(items);
  // [
  //   {
  //     "name": "Jon Snow",
  //     "friends": [
  //       {
  //         "firstName": "Sansa",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Bran",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Arya",
  //         "lastName": "Stark"
  //       }
  //     ],
  //     "photo": "http://images.com/jonsnow",
  //     "price": 12345
  //   },
  //   {
  //     "name": "Ned Stark",
  //     "friends": [
  //       {
  //         "firstName": "Sansa",
  //         "lastName": "Stark"
  //       },
  //       {
  //         "firstName": "Bobby",
  //         "lastName": "B"
  //       },
  //       {
  //         "firstName": "Little",
  //         "lastName": "finger"
  //       }
  //     ],
  //     "photo": "http://images.com/nedstark",
  //     "price": 6789
  //   }
  // ]
});