Web Article Extractor is a PHP library that detects and extracts the primary 'article' content from a web page, detecting and removing the 'clutter' to give you the clean article. Additionally, it will also filter information from the article that can be used for indexing, such as language and keywords.
- Extracts a clean article and headline from a web page quickly.
- Identifies the language of the extracted article.
- Identifies keywords for the extracted article.
- Designed to easily integrate into pipeline or microservice project architectures.
There are two ways to use Web Article Extractor, the first way is to use the provided Docker file (See 'Installation'), this will create an instance that you can start using straight away and is ideal for pipeline architectures, the second way is to add the PHP library into your project through Composer.
To build with Docker execute the build command inside this project's root directory
$ docker build -t zackslash/web-article-extractor .
You should now be able to run the article extractor script with the following command
$ docker run zackslash/web-article-extractor <URL>
Example:
$ docker run zackslash/web-article-extractor http://uk.ign.com/articles/2015/03/19/gabe-newell-discusses-possibility-of-half-life-3
The first step to using Web Article Extractor in PHP is to download Composer:
$ curl -s http://getcomposer.org/installer | php
Now add PHP Web Article Extractor to your project with Composer:
$ php composer.phar require zackslash/php-web-article-extractor
And that's it! Composer will automatically handle the rest.
Alternatively, you can manually add the dependency to composer.json
file...
{
"require": {
"zackslash/php-web-article-extractor": "*"
}
}
... and then install our dependencies using:
$ php composer.phar install
PHP simple example:
<?php
// This file is generated by Composer
require_once 'vendor/autoload.php';
// Extract article directly from a URL
$extractionResult = WebArticleExtractor\Extract::extractFromURL('http://uk.ign.com/articles/2015/03/19/gabe-newell-discusses-possibility-of-half-life-3');
// Display the extracted article in JSON form
echo json_encode($extractionResult);
?>
- PHP >= 5.5.0
To run the unit tests, you'll need to install PHPUnit, once installed, just launch the following command inside this libraries' 'build' directory:
$ phpunit
Parts of PHP Web Article Extractor are based on algorithms from the whitepaper 'Boilerplate detection using Shallow Text Features' and 'Boilerpipe' by Christian Kohlschuetter, Peter Fankhauser, Wolfgang Nejdl
PHP Web Article Extractor implements the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in the book 'Text Mining: Theory and Applications' and the implementation was based on aneesha's open source Python version
The Stop Word dictionary used in this project was pulled from Peter Graham's 'stopwords' repository
PHP Web Article Extractor is released under the MIT License. See the bundled LICENSE file for details.