Skip to content

Commit

Permalink
updated README, fixed hard-coded userAgent
Browse files Browse the repository at this point in the history
  • Loading branch information
betolink committed Mar 2, 2015
1 parent b2cf6e5 commit 460b2a5
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 7 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ conf/slaves
build/
runtime/
logs/
crawl*
hadoop.log
11 changes: 5 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
Nutch Crawler
=========
The BCube Crawler is a fork of the Apache Nutch project (version 1.9) tweaked to run on Amazon's ElasticMapReduce and optimized for data and web services discovery.
The BCube Crawler is a fork of the Apache Nutch project (version 1.9) tweaked to run on Amazon's ElasticMapReduce and optimized for web services and data discovery.


Motivation
----

Setting up a healthy Hadoop cluster is not always an easy task and the variables that make an application to perform well could add a lot of overhead to a project. ElasticMapReduce is -as its name suggests- a “MapReduce as a service” platform that allows users to create resizable Hadoop clusters and run MapReduce jobs. A key advantage of Amazon’s EMR platform is the flexibility to resize a cluster on the fly and the possibility of using spot instances to dynamically increase the computational power at lower costs.

There are some limitations on EMR like preconfigured MR settings or that it only supports a set of [Hadoop distributions](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html). This and the fact that the “all-in-one” Crawl class was deprecated since Nutch 1.8 creates 2 issues if we try to run Nutch using the EMR API.
There are some limitations on EMR like preconfigured MR settings or that it only supports a set of [Hadoop distributions](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html). This and the fact that the “all-in-one” Crawl class was deprecated since Nutch 1.8 creates issues if we try to run Nutch using the EMR API.

In order to make use of the EMR API and automate our crawls we need to send a jar and a main class as entry point. This project adds back the main Crawl class deprecated in Nutch 1.6 and offers 3 important features when crawling using EMR

Expand Down Expand Up @@ -49,15 +49,14 @@ emr=true
First we need to clone Nutch from NSIDC's repo and switch to the aws-emr branch.

```sh
git clone https://github.com/nsidc/nutch.git
cd nutch
git checkout aws-emr
git clone https://github.com/b-cube/nutch-crawler.git
cd nutch-crawler
```

Then we need to update the nutch-site.xml configuration file with the appropriate values, see this [Nutch tutorial](https://groups.drupal.org/node/105774) and [Nutch FAQ](http://wiki.apache.org/nutch/FAQ). Once we modify nutch-site.xml we proceed to compile Nutch with Ant.
Before we compile Nutch we need to set our JAVA_HOME to a working JRE environment. We can use `readlink -f $(which java)` to know our java path.

```sh
export JAVA_HOME=`readlink -f $(which java)`
ant clean runtime
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ public void setConf(Configuration conf) {
this.proxyUsername = conf.get("http.proxy.username", "");
this.proxyPassword = conf.get("http.proxy.password", "");
this.proxyRealm = conf.get("http.proxy.realm", "");
this.userAgent = conf.get("http.agent.name");
agentHost = conf.get("http.agent.host", "");
authFile = conf.get("http.auth.file", "");
configureClient();
Expand Down Expand Up @@ -182,7 +183,7 @@ private void configureClient() {
HostConfiguration hostConf = client.getHostConfiguration();
ArrayList<Header> headers = new ArrayList<Header>();
// Set the User Agent in the header
headers.add(new Header("User-Agent", "Bcube-test-crawler"));
headers.add(new Header("User-Agent", this.userAgent));
// prefer English
headers.add(new Header("Accept-Language", acceptLanguage));
// prefer UTF-8
Expand Down

0 comments on commit 460b2a5

Please sign in to comment.