updated README, fixed hard-coded userAgent

b-cube · Mar 2, 2015 · 460b2a5 · 460b2a5
1 parent b2cf6e5
commit 460b2a5
Show file tree

Hide file tree

Showing 3 changed files with 8 additions and 7 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,4 +5,5 @@ conf/slaves
 build/
 runtime/
 logs/
+crawl*
 hadoop.log
diff --git a/README.md b/README.md
@@ -1,14 +1,14 @@
 Nutch Crawler
 =========
-The BCube Crawler is a fork of the Apache Nutch project (version 1.9) tweaked to run on Amazon's ElasticMapReduce and optimized for data and web services discovery.
+The BCube Crawler is a fork of the Apache Nutch project (version 1.9) tweaked to run on Amazon's ElasticMapReduce and optimized for web services and data discovery.
 
 
 Motivation
 ----
 
 Setting up a healthy Hadoop cluster is not always an easy task and the variables that make an application to perform well could add a lot of overhead to a project. ElasticMapReduce is -as its name suggests- a “MapReduce as a service” platform that allows users to create resizable Hadoop clusters and run MapReduce jobs. A key advantage of Amazon’s EMR platform is the flexibility to resize a cluster on the fly and the possibility of using spot instances to dynamically increase the computational power at lower costs. 
 
-There are some limitations on EMR like preconfigured MR settings or that it only supports a set of [Hadoop distributions](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html). This and the fact that the “all-in-one” Crawl class was deprecated since Nutch 1.8 creates 2 issues if we try to run Nutch using the EMR API.
+There are some limitations on EMR like preconfigured MR settings or that it only supports a set of [Hadoop distributions](http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html). This and the fact that the “all-in-one” Crawl class was deprecated since Nutch 1.8 creates issues if we try to run Nutch using the EMR API.
 
 In order to make use of the EMR API and automate our crawls we need to send a jar and a main class as entry point. This project adds back the main Crawl class deprecated in Nutch 1.6 and offers 3 important features when crawling using EMR
 
@@ -49,15 +49,14 @@ emr=true
 First we need to clone Nutch from NSIDC's repo and switch to the aws-emr branch.
 
 ```sh
-git clone https://github.com/nsidc/nutch.git
-cd nutch
-git checkout aws-emr
+git clone https://github.com/b-cube/nutch-crawler.git
+cd nutch-crawler
 ```
 
 Then we need to update the nutch-site.xml configuration file with the appropriate values, see this [Nutch tutorial](https://groups.drupal.org/node/105774) and [Nutch FAQ](http://wiki.apache.org/nutch/FAQ). Once we modify nutch-site.xml we proceed to compile Nutch with Ant.
+Before we compile Nutch we need to set our JAVA_HOME to a working JRE environment. We can use `readlink -f $(which java)` to know our java path. 
 
 ```sh
-export JAVA_HOME=`readlink -f $(which java)`
 ant clean runtime
 ```
 

diff --git a/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java b/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
@@ -116,6 +116,7 @@ public void setConf(Configuration conf) {
     this.proxyUsername = conf.get("http.proxy.username", "");
     this.proxyPassword = conf.get("http.proxy.password", "");
     this.proxyRealm = conf.get("http.proxy.realm", "");
+    this.userAgent = conf.get("http.agent.name");
     agentHost = conf.get("http.agent.host", "");
     authFile = conf.get("http.auth.file", "");
     configureClient();
@@ -182,7 +183,7 @@ private void configureClient() {
     HostConfiguration hostConf = client.getHostConfiguration();
     ArrayList<Header> headers = new ArrayList<Header>();
     // Set the User Agent in the header
-    headers.add(new Header("User-Agent", "Bcube-test-crawler"));
+    headers.add(new Header("User-Agent", this.userAgent));
     // prefer English
     headers.add(new Header("Accept-Language", acceptLanguage));
     // prefer UTF-8
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,4 +5,5 @@ conf/slaves @@
     build/
     runtime/
     logs/
+    crawl*
     hadoop.log