This repository contains hadoop mapreduce program to all the emails from Common Crawl public datasets.
- Need AWS account
- Create awsAccessKeyId and awsSecretAccessKey
- Create S3 bucket
- Create Amazon EMR cluster to run mapreduce
- Checkout this repository locally
- Import into eclipse as existing maven project
- run
mvn install
- build the project using
mvn package
or right click pom.xml and run asmaven build
- Upload the standalone build jar in S3 bucket
- Add mapreduce step in the Amazon EMR cluster with custom jar option. Browse and add jar from S3 bucket
- Add arguments like explained below
- Run the mapreduce step
It requires 4 arguments to run
- awsAccessKeyId
- awsSecretAccessKey
- Common Crawl WARC S3n path
- mapreduce output path
To Run all warc files under a s3 directory
<awsAccessKeyId> <awsSecretAccessKey> s3n://commoncrawl/<segment-path>/*.warc.gz s3n://<bucket-name>/output
or
Only to run single warc file
<awsAccessKeyId> <awsSecretAccessKey> s3n://commoncrawl/<segment-path>/xxxxx.warc.wet.gz s3n://<bucket-name>/output