Email Extractor from common crawl dataset

This repository contains hadoop mapreduce program to all the emails from Common Crawl public datasets.

Prerequisities

Need AWS account
Create awsAccessKeyId and awsSecretAccessKey
Create S3 bucket
Create Amazon EMR cluster to run mapreduce

How to run in Amazon EMR

Checkout this repository locally
Import into eclipse as existing maven project
run mvn install
build the project using mvn package or right click pom.xml and run as maven build
Upload the standalone build jar in S3 bucket
Add mapreduce step in the Amazon EMR cluster with custom jar option. Browse and add jar from S3 bucket
Add arguments like explained below
Run the mapreduce step

Arguments

It requires 4 arguments to run

awsAccessKeyId
awsSecretAccessKey
Common Crawl WARC S3n path
mapreduce output path

Example

To Run all warc files under a s3 directory <awsAccessKeyId> <awsSecretAccessKey> s3n://commoncrawl/<segment-path>/*.warc.gz s3n://<bucket-name>/output

or

Only to run single warc file <awsAccessKeyId> <awsSecretAccessKey> s3n://commoncrawl/<segment-path>/xxxxx.warc.wet.gz s3n://<bucket-name>/output

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src/main/java/com/peterjeroldleslie/emailextractor		src/main/java/com/peterjeroldleslie/emailextractor
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Extractor from common crawl dataset

Prerequisities

How to run in Amazon EMR

Arguments

Example

About

Releases

Packages

Languages

jeroldleslie/cc-email-extractor

Folders and files

Latest commit

History

Repository files navigation

Email Extractor from common crawl dataset

Prerequisities

How to run in Amazon EMR

Arguments

Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages