currently this swallows xml files that come from TR's DB and outputs SQL that we can stick in our DB for analysis.
This block of code creates the schema needed for the output of the ParseXML jar:
delete from x;
delete from subject_hash;
delete from author_country_year;
drop table x;
drop table author_country_year;
drop table subject_hash;
CREATE TABLE x
(
year int,
author varchar(255),
subject_hash int,
value_total decimal
);
CREATE TABLE author_country_year
(
author varchar(255),
country varchar(255),
year int
);
CREATE TABLE subject_hash
(
subject varchar(255) UNIQUE,
hash BIGINT UNIQUE
);
CREATE INDEX ac_country_index ON author_country_year (country);
CREATE INDEX ac_year_index ON author_country_year (year);
CREATE INDEX x_year_index ON x (year);
the SecondPhase processor uses the following SQL schema:
delete from rca_country_year;
drop table rca_country_year;
delete from year_country_subject_x;
drop table year_country_subject_x;
CREATE TABLE year_country_subject_x
(
year int,
country varchar(255),
subject_hash int,
x decimal
);
the SecondPhase reads in files named year.xml it also needs an author_country DB saved in /tmp/author_country_year.PipeDelimitedFile in the format:
tordo, p|france|1980
after running the first phase and generating the author_country_year table.
the SubSubject processor is specifically used to find correlations between subSubjects by processing their UIDs. The input is a csv file given as an argument and the output is a MMA input steam sent to standard out.