GitHub - madhukarsamak/TACIT_TEMP: Temporarily till code merges to the main TACIT

madhukarsamak / TACIT_TEMP Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Temporarily till code merges to the main TACIT

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src/main/java		src/main/java
README.txt		README.txt
TurboTopics.iml		TurboTopics.iml
compare.py		compare.py
execute_ngrams.sh		execute_ngrams.sh
execute_tt.sh		execute_tt.sh
generate_corpus.py		generate_corpus.py
generate_vocab.py		generate_vocab.py
generate_word_assignment.py		generate_word_assignment.py
pom.xml		pom.xml

Repository files navigation

The complete code base is emulated from the Turbo Topics python code available written by David.M.Blei from pronceston univerity
you can find the code base at : https://github.com/blei-lab/turbotopics
This emulated java code produces the exact same output as that of the python code

Clone this repository and follow the steps as mentioned below

STEPS TO GENERATE TURBOTOPICS:

 STEP 1: Before using this tool to generate TurboTopics, we should generate a set of topics and significant words in those topics using LDA.

 STEP 2: In this step we have to generate required files that will be used to generate TurboTopics. These required files is explained along with the arguments as follows

 <corpus_file>: This is a single file which has a list of lines. Each line is a line of text from the same corpus that was used to generate topics through LDA.

 <vocabulary_file>: This a vocabulary file that lists all unique words that are used in the corpus

 <word_topic_assignemnt_file>: This is a word-to-topic assignment file. This file should be generated using the words from a given topic that was generated by LDA
 The format of the file is as follows
 For every line in the corpus file that was generated in the STEP 1, we generate line which has word-to-topic mapping. ie,
 <line-number> <vocab-index-of-the-word>:<topic-to-which-this-word-is-assigned>. ex:
 Let us assume the vocabulary list to be:
 slow
 and
 study
 wins
 the
 race
 Let us assume the corpus is
 [slow wins]
 [study wins]
 [lose race and wins race]
 Let the word-topic assignment obtained from LDA be
 slow - 1
 and - 3
 study - 2
 wins - 1
 the - 2
 race - 4
 Note: topics are represented as integers (ie, topic0 - 0, topic1 -1) and each word is assigned to a single topic which is the most probable topic for the word
 With all of the above parameters, the assignment file generated will be:
 0 0:1 3:1
 1 2:2 3:1
 2 5:4 1:3 3:1 5:4

 <output_path>: The output path in which the files for each topic will be generated

 <prefix>: The prefix for the output files

 <number_of_topics>: Number of topics to generate. Should be same as the number of topics generated by LDA

 <pvalue>: a value of type Double

 <use_permutaion>: "true" if likelihood ratio score should be generated using permutations

 <min_count>: minimum count of word occurences to be used


 command to run the file,
 enter the "src/main/java/turbotopics" directory and type
 javac LDAtopics.java
 java LDAtopics <corpus_file> <word_topic_assignemnt_file> <vocabulary_file> <output_path> <number_of_topics> <min_count> <pvalue> <use_permutaion> <prefix>

 There is also a convinience script called "execute_tt.sh" which generates the JAR from the code and uses that JAR. To use the convinience script enter "TACIT_TEMP" and type
 ./execute_tt.sh <corpus_file> <word_topic_assignemnt_file> <vocabulary_file> <output_path> <number_of_topics> <min_count> <pvalue> <use_permutaion> <prefix>


STEPS TO GENERATE NGRAMS:

This is the main class that generates NGrams for a given corpus.
Recursively finds collocations for a given corpus.  writes out the marginal counts to a specified file

Steps to be followed to generate NGrams:
 
  <corpus_file>: This is a single file which has a list of lines. Each line is a line of text from every file in the corpus
 
  <output_path>: The output path in which the file containing the NGrams and marginal counts will be generated
 
  <pvalue>: a value of type Double called the precision value
 
  <use_permutaion>: "true" if likelihood ratio score should be generated using permutations, "false" otherwise
 
  <min_count>: minimum count of word occurrences to be considered

  <prefix>: The prefix for the output files
 
  command to run the file
  Enter the "src/main/java/turbotopics" directoruy and type
  javac ComputeNGrams.java
  java ComputeNGrams <corpus_file> <pvalue> <use_permutaion> <output_path> <min_count> <prefix>

  There is also a convinience script called "execute_ngrams.sh" which generates the JAR from the code and uses that JAR. To use the convinience script enter "TACIT_TEMP" and type
   ./execute_ngrams.sh <corpus_file> <pvalue> <use_permutaion> <output_path> <min_count> <prefix>



There are many other convinience scripts that are available.

1. generate_corpus.py :-
This script is used to generate a single corpus file that is required to generate turbotopics or ngrams.
It takes input as a directory path and prints each line from the files in the directory which can be redirected
to a file.
python generate_corpus.py <directory_path_for_corpus_files>  >  <output_path/corpus_file_name>

2. generate_word_assignment.py :-
This script is used to generate the word-topic assignment file that is required to generate turbotopics
It takes input as the the above generated corpus_file and the TACIT-LDA generated  word_weight.txt file
and generates the word-topic assignment
python generate_word_assignment.py <corpus_file> <word_weight_file>  >  <word_topic_assignment_file>

3. generate_vocab.py :-
This script is used to generate vocabulary file that is required to generate turbotopics
It takes the TACIT-LDA generated word_weight.txt file as input and generates a file with list of words (vocabulary)
python generate_vocab.py <word_weight_file>  > <vocabulary_file>

4. compare.py :-
This script is used to compare the output from Turbotopics and Ngrams python code with the Turbotopics and Ngrams java code.
It prints the file name first, and then lists differences if there are any
python compare.py <directory_of_python_generated_files> <directory_of_java_generated_files>




Other Details:

How the porting was performed:
The porting was performed in a simple understandable methodology

1: First off, i started with the easiest of all - Classes. It was easily recognizable which part of the code should be a class and which [art should be a method.
Thus created all the classes initially

2: Went through every line of code in those classes and understood the python way of working.

3: Created exact same methods and variables inside methods with same names so that it will be easy to compare the code

4: The biggest challange was to understand what were the underlying types of every object in python since java is completely typed and it is important to know what type to serilialize and deserialze to.
Once the underlying types became clear, then every object in java was coded as a generic "Object" class. And every time the underlying value was required, a normal type cast would work.

5. For every lambda function that was used in python a similar lamda function using Function<T,T>, Consumer<T> or BiFunction<T,T,T> was used  in Java. This helped to keep the code exactly similar to that
of the python code.

6. Lastly testing was to done to compare every line of every file of the outputs generated for same set of inputs.

7. This increased the confidence in the code and also observerd that the Java code runs extremely fast as compared to the python code


Testing:

Since i am not from the Machine Learning background, I don't have in depth understanding of how LDA works or how TurboTopics or Ngram generation works.
But i had a general idea about LDA, TurboTopics and NGram generation. With that i did the following

1. downlaoded the crawl data from us_congress_crawler. Downaloded around 800 files for "Barrack-Obama".

2. Used a smaller set of files, like around 5 files or 10 files and generated word-weight file using LDA plugin. (only sets of 5 or 10 files were uses since it takes extremely long time if more files
selectec ).

3. Used the same set of files to generate the single corpus file in the format mentiones above in the steps

4. Similarly generated word-topic-assignment file as explained above

5. Similarly generated vocabulary file as explained above

6. With these set of files, i executed TurboTopics with same set of inputs using both python code and java code and used compare.py file to compare the outputs.
I did this for various combinations of inputs - like using different pvalues, using permutations for likelihood ratio score, using different number of topics.
This comparision helped me fix couple of bugs which was very evident when the outputs were compared.

7. I repeated the procedure for different set of files again.

8. Every time after fixing bugs, i found that the outputs were perfectly matching (string comparision of outputs) including the print statments.

9. This verifies that the Java code works exactly the way the Python code works.