Skip to content

How to use Time Matters MultipleDocs

JMendes1995 edited this page Oct 8, 2019 · 1 revision

Please do not change any wiki page without permission from Time-Matters developers.


In this wiki, we will explain:

How to use Time Matters MultipleDocs

Time-Matters-MultipleDocs aims to score temporal expressions found within multiple texts. Given an identified temporal expression it offers the user three scoring options:

  • ByCorpus: it retrieves a unique single score for each temporal expression found in the corpus of documents, regardless it occurs multiple times in different documents, that is, multiple occurrences of a temporal expression in different documents, will always return the same score (e.g., 0.92);

  • ByDoc: to retrieve a multiple (eventually different) score for each occurrence of a temporal expression found in the set of documents, that is, multiple occurrences of a temporal expression in different documents, will return multiple (eventually different) scores (e.g., 0.92 for the occurrence of 2019 in document 1; and 0.77 for the occurrence of 2019 in document 2);

  • ByDocSentence: to retrieve a multiple (eventually different) score for each occurrence of a temporal expression found in a given document, that is, multiple occurrences of a temporal expression in different sentences (e.g., 2019....... 2019) of a document, will return multiple (eventually different) scores (e.g., 0.92 for the occurrence of 2019 in sentence 1 of document 1; and 0.77 for the occurrence of 2019 in sentence 2 of document 1);

The first one (ByCorpus) evaluates the score of a given candidate date in the context of a corpus of texts, with regards to all the relevant keywords that it co-occurs with (regardless if it's on document 1 or 2). The following example illustrates one such case, in which all the relevant keywords (w1, w2, w3) that co-occur with temporal expression (d1) will be considered in the computation of the temporal score given by the GTE equation, i.e., (Median([IS(d1, w1); IS(d1, w2); IS(d1, w3)])), and subject to the fact that those keywords occur at least in two different documents (to avoid considering keywords that are too specific to a given document):

The second, evaluates the score of a given candidate date with regards to the documents where it occurs (thus taking into account only the relevant keywords of each document (within the search space defined), subject to the fact that those keywords occur at least in two different documents (to avoid considering keywords that are too specific to a given document)). This means that, if 2010 co-occurs with w1 in document 1, only this relevant keyword will be considered to compute the temporal score of 2010 for that particular document. Likewise, if 2010 co-occurs with w2 and with w3 in document 2, only these relevant keywords will be considered to compute the temporal score of 2010 for that particular document. This means that we would have a temporal score of 2010 for document 1 computed by GTE equation as follows: (Median([IS(d1, w1)])), and a temporal score of 2010 for document 2 computed by GTE equation as follows, (Median([IS(d1, w2); IS(d1, w3)])):

Finally, the third evaluates the score of a given candidate date with regards to the documents and sentences where it occurs (thus taking into account only the relevant keywords of each sentence of a given document (within the search space defined), subject to the fact that those keywords occur at least in two different documents (to avoid considering keywords that are too specific to a given document)). This means that, if 2010 co-occurs with w1 in sentence 1 of document 1, only this relevant keyword will be considered to compute the temporal score of 2010 for that particular sentence. Likewise, if 2010 co-occurs with w2 and with w3 in sentence 2 of document 1, only these relevant keywords will be considered to compute the temporal score of 2010 for that particular sentence of thta document. This means that we would have a temporal score of 2010 for document 1 computed by GTE equation as follows: (Median([IS(d1, w1)])), and a temporal score of 2010 for document 2 computed by GTE equation as follows: (Median([IS(d1, w2); IS(d1, w3)]))

How to work with each one will be explained next. Before that, we explain how to import the libraries and a set of text documents. We suggest you to play with your own texts, or in alternative, to download a set of 28 documents (MultiDocTexts.zip) that we make available in this git as a running example. These documents were collected in November 2016 by issuing the query "Boston Marathon Bombing" on Bing Search Engine. The Diffbot Article API was then used to collect the full text of the top-50 web pages retrieved by Bing. We ended up with 28 documents, as some of them didn't have any useful text.

Note: if you want to use the following code, don't forget to put the texts under a folder named data/MultiDocTexts.

from Time_Matters_MultipleDocs import Time_Matters_MultipleDocs

import os
path = 'data/MultiDocTexts'
ListOfDocs = []
for file in os.listdir (path) :
    with open(os.path.join(path, file),'r') as f:
        txt = f.read()
        ListOfDocs.append(txt)

Score


The structure of the score depends on the type of extraction considered: ByCorpus, ByDoc or ByDoc&Sentence.

ByCorpus

Getting temporal scores by a corpus of documents is possible through the following code: results = Time_Matters_MultipleDocs(ListOfDocs). This configuration assumes "py_heideltime" as default temporal tagger, "ByCorpus" as the default score_type and the default parameters of time_matters. In this configuration, a single score will be retrieved for a temporal expression regardless it occurs in different documents.

Running this code, however, will take a considerable amount of time (depending on the PC used) as Heideltime temporal tagger will be running on top of 28 texts. If you want a quicker solution (though not effective) you should use a rule-based approach instead (more about this on the Optional Parameters section). Also letting py_heideltime getting all the possible temporal expressions from the text might become too cumbersome. For that reason, we opt to set the date granularity to year and the document timestamp to ´2013-04-15´ (the date of the Boston marathon bombings).

results = Time_Matters_MultipleDocs(ListOfDocs, temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2013-04-15'])
#results = Time_Matters_MultipleDocs(ListOfDocs, score_type="ByCorpus", temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2013-04-15'])

The output is a dictionary where the key is the normalized temporal expression and the value is a list with two positions. The first is the score of the temporal expression. The second is a dictionary of the instances of the temporal expression (as they were found in each document). Example: {'2011-01-12': [1.0, {0: ['2011-01-12', '12 January 2011'], 6: ['2011-01-12']}]}, means that the normalized temporal expression 2011-01-12 has a score of 1 and occurs twice (the first time as 2011-01-12, and the second time as 12 January 2011) in document 0 and one time (as '2011-01-12') in document 6.

ByDoc

Getting temporal scores by document is possible through the following code. This configuration is set to consider "py_heideltime" as the default temporal tagger, "ByDoc" as the score_type and the default parameters of time_matters. In this configuration, multiple occurrences of a temporal expression in different documents, will return multiple (eventually different) scores (e.g., 0.92 for the occurrence of 2019 in sentence 1 of document 1; and 0.77 for the occurrence of 2019 in sentence 2 of document 1). Once again, we apply the year granularity to avoid getting too many fine-grained temporal expressions. Yet, you are more than welcome to alternatively run the following code: results = Time_Matters_MultipleDocs(ListOfDocs, score_type='ByDoc').

results = Time_Matters_MultipleDocs(ListOfDocs, score_type='ByDoc', temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2013-04-15'])

The output is a dictionary where the key is the normalized temporal expression and the value is a dictionary (where the key is the DocID and the value is a list with two positions. The first is the score of the temporal expression in that particular document. The second is a list of the instances of the temporal expression (as they were found in the text in that particular document). Example: {'2010': {1: [0.2, ['2010']], 5: [0.983, ['2010', '2010']]}}, means that the normalized temporal expression 2010 has a score of 0.2 in the document with ID 1, and a score of 0.983 in the document with ID 5 (where it occurs two times).

ByDocSentence

Getting temporal scores by document & sentence is possible through the following code. This configuration is set to consider "py_heideltime" as the default temporal tagger, "ByDoc&Sentence" as the score_type and the default parameters of time_matters. In this configuration, multiple occurrences of a temporal expression in different sentences of a given document, will return multiple (eventually different) scores (e.g., 0.2 for its occurrence in document 1; and 0.982 for its occurrence in document 2). Once again, we apply the year granularity to avoid getting too many fine-grained temporal expressions. Yet, you are more than welcome to alternatively run the following code: results = Time_Matters_MultipleDocs(ListOfDocs, score_type='ByDocSentence').

results = Time_Matters_MultipleDocs(ListOfDocs, score_type='ByDocSentence', temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2013-04-15'])

The output is a dictionary where the key is the normalized temporal expression and the value is a dictionary (where the key is the DocID and the value is a new dictionary (where the key is the sentenceID and the value is list with two positions. The first is the score of the temporal expression in that particular sentence. The second is a list of the instances of the temporal expression (as they were found in the text in that particular setencent of that document)). Example: {'2011': {0: {5: [0.983, ['2010', '2010']], {6: [0.183, ['2010']]}}, means that the normalized temporal expression 2011 has a score of 0.983 in the sentence with ID 5 (where it occurs twice) of docID 0, and a score of 0.183 in the sentence with ID 6 of docID 0.

Remaining Output


  • TempExpressions: A dictionary, where the key is the docID and the value is a list of tuples, each having two positions. The first is the normalized temporal expression. The second is the temporal expression as it was found in the text. The order in which the elements appear in the list, reflect the order of the temporal expressions in the text. Example: {0: [('1975-02-11TAF', 'the afternoon of February 11, 1975'),..]}.

  • RelevantKWs: a dictionary where the key is the docID and the value is a dictionary of the relevant keywords (and corresponding scores). In our algorithm, keywords are detected by YAKE!. If you want to know more about the role of YAKE! in Time-Matters, please refer to the following link. Example: {0: {'haiti': 0.03, 'haiti earthquake': 0.07}} means that the tokens haiti and haiti earthquake were determined as relevant keywords by YAKE! keyword extractor with the scores 0.03 and 0.07 (the lower the score the more relevant the keyword is) in docID 0.

  • TextNormalized: A normalized version of the text, a dictionary, where the key is the docID and the value is a string, where temporal expressions are marked with the tag <d> and relevant keywords with the tag <kw>. Example: {0: 'As of <d>2010</d> (see 1500 photos here), the following major earthquakes have been recorded in <kw>haiti</kw>.'}

  • TextTokens: A dictionary where the key is the docID and the value is a list of the text tokens. Tokens that are temporal expressions are marked with the tag <d>, whereas relevant keywords are marked with the tag <kw>. Example: {0: ['As', 'of', '<d>2010</d>', 'see', '1500',...]}.

  • SentencesNormalized: A dictionary, where the key is the docID and the value is a list of the normalized version of the sentence text (position 0 of the list corresponds to sentence 0, etc). Temporal expressions found in the text are marked with the tag <d> while relevant keywords are marked with the tag <kw>; Example: {0: [..., 'As of <d>2010</d> (see 1500 photos here), the following major earthquakes have been recorded in <kw>haiti</kw>.',...]}.

  • SentencesTokens: A dictionary, where the key is the docID and the value is a list of the text tokens by sentence, that is a list of lists (position 0 of the list gives the tokens of sentence 0, etc). Tokens that are temporal expressions are marked with the tag <d>, whereas relevant keywords are marked with the tag <kw>. Example: {0: [[...,..], ['As', 'of', '<d>2010</d>', 'see', '1500',...], [...,..],]}.

Optional Parameters


Apart from the score_type (ByCorpus, ByDoc and ByDocSentence) there are also parameters regarding the temporal_tagger and time_matters.

Temporal Tagger

While 'py_heideltime' is the default temporal tagger, a 'rule_based' approach can be used instead. In the following, we assume the default parameters of the rule-based approach, that is: date_granularity is "full" (highest possible granularity detected will be retrieved), begin_date is 0 and end_date is 2100 which means that all the dates within this range will be retrieved. Instead, we can specify a more fine-grained granularity, such as year and a begin and end date which would result in the following code: results = Time_Matters_SingleDoc(text, temporal_tagger=['rule_based', 'year', 2000, 2011]). However, in the following code, we resort to the default parameter values of the rule-based approach.

results = Time_Matters_MultipleDocs(ListOfDocs, temporal_tagger=['rule_based'])

In addition, a few other parameters (already experienced before) are available to py_heideltime, namely:

  • language: English - default; Portuguese; Spanish; Germany; Dutch; Italian; and French. To know how to configure py_heideltime for other languages please refer to this link;
  • date granularity: "full" - default (Highest possible granularity detected will be retrieved); "year" (YYYY will be retrieved); "month" (YYYY-MM will be retrieved); "day" (YYYY-MM-DD will be retrieved). Note that this parameter can also be used with the rule_based model.
  • document type "news" - default (news-style documents); "narrative" (narrative-style documents (e.g., Wikipedia articles)); "colloquial" (English colloquial (e.g., Tweets and SMS)); "scientific" (scientific articles (e.g., clinical trails))
  • document creation time: in the format YYYY-MM-DD

Time Matters

  • n-gram: maximum number of terms a keyword might have. Default value is 1 (but any value > 0 is considered. For instance n = 1 means that single tokens such as "keyword" can be considered; instead n = 2 means that "keyword" but also "keyword extractor" can be considered). More about this here and here;
  • num_of_keywords: number of YAKE! keywords to extract from the text. Default value is 10 (but any value > 0 is considered) meaning that the system will extract 10 relevant keywords from the text. More about this here and here;
  • n_contextual_window: defines the n-contextual window distance. Default value is "full_document" when the score type is ByCorpus, or "full_sentence" (but a n-window where n > 0 can be considered as alternative) when the score type is ByDoc or ByDocSentence; More about this here;
  • N: size of the context vector for X and Y at InfoSimba. Default value is '10' (but any value > 0 is considered). You can also define 'max' meaning that the context vector should have the maximum number of n-terms co-occurring with X (likewise with Y). This option however will require a huge amount of time (depending on the PC) to execute. More about thishere;
  • TH: minimum threshold value from which terms are eligible to the context vector X and Y at InfoSimba. Default value is 0.05 (but any value > 0 is considered) meaning that any terms co-occuring between them with a DICE similarity value > 0.05 are eligible for the n-size vector. More about this here.

The following code assumes a score-type of ByCorpus, the default parameters of the temporal_tagger (such as py_heideltime) and specifies the five parameters (also the default ones) for time_matters.

results = Time_Matters_MultipleDocs(ListOfDocs, time_matters=[1, 10, 'full_document', 10, 0.05])

More interistingly is that if we consider a different n-gram for the keywords. In the following we consider n = 3.

results = Time_Matters_MultipleDocs(ListOfDocs, time_matters=[3, 10, 'full_document', 10, 0.05])

Debug Mode


We also offer the user a debug mode where users can access a more detailed version of the results. Thus in addition to the fields already explained before we also make available the InvertedIndex, the DiceMatrix and the ExecutionTime.

To this regard, we consider the following code with debug_mode=True, thus assuming the score_type ByCorpus, and the default parameters of temporal_tagger (thus with heideltime) and of time_matters.

results = Time_Matters_MultipleDocs(ListOfDocs, debug_mode=True)
  • InvertedIndex: An inverted index of the entire set of documents, most notably of its relevant keywords and temporal expressions. It follows the following dictionary structure: {'term' : [DF, TotFreq, {DocID : [FreqInDoc, [OffsetsDoc], {SentenceID : [FreqInSentence, [OffsetsSentence]]}]}], where DF is the Document Frequency, TotFreq is the total frequency of the term within the entire corpus of documents, DocID is the ID of the document (knowing that IDs start on 0), FreqInDoc is the frequency of the term in the document, [OffsetsDoc] is a list of the document offsets, that is, a list of the position(s) where the term appears in the document, SentenceID is the ID of the sentence, FreqInSentence is the frequency of the term in the sentence, OffsetsSentence is a list of the sentence offsets, that is, a list of the position(s) where the term appears in the sentence. For instance, a term with the following structure '2010': [1, 4, {1 : [4, [6, 13, 20, 27], {0 : [2, [6, 13]], 1: [2, [20, 27]] }]}] means that it has 4 occurrences in 1 document, in particular in the document with ID 1, namely in position 6, 13, 20, and 27, the first two occur in sentence ID 0, and the latter in sentence ID 1.
  • DiceMatrix: It retrieves (in pandas format) the DICE matrix between each term according to the n-contextual window distance defined. For instance, a DICE similarity of 1 between prime and minister means that, whenever each of these terms occur, they always occur together. If you want to know more about the role of DICE in our algorithm please refer to this link.
  • ExecutionTime: It retrieves information about the processing times of our algorithm, in particular, of the TotalTime required to execute the algorithm, but also of each of its most important components, namely: heideltime_processing, py_heideltime_text_normalization, keyword_text_normalization, YAKE, InvertedIndex, DICEMatrix and GTE. As it can be observed from the example, most of the time is consumed by the py_heideltime component (which entails the heideltime_processing and the text_normalization process, that is, the tagging of the text with the tag).

CLI


Help

$ Time_Matters_MultipleDocs --help

Usage Examples

Usage_examples (make sure that the input parameters are within quotes):

Default Parameters: This configuration assumes "py_heideltime" as default temporal tagger, "ByCorpus" as the default score_type and the default parameters of time_matters.

Time_Matters_MultipleDocs -i "['path', 'c:\path_to_directory']"

All the Parameters:

Time_Matters_MultipleDocs -i "['path', 'c:\path_to_directory']" -tt "['py_heideltime','English', 'full', 'news', '2019-05-05']" -tm "[1, 10,'full_document', 10, 0.05]" -st ByCorpus -dm False
Options
 [not required]
 ----------------------------------------------------------------------------------------------------------------------------------
  -tt, --temporal_tagger   Specifies the temporal tagger and the corresponding parameters.
                           Default: "py_heideltime"
			   Options:
			   	    "py_heideltime"
				    "rule_based"
				 
			   py_heideltime (parameters):
			   ____________________________
			   - temporal_tagger_name
			     Options:
				     "py_heideltime"

			   - language
			     Default: "English"
			     Options:
			   	      "English";
				      "Portuguese";
				      "Spanish";
				      "Germany";
				      "Dutch";
				      "Italian";
				      "French".

		          - date_granularity
			    Default: "full"
			    Options:
			           "full": means that all types of granularity will be retrieved, from the coarsest to the 
					   finest-granularity.
			           "day": means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD;
				   "month": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved;
				   "year": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved;

			  - document_type
			    Default: "News"
			    Options:
			  	    "News": for news-style documents - default param;
				    "Narrative": for narrative-style documents (e.g., Wikipedia articles);
				    "Colloquial": for English colloquial (e.g., Tweets and SMS);
				    "Scientific": for scientific articles (e.g., clinical trails).

			  - document_creation_time
			    Document creation date in the format YYYY-MM-DD. Taken into account when "News" or "Colloquial" texts
		            are specified.
		            Example: "2019-05-30".

			  - Example: 
			  	    -tt "['py_heideltime','English', 'full', 'news', '2019-05-05']"	 

		          
			  Rule_Based (parameters):
		          ____________________________
			  - temporal_tagger_name
			    Options:
			  	    "rule_based"

			  - date_granularity
			    Default: "full"
			    Options:
			           "full": means that all types of granularity will be retrieved, from the coarsest to the 
					   finest-granularity.
			           "day": means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD;
				   "month": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved;
				   "year": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved;

                          - begin_date
			    Default: 0
                            Options: any number > 0

			  - end_date
			    Default: 2100
                            Options: any number > 0

			  - Example: 
			  	    -tt "['rule_based','full','2000','2100']"
 [not required]
 ----------------------------------------------------------------------------------------------------------------------------------
  -tm, --time_matters     Specifies information about Time-Matters, namely:
			  - n-gram: maximum number of terms a keyword might have. 
			    Default: 1
			    Options:
				    any integer > 0

			  - num_of_keywords: number of YAKE! keywords to extract from the text
			    Default: 10
			    Options:
				    any integer > 0

		          - n_contextual_window: defines the search space where co-occurrences between terms may be counted.
			    Default: "full_sentence"
			    Options:
                                    "full_sentence": the system will look for co-occurrences between terms that occur within the search 
				                    space of a sentence;
			            n: where n is any value > 0, that is, the system will look for co-occurrences between terms that 
				       occur within a window of n terms;
				       
		          - N: N-size context vector for InfoSimba vectors
			    Default: 10
			    Options: 
                                    any integer > 0
			            "max": where "max" is given by the maximum number of terms eligible to be part of the vector
				    
				    
			  - TH: all the terms with a DICE similarity > TH threshold are eligible to the context vector of InfoSimba
			    Default: 0.05
			    Options: 
				    any float > 0


			  - Example: 
			  	    -tm "[1, 10, 'full_sentence', 'max', 0.05]"
 [not required]
 ----------------------------------------------------------------------------------------------------------------------------------
  -st, --score_type       Specifies the type of score for the temporal expression found in the text
  			  Default: "ByDoc"
                          Options:
                                  "ByDoc": returns a single score regardless the temporal expression occurs in different sentences;
                                  "BySentence": returns multiple scores (one for each sentence where it occurs)
				  
			  - Example: 
			  	    -st ByDoc
 [not required]
 ----------------------------------------------------------------------------------------------------------------------------------
  -dm, --debug_mode      Returns detailed information about the results
  	                 Default: False
			 Options:
			          False: when set to False debug mode is not activated
				  True: activates debug mode. In that case it returns 
                                        "Text";
					"TextNormalized"
					"Score"
					"CandidateDates"
					"NormalizedCandidateDates"
					"RelevantKWs"
					"InvertedIndex"
					"Dice_Matrix"
					"ExecutionTime"
					
			  - Example: 
			  	    -dm True
				    
  --help                 Show this message and exit.
Output

The output is a json list that retrieves the following information: score, temporal expressions, relevant keywords, text normalized, text tokens, sentences normalized and sentences tokens.