-
Notifications
You must be signed in to change notification settings - Fork 0
How to use Time Matters MultipleDocs
Please do not change any wiki page without permission from Time-Matters developers.
In this wiki, we will explain:
- how to get temporal scores ByCorpus;
- how to get temporal scores ByDoc;
- how to get temporal scores ByDocSentence;
- how to play with Optional Parameters, namely those related to the temporal tagger and to Time-Matters;
- how to get scores using the Debug Mode where further other structures can be retrieved to the user;
- how to execute Time-Matters MultipleDocs in the prompt;
Time-Matters-MultipleDocs aims to score temporal expressions found within multiple texts. Given an identified temporal expression it offers the user three scoring options:
-
ByCorpus: it retrieves a unique single score for each temporal expression found in the corpus of documents, regardless it occurs multiple times in different documents, that is, multiple occurrences of a temporal expression in different documents, will always return the same score (e.g., 0.92);
-
ByDoc: to retrieve a multiple (eventually different) score for each occurrence of a temporal expression found in the set of documents, that is, multiple occurrences of a temporal expression in different documents, will return multiple (eventually different) scores (e.g., 0.92 for the occurrence of 2019 in document 1; and 0.77 for the occurrence of 2019 in document 2);
-
ByDocSentence: to retrieve a multiple (eventually different) score for each occurrence of a temporal expression found in a given document, that is, multiple occurrences of a temporal expression in different sentences (e.g., 2019....... 2019) of a document, will return multiple (eventually different) scores (e.g., 0.92 for the occurrence of 2019 in sentence 1 of document 1; and 0.77 for the occurrence of 2019 in sentence 2 of document 1);
The first one (ByCorpus) evaluates the score of a given candidate date in the context of a corpus of texts, with regards to all the relevant keywords that it co-occurs with (regardless if it's on document 1 or 2). The following example illustrates one such case, in which all the relevant keywords (w1, w2, w3) that co-occur with temporal expression (d1) will be considered in the computation of the temporal score given by the GTE equation, i.e., (Median([IS(d1, w1); IS(d1, w2); IS(d1, w3)])), and subject to the fact that those keywords occur at least in two different documents (to avoid considering keywords that are too specific to a given document):
The second, evaluates the score of a given candidate date with regards to the documents where it occurs (thus taking into account only the relevant keywords of each document (within the search space defined), subject to the fact that those keywords occur at least in two different documents (to avoid considering keywords that are too specific to a given document)). This means that, if 2010 co-occurs with w1 in document 1, only this relevant keyword will be considered to compute the temporal score of 2010 for that particular document. Likewise, if 2010 co-occurs with w2 and with w3 in document 2, only these relevant keywords will be considered to compute the temporal score of 2010 for that particular document. This means that we would have a temporal score of 2010 for document 1 computed by GTE equation as follows: (Median([IS(d1, w1)])), and a temporal score of 2010 for document 2 computed by GTE equation as follows, (Median([IS(d1, w2); IS(d1, w3)])):
Finally, the third evaluates the score of a given candidate date with regards to the documents and sentences where it occurs (thus taking into account only the relevant keywords of each sentence of a given document (within the search space defined), subject to the fact that those keywords occur at least in two different documents (to avoid considering keywords that are too specific to a given document)). This means that, if 2010 co-occurs with w1 in sentence 1 of document 1, only this relevant keyword will be considered to compute the temporal score of 2010 for that particular sentence. Likewise, if 2010 co-occurs with w2 and with w3 in sentence 2 of document 1, only these relevant keywords will be considered to compute the temporal score of 2010 for that particular sentence of thta document. This means that we would have a temporal score of 2010 for document 1 computed by GTE equation as follows: (Median([IS(d1, w1)])), and a temporal score of 2010 for document 2 computed by GTE equation as follows: (Median([IS(d1, w2); IS(d1, w3)]))
How to work with each one will be explained next. Before that, we explain how to import the libraries and a set of text documents. We suggest you to play with your own texts, or in alternative, to download a set of 28 documents (MultiDocTexts.zip) that we make available in this git as a running example. These documents were collected in November 2016 by issuing the query "Boston Marathon Bombing" on Bing Search Engine. The Diffbot Article API was then used to collect the full text of the top-50 web pages retrieved by Bing. We ended up with 28 documents, as some of them didn't have any useful text.
Note: if you want to use the following code, don't forget to put the texts under a folder named data/MultiDocTexts
.
from Time_Matters_MultipleDocs import Time_Matters_MultipleDocs
import os
path = 'data/MultiDocTexts'
ListOfDocs = []
for file in os.listdir (path) :
with open(os.path.join(path, file),'r') as f:
txt = f.read()
ListOfDocs.append(txt)
The structure of the score depends on the type of extraction considered: ByCorpus
, ByDoc
or ByDoc&Sentence
.
Getting temporal scores by a corpus of documents is possible through the following code: results = Time_Matters_MultipleDocs(ListOfDocs)
. This configuration assumes "py_heideltime" as default temporal tagger, "ByCorpus" as the default score_type and the default parameters of time_matters. In this configuration, a single score will be retrieved for a temporal expression regardless it occurs in different documents.
Running this code, however, will take a considerable amount of time (depending on the PC used) as Heideltime temporal tagger will be running on top of 28 texts. If you want a quicker solution (though not effective) you should use a rule-based approach instead (more about this on the Optional Parameters section). Also letting py_heideltime
getting all the possible temporal expressions from the text might become too cumbersome. For that reason, we opt to set the date granularity to year and the document timestamp to ´2013-04-15´ (the date of the Boston marathon bombings).
results = Time_Matters_MultipleDocs(ListOfDocs, temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2013-04-15'])
#results = Time_Matters_MultipleDocs(ListOfDocs, score_type="ByCorpus", temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2013-04-15'])
The output is a dictionary where the key is the normalized temporal expression and the value is a list with two positions. The first is the score of the temporal expression. The second is a dictionary of the instances of the temporal expression (as they were found in each document). Example: {'2011-01-12': [1.0, {0: ['2011-01-12', '12 January 2011'], 6: ['2011-01-12']}]}
, means that the normalized temporal expression 2011-01-12
has a score of 1 and occurs twice (the first time as 2011-01-12
, and the second time as 12 January 2011
) in document 0 and one time (as '2011-01-12') in document 6.
Getting temporal scores by document is possible through the following code. This configuration is set to consider "py_heideltime" as the default temporal tagger, "ByDoc" as the score_type and the default parameters of time_matters. In this configuration, multiple occurrences of a temporal expression in different documents, will return multiple (eventually different) scores (e.g., 0.92 for the occurrence of 2019 in sentence 1 of document 1; and 0.77 for the occurrence of 2019 in sentence 2 of document 1). Once again, we apply the year
granularity to avoid getting too many fine-grained temporal expressions. Yet, you are more than welcome to alternatively run the following code: results = Time_Matters_MultipleDocs(ListOfDocs, score_type='ByDoc')
.
results = Time_Matters_MultipleDocs(ListOfDocs, score_type='ByDoc', temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2013-04-15'])
The output is a dictionary where the key is the normalized temporal expression and the value is a dictionary (where the key is the DocID and the value is a list with two positions. The first is the score of the temporal expression in that particular document. The second is a list of the instances of the temporal expression (as they were found in the text in that particular document). Example: {'2010': {1: [0.2, ['2010']], 5: [0.983, ['2010', '2010']]}}
, means that the normalized temporal expression 2010
has a score of 0.2 in the document with ID 1, and a score of 0.983 in the document with ID 5 (where it occurs two times).
Getting temporal scores by document & sentence is possible through the following code. This configuration is set to consider "py_heideltime" as the default temporal tagger, "ByDoc&Sentence" as the score_type and the default parameters of time_matters. In this configuration, multiple occurrences of a temporal expression in different sentences of a given document, will return multiple (eventually different) scores (e.g., 0.2 for its occurrence in document 1; and 0.982 for its occurrence in document 2). Once again, we apply the year
granularity to avoid getting too many fine-grained temporal expressions. Yet, you are more than welcome to alternatively run the following code: results = Time_Matters_MultipleDocs(ListOfDocs, score_type='ByDocSentence')
.
results = Time_Matters_MultipleDocs(ListOfDocs, score_type='ByDocSentence', temporal_tagger=['py_heideltime', 'English', 'year', 'news', '2013-04-15'])
The output is a dictionary where the key is the normalized temporal expression and the value is a dictionary (where the key is the DocID and the value is a new dictionary (where the key is the sentenceID and the value is list with two positions. The first is the score of the temporal expression in that particular sentence. The second is a list of the instances of the temporal expression (as they were found in the text in that particular setencent of that document)). Example: {'2011': {0: {5: [0.983, ['2010', '2010']], {6: [0.183, ['2010']]}}
, means that the normalized temporal expression 2011
has a score of 0.983 in the sentence with ID 5 (where it occurs twice) of docID 0, and a score of 0.183 in the sentence with ID 6 of docID 0.
-
TempExpressions: A dictionary, where the key is the docID and the value is a list of tuples, each having two positions. The first is the normalized temporal expression. The second is the temporal expression as it was found in the text. The order in which the elements appear in the list, reflect the order of the temporal expressions in the text. Example:
{0: [('1975-02-11TAF', 'the afternoon of February 11, 1975'),..]}
. -
RelevantKWs: a dictionary where the key is the docID and the value is a dictionary of the relevant keywords (and corresponding scores). In our algorithm, keywords are detected by YAKE!. If you want to know more about the role of YAKE! in Time-Matters, please refer to the following link. Example:
{0: {'haiti': 0.03, 'haiti earthquake': 0.07}}
means that the tokenshaiti
andhaiti earthquake
were determined as relevant keywords by YAKE! keyword extractor with the scores 0.03 and 0.07 (the lower the score the more relevant the keyword is) in docID 0. -
TextNormalized: A normalized version of the text, a dictionary, where the key is the docID and the value is a string, where temporal expressions are marked with the tag
<d>
and relevant keywords with the tag<kw>
. Example:{0: 'As of <d>2010</d> (see 1500 photos here), the following major earthquakes have been recorded in <kw>haiti</kw>.'}
-
TextTokens: A dictionary where the key is the docID and the value is a list of the text tokens. Tokens that are temporal expressions are marked with the tag
<d>
, whereas relevant keywords are marked with the tag<kw>
. Example:{0: ['As', 'of', '<d>2010</d>', 'see', '1500',...]}
. -
SentencesNormalized: A dictionary, where the key is the docID and the value is a list of the normalized version of the sentence text (position 0 of the list corresponds to sentence 0, etc). Temporal expressions found in the text are marked with the tag
<d>
while relevant keywords are marked with the tag<kw>
; Example:{0: [..., 'As of <d>2010</d> (see 1500 photos here), the following major earthquakes have been recorded in <kw>haiti</kw>.',...]}
. -
SentencesTokens: A dictionary, where the key is the docID and the value is a list of the text tokens by sentence, that is a list of lists (position 0 of the list gives the tokens of sentence 0, etc). Tokens that are temporal expressions are marked with the tag
<d>
, whereas relevant keywords are marked with the tag<kw>
. Example:{0: [[...,..], ['As', 'of', '<d>2010</d>', 'see', '1500',...], [...,..],]}
.
Apart from the score_type (ByCorpus, ByDoc and ByDocSentence) there are also parameters regarding the temporal_tagger and time_matters.
While 'py_heideltime' is the default temporal tagger, a 'rule_based' approach can be used instead. In the following, we assume the default parameters of the rule-based approach, that is: date_granularity is "full" (highest possible granularity detected will be retrieved), begin_date is 0 and end_date is 2100 which means that all the dates within this range will be retrieved. Instead, we can specify a more fine-grained granularity, such as year
and a begin
and end date
which would result in the following code: results = Time_Matters_SingleDoc(text, temporal_tagger=['rule_based', 'year', 2000, 2011])
. However, in the following code, we resort to the default parameter values of the rule-based approach.
results = Time_Matters_MultipleDocs(ListOfDocs, temporal_tagger=['rule_based'])
In addition, a few other parameters (already experienced before) are available to py_heideltime
, namely:
-
language
: English - default; Portuguese; Spanish; Germany; Dutch; Italian; and French. To know how to configure py_heideltime for other languages please refer to this link; -
date granularity
: "full" - default (Highest possible granularity detected will be retrieved); "year" (YYYY will be retrieved); "month" (YYYY-MM will be retrieved); "day" (YYYY-MM-DD will be retrieved). Note that this parameter can also be used with the rule_based model. -
document type
"news" - default (news-style documents); "narrative" (narrative-style documents (e.g., Wikipedia articles)); "colloquial" (English colloquial (e.g., Tweets and SMS)); "scientific" (scientific articles (e.g., clinical trails)) -
document creation time
: in the format YYYY-MM-DD
- n-gram: maximum number of terms a keyword might have. Default value is 1 (but any value > 0 is considered. For instance n = 1 means that single tokens such as "keyword" can be considered; instead n = 2 means that "keyword" but also "keyword extractor" can be considered). More about this here and here;
- num_of_keywords: number of YAKE! keywords to extract from the text. Default value is 10 (but any value > 0 is considered) meaning that the system will extract 10 relevant keywords from the text. More about this here and here;
- n_contextual_window: defines the n-contextual window distance. Default value is "full_document" when the score type is ByCorpus, or "full_sentence" (but a n-window where n > 0 can be considered as alternative) when the score type is ByDoc or ByDocSentence; More about this here;
- N: size of the context vector for X and Y at InfoSimba. Default value is '10' (but any value > 0 is considered). You can also define 'max' meaning that the context vector should have the maximum number of n-terms co-occurring with X (likewise with Y). This option however will require a huge amount of time (depending on the PC) to execute. More about thishere;
- TH: minimum threshold value from which terms are eligible to the context vector X and Y at InfoSimba. Default value is 0.05 (but any value > 0 is considered) meaning that any terms co-occuring between them with a DICE similarity value > 0.05 are eligible for the n-size vector. More about this here.
The following code assumes a score-type of ByCorpus
, the default parameters of the temporal_tagger
(such as py_heideltime
) and specifies the five parameters (also the default ones) for time_matters.
results = Time_Matters_MultipleDocs(ListOfDocs, time_matters=[1, 10, 'full_document', 10, 0.05])
More interistingly is that if we consider a different n-gram for the keywords. In the following we consider n = 3.
results = Time_Matters_MultipleDocs(ListOfDocs, time_matters=[3, 10, 'full_document', 10, 0.05])
We also offer the user a debug mode where users can access a more detailed version of the results. Thus in addition to the fields already explained before we also make available the InvertedIndex, the DiceMatrix and the ExecutionTime.
To this regard, we consider the following code with debug_mode=True
, thus assuming the score_type ByCorpus
, and the default parameters of temporal_tagger
(thus with heideltime) and of time_matters
.
results = Time_Matters_MultipleDocs(ListOfDocs, debug_mode=True)
-
InvertedIndex: An inverted index of the entire set of documents, most notably of its relevant keywords and temporal expressions. It follows the following dictionary structure:
{'term' : [DF, TotFreq, {DocID : [FreqInDoc, [OffsetsDoc], {SentenceID : [FreqInSentence, [OffsetsSentence]]}]}]
, whereDF
is theDocument Frequency
,TotFreq
is thetotal frequency
of the term within the entire corpus of documents,DocID
is theID of the document
(knowing that IDs start on 0),FreqInDoc
is the frequency of the term in the document,[OffsetsDoc]
is a list of the document offsets, that is, a list of the position(s) where the term appears in the document,SentenceID
is theID of the sentence
,FreqInSentence
is the frequency of the term in the sentence,OffsetsSentence
is a list of the sentence offsets, that is, a list of the position(s) where the term appears in the sentence. For instance, a term with the following structure'2010': [1, 4, {1 : [4, [6, 13, 20, 27], {0 : [2, [6, 13]], 1: [2, [20, 27]] }]}]
means that it has 4 occurrences in 1 document, in particular in the document with ID 1, namely in position 6, 13, 20, and 27, the first two occur in sentence ID 0, and the latter in sentence ID 1. -
DiceMatrix: It retrieves (in pandas format) the DICE matrix between each term according to the n-contextual window distance defined. For instance, a DICE similarity of 1 between
prime
andminister
means that, whenever each of these terms occur, they always occur together. If you want to know more about the role of DICE in our algorithm please refer to this link. -
ExecutionTime: It retrieves information about the processing times of our algorithm, in particular, of the
TotalTime
required to execute the algorithm, but also of each of its most important components, namely:heideltime_processing
,py_heideltime_text_normalization
,keyword_text_normalization
,YAKE
,InvertedIndex
,DICEMatrix
andGTE
. As it can be observed from the example, most of the time is consumed by thepy_heideltime
component (which entails the heideltime_processing and the text_normalization process, that is, the tagging of the text with the tag).
$ Time_Matters_MultipleDocs --help
Usage_examples (make sure that the input parameters are within quotes):
Default Parameters: This configuration assumes "py_heideltime" as default temporal tagger, "ByCorpus" as the default score_type and the default parameters of time_matters.
Time_Matters_MultipleDocs -i "['path', 'c:\path_to_directory']"
All the Parameters:
Time_Matters_MultipleDocs -i "['path', 'c:\path_to_directory']" -tt "['py_heideltime','English', 'full', 'news', '2019-05-05']" -tm "[1, 10,'full_document', 10, 0.05]" -st ByCorpus -dm False
[not required]
----------------------------------------------------------------------------------------------------------------------------------
-tt, --temporal_tagger Specifies the temporal tagger and the corresponding parameters.
Default: "py_heideltime"
Options:
"py_heideltime"
"rule_based"
py_heideltime (parameters):
____________________________
- temporal_tagger_name
Options:
"py_heideltime"
- language
Default: "English"
Options:
"English";
"Portuguese";
"Spanish";
"Germany";
"Dutch";
"Italian";
"French".
- date_granularity
Default: "full"
Options:
"full": means that all types of granularity will be retrieved, from the coarsest to the
finest-granularity.
"day": means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD;
"month": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved;
"year": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved;
- document_type
Default: "News"
Options:
"News": for news-style documents - default param;
"Narrative": for narrative-style documents (e.g., Wikipedia articles);
"Colloquial": for English colloquial (e.g., Tweets and SMS);
"Scientific": for scientific articles (e.g., clinical trails).
- document_creation_time
Document creation date in the format YYYY-MM-DD. Taken into account when "News" or "Colloquial" texts
are specified.
Example: "2019-05-30".
- Example:
-tt "['py_heideltime','English', 'full', 'news', '2019-05-05']"
Rule_Based (parameters):
____________________________
- temporal_tagger_name
Options:
"rule_based"
- date_granularity
Default: "full"
Options:
"full": means that all types of granularity will be retrieved, from the coarsest to the
finest-granularity.
"day": means that for the date YYYY-MM-DD-HH:MM:SS it will retrieve YYYY-MM-DD;
"month": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY-MM will be retrieved;
"year": means that for the date YYYY-MM-DD-HH:MM:SS only the YYYY will be retrieved;
- begin_date
Default: 0
Options: any number > 0
- end_date
Default: 2100
Options: any number > 0
- Example:
-tt "['rule_based','full','2000','2100']"
[not required]
----------------------------------------------------------------------------------------------------------------------------------
-tm, --time_matters Specifies information about Time-Matters, namely:
- n-gram: maximum number of terms a keyword might have.
Default: 1
Options:
any integer > 0
- num_of_keywords: number of YAKE! keywords to extract from the text
Default: 10
Options:
any integer > 0
- n_contextual_window: defines the search space where co-occurrences between terms may be counted.
Default: "full_sentence"
Options:
"full_sentence": the system will look for co-occurrences between terms that occur within the search
space of a sentence;
n: where n is any value > 0, that is, the system will look for co-occurrences between terms that
occur within a window of n terms;
- N: N-size context vector for InfoSimba vectors
Default: 10
Options:
any integer > 0
"max": where "max" is given by the maximum number of terms eligible to be part of the vector
- TH: all the terms with a DICE similarity > TH threshold are eligible to the context vector of InfoSimba
Default: 0.05
Options:
any float > 0
- Example:
-tm "[1, 10, 'full_sentence', 'max', 0.05]"
[not required]
----------------------------------------------------------------------------------------------------------------------------------
-st, --score_type Specifies the type of score for the temporal expression found in the text
Default: "ByDoc"
Options:
"ByDoc": returns a single score regardless the temporal expression occurs in different sentences;
"BySentence": returns multiple scores (one for each sentence where it occurs)
- Example:
-st ByDoc
[not required]
----------------------------------------------------------------------------------------------------------------------------------
-dm, --debug_mode Returns detailed information about the results
Default: False
Options:
False: when set to False debug mode is not activated
True: activates debug mode. In that case it returns
"Text";
"TextNormalized"
"Score"
"CandidateDates"
"NormalizedCandidateDates"
"RelevantKWs"
"InvertedIndex"
"Dice_Matrix"
"ExecutionTime"
- Example:
-dm True
--help Show this message and exit.
The output is a json list that retrieves the following information: score
, temporal expressions
, relevant keywords
, text normalized
, text tokens
, sentences normalized
and sentences tokens
.