-
Notifications
You must be signed in to change notification settings - Fork 7
Data Preprocessing
The data we receive in the Collection phase are most likely not in the desired format and needs to be processed with additional metadata information to bring the collections into a standard format. These methods to convert raw data into standard format will happen at this stage.
There are many crucial steps:
- Data License verification
- Data cleaning
- HTML/CSS tags removal
- Punctuation removal
- English word removal from the Odia part of the pair
- Alignment (Phrase, Sentence, Paragraph)
- Metadata generation/addition
- POS tags
- NER
- Filtering out the pairs based on a threshold.
- Classification
- Business domain based (IT, Religion, Tourism, Politics, etc.)
- Morphology based (Word, Phrase, Sentence, Paragraph, etc.)
- Converting the input raw format into standard format
Let is go though these steps individually:
- The data which we will receive through the Data collectors need to be checked.
- Provide proper Attributions wherever needed.
- Weed out the corpus where proper license is not there.
- Contact the original authors in case needed.
Licensing is a crucial thing in Open source projects. Therefore, should not be taken lightly. Legal obligations may need to be faced if not handled smartly.
The data we received need to go through a set of cleaning steps before going further.
- Remove duplicates from the corpus.
- For the initial days, we may need to remove the punctuation marks from the pairs. later on during tune up we may need to enable the punctuation marks.
- HTML and CSS tags are definitely need to be removed. A program can be written for this.
- It may occur that there might be English words present in the Odia part of the pair. May be due to Brand name or Trademark. In these cases we may need to use
- Keep it as it is (Recommendation) or
- NER and POS will come into the picture here.
- it will identify if the English word is a Pronoun.
- If Yes, keep it, else try out other two options.
- Transliteration to substitute the English word into Odia.
- Remove that sentence
- Keep it as it is (Recommendation) or
- Again remove duplicates from the corpus. In fact we should remove duplicates after each step to reduce the workload on the next steps.
Alignment of the pairs is the crucial part of this process. An entire codebase can be written to align the pairs. We have thousands of corpus lying around unusable due to this Alignment issue. The Alignment can be done based on the original text Morphology.
- A dictionary based approach (matching words across the pairs) might be tried to align the sentences. However a strong plan needs to be created.
Odia does not have any open source free auto POS tagger or NER taggers. There is word2vec present, which I need to test though. In combination of these features the translation accuracy will shoot up.
For this I has asked these info from the community in the Data collection phase.
After all these steps there will be a threshold set to analyze the validity and uniqueness of the pair. The threshold can be set based on:
- Number of minimum letters need to be present in a sentence pair.
- Number of minimum words need to be present in a sentence pair.
- Percentage of English letters in an Odia part of the pair.
- Minimum number of high weight POS tags like Nouns/Adjectives/Pronouns/Verbs needed to be declared as a valid pair.
If any pair unable to pass any of the above conditions, should be filtered out.
We can not make a swiss army knife for any kind of translation done by a generic model. We have to find our niche domain. For the initial stage it may be the domain on which we get the maximum number of pairs.
This concept is very critical and need to understood at early stage. You can not train with Agriculture data and want to test those with Medical terms. The result will be pathetic.
Thats why those MT models who grow initially pick a specific sector and specialize on that first. If we need to specialize in generic Hi, Bye terms we need to weed out the other domain specific data pairs.
Due to this reason the domain based classification is critical during the processing phase both for the MT model and business too.
We need to classify the pairs into words, phrases and sentences.
Convert the format to standard format and make any changes if needed to the data which will be used further on the training process.