Using NLP

capitalize it! uses a natural language processing (NLP) application programming interface (API) called TextRazor to properly capitalize text. It capitalizes proper nouns, beginning of sentences, and some acronyms. I used NLP because named entity, or proper noun, recognition is not as simple as checking text against a hard-coded dictionary of names, places, holidays, etc. A lot of times, context determines what qualifies as a named entity. Similarly, punctuation doesn’t necessarily detect the end of a sentence. A period could be used in an email address, abbreviation, or another context. NLP uses linguistics and AI to examine patterns in human language and determine what qualifies as a named entity or sentence in context.

Capitalization function

Capitalizeit.py has the main functionality of capitalizing text. In the function called "cap," the text is first analyzed for words and entities by the API client. Then the analyzed result is tokenized by sentence to determine the first words of sentences, whose positions are then stored in a set. I used sets rather than lists because they’re much faster at testing for membership and they ensure each position is stored only once. Then I tokenize by basic token (words, punctuation, symbols) and save the tokens as a list. In addition, positions of tokens tagged as "NNP" or "NNPS" (proper nouns) are stored in another set. Then I also check for named entities because the proper noun tagging isn’t foolproof. I store the positions of entities that have both dbedia_type and freebase_type attributes because I noticed that that most likely marks a proper noun. Entity IDs (or their properly formatted name) are also stored in a dict with the key being the position so that the dict maps nicely to the position sets and token list. The ID is stored because it provides helpful info for which words in a phrase need to be capitalized and which shouldn’t.

Once all the positions of potential words to be capitalized is stored, the tokens list is iterated through and positions that are in any of the sets are capitalized (with some exceptions). Some words are often false positives, like the "the" and "of", or represent acronyms and should be in all caps. I tested lots of inputs and made edits to the capitalization criteria as I went along. I also capitalize the word "I" and its contractions. Once the entire text is iterated through and capitalized accordingly, I untokenize it (put it back together) using an untokenizer from github called metanl. I couldn’t simply use the "join" method because some tokens (e.g. words) are separated by space, but some aren’t (e.g. punctuation). I also made some edits to the untokenizer as I noticed it missed some replacements and TextRazor replaces brackets, braces, and parentheses with letters.

Web Application

In application.py, the main page (whose route is "/") shows index.html when loaded via a GET request. Index.html has the textbox for users to type in text to be capitalized, as well as a button to submit the request and another button to redirect the user to a file uploader if they choose to capitalize a file instead. If the user chooses the textbox method of capitalizing and submits, capitalized.html is rendered (under the same route for convenience). Capitalized.html has everything index.html has except the original text is kept in the text box for easy comparison and of course, the capitalized text is displayed. Additionally, it has a ratings system at the bottom.

With message flashing, I ask users rate the accuracy on their first recent request so I can see how well the app is doing. The rating is stored in the ratings table of the ratings database with fields request_id, rating, and datetime. I decided to store the timestamp of the request so I can see ratings for different versions and see how ratings change over time. I only request that users rate on their first request because I don’t want to annoy them, but they’re free to rate every one if they’d like. I check for first requests using Flask session and after the first request, the user’s session is assigned a random integer between 1 and 1,000,000. Not the most secure method, but there’s not really much at stake in this case. Once the session id expires, they’re asked to rate again. The rating system is at the bottom and uses star buttons that glow when rolled over (code I got from a site which is cited in the styles.css). I tried it with radio buttons at first, but found that submit buttons were more efficient because they submit right away upon being clicked.

Users can also capitalize the contents of a .txt file by clicking "Or uplaod file" from the home page. I added this feature for more flexibility. It redirects them to an uploader where they can choose a file to capitalize. If a user chooses a file that’s not .txt or fails to select a file at all, an error message is flashed in red. .txt is the only supported format for now because it was the easiest to parse. I also use different colors for different types of messages so users can easily detect if something’s wrong. I set the second argument of flash() as the most appropriate bootstrap alert class so I can put that directly into the HTML.

Once a valid file is submitted, the file is saved to the os, open, and read. The file is saved under a secure version of the original name (a function of werkzeug.utils) to protect against harmful user input. A random integer between 1 and 1,000,000 is also attached to the end of the the file in case two users simultaneously submit files with the same name and their requests get mixed up. I’m not sure if that would ever happen, but I wanted to be safe. Once the file is read, its contents are stored in a variable and then the file is deleted because I don’t want to permanently store all uploads nor do I want to breach user privacy. The file’s contents get capitalized and the original and capitalized text is displayed on the home page. Again, the user is asked to rate if it’s their first request in a while. I decided not to automatically download the capitalized .txt file to the user’s computer or even give them the option because I thought it was an unnecessary feature.

The application constantly listens for errors and if one occurs, the error code and name is shown in a large, header style under layout so the user at least knows they’re on the correct domain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DESIGN.md

DESIGN.md

Using NLP

Capitalization function

Web Application

Files

DESIGN.md

Latest commit

History

DESIGN.md

File metadata and controls

Using NLP

Capitalization function

Web Application