Skip to content

Latest commit

 

History

History
21 lines (14 loc) · 974 Bytes

README.md

File metadata and controls

21 lines (14 loc) · 974 Bytes

portuguese-nlp

Nlp work on Brazil Portuguese newswire text

You can browse the dataset online and see annotations on drive

We have x number of newswire articles collected between years 1994-2016. After preprocessing the dataset, since the articles are in html format, we first clean the tags and rename all files such as:

folca/data/2005/01/01/19.html --> folca/parsed-data/2005_01_01_19.html

and collect them all in one folder.

More