Replies: 1 comment
-
I addressed your comment in the linked issue, so let's continue our discussion there. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi. I want to run a simple regex expression on a dataset, however I am running into the issue that it is not being cached. There was a similar issue (and pull request to fix it) for the nlp library (predecessor to datasets library). I was wondering if someone has any pointers on how to do it properly.
More concretely I want to run this expression with a re.findall() on a dataset.
pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
. This expression itself comes from GPT2Tokenizer, but I couldn't find any specific function that makes compiled regex expressions pliable underdatasets.map()
with caching.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions