-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New dataset: OCR for meal expenses since 2015 using google's cloud vision API #188
Comments
Wow… great progress, great data extracted from the PDFs ; ) Some general comments trying to help in the direction of PRs to this repo:
What do you think? |
Hey, my idea here was really just to share the data in case u guys or
others wanna analyse it. Also I have the 300 dollars credit for signing up
for the Google cloud platform that I need to spend over the next 2 weeks
(got 200 left after this first batch of OCRs now).
Making this more robust / making it work on other platforms is "out of
scope" for me for now so feel free to close this issue to avoid the noise.
My plan is to do the same for a few other reimbursement categories that Ana
pointed out to me over Telegram, I'll do my best to get the code in a good
shape but no promises yet as I need to wrap up other PRs I have open ;). If
u believe there are better ways to share that dataset instead of a GitHub
issue LMK.
…--
Fábio Rehm
Sent from my phone
On Feb 11, 2017 12:43 PM, "Eduardo Cuducos" <[email protected]> wrote:
Wow… great progress, great data extracted from the PDFs ; )
Some general comments trying to help in the direction of PRs to this repo:
- We already have a script to fetch all PDFs
<https://github.com/datasciencebr/serenata-de-amor/blob/master/src/fetch_receipts.py>
so we need to find a way to convert PDFs to PNG (or whatever)
without using os.system("pdftoppm
…") for greater compatibility (Windows users? Or even different *nix
user (brew install vs. apt-gey install vs. macports install vs. yum
install and so on) — but maybe that's to utopian…
- A script to actually use the Google service and create the data/txt
What do you think?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#188 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAE_w-VN35s57kMDcM4tPn_jqlX116xaks5rbckWgaJpZM4L-AHh>
.
|
No worries at all, mate. I think this Issue is pretty useful as it is. People interested in OCR can learn from your experience, try it, get access to data etc. I'll leave it as it is ; ) |
No worries, I've finished OCR of I still have $100 left 😱 but that's not enough to OCR the 140k of meal reimbursements which would be what @anaschwendler suggested me to do next on a chat over telegram. I guess for now I'll proceed with subquotas that have less reimbursements until I'm done with those credits and will share a single zip file with you guys with all that at some point this week 🤘 As a side note, I know that the text generated is not 100% accurate but I think we could still have this data on an elasticsearch instance somewhere for full text searching 💭 |
Also, I've "merged" the code from the 3 initial notebooks into a couple classes that u can find on the gist above. If that's good enough for a PR to toolbox LMK and I'll try to get it going when I have a chance |
@fgrehm there no way to request Google a free tier for using into open source projects? I guess they are very welcome in cases like that. |
@pedrommone Probably yes, but that's something that serenata's core team will have to do once this has been wired up with rosie / jarbas / etc... My idea with this is to provide enough ammo for analysis to check if it is going to be worth the trouble having this in the first place 😄 |
❤️ |
Here's the dataset with the receipts texts and some numbers about it:
The process of obtaining those texts can be seen on the following gists:
I believe that @Irio already uploaded the zip file with raw JSON responses to S3 but I couldn't find the "easy to use CSV" version linked above. Once that file is uploaded to S3 I guess we can close this issue and GH-173. An example of usage is coming up as a PR in a bit 🎉 🍻 EDIT Here are the breakdown of OCRed stuff per subquota:
|
@fgrehm I'm so sorry. I had a couple of unforeseen situations this week and I couldn't follow you in time. Now it looks like the file at WeTransfer is not available anymore. Can you re-upload it or send it via PVT so I can upload it (and the one from #173) to S3? Also, I haven't check, but just as a reminder if that's the case: would you mind documenting this new datasets in the |
Hi, all. My recommendation is to run the API again because Google updated the version of API to v1.1 with new features for example Document Text Detection. |
@michelpereira yeah, tks for the heads up. That feature got released after I was done with this initial processing and I only found out about the new feature while on the dataset documentation I have in the works 😄 I'm on a work trip right now but I'm going to wrap up that PR as soon as I'm back home |
Hi guys, So, we have some data like that: The previous mentioned article the reimbursements can't be a generalization. So, could we consider it as a invalid reimbursement? |
Sure thing. If we could identify all receipts with handwriting description such as that we would have thousands of new suspicions ; ) |
@fgrehm, @pedrommone Maybe you could try Tensorflow Research Cloud. |
The datasets are on S3 and docs have been merged. Please follow up on ☝️ for additional efforts on this |
Hide availability in the Chamber's dataset while we don't update the db
is this dataset still available for download? the wetransfer (https://we.tl/i1C2z6sBJX) link looks expired.. would love to get my hands on this. |
Yep.
That's expected. We use sites such as WeTransfer just to quickly exchange files between collaborators and core developers. Later, as @fgrehm, it's uploaded to our file storage.
Hell yeah 🤘 Just go ahead and use the toolboxx mentioned in the |
@cuducos the toolbox link in broken :( |
Toolbox link : https://github.com/okfn-brasil/serenata-toolbox |
Hello! Can someone direct me to a link containing the JSON files for the 57K meal reimbursements? Also where can I find the images corresponding to the JSON files? Thanks!! |
@michaelyan-coupa, you can create easily create a CSV using
Yes. With data from the CSV mentioned in my lats paragraph, you can download from the source concatenating the URL as we do in Jarbas. |
@fgrehm |
@liminghao1630 the file you're looking for is |
Is it possible to download these receipt images? |
Yes. I described the process three messages above your question, @sks4world. |
Could you or anyone please make that raw cloud vision JSON api responses available for download again? I have tried |
how can i download this dataset..all recites in pdf format ? |
Some stats:
/cc @Irio @anaschwendler
The text was updated successfully, but these errors were encountered: