Skip to content

Commit

Permalink
feat(extractors): add image extractor
Browse files Browse the repository at this point in the history
  • Loading branch information
CorentinTh committed Jan 22, 2025
1 parent 8dcd6bc commit f97e5f8
Show file tree
Hide file tree
Showing 11 changed files with 143 additions and 6 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,4 @@ logs
coverage
cache
.zed
*.traineddata
6 changes: 6 additions & 0 deletions fixtures/006.expected
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Lorem ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sapien ante conubia vestibulum
ultrices quisque nam nascetur consectetur. Viverra amet lacinia massa donec gravida primis
leo tellus. Montes nulla sit cras odio penatibus cum aenean metus. Per per eros fusce et
platea et feugiat ullamcorper.
Binary file added fixtures/006.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
41 changes: 41 additions & 0 deletions fixtures/007.expected
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
at his touch of a certain icy pang along my blood. “Come, sir,” said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“T understood, a drawer...”

But here I took pity on my visitor's suspense, and some perhaps
on my own growing curiosity.

“There it is, sir,” said I, pointing to the drawer, where it lay on the
floor behind a table and still covered with the sheet.

He sprang to it, and then paused, and laid his hand upon his
heart: I could hear his teeth grate with the convulsive action of his
jaws; and his face was so ghastly to see that I grew alarmed both for
his life and reason.

“Compose yourself,” said I.

He turned a dreadful smile to me, and as if with the decision of
despair, plucked away the sheet. At sight of the contents, he uttered
one loud sob of such immense relief that I sat petrified. And the
next moment, in a voice that was already fairly well under control,
“Have you a graduated glass?” he asked.

I rose from my place with something of an effort and gave him
what he asked.

He thanked me with a smiling nod, measured out a few min-
ims of the red tincture and added one of the powders. The mix-
ture, which was at first of a reddish hue, began, in proportion as the
Binary file added fixtures/007.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions fixtures/008.expected
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Lorem ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sapien ante conubia vestibulum
ultrices quisque nam nascetur consectetur. Viverra amet lacinia massa donec gravida primis
leo tellus. Montes nulla sit cras odio penatibus cum aenean metus. Per per eros fusce et
platea et feugiat ullamcorper.
Binary file added fixtures/008.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
"release": "bumpp --commit --tag --push"
},
"dependencies": {
"tesseract.js": "^6.0.0",
"unpdf": "^0.12.1"
},
"devDependencies": {
Expand Down
69 changes: 63 additions & 6 deletions pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions src/extractors.registry.ts
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
import type { ExtractorDefinition } from './extractors.models';
import { imageExtractorDefinition } from './extractors/img.extractor';
import { pdfExtractorDefinition } from './extractors/pdf.extractor';
import { txtExtractorDefinition } from './extractors/txt.extractor';

export const extractorDefinitions: ExtractorDefinition[] = [
pdfExtractorDefinition,
txtExtractorDefinition,
imageExtractorDefinition,
];

export function getExtractor({
Expand Down
23 changes: 23 additions & 0 deletions src/extractors/img.extractor.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import { Buffer } from 'node:buffer';
import { createWorker } from 'tesseract.js';
import { defineTextExtractor } from '../extractors.models';

export const imageExtractorDefinition = defineTextExtractor({
name: 'image',
mimeTypes: [
'image/png',
'image/jpeg',
'image/webp',
'image/gif',
],
extract: async ({ arrayBuffer }) => {
const buffer = Buffer.from(arrayBuffer);

const worker = await createWorker();

const { data: { text } } = await worker.recognize(buffer);
await worker.terminate();

return { content: text };
},
});

0 comments on commit f97e5f8

Please sign in to comment.