Skip to content

Commit

Permalink
feat(extractors): refactor image extraction to use child_process for …
Browse files Browse the repository at this point in the history
…Tesseract
  • Loading branch information
CorentinTh committed Jan 22, 2025
1 parent a7fbf21 commit b625f61
Show file tree
Hide file tree
Showing 4 changed files with 28 additions and 75 deletions.
10 changes: 5 additions & 5 deletions fixtures/007.expected
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
at his touch of a certain icy pang along my blood. “Come, sir, said I.
at his touch ofa certain icy pang along my blood. “Come, sir, said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon, he replied civilly enough. “What
“I beg your pardon, Dr. Lanyon, he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“T understood, a drawer...”
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor's suspense, and some perhaps
But here I took pity on my visitors suspense, and some perhaps
on my own growing curiosity.

“There it is, sir,” said I, pointing to the drawer, where it lay on the
Expand All @@ -25,7 +25,7 @@ heart: I could hear his teeth grate with the convulsive action of his
jaws; and his face was so ghastly to see that I grew alarmed both for
his life and reason.

“Compose yourself, said I.
“Compose yourself, said I.

He turned a dreadful smile to me, and as if with the decision of
despair, plucked away the sheet. At sight of the contents, he uttered
Expand Down
1 change: 0 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@
"release": "bumpp --commit --tag --push"
},
"dependencies": {
"tesseract.js": "^6.0.0",
"unpdf": "^0.12.1"
},
"devDependencies": {
Expand Down
69 changes: 6 additions & 63 deletions pnpm-lock.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 17 additions & 6 deletions src/extractors/img.extractor.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import { Buffer } from 'node:buffer';
import { createWorker } from 'tesseract.js';
import { exec } from 'node:child_process';
import { env } from 'node:process';
import { defineTextExtractor } from '../extractors.models';

export const imageExtractorDefinition = defineTextExtractor({
Expand All @@ -11,13 +12,23 @@ export const imageExtractorDefinition = defineTextExtractor({
'image/gif',
],
extract: async ({ arrayBuffer }) => {
const buffer = Buffer.from(arrayBuffer);
const binary = env.LECTURE_TESSERACT_BINARY ?? 'tesseract';

const worker = await createWorker();
const { stdout } = await new Promise<{ stdout: string }>((resolve, reject) => {
const child = exec(`${binary} stdin stdout`, (error, stdout) => {
if (error) {
reject(error);
} else {
resolve({ stdout });
}
});

const { data: { text } } = await worker.recognize(buffer);
await worker.terminate();
child.stdin.write(Buffer.from(arrayBuffer));

Check failure on line 26 in src/extractors/img.extractor.ts

View workflow job for this annotation

GitHub Actions / CI - Lib

Unhandled error

Error: write EPIPE ❯ afterWriteDispatched node:internal/stream_base_commons:159:15 ❯ writeGeneric node:internal/stream_base_commons:150:3 ❯ Socket._writeGeneric node:net:964:11 ❯ Socket._write node:net:976:8 ❯ writeOrBuffer node:internal/streams/writable:572:12 ❯ _write node:internal/streams/writable:501:10 ❯ Socket.Writable.write node:internal/streams/writable:510:10 ❯ content src/extractors/img.extractor.ts:26:19 ❯ Object.extract src/extractors/img.extractor.ts:17:30 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Serialized Error: { errno: -32, code: 'EPIPE', syscall: 'write' } This error originated in "src/extractors.usecases.test.ts" test file. It doesn't mean the error was thrown inside the file itself, but while it was running. The latest test that might've caused the error is "fixture fixtures/006.png". It might mean one of the following: - The error was thrown, while Vitest was running this test. - If the error occurred after the test had been completed, this was the last documented test before it was thrown.

Check failure on line 26 in src/extractors/img.extractor.ts

View workflow job for this annotation

GitHub Actions / CI - Lib

Unhandled error

Error: write EPIPE ❯ afterWriteDispatched node:internal/stream_base_commons:159:15 ❯ writeGeneric node:internal/stream_base_commons:150:3 ❯ Socket._writeGeneric node:net:964:11 ❯ Socket._write node:net:976:8 ❯ writeOrBuffer node:internal/streams/writable:572:12 ❯ _write node:internal/streams/writable:501:10 ❯ Socket.Writable.write node:internal/streams/writable:510:10 ❯ content src/extractors/img.extractor.ts:26:19 ❯ Object.extract src/extractors/img.extractor.ts:17:30 ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Serialized Error: { errno: -32, code: 'EPIPE', syscall: 'write' } This error originated in "src/extractors.usecases.test.ts" test file. It doesn't mean the error was thrown inside the file itself, but while it was running. The latest test that might've caused the error is "fixture fixtures/007.jpg". It might mean one of the following: - The error was thrown, while Vitest was running this test. - If the error occurred after the test had been completed, this was the last documented test before it was thrown.
child.stdin.end();
});

return { content: text };
return {
content: stdout,
};
},
});

0 comments on commit b625f61

Please sign in to comment.