-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit a91d98f
Showing
28 changed files
with
6,131 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
github: | ||
- CorentinTh | ||
|
||
buy_me_a_coffee: cthmsst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
name: CI - Lib | ||
|
||
on: | ||
pull_request: | ||
push: | ||
branches: | ||
- main | ||
|
||
jobs: | ||
ci-lib: | ||
name: CI - Lib | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4 | ||
- run: corepack enable | ||
- uses: actions/setup-node@v4 | ||
with: | ||
node-version: 22 | ||
cache: pnpm | ||
|
||
- name: Install dependencies | ||
run: pnpm i --frozen-lockfile | ||
|
||
- name: Run linters | ||
run: pnpm lint | ||
|
||
- name: Type check | ||
run: pnpm typecheck | ||
|
||
- name: Run unit test | ||
run: pnpm test | ||
|
||
- name: Build the lib | ||
run: pnpm build |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Nuxt dev/build outputs | ||
.output | ||
.data | ||
.nuxt | ||
.nitro | ||
.cache | ||
dist | ||
dist-app | ||
dist-node | ||
dist-cloudflare | ||
|
||
# Node dependencies | ||
node_modules | ||
|
||
# Logs | ||
logs | ||
*.log | ||
|
||
# Misc | ||
.DS_Store | ||
.fleet | ||
.idea | ||
|
||
# Local env files | ||
.env | ||
.env.* | ||
!.env.example | ||
|
||
.wrangler | ||
coverage | ||
cache | ||
.zed |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2025 Corentin THOMASSET | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
# @papra/lecture | ||
|
||
`@papra/lecture` is a robust and lightweight library for extracting text content from various file formats. Whether you're processing Document for indexing or for LLM readability, this library simplifies the task of reading text content programmatically. | ||
|
||
## Features | ||
|
||
- **Wide Format Support**: Extract text from PDFs, plain text files, YAML, Markdown, CSV, and all `text/*` MIME types. | ||
- **Promise-based API**: Designed with asynchronous functions to provide a seamless integration experience. | ||
- **Extensible and Modular**: Built with future compatibility in mind. Support for more file formats is on the way. | ||
- **Error Handling**: Provides detailed error information when extraction fails. | ||
|
||
## Installation | ||
|
||
To install the package, use npm or yarn: | ||
|
||
```bash | ||
pnpm install @papra/lecture | ||
|
||
npm install @papra/lecture | ||
|
||
yarn add @papra/lecture | ||
``` | ||
|
||
## Usage | ||
|
||
### Importing the Library | ||
|
||
You can import the library using ES Modules or CommonJS syntax: | ||
|
||
```javascript | ||
// ES Modules | ||
import { extractText, extractTextFromBlob } from '@papra/lecture'; | ||
``` | ||
|
||
```javascript | ||
// CommonJS | ||
const { extractText, extractTextFromBlob } = require('@papra/lecture'); | ||
``` | ||
|
||
### Functions | ||
|
||
#### `extractText` | ||
|
||
Extracts text from a file's binary data using its MIME type. | ||
|
||
**Parameters**: | ||
|
||
- `arrayBuffer` (`ArrayBuffer`): The binary content of the file. | ||
- `mimeType` (`string`): The MIME type of the file. | ||
|
||
**Returns**: | ||
A promise that resolves to an object with the following properties: | ||
|
||
- `extractorName` (`string | undefined`): The name of the extractor used. | ||
- `textContent` (`string | undefined`): The extracted text content, if available. | ||
- `error` (`Error | undefined`): An error object, if an issue occurred during extraction. | ||
|
||
**Example**: | ||
|
||
```javascript | ||
const file = await fetch('example.pdf').then(res => res.arrayBuffer()); | ||
const mimeType = 'application/pdf'; | ||
|
||
const result = await extractText({ arrayBuffer: file, mimeType }); | ||
|
||
if (result.textContent) { | ||
console.log('Extracted Text:', result.textContent); | ||
} else { | ||
console.error('Error:', result.error); | ||
} | ||
``` | ||
|
||
#### `extractTextFromBlob` | ||
|
||
Extracts text from a `Blob` object (e.g., files or data retrieved from APIs). | ||
|
||
**Parameters**: | ||
|
||
- `blob` (`Blob`): A Blob representing the file content. | ||
|
||
**Returns**: | ||
A promise that resolves with the same structure as `extractText`. | ||
|
||
**Example**: | ||
|
||
```javascript | ||
const inputFile = document.querySelector('#file-input').files[0]; // HTML File Input | ||
|
||
const result = await extractTextFromBlob(inputFile); | ||
|
||
if (result.textContent) { | ||
console.log('Extracted Text:', result.textContent); | ||
} else { | ||
console.error('Error:', result.error); | ||
} | ||
``` | ||
|
||
## Supported File Formats | ||
|
||
Currently, `@papra/lecture` supports the following file formats: | ||
|
||
- **PDF** | ||
- **Plain Text** (e.g., `.txt`) | ||
- **YAML** (e.g., `.yaml`, `.yml`) | ||
- **Markdown** (e.g., `.md`) | ||
- **CSV** | ||
- All `text/*` MIME types | ||
- Coming soon: **Microsoft Office Documents** (e.g., `.docx`, `.xlsx`, `.pptx`) | ||
- Coming soon: **eBooks** (e.g., `.epub`, `.mobi`) | ||
- Coming soon: **Images OCR** (e.g., `.jpg`, `.png`) | ||
|
||
### Coming Soon | ||
|
||
We are actively working on adding support for more file formats. Stay tuned for updates! | ||
|
||
## Contributing | ||
|
||
Contributions are welcome! Feel free to open issues or submit pull requests. Let's make `@papra/lecture` better together. | ||
|
||
## Testing | ||
|
||
You can run the tests with the following command: | ||
|
||
```bash | ||
# one shot | ||
pnpm run test | ||
|
||
# watch mode | ||
pnpm run test:watch | ||
``` | ||
|
||
Automated fixtures are run against the [`fixtures`](./fixtures) directory. Add files to this directory in the format `[0-9]{3}.ext` like `001.js` following the incremental pattern. The test runner will automatically pick up the new fixtures and generate a `[0-9]{3}.expected`, adding the expected output. | ||
|
||
## License | ||
|
||
This project is licensed under the MIT License. See the [LICENSE](./LICENSE) file for more information. | ||
|
||
## Credits and Acknowledgements | ||
|
||
This project is crafted with ❤️ by [Corentin Thomasset](https://corentin.tech). | ||
If you find this project helpful, please consider [supporting my work](https://buymeacoffee.com/cthmsst). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
import { defineBuildConfig } from 'unbuild'; | ||
|
||
export default defineBuildConfig({ | ||
entries: [ | ||
'src/index', | ||
], | ||
clean: true, | ||
declaration: true, | ||
sourcemap: true, | ||
rollup: { | ||
emitCJS: true, | ||
}, | ||
}); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
import antfu from '@antfu/eslint-config'; | ||
|
||
export default antfu({ | ||
stylistic: { | ||
semi: true, | ||
}, | ||
|
||
ignores: ['README.md'], | ||
|
||
rules: { | ||
// To allow export on top of files | ||
'ts/no-use-before-define': ['error', { allowNamedExports: true, functions: false }], | ||
'curly': ['error', 'all'], | ||
'vitest/consistent-test-it': ['error', { fn: 'test' }], | ||
'ts/consistent-type-definitions': ['error', 'type'], | ||
'style/brace-style': ['error', '1tbs', { allowSingleLine: false }], | ||
'unused-imports/no-unused-vars': ['error', { | ||
argsIgnorePattern: '^_', | ||
varsIgnorePattern: '^_', | ||
caughtErrorsIgnorePattern: '^_', | ||
}], | ||
}, | ||
}); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sapien ante conubia vestibulum ultrices quisque nam nascetur consectetur. Viverra amet lacinia massa donec gravida primis leo tellus. Montes nulla sit cras odio penatibus cum aenean metus. Per per eros fusce et platea et feugiat ullamcorper. Nunc suscipit senectus suscipit convallis duis accumsan feugiat maecenas. Turpis suscipit proin etiam nam proin interdum mattis netus. Montes sociis justo non pharetra neque lectus dolor lacinia. Varius ac dictumst et nec massa taciti gravida nullam. Elit sagittis velit placerat vivamus at at non donec. Dolor conubia turpis nostra eu id habitant sollicitudin aliquet. Vitae eros elementum nisl vestibulum euismod mattis conubia praesent. Lacus sapien vehicula luctus purus class at mattis lacinia. Luctus nulla ullamcorper congue facilisi platea interdum metus montes. Habitant a iaculis phasellus faucibus faucibus potenti tincidunt vivamus. Sociosqu cum hendrerit neque ante aenean nunc convallis tempus. Dapibus molestie odio condimentum mollis eget malesuada aliquet aptent. Lobortis scelerisque dictumst pellentesque penatibus ornare lectus pharetra fermentum. Nostra inceptos tempus varius tempus facilisi faucibus in suspendisse. Pretium consequat ornare tempus est molestie vestibulum congue ad. |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam primis posuere sit nec integer eu sem lectus condimentum rhoncus odio. Pretium urna mus nam posuere mus aliquet a curabitur massa eleifend volutpat. Diam tincidunt diam montes aliquam integer metus tellus gravida ad montes dui. Nisl vivamus primis tincidunt platea ligula scelerisque mollis lacinia arcu est torquent. Laoreet varius ridiculus class eros suscipit lobortis curabitur morbi condimentum porta penatibus. Senectus dolor ullamcorper litora auctor elementum massa netus quis dolor aptent turpis. Feugiat volutpat nec varius faucibus porta pulvinar parturient lacus nisi rhoncus neque. Cubilia luctus ac gravida fringilla fringilla fusce nisi bibendum potenti convallis natoque. Mi tellus sit in blandit ante mus venenatis commodo lorem diam pharetra. | ||
|
||
Leo sapien rutrum integer diam magnis magnis gravida montes placerat viverra suscipit. A lacus molestie suspendisse etiam curae mollis donec class primis luctus elit. Massa sagittis euismod tempor lobortis felis eros massa purus volutpat at leo. Gravida neque nam magnis aptent dictumst semper ultricies eget dictum molestie netus. Eros cum ullamcorper euismod ad condimentum ipsum vehicula fusce purus nostra bibendum. Primis fermentum tempus faucibus dui amet eget gravida aliquet neque sollicitudin sapien. Varius nostra mus laoreet maecenas facilisi tincidunt integer tempor quis mi dignissim. Eros luctus nisi dapibus inceptos ligula suspendisse accumsan neque ridiculus donec vehicula. Ac massa enim luctus suscipit eu lacinia imperdiet orci lacinia donec consequat. Nulla eu fermentum donec mauris quis convallis et gravida pellentesque non mi. | ||
|
||
Turpis erat dictumst arcu quisque commodo ad urna nisi semper iaculis mollis. Sed praesent suscipit montes ligula quam mi nisi auctor pellentesque diam auctor. Suscipit laoreet dictumst hac sed morbi molestie est montes semper interdum vulputate. Vel eget phasellus cum per venenatis magna dolor mi inceptos ad etiam. Tellus velit aptent gravida bibendum varius felis sollicitudin odio dapibus volutpat ultrices. Ipsum eleifend magnis imperdiet ultricies volutpat arcu non non integer fermentum et. Dapibus phasellus neque vestibulum consequat natoque vel id sagittis senectus senectus eros. Quam magna tincidunt praesent neque orci imperdiet neque sit cubilia lacinia per. Senectus penatibus felis pretium risus ultrices duis dignissim fermentum amet elit congue. Velit pharetra aptent aenean magna potenti sed litora cubilia pellentesque aliquet nunc. | ||
|
||
Quis dignissim vehicula sem eu posuere nisi nisl praesent pellentesque quam mauris. Odio donec nisl conubia consectetur dolor mattis sed consectetur sollicitudin lobortis rhoncus. Mi fringilla curae dictum donec gravida himenaeos eu euismod sit tincidunt consequat. Volutpat per pretium consequat diam commodo metus aenean tortor nisi senectus cubilia. Varius porta euismod morbi sapien dignissim varius conubia venenatis fermentum at lacus. Luctus aliquet ultrices lectus dolor vehicula erat mattis eu ridiculus amet eu. Consequat cras massa curae purus egestas elementum neque porta nisl himenaeos ligula. Sit auctor lorem eros hendrerit sagittis nunc rhoncus eu iaculis pharetra tellus. Senectus hendrerit id egestas mus commodo lectus iaculis ac conubia placerat lobortis. Natoque massa vivamus venenatis scelerisque viverra est ad pulvinar primis sagittis nostra. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam primis posuere sit nec integer eu sem lectus condimentum rhoncus odio. Pretium urna mus nam posuere mus aliquet a curabitur massa eleifend volutpat. Diam tincidunt diam montes aliquam integer metus tellus gravida ad montes dui. Nisl vivamus primis tincidunt platea ligula scelerisque mollis lacinia arcu est torquent. Laoreet varius ridiculus class eros suscipit lobortis curabitur morbi condimentum porta penatibus. Senectus dolor ullamcorper litora auctor elementum massa netus quis dolor aptent turpis. Feugiat volutpat nec varius faucibus porta pulvinar parturient lacus nisi rhoncus neque. Cubilia luctus ac gravida fringilla fringilla fusce nisi bibendum potenti convallis natoque. Mi tellus sit in blandit ante mus venenatis commodo lorem diam pharetra. | ||
|
||
Leo sapien rutrum integer diam magnis magnis gravida montes placerat viverra suscipit. A lacus molestie suspendisse etiam curae mollis donec class primis luctus elit. Massa sagittis euismod tempor lobortis felis eros massa purus volutpat at leo. Gravida neque nam magnis aptent dictumst semper ultricies eget dictum molestie netus. Eros cum ullamcorper euismod ad condimentum ipsum vehicula fusce purus nostra bibendum. Primis fermentum tempus faucibus dui amet eget gravida aliquet neque sollicitudin sapien. Varius nostra mus laoreet maecenas facilisi tincidunt integer tempor quis mi dignissim. Eros luctus nisi dapibus inceptos ligula suspendisse accumsan neque ridiculus donec vehicula. Ac massa enim luctus suscipit eu lacinia imperdiet orci lacinia donec consequat. Nulla eu fermentum donec mauris quis convallis et gravida pellentesque non mi. | ||
|
||
Turpis erat dictumst arcu quisque commodo ad urna nisi semper iaculis mollis. Sed praesent suscipit montes ligula quam mi nisi auctor pellentesque diam auctor. Suscipit laoreet dictumst hac sed morbi molestie est montes semper interdum vulputate. Vel eget phasellus cum per venenatis magna dolor mi inceptos ad etiam. Tellus velit aptent gravida bibendum varius felis sollicitudin odio dapibus volutpat ultrices. Ipsum eleifend magnis imperdiet ultricies volutpat arcu non non integer fermentum et. Dapibus phasellus neque vestibulum consequat natoque vel id sagittis senectus senectus eros. Quam magna tincidunt praesent neque orci imperdiet neque sit cubilia lacinia per. Senectus penatibus felis pretium risus ultrices duis dignissim fermentum amet elit congue. Velit pharetra aptent aenean magna potenti sed litora cubilia pellentesque aliquet nunc. | ||
|
||
Quis dignissim vehicula sem eu posuere nisi nisl praesent pellentesque quam mauris. Odio donec nisl conubia consectetur dolor mattis sed consectetur sollicitudin lobortis rhoncus. Mi fringilla curae dictum donec gravida himenaeos eu euismod sit tincidunt consequat. Volutpat per pretium consequat diam commodo metus aenean tortor nisi senectus cubilia. Varius porta euismod morbi sapien dignissim varius conubia venenatis fermentum at lacus. Luctus aliquet ultrices lectus dolor vehicula erat mattis eu ridiculus amet eu. Consequat cras massa curae purus egestas elementum neque porta nisl himenaeos ligula. Sit auctor lorem eros hendrerit sagittis nunc rhoncus eu iaculis pharetra tellus. Senectus hendrerit id egestas mus commodo lectus iaculis ac conubia placerat lobortis. Natoque massa vivamus venenatis scelerisque viverra est ad pulvinar primis sagittis nostra. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
foo: bar | ||
biz: baz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
foo: bar | ||
biz: baz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# My file | ||
|
||
## Heading | ||
|
||
Lorem ipsum dolor sit amet, consectetur adipiscing elit | ||
|
||
```js | ||
console.log('Hello, World!'); | ||
``` | ||
|
||
| Jedi | Light saber | | ||
|-------|-----| | ||
| Obi-Wan Kenobi | Blue | | ||
| Yoda | Green | | ||
| Ahoska Tano | White | | ||
| Luke Skywalker | Green | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# My file | ||
|
||
## Heading | ||
|
||
Lorem ipsum dolor sit amet, consectetur adipiscing elit | ||
|
||
```js | ||
console.log('Hello, World!'); | ||
``` | ||
|
||
| Jedi | Light saber | | ||
|-------|-----| | ||
| Obi-Wan Kenobi | Blue | | ||
| Yoda | Green | | ||
| Ahoska Tano | White | | ||
| Luke Skywalker | Green | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
id,jedi,lightsaber_color | ||
1,Obi-Wan Kenobi,blue | ||
2,Yoda,green | ||
3,Ahoska Tano,white |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
id,jedi,lightsaber_color | ||
1,Obi-Wan Kenobi,blue | ||
2,Yoda,green | ||
3,Ahoska Tano,white |
Oops, something went wrong.