Skip to content

Commit

Permalink
chore(setup): first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
CorentinTh committed Jan 22, 2025
0 parents commit a91d98f
Show file tree
Hide file tree
Showing 28 changed files with 6,131 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
github:
- CorentinTh

buy_me_a_coffee: cthmsst
35 changes: 35 additions & 0 deletions .github/workflows/ci-lib.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: CI - Lib

on:
pull_request:
push:
branches:
- main

jobs:
ci-lib:
name: CI - Lib
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4
- run: corepack enable
- uses: actions/setup-node@v4
with:
node-version: 22
cache: pnpm

- name: Install dependencies
run: pnpm i --frozen-lockfile

- name: Run linters
run: pnpm lint

- name: Type check
run: pnpm typecheck

- name: Run unit test
run: pnpm test

- name: Build the lib
run: pnpm build
32 changes: 32 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Nuxt dev/build outputs
.output
.data
.nuxt
.nitro
.cache
dist
dist-app
dist-node
dist-cloudflare

# Node dependencies
node_modules

# Logs
logs
*.log

# Misc
.DS_Store
.fleet
.idea

# Local env files
.env
.env.*
!.env.example

.wrangler
coverage
cache
.zed
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2025 Corentin THOMASSET

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
141 changes: 141 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# @papra/lecture

`@papra/lecture` is a robust and lightweight library for extracting text content from various file formats. Whether you're processing Document for indexing or for LLM readability, this library simplifies the task of reading text content programmatically.

## Features

- **Wide Format Support**: Extract text from PDFs, plain text files, YAML, Markdown, CSV, and all `text/*` MIME types.
- **Promise-based API**: Designed with asynchronous functions to provide a seamless integration experience.
- **Extensible and Modular**: Built with future compatibility in mind. Support for more file formats is on the way.
- **Error Handling**: Provides detailed error information when extraction fails.

## Installation

To install the package, use npm or yarn:

```bash
pnpm install @papra/lecture

npm install @papra/lecture

yarn add @papra/lecture
```

## Usage

### Importing the Library

You can import the library using ES Modules or CommonJS syntax:

```javascript
// ES Modules
import { extractText, extractTextFromBlob } from '@papra/lecture';
```

```javascript
// CommonJS
const { extractText, extractTextFromBlob } = require('@papra/lecture');
```

### Functions

#### `extractText`

Extracts text from a file's binary data using its MIME type.

**Parameters**:

- `arrayBuffer` (`ArrayBuffer`): The binary content of the file.
- `mimeType` (`string`): The MIME type of the file.

**Returns**:
A promise that resolves to an object with the following properties:

- `extractorName` (`string | undefined`): The name of the extractor used.
- `textContent` (`string | undefined`): The extracted text content, if available.
- `error` (`Error | undefined`): An error object, if an issue occurred during extraction.

**Example**:

```javascript
const file = await fetch('example.pdf').then(res => res.arrayBuffer());
const mimeType = 'application/pdf';

const result = await extractText({ arrayBuffer: file, mimeType });

if (result.textContent) {
console.log('Extracted Text:', result.textContent);
} else {
console.error('Error:', result.error);
}
```

#### `extractTextFromBlob`

Extracts text from a `Blob` object (e.g., files or data retrieved from APIs).

**Parameters**:

- `blob` (`Blob`): A Blob representing the file content.

**Returns**:
A promise that resolves with the same structure as `extractText`.

**Example**:

```javascript
const inputFile = document.querySelector('#file-input').files[0]; // HTML File Input

const result = await extractTextFromBlob(inputFile);

if (result.textContent) {
console.log('Extracted Text:', result.textContent);
} else {
console.error('Error:', result.error);
}
```

## Supported File Formats

Currently, `@papra/lecture` supports the following file formats:

- **PDF**
- **Plain Text** (e.g., `.txt`)
- **YAML** (e.g., `.yaml`, `.yml`)
- **Markdown** (e.g., `.md`)
- **CSV**
- All `text/*` MIME types
- Coming soon: **Microsoft Office Documents** (e.g., `.docx`, `.xlsx`, `.pptx`)
- Coming soon: **eBooks** (e.g., `.epub`, `.mobi`)
- Coming soon: **Images OCR** (e.g., `.jpg`, `.png`)

### Coming Soon

We are actively working on adding support for more file formats. Stay tuned for updates!

## Contributing

Contributions are welcome! Feel free to open issues or submit pull requests. Let's make `@papra/lecture` better together.

## Testing

You can run the tests with the following command:

```bash
# one shot
pnpm run test

# watch mode
pnpm run test:watch
```

Automated fixtures are run against the [`fixtures`](./fixtures) directory. Add files to this directory in the format `[0-9]{3}.ext` like `001.js` following the incremental pattern. The test runner will automatically pick up the new fixtures and generate a `[0-9]{3}.expected`, adding the expected output.

## License

This project is licensed under the MIT License. See the [LICENSE](./LICENSE) file for more information.

## Credits and Acknowledgements

This project is crafted with ❤️ by [Corentin Thomasset](https://corentin.tech).
If you find this project helpful, please consider [supporting my work](https://buymeacoffee.com/cthmsst).
13 changes: 13 additions & 0 deletions build.config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import { defineBuildConfig } from 'unbuild';

export default defineBuildConfig({
entries: [
'src/index',
],
clean: true,
declaration: true,
sourcemap: true,
rollup: {
emitCJS: true,
},
});
23 changes: 23 additions & 0 deletions eslint.config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import antfu from '@antfu/eslint-config';

export default antfu({
stylistic: {
semi: true,
},

ignores: ['README.md'],

rules: {
// To allow export on top of files
'ts/no-use-before-define': ['error', { allowNamedExports: true, functions: false }],
'curly': ['error', 'all'],
'vitest/consistent-test-it': ['error', { fn: 'test' }],
'ts/consistent-type-definitions': ['error', 'type'],
'style/brace-style': ['error', '1tbs', { allowSingleLine: false }],
'unused-imports/no-unused-vars': ['error', {
argsIgnorePattern: '^_',
varsIgnorePattern: '^_',
caughtErrorsIgnorePattern: '^_',
}],
},
});
1 change: 1 addition & 0 deletions fixtures/001.expected
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sapien ante conubia vestibulum ultrices quisque nam nascetur consectetur. Viverra amet lacinia massa donec gravida primis leo tellus. Montes nulla sit cras odio penatibus cum aenean metus. Per per eros fusce et platea et feugiat ullamcorper. Nunc suscipit senectus suscipit convallis duis accumsan feugiat maecenas. Turpis suscipit proin etiam nam proin interdum mattis netus. Montes sociis justo non pharetra neque lectus dolor lacinia. Varius ac dictumst et nec massa taciti gravida nullam. Elit sagittis velit placerat vivamus at at non donec. Dolor conubia turpis nostra eu id habitant sollicitudin aliquet. Vitae eros elementum nisl vestibulum euismod mattis conubia praesent. Lacus sapien vehicula luctus purus class at mattis lacinia. Luctus nulla ullamcorper congue facilisi platea interdum metus montes. Habitant a iaculis phasellus faucibus faucibus potenti tincidunt vivamus. Sociosqu cum hendrerit neque ante aenean nunc convallis tempus. Dapibus molestie odio condimentum mollis eget malesuada aliquet aptent. Lobortis scelerisque dictumst pellentesque penatibus ornare lectus pharetra fermentum. Nostra inceptos tempus varius tempus facilisi faucibus in suspendisse. Pretium consequat ornare tempus est molestie vestibulum congue ad.
Binary file added fixtures/001.pdf
Binary file not shown.
7 changes: 7 additions & 0 deletions fixtures/002.expected
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam primis posuere sit nec integer eu sem lectus condimentum rhoncus odio. Pretium urna mus nam posuere mus aliquet a curabitur massa eleifend volutpat. Diam tincidunt diam montes aliquam integer metus tellus gravida ad montes dui. Nisl vivamus primis tincidunt platea ligula scelerisque mollis lacinia arcu est torquent. Laoreet varius ridiculus class eros suscipit lobortis curabitur morbi condimentum porta penatibus. Senectus dolor ullamcorper litora auctor elementum massa netus quis dolor aptent turpis. Feugiat volutpat nec varius faucibus porta pulvinar parturient lacus nisi rhoncus neque. Cubilia luctus ac gravida fringilla fringilla fusce nisi bibendum potenti convallis natoque. Mi tellus sit in blandit ante mus venenatis commodo lorem diam pharetra.

Leo sapien rutrum integer diam magnis magnis gravida montes placerat viverra suscipit. A lacus molestie suspendisse etiam curae mollis donec class primis luctus elit. Massa sagittis euismod tempor lobortis felis eros massa purus volutpat at leo. Gravida neque nam magnis aptent dictumst semper ultricies eget dictum molestie netus. Eros cum ullamcorper euismod ad condimentum ipsum vehicula fusce purus nostra bibendum. Primis fermentum tempus faucibus dui amet eget gravida aliquet neque sollicitudin sapien. Varius nostra mus laoreet maecenas facilisi tincidunt integer tempor quis mi dignissim. Eros luctus nisi dapibus inceptos ligula suspendisse accumsan neque ridiculus donec vehicula. Ac massa enim luctus suscipit eu lacinia imperdiet orci lacinia donec consequat. Nulla eu fermentum donec mauris quis convallis et gravida pellentesque non mi.

Turpis erat dictumst arcu quisque commodo ad urna nisi semper iaculis mollis. Sed praesent suscipit montes ligula quam mi nisi auctor pellentesque diam auctor. Suscipit laoreet dictumst hac sed morbi molestie est montes semper interdum vulputate. Vel eget phasellus cum per venenatis magna dolor mi inceptos ad etiam. Tellus velit aptent gravida bibendum varius felis sollicitudin odio dapibus volutpat ultrices. Ipsum eleifend magnis imperdiet ultricies volutpat arcu non non integer fermentum et. Dapibus phasellus neque vestibulum consequat natoque vel id sagittis senectus senectus eros. Quam magna tincidunt praesent neque orci imperdiet neque sit cubilia lacinia per. Senectus penatibus felis pretium risus ultrices duis dignissim fermentum amet elit congue. Velit pharetra aptent aenean magna potenti sed litora cubilia pellentesque aliquet nunc.

Quis dignissim vehicula sem eu posuere nisi nisl praesent pellentesque quam mauris. Odio donec nisl conubia consectetur dolor mattis sed consectetur sollicitudin lobortis rhoncus. Mi fringilla curae dictum donec gravida himenaeos eu euismod sit tincidunt consequat. Volutpat per pretium consequat diam commodo metus aenean tortor nisi senectus cubilia. Varius porta euismod morbi sapien dignissim varius conubia venenatis fermentum at lacus. Luctus aliquet ultrices lectus dolor vehicula erat mattis eu ridiculus amet eu. Consequat cras massa curae purus egestas elementum neque porta nisl himenaeos ligula. Sit auctor lorem eros hendrerit sagittis nunc rhoncus eu iaculis pharetra tellus. Senectus hendrerit id egestas mus commodo lectus iaculis ac conubia placerat lobortis. Natoque massa vivamus venenatis scelerisque viverra est ad pulvinar primis sagittis nostra.
7 changes: 7 additions & 0 deletions fixtures/002.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam primis posuere sit nec integer eu sem lectus condimentum rhoncus odio. Pretium urna mus nam posuere mus aliquet a curabitur massa eleifend volutpat. Diam tincidunt diam montes aliquam integer metus tellus gravida ad montes dui. Nisl vivamus primis tincidunt platea ligula scelerisque mollis lacinia arcu est torquent. Laoreet varius ridiculus class eros suscipit lobortis curabitur morbi condimentum porta penatibus. Senectus dolor ullamcorper litora auctor elementum massa netus quis dolor aptent turpis. Feugiat volutpat nec varius faucibus porta pulvinar parturient lacus nisi rhoncus neque. Cubilia luctus ac gravida fringilla fringilla fusce nisi bibendum potenti convallis natoque. Mi tellus sit in blandit ante mus venenatis commodo lorem diam pharetra.

Leo sapien rutrum integer diam magnis magnis gravida montes placerat viverra suscipit. A lacus molestie suspendisse etiam curae mollis donec class primis luctus elit. Massa sagittis euismod tempor lobortis felis eros massa purus volutpat at leo. Gravida neque nam magnis aptent dictumst semper ultricies eget dictum molestie netus. Eros cum ullamcorper euismod ad condimentum ipsum vehicula fusce purus nostra bibendum. Primis fermentum tempus faucibus dui amet eget gravida aliquet neque sollicitudin sapien. Varius nostra mus laoreet maecenas facilisi tincidunt integer tempor quis mi dignissim. Eros luctus nisi dapibus inceptos ligula suspendisse accumsan neque ridiculus donec vehicula. Ac massa enim luctus suscipit eu lacinia imperdiet orci lacinia donec consequat. Nulla eu fermentum donec mauris quis convallis et gravida pellentesque non mi.

Turpis erat dictumst arcu quisque commodo ad urna nisi semper iaculis mollis. Sed praesent suscipit montes ligula quam mi nisi auctor pellentesque diam auctor. Suscipit laoreet dictumst hac sed morbi molestie est montes semper interdum vulputate. Vel eget phasellus cum per venenatis magna dolor mi inceptos ad etiam. Tellus velit aptent gravida bibendum varius felis sollicitudin odio dapibus volutpat ultrices. Ipsum eleifend magnis imperdiet ultricies volutpat arcu non non integer fermentum et. Dapibus phasellus neque vestibulum consequat natoque vel id sagittis senectus senectus eros. Quam magna tincidunt praesent neque orci imperdiet neque sit cubilia lacinia per. Senectus penatibus felis pretium risus ultrices duis dignissim fermentum amet elit congue. Velit pharetra aptent aenean magna potenti sed litora cubilia pellentesque aliquet nunc.

Quis dignissim vehicula sem eu posuere nisi nisl praesent pellentesque quam mauris. Odio donec nisl conubia consectetur dolor mattis sed consectetur sollicitudin lobortis rhoncus. Mi fringilla curae dictum donec gravida himenaeos eu euismod sit tincidunt consequat. Volutpat per pretium consequat diam commodo metus aenean tortor nisi senectus cubilia. Varius porta euismod morbi sapien dignissim varius conubia venenatis fermentum at lacus. Luctus aliquet ultrices lectus dolor vehicula erat mattis eu ridiculus amet eu. Consequat cras massa curae purus egestas elementum neque porta nisl himenaeos ligula. Sit auctor lorem eros hendrerit sagittis nunc rhoncus eu iaculis pharetra tellus. Senectus hendrerit id egestas mus commodo lectus iaculis ac conubia placerat lobortis. Natoque massa vivamus venenatis scelerisque viverra est ad pulvinar primis sagittis nostra.
2 changes: 2 additions & 0 deletions fixtures/003.expected
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
foo: bar
biz: baz
2 changes: 2 additions & 0 deletions fixtures/003.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
foo: bar
biz: baz
16 changes: 16 additions & 0 deletions fixtures/004.expected
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# My file

## Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit

```js
console.log('Hello, World!');
```

| Jedi | Light saber |
|-------|-----|
| Obi-Wan Kenobi | Blue |
| Yoda | Green |
| Ahoska Tano | White |
| Luke Skywalker | Green |
16 changes: 16 additions & 0 deletions fixtures/004.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# My file

## Heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit

```js
console.log('Hello, World!');
```

| Jedi | Light saber |
|-------|-----|
| Obi-Wan Kenobi | Blue |
| Yoda | Green |
| Ahoska Tano | White |
| Luke Skywalker | Green |
4 changes: 4 additions & 0 deletions fixtures/005.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
id,jedi,lightsaber_color
1,Obi-Wan Kenobi,blue
2,Yoda,green
3,Ahoska Tano,white
4 changes: 4 additions & 0 deletions fixtures/005.expected
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
id,jedi,lightsaber_color
1,Obi-Wan Kenobi,blue
2,Yoda,green
3,Ahoska Tano,white
Loading

0 comments on commit a91d98f

Please sign in to comment.