Skip to content

Latest commit

 

History

History
722 lines (512 loc) · 25.8 KB

examples.rst

File metadata and controls

722 lines (512 loc) · 25.8 KB

EXAMPLES

Examples to show how to execute the different subcommands from the ebooktools.py script.

Check also some important tips about using the ebooktools.py script.

In order to avoid data loss, use the --dry-run or --symlink-only option to test that a given subcommand would do what you expect it to do, as explained in the Security and safety section.

When --dry-run is enabled, no file rename/move/symlink/etc. operations will actually be executed.

When --symlink-only is enabled, symbolic links to ebook files are created instead of moving them.

Thus, these flags are convenient with the subcommands organize, rename, and split which will modify the input files. Though only --dry-run applies to split.

When using these two flags, you will be able to know what the ebooktools.py script would have done by looking at the messages printed out in the terminal. Use the --log-level debug flag to get more information about what is the ebooktools.py performing as operation (e.g. rename/move/symlink/ect.).

Sometimes, it might be more convenient to edit the main configuration file config.py instead of building a long command in the terminal with all the options for a given subcommand (e.g. organize).

For example, you might have the following long command in the terminal:

$ ebooktools organize ~/folder_to_organize/ -o ~/output_folder/ --ofp ~/output_folder_pamphlets/ --owi

Instead of providing command-line arguments, we will edit the configuration file config.py by running the following edit subcommand:

$ ebooktools edit main

The config.py file will be opened by the default source code editor associated with this type of file and then you can modify the right configuration variables for the given subcommand.

You can then run the given subcommand with the -u flag (short name for the --use-config option) and all the updated options in the configuration file will be used:

$ ebooktools organize -u

where organize can also be any of the other supported subcommands.

The -u, --use-config flag tells the ebooktools.py script to only use the parameters defined in the config file config.py.

See edit for more info about this subcommand.
$ ebooktools convert --ocr always pdf_to_convert.pdf -o converted.txt

By setting --ocr to always, the pdf file will be first OCRed before trying the simple conversion tools (pdftotext or calibre's ebook-convert if the former command is not found).

Output:

Running pyebooktools v0.1.0a3
Verbose option disabled
OCR=always, first try OCR then conversion
Will run OCR on file 'pdf_to_convert.pdf' with 1 page...
OCR successful!
$ ebooktools convert pdf_to_convert.pdf -o converted.txt

If pdftotext is present, it is used to convert the pdf file to text. Otherwise, calibre's ebook-convert is used for the conversion.

Output:

Running pyebooktools v0.1.0a3
Verbose option disabled
OCR=false, try only conversion...
Conversion successful!

The two config files that can be edited are the main and logging config files, named config.py and logging.py respectively. We will only focus in the main config file because it is the most important one since it contains all the options for the ebooktools.py script.

To edit the main config file with PyCharm:

$ ebooktools edit -a charm main

A tab with the main config file will be opened in PyCharm's Editor window:

Example: opened tab with config file in PyCharm

To reset the main config file with factory settings as defined in default_config.py:

$ ebooktools edit -r main

Find ISBNs in the string '978-159420172-1 978-1892391810 0000000000 0123456789 1111111111':

$ ebooktools find '978-159420172-1 978-1892391810 0000000000 0123456789 1111111111'

The input string can be enclosed within single or double quotes.

Output:

Running pyebooktools v0.1.0a3
Verbose option disabled
Extracted ISBNs:
9781594201721
9781892391810

The other sequences '0000000000 0123456789 1111111111' are rejected because they are matched with the regular expression isbn_blacklist_regex.

By default, the extracted ISBNs are separated by newlines, \n.

ℹ️

If you want to search ISBNs in a multiple-lines string, e.g. you copied many pages from a document, you must follow the find subcommand with a backslash \ and enclose the string within double quotes, like so:

$ ebooktools find \
"
978-159420172-1

blablabla
blablabla
blablabla

978-1892391810
0000000000 0123456789

blablabla
blablabla
blablabla

1111111111
blablabla
blablabla
"
$ ebooktools find pdf_file.pdf

Output:

Running pyebooktools v0.1.0a3
Verbose option disabled
Searching file 'pdf_file.pdf' for ISBN numbers...
Extracted ISBNs:
9789580158448
1000100111

The search for ISBNs starts in the first pages of the document to increase the likelihood that the first extracted ISBN is the correct one. Then the last pages are analyzed in reverse. Finally, the rest of the pages are searched.

Thus, in this example, the first extracted ISBN is the correct one associated with the book since it was found in the first page.

The last sequence 1000100111 was found in the middle of the document and is not an ISBN even though it is a technically valid but wrong ISBN that the regular expression isbn_blacklist_regex didn't catch. Maybe it is a binary sequence that is part of a problem in a book about digital system.

The following examples show how to organize ebooks depending on different cases:

You can use organize to check ebooks for corruption without organizing them by using the --corruption-check-only flag. See the Check ebooks for corruption only example for more details.

ℹ️

You can also combine all these cases by using all of the output folders along with the --owi flag in the command-line when calling the organize subcommand.

Or better you can also do it through the config file config.py by running the following edit subcommand:

$ ebooktools edit main

The config.py file will be opened by the default source code editor associated with this type of file and then you can modify the right configuration variables.

Then run the organize subcommand with the -u flag and the updated options in the configuration file will be used:

$ ebooktools organize -u

See edit for more info about this subcommand.

We only want to check the following ebook files for corruption (e.g. zero-filled files, broken pdfs, corrupt archive, etc.):

Example: content of ``folder_to_organize``

This is the command to check these ebooks for corruption only:

$ ebooktools organize --cco ~/folder_to_organize/

where

  • --cco is the short name for the corruption-check-only flag and checks ebooks for corruption only without organizing them
  • folder_to_organize contains the ebooks that need to be organized or checked (as in our case)

Output:

Example: output terminal

ℹ️

We want to organize the following ebook files:

Example: content of ``folder_to_organize``

This is the command to organize these ebooks:

$ ebooktools organize ~/folder_to_organize/ -o ~/output_folder/

where

Output:

Example: output terminal

Content of output_folder:

Example: content of ``output_folder``

ℹ️

Since the --owi flag was not used, two ebook files that didn't contain ISBNs could not be further processed and thus were left as they are in the original directory folder_to_organize. See Organize ebooks with output_folder_uncertain where this flag is enabled to organize ebooks without ISBNs by getting these book identifiers through other means (e.g. calibre's ebook-meta).

We want to organize the following ebook files, one of which is corrupted:

Example: content of ``folder_to_organize``

This is the command to organize these ebooks as wanted:

$ ebooktools organize ~/folder_to_organize/ -o ~/output_folder/ --ofc ~/output_folder_corrupt/

where

  • output_folder will contain all the renamed ebooks for which an ISBN was found in it
  • output_folder_corrupt will contain any corrupted ebook (e.g. zero-filled files, corrupt archives or broken .pdf files)

Output:

Example: output terminal

Content of output_folder:

Example: content of ``output_folder``

Content of output_folder_corrupt:

Example: content of ``output_folder_corrupt``

ℹ️

Along each corrupted file, a metadata file is saved containing information about the corruption reason and the ebook's old file path.

We want to organize the following ebook files, some of which are pamphlets:

Example: content of ``folder_to_organize``

ℹ️

If no ISBN was found for a non-pdf file and the file size is less than pamphlet_max_filesize_kib, then it is considered as a pamphlet.

This is the command to organize these ebooks as wanted:

$ ebooktools organize ~/folder_to_organize/ -o ~/output_folder/ --ofp ~/output_folder_pamphlets/ --owi

where

  • output_folder will contain all the renamed ebooks for which an ISBN was found in it
  • output_folder_pamphlets will contain all the pamphlets-like documents
  • --owi is a flag to enable the organization of documents without ISBNs such as pamphlets

Output:

Example: output terminal

Content of output_folder:

Example: content of ``output_folder``

Content of output_folder_pamphlets:

Example: content of ``output_folder_pamphlets``

We want to organize the following ebook files, some of which do not contain any ISBNs:

Example: content of ``folder_to_organize``

This is the command to organize these ebooks as wanted:

$ ebooktools organize ~/folder_to_organize/ -o ~/output_folder/ --ofu ~/output_folder_uncertain/ --owi

where

  • output_folder will contain all the renamed ebooks for which an ISBN was found in it
  • output_folder_uncertain will contain all the renamed ebooks for which no ISBNs could be found in them
  • --owi is a flag to enable the organization of ebooks without ISBNs

Output:

Example: output terminal

Content of output_folder:

Example: content of ``output_folder``

Content of output_folder_uncertain:

Example: content of ``output_folder_uncertain``

ℹ️

For those ebooks for which no ISBNs could be found in them, the ebooktools.py script takes the following steps to organize them:

  1. Use calibre's ebook-meta to extract the author and title metadata from the ebook file
  2. Search the online metadata sources (Goodreads,Amazon.com,Google) by the extracted author & title and just by title
  3. If there is no useful metadata or nothing is found online, the script will try to use the filename for searching

[OWI]

We want to rename ebook files from a calibre library folder and save their symlinks along with their copied metadata.opf files in a separate folder.

Content of calibre_folder:

Example: content of ``folder_to_organize``

This is the command to rename these ebooks as wanted:

$ ebooktools rename ~/calibre_folder/ -o ~/output_folder/ --sm opfcopy --sl

where

  • output_folder is where the renamed books (or their symbolic links) will be moved to along with their metadata files
  • --sm opfcopy copies calibre's metadata.opf next to each renamed file with a output_metadata_extension extension
  • --sl is a flag for creating symbolic links to ebooks, instead of moving them to the output_folder

Output:

Running pyebooktools v0.1.0a3
Verbose option disabled
Files sorted in asc
Parsing metadata for 'Title1 - Author1.pdf'...
Saving book file and metadata...
Parsing metadata for 'Title2 - Author2.epub'...
Saving book file and metadata...
Parsing metadata for 'Title3 - Author3.pdf'...
Saving book file and metadata...
Parsing metadata for 'Title4 - Author4.epub'...
Saving book file and metadata...

Content of output_folder:

Example: content of ``output_folder``

ℹ️

  • The ebook files are renamed based on the content of their associated metadata.opf files and the new filenames follow the output_filename_template format.
  • The metadata.opf files are copied with the meta extension (default) beside the symlinks to the ebook files.

We have a folder containing four ebooks and the metadata file for two of them:

Example: content of ``folder_with_books``

We want to split these ebook files into folders containing two files each and their numbering should start at 1:

$ ebooktools split ~/folder_with_books/ -o ~/output_folder/ -s 1 --fpf 2

where

  • output_folder in which all the new consecutively named folders will be created
  • -s is the number of the first folder
  • --fpf is the number of files per folder

Content of output_folder

Example: content of ``output_folder``

ℹ️

Note that the metadata folders contain only one file each as expected.
[OWI]https://github.com/raul23/pyebooktools#organize-without-isbn-label