Skip to content

Latest commit

 

History

History
95 lines (77 loc) · 6.05 KB

README.md

File metadata and controls

95 lines (77 loc) · 6.05 KB

docx/odt parsing for Seismica

These scripts are for converting docx/odt manuscript submissions into latex, specifically a tex file that will work with the Seismica latex publication template. The scripts are not perfect and depend heavily on the docx/odt templates being used correctly, so if something goes wrong, it might be due to a template error.

If you hit an error running these scripts, we recommend looking at the traceback and deciding where it would be easiest to fix it: you might want to fix styles/formatting in the docx/odt file, or it might be simpler to make some quick edits in the bib file generated by anystyle or the tex file generated by pandoc.

dependencies:

pandoc and anystyle are not python packages and need to be installed on their own. To install anystyle locally on a linux machine, run:

  • sudo apt install ruby-dev
  • sudo gem install anystyle-cli Note that you will also need to have make installed on your system for this to work.

A conda environment is a nice way to set up the python dependencies. You could, for example, use these commands:

conda create -n seismica numpy habanero python-dateutil
cd /path/to/put/biblib/files
git clone [email protected]:WeAreSeismica/biblib.git
conda activate seismica
pip install /path/to/put/biblib/files/biblib

Note that biblib is not available through conda or PyPI (the one available from PyPI is not the same module), though it can be installed in a conda env by using pip with the downloaded source code. We use a fork of the original source code that has been updated for compatibility with Python 3.10+.

Alternatively, you can use docker, with instructions provided below.

docker setup

  1. Install and start the docker client on your computer.
  2. From the repository directory, run the following commands in your terminal -
    docker-compose run --build -it docx-converter
    
  3. This will open a terminal session with all the necessary dependencies already installed. You can follow the remaining intructions in this terminal.
  4. After completing the conversions, exit this terminal session and run the following to turn down the docker service -
    docker-compose down
    

general steps for the conversion process

  1. Check the docx/odt file before running pandoc to ensure that template formatting is generally intact. If you run into trouble later, incorrect template formatting is probably at least part of the reason. There are some common errors we encounter; a checklist for avoiding them is below:

    • 2nd and 3rd language abstract headers must be in the format "Second language abstract: Resumen (Spanish)"
    • headers must use header styles (and other text should not use header styles)
    • authors must be listed on one line and affiliations are right after authors, one affiliation per line
    • references or bibliography must be the last section of the document
    • author contributions are given as CRediT statements at the beginning, and there is no separate Author Contributions Statement elsewhere
  2. convert article file to latex

    • docx: pandoc file.docx -f docx -t latex --wrap=none -s -o file_pandoc.tex
    • odt: pandoc file.odt -f odt -t latex --wrap=none -s -o file_pandoc.tex
  3. convert text bibliography into a .bib file by either:

    • running the anystyle gem locally: copy the text of the references into a plain text file (e.g. refs.txt) with one reference per line, and then run anystyle -f bib parse refs.txt . to generate the file refs.bib (filename will be the same as the input, with suffix replaced by .bib)
      • note that anystyle will throw a processing error if the output file (default refs.bib) already exists
    • copy-pasting bibliography from docx/odt into anystyle.io, and copy-pasting the output into a new .bib file
  4. fix anystyle bibtex file year fields and keys, make a new .bib file. This script will query crossref to try and add missing DOIs to references. Interactive prompts are used to check the results of crossref queries so as not to introduce errors.

    • python3 -m fix_bibtex --ifile refs.bib --ofile refs_better.bib
  5. parse the pandoc output tex file to a better tex format

    • python3 -m parse_pandoc_file --bibfile refs_better.bib --ifile file_pandoc.tex --ofile file_better.tex
    • bits that can't be parsed automatically will go in junk.tex
  6. make pdf by either:

    • compiling all the _better files with the publication template:
      • lualatex file_better.tex -> file_better.pdf
      • bibtex file_better.aux
      • lualatex file_better.tex
      • lualatex file_better.tex
        • running at least twice gives inline references a chance to sort themselves out
    • the publication template files are on github.
  7. complete other tasks to fix the latex files:

    • manually set options (breakmath, languages, report, etc) as necessary
    • adjust figure sizes (1- or 2-column), link figure files that the author uploaded separately
    • manually correct any citations that we couldn't parse (most should be marked in red)
    • manually adjust table formatting
    • add extra hyphenation rules for words latex doesn't know if columns are overfull
    • look at junk file and edit/replace any pieces that were not parsed

TODO:

  • multi-line equations in brackets/parens! These are a problem, they break the paren parsing.
  • print warning for references that start with special characters
  • handle hyphenated author names vs un-hyphenated bib keys
  • check YYYYa/YYYYb citations
  • single refs that aren't matched should be in red if there's something like a year in them
  • clean out enumerate stuff when word messes up the template