Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade codebase to recent changes in Python ecosystem #25

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,35 @@ jobs:
strategy:
max-parallel: 3
matrix:
python-version: ["3.7", "3.8", "3.9", "3.10"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]

steps:
- name: checkout
uses: actions/checkout@v2
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -U black pytest pytest-cov
python setup.py -q install
python -m pip install --upgrade ".[dev]"

- name: Style Check
if: matrix.python-version == '3.8'
run: |
black --check cdxj_indexer/*
black --check test/*

- name: Test with pytest
run: |
set -e
pytest -v --cov=cdxj_indexer --cov-report=xml
python -m pytest -v --cov=cdxj_indexer --cov-report=xml

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
if: matrix.python-version == '3.8'
uses: codecov/codecov-action@v4
with:
verbose: true
6 changes: 5 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ CDXJ Indexer
A command-line tool for generating CDXJ (and CDX) indexes from WARC and ARC files.
The indexer is a new tool redesigned for fast and flexible indexing. (Based on the indexing functionality from `pywb <https://github.com/ikreymer/pywb>`_)

Install with ``pip install cdxj-indexer`` or install locally with ``python setup.py install``
Install with ``pip install cdxj-indexer`` or install locally with ``pip install .`` (or ``pip install -e ".[dev]"`` to install in editable/development mode and include all dev dependencies: black, pytest, ...)


The indexer supports classic CDX index format as well as the more flexible CDXJ. With CDXJ, the indexer supports custom fields and ``request`` record access for WARC files. See the examples below and the command line ``-h`` option for latest features. (This is a work in progress).
Expand Down Expand Up @@ -42,5 +42,9 @@ More advanced use cases: add additonal http headers as fields. ``http:`` prefix
The CDXJ Indexer extends the ``Indexer`` functionality in `warcio <https://github.com/webrecorder/warcio>`_ and should be flexible to extend.


Contributing
~~~~~~~~~~~~~~~~~~~~

Run tests with ``python -m pytest -v --cov=cdxj_indexer --cov-report term-missing``

If you wanna build the sdist/wheel, first install ``build`` package with ``python -m pip install build`` and then run ``python -m build --sdist --wheel``.
2 changes: 2 additions & 0 deletions cdxj_indexer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from cdxj_indexer.main import CDXJIndexer, iter_file_or_dir
from cdxj_indexer.postquery import append_method_query_from_req_resp
from cdxj_indexer.bufferiter import buffering_record_iter

__version__ = "1.5.0-dev0"
30 changes: 11 additions & 19 deletions cdxj_indexer/postquery.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from multipart import MultipartParser
from warcio.utils import to_native_str

from urllib.parse import unquote_plus, urlencode
Expand All @@ -6,12 +7,13 @@
from cdxj_indexer.amf import amf_parse

import base64
import cgi
import json
import sys
import re

MAX_QUERY_LENGTH = 4096


# ============================================================================
def append_method_query_from_req_resp(req, resp):
len_ = req.http_headers.get_header("Content-Length")
Expand Down Expand Up @@ -93,27 +95,17 @@ def handle_binary(query_data):
query = handle_binary(query_data)

elif mime.startswith("multipart/"):
env = {
"REQUEST_METHOD": "POST",
"CONTENT_TYPE": mime,
"CONTENT_LENGTH": len(query_data),
}

args = dict(fp=BytesIO(query_data), environ=env, keep_blank_values=True)

args["encoding"] = "utf-8"

try:
data = cgi.FieldStorage(**args)
except ValueError:
# Content-Type multipart/form-data may lack "boundary" info
query = handle_binary(query_data)
else:
if boundary_match := re.match(r".*boundary=(\w*?)(?:\s|$|;).*", mime):
data = MultipartParser(
stream=BytesIO(query_data), boundary=boundary_match[1], charset="utf-8"
)
values = []
for item in data.list:
for item in data.parts():
values.append((item.name, item.value))

query = urlencode(values, True)
else:
# Content-Type multipart/form-data may lack "boundary" info
query = handle_binary(query_data)

elif mime.startswith("application/json"):
try:
Expand Down
51 changes: 51 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
[project]
name = "cdxj_indexer"
description = "CDXJ Indexer for WARC and ARC files"
readme = "README.rst"
authors = [
{ name = "Ilya Kreymer", email = "[email protected]" }
]
license = { text = "Apache 2.0" }
dynamic = ["version"]
dependencies = [
'warcio',
'surt',
'py3amf',
'multipart'
]
classifiers = [
"Development Status :: 4 - Beta",
"Environment :: Web Environment",
"License :: OSI Approved :: Apache Software License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Topic :: Software Development :: Libraries :: Python Modules",
"Topic :: Utilities",
]

[project.optional-dependencies]
lint = [
"black",
]
test = [
"pytest",
"pytest-cov",
]
dev = [
"cdxj_indexer[lint]",
"cdxj_indexer[test]",
]

[project.scripts]
cdxj-indexer = "cdxj_indexer.main:main"

[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[tool.setuptools.dynamic]
version = {attr = "cdxj_indexer.__version__"}
78 changes: 0 additions & 78 deletions setup.py

This file was deleted.

1 change: 0 additions & 1 deletion test/test_indexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@

from cdxj_indexer.main import write_cdx_index, main, CDXJIndexer

import pkg_resources

TEST_DIR = os.path.join(os.path.dirname(os.path.realpath(__file__)), "data")

Expand Down