Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTIN extraction, Python 3.12 support #11

Merged
merged 7 commits into from
Dec 26, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 28 additions & 14 deletions .flake8
Original file line number Diff line number Diff line change
@@ -1,20 +1,34 @@
[flake8]
ignore =
E203, # whitespace before ':'
E501, # line too long
# whitespace before ':'
E203,
# line too long
E501,

D100, # Missing docstring in public module
D101, # Missing docstring in public class
D102, # Missing docstring in public method
D103, # Missing docstring in public function
D104, # Missing docstring in public package
D105, # Missing docstring in magic method
D107, # Missing docstring in __init__
D200, # One-line docstring should fit on one line with quotes
D205, # 1 blank line required between summary line and description
D400, # First line should end with a period
D401, # First line should be in imperative mood
D403, # First word of the first line should be properly capitalized
# Missing docstring in public module
D100,
# Missing docstring in public class
D101,
# Missing docstring in public method
D102,
# Missing docstring in public function
D103,
# Missing docstring in public package
D104,
# Missing docstring in magic method
D105,
# Missing docstring in __init__
D107,
# One-line docstring should fit on one line with quotes
D200,
# 1 blank line required between summary line and description
D205,
# Multi-line docstring closing quotes should be on a separate line
D400,
# First line should be in imperative mood
D401,
# First word of the first line should be properly capitalized
D403,

per-file-ignores =
# F401: Ignore "imported but unused" errors in __init__ files, as those
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: '3.x'
python-version: '3.12'
- name: Install dependencies
run: |
pip install --upgrade build twine
python -m build
- name: Publish to PyPI
uses: pypa/[email protected].6
uses: pypa/[email protected].11
with:
password: ${{ secrets.PYPI_TOKEN }}
11 changes: 6 additions & 5 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,12 @@ jobs:
- python-version: "3.9"
- python-version: "3.10"
- python-version: "3.11"
- python-version: "3.12"

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand All @@ -43,13 +44,13 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ['3.11'] # Keep in sync with .readthedocs.yml
python-version: ['3.12'] # Keep in sync with .readthedocs.yml
tox-job: ["pre-commit", "mypy", "docs", "twinecheck"]

steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
Expand Down
6 changes: 3 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ repos:
- hooks:
- id: black
repo: https://github.com/ambv/black
rev: 23.3.0
rev: 23.12.1
- hooks:
- id: isort
repo: https://github.com/PyCQA/isort
rev: 5.11.5
rev: 5.13.2
- hooks:
- id: flake8
additional_dependencies:
Expand All @@ -18,4 +18,4 @@ repos:
- flake8-docstrings
- flake8-string-format
repo: https://github.com/pycqa/flake8
rev: 5.0.4
rev: 6.1.0
2 changes: 1 addition & 1 deletion .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sphinx:
build:
os: ubuntu-22.04
tools:
python: "3.11" # Keep in sync with .github/workflows/tests.yml
python: "3.12" # Keep in sync with .github/workflows/tests.yml

python:
install:
Expand Down
9 changes: 9 additions & 0 deletions docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,15 @@ Breadcrumbs

.. autofunction:: zyte_parsers.extract_breadcrumbs

GTIN
----

.. autoclass:: zyte_parsers.Gtin
:members:
:undoc-members:

.. autofunction:: zyte_parsers.extract_gtin

Price
-----

Expand Down
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,19 @@ classifiers = [
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: Implementation :: CPython",
]
requires-python = ">=3.8"
dependencies = [
"attrs>=21.3.0",
"gtin-validator>=1.0.3",
"html-text",
"lxml",
"parsel",
"price-parser>=0.3.4",
"python-stdnum>=1.19",
"six", # unstated dependency of gtin-validator
"w3lib",
]
dynamic = ["version"]
Expand All @@ -47,7 +51,9 @@ multi_line_output = 3

[[tool.mypy.overrides]]
module = [
"gtin.validator.*",
"html_text.*",
"stdnum.*",
]
ignore_missing_imports = true

Expand Down
182 changes: 182 additions & 0 deletions tests/test_gtin.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
import pytest
from lxml.html import fromstring
from parsel import Selector

from zyte_parsers.gtin import Gtin, extract_gtin, extract_gtin_id, gtin_classification

GTIN_CLASSIFICATION_CASES = [
("978-1-933624-34-1", "isbn13"),
("978-1-933624-76-1", "isbn13"),
(" 978-1-62544-118-8", "isbn13"),
("9781625441775", "isbn13"),
("978-1-62544-167-6", "isbn13"),
("ISBN: 978-1-62544-175-1", "isbn13"),
("978-0-9801023-6-9", "isbn13"),
("9780062315007", "isbn13"),
("978-1-56619-909-4 ", "isbn13"),
# Negative examples for ISBN by changing last digit
("978-1-56619-909-5 ", None),
("978-1-56619-909-6 ", None),
("978-1-56619-909-70 ", None),
("978-1-56619-909 ", None),
("0-545-01022-5", "isbn10"),
(" 1 86197 271-7", "isbn10"),
("7350053850019", "gtin13"),
("EAN: 8808993650040", "gtin13"),
("4015600608835", "gtin13"),
("4031778810191", "gtin13"),
("042100005264", "upc"),
# Negative examples by changing last digit in above number
("042100005265", None),
("042100005266", None),
("042100005267", None),
("042100005268", None),
# Test cases for ISSN
("03178471", "issn"),
("0083-2421", "issn"),
("0500-0270", "issn"),
("ISSN: 0500-0270", "issn"),
("1562-6865", "issn"),
("10637710", "issn"),
# Negative examples by changing the last digit of the last number
("10637711", None),
("10637712", None),
("10637713", None),
# Test cases for ISMN
("979-0-65001-268-3", "ismn"),
("9790035236338", "ismn"),
("9790035057292", "ismn"),
# Negative examples for ismn
("979-0-65001-268-4", None),
("979-0-65001-268-5", None),
("979-0-65001-268-6", None),
# Test cases from the annotated data
("978-2-89455-671-9", "isbn13"),
("9780285640856", "isbn13"),
("ISBN-13: 978-1607439677", "isbn13"),
("8590875345921", "gtin13"),
("4001868008067", "gtin13"),
("7897396607622", "gtin13"),
("857392003023", "upc"),
("4250586357265", "gtin13"),
("3661276011335", "gtin13"),
("(9782894556726)", "isbn13"),
(" 0161019884293", "gtin13"),
("8717801048538", "gtin13"),
("9780285640856", "isbn13"),
# Test cases for gtin14
("10614141543219", "gtin14"),
("0001 2345 6000 12", "gtin14"),
("4 07007 1967 072 0", "gtin14"),
("1061-4141-543-219", "gtin14"),
("gtin14: 10614141543219", "gtin14"),
("gtin: 09501101530003", "gtin14"),
("00075678164125", "gtin14"),
# Negatvie test cases for gtin14
("10614141543218", None),
("10614141543217", None),
("10614141543216", None),
("10614141543215", None),
("00012345600013", None),
("00012345600014", None),
("00012345600015", None),
("00012345600016", None),
# Test cases for gtin8 cases
("40170725", "gtin8"),
(" gtin8 : 12345670", "gtin8"),
("(93123457)", "gtin8"),
("Gtin : 47612341", "gtin8"),
("59012344", "gtin8"),
# Negative test cases for gtin8
("40170726", None),
("40170727", None),
("40170728", None),
("40170729", None),
("40170720", None),
]


@pytest.mark.parametrize(["value", "expected"], GTIN_CLASSIFICATION_CASES)
def test_gtin_classification(value, expected):
assert expected == gtin_classification(value)


GTIN_IDS = [
# Simple cases
("978-1-933624-34-1", "9781933624341"),
("978-1-933624-76-1", "9781933624761"),
(" 978-1-62544-118-8", "9781625441188"),
("ISBN# 9781625441775", "9781625441775"),
("978-1-62544-167-6", "9781625441676"),
("ISBN: 978-1-62544-175-1", "9781625441751"),
("ISBN - 978-0-9801023-6-9", "9780980102369"),
("isbn 9780062315007", "9780062315007"),
("978-1-56619-909-4 ", "9781566199094"),
("EAN# 8808993650040", "8808993650040"),
# More complex examples with ids where the numeric values
# will pass the gtin checksum, however the non-numeric character
# are intercepted in these, therefore, these should not be passed
# as gtin ID.
("HD978193-INT3624341", None),
("TFG-05451-HOP-0225", None),
("1063HP7710", None),
# Cases with the prefix having numeric values (eg. gtin13, isbn10)
("EAN13: 8808993650040", "8808993650040"),
("EAN13#8717801048538", "8717801048538"),
("EAN13 8808993650040", "8808993650040"),
("Isbn13: 978-1-62544-175-1", "9781625441751"),
("ISBN10: 0-545-01022-5", "0545010225"),
("ISBN10-0-545-01022-5", "0545010225"),
("ISBN10 0-545-01022-5", "0545010225"),
("Isbn-10: 0-545-01022-5", "0545010225"),
# Cases with the prefix having numeric values and code starting with same
# numeric prefix value (eg. gtin13 and code starting with 13 like 1334567890125)
# Gtin14 startign with 14 (e.g. 14334567890129)
("EAN131334567890125", "1334567890125"),
("EAN1334567890125", "1334567890125"),
("Gtin13 1334567890125", "1334567890125"),
("Gtin 1334567890125", "1334567890125"),
("Gtin1414334567890129", "14334567890129"),
("Gtin14334567890129", "14334567890129"),
# Example text cases
("TSF8UP-R407-26A44", None),
("978-2-89455-671-9", "9782894556719"),
("9780285640856", "9780285640856"),
("ISBN-13: 978-1607439677", "9781607439677"),
("8590875345921", "8590875345921"),
("(9782894556726)", "9782894556726"),
# Example test cases from the real world product pages
("EAN: 9781101872239", "9781101872239"),
("ISBN: 0722539312", "0722539312"),
("UPC4960759145062", "4960759145062"),
("ISBN13: 9780525555360 ", "9780525555360"),
("ISBN10: 0525555366 ", "0525555366"),
("ISBN-13: 9780525576709", "9780525576709"),
("ISBN-10: 0525576703", "0525576703"),
("UPC: 884116293835", "884116293835"),
("8423490261447 -", "8423490261447"),
]


@pytest.mark.parametrize(["value", "expected"], GTIN_IDS)
def test_extract_gtin_id(value, expected):
assert expected == extract_gtin_id(value)


GTINS = [
("TSF8UP-R407-26A44", None),
("978-1-933624-34-1", Gtin("isbn13", "9781933624341")),
]


@pytest.mark.parametrize(["value", "expected"], GTINS)
def test_extract_gtin(value, expected):
assert expected == extract_gtin(fromstring(f"<p>{value}</p>"))


def test_extract_gtin_types():
value = "978-1-933624-34-1"
expected = Gtin("isbn13", "9781933624341")
assert expected == extract_gtin(value)
assert expected == extract_gtin(fromstring(f"<p>{value}</p>"))
assert expected == extract_gtin(Selector(text=f"<p>{value}</p>"))
10 changes: 5 additions & 5 deletions tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ commands = pre-commit run --all-files --show-diff-on-failure

[testenv:mypy]
deps =
mypy==1.3.0
mypy==1.8.0
types-attrs==19.1.0
types-lxml==2023.3.28
pytest==7.3.1
commands = mypy --show-error-codes {posargs:zyte_parsers tests}
types-lxml==2023.10.21
pytest==7.4.3
commands = mypy {posargs:zyte_parsers tests}

[testenv:docs]
basepython = python3
Expand All @@ -37,7 +37,7 @@ commands =
basepython = python3
deps =
twine==4.0.2
build==0.10.0
build==1.0.3
commands =
python -m build --sdist
twine check dist/*
1 change: 1 addition & 0 deletions zyte_parsers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
from .api import SelectorOrElement
from .brand import extract_brand_name
from .breadcrumbs import Breadcrumb, extract_breadcrumbs
from .gtin import Gtin, extract_gtin
from .price import extract_price
Loading