Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken resource url/file size scripts #1079

Closed
wants to merge 12 commits into from
Closed

Conversation

maxFredenburgh
Copy link
Contributor

No description provided.

@wardi
Copy link
Member

wardi commented May 12, 2020

please use vscode, pycharm or some editor that lints your python code for you, there are lots pep8 problems here like spaces missing around == and =, spaces missing after , and unnecessary parenthesis after if statements.

Commit messages are important too, just 60 characters or so about what this change does to help make sense of our git logs.

@wardi
Copy link
Member

wardi commented May 12, 2020

Here's how I think this should work:

  1. script that reads urls on stdin and updates a url database (could be a simple csv file) containing:

    • url (primary key)
    • utc timestamp (may be used for removing stale requests and retry in the future)
    • server response
    • file size (if successful)
    • file type (if successful)

    if a url is already present in the database this script won't attempt to fetch it again, this allows repeated runs to refresh the url database and prevents requesting the same url multiple times if present in many resources

    this script can run parallel requests with multiple processes or by using async requests.

  2. a script that combines the url database and the od-do-canada json lines file to generate a report on broken links and incorrect file types

  3. a script that combines the url database and the od-do-canada json lines file to update file types and sizes in ckan using ckanapi (this should be python2 so it can update directly from the config file instead of over http + api key)

We can feed script # 1 with a command like zcat od-do-canada.jl.gz | jq .resources[].url there's no need for another script to pull out urls.

@wardi wardi changed the title initial commit broken resource url/file size scripts May 21, 2020
ckanext/canada/commands.py Outdated Show resolved Hide resolved
@wardi
Copy link
Member

wardi commented Jun 2, 2020

looking good just a few comments. Please assume each comment applies to all files below it (I didn't bother repeating at each location)

@wardi
Copy link
Member

wardi commented Jun 10, 2020

the content-type returned from a web server is not reliable because it gets ignored by users. Can you collect the file type from the Content-Disposition filename=... header instead? https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition

'MOV', 'MPEG', 'MPEG-1', 'MP3', 'MXD', 'NetCDF', 'ODP', 'ODS', 'ODT', 'PDF', 'PDF/A-1', 'PDF/A-2',
'PDF/UA', 'PNG', 'PPT', 'PPTX', 'RDF', 'TTL', 'NT', 'RDFa', 'RSS', 'RTF', 'SAR', 'SAV', 'SEGY', 'SHP',
'SQL', 'SQLITE3', 'SQLITE', 'SVG', 'TIFF', 'TRIG', 'TRIX', 'TFW', 'TXT', 'VPF', 'WAV', 'WCS', 'WFS',
'WMS', 'WMTS', 'WMV', 'WPS', 'XML', 'XLS', 'XLSM', 'XLSX', 'ZIP', 'other']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duplicates a list from the metadata schema. Is it necessary to check against this? If so can you reference the one in ckanext/canada/schemas/presets.yaml instead?

@maxFredenburgh
Copy link
Contributor Author

maxFredenburgh commented Jun 11, 2020

the content-type returned from a web server is not reliable because it gets ignored by users. Can you collect the file type from the Content-Disposition filename=... header instead? https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition

@wardi The headers returned from each request do not contain content-disposition. Here's an example of what is returned:
{'Content-Length': '968', 'Content-Language': 'en', 'X-AspNet-Version': '4.0.30319', 'Set-Cookie': 'BIGipServer~NRCAN-FGP-PROD~NRCAN-FGP-VIP1.app~NRCAN-FGP-VIP1_pool=rd1o00000000000000000000ffffa72883cco80; path=/; Httponly; Secure', 'X-Powered-By': 'ASP.NET', 'Server': 'Microsoft-IIS/8.5, ', 'Connection': 'close', 'Cache-Control': 'private', 'Date': 'Thu, 11 Jun 2020 16:27:40 GMT', 'Content-Type': 'text/html;charset=utf-8'}

@JVickery-TBS JVickery-TBS deleted the resource-management branch January 8, 2025 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants