Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic XML adapter #389

Closed
Tracked by #78
blackerby opened this issue Oct 17, 2023 · 12 comments · Fixed by #391
Closed
Tracked by #78

Generic XML adapter #389

blackerby opened this issue Oct 17, 2023 · 12 comments · Fixed by #391

Comments

@blackerby
Copy link

Is your feature request related to a problem? Please describe.
Currently, I am not able to query XML files using Shillelagh. I have conducted Google searches, searches of project documentation, and a search of the Apache Superset Slack but have not found an off-the-shelf solution.

Describe the solution you'd like
I would like to provide source data for an Apache Superset chart from an XML file not on a local file system but available from a URL.

Describe alternatives you've considered
The only other thing I can think of would be to read the XML file into a Pandas DataFrame and query that (cf. #388 and related Slack conversation), which I admit I have not yet tried.

Additional context
The issue and Slack conversation referenced above established a need for a custom adapter. I expect my issue could be solved by a customer adapter, too. I am happy to put in the learning and work to develop such an adapter, but I want to make sure I'm not missing something obvious before I head down that path.

@blackerby blackerby changed the title XML adapter Generic XML adapter Oct 18, 2023
@betodealmeida
Copy link
Owner

@blackerby do you have an example of the XML response you'd want to query?

The generic JSON adapter allows the user to define a JSONPath expression; it shouldn't be hard to duplicate it and have a XML adapter using XPath.

@blackerby
Copy link
Author

I suspected duplicating the generic JSON adapter for XML would be the way to go. Right now I'm working on a different custom adapter (your tutorial is awesome by the way, thanks so much for the good docs), but when that's in good shape I can try to get started on this or pitch in if someone else wants to start work on this.

I've pasted the response to a GET request to the following URL below:

https://api.congress.gov/v3/bill/118?format=xml&offset=0&limit=2&api_key=<API_KEY_HERE>

Note: this specific API offers a JSON response option, but I've got other use cases that only offer XML, so this is just offered for the sake of example.

<?xml version="1.0" encoding="utf-8"?>
<api-root>
   <bills>
      <bill>
         <congress>
            118
         </congress>
         <type>
            SRES
         </type>
         <originChamber>
            Senate
         </originChamber>
         <originChamberCode>
            S
         </originChamberCode>
         <number>
            416
         </number>
         <url>
            https://api.congress.gov/v3/bill/118/sres/416?format=xml
         </url>
         <title>
            A resolution to authorize testimony and representation in United States v. Sullivan.
         </title>
         <updateDateIncludingText>
            2023-10-19T12:43:41Z
         </updateDateIncludingText>
         <latestAction>
            <actionDate>
               2023-10-18
            </actionDate>
            <text>
               Submitted in the Senate, considered, and agreed to without amendment and with a preamble by Unanimous Consent. (consideration: CR S5082-5083; text: CR S5091)
            </text>
         </latestAction>
         <updateDate>
            2023-10-19
         </updateDate>
      </bill>
      <bill>
         <congress>
            118
         </congress>
         <type>
            SRES
         </type>
         <originChamber>
            Senate
         </originChamber>
         <originChamberCode>
            S
         </originChamberCode>
         <number>
            415
         </number>
         <url>
            https://api.congress.gov/v3/bill/118/sres/415?format=xml
         </url>
         <title>
            A resolution to authorize testimony and representation in United States v. Samsel.
         </title>
         <updateDateIncludingText>
            2023-10-19T12:43:40Z
         </updateDateIncludingText>
         <latestAction>
            <actionDate>
               2023-10-18
            </actionDate>
            <text>
               Submitted in the Senate, considered, and agreed to without amendment and with a preamble by Unanimous Consent. (consideration: CR S5082-5083; text: CR S5091)
            </text>
         </latestAction>
         <updateDate>
            2023-10-19
         </updateDate>
      </bill>
   </bills>
   <pagination>
      <count>
         10503
      </count>
      <next>
         https://api.congress.gov/v3/bill/118?offset=2&amp;limit=2&amp;format=xml
      </next>
   </pagination>
   <request>
      <congress>
         118
      </congress>
      <contentType>
         application/xml
      </contentType>
      <format>
         xml
      </format>
   </request>
</api-root>

@betodealmeida
Copy link
Owner

Thanks! I started working on the XML adapter today, I'll test it against https://api.congress.gov/.

@betodealmeida betodealmeida self-assigned this Oct 20, 2023
@betodealmeida
Copy link
Owner

betodealmeida commented Oct 20, 2023

@blackerby how do you see the format of the response? For the endpoint above, eg, should each row be a string with the XML:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
bill
----
<bill>
    <congress>118</congress>
    <type>SRES</type>
    ...
</bill>
(1 row in 0.00s) 

Or should it return a JSON representation of the data? (so it can be processed with the JSON functions in SQLite) Something like:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
bill
----
{"congress": 118, "type": "SRES", ...}
(1 row in 0.00s) 

Even better, we could explode the payload to columns and have:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
  congress  type    ...  latestAction
----------  ------       --------------------------------------------------------
       118  SRES         {"actionDate": "2023-10-18", "text": "Submitted in ..."}
(1 row in 0.00s) 

@betodealmeida
Copy link
Owner

Are XML attributes important? Or do we care more about the text?

@blackerby
Copy link
Author

To your first question, I think the third option (exploding the payload to columns) is the way to go. Then columns with JSON in them (like the latestAction column) can be further processed with SQLite's JSON functions.

To your second question about XML attributes, I will have use cases in which attributes are important, but they may be specific enough that they require a custom adapter, e.g., for MODS. Is it easy enough to incorporate attribute processing syntax into the XPath URL fragment?

@betodealmeida
Copy link
Owner

To your second question about XML attributes, I will have use cases in which attributes are important, but they may be specific enough that they require a custom adapter, e.g., for MODS. Is it easy enough to incorporate attribute processing syntax into the XPath URL fragment?

I'm using xml.etree.ElementTree from the Python standard library to support XPath, and it supports attributes, so it should be fine. I'm more concerned about the process of converting XML to JSON, eg:

<foo bar="baz">hi</foo>

What should we map that to?

  • {"foo": "hi"} is what I have currently working.
  • {"foo": {"text": "hi", "@bar": "baz"}} is a common format, but seems too verbose.
  • {"foo": "hi", "foo:bar": "baz"} is a more concise option.

@blackerby
Copy link
Author

I hear you on the verbosity concern, but the second option ({"foo": {"text": "hi", "@bar": "baz"}}) makes more intuitive sense to me.

@cwegener
Copy link
Collaborator

I'm using xml.etree.ElementTree from the Python standard library to support XPath, and it supports attributes, so it should be fine. I'm more concerned about the process of converting XML to JSON, eg:

I think defusedxml is the sane drop-in replacement for the built-in xml.etree package.

https://github.com/tiran/defusedxml

@betodealmeida betodealmeida linked a pull request Oct 21, 2023 that will close this issue
@betodealmeida
Copy link
Owner

@cwegener thanks for the tip on defusedxml!

@betodealmeida
Copy link
Owner

I hear you on the verbosity concern, but the second option ({"foo": {"text": "hi", "@bar": "baz"}}) makes more intuitive sense to me.

@blackerby, I released 1.2.8 with a simple generic XML adapter that only cares about text. If we need we can later implement a different algorithm that exposes XML attributes, and have a way of specifying which one should be used.

@blackerby
Copy link
Author

That is great, thanks @betodealmeida. I'll play with the new release this week and open a new issue if/when access to XML attributes becomes a challenge. I'm also looking forward to digging into the commit that added the XML adapter -- seems like a great way to learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants