Generic XML adapter #389

blackerby · 2023-10-17T17:56:18Z

Is your feature request related to a problem? Please describe.
Currently, I am not able to query XML files using Shillelagh. I have conducted Google searches, searches of project documentation, and a search of the Apache Superset Slack but have not found an off-the-shelf solution.

Describe the solution you'd like
I would like to provide source data for an Apache Superset chart from an XML file not on a local file system but available from a URL.

Describe alternatives you've considered
The only other thing I can think of would be to read the XML file into a Pandas DataFrame and query that (cf. #388 and related Slack conversation), which I admit I have not yet tried.

Additional context
The issue and Slack conversation referenced above established a need for a custom adapter. I expect my issue could be solved by a customer adapter, too. I am happy to put in the learning and work to develop such an adapter, but I want to make sure I'm not missing something obvious before I head down that path.

betodealmeida · 2023-10-19T14:56:33Z

@blackerby do you have an example of the XML response you'd want to query?

The generic JSON adapter allows the user to define a JSONPath expression; it shouldn't be hard to duplicate it and have a XML adapter using XPath.

blackerby · 2023-10-20T00:43:39Z

I suspected duplicating the generic JSON adapter for XML would be the way to go. Right now I'm working on a different custom adapter (your tutorial is awesome by the way, thanks so much for the good docs), but when that's in good shape I can try to get started on this or pitch in if someone else wants to start work on this.

I've pasted the response to a GET request to the following URL below:

https://api.congress.gov/v3/bill/118?format=xml&offset=0&limit=2&api_key=<API_KEY_HERE>

Note: this specific API offers a JSON response option, but I've got other use cases that only offer XML, so this is just offered for the sake of example.

<?xml version="1.0" encoding="utf-8"?>
<api-root>
   <bills>
      <bill>
         <congress>
            118
         </congress>
         <type>
            SRES
         </type>
         <originChamber>
            Senate
         </originChamber>
         <originChamberCode>
            S
         </originChamberCode>
         <number>
            416
         </number>
         <url>
            https://api.congress.gov/v3/bill/118/sres/416?format=xml
         </url>
         <title>
            A resolution to authorize testimony and representation in United States v. Sullivan.
         </title>
         <updateDateIncludingText>
            2023-10-19T12:43:41Z
         </updateDateIncludingText>
         <latestAction>
            <actionDate>
               2023-10-18
            </actionDate>
            <text>
               Submitted in the Senate, considered, and agreed to without amendment and with a preamble by Unanimous Consent. (consideration: CR S5082-5083; text: CR S5091)
            </text>
         </latestAction>
         <updateDate>
            2023-10-19
         </updateDate>
      </bill>
      <bill>
         <congress>
            118
         </congress>
         <type>
            SRES
         </type>
         <originChamber>
            Senate
         </originChamber>
         <originChamberCode>
            S
         </originChamberCode>
         <number>
            415
         </number>
         <url>
            https://api.congress.gov/v3/bill/118/sres/415?format=xml
         </url>
         <title>
            A resolution to authorize testimony and representation in United States v. Samsel.
         </title>
         <updateDateIncludingText>
            2023-10-19T12:43:40Z
         </updateDateIncludingText>
         <latestAction>
            <actionDate>
               2023-10-18
            </actionDate>
            <text>
               Submitted in the Senate, considered, and agreed to without amendment and with a preamble by Unanimous Consent. (consideration: CR S5082-5083; text: CR S5091)
            </text>
         </latestAction>
         <updateDate>
            2023-10-19
         </updateDate>
      </bill>
   </bills>
   <pagination>
      <count>
         10503
      </count>
      <next>
         https://api.congress.gov/v3/bill/118?offset=2&amp;limit=2&amp;format=xml
      </next>
   </pagination>
   <request>
      <congress>
         118
      </congress>
      <contentType>
         application/xml
      </contentType>
      <format>
         xml
      </format>
   </request>
</api-root>

betodealmeida · 2023-10-20T04:06:23Z

Thanks! I started working on the XML adapter today, I'll test it against https://api.congress.gov/.

betodealmeida · 2023-10-20T16:53:23Z

@blackerby how do you see the format of the response? For the endpoint above, eg, should each row be a string with the XML:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
bill
----
<bill>
    <congress>118</congress>
    <type>SRES</type>
    ...
</bill>
(1 row in 0.00s)

Or should it return a JSON representation of the data? (so it can be processed with the JSON functions in SQLite) Something like:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
bill
----
{"congress": 118, "type": "SRES", ...}
(1 row in 0.00s)

Even better, we could explode the payload to columns and have:

sql> SELECT * FROM "https://api.congress.gov/v3/bill/118#/api-root/bills/bill" LIMIT 1;
  congress  type    ...  latestAction
----------  ------       --------------------------------------------------------
       118  SRES         {"actionDate": "2023-10-18", "text": "Submitted in ..."}
(1 row in 0.00s)

betodealmeida · 2023-10-20T17:16:52Z

Are XML attributes important? Or do we care more about the text?

blackerby · 2023-10-20T20:35:18Z

To your first question, I think the third option (exploding the payload to columns) is the way to go. Then columns with JSON in them (like the latestAction column) can be further processed with SQLite's JSON functions.

To your second question about XML attributes, I will have use cases in which attributes are important, but they may be specific enough that they require a custom adapter, e.g., for MODS. Is it easy enough to incorporate attribute processing syntax into the XPath URL fragment?

betodealmeida · 2023-10-20T21:09:27Z

To your second question about XML attributes, I will have use cases in which attributes are important, but they may be specific enough that they require a custom adapter, e.g., for MODS. Is it easy enough to incorporate attribute processing syntax into the XPath URL fragment?

I'm using xml.etree.ElementTree from the Python standard library to support XPath, and it supports attributes, so it should be fine. I'm more concerned about the process of converting XML to JSON, eg:

<foo bar="baz">hi</foo>

What should we map that to?

{"foo": "hi"} is what I have currently working.
{"foo": {"text": "hi", "@bar": "baz"}} is a common format, but seems too verbose.
{"foo": "hi", "foo:bar": "baz"} is a more concise option.

blackerby · 2023-10-20T22:42:14Z

I hear you on the verbosity concern, but the second option ({"foo": {"text": "hi", "@bar": "baz"}}) makes more intuitive sense to me.

cwegener · 2023-10-21T05:05:07Z

I'm using xml.etree.ElementTree from the Python standard library to support XPath, and it supports attributes, so it should be fine. I'm more concerned about the process of converting XML to JSON, eg:

I think defusedxml is the sane drop-in replacement for the built-in xml.etree package.

https://github.com/tiran/defusedxml

betodealmeida · 2023-10-22T12:33:35Z

@cwegener thanks for the tip on defusedxml!

betodealmeida · 2023-10-22T12:34:41Z

I hear you on the verbosity concern, but the second option ({"foo": {"text": "hi", "@bar": "baz"}}) makes more intuitive sense to me.

@blackerby, I released 1.2.8 with a simple generic XML adapter that only cares about text. If we need we can later implement a different algorithm that exposes XML attributes, and have a way of specifying which one should be used.

blackerby · 2023-10-22T12:39:04Z

That is great, thanks @betodealmeida. I'll play with the new release this week and open a new issue if/when access to XML attributes becomes a challenge. I'm also looking forward to digging into the commit that added the XML adapter -- seems like a great way to learn.

blackerby changed the title ~~XML adapter~~ Generic XML adapter Oct 18, 2023

betodealmeida added enhancement help wanted good first issue labels Oct 19, 2023

betodealmeida mentioned this issue Oct 19, 2023

More adapters #78

Open

14 tasks

betodealmeida self-assigned this Oct 20, 2023

betodealmeida mentioned this issue Oct 21, 2023

feat: generic XML adapter #391

Merged

betodealmeida linked a pull request Oct 21, 2023 that will close this issue

feat: generic XML adapter #391

Merged

betodealmeida closed this as completed in #391 Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic XML adapter #389

Generic XML adapter #389

blackerby commented Oct 17, 2023

betodealmeida commented Oct 19, 2023

blackerby commented Oct 20, 2023

betodealmeida commented Oct 20, 2023

betodealmeida commented Oct 20, 2023 •

edited

Loading

betodealmeida commented Oct 20, 2023

blackerby commented Oct 20, 2023

betodealmeida commented Oct 20, 2023

blackerby commented Oct 20, 2023

cwegener commented Oct 21, 2023

betodealmeida commented Oct 22, 2023

betodealmeida commented Oct 22, 2023

blackerby commented Oct 22, 2023

Generic XML adapter #389

Generic XML adapter #389

Comments

blackerby commented Oct 17, 2023

betodealmeida commented Oct 19, 2023

blackerby commented Oct 20, 2023

betodealmeida commented Oct 20, 2023

betodealmeida commented Oct 20, 2023 • edited Loading

betodealmeida commented Oct 20, 2023

blackerby commented Oct 20, 2023

betodealmeida commented Oct 20, 2023

blackerby commented Oct 20, 2023

cwegener commented Oct 21, 2023

betodealmeida commented Oct 22, 2023

betodealmeida commented Oct 22, 2023

blackerby commented Oct 22, 2023

betodealmeida commented Oct 20, 2023 •

edited

Loading