Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post on Markdown #751

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

post on Markdown #751

wants to merge 6 commits into from

Conversation

maelle
Copy link
Member

@maelle maelle commented Apr 9, 2024


Programmatically parsing and editing R code is out of the scope of this post, but closely related enough to throw in a few tips.
As with Markdown, you might need to use regular expressions but try not to.
You can parse the code to XML using base R parsing and [xmlparsedata](https://r-lib.github.io/xmlparsedata/), then you manipulate the XML with [XPath](https://masalmon.eu/2022/04/08/xml-xpath/).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and https://masalmon.eu/2024/05/15/refactoring-xml/ instead of my general xpath post

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good so far! I made some modifications in #788.


```

Furthermore there are different _flavors_ of Markdown, and some supplementary features added depending on what your Markdown files will be used by, like emoji written so: `:grin:`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hugo will render this emoji. Did you want to show the shortcode?

In #788, I add a bit more context, linking to the engines that generate markdown and the extended syntax guide.

I was very tempted to include a comment somewhere about the horrors of Jekyll, but that's for another day.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to escape this then 😅

I'm curious about the Jekyll horrors even if out of scope for this post. Recently I was not even able to install it so I'm spared further horrors locally 😂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The horrors of kramdown are many, but I think the biggest annoyance is the use of postfix tags, as shown in this nearly 180 line function I had to write to parse nested block quotes.

This and the fact that the parser was really forgiving for block quote definitions in that you could do things like:

> ## title
text that belongs in the block quote without a leading carrot
and that's totally cool, I guess. 
> other text that's still in the same block quote

text outside of block quote

And this was a hack that people did to get knitr code to run inside of block quotes without throwing warnings/errors because of the "unrecognised leading symbol '>'".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙀 🤯

content/blog/2024-04-16-markdown-programmatic/index.Rmd Outdated Show resolved Hide resolved

The [tinkr package](http://docs.ropensci.org/tinkr/) maintained by Zhian Kamvar parses Markdown to XML using Commonmark, and writes it back to Markdown using XSLT. The YAML metadata is available as a string.

With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree, or to, say HTML, and then back to Markdown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included Nic Crane's experimental package here because it uses this process under the hood.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it'd use Quarto directly not Pandoc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I does use Quarto directly, but that's just a throughline to Pandoc, so it's basically the equivalent, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but maybe worth mentioning the strategy as it's a nice one 🤔


With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree, or to, say HTML, and then back to Markdown.

The [parsermd package](https://rundel.github.io/parsermd/) maintained by Colin Rundel is "implementation of a formal grammar and parser for R Markdown documents using the Boost Spirit X3 library. It also includes a collection of high level functions for working with the resulting abstract syntax tree."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more context to this and put it into a separate section because it differs from the other parers in that it doesn't parse beyond headers or code blocks.

* first pass at edits

- add links and context
- reword some sections

* Separate parsers into sections

I also added some more context.

* add additional context

* Apply suggestions from Maëlle Code Review

Co-authored-by: Maëlle Salmon <[email protected]>

* Add Zhian as author; define AST

* Add code example for templating

---------

Co-authored-by: Maëlle Salmon <[email protected]>
@maelle
Copy link
Member Author

maelle commented Jun 25, 2024

@zkamvar I am not sure we should use my name as is in a filename 😅

@yabellini yabellini added the blog post Blog posts to be published when merged label Jul 1, 2024
Copy link
Contributor

@cderv cderv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @maelle !

Long overdue - here is a first review with some thoughts before contributing some more content about Pandoc and Lua.

I am sharing this because I want to discuss with you where and how it would fit.

Main problem I have is that:

  • If we talk about Pure Markdown document, this would fit well.
  • But if we only focus on Rmd or Qmd editing, for document with CodeBlock then this is where it fails.

Pandoc wouldn't know how to deal with specific Qmd or Rmd syntax, especially code cells.

How does the other tooling like tinkr handles it ?

I see we talk about it a lot, but would it keep code cells ? I guess so considering the work you've done with it;

This would not be possible with only Pandoc and Lua or Pandoc to json and json processing.

This is where parsermd or lightparser are useful as they work to handle the code cells.

So I commented where I could mention more about Pandoc and Lua and / or Pandoc and JSON.

A a general comment after reading the article, I wonder if the organization should be different to present all this.

  • Are we only listing tools ? But we added an example of code for the Markdown templating. Why not other ?
  • Or should we then, have a use case and show an example for all solutions we list.

More complex, but could be

  • What are we trying to do ?
    • Parse to retrieve some content ?
    • Parse to modify and write back ?
  • Shows the different solution and maybe an example with each.

Anyhow, in the current state of content, I was thinking of TOC more like this

## What is Markdown?

... 

## Templating Tools for Boilerplate Documents

.... 

## String Manipulation Tools to change small part of a document

...

## Parsing Tools for heavier manipulation

### AST ?

### Different representation for different processing tools 

- Markdown to XML -> xml2
- Markdown to HTML -> rvest / xml2
- Markdown to JSON -> jsonlite 
- Markdown to R lists -> purrr / rlists
- Markdown to Pandoc AST -> Pandoc Lua filters 

### What about code cells ? 

yes - Rmd or Qmd are quite specific because they have executable code cells which are not in Markdown specs

R specific Package for that: 
- parsermd -> uses C++ to parse document as string content
- lightparser -> uses knitr internal parser (dangerous ! 😅 )
- parseqmd (experimental) -> leverage to JSON conversion

Hard task !  Those tools helps to split content, and then process each part accordingly
So, in a way could be seen as 
- Parsing Markdown -> what we have seen
- Parsing R code -> r-tree-sitter / xmlparsedata

## How to choose the tools I need ? 

- Simple tweak -> string manipulation is fine
- Extracting content from a markdown file 
	- Interested in markdown content = Choose what you know best to query -> XML / JSON / HTML
	- Code cells matters = look are R package handling Qmd / Rmd structure

### The Impossibility of a Perfect Roundtrip

- Round trip is important
	- Consider tinkr for commonmark 
	- Pandoc for reading markdown -> writing markdown using JSON or Lua filters

Best chance of success with Markdown

## Examples 

##  Concluson

Long review again - happy to discuss it live on this basis, and then I'll do a PR based on what I need to add.

Hopefully, this is helpful and not to far away from what you expected.

Sorry again for the delay, and thanks a lot for the wait !!

[extended syntax]: https://www.markdownguide.org/extended-syntax/


Markdown formats that R users will commonly interact with include: R Markdown (uses Pandoc under the hood), Quarto (uses Pandoc under the hood... see any trend here?), GitHub, Hugo (for blogdown or hugodown websites).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we make a difference between tools that uses a Markdown flavor, and the Markdown flavor in question ?

This is quite technical but usually good to know.

Regarding the R space and markdown flavor, citing CRAN commonmark package could also be good among useful tools.

Comment on lines +74 to +76
Most often R users will write Markdown manually, or with the help of an editor such as the RStudio IDE visual editor.
But sometimes, one will have to create or edit a bunch of Markdown files at once, and editing all those files by hand is a huge waste of time.
This blog post will give you resources in R that you can use to create, parse, and edit Markdown documents, so that you can become the Markdown wizard you have always dreamed of becoming :mage:!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Most often R users will write Markdown manually, or with the help of an editor such as the RStudio IDE visual editor.
But sometimes, one will have to create or edit a bunch of Markdown files at once, and editing all those files by hand is a huge waste of time.
This blog post will give you resources in R that you can use to create, parse, and edit Markdown documents, so that you can become the Markdown wizard you have always dreamed of becoming :mage:!
Most often R users will write Markdown manually, or with the help of an editor such as the [RStudio IDE visual editor][^rstudio-md-editor].
But sometimes, one will have to create or edit a bunch of Markdown files at once, and editing all those files by hand is a huge waste of time.
This blog post will give you resources in R that you can use to create, parse, and edit Markdown documents, so that you can become the Markdown wizard you have always dreamed of becoming :mage:!
[rstudio-md-editor]: https://posit.co/blog/exploring-rstudio-visual-markdown-editor/

Good to add a link to what this is ? Could also be quarto doc website: https://quarto.org/docs/visual-editor/


Templating tools include:

- [`knitr::knit_expand()`](https://cran.r-project.org/web/packages/knitr/vignettes/knit_expand.html) by Yihui Xie;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the cookbook page recipe is easier as an entry point (I'll add the link to the vignette there too).

https://bookdown.org/yihui/rmarkdown-cookbook/knit-expand.html

- [`knitr::knit_expand()`](https://cran.r-project.org/web/packages/knitr/vignettes/knit_expand.html) by Yihui Xie;
- the [whisker package](https://github.com/edwindj/whisker) maintained by Edwin de Jonge (used in for instance pkgddown);
- the [brew package](https://github.com/gregfrog/brew) maintained by Greg Hunt;
- [Pandoc](/blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/) by John MacFarlane.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we mentioning pandoc here ? Is it for its Templating system specifically ? https://pandoc.org/MANUAL.html#templates

or for its general used to convert to markdown ?

It is not clear to me as we put it in the same list of templating tools, and regarding what we say below in "A common workflow would be:" ...

Comment on lines +103 to +106
````{r show-markdown, echo = FALSE, warn = FALSE, message = FALSE, results = 'asis', comment = ""}
md <- readLines("hw-template.md")
writeLines(c("````markdown", md, "````"), con = stdout())
````
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this should be possible to do using only chunk options (https://yihui.org/en/2022/01/knitr-news/#the-new-engines-comment-verbatim-and-embed)

Suggested change
````{r show-markdown, echo = FALSE, warn = FALSE, message = FALSE, results = 'asis', comment = ""}
md <- readLines("hw-template.md")
writeLines(c("````markdown", md, "````"), con = stdout())
````
```{embed show-markdown, file="hw-template.md", lang="markdown"}
```

This should produce the right markdown even for Hugo .md

With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree (in JSON format).
Nic Crane has an experimental package called [parseqmd](https://github.com/thisisnic/parseqmd) that uses this strategy, parsing
the output with the jsonlite package.
You can also parse to, say HTML, and then back to Markdown. The benefit of parsing it to HTML is that you can use a package such as rvest to extract and manipulate the elements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such as rvest to extract and manipulate the elements

I usually use directly xml2 package for handling local HTML file (no scrapping done). Is rvest really still the one to use for this ? Or should we mention xml2 ?

The [md4r package](https://rundel.github.io/md4r/), is a recent experimental package maintained by Colin Rundel, and is an R wrapper around the MD4C (Markdown for C) library and represents the AST as a nested list with attributes in R.
The development version of the package has utilities for constructing Markdown documents programmatically.

With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree (in JSON format).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can parse a Markdown files to a Pandoc Abstract Syntax Tree (in JSON format).

I don't know why, but it puzzles me that we say "parse" to talk about the transformation to AST.

Anyhow, here I think we can say more about how Pandoc works. Pandoc can transform to AST in its own native representation, but also to json format.

Basically, if you want to do

md -> do something -> back to md

then with Pandoc two choices

  • Using Lua Filter : Pandoc converts to AST in its native format, Lua filters allow to process it to tweak it, and than Pandoc can write back to markdown.

  • Using Json filter: Pandoc convert to AST outputing a JSON representation of it, then any tools can modify this JSON file and provided a modified version to pandoc to convert back to markdown.

Maybe by parsing, we really mean "How to retrieve part of the document ?" . for example, "I want all headers" ?

Then indeed working with the Json output may be simpler than producing an output from Lua.

Trying to get the correct meaning of this part. I feel this is where I can add something about Lua and Pandoc.

Although string manipulation tools are of a limited usefulness when parsing Markdown, they can _complement_ the actual parsing tools.
Even if using specific Markdown parsing tools will help you write less regular expressions yourself... they won't completely free you from them.

## Parsing Tools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mention this later, but the Parsing word does not feel right to me. I wonder if this should be called

Suggested change
## Parsing Tools
## Abstract Represensation Manipulation Tools

in opposition to String Manipulation. *

Or something conveying the conversion to another format to correctly parse the content.

But maybe that is what a parser is about 😅 🙄 is it ?

the output with the jsonlite package.
You can also parse to, say HTML, and then back to Markdown. The benefit of parsing it to HTML is that you can use a package such as rvest to extract and manipulate the elements.

### High-level Parsing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a recent lightparser package for Rmd and Qmd https://cloud.r-project.org/web/packages/lightparser/index.html

Split your 'rmarkdown' or 'quarto' files by sections into a tibble: titles, text, chunks. Rebuild the file from the tibble.

It allows to manipulate the content.

Should it be added as a High-Level Parsing ?

I am not sure I see the different between parsermd and parseqmd with former being in High-Level parsing part, and later considered Fine-Grain parsing.

Should we just list tools and gives examples without classifying them this way ?

For instance, with [tinkr](http://docs.ropensci.org/tinkr/#general-principles-and-solution) list items all start with a `-` even if in the original document they started with a `*`. With md4r, lists that are indented with extra space will be readjusted.

Depending on your use case you might want to find ways to mitigate such losses, for instance only re-writing the lines you made intentional edits to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this where Pandoc will shine with Lua filter or Json filter as you read from markdown to write to markdown ?

Problem is Rmd to Rmd or .qmd to .qmd 🤔 This is hard because of specific Rmd or Quarto syntax ... (thinking out loud here)

@maelle
Copy link
Member Author

maelle commented Jul 19, 2024

@zkamvar @cderv I'll try to come up with an updated structure and ✨ diagrams ✨ soon-ish

@maelle
Copy link
Member Author

maelle commented Jul 19, 2024

Some notes of mine, parly repeating what @cderv said.

  • Important to use "do you want to use R code cells" as criterion (can the tool preserve them, can the tool operate on Rmd/qmd or the resulting thing only)
  • Lua filters documented in Quarto extensions.
  • add flavors explanation to the section that explains what Markdown is.

I'd like to add

  • a table with one line per tool and criterion such as code cells yes/no, intermediary formats provided (tinkr: XML, Pandoc: native thing or JSON, etc)
  • a decision tree based on what you want to do. Use excalidraw for that.

The title needs to be tweaked as it's not all about edits. "handle" maybe.

@maelle
Copy link
Member Author

maelle commented Jul 19, 2024

https://astgrepr.etiennebacher.com/

@maelle
Copy link
Member Author

maelle commented Aug 30, 2024

lightparser https://edenian-prince.github.io/blog/posts/2024-08-21-translate-md-files/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blog post Blog posts to be published when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants