post on Markdown #751

maelle · 2024-04-09T10:17:00Z

add https://github.com/thisisnic/parseqmd

maelle · 2024-04-09T14:08:57Z

content/blog/2024-04-16-markdown-programmatic/index.md

+
+Programmatically parsing and editing R code is out of the scope of this post, but closely related enough to throw in a few tips.
+As with Markdown, you might need to use regular expressions but try not to.
+You can parse the code to XML using base R parsing and [xmlparsedata](https://r-lib.github.io/xmlparsedata/), then you manipulate the XML with [XPath](https://masalmon.eu/2022/04/08/xml-xpath/).


maybe mention https://github.com/DavisVaughan/r-tree-sitter

and https://masalmon.eu/2024/05/15/refactoring-xml/ instead of my general xpath post

zkamvar

I think it looks good so far! I made some modifications in #788.

zkamvar · 2024-06-21T18:04:52Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+
+```
+
+Furthermore there are different _flavors_ of Markdown, and some supplementary features added depending on what your Markdown files will be used by, like emoji written so: `:grin:`.


Hugo will render this emoji. Did you want to show the shortcode?

In #788, I add a bit more context, linking to the engines that generate markdown and the extended syntax guide.

I was very tempted to include a comment somewhere about the horrors of Jekyll, but that's for another day.

I will need to escape this then 😅

I'm curious about the Jekyll horrors even if out of scope for this post. Recently I was not even able to install it so I'm spared further horrors locally 😂

The horrors of kramdown are many, but I think the biggest annoyance is the use of postfix tags, as shown in this nearly 180 line function I had to write to parse nested block quotes.

This and the fact that the parser was really forgiving for block quote definitions in that you could do things like:

> ## title text that belongs in the block quote without a leading carrot and that's totally cool, I guess. > other text that's still in the same block quote text outside of block quote

And this was a hack that people did to get knitr code to run inside of block quotes without throwing warnings/errors because of the "unrecognised leading symbol '>'".

content/blog/2024-04-16-markdown-programmatic/index.Rmd

zkamvar · 2024-06-21T18:08:43Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+
+The [tinkr package](http://docs.ropensci.org/tinkr/) maintained by Zhian Kamvar parses Markdown to XML using Commonmark, and writes it back to Markdown using XSLT. The YAML metadata is available as a string.
+
+With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree, or to, say HTML, and then back to Markdown.


I included Nic Crane's experimental package here because it uses this process under the hood.

I thought it'd use Quarto directly not Pandoc?

I does use Quarto directly, but that's just a throughline to Pandoc, so it's basically the equivalent, right?

but maybe worth mentioning the strategy as it's a nice one 🤔

zkamvar · 2024-06-21T18:09:22Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+
+With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree, or to, say HTML, and then back to Markdown.
+
+The [parsermd package](https://rundel.github.io/parsermd/) maintained by Colin Rundel is "implementation of a formal grammar and parser for R Markdown documents using the Boost Spirit X3 library. It also includes a collection of high level functions for working with the resulting abstract syntax tree."


I added more context to this and put it into a separate section because it differs from the other parers in that it doesn't parse beyond headers or code blocks.

* first pass at edits - add links and context - reword some sections * Separate parsers into sections I also added some more context. * add additional context * Apply suggestions from Maëlle Code Review Co-authored-by: Maëlle Salmon <[email protected]> * Add Zhian as author; define AST * Add code example for templating --------- Co-authored-by: Maëlle Salmon <[email protected]>

maelle · 2024-06-25T09:51:05Z

@zkamvar I am not sure we should use my name as is in a filename 😅

cderv

👋 @maelle !

Long overdue - here is a first review with some thoughts before contributing some more content about Pandoc and Lua.

I am sharing this because I want to discuss with you where and how it would fit.

Main problem I have is that:

If we talk about Pure Markdown document, this would fit well.
But if we only focus on Rmd or Qmd editing, for document with CodeBlock then this is where it fails.

Pandoc wouldn't know how to deal with specific Qmd or Rmd syntax, especially code cells.

How does the other tooling like tinkr handles it ?

I see we talk about it a lot, but would it keep code cells ? I guess so considering the work you've done with it;

This would not be possible with only Pandoc and Lua or Pandoc to json and json processing.

This is where parsermd or lightparser are useful as they work to handle the code cells.

So I commented where I could mention more about Pandoc and Lua and / or Pandoc and JSON.

A a general comment after reading the article, I wonder if the organization should be different to present all this.

Are we only listing tools ? But we added an example of code for the Markdown templating. Why not other ?
Or should we then, have a use case and show an example for all solutions we list.

More complex, but could be

What are we trying to do ?
- Parse to retrieve some content ?
- Parse to modify and write back ?
Shows the different solution and maybe an example with each.

Anyhow, in the current state of content, I was thinking of TOC more like this

## What is Markdown?

... 

## Templating Tools for Boilerplate Documents

.... 

## String Manipulation Tools to change small part of a document

...

## Parsing Tools for heavier manipulation

### AST ?

### Different representation for different processing tools 

- Markdown to XML -> xml2
- Markdown to HTML -> rvest / xml2
- Markdown to JSON -> jsonlite 
- Markdown to R lists -> purrr / rlists
- Markdown to Pandoc AST -> Pandoc Lua filters 

### What about code cells ? 

yes - Rmd or Qmd are quite specific because they have executable code cells which are not in Markdown specs

R specific Package for that: 
- parsermd -> uses C++ to parse document as string content
- lightparser -> uses knitr internal parser (dangerous ! 😅 )
- parseqmd (experimental) -> leverage to JSON conversion

Hard task !  Those tools helps to split content, and then process each part accordingly
So, in a way could be seen as 
- Parsing Markdown -> what we have seen
- Parsing R code -> r-tree-sitter / xmlparsedata

## How to choose the tools I need ? 

- Simple tweak -> string manipulation is fine
- Extracting content from a markdown file 
	- Interested in markdown content = Choose what you know best to query -> XML / JSON / HTML
	- Code cells matters = look are R package handling Qmd / Rmd structure

### The Impossibility of a Perfect Roundtrip

- Round trip is important
	- Consider tinkr for commonmark 
	- Pandoc for reading markdown -> writing markdown using JSON or Lua filters

Best chance of success with Markdown

## Examples 

##  Concluson

Long review again - happy to discuss it live on this basis, and then I'll do a PR based on what I need to add.

Hopefully, this is helpful and not to far away from what you expected.

Sorry again for the delay, and thanks a lot for the wait !!

cderv · 2024-07-17T14:31:59Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+[extended syntax]: https://www.markdownguide.org/extended-syntax/
+
+
+Markdown formats that R users will commonly interact with include: R Markdown (uses Pandoc under the hood), Quarto (uses Pandoc under the hood... see any trend here?), GitHub, Hugo (for blogdown or hugodown websites).


Shouldn't we make a difference between tools that uses a Markdown flavor, and the Markdown flavor in question ?

Quarto, R Markdown are using Pandoc's Markdown per https://pandoc.org

Github is using GFM (https://github.github.com/gfm/)

Hugo is using GoldMark (which support Commonmark and GFM spec : https://gohugo.io/content-management/formats/#markdown)

This is quite technical but usually good to know.

Regarding the R space and markdown flavor, citing CRAN commonmark package could also be good among useful tools.

cderv · 2024-07-17T14:34:38Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+Most often R users will write Markdown manually, or with the help of an editor such as the RStudio IDE visual editor.
+But sometimes, one will have to create or edit a bunch of Markdown files at once, and editing all those files by hand is a huge waste of time. 
+This blog post will give you resources in R that you can use to create, parse, and edit Markdown documents, so that you can become the Markdown wizard you have always dreamed of becoming :mage:!


Suggested change

Most often R users will write Markdown manually, or with the help of an editor such as the RStudio IDE visual editor.

But sometimes, one will have to create or edit a bunch of Markdown files at once, and editing all those files by hand is a huge waste of time.

This blog post will give you resources in R that you can use to create, parse, and edit Markdown documents, so that you can become the Markdown wizard you have always dreamed of becoming :mage:!

Most often R users will write Markdown manually, or with the help of an editor such as the [RStudio IDE visual editor][^rstudio-md-editor].

But sometimes, one will have to create or edit a bunch of Markdown files at once, and editing all those files by hand is a huge waste of time.

This blog post will give you resources in R that you can use to create, parse, and edit Markdown documents, so that you can become the Markdown wizard you have always dreamed of becoming :mage:!

[rstudio-md-editor]: https://posit.co/blog/exploring-rstudio-visual-markdown-editor/

Good to add a link to what this is ? Could also be quarto doc website: https://quarto.org/docs/visual-editor/

cderv · 2024-07-17T14:36:37Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+
+Templating tools include:
+
+- [`knitr::knit_expand()`](https://cran.r-project.org/web/packages/knitr/vignettes/knit_expand.html) by Yihui Xie;


I am wondering if the cookbook page recipe is easier as an entry point (I'll add the link to the vignette there too).

https://bookdown.org/yihui/rmarkdown-cookbook/knit-expand.html

cderv · 2024-07-17T14:40:24Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+- [`knitr::knit_expand()`](https://cran.r-project.org/web/packages/knitr/vignettes/knit_expand.html) by Yihui Xie;
+- the [whisker package](https://github.com/edwindj/whisker) maintained by Edwin de Jonge (used in for instance pkgddown);
+- the [brew package](https://github.com/gregfrog/brew) maintained by Greg Hunt;
+- [Pandoc](/blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/) by John MacFarlane.


Why are we mentioning pandoc here ? Is it for its Templating system specifically ? https://pandoc.org/MANUAL.html#templates

or for its general used to convert to markdown ?

It is not clear to me as we put it in the same list of templating tools, and regarding what we say below in "A common workflow would be:" ...

cderv · 2024-07-17T14:52:22Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+````{r show-markdown, echo = FALSE, warn = FALSE, message = FALSE, results = 'asis', comment = ""}
+md <- readLines("hw-template.md")
+writeLines(c("````markdown", md, "````"), con = stdout())
+````


BTW, this should be possible to do using only chunk options (https://yihui.org/en/2022/01/knitr-news/#the-new-engines-comment-verbatim-and-embed)

Suggested change

````{r show-markdown, echo = FALSE, warn = FALSE, message = FALSE, results = 'asis', comment = ""}

md <- readLines("hw-template.md")

writeLines(c("````markdown", md, "````"), con = stdout())

````

```{embed show-markdown, file="hw-template.md", lang="markdown"}

```

This should produce the right markdown even for Hugo .md

cderv · 2024-07-17T18:07:44Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree (in JSON format). 
+Nic Crane has an experimental package called [parseqmd](https://github.com/thisisnic/parseqmd) that uses this strategy, parsing
+the output with the jsonlite package.
+You can also parse to, say HTML, and then back to Markdown. The benefit of parsing it to HTML is that you can use a package such as rvest to extract and manipulate the elements.


such as rvest to extract and manipulate the elements

I usually use directly xml2 package for handling local HTML file (no scrapping done). Is rvest really still the one to use for this ? Or should we mention xml2 ?

cderv · 2024-07-17T18:15:04Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+The [md4r package](https://rundel.github.io/md4r/), is a recent experimental package maintained by Colin Rundel, and is an R wrapper around the MD4C (Markdown for C) library and represents the AST as a nested list with attributes in R. 
+The development version of the package has utilities for constructing Markdown documents programmatically.
+
+With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree (in JSON format). 


you can parse a Markdown files to a Pandoc Abstract Syntax Tree (in JSON format).

I don't know why, but it puzzles me that we say "parse" to talk about the transformation to AST.

Anyhow, here I think we can say more about how Pandoc works. Pandoc can transform to AST in its own native representation, but also to json format.

Basically, if you want to do

md -> do something -> back to md

then with Pandoc two choices

Using Lua Filter : Pandoc converts to AST in its native format, Lua filters allow to process it to tweak it, and than Pandoc can write back to markdown.

Using Json filter: Pandoc convert to AST outputing a JSON representation of it, then any tools can modify this JSON file and provided a modified version to pandoc to convert back to markdown.

In R: pandocfilters package leverage this

Maybe by parsing, we really mean "How to retrieve part of the document ?" . for example, "I want all headers" ?

Then indeed working with the Json output may be simpler than producing an output from Lua.

Trying to get the correct meaning of this part. I feel this is where I can add something about Lua and Pandoc.

cderv · 2024-07-17T18:18:00Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+Although string manipulation tools are of a limited usefulness when parsing Markdown, they can _complement_ the actual parsing tools.
+Even if using specific Markdown parsing tools will help you write less regular expressions yourself... they won't completely free you from them.
+
+## Parsing Tools


I mention this later, but the Parsing word does not feel right to me. I wonder if this should be called

Suggested change

## Parsing Tools

## Abstract Represensation Manipulation Tools

in opposition to String Manipulation. *

Or something conveying the conversion to another format to correctly parse the content.

But maybe that is what a parser is about 😅 🙄 is it ?

cderv · 2024-07-17T18:21:33Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+the output with the jsonlite package.
+You can also parse to, say HTML, and then back to Markdown. The benefit of parsing it to HTML is that you can use a package such as rvest to extract and manipulate the elements.
+
+### High-level Parsing


There is a recent lightparser package for Rmd and Qmd https://cloud.r-project.org/web/packages/lightparser/index.html

Split your 'rmarkdown' or 'quarto' files by sections into a tibble: titles, text, chunks. Rebuild the file from the tibble.

It allows to manipulate the content.

Should it be added as a High-Level Parsing ?

I am not sure I see the different between parsermd and parseqmd with former being in High-Level parsing part, and later considered Fine-Grain parsing.

Should we just list tools and gives examples without classifying them this way ?

cderv · 2024-07-17T18:23:07Z

content/blog/2024-04-16-markdown-programmatic/index.Rmd

+For instance, with [tinkr](http://docs.ropensci.org/tinkr/#general-principles-and-solution) list items all start with a `-` even if in the original document they started with a `*`. With md4r, lists that are indented with extra space will be readjusted. 
+
+Depending on your use case you might want to find ways to mitigate such losses, for instance only re-writing the lines you made intentional edits to.
+


Maybe this where Pandoc will shine with Lua filter or Json filter as you read from markdown to write to markdown ?

Problem is Rmd to Rmd or .qmd to .qmd 🤔 This is hard because of specific Rmd or Quarto syntax ... (thinking out loud here)

maelle · 2024-07-19T12:50:51Z

@zkamvar @cderv I'll try to come up with an updated structure and ✨ diagrams ✨ soon-ish

maelle · 2024-07-19T13:03:49Z

Some notes of mine, parly repeating what @cderv said.

Important to use "do you want to use R code cells" as criterion (can the tool preserve them, can the tool operate on Rmd/qmd or the resulting thing only)
Lua filters documented in Quarto extensions.
add flavors explanation to the section that explains what Markdown is.

I'd like to add

a table with one line per tool and criterion such as code cells yes/no, intermediary formats provided (tinkr: XML, Pandoc: native thing or JSON, etc)
a decision tree based on what you want to do. Use excalidraw for that.

The title needs to be tweaked as it's not all about edits. "handle" maybe.

maelle · 2024-07-19T13:44:28Z

https://astgrepr.etiennebacher.com/

maelle · 2024-08-30T07:35:34Z

lightparser https://edenian-prince.github.io/blog/posts/2024-08-21-translate-md-files/

post on Markdown

ff91b68

maelle force-pushed the markdown branch from fd8fe3a to ff91b68 Compare April 9, 2024 10:24

maelle commented Apr 9, 2024

View reviewed changes

zkamvar mentioned this pull request Jun 21, 2024

Structure packages and add more context #788

Merged

zkamvar reviewed Jun 21, 2024

View reviewed changes

maelle added 4 commits June 25, 2024 12:07

fix case 💅

dfcb687

add @zkamvar's homework in full

c98e8b1

add missing s

68e2ea4

add link

f0ea694

yabellini added the blog post Blog posts to be published when merged label Jul 1, 2024

cderv reviewed Jul 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

post on Markdown #751

post on Markdown #751

maelle commented Apr 9, 2024 •

edited

Loading

maelle Apr 9, 2024

maelle Jun 25, 2024

zkamvar left a comment

zkamvar Jun 21, 2024

maelle Jun 24, 2024

zkamvar Jun 24, 2024

maelle Jun 25, 2024

zkamvar Jun 21, 2024

maelle Jun 24, 2024

zkamvar Jun 24, 2024

maelle Jun 25, 2024

zkamvar Jun 21, 2024

maelle commented Jun 25, 2024

cderv left a comment

cderv Jul 17, 2024

cderv Jul 17, 2024

cderv Jul 17, 2024

cderv Jul 17, 2024

cderv Jul 17, 2024

cderv Jul 17, 2024

cderv Jul 17, 2024

cderv Jul 17, 2024

cderv Jul 17, 2024

cderv Jul 17, 2024

maelle commented Jul 19, 2024

maelle commented Jul 19, 2024

maelle commented Jul 19, 2024

maelle commented Aug 30, 2024


		```

		Furthermore there are different _flavors_ of Markdown, and some supplementary features added depending on what your Markdown files will be used by, like emoji written so: `:grin:`.


		The [tinkr package](http://docs.ropensci.org/tinkr/) maintained by Zhian Kamvar parses Markdown to XML using Commonmark, and writes it back to Markdown using XSLT. The YAML metadata is available as a string.

		With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree, or to, say HTML, and then back to Markdown.


		With Pandoc that we presented in a [tech note last year](blog/2023/06/01/troubleshooting-pandoc-problems-as-an-r-user/#raw-attributes), you can parse a Markdown files to a Pandoc Abstract Syntax Tree, or to, say HTML, and then back to Markdown.

		The [parsermd package](https://rundel.github.io/parsermd/) maintained by Colin Rundel is "implementation of a formal grammar and parser for R Markdown documents using the Boost Spirit X3 library. It also includes a collection of high level functions for working with the resulting abstract syntax tree."

		[extended syntax]: https://www.markdownguide.org/extended-syntax/


		Markdown formats that R users will commonly interact with include: R Markdown (uses Pandoc under the hood), Quarto (uses Pandoc under the hood... see any trend here?), GitHub, Hugo (for blogdown or hugodown websites).


		Templating tools include:

		- [`knitr::knit_expand()`](https://cran.r-project.org/web/packages/knitr/vignettes/knit_expand.html) by Yihui Xie;

	## Parsing Tools
	## Abstract Represensation Manipulation Tools

		For instance, with [tinkr](http://docs.ropensci.org/tinkr/#general-principles-and-solution) list items all start with a `-` even if in the original document they started with a `*`. With md4r, lists that are indented with extra space will be readjusted.

		Depending on your use case you might want to find ways to mitigate such losses, for instance only re-writing the lines you made intentional edits to.

post on Markdown #751

Are you sure you want to change the base?

post on Markdown #751

Conversation

maelle commented Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zkamvar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maelle commented Jun 25, 2024

cderv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maelle commented Jul 19, 2024

maelle commented Jul 19, 2024

maelle commented Jul 19, 2024

maelle commented Aug 30, 2024

maelle commented Apr 9, 2024 •

edited

Loading