Discussion:
How to programmatically enforcing a pandoc markdown style
Kolen Cheung
2016-10-22 09:09:32 UTC
Permalink
Hi, all,
pandoc from markdown to markdown

Ever since I read issue #2814 <https://github.com/jgm/pandoc/issues/2814>,
I find it a very useful trick.

I am now working on a project, that starting from next semester will open
up to about 100 GSIs to collaboratively update a series of workbooks. I
want to incorporate the said trick as a cleanup tool to normalize the
source code (in pandoc markdown with minimal raw LaTeX).

Most things works very well, but however I find a few problems. I don’t
know if there’s any way to get around these?

1. ### Main Goals {-} becomes ### Main Goals {#main-goals .unnumbered}:
I want to keep using {-} for 2 reasons: shorter, and does not depends on
the header (which will gets repeated after cat).
2. 1. abcd... becomes 1. abcd...: it seems that pandoc enforce 2 spaces
after the enumerated list/bullet list. Are there ways to change this
behavior? I suppose I could use a regex to transform it back but it seems
to prone to error.
3. inline footnotes: I found that pandoc would convert inline footnotes
to explicit footnotes with [^1], [^2].... And the use of inline_notes
cannot be enforced. I opened an issue in #3172
<https://github.com/jgm/pandoc/issues/3172>. I suppose I can change the
source code to use explicit footnotes only. But it seems difficult to
enforce it and tell people not to use inline footnotes.
4. &trade; becomes ™: after studying how trademark should be typeset,
considering I aim at HTML+LaTeX output and no non-ascii characters in the
source code, I chose &trade;. But pandoc would happily convert that to ™
without my consent. I suppose other such HTML characters might behave
similarly. (by the way, input &trade; from markdown would output ™ in
TeX, and pdflatex has no problem with that. The resultant PDF looks
identical as if I use \texttrademark. Does anyone knows why? I thought
pdflatex don’t like unicode.)
5. pipe tables becomes HTML tables: I believe it is a bug so I opened issue
#3171 <https://github.com/jgm/pandoc/issues/3171>. Even more
interestingly, the pipe tables were obtained by a .docx to .md
conversion.

The command I used to enforce “pandoc style” is:

find . -maxdepth 2 -mindepth 2 -iname "*.md" -exec pandoc -f markdown+abbreviations+autolink_bare_uris+markdown_attribute+mmd_header_identifiers+mmd_link_attributes+mmd_title_block+tex_math_double_backslash-latex_macros -t markdown+raw_tex-native_spans-simple_tables-multiline_tables-grid_tables-latex_macros --normalize -s --wrap=none --atx-headers -o {} {} \;

“pandoc lint”

By the way, does anyone know how to do some sort of “pandoc lint”?
Currently I checked the TeX output by chktex -q and lacheck, which
sometimes gives useful typographical hints on what to correct.

And I remembered I read somewhere @jgm mentioned something about a random
string should be a valid markdown syntax (part of the markdown philosophy
kind of thing). In this sense it seems very difficult to enforce a “right”
syntax in markdown.
cat a lot of markdown files into one

Lastly, there’s a very minor issue: if I cat lots of markdown files into
one, then between the end of one file to the beginning of another, the lack
of enough newlines between them might make it a wrong markdown syntax. (
*e.g.* the beginning of a file starts with a heading, some text editors (
*e.g.* Atom) normalized my trailing newline without my consent to 1 empty
line. So then the heading would start immediately after the last paragraph,
which pandoc will not parse it as a heading.)

I currently get around this problem with a script to normalize every files
with exactly 2 trailing empty lines.

I suppose cating markdown files would be a very common process. How
normally would others do it?

Thanks in advance,
Kolen
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e82e943f-604e-4a5b-a621-4b3dd82e42c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-22 20:54:06 UTC
Permalink
In order to function as a proper linter, pandoc would have
to store much more detail in the AST than it currently does.
Is a header an AST or setext style header? How many spaces
before, and how many after, a list delimiter? Inline or
regular footnote? If regular, what's the label for the
marker? Inline or reference link? If reference, what's
the label? Fenced or indented code? Was a particular
unicode character entered as an entity or verbatim? Etc.

I just didn't design pandoc to be used as a linter; the
underlying design would need to be very different.
Post by Kolen Cheung
Hi, all,
pandoc from markdown to markdown
Ever since I read [1]issue #2814, I find it a very useful trick.
I am now working on a project, that starting from next semester will
open up to about 100 GSIs to collaboratively update a series of
workbooks. I want to incorporate the said trick as a cleanup tool to
normalize the source code (in pandoc markdown with minimal raw LaTeX).
Most things works very well, but however I find a few problems. I don’t
know if there’s any way to get around these?
1. ### Main Goals {-} becomes ### Main Goals {#main-goals
.unnumbered}: I want to keep using {-} for 2 reasons: shorter, and
does not depends on the header (which will gets repeated after
cat).
2. 1. abcd... becomes 1. abcd...: it seems that pandoc enforce 2
spaces after the enumerated list/bullet list. Are there ways to
change this behavior? I suppose I could use a regex to transform it
back but it seems to prone to error.
3. inline footnotes: I found that pandoc would convert inline
footnotes to explicit footnotes with [^1], [^2].... And the use of
inline_notes cannot be enforced. I opened an [2]issue in #3172. I
suppose I can change the source code to use explicit footnotes
only. But it seems difficult to enforce it and tell people not to
use inline footnotes.
4. &trade; becomes ™: after studying how trademark should be typeset,
considering I aim at HTML+LaTeX output and no non-ascii characters
in the source code, I chose &trade;. But pandoc would happily
convert that to ™ without my consent. I suppose other such HTML
characters might behave similarly. (by the way, input &trade; from
markdown would output ™ in TeX, and pdflatex has no problem with
that. The resultant PDF looks identical as if I use \texttrademark.
Does anyone knows why? I thought pdflatex don’t like unicode.)
5. pipe tables becomes HTML tables: I believe it is a bug so I opened
[3]issue #3171. Even more interestingly, the pipe tables were
obtained by a .docx to .md conversion.
find . -maxdepth 2 -mindepth 2 -iname "*.md" -exec pandoc -f markdown+abbreviati
ons+autolink_bare_uris+markdown_attribute+mmd_header_identifiers+mmd_link_attrib
utes+mmd_title_block+tex_math_double_backslash-latex_macros -t markdown+raw_tex-
native_spans-simple_tables-multiline_tables-grid_tables-latex_macros --normalize
-s --wrap=none --atx-headers -o {} {} \;
“pandoc lint”
By the way, does anyone know how to do some sort of “pandoc lint”?
Currently I checked the TeX output by chktex -q and lacheck, which
sometimes gives useful typographical hints on what to correct.
random string should be a valid markdown syntax (part of the markdown
philosophy kind of thing). In this sense it seems very difficult to
enforce a “right” syntax in markdown.
cat a lot of markdown files into one
Lastly, there’s a very minor issue: if I cat lots of markdown files
into one, then between the end of one file to the beginning of another,
the lack of enough newlines between them might make it a wrong markdown
syntax. (e.g. the beginning of a file starts with a heading, some text
editors (e.g. Atom) normalized my trailing newline without my consent
to 1 empty line. So then the heading would start immediately after the
last paragraph, which pandoc will not parse it as a heading.)
I currently get around this problem with a script to normalize every
files with exactly 2 trailing empty lines.
I suppose cating markdown files would be a very common process. How
normally would others do it?
Thanks in advance,
Kolen

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[6]https://groups.google.com/d/msgid/pandoc-discuss/e82e943f-604e-4a5b-
a621-4b3dd82e42c0%40googlegroups.com.
For more options, visit [7]https://groups.google.com/d/optout.
References
1. https://github.com/jgm/pandoc/issues/2814
2. https://github.com/jgm/pandoc/issues/3172
3. https://github.com/jgm/pandoc/issues/3171
7. https://groups.google.com/d/optout
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161022205406.GB83446%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-22 22:41:23 UTC
Permalink
Thanks for the info.

So, in a sense, to use pandoc as a linter, the style to enforce will be to
use the output of pandoc markdown writer as the *de facto* standard. In
that case, I think point 1 can still cause an issue. Are there anyway for
the writer to write something that is unnumbered, but do not have the ID?
If not, may be div & amsthm can be used to get around the issue.

I also want to discover tricks on using linters. As I said, I currently use
lacheck and chktex to check styles of the tex output, where the source code
(in markdown) actually benefits from it. In a sense, this is a blessing
from pandoc since I can output to any format pandoc support and use the
lint there. May be someone will point out that I could output to ConTeXt,
or docx to find some linter too.

And lastly, I know if throwing pandoc a bunch of files at the same time,
pandoc will cat them together first. How pandoc handle the empty line issue
between the beginning of one and the end of another?

Thanks again.

On Saturday, October 22, 2016 at 1:54:23 PM UTC-7, John MacFarlane wrote:

In order to function as a proper linter, pandoc would have
Post by John MacFarlane
to store much more detail in the AST than it currently does.
Is a header an AST or setext style header? How many spaces
before, and how many after, a list delimiter? Inline or
regular footnote? If regular, what's the label for the
marker? Inline or reference link? If reference, what's
the label? Fenced or indented code? Was a particular
unicode character entered as an entity or verbatim? Etc.
I just didn't design pandoc to be used as a linter; the
underlying design would need to be very different.
Post by Kolen Cheung
Hi, all,
pandoc from markdown to markdown
Ever since I read [1]issue #2814, I find it a very useful trick.
I am now working on a project, that starting from next semester will
open up to about 100 GSIs to collaboratively update a series of
workbooks. I want to incorporate the said trick as a cleanup tool to
normalize the source code (in pandoc markdown with minimal raw LaTeX).
Most things works very well, but however I find a few problems. I
don’t
Post by Kolen Cheung
know if there’s any way to get around these?
1. ### Main Goals {-} becomes ### Main Goals {#main-goals
.unnumbered}: I want to keep using {-} for 2 reasons: shorter, and
does not depends on the header (which will gets repeated after
cat).
2. 1. abcd... becomes 1. abcd...: it seems that pandoc enforce 2
spaces after the enumerated list/bullet list. Are there ways to
change this behavior? I suppose I could use a regex to transform
it
Post by Kolen Cheung
back but it seems to prone to error.
3. inline footnotes: I found that pandoc would convert inline
footnotes to explicit footnotes with [^1], [^2].... And the use of
inline_notes cannot be enforced. I opened an [2]issue in #3172. I
suppose I can change the source code to use explicit footnotes
only. But it seems difficult to enforce it and tell people not to
use inline footnotes.
4. &trade; becomes ™: after studying how trademark should be typeset,
considering I aim at HTML+LaTeX output and no non-ascii characters
in the source code, I chose &trade;. But pandoc would happily
convert that to ™ without my consent. I suppose other such HTML
characters might behave similarly. (by the way, input &trade; from
markdown would output ™ in TeX, and pdflatex has no problem with
that. The resultant PDF looks identical as if I use
\texttrademark.
Post by Kolen Cheung
Does anyone knows why? I thought pdflatex don’t like unicode.)
5. pipe tables becomes HTML tables: I believe it is a bug so I opened
[3]issue #3171. Even more interestingly, the pipe tables were
obtained by a .docx to .md conversion.
find . -maxdepth 2 -mindepth 2 -iname "*.md" -exec pandoc -f
markdown+abbreviati
Post by Kolen Cheung
ons+autolink_bare_uris+markdown_attribute+mmd_header_identifiers+mmd_link_attrib
utes+mmd_title_block+tex_math_double_backslash-latex_macros -t
markdown+raw_tex-
Post by Kolen Cheung
native_spans-simple_tables-multiline_tables-grid_tables-latex_macros
--normalize
Post by Kolen Cheung
-s --wrap=none --atx-headers -o {} {} \;
“pandoc lint”
By the way, does anyone know how to do some sort of “pandoc lint”?
Currently I checked the TeX output by chktex -q and lacheck, which
sometimes gives useful typographical hints on what to correct.
random string should be a valid markdown syntax (part of the markdown
philosophy kind of thing). In this sense it seems very difficult to
enforce a “right” syntax in markdown.
cat a lot of markdown files into one
Lastly, there’s a very minor issue: if I cat lots of markdown files
into one, then between the end of one file to the beginning of
another,
Post by Kolen Cheung
the lack of enough newlines between them might make it a wrong
markdown
Post by Kolen Cheung
syntax. (e.g. the beginning of a file starts with a heading, some text
editors (e.g. Atom) normalized my trailing newline without my consent
to 1 empty line. So then the heading would start immediately after the
last paragraph, which pandoc will not parse it as a heading.)
I currently get around this problem with a script to normalize every
files with exactly 2 trailing empty lines.
I suppose cating markdown files would be a very common process. How
normally would others do it?
Thanks in advance,
Kolen
​
--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[6]
https://groups.google.com/d/msgid/pandoc-discuss/e82e943f-604e-4a5b-
Post by Kolen Cheung
a621-4b3dd82e42c0%40googlegroups.com.
For more options, visit [7]https://groups.google.com/d/optout.
References
1. https://github.com/jgm/pandoc/issues/2814
2. https://github.com/jgm/pandoc/issues/3172
3. https://github.com/jgm/pandoc/issues/3171
6.
7. https://groups.google.com/d/optout
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/964c8fc2-834a-4f4c-8390-091177a82562%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sergio Correia
2016-10-23 02:36:24 UTC
Permalink
Post by Kolen Cheung
And lastly, I know if throwing pandoc a bunch of files at the same time,
pandoc will cat them together first. How pandoc handle the empty line
issue between the beginning of one and the end of another?
I also had problems with this. If I don't leave an empty line at the end of
a file, and the next one starts with a header ("# A Section"), then
everything will compile but no title will be created...
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e7bd4eba-c43a-4f20-8536-5fa0926b857b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-23 21:39:56 UTC
Permalink
Post by Kolen Cheung
And lastly, I know if throwing pandoc a bunch of files at the same
time, pandoc will cat them together first. How pandoc handle the empty
line issue between the beginning of one and the end of another?
I also had problems with this. If I don't leave an empty line at the
end of a file, and the next one starts with a header ("# A Section"),
then everything will compile but no title will be created...
This is intentional. For some purposes you might want to be
able to combine things without blank spaces. If so, just
leave off the newline from the final line of the first file.

Normally text files end with a newline, so pandoc should
in effect insert a blank line.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161023213955.GA92837%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-24 07:59:07 UTC
Permalink
Currently, my “fix” to the problem is to enclose a style in the source. An
example is in ickc/travis-ci-pandoc-latex-config/makefile
<https://github.com/ickc/travis-ci-pandoc-latex-config/blob/master/makefile#L83-L85>
:

# Normalize white spaces: 1. Add 2 trailing newlines 2. delete all CONSECUTIVE blank lines from file except the first; deletes all blank lines from top and end of file; allows 0 blanks at top, 0,1,2 at EOF 3. delete trailing whitespace (spaces, tabs) from end of each line
normalize:
find . -maxdepth 2 -iname "*.md" -exec bash -c 'printf "\n\n" >> "$$0"' {} \; -exec sed -i -e '/./,/^$$/!d' -e 's/[ \t]*$$//' {} \;

where some 1-liners are from ‎sed.sourceforge.net/sed1line.txt
<http://sed.sourceforge.net/sed1line.txt>.

The 1-liners said it only allow optionally 1 trailing newline. But from my
test it actually allow 0-2 traling newlines. So the way I enforce 2
trailing newlines are to add 2 more and use it to remove any more than 2.

This way, it will make sure the end and beginning of the texts of the 2 md
files being cat are separated by exactly 1 empty line.

On Saturday, October 22, 2016 at 7:36:24 PM UTC-7, Sergio Correia wrote:

And lastly, I know if throwing pandoc a bunch of files at the same time,
Post by Sergio Correia
Post by Kolen Cheung
pandoc will cat them together first. How pandoc handle the empty line
issue between the beginning of one and the end of another?
I also had problems with this. If I don't leave an empty line at the end
of a file, and the next one starts with a header ("# A Section"), then
everything will compile but no title will be created...
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/2dad80d8-0652-45e6-9fb7-6df98c88c24c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-24 08:42:52 UTC
Permalink
Excuse me for getting back to the pandoc lint concept. In a sense, the
minimal requirement for it to work is this:

Consider a mapping [image: x\mapsto y \mapsto z] where each of the map is
done by pandoc -t markdown -f markdown. It is not realistic to expect [image:
x \equiv y], as you pointed out many info is lost in the reader (and if so
it won’t be a linter too because it is not changing anything). *But*, if [image:
y \equiv z], than it should be a good “linter”. *i.e.* if it is *idempotent*.
The point is I could use it to enforce a style, and repeatedly applying it
would not further change it so the styling is fixed.

In this sense, most of the problems listed in the original post is not a
problem. The one thing I am worrying is the last point about tables, as in issue
#3171 <https://github.com/jgm/pandoc/issues/3171>.

Just as I’m brain-storming about it, it seems reasonable to expect *all*
reader-writer pairs should be *idempotent*. I don’t know if the tests are
checking for this. If not, it seems it’s a simple tests to detect potential
error. May be add it as a “allow failure” kind of test?

By the way, I improved the first problem by using -f
markdown-auto_identifiers.... It still expands {-} to {.unnumbered} though.
May be actually this is better since any people who are unfamiliar with
pandoc syntax would still recognize what it means.

On Saturday, October 22, 2016 at 1:54:23 PM UTC-7, John MacFarlane wrote:

In order to function as a proper linter, pandoc would have
Post by John MacFarlane
to store much more detail in the AST than it currently does.
Is a header an AST or setext style header? How many spaces
before, and how many after, a list delimiter? Inline or
regular footnote? If regular, what's the label for the
marker? Inline or reference link? If reference, what's
the label? Fenced or indented code? Was a particular
unicode character entered as an entity or verbatim? Etc.
I just didn't design pandoc to be used as a linter; the
underlying design would need to be very different.
Post by Kolen Cheung
Hi, all,
pandoc from markdown to markdown
Ever since I read [1]issue #2814, I find it a very useful trick.
I am now working on a project, that starting from next semester will
open up to about 100 GSIs to collaboratively update a series of
workbooks. I want to incorporate the said trick as a cleanup tool to
normalize the source code (in pandoc markdown with minimal raw LaTeX).
Most things works very well, but however I find a few problems. I
don’t
Post by Kolen Cheung
know if there’s any way to get around these?
1. ### Main Goals {-} becomes ### Main Goals {#main-goals
.unnumbered}: I want to keep using {-} for 2 reasons: shorter, and
does not depends on the header (which will gets repeated after
cat).
2. 1. abcd... becomes 1. abcd...: it seems that pandoc enforce 2
spaces after the enumerated list/bullet list. Are there ways to
change this behavior? I suppose I could use a regex to transform
it
Post by Kolen Cheung
back but it seems to prone to error.
3. inline footnotes: I found that pandoc would convert inline
footnotes to explicit footnotes with [^1], [^2].... And the use of
inline_notes cannot be enforced. I opened an [2]issue in #3172. I
suppose I can change the source code to use explicit footnotes
only. But it seems difficult to enforce it and tell people not to
use inline footnotes.
4. &trade; becomes ™: after studying how trademark should be typeset,
considering I aim at HTML+LaTeX output and no non-ascii characters
in the source code, I chose &trade;. But pandoc would happily
convert that to ™ without my consent. I suppose other such HTML
characters might behave similarly. (by the way, input &trade; from
markdown would output ™ in TeX, and pdflatex has no problem with
that. The resultant PDF looks identical as if I use
\texttrademark.
Post by Kolen Cheung
Does anyone knows why? I thought pdflatex don’t like unicode.)
5. pipe tables becomes HTML tables: I believe it is a bug so I opened
[3]issue #3171. Even more interestingly, the pipe tables were
obtained by a .docx to .md conversion.
find . -maxdepth 2 -mindepth 2 -iname "*.md" -exec pandoc -f
markdown+abbreviati
Post by Kolen Cheung
ons+autolink_bare_uris+markdown_attribute+mmd_header_identifiers+mmd_link_attrib
utes+mmd_title_block+tex_math_double_backslash-latex_macros -t
markdown+raw_tex-
Post by Kolen Cheung
native_spans-simple_tables-multiline_tables-grid_tables-latex_macros
--normalize
Post by Kolen Cheung
-s --wrap=none --atx-headers -o {} {} \;
“pandoc lint”
By the way, does anyone know how to do some sort of “pandoc lint”?
Currently I checked the TeX output by chktex -q and lacheck, which
sometimes gives useful typographical hints on what to correct.
random string should be a valid markdown syntax (part of the markdown
philosophy kind of thing). In this sense it seems very difficult to
enforce a “right” syntax in markdown.
cat a lot of markdown files into one
Lastly, there’s a very minor issue: if I cat lots of markdown files
into one, then between the end of one file to the beginning of
another,
Post by Kolen Cheung
the lack of enough newlines between them might make it a wrong
markdown
Post by Kolen Cheung
syntax. (e.g. the beginning of a file starts with a heading, some text
editors (e.g. Atom) normalized my trailing newline without my consent
to 1 empty line. So then the heading would start immediately after the
last paragraph, which pandoc will not parse it as a heading.)
I currently get around this problem with a script to normalize every
files with exactly 2 trailing empty lines.
I suppose cating markdown files would be a very common process. How
normally would others do it?
Thanks in advance,
Kolen
​
--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[6]
https://groups.google.com/d/msgid/pandoc-discuss/e82e943f-604e-4a5b-
Post by Kolen Cheung
a621-4b3dd82e42c0%40googlegroups.com.
For more options, visit [7]https://groups.google.com/d/optout.
References
1. https://github.com/jgm/pandoc/issues/2814
2. https://github.com/jgm/pandoc/issues/3172
3. https://github.com/jgm/pandoc/issues/3171
6.
7. https://groups.google.com/d/optout
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/5405bad9-caa9-439d-b298-72731a144fbd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-27 07:05:07 UTC
Permalink
I found that the pandoc markdown writer will use unicode output if possible. Not only in the case like the trademark symbol `&trade;`, but also when non-breaking space is used. The markdown writer will output a unicode non-breaking space character rather than `\ `. Since the latter is more markdown-ish and is the recommended way of typing non-braking space in the manual, it seems the markdown writer should use that instead.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/0f0bc668-c454-4119-a62b-307e318553f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-27 19:02:43 UTC
Permalink
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if possible. Not only in the case like the trademark symbol `&trade;`, but also when non-breaking space is used. The markdown writer will output a unicode non-breaking space character rather than `\ `. Since the latter is more markdown-ish and is the recommended way of typing non-braking space in the manual, it seems the markdown writer should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161027190242.GD1044%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-28 00:58:44 UTC
Permalink
On Thursday, October 27, 2016 at 12:02:43 PM UTC-7, John MacFarlane wrote:
+++ Kolen Cheung [Oct 27 16 00:05 ]:

I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol &trade;, but also
when non-breaking space is used. The markdown writer will output a unicode
non-breaking space character rather than \. Since the latter is more
markdown-ish and is the recommended way of typing non-braking space in the
manual, it seems the markdown writer should use that instead.

I don’t know if it’s more markdown-ish. The goal of getting a text that
reads naturally without special processing is better met by using a unicode
nonbreaking space. The \ is pretty ugly. I don’t think the manual
recommends \ as preferable to a literal nonbreaking space.

I was referring to:

If you just want a regular inline image, just make sure it is not the only
thing in the paragraph. One way to do this is to insert a nonbreaking space
after the image:

![This image won't be a figure](/url/of/image.png)\

If the unicode non-breaking space is recommended, it should reads ![This
image won't be a figure](/url/of/image.png).

I would say a unicode non-breaking space actually doesn’t quite “read
naturally”. Since it requires me to turn on features in my text editor to
show invisible characters, and even then the non-breaking space and
“normal” space looks almost the same with only a different shade. (It spent
me quite some time to realize that. I thought pandoc was converting \ into `,
but it actually convert it into `, which looks probably identical here.)

Inspired by the reply of Jesse Rosenthal, and what --smart do for any
writers, may be there should be an --unsmart option for the markdown writer
that will represent characters in pure ASCII whenever possible.

I actually have another related problem with the markdown writer: currently
if there’s string with intraword_underscores, after the pandoc -t
markdown... -f markdown..., it becomes intraword\_underscores. In this case
it wrote an unnecessary escape character, which makes it reads less
naturally.
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cbf3c105-241b-45de-8519-8962cadda270%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
BP Jonsson
2016-10-28 13:15:14 UTC
Permalink
The way I see it `\<space>` is preferred for input and U+00a0 for output
which IMO is perfectly sensible. I have configured Vim to distinguish nbsp
no-break hyphen, soft hyphen, dashes and a few other things with
highlighting.
If I want to convert U+00a0 into `\<space>` I just do `:%s/\%xa0/\\ /g`. I
also have a perl script which converts non-ASCII characters to entities,
selecting them by regex, thus by codepoint, range, general category, block
or whatever properties perl regexes support (which are very many).
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol &trade;, but
also when non-breaking space is used. The markdown writer will output a
unicode non-breaking space character rather than \. Since the latter is
more markdown-ish and is the recommended way of typing non-braking space in
the manual, it seems the markdown writer should use that instead.
I don’t know if it’s more markdown-ish. The goal of getting a text that
reads naturally without special processing is better met by using a unicode
nonbreaking space. The \ is pretty ugly. I don’t think the manual
recommends \ as preferable to a literal nonbreaking space.
If you just want a regular inline image, just make sure it is not the only
thing in the paragraph. One way to do this is to insert a nonbreaking space
![This image won't be a figure](/url/of/image.png)\
If the unicode non-breaking space is recommended, it should reads ![This
image won't be a figure](/url/of/image.png).
I would say a unicode non-breaking space actually doesn’t quite “read
naturally”. Since it requires me to turn on features in my text editor to
show invisible characters, and even then the non-breaking space and
“normal” space looks almost the same with only a different shade. (It spent
me quite some time to realize that. I thought pandoc was converting \
into `, but it actually convert it into `, which looks probably identical
here.)
Inspired by the reply of Jesse Rosenthal, and what --smart do for any
writers, may be there should be an --unsmart option for the markdown
writer that will represent characters in pure ASCII whenever possible.
currently if there’s string with intraword_underscores, after the pandoc
-t markdown... -f markdown..., it becomes intraword\_underscores. In this
case it wrote an unnecessary escape character, which makes it reads less
naturally.
​
--
You received this message because you are subscribed to the Google Groups
"pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
To view this discussion on the web visit https://groups.google.com/d/
msgid/pandoc-discuss/cbf3c105-241b-45de-8519-8962cadda270%
40googlegroups.com
<https://groups.google.com/d/msgid/pandoc-discuss/cbf3c105-241b-45de-8519-8962cadda270%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuSXMWATCa0GFO0Y94H0PpFXcShGMLwEEaHBqutpxuLSiw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-28 20:45:47 UTC
Permalink
I try to use grep --color='auto' -P -H -n "[^\x00-\x7F]" to highlight all
the non-ASCII characters. I puzzled by the output (on hundreds of files)
for some time because on the screen it printed out a lot of lines with
nothing highlighted. I debugged it quite a while, until I later discovered
(on a different circumstance: add non-breaking space to the end of image to
force non-implicit figure) those are actually unicode non-breaking space.

Now that this is “discovered”, it will be easy to write a script to regex
it back. (And I put it under the unSmartyPants.sh
<https://github.com/ickc/markdown-variants/blob/master/bin/unSmartyPants.sh>
umbrella (although technically has nothing to do with SmartyPants).

After @jgm mentioned --ascii, it got me thinking if the function of this
option could be expanded to cover other output (currently HTML output
only), e.g. markdown (that use “native” ASCII before using HTML ASCII); or
even LaTeX.

On Friday, October 28, 2016 at 6:15:17 AM UTC-7, BP Jonsson wrote:

The way I see it `\<space>` is preferred for input and U+00a0 for output
Post by BP Jonsson
which IMO is perfectly sensible. I have configured Vim to distinguish nbsp
no-break hyphen, soft hyphen, dashes and a few other things with
highlighting.
If I want to convert U+00a0 into `\<space>` I just do `:%s/\%xa0/\\ /g`. I
also have a perl script which converts non-ASCII characters to entities,
selecting them by regex, thus by codepoint, range, general category, block
or whatever properties perl regexes support (which are very many).
Post by Kolen Cheung
On Thursday, October 27, 2016 at 12:02:43 PM UTC-7, John MacFarlane
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol &trade;, but
also when non-breaking space is used. The markdown writer will output a
unicode non-breaking space character rather than \. Since the latter is
more markdown-ish and is the recommended way of typing non-braking space in
the manual, it seems the markdown writer should use that instead.
I don’t know if it’s more markdown-ish. The goal of getting a text that
reads naturally without special processing is better met by using a unicode
nonbreaking space. The \ is pretty ugly. I don’t think the manual
recommends \ as preferable to a literal nonbreaking space.
If you just want a regular inline image, just make sure it is not the
only thing in the paragraph. One way to do this is to insert a nonbreaking
![This image won't be a figure](/url/of/image.png)\
If the unicode non-breaking space is recommended, it should reads ![This
image won't be a figure](/url/of/image.png).
I would say a unicode non-breaking space actually doesn’t quite “read
naturally”. Since it requires me to turn on features in my text editor to
show invisible characters, and even then the non-breaking space and
“normal” space looks almost the same with only a different shade. (It spent
me quite some time to realize that. I thought pandoc was converting \
into `, but it actually convert it into `, which looks probably
identical here.)
Inspired by the reply of Jesse Rosenthal, and what --smart do for any
writers, may be there should be an --unsmart option for the markdown
writer that will represent characters in pure ASCII whenever possible.
currently if there’s string with intraword_underscores, after the pandoc
-t markdown... -f markdown..., it becomes intraword\_underscores. In
this case it wrote an unnecessary escape character, which makes it reads
less naturally.
​
--
You received this message because you are subscribed to the Google Groups
"pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
<javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/pandoc-discuss/cbf3c105-241b-45de-8519-8962cadda270%40googlegroups.com
<https://groups.google.com/d/msgid/pandoc-discuss/cbf3c105-241b-45de-8519-8962cadda270%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/10bddc3d-e533-44bd-8d8c-5b132e56a57f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-29 06:19:05 UTC
Permalink
this option could be expanded to cover other output (currently HTML
output only), e.g. markdown (that use “native” ASCII before using HTML
ASCII); or even LaTeX.
Supporting --ascii in Markdown would be feasible, since we
can use entities. (This is ugly, though!)

Supporting it in LaTeX would not in general be possible.
For some western european accented characters there are
TeX control sequences, like \"a, but in general there's
no way to get ascii equivalents of arbitrary unicode
characters.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161029061904.GF7496%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-29 06:53:13 UTC
Permalink
But actually what I want to do is not exactly “entities” in markdown.
Essentially I’m talking about “un-SmartyPants” so that markdown writers
will use the character sequence *before* SmartyPaints. i.e. non-breaking
space is \, em-dash is ---. So it would be best described by --unsmart vs.
--smart but not really related to --ascii.

I also agree leaving the markdown source using “entities” is ugly. I am
rethinking my strategy in my project to relax that requirement. Right now
I’m using [^™[:ascii:]] to check for “illegal” character. And by the way,
interestingly, grep would regard AmpÚre and Schrödinger as [:ascii:]. Only
[^\x00-\x7F] would detect them.

On Friday, October 28, 2016 at 11:19:09 PM UTC-7, John MacFarlane wrote:

Supporting —ascii in Markdown would be feasible, since we can use entities.
(This is ugly, though!)

Supporting it in LaTeX would not in general be possible. For some western
european accented characters there are TeX control sequences, like \”a, but
in general there’s no way to get ascii equivalents of arbitrary unicode
characters.

​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/45d24e60-5523-4bbe-8c9f-a49e53583198%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-29 07:01:58 UTC
Permalink
Ah, sorry I might have mixed up the issue here. I said before but mixed it
again: non-breaking space using backslash escape has nothing to do with
SmartyPants. But you get the idea—some special sequence means a special
character in markdown, and whenever they could be done, they should be
written in that way. But I think this way of thinking is not really about
markdown (like how easy to read) but about source code (enforcing a style).
It would be great if such kind of --unsmart or --ascii option for markdown
writer would be created, but I’m ok to settle for the regex I used since it
is a very simple one-one correspondence.

On Friday, October 28, 2016 at 11:53:13 PM UTC-7, Kolen Cheung wrote:

But actually what I want to do is not exactly “entities” in markdown.
Post by Kolen Cheung
Essentially I’m talking about “un-SmartyPants” so that markdown writers
will use the character sequence *before* SmartyPaints. i.e. non-breaking
space is \, em-dash is ---. So it would be best described by --unsmart
vs. --smart but not really related to --ascii.
I also agree leaving the markdown source using “entities” is ugly. I am
rethinking my strategy in my project to relax that requirement. Right now
I’m using [^™[:ascii:]] to check for “illegal” character. And by the way,
interestingly, grep would regard AmpÚre and Schrödinger as [:ascii:].
Only [^\x00-\x7F] would detect them.
Supporting —ascii in Markdown would be feasible, since we can use
entities. (This is ugly, though!)
Supporting it in LaTeX would not in general be possible. For some western
european accented characters there are TeX control sequences, like \”a, but
in general there’s no way to get ascii equivalents of arbitrary unicode
characters.
​
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e5732eff-d1b9-4f60-95a6-80ffc4278b9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-28 03:45:29 UTC
Permalink
Another case I found puzzling on how pandoc handle special character is

illegal
Post by Kolen Cheung
block
in
pandoc
pandoc -s -o test-pandoc.md test.md will output:

illegal &gt;block &gt;in &gt;pandoc

The focus here is not the illegal block quote, but the > becomes $gt;. It
is the least expected result, considering how pandoc turn $trade; into ™.
It seems either > or \> is more reasonable.
Post by Kolen Cheung
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`, but
also when non-breaking space is used. The markdown writer will output a
unicode non-breaking space character rather than `\ `. Since the latter is
more markdown-ish and is the recommended way of typing non-braking space in
the manual, it seems the markdown writer should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/6a504fbe-45c3-4221-ab15-0dc47b4591c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-28 08:25:05 UTC
Permalink
Post by Kolen Cheung
Another case I found puzzling on how pandoc handle special character is
illegal
Post by Kolen Cheung
block
in
pandoc
The focus here is not the illegal block quote, but the > becomes $gt;.
It is the least expected result, considering how pandoc turn $trade;
into ™. It seems either > or \> is more reasonable.
There are a few characters that must always be escaped as
entities in HTML: <, >, &, ".

For others we use UTF-8.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161028082505.GE4501%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-29 09:33:20 UTC
Permalink
I just encounter a problem kind of related:

I’m writing a README for the project, having a line like this:

End shortform in non-breaking space, like this: `e.g.\ `, `i.e.\ `. The backslash escaped space, `\ `...

However I found the space got eaten, and found a documentation on this
behavior in the manual:

(The spaces after the opening backticks and before the closing backticks
will be ignored.)

Are there any way to get around this? e.g. I notice that ending with space
in code block is ok.

Code used to test:

# printf "%s\n\n" '`e.g.\ `' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]# printf "%s\n\n" '``e.g.\ ``' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]# printf "%s\n\n" '```e.g.\ ```' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]# printf "%s\n\n" ' end with a space \ ' | pandoc -f markdown -t native
[CodeBlock ("",[],[]) "end with a space \\ "]# printf "%s\n\n" '```' 'end with a space \ ' '```' | pandoc -f markdown -t native
[CodeBlock ("",[],[]) "\nend with a space \\ \n"]
Post by Kolen Cheung
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`, but
also when non-breaking space is used. The markdown writer will output a
unicode non-breaking space character rather than `\ `. Since the latter is
more markdown-ish and is the recommended way of typing non-braking space in
the manual, it seems the markdown writer should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/565f0a35-b5d3-45b3-8cde-e0c9dfe0ca3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-29 18:54:45 UTC
Permalink
This space collapsing is unfortunately part of the Markdown
syntax description. There's no way around it that I can
think of.
Post by Kolen Cheung
End shortform in non-breaking space, like this: `e.g.\ `, `i.e.\ `. The backslas
h escaped space, `\ `...
However I found the space got eaten, and found a documentation on this
(The spaces after the opening backticks and before the closing
backticks will be ignored.)
Are there any way to get around this? e.g. I notice that ending with
space in code block is ok.
# printf "%s\n\n" '`e.g.\ `' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '``e.g.\ ``' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '```e.g.\ ```' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" ' end with a space \ ' | pandoc -f markdown -t native
[CodeBlock ("",[],[]) "end with a space \\ "]
# printf "%s\n\n" '```' 'end with a space \ ' '```' | pandoc -f markdown -t nati
ve
[CodeBlock ("",[],[]) "\nend with a space \\ \n"]
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`,
but also when non-breaking space is used. The markdown writer will
output a unicode non-breaking space character rather than `\ `.
Since the latter is more markdown-ish and is the recommended way of
typing non-braking space in the manual, it seems the markdown writer
should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[3]https://groups.google.com/d/msgid/pandoc-discuss/565f0a35-b5d3-45b3-
8cde-e0c9dfe0ca3b%40googlegroups.com.
For more options, visit [4]https://groups.google.com/d/optout.
References
4. https://groups.google.com/d/optout
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161029185445.GE5364%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-29 19:27:45 UTC
Permalink
Thanks. I will just lengthen my example so that it doesn’t end on a space.

Going back to the point to using pandoc from markdown to markdown to
enforce a style: it is a prerequisite to apply pandoc filters on the
source. For example, I am updating my project to use
--top-level-division=part. For this I need to increment the header levels
by 1. As suggested in Pandoc - Scripting with pandoc
<http://pandoc.org/scripting.html>, using regex could mess somethings else
up. But in order to write and use a pandoc filter for this task, one needs
to make sure the read/write cycle don’t change other things else.

As I said earlier, expecting read/write cycle to be an identify is
unreasonable, but as long as it is idempotent it would make this use
possible. There’s only need to be a commit using a read/write cycle to
enforce a style, then apply the said hypothetical filter to change header
level only, making the commits a lot cleaner.

So, in addition to control freaks like me that want to enforce a style on
the source, being able to do that also has other real world use case.

On Saturday, October 29, 2016 at 11:54:52 AM UTC-7, John MacFarlane wrote:

This space collapsing is unfortunately part of the Markdown
Post by John MacFarlane
syntax description. There's no way around it that I can
think of.
Post by Kolen Cheung
End shortform in non-breaking space, like this: `e.g.\ `, `i.e.\ `. The
backslas
Post by Kolen Cheung
h escaped space, `\ `...
However I found the space got eaten, and found a documentation on this
(The spaces after the opening backticks and before the closing
backticks will be ignored.)
Are there any way to get around this? e.g. I notice that ending with
space in code block is ok.
# printf "%s\n\n" '`e.g.\ `' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '``e.g.\ ``' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '```e.g.\ ```' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" ' end with a space \ ' | pandoc -f markdown -t
native
Post by Kolen Cheung
[CodeBlock ("",[],[]) "end with a space \\ "]
# printf "%s\n\n" '```' 'end with a space \ ' '```' | pandoc -f markdown
-t nati
Post by Kolen Cheung
ve
[CodeBlock ("",[],[]) "\nend with a space \\ \n"]
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`,
but also when non-breaking space is used. The markdown writer will
output a unicode non-breaking space character rather than `\ `.
Since the latter is more markdown-ish and is the recommended way of
typing non-braking space in the manual, it seems the markdown writer
should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.
​
--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[3]
https://groups.google.com/d/msgid/pandoc-discuss/565f0a35-b5d3-45b3-
Post by Kolen Cheung
8cde-e0c9dfe0ca3b%40googlegroups.com.
For more options, visit [4]https://groups.google.com/d/optout.
References
3.
4. https://groups.google.com/d/optout
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/af7d7c17-b985-4370-b5c7-872433996afd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Melroch
2016-10-29 21:03:05 UTC
Permalink
Incrementing the level of all headers with a filter would be easy. Note
however that Pandoc apparently is OK with headings of level 7 and higher. I
ran into that when converting a very long and detailed outliner document
from OPML to Markdown. Since that document didn't contain any ambiguous
lines I could use regex -- even a perl oneliner -- to convert too deep
headings to bullet lists, although I managed to break even LaTeX's limit on
list nesting depth in a few places!

You can of course have a commit hook which has pandoc convert your
documents from markdown to markdown running any number of filters. I will
certainly try it now that I've tought of it (and I hope that the Markdown
writer soon will be able to output bracket spans! :-)
Post by Kolen Cheung
Thanks. I will just lengthen my example so that it doesn’t end on a space.
Going back to the point to using pandoc from markdown to markdown to
enforce a style: it is a prerequisite to apply pandoc filters on the
source. For example, I am updating my project to use
--top-level-division=part. For this I need to increment the header levels
by 1. As suggested in Pandoc - Scripting with pandoc
<http://pandoc.org/scripting.html>, using regex could mess somethings
else up. But in order to write and use a pandoc filter for this task, one
needs to make sure the read/write cycle don’t change other things else.
As I said earlier, expecting read/write cycle to be an identify is
unreasonable, but as long as it is idempotent it would make this use
possible. There’s only need to be a commit using a read/write cycle to
enforce a style, then apply the said hypothetical filter to change header
level only, making the commits a lot cleaner.
So, in addition to control freaks like me that want to enforce a style on
the source, being able to do that also has other real world use case.
This space collapsing is unfortunately part of the Markdown
Post by John MacFarlane
syntax description. There's no way around it that I can
think of.
Post by Kolen Cheung
End shortform in non-breaking space, like this: `e.g.\ `, `i.e.\ `. The
backslas
Post by Kolen Cheung
h escaped space, `\ `...
However I found the space got eaten, and found a documentation on
this
Post by Kolen Cheung
(The spaces after the opening backticks and before the closing
backticks will be ignored.)
Are there any way to get around this? e.g. I notice that ending with
space in code block is ok.
# printf "%s\n\n" '`e.g.\ `' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '``e.g.\ ``' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '```e.g.\ ```' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" ' end with a space \ ' | pandoc -f markdown -t
native
Post by Kolen Cheung
[CodeBlock ("",[],[]) "end with a space \\ "]
# printf "%s\n\n" '```' 'end with a space \ ' '```' | pandoc -f markdown
-t nati
Post by Kolen Cheung
ve
[CodeBlock ("",[],[]) "\nend with a space \\ \n"]
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`,
but also when non-breaking space is used. The markdown writer will
output a unicode non-breaking space character rather than `\ `.
Since the latter is more markdown-ish and is the recommended way of
typing non-braking space in the manual, it seems the markdown
writer
Post by Kolen Cheung
should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.
​
--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
send
Post by Kolen Cheung
To post to this group, send email to
To view this discussion on the web visit
[3]https://groups.google.com/d/msgid/pandoc-discuss/565f0a35
-b5d3-45b3-
Post by Kolen Cheung
8cde-e0c9dfe0ca3b%40googlegroups.com.
For more options, visit [4]https://groups.google.com/d/optout.
References
3. https://groups.google.com/d/msgid/pandoc-discuss/565f0a35-b5
utm_source=footer
Post by Kolen Cheung
4. https://groups.google.com/d/optout
​
--
You received this message because you are subscribed to the Google Groups
"pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
To view this discussion on the web visit https://groups.google.com/d/
msgid/pandoc-discuss/af7d7c17-b985-4370-b5c7-872433996afd%
40googlegroups.com
<https://groups.google.com/d/msgid/pandoc-discuss/af7d7c17-b985-4370-b5c7-872433996afd%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CADAJKhDsYm4yaCen61Q4kKXpgL9oKgToMn%3DhT71g5UZrwDeWSA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-29 21:44:12 UTC
Permalink
I'm not talking about the possibility (or ease) of writing such a filter, but the practicality of writing such a filter (that output to markdown and overwrite the source) if the markdown source hasn't been "standardized" by the markdown writer. So if we dismiss the usefulness of using pandoc as a markdown styling tool, we also dismiss the usefulness of using pandoc's parser and filter system to act on the source.

The example I gave is suppose to be easy in both case: regex (on source) or filter (on AST). But (one of) the very nature of filter is to address the shortcoming of not having a parser.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/021f8bb7-3349-4d5b-859d-cb7f08e94893%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-30 09:42:23 UTC
Permalink
Achieving idempotency (write markdown == write markdown ->
read markdown -> write markdown) is actually not so easy.
Partly this is because unless we escape things VERY
aggressively, regular characters can turn into syntax.

In cmark (commonmark library), I worked hard at this and
managed to get idempotence for all the test cases in the
test suite (except for a few special cases). But
pandoc's writer is not so close.

We can make steps towards this, but a guarantee is going
to be very tough. I tried adding a QuickCheck property for
this, and each time I ran it, it falsified idempotency on
the first try...
Thanks. I will just lengthen my example so that it doesn’t end on a
space.
Going back to the point to using pandoc from markdown to markdown to
enforce a style: it is a prerequisite to apply pandoc filters on the
source. For example, I am updating my project to use
--top-level-division=part. For this I need to increment the header
levels by 1. As suggested in [1]Pandoc - Scripting with pandoc, using
regex could mess somethings else up. But in order to write and use a
pandoc filter for this task, one needs to make sure the read/write
cycle don’t change other things else.
As I said earlier, expecting read/write cycle to be an identify is
unreasonable, but as long as it is idempotent it would make this use
possible. There’s only need to be a commit using a read/write cycle to
enforce a style, then apply the said hypothetical filter to change
header level only, making the commits a lot cleaner.
So, in addition to control freaks like me that want to enforce a style
on the source, being able to do that also has other real world use
case.
This space collapsing is unfortunately part of the Markdown
syntax description. There's no way around it that I can
think of.
Post by Kolen Cheung
End shortform in non-breaking space, like this: `e.g.\ `, `i.e.\ `.
The backslas
Post by Kolen Cheung
h escaped space, `\ `...
However I found the space got eaten, and found a documentation
on this
Post by Kolen Cheung
(The spaces after the opening backticks and before the closing
backticks will be ignored.)
Are there any way to get around this? e.g. I notice that ending
with
Post by Kolen Cheung
space in code block is ok.
# printf "%s\n\n" '`e.g.\ `' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '``e.g.\ ``' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '```e.g.\ ```' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" ' end with a space \ ' | pandoc -f markdown -t
native
Post by Kolen Cheung
[CodeBlock ("",[],[]) "end with a space \\ "]
# printf "%s\n\n" '```' 'end with a space \ ' '```' | pandoc -f
markdown -t nati
Post by Kolen Cheung
ve
[CodeBlock ("",[],[]) "\nend with a space \\ \n"]
On Thursday, October 27, 2016 at 12:02:43 PM UTC-7, John
MacFarlane
Post by Kolen Cheung
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode
output if
Post by Kolen Cheung
possible. Not only in the case like the trademark symbol
`&trade;`,
Post by Kolen Cheung
but also when non-breaking space is used. The markdown writer
will
Post by Kolen Cheung
output a unicode non-breaking space character rather than `\
`.
Post by Kolen Cheung
Since the latter is more markdown-ish and is the recommended
way of
Post by Kolen Cheung
typing non-braking space in the manual, it seems the markdown
writer
Post by Kolen Cheung
should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.

--
You received this message because you are subscribed to the
Google
Post by Kolen Cheung
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from
it, send
Post by Kolen Cheung
To post to this group, send email to
To view this discussion on the web visit
[3][4]https://groups.google.com/d/msgid/pandoc-discuss/
565f0a35-b5d3-45b3-
Post by Kolen Cheung
8cde-e0c9dfe0ca3b%[5]40googlegroups.com.
For more options, visit [4][6]https://groups.google.com/
d/optout.
Post by Kolen Cheung
References
3. [9]https://groups.google.com/d/msgid/pandoc-discuss/565f0a35-
email&utm_source=footer
Post by Kolen Cheung
4. [10]https://groups.google.com/d/optout

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[13]https://groups.google.com/d/msgid/pandoc-discuss/af7d7c17-b985-4370
-b5c7-872433996afd%40googlegroups.com.
For more options, visit [14]https://groups.google.com/d/optout.
References
1. http://pandoc.org/scripting.html
2. javascript:/
3. javascript:/
4. https://groups.google.com/d/msgid/pandoc-discuss/565f0a35-b5d3-45b3-
5. http://40googlegroups.com/
6. https://groups.google.com/d/optout
7. javascript:/
8. javascript:/
10. https://groups.google.com/d/optout
14. https://groups.google.com/d/optout
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161030094223.GH6690%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-31 01:50:37 UTC
Permalink
On Sunday, October 30, 2016 at 2:42:30 AM UTC-7, John MacFarlane wrote:
Achieving idempotency (write markdown == write markdown -> read markdown ->
write markdown) is actually not so easy. Partly this is because unless we
escape things VERY aggressively, regular characters can turn into syntax.

In cmark (commonmark library), I worked hard at this and managed to get
idempotence for all the test cases in the test suite (except for a few
special cases). But pandoc’s writer is not so close.

We can make steps towards this, but a guarantee is going to be very tough.
I tried adding a QuickCheck property for this, and each time I ran it, it
falsified idempotency on the first try


Very interesting to know these!

After thinking more about it, I can see it is complicated (probably @jgm
understand all these and may have thought about it already! It serves as a
note to myself and perhaps any others who are interested in):

Note: inline math is used, I see that in Google Groups’ web view it is
rendered correctly.

Let [image: A] be the set of all valid AST, [image: M] be the set of all
valid Markdown, and [image: f: A \rightarrow M] be the markdown writer, [image:
g: M \rightarrow A] be the markdown reader.

Criterion 1: (write markdown == write markdown -> read markdown -> write
markdown): can be written as [image: \forall a \in A, f(a) = f\circ g\circ
f(a)], which is equivalent to [image: \forall m\in f(A), m=f\circ g(m)].

Criterion 2: idempotence of [image: f\circ g] means: [image: \forall m \in
M, (f\circ g)\circ(f\circ g)(m)=f\circ g(m)], since [image: g] might not be
surjective, criterion 1 implies 2 and is stronger than 2.

Criterion 3: Ideally, [image: f] is one-one and [image: g] is surjective.
In a sense it means the markdown writer does not lose information and [image:
g] can recover all of these information. It also means that any features
supported by the AST has a markdown representation. Given people is not
writing in AST but markdown, this is a desirable feature. In this case, the
criterion is [image: \forall a \in A, a = g\circ f(a)], i.e. [image: g\circ
f=I]. However, markdown syntax doesn’t (yet) allowed this, e.g. space at
the end of verbatim is eaten, none of the native tables syntax supports all
features/properties of the internal AST. The requirement is too strong.

Criterion 4: So to relax criterion 3 to a practical level: [image: \forall
a \in g(M), a = g\circ f(a))]. i.e. all AST that could be generated by
markdown reader will fulfill the one-one requirement on markdown writer.
This is equivalent to [image: \forall m \in M, g(m) = g\circ f \circ g(m)],
which implies criterion 2, and is the “opposite” of criterion 1.

To summaries, criterion 3 is strongest (implying all other criteria) but
impossible. Criterion 2 is the least strong (all criteria implies 2),
allowing them to be a markdown styling tool, and hence, say, using the
pandoc filter system to act on the source (after fixing the styling).
criterion 1 is important when the AST is obtained through somewhere else,
e.g. docx reader. criterion 4 is important to guarantee the correctness of
markdown writer that uses markdown to represent the information in the AST
(as far as markdown syntax is allowed). So criteria 1 & 4 should be “the
goal”. To summaries it in 1 statement (and generalize to any format): [image:
\forall i,j, \forall b_i \in f_{ji}(B_j), b_i=f_{ji}\circ f_{ij}(b_i)],
where the [image: i,j] runs through formats, [image: f_{ij}] maps from [image:
i]-format to [image: j]-format. i.e. all reader & writer pairs forms an
identity when constrain to a subset.

After thinking about all these, the criteria become more important than
what I originally asked for (markdown styler), they guarantee the output
from AST won’t be misinterpreted. And now I can see why it is difficult: at
least in the case of markdown output, the target format is “not very well
defined”. I remember @jgm mentioned somewhere that any string is a valid
markdown. Probably this is what behinds the statement “unless we escape
things VERY aggressively, regular characters can turn into syntax”.

I guess one way it could help solving the problem is this (for the
meanwhile concerns markdown-AST pair only): say we have an experimental
command-line option, --safe, then immediately after the markdown writer
writes it as markdown, it reads back into AST immediately and calculate the
diff. If there’s a diff, then it starts escaping more aggressively until
identity is reached. It might even starts to calculate the diff from the
smaller subset of the document to hunt down the “trouble-maker”. (It is
easier said that done, and can be much slower in corner cases, hence the
“experimental command-line option”.)
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e586ec3b-703d-4a9c-9c12-03689b6847e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-11-30 02:46:31 UTC
Permalink
When I was testing some filters I wrote, I find it is actually not quite
difficult to achieve a looser condition: [image: P^3 = P^2]. My tests
aren’t very complicated though. But it seems to me that it is reasonable
for something like [image: \exists n, \forall x, \forall m > 0, P^{n+m} x =
P^n x] because if you allow enough iteration, eventually things will dies
down. (If it doesn’t converge, then I guess the reader-writer pairs should
be “fixed”. It will be interesting if such [image: n] doesn’t exist but can
get arbitrarily large though. In practice I hope [image: n=2].)

If it truly works (for some [image: n]), then the benefits are:

1.

automated tests: no matter what has changed and needed to test, the
bottom line is to satisfied the “weaker idempotent” requirement. (To be
fancy, imagine to discover bugs from crawling random documents across the
internet and feed into this test.)
2.

a corollary is that [image: P^n] will be idempotent, useful for
1.

pandoc as a “linter”
2.

as discussed in the last post, ideally all reader-writer pairs should
be idempotent, so this automatically gives us something like a --safe
option that is slower but guarantee to be idempotent.

​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/e29cd3d1-0cfb-42be-8cbe-c3c771efe125%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-11-30 03:05:06 UTC
Permalink
Oops
 just as I was hopeful, I just found a runaway situation with this
example:

Both equations give the same result, and you may choose whichever is more convenient for a given problem. *F\~ $\perp$ ~*means\\\\\\\\\\\\ the\\\\\\\\\\\\ component\\\\\\\\\\\\ of\\\\\\\\\\\\ $F$\\\\\\\\\\\\ perpendicular\\\\\\\\\\\\ to\\\\\\\\\\\\ $r$,\\\\\\\\\\\\ while*r~ $\perp$ \~* means the component of $r$ perpendicular to $F_.$

The was from an erroneous conversion from .doc to .docx to .md. But
basically if you try to apply pandoc -f markdown -t markdown to it, the
long line of escape sequence \\\\... will be getting longer. (The “source”
has such long sequence of \\\... perhaps because I use pandoc as linter
from time to time and didn’t look too carefully in this file.)
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/1cf2f022-2a64-4a9e-94d3-f2da097709ba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sergio Correia
2016-11-30 04:18:50 UTC
Permalink
TBH it kinda feels like a bug in Pandoc's markdown writer:

*example.md:*
*a cat*
~a cat~
*~a cat~*
~*a cat*~
*pandoc example.md --to=native*

[Para [Emph [Str "a",Space,Str "cat"]]
,Para [Str "~a",Space,Str "cat~"]
,Para [Emph [Str "~a",Space,Str "cat~"]]
,Para [Subscript [Emph [Str "a",Space,Str "cat"]]]]

So if instead of writing *~a cat~* you write *~*a cat*~, *pandoc recognizes
the text as subscript (same as if you just type *~cat~* ). So far so good.

However, when writing to markdown, somehow two backslashes get added (maybe
for escape reasons?). This is the part that looks like a bug, because two
consecutive backslashes mean that a backslash will be produced as output.

This is even easier to see in html:

*example.html:*
<p><sub><em>a cat</em></sub></p>
*pandoc example.html --to=markdown*


*~*a\\ cat*~*

Or even worse:

*pandoc example.html --to=markdown | pandoc --to=html*

*<p><sub><em>a\ cat</em></sub></p>*
Oops
 just as I was hopeful, I just found a runaway situation with this
Both equations give the same result, and you may choose whichever is more convenient for a given problem. *F\~ $\perp$ ~*means\\\\\\\\\\\\ the\\\\\\\\\\\\ component\\\\\\\\\\\\ of\\\\\\\\\\\\ $F$\\\\\\\\\\\\ perpendicular\\\\\\\\\\\\ to\\\\\\\\\\\\ $r$,\\\\\\\\\\\\ while*r~ $\perp$ \~* means the component of $r$ perpendicular to $F_.$
The was from an erroneous conversion from .doc to .docx to .md. But
basically if you try to apply pandoc -f markdown -t markdown to it, the
long line of escape sequence \\\\... will be getting longer. (The
“source” has such long sequence of \\\... perhaps because I use pandoc as
linter from time to time and didn’t look too carefully in this file.)
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/a679cf14-0eea-4a14-85b5-2506b61975fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-11-30 11:59:16 UTC
Permalink
I just tried your example with the current dev version
of pandoc and could not reproduce:

% pandoc -f html -t markdown
<p><sub><em>a cat</em></sub></p>
~*a\ cat*~

But I could reproduce it with 1.18 release, so this bug
has already been fixed.
*a cat*
~a cat~
*~a cat~*
~*a cat*~
pandoc example.md --to=native
[Para [Emph [Str "a",Space,Str "cat"]]
,Para [Str "~a",Space,Str "cat~"]
,Para [Emph [Str "~a",Space,Str "cat~"]]
,Para [Subscript [Emph [Str "a",Space,Str "cat"]]]]
So if instead of writing ~a cat~ you write ~*a cat*~, pandoc recognizes
the text as subscript (same as if you just type ~cat~ ). So far so
good.
However, when writing to markdown, somehow two backslashes get added
(maybe for escape reasons?). This is the part that looks like a bug,
because two consecutive backslashes mean that a backslash will be
produced as output.
<p><sub><em>a cat</em></sub></p>
pandoc example.html --to=markdown
~*a\\ cat*~
pandoc example.html --to=markdown | pandoc --to=html
<p><sub><em>a\ cat</em></sub></p>
Oops… just as I was hopeful, I just found a runaway situation with this
Both equations give the same result, and you may choose whichever is more conven
ient for a given problem. *F\~ $\perp$ ~*means\\\\\\\\\\\\ the\\\\\\\\\\\\ compo
nent\\\\\\\\\\\\ of\\\\\\\\\\\\ $F$\\\\\\\\\\\\ perpendicular\\\\\\\\\\\\ to\\\\
\\\\\\\\ $r$,\\\\\\\\\\\\ while*r~ $\perp$ \~* means the component of $r$ perpen
dicular to $F_.$
The was from an erroneous conversion from .doc to .docx to .md. But
basically if you try to apply pandoc -f markdown -t markdown to it, the
long line of escape sequence \\\\... will be getting longer. (The
“source” has such long sequence of \\\... perhaps because I use pandoc
as linter from time to time and didn’t look too carefully in this
file.)

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[3]https://groups.google.com/d/msgid/pandoc-discuss/a679cf14-0eea-4a14-
85b5-2506b61975fe%40googlegroups.com.
For more options, visit [4]https://groups.google.com/d/optout.
References
4. https://groups.google.com/d/optout
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161130115916.GE15143%40Administrateurs-iMac-3.local.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-11-30 11:59:45 UTC
Permalink
It's bug #3225 by the way.
*a cat*
~a cat~
*~a cat~*
~*a cat*~
pandoc example.md --to=native
[Para [Emph [Str "a",Space,Str "cat"]]
,Para [Str "~a",Space,Str "cat~"]
,Para [Emph [Str "~a",Space,Str "cat~"]]
,Para [Subscript [Emph [Str "a",Space,Str "cat"]]]]
So if instead of writing ~a cat~ you write ~*a cat*~, pandoc recognizes
the text as subscript (same as if you just type ~cat~ ). So far so
good.
However, when writing to markdown, somehow two backslashes get added
(maybe for escape reasons?). This is the part that looks like a bug,
because two consecutive backslashes mean that a backslash will be
produced as output.
<p><sub><em>a cat</em></sub></p>
pandoc example.html --to=markdown
~*a\\ cat*~
pandoc example.html --to=markdown | pandoc --to=html
<p><sub><em>a\ cat</em></sub></p>
Oops… just as I was hopeful, I just found a runaway situation with this
Both equations give the same result, and you may choose whichever is more conven
ient for a given problem. *F\~ $\perp$ ~*means\\\\\\\\\\\\ the\\\\\\\\\\\\ compo
nent\\\\\\\\\\\\ of\\\\\\\\\\\\ $F$\\\\\\\\\\\\ perpendicular\\\\\\\\\\\\ to\\\\
\\\\\\\\ $r$,\\\\\\\\\\\\ while*r~ $\perp$ \~* means the component of $r$ perpen
dicular to $F_.$
The was from an erroneous conversion from .doc to .docx to .md. But
basically if you try to apply pandoc -f markdown -t markdown to it, the
long line of escape sequence \\\\... will be getting longer. (The
“source” has such long sequence of \\\... perhaps because I use pandoc
as linter from time to time and didn’t look too carefully in this
file.)

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[3]https://groups.google.com/d/msgid/pandoc-discuss/a679cf14-0eea-4a14-
85b5-2506b61975fe%40googlegroups.com.
For more options, visit [4]https://groups.google.com/d/optout.
References
4. https://groups.google.com/d/optout
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161130115945.GF15143%40Administrateurs-iMac-3.local.
For more options, visit https://groups.google.com/d/optout.
Sergio Correia
2016-11-30 14:29:37 UTC
Permalink
Great, thanks!
Post by John MacFarlane
It's bug #3225 by the way.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/b190207a-dd35-4036-8b82-1fb11cc07feb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-11-30 11:52:07 UTC
Permalink
Could you come up with a more minimal example to isolate the
problem?
Oops… just as I was hopeful, I just found a runaway situation with this
Both equations give the same result, and you may choose whichever is more conven
ient for a given problem. *F\~ $\perp$ ~*means\\\\\\\\\\\\ the\\\\\\\\\\\\ compo
nent\\\\\\\\\\\\ of\\\\\\\\\\\\ $F$\\\\\\\\\\\\ perpendicular\\\\\\\\\\\\ to\\\\
\\\\\\\\ $r$,\\\\\\\\\\\\ while*r~ $\perp$ \~* means the component of $r$ perpen
dicular to $F_.$
The was from an erroneous conversion from .doc to .docx to .md. But
basically if you try to apply pandoc -f markdown -t markdown to it, the
long line of escape sequence \\\\... will be getting longer. (The
“source” has such long sequence of \\\... perhaps because I use pandoc
as linter from time to time and didn’t look too carefully in this
file.)

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[3]https://groups.google.com/d/msgid/pandoc-discuss/1cf2f022-2a64-4a9e-
94d3-f2da097709ba%40googlegroups.com.
For more options, visit [4]https://groups.google.com/d/optout.
References
4. https://groups.google.com/d/optout
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161130115207.GD15143%40Administrateurs-iMac-3.local.
For more options, visit https://groups.google.com/d/optout.
BP Jonsson
2016-11-08 07:25:26 UTC
Permalink
Sorry for awakening an old thread, but this would actually be a use case
for the Unicode non-breaking space, wouldn't it? Perhaps you could use some
visible character which you don't actually use inside code and have a
filter substitute to nbspace. To fool the Markdown parser I mean.

Thinking of that: when you use regex to 'unsmart' your source you probably
would like to leave code alone. There is an idiom for that which works at
least In Perl:

````perl
my %unsmart_chars_for = ( '“' => '"', '”' => '"', ... );
$text =~ s{(([\`]+).+?\2)|([“”...])}{ $1 || $unsmart_chars_for{$3} }egs;
````

First you set up an associative array with the 'smart' chars as keys and
their unsmart equivalents as values. The idea is that you use a regex with
an alternation which captures substrings you don't want to change before
the stuff you want to change. If you get a match on a substring you don't
want to change you just put it back in, while if you get a match on a
substring you want to change/replace you put in the changed
substring/replacement as usual. Thus the regex captures the opening
backticks of code into $2 aka \2 and the text upto the closing backticks
into $1 (note the non-greedy quantifier!), or a smart char into $3. In the
replacement: if there was a match for code $1 is non-empty/true so you just
put the code back in. If there wasn't a code match you got a smart char in
$3, so you use it as key on the associative array to retrieve the unsmart
equivalent. Finally the s modifier makes dot match newlines too so that $1
captures code blocks too. You should be able to do this in python by using
a replacement function with re.sub() http://stackoverflow.com/a/12597709
Post by Kolen Cheung
Thanks. I will just lengthen my example so that it doesn’t end on a space.
Going back to the point to using pandoc from markdown to markdown to
enforce a style: it is a prerequisite to apply pandoc filters on the
source. For example, I am updating my project to use
--top-level-division=part. For this I need to increment the header levels
by 1. As suggested in Pandoc - Scripting with pandoc
<http://pandoc.org/scripting.html>, using regex could mess somethings
else up. But in order to write and use a pandoc filter for this task, one
needs to make sure the read/write cycle don’t change other things else.
As I said earlier, expecting read/write cycle to be an identify is
unreasonable, but as long as it is idempotent it would make this use
possible. There’s only need to be a commit using a read/write cycle to
enforce a style, then apply the said hypothetical filter to change header
level only, making the commits a lot cleaner.
So, in addition to control freaks like me that want to enforce a style on
the source, being able to do that also has other real world use case.
This space collapsing is unfortunately part of the Markdown
Post by John MacFarlane
syntax description. There's no way around it that I can
think of.
Post by Kolen Cheung
End shortform in non-breaking space, like this: `e.g.\ `, `i.e.\ `. The
backslas
Post by Kolen Cheung
h escaped space, `\ `...
However I found the space got eaten, and found a documentation on
this
Post by Kolen Cheung
(The spaces after the opening backticks and before the closing
backticks will be ignored.)
Are there any way to get around this? e.g. I notice that ending with
space in code block is ok.
# printf "%s\n\n" '`e.g.\ `' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '``e.g.\ ``' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '```e.g.\ ```' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" ' end with a space \ ' | pandoc -f markdown -t
native
Post by Kolen Cheung
[CodeBlock ("",[],[]) "end with a space \\ "]
# printf "%s\n\n" '```' 'end with a space \ ' '```' | pandoc -f markdown
-t nati
Post by Kolen Cheung
ve
[CodeBlock ("",[],[]) "\nend with a space \\ \n"]
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`,
but also when non-breaking space is used. The markdown writer will
output a unicode non-breaking space character rather than `\ `.
Since the latter is more markdown-ish and is the recommended way of
typing non-braking space in the manual, it seems the markdown
writer
Post by Kolen Cheung
should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.
​
--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it,
send
Post by Kolen Cheung
To post to this group, send email to
To view this discussion on the web visit
[3]https://groups.google.com/d/msgid/pandoc-discuss/565f0a35
-b5d3-45b3-
Post by Kolen Cheung
8cde-e0c9dfe0ca3b%40googlegroups.com.
For more options, visit [4]https://groups.google.com/d/optout.
References
3. https://groups.google.com/d/msgid/pandoc-discuss/565f0a35-b5
utm_source=footer
Post by Kolen Cheung
4. https://groups.google.com/d/optout
​
--
You received this message because you are subscribed to the Google Groups
"pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
To view this discussion on the web visit https://groups.google.com/d/
msgid/pandoc-discuss/af7d7c17-b985-4370-b5c7-872433996afd%
40googlegroups.com
<https://groups.google.com/d/msgid/pandoc-discuss/af7d7c17-b985-4370-b5c7-872433996afd%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/CAFC_yuS9S7wPeVFfsH7iSpVT0Po4midWhJ0Fn%2BLEH3VJifKBEA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-29 18:55:34 UTC
Permalink
Some relevant discussion here:
https://talk.commonmark.org/t/leading-and-trailing-white-spaces-in-code-blocks/628/3
Post by Kolen Cheung
End shortform in non-breaking space, like this: `e.g.\ `, `i.e.\ `. The backslas
h escaped space, `\ `...
However I found the space got eaten, and found a documentation on this
(The spaces after the opening backticks and before the closing
backticks will be ignored.)
Are there any way to get around this? e.g. I notice that ending with
space in code block is ok.
# printf "%s\n\n" '`e.g.\ `' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '``e.g.\ ``' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" '```e.g.\ ```' | pandoc -f markdown -t native
[Para [Code ("",[],[]) "e.g.\\"]]
# printf "%s\n\n" ' end with a space \ ' | pandoc -f markdown -t native
[CodeBlock ("",[],[]) "end with a space \\ "]
# printf "%s\n\n" '```' 'end with a space \ ' '```' | pandoc -f markdown -t nati
ve
[CodeBlock ("",[],[]) "\nend with a space \\ \n"]
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`,
but also when non-breaking space is used. The markdown writer will
output a unicode non-breaking space character rather than `\ `.
Since the latter is more markdown-ish and is the recommended way of
typing non-braking space in the manual, it seems the markdown writer
should use that instead.
I don't know if it's more markdown-ish. The goal of getting
a text that reads naturally without special processing is
better met by using a unicode nonbreaking space. The \ is
pretty ugly. I don't think the manual recommends \ as
preferable to a literal nonbreaking space.

--
You received this message because you are subscribed to the Google
Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send
To post to this group, send email to
To view this discussion on the web visit
[3]https://groups.google.com/d/msgid/pandoc-discuss/565f0a35-b5d3-45b3-
8cde-e0c9dfe0ca3b%40googlegroups.com.
For more options, visit [4]https://groups.google.com/d/optout.
References
4. https://groups.google.com/d/optout
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161029185533.GF5364%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Jesse Rosenthal
2016-10-27 19:10:34 UTC
Permalink
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`,
but also when non-breaking space is used. The markdown writer will
output a unicode non-breaking space character rather than `\ `. Since
the latter is more markdown-ish and is the recommended way of typing
non-braking space in the manual, it seems the markdown writer should
use that instead.
I prefer to keep my markdown files with ascii dashes, quotes, etc. So I
have a filter called `dumbDowner` that I run on files that I'm
converting to markdown. It doesn't replace a nbsp with a '\ ', but that
would certainly be possible. See below (it's in haskell, but I hope that
what it does should be pretty clear):

~~~{.haskell}
import Text.Pandoc.JSON


-- convert unicode chars into their dumb versions
dumbDownChar :: Char -> String
dumbDownChar '\160' = " "
dumbDownChar '\8211' = "--"
dumbDownChar '\8212' = "---"
dumbDownChar '\8230' = "..."
dumbDownChar '\8216' = "'"
dumbDownChar '\8217' = "'"
dumbDownChar '\8220' = "\""
dumbDownChar '\8221' = "\""
dumbDownChar c = [c]

-- convert an inline into a list of dumb inlines
dumbDown' :: Inline -> [Inline]
dumbDown' (Str cs) = [Str $ concatMap dumbDownChar cs]
dumbDown' (Quoted SingleQuote ils) = [Str "'"] ++ ils ++ [Str "'"]
dumbDown' (Quoted DoubleQuote ils) = [Str "\""] ++ ils ++ [Str "\""]
dumbDown' il = [il]

-- do the conversion if it's going out to a lightweight markup format.
dumbDown :: Maybe Format -> Inline -> [Inline]
dumbDown (Just fmt) il | Format f <- fmt =
if f `elem` ["markdown", "plain", "textile", "org", "rst"]
then dumbDown' il
else [il]

main = toJSONFilter dumbDown
~~~
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/87bmy5txp1.fsf%40jhu.edu.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-28 01:21:58 UTC
Permalink
Thanks! I have a similar script in ickc/markdown-variants/unSmartyPants.sh
<https://github.com/ickc/markdown-variants/blob/master/bin/unSmartyPants.sh>.
I use it to process files that has already been converted to markdown.

It seems yours will be very useful to add as a filter during the conversion
to markdown. Did you host it somewhere already? If not, I suggest hosting
it. I hope to better document available pandoc filters, and started a
collaborative spread sheet in pandoc-filters - Google Sheets
<https://docs.google.com/spreadsheets/d/1eqMwPyxT0rN3z_tXpsISGBys0QR25W0x-tYDRsFBKAE/edit#gid=0>.
Even very simple filters that can be serves as both a productivity tip and
an example on writing filters.

By the way, may I ask why you use the unicode number but not the unicode
character itself?
Post by Jesse Rosenthal
Post by Kolen Cheung
I found that the pandoc markdown writer will use unicode output if
possible. Not only in the case like the trademark symbol `&trade;`,
but also when non-breaking space is used. The markdown writer will
output a unicode non-breaking space character rather than `\ `. Since
the latter is more markdown-ish and is the recommended way of typing
non-braking space in the manual, it seems the markdown writer should
use that instead.
I prefer to keep my markdown files with ascii dashes, quotes, etc. So I
have a filter called `dumbDowner` that I run on files that I'm
converting to markdown. It doesn't replace a nbsp with a '\ ', but that
would certainly be possible. See below (it's in haskell, but I hope that
~~~{.haskell}
import Text.Pandoc.JSON
-- convert unicode chars into their dumb versions
dumbDownChar :: Char -> String
dumbDownChar '\160' = " "
dumbDownChar '\8211' = "--"
dumbDownChar '\8212' = "---"
dumbDownChar '\8230' = "..."
dumbDownChar '\8216' = "'"
dumbDownChar '\8217' = "'"
dumbDownChar '\8220' = "\""
dumbDownChar '\8221' = "\""
dumbDownChar c = [c]
-- convert an inline into a list of dumb inlines
dumbDown' :: Inline -> [Inline]
dumbDown' (Str cs) = [Str $ concatMap dumbDownChar cs]
dumbDown' (Quoted SingleQuote ils) = [Str "'"] ++ ils ++ [Str "'"]
dumbDown' (Quoted DoubleQuote ils) = [Str "\""] ++ ils ++ [Str "\""]
dumbDown' il = [il]
-- do the conversion if it's going out to a lightweight markup format.
dumbDown :: Maybe Format -> Inline -> [Inline]
dumbDown (Just fmt) il | Format f <- fmt =
if f `elem` ["markdown", "plain", "textile", "org", "rst"]
then dumbDown' il
else [il]
main = toJSONFilter dumbDown
~~~
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/7e8b352f-1df9-4a92-81df-10359475f869%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sergio Correia
2016-10-28 01:31:44 UTC
Permalink
I think an useful filter would be one that efficiently converts unicode to
html entities (and latex, etc.) , and viceversa.

If you are interested, this issue
<https://github.com/mmechtley/pandoc-filter-test/issues/1> might be useful,
as well as this XML file
<http://stackoverflow.com/questions/2354067/map-between-latex-commands-and-unicode-points/2356160#2356160>
that maps unicode-html-latex
Post by Kolen Cheung
Thanks! I have a similar script in ickc/markdown-variants/unSmartyPants.sh
<https://github.com/ickc/markdown-variants/blob/master/bin/unSmartyPants.sh>.
I use it to process files that has already been converted to markdown.
It seems yours will be very useful to add as a filter during the
conversion to markdown. Did you host it somewhere already? If not, I
suggest hosting it. I hope to better document available pandoc filters, and
started a collaborative spread sheet in pandoc-filters - Google Sheets
<https://docs.google.com/spreadsheets/d/1eqMwPyxT0rN3z_tXpsISGBys0QR25W0x-tYDRsFBKAE/edit#gid=0>.
Even very simple filters that can be serves as both a productivity tip and
an example on writing filters.
By the way, may I ask why you use the unicode number but not the unicode
character itself?
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/4e8bc6eb-5f42-4db1-bf6b-2b2fb44482c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2016-10-28 01:53:06 UTC
Permalink
Interesting. I once wanted to type Greek in math using just Greek
characters. But then relying on XeLaTeX and unicode-math seems too
restrictive (especially when one has no control on the LaTeX engine to
use). LaTeX 3 might be a hope, but in the far future. If the filter you
referred to becomes mature, it will be very helpful to disentangle the
requirements between the output and the source.

FYI, I have a short script that kind of do that in
ickc/markdown-variants/unicode-to-math.sh
<https://github.com/ickc/markdown-variants/blob/master/bin/unicode-to-math.sh>.
It is not designed to be general purpose however but for dealing with some
messy .md converted from ancient .doc files (which converted to docx by
Word first).

On Thursday, October 27, 2016 at 6:31:44 PM UTC-7, Sergio Correia wrote:

I think an useful filter would be one that efficiently converts unicode to
Post by Sergio Correia
html entities (and latex, etc.) , and viceversa.
If you are interested, this issue
<https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fmmechtley%2Fpandoc-filter-test%2Fissues%2F1&sa=D&sntz=1&usg=AFQjCNEvfQlMezuFtLk1dmls6jbspSKpXw>
might be useful, as well as this XML file
<http://stackoverflow.com/questions/2354067/map-between-latex-commands-and-unicode-points/2356160#2356160>
that maps unicode-html-latex
Post by Kolen Cheung
Thanks! I have a similar script in
ickc/markdown-variants/unSmartyPants.sh
<https://github.com/ickc/markdown-variants/blob/master/bin/unSmartyPants.sh>.
I use it to process files that has already been converted to markdown.
It seems yours will be very useful to add as a filter during the
conversion to markdown. Did you host it somewhere already? If not, I
suggest hosting it. I hope to better document available pandoc filters, and
started a collaborative spread sheet in pandoc-filters - Google Sheets
<https://docs.google.com/spreadsheets/d/1eqMwPyxT0rN3z_tXpsISGBys0QR25W0x-tYDRsFBKAE/edit#gid=0>.
Even very simple filters that can be serves as both a productivity tip and
an example on writing filters.
By the way, may I ask why you use the unicode number but not the unicode
character itself?
​
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/221e45d5-87af-4ad9-8338-214b304b3c69%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
John MacFarlane
2016-10-28 08:23:39 UTC
Permalink
Post by Sergio Correia
I think an useful filter would be one that efficiently converts unicode
to html entities (and latex, etc.) , and viceversa.
Do you know about the --ascii option to pandoc?

% pandoc --ascii
“hi\ there”
^D
<p>&#8220;hi&#160;there&#8221;</p>
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/20161028082339.GD4501%40MacBook-Air-2.local.
For more options, visit https://groups.google.com/d/optout.
Kolen Cheung
2017-02-07 22:36:16 UTC
Permalink
For the unsmartypant we mentioned earlier on, pandoc 2.0 will support this.
See https://github.com/jgm/pandoc/issues/3416#issuecomment-277425921

(basically `smart` becomes an extension and can be toggled off.)
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pandoc-discuss+***@googlegroups.com.
To post to this group, send email to pandoc-***@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pandoc-discuss/cd7d9fe4-c5a9-47c9-b8e3-f45d3718a1a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...