Discussion:
En-dashes vs em-dashes revisited
Eric Shade
2011-12-31 00:18:11 UTC
Permalink
The issue of using -- for an en-dash (when --smart is enabled) seems
not to have been conclusively resolved, so I'd like to reopen it.
Pandoc currently renders both -- and --- as an em-dash, and tries to
be smart about converting some uses of the hyphen to an en-dash.

I'm strongly in favor of using -- for an en-dash. But I'm not alone.

History favors the change. John Gruber, creator of Markdown, defined a
SmartyPants extension in 2004 that uses -- for an en-dash. TeX has
been using this convention since 1978.

Precision favors the change. Careful writers need to use both kinds of
dashes. Pandoc doesn't (and really can't) correctly convert *all* uses
of hyphens to en-dashes where appropriate. Pandoc handles numeric
ranges like 15-20 correctly, mishandles numeric dates like 2011-12-31
(where it incorrectly replaces the hyphens with en-dashes), and
mishandles phrases like "pro-Emacs-anti-Vim debate", where the second
hyphen should be an en-dash. Careful writers have no easy way around
this.

Coding theory favors the change. The -- and --- sequences are two of
only a handful that don't "look like markup", thus preserving the
spirit of Markdown. To use *both* of them to mean the same thing is
wasteful. Since we're all agreed that --- means an em-dash, that
leaves -- as the only natural choice for an en-dash.

I18n favors the change. Different languages have different conventions
about dashes, and some don't use the em-dash at all. Allowing writers
to choose for themselves solves the problem.

Time favors the change. The longer the wait to implement the change,
the larger the specter of "backwards compatibility" looms. (If that's
an issue, I suggest an --old-smart option that leaves -- as em-dash.)

The only serious argument *against* using -- as an en-dash seems to be
that some casual writers might be confused, because they'll write --
and expect an em-dash. I don't understand this argument. We're talking
about a group of people who (1) are sophisticated enough to know about
em-dashes, (2) are observant enough to see that the the en-dash they
get from --, though longer than a hyphen, is not long enough to be a
proper em-dash, yet (3) have never heard of en-dashes. How large can
this group be? And is it larger than the group of people who *do* know
the difference and want natural ways to write both kinds of dashes? I
suspect that the Pandoc community is quite sophisticated on the whole,
because most casual writers would never leave the comfort of a WYSIWYG
editor.

(The argument that --- is too long and troublesome to type for an em-
dash is hard to take seriously.)
fiddlosopher
2011-12-31 02:09:29 UTC
Permalink
I welcome these points; this is an issue I remain undecided about.
Post by Eric Shade
History favors the change. John Gruber, creator of Markdown, defined a
SmartyPants extension in 2004 that uses -- for an en-dash. TeX has
been using this convention since 1978.
Back in the days of typewriters (which alas I am old enough to
remember),
a '--' was always used in manuscripts for (what would be typeset as)
an em-dash. Take a look at earlier editions of the Chicago Manual of
Style.
I think the weight of this history outweighs what John Gruber did in
2004. As
a TeX user, I am myself quite comfortable using -- for en-dashes and
--- for em-dashes, but it still looks unnatural to me. I believe that
MS
Word's "smart punctuation" also converts '--' to an em-dash (at least
it
used to).
Post by Eric Shade
Precision favors the change. Careful writers need to use both kinds of
dashes. Pandoc doesn't (and really can't) correctly convert *all* uses
of hyphens to en-dashes where appropriate. Pandoc handles numeric
ranges like 15-20 correctly, mishandles numeric dates like 2011-12-31
(where it incorrectly replaces the hyphens with en-dashes), and
mishandles phrases like "pro-Emacs-anti-Vim debate", where the second
hyphen should be an en-dash. Careful writers have no easy way around
this.
Not true. You can always use a unicode en-dash or em-dash when
you want to be precise, and pandoc will preserve it. Most text
editors allow you to define easy ways to type these symbols.

John
Chris Lott
2011-12-31 02:32:36 UTC
Permalink
Post by fiddlosopher
Back in the days of typewriters (which alas I am old enough to
remember),
a '--' was always used in manuscripts for (what would be typeset as)
an em-dash.  Take a look at earlier editions of the Chicago Manual of
Style.
I think the weight of this history outweighs what John Gruber did in
2004.  As
a TeX user, I am myself quite comfortable using -- for en-dashes and
--- for em-dashes, but it still looks unnatural to me.
I don't know that -- and --- look any more unnatural than _foo_ or
*foo* or pretty much any other markup when used in their plain text
form. It seems to me that -- and --- are an easy, readable solution
that doesn't require resorting to unicode dashes (which, to me,
detracts from the idea of a plain text markup's portability).

It also seems a little strange to say: ok, you get a hyphen or an
em-dash predictably, but you have to use unicode to get an en-dash.
Logically it would make more sense for the em-dash to be the odd-man
out. But, of course, none of them *have* to be.

But that's just me. I don't regularly use --- in plain text, but only
because it's meaningless right now.

c
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
John MacFarlane
2011-12-31 05:58:59 UTC
Permalink
Post by Chris Lott
Post by fiddlosopher
Back in the days of typewriters (which alas I am old enough to
remember),
a '--' was always used in manuscripts for (what would be typeset as)
an em-dash.  Take a look at earlier editions of the Chicago Manual of
Style.
I think the weight of this history outweighs what John Gruber did in
2004.  As
a TeX user, I am myself quite comfortable using -- for en-dashes and
--- for em-dashes, but it still looks unnatural to me.
I don't know that -- and --- look any more unnatural than _foo_ or
*foo* or pretty much any other markup when used in their plain text
form. It seems to me that -- and --- are an easy, readable solution
that doesn't require resorting to unicode dashes (which, to me,
detracts from the idea of a plain text markup's portability).
It also seems a little strange to say: ok, you get a hyphen or an
em-dash predictably, but you have to use unicode to get an en-dash.
Logically it would make more sense for the em-dash to be the odd-man
out. But, of course, none of them *have* to be.
But that's just me. I don't regularly use --- in plain text, but only
because it's meaningless right now.
I'm starting to become convinced. I agree that the unicode dashes aren't a
great solution. Even if they're easy to write, they're not easy to read:
on my terminal, the em-dash is indistinguishable from the en-dash, and far too
narrow to "read" as an em-dash. I also agree that pandoc's current guessing
algorithm to distinguish en-dashes from hyphens isn't accurate enough, and I
don't see good prospects for improving it.

I do worry about backwards compatibility, but maybe adding an --old-dashes
flag or something like that would suffice.

Would anyone be upset if pandoc moved to a consistent policy of
treating --- as em-dash and -- as en-dash?

John
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
HansBKK
2011-12-31 12:45:43 UTC
Permalink
As a newbie here I'm not presuming to "vote", but IMO Eric makes a pretty
compelling case - this seems the best solution to ensure clarity on the
"master source" side where it is of course most important. Backwards
compatibility is taken care of, as are those who don't know/care about the
difference between en and em dashes.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pandoc-discuss/-/xDGTIbPwKYoJ.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
Eric Shade
2011-12-31 17:49:41 UTC
Permalink
Assuming this change is made, only one dash is affected. I originally
suggested --old-smart to get the old behavior. John mentioned --old-
dashes, which is better, but implies that more than one dash is
affected. What about just --em-dash, which implies --smart? It has a
simple definition: "treat -- as an em-dash instead of the default en-
dash". It has no pejorative "new vs. old" connotations, it's
meaningful without knowing the history of Pandoc, and it's useful for
those who never use en-dashes and want a shorter way to write an em-
dash.
fiddlosopher
2011-12-31 19:17:44 UTC
Permalink
Post by Eric Shade
Assuming this change is made, only one dash is affected. I originally
suggested --old-smart to get the old behavior. John mentioned --old-
dashes, which is better, but implies that more than one dash is
affected. What about just --em-dash, which implies --smart? It has a
simple definition: "treat -- as an em-dash instead of the default en-
dash". It has no pejorative "new vs. old" connotations, it's
meaningful without knowing the history of Pandoc, and it's useful for
those who never use en-dashes and want a shorter way to write an em-
dash.
Actually both dashes are affected. Activating the old behavior would
also require turning hyphens before numerals into en-dashes. So
--old-dashes is better, I think.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
Eric Shade
2011-12-31 21:15:11 UTC
Permalink
Actually both dashes are affected.  Activating the old behavior would
also require turning hyphens before numerals into en-dashes.  So
--old-dashes is better, I think.
Yes, you're right, assuming that the new default is that no hyphen-to-
en-dash guessing is performed. It's not clear from the discussion
whether that has been decided. The current scheme would be improved if
hyphens were only converted to en-dashes between exactly two numbers,
but not three or more. But there would still be some cases that are
impossible to disambiguate, though one could make good guesses. For
example, 867-1234 needs a hyphen if it's a phone number, but an en-
dash if it's a range; guessing that it's a phone number will be right
most of the time. And 12.4-8 needs a hyphen in "Exercise 12.4-8", but
an en-dash in "see sections 12.4-8"; checking whether the previous
word is plural will lead you to the right guess most of the time, but
that's language-dependent.

So is it better to improve the guessing algorithm so that it's right
most of the time, and require writers to use \- to force a hyphen or
-- to force an en-dash when it guesses wrong? Or is it better to
eliminate the guessing entirely? I don't have strong feelings either
way as long as the guessing algorithm is fairly accurate and clearly
documented, but I suppose it's simpler and clearer to do no guessing
at all.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
John Haltiwanger
2011-12-31 21:38:58 UTC
Permalink
Post by Eric Shade
Post by fiddlosopher
Actually both dashes are affected. Activating the old behavior would
also require turning hyphens before numerals into en-dashes. So
--old-dashes is better, I think.
Yes, you're right, assuming that the new default is that no hyphen-to-
en-dash guessing is performed. It's not clear from the discussion
whether that has been decided. The current scheme would be improved if
hyphens were only converted to en-dashes between exactly two numbers,
but not three or more. But there would still be some cases that are
impossible to disambiguate, though one could make good guesses. For
example, 867-1234 needs a hyphen if it's a phone number, but an en-
dash if it's a range; guessing that it's a phone number will be right
most of the time. And 12.4-8 needs a hyphen in "Exercise 12.4-8", but
an en-dash in "see sections 12.4-8"; checking whether the previous
word is plural will lead you to the right guess most of the time, but
that's language-dependent.
So is it better to improve the guessing algorithm so that it's right
most of the time, and require writers to use \- to force a hyphen or
-- to force an en-dash when it guesses wrong? Or is it better to
eliminate the guessing entirely? I don't have strong feelings either
way as long as the guessing algorithm is fairly accurate and clearly
documented, but I suppose it's simpler and clearer to do no guessing
at all.
Doesn't your previous paragraph make it plainly clear that the guessing
approach is overly complicated?
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
HansBKK
2012-01-01 09:53:26 UTC
Permalink
Post by John Haltiwanger
Doesn't your previous paragraph make it plainly clear that the guessing
approach is overly complicated?
If we're now getting the ability for the user to have complete control over
these dashes, then IMO no guessing should take place except when explicitly
invoked, presumably only to be used for the sake of preserving backwards
compatibility.
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To view this discussion on the web visit https://groups.google.com/d/msg/pandoc-discuss/-/bz1ZFx34yGkJ.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
Eric Shade
2012-01-01 19:14:12 UTC
Permalink
It seems we agree that -- should mean en-dash, which is the main
issue.

A good guesser would indeed be complex, and can't ever be 100%
accurate. But one that's (say) 95% accurate might be useful. That's
the point I was trying (and failing) to make.

I did express a slight preference for eliminating the guesser on the
grounds of simplicity and clarity, so I have no complaint if that's
the consensus.
David Sanson
2011-12-31 21:51:57 UTC
Permalink
I'd be happy with '--' for en-dashes only, and I value the greater predictability thereby gained.

I don't think --old-dashes should imply --smart. Something like

pandoc -f markdown -t markdown --old-dashes

should output the new standard '--' en-dash and '---' em-dash based on the old guessing algorithm, so we have an easy way to convert existing documents to the new standard.

David
fiddlosopher
2012-01-01 01:35:33 UTC
Permalink
--old-smart or --legacy-smart seems the best name for the flag, since
that will make it clear that it is an alternative to --smart, rather
than
an addition.
Post by David Sanson
I'd be happy with '--' for en-dashes only, and I value the greater predictability thereby gained.
I don't think --old-dashes should imply --smart. Something like
    pandoc -f markdown -t markdown --old-dashes
should output the new standard '--' en-dash and '---' em-dash based on the old guessing algorithm, so we have an easy way to convert existing documents to the new standard.
David
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
John MacFarlane
2012-01-01 22:17:02 UTC
Permalink
Thanks, everyone, for helping me think through this. I've pushed the
changes to pandoc's dash parsing.

https://github.com/jgm/pandoc/commit/da8425598a8ab4a98388e8ee346a2ae7ec540aa0

This affects not just markdown, but HTML and RST with --smart. Textile
will continue to use the old dashes, because that is what the textile
spec requires.

Note: the --smart and --old-dashes options only affect parsing. The
markdown writer will print a unicode character for an em- or en-dash
(same as in 1.8.2.1). I think that is reasonable behavior, but it means
that you won't be able to convert a document from old-style dashes to
new-style ones using 'pandoc --old-dashes -S -f markdown -t markdown'.
You'll still be able to convert a document to one that renders properly
with the new --smart, but it will have unicode characters for the
dashes. If you want '---' and '--' instead, you should be able to
achieve this by piping the output of pandoc through a perl one-liner
that substitutes these for the unicode dashes. (For better precision,
you could also use the techniques described in "Scripting with pandoc,"
but the perl one-liner should work fine unless you've got unicode dashes
in code blocks.)

John

Bruce
2012-01-01 18:52:23 UTC
Permalink
On Dec 31 2011, 12:58 am, John MacFarlane <fiddlosop...-***@public.gmane.org>
wrote:

...
Post by John MacFarlane
Would anyone be upset if pandoc moved to a consistent policy of
treating --- as em-dash and -- as en-dash?
Not me; I support the change.

Bruce
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
Eric Shade
2011-12-31 06:01:48 UTC
Permalink
Not true.  You can always use a unicode en-dash or em-dash when
you want to be precise, and pandoc will preserve it.  Most text
editors allow you to define easy ways to type these symbols.
Yes, but I suspect that many (if not most) Pandoc users write using a
general purpose text editor and a monospaced font. It's very hard to
tell the difference between a hyphen, an en-dash, and an em-dash in
most monospaced fonts, unless the symbols are adjacent (which they
never are in practice). And using the standard Menlo font on a Mac, an
en-dash and em-dash look *identical*. Asking writers to keep track of
subtly different (or even identical) dashes with different meanings
seems onerous when a better option exists.

(In a similar vein, I think the only big mistake Gruber made in
designing Markdown was to use two spaces at the end of a line to
indicate a line break, precisely because those spaces are invisible by
default in most editors. Many of us configure our editors to remove
trailing spaces automatically. Pandoc wisely added a trailing
backslash as an alternate, easily visible syntax.)
--
You received this message because you are subscribed to the Google Groups "pandoc-discuss" group.
To post to this group, send email to pandoc-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to pandoc-discuss+***@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pandoc-discuss?hl=en.
John Gabriele
2011-12-31 08:17:49 UTC
Permalink
As a TeX user, I am myself quite comfortable using -- for en-dashes
and --- for em-dashes, but it still looks unnatural to me.
I agree, but I suspect that the simplicity you get from consistent
dashes makes it a worthwhile tradeoff.

Another thing to consider: if a Pandoc user is choosing to pass the
`--smart` option, then they very likely know about the differences
between the various dashes and so should have simple manual control
over which one they get in their output.

One more point: If I'm centrally setting the `--smart` option for
other writers (say, Pandoc is generating website pages from the
writers' plain text content which they email to me), it's easier to
explain consistent dashes to them rather than explain how Pandoc does
its guessing.

(BTW, aside: at <http://johnmacfarlane.net/pandoc/README.html>, in the
docs for the --smart option, there's a typo: s/ande/and/.)

---John
Loading...