Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 11 additions & 8 deletions spec/fixtures/gfm-extensions.txt
Original file line number Diff line number Diff line change
Expand Up @@ -538,7 +538,7 @@ size (100) for parsing delimiters in inlines.c

## Autolinks

```````````````````````````````` example pending
```````````````````````````````` example autolink
: http://google.com https://google.com

<http://google.com/å> http://google.com/å
Expand All @@ -559,8 +559,6 @@ This is a mailto:scyther@pokemon.com

mailto:scyther@pokemon.com.

mmmmailto:scyther@pokemon.com

mailto:scyther@pokemon.com/

mailto:scyther@pokemon.com/message
Expand All @@ -587,7 +585,7 @@ Underscores not allowed in host name www.xxx._yyy.zzz

Underscores allowed in domain name www._xxx.yyy.zzz

**Autolink and http://inlines**
**Autolink and http://inlines.com**

![http://inline.com/image](http://inline.com/image)

Expand All @@ -613,7 +611,6 @@ http://🍄.ga/ http://x🍄.ga/
<p><a href="mailto:scyther@pokemon.com">mailto:scyther@pokemon.com</a></p>
<p>This is a <a href="mailto:scyther@pokemon.com">mailto:scyther@pokemon.com</a></p>
<p><a href="mailto:scyther@pokemon.com">mailto:scyther@pokemon.com</a>.</p>
<p>mmmmailto:<a href="mailto:scyther@pokemon.com">scyther@pokemon.com</a></p>
<p><a href="mailto:scyther@pokemon.com">mailto:scyther@pokemon.com</a>/</p>
<p><a href="mailto:scyther@pokemon.com">mailto:scyther@pokemon.com</a>/message</p>
<p><a href="mailto:scyther@pokemon.com">mailto:scyther@pokemon.com</a>/<a href="mailto:beedrill@pokemon.com">mailto:beedrill@pokemon.com</a></p>
Expand All @@ -627,7 +624,7 @@ http://🍄.ga/ http://x🍄.ga/
<p>Underscores not allowed in host name www.xxx.yyy._zzz</p>
<p>Underscores not allowed in host name www.xxx._yyy.zzz</p>
<p>Underscores allowed in domain name <a href="http://www._xxx.yyy.zzz">www._xxx.yyy.zzz</a></p>
<p><strong>Autolink and <a href="http://inlines">http://inlines</a></strong></p>
<p><strong>Autolink and <a href="http://inlines.com">http://inlines.com</a></strong></p>
<p><img src="http://inline.com/image" alt="http://inline.com/image" /></p>
<p><a href="mailto:a.w@b.c">a.w@b.c</a></p>
<p>Full stop outside parens shouldn't be included <a href="http://google.com/ok">http://google.com/ok</a>.</p>
Expand All @@ -638,6 +635,12 @@ http://🍄.ga/ http://x🍄.ga/
````````````````````````````````

```````````````````````````````` example pending
mmmmailto:scyther@pokemon.com
.
<p>mmmmailto:<a href="mailto:scyther@pokemon.com">scyther@pokemon.com</a></p>
````````````````````````````````

```````````````````````````````` example
This shouldn't crash everything: (_A_@_.A
.
<IGNORE>
Expand Down Expand Up @@ -800,7 +803,7 @@ Hello[^"><script>alert(1)</script>]

Autolink and strikethrough.

```````````````````````````````` example pending
```````````````````````````````` example autolink
~~www.google.com~~

~~http://google.com~~
Expand All @@ -811,7 +814,7 @@ Autolink and strikethrough.

Autolink and tables.

```````````````````````````````` example pending
```````````````````````````````` example autolink
| a | b |
| --- | --- |
| https://github.com www.github.com | http://pokemon.com |
Expand Down
2 changes: 2 additions & 0 deletions spec/spec_helper.cr
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ def assert_example(file, section, index, example, smart, gfm = false)
else
it "- #{index}\n#{show_space(markdown)}", file, line do
output = Markd.to_html(markdown, options)
next if html == "<IGNORE>\n"

output.should eq(html), file: file, line: line
end
end
Expand Down
126 changes: 85 additions & 41 deletions src/markd/parsers/inline.cr
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
require "html"
require "uri"

module Markd::Parser
class Inline
Expand Down Expand Up @@ -68,7 +69,13 @@ module Markd::Parser
when 'h'
# Catch http:// and https:// autolinks for GFM
# Do not match if it's <http:// ... because that was matched by '<'
if @options.gfm && @options.autolink && (@pos == 0 || char_at?(@pos - 1) != '<')
if @options.gfm && @options.autolink && (
@pos == 0 ||
# Do not match if it's <http:// ... because that was matched by '<'
char_at?(@pos - 1) != '<'
# Do not match ![http:// ... because that was matched by '!']
char_at?(@pos - 1) != '['
)
auto_link(node)
else
false
Expand All @@ -81,6 +88,20 @@ module Markd::Parser
else
false
end
when 'x'
# Catch xmpp: autolinks for GFM
if @options.gfm && @options.autolink && (@pos == 0 || char_at?(@pos - 1) != '<')
auto_link(node)
else
false
end
when 'm'
# Catch mailto: autolinks for GFM
if @options.gfm && @options.autolink && (@pos == 0 || char_at?(@pos - 1) != '<')
auto_link(node)
else
false
end
when '&'
entity(node)
when ':'
Expand Down Expand Up @@ -461,33 +482,50 @@ module Markd::Parser
elsif matched_text = match(Rule::AUTO_LINK)
node.append_child(link(matched_text, false))
return true
elsif @options.gfm && (matched_text = match(Rule::WWW_AUTO_LINK))
clean_text = autolink_cleanup(matched_text)
link = link(clean_text, false, true)
node.append_child(link)
node.append_child(text(matched_text[clean_text.size..])) if clean_text != matched_text
return true
elsif @options.gfm && (matched_text = match(Rule::PROTOCOL_AUTO_LINK))
clean_text = autolink_cleanup(matched_text)
link = link(clean_text, false, false)
node.append_child(link)
node.append_child(text(matched_text[clean_text.size..])) if clean_text != matched_text
return true
elsif @options.gfm && (matched_text = match(Rule::EXTENDED_EMAIL_AUTO_LINK))
# Emails that end in - or _ are declared not to be links by the spec:
#
# `.`, `-`, and `_` can occur on both sides of the `@`, but only `.` may occur at
# the end of the email address, in which case it will not be considered part of
# the address:
elsif @options.gfm && @options.autolink
# These are all the extended autolinks from the
# autolink extension

if matched_text = match(Rule::WWW_AUTO_LINK)
clean_text = autolink_cleanup(matched_text)
if clean_text.empty?
node.append_child(text(matched_text))
else
_, post = @text.split(clean_text, 2)
node.append_child(link(clean_text, false, true))
node.append_child(text(post)) if post.size > 0 && matched_text != clean_text
end
return true
elsif matched_text = (
match(Rule::PROTOCOL_AUTO_LINK) ||
match(Rule::XMPP_AUTO_LINK) ||
match(Rule::MAILTO_AUTO_LINK)
)
clean_text = autolink_cleanup(matched_text)
if clean_text.empty?
node.append_child(text(matched_text))
else
_, post = @text.split(clean_text, 2)
node.append_child(link(clean_text, false, false))
node.append_child(text(post)) if post.size > 0 && matched_text != clean_text
end
return true
elsif matched_text = match(Rule::EXTENDED_EMAIL_AUTO_LINK)
# Emails that end in - or _ are declared not to be links by the spec:
#
# `.`, `-`, and `_` can occur on both sides of the `@`, but only `.` may occur at
# the end of the email address, in which case it will not be considered part of
# the address:

# a.b-c_d@a.b_ => <p>a.b-c_d@a.b_</p>
# a.b-c_d@a.b_ => <p>a.b-c_d@a.b_</p>

if "-_".includes?(matched_text[-1])
node.append_child(text(matched_text))
else
node.append_child(link(matched_text, true, false))
if "-_".includes?(matched_text[-1])
node.append_child(text(matched_text))
else
node.append_child(link(matched_text, true, false))
end
return true
end
return true
end

false
Expand Down Expand Up @@ -924,14 +962,12 @@ module Markd::Parser

private def special_string?(full_text : String, pos : Int) : Int
text = full_text.byte_slice(pos)
if text.starts_with?("http://") || text.starts_with?("https://") || text.starts_with?("ftp://")
# All such recognized autolinks can only come at the beginning of
# a line, after whitespace, or any of the delimiting characters `*`, `_`, `~`,
# and `(`.
if pos > 0 && !("*_~( \n\t".includes? char_at(pos - 1))
return 0
end

# All such recognized autolinks can only come at the beginning of
# a line, after whitespace, or any of the delimiting characters `*`, `_`, `~`,
# and `(`.
if pos > 0 && !("*_~( \n\t".includes? char_at(pos - 1))
0
elsif text.starts_with?("http://") || text.starts_with?("https://") || text.starts_with?("ftp://")
# This should not be an autolink:
# < ftp://example.com >
if full_text[...pos].includes?("<") && full_text[...pos].matches?(/<\s*$/)
Expand All @@ -940,14 +976,10 @@ module Markd::Parser

m = autolink_cleanup(text.match(Rule::PROTOCOL_AUTO_LINK).to_s)
m.size
elsif text.starts_with?("www.") && text.matches?(Rule::WWW_AUTO_LINK)
m = autolink_cleanup(text.match(Rule::WWW_AUTO_LINK).to_s)
m.size
elsif text.includes?("@") && text.matches?(Rule::EXTENDED_EMAIL_AUTO_LINK)
# All such recognized autolinks can only come at the beginning of
# a line, after whitespace, or any of the delimiting characters `*`, `_`, `~`,
# and `(`.
if pos > 0 && !("*_~( \n\t".includes? char_at(pos - 1))
return 0
end

# m = autolink_cleanup(text.match(Rule::EMAIL_AUTO_LINK).to_s)
matched_text = text.match(Rule::EMAIL_AUTO_LINK).to_s

Expand All @@ -967,6 +999,7 @@ module Markd::Parser
# These cleanups are defined in the spec

private def autolink_cleanup(text : String) : String
return text if text.empty?
# When an autolink ends in `)`, we scan the entire autolink for the total number
# of parentheses. If there is a greater number of closing parentheses than
# opening ones, we don't consider the unmatched trailing parentheses part of the
Expand All @@ -978,7 +1011,7 @@ module Markd::Parser
# Trailing punctuation (specifically, `?`, `!`, `.`, `,`, `:`, `*`, `_`, and `~`)
# will not be considered part of the autolink, though they may be included in the
# interior of the link
while "?!.,:*~_".includes?(text[-1])
while "\"'?!.,:*~_".includes?(text[-1])
text = text[0..-2]
end

Expand All @@ -994,6 +1027,17 @@ module Markd::Parser
end
end

# If the autolink has a domain and the last component has a `_` then
# it's invalid.
if text.starts_with?("www.")
uri = URI.parse("http://#{text}")
else
uri = URI.parse(text)
end
if uri.host && !uri.host.to_s.match(Rule::VALID_DOMAIN_NAME)
text = ""
end

text
end

Expand Down
27 changes: 24 additions & 3 deletions src/markd/rule.cr
Original file line number Diff line number Diff line change
Expand Up @@ -63,11 +63,32 @@ module Markd

LINK_DESTINATION_BRACES = Regex.new("^(?:[<](?:[^<>\\t\\n\\\\\\x00]|" + ESCAPED_CHAR_STRING + ")*[>])")

# A valid domain name is:
#
# segments of alphanumeric characters, underscores (_) and hyphens (-)
# separated by periods (.). There must be at least one period, and no
# underscores may be present in the last two segments of the domain.
#
# Alphanumeric characters in this context include emojis.
LAST_DOMAIN_SEGMENT = /(?:[a-zA-Z0-9\-\p{Emoji_Presentation}\-]+)/
OTHER_DOMAIN_SEGMENTS = /(?:[a-zA-Z0-9\p{Emoji_Presentation}\-_]+)/
# The spec wants to capture greedily, even invalid domain names and then
# reject the invalid ones later.
# For example: www.xxx._yyy.zzz is never linked because of the
# _ in the last segment.
DOMAIN_NAME = /(?:#{OTHER_DOMAIN_SEGMENTS}\.)*#{OTHER_DOMAIN_SEGMENTS}/
VALID_DOMAIN_NAME = /^(?:#{OTHER_DOMAIN_SEGMENTS}\.)*(?:#{LAST_DOMAIN_SEGMENT}\.)+#{LAST_DOMAIN_SEGMENT}$/
VALID_URL_PATH = /(?:\/[^\s<]*)?/

AUTOLINK_PROTOCOLS = /^(?:http|https|ftp):\/\//

EMAIL_AUTO_LINK = /^<([a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*)>/
EXTENDED_EMAIL_AUTO_LINK = /^([a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)+)[-_]*/
EXTENDED_EMAIL_AUTO_LINK = /^([a-zA-Z0-9][a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)+)[-_]*/
AUTO_LINK = /^<[A-Za-z][A-Za-z0-9.+-]{1,31}:[^<>\x00-\x20]*>/i
WWW_AUTO_LINK = /^www(\.[a-zA-Z0-9\-]{1,})+(\/[^\s<]*[^\s<?!.,:*_~])?/
PROTOCOL_AUTO_LINK = /^(?:http|https|ftp):\/\/([a-zA-Z0-9\-_.]{2,})+(\/[^\s<]*[^\s?!.,:*_~])?/
WWW_AUTO_LINK = /^www\.#{DOMAIN_NAME}#{VALID_URL_PATH}/
XMPP_AUTO_LINK = /^xmpp:[A-Za-z0-9]+@#{DOMAIN_NAME}#{VALID_URL_PATH}/
MAILTO_AUTO_LINK = /^mailto:[A-Za-z0-9]+@#{DOMAIN_NAME}/
PROTOCOL_AUTO_LINK = /#{AUTOLINK_PROTOCOLS}#{DOMAIN_NAME}#{VALID_URL_PATH}[^\s?!.,:*_~]/

WHITESPACE_CHAR = /^[ \t\n\x0b\x0c\x0d]/
WHITESPACE = /[ \t\n\x0b\x0c\x0d]+/
Expand Down