Build Regex To Find And Replace Invalid Html Attributes
Solution 1:
I think it's better not to mix it in single mega-regex. I'd prefer several steps:
- Identify tag:
<([^>]+)/?>
- Replace wrong attributes with correct ones iteratively through tag string:
replace
\s+([\w]+)\s*=\s*(['"]?)(\S+)(\2)
pattern with$1="$3"
(with a space after last quote). I think that .net allows to track boundaries of match. It can help to avoid searching through already corrected part of tag.
Solution 2:
drop the word 'attribute', i.e.
Dim test AsString = "=(?:(['""])(?<attribute>(?:(?!\1).)*)\1|(?<attribute>\S+))"
which would find every "='something'" string, fine if you have no other code in the pages, i.e. javascript.
Solution 3:
I had trouble that the final update (8/21/09) would replace
<font color=red size=4>
with
<font color="red" size="4>"
(placing the closing quote on second attribute on outside of closing tag)
I changed the attributes string in EvaluateTag to:
Dim attributes As String = "\s*=\s*(?:('|"")(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>|\s]+))"
changed [^>|\s]
near end.
This returns my desired results of:
<font color="red" size="4">
It works on my exhaustive testcase of one.
Solution 4:
Here is the final product. I hope this helps somebody!
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim input AsString = "<tag border=2 style='display: none' width=""100%"">Some stuff""""""in between tags==="""" that could be there</tag>" & _
"<sometag border=2 width=""100%"" /><another that=""is"" completely=""normal"">with some content, of course</another>"
Console.WriteLine(ConvertMarkupAttributeQuoteType(input, "'"))
Console.ReadKey()
EndSubPublicFunction ConvertMarkupAttributeQuoteType(ByVal html AsString, ByVal quoteChar AsString) AsStringDim findTags AsString = "</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>"Return Regex.Replace(html, findTags, New MatchEvaluator(Function(m) EvaluateTag(m, quoteChar)))
EndFunctionPrivateFunction EvaluateTag(ByVal match As Match, ByVal quoteChar AsString) AsStringDim attributes AsString = "\s*=\s*(?:(['""])(?<g1>(?:(?!\1).)*)\1|(?<g1>[^>\s]+))"Return Regex.Replace(match.Value, attributes, String.Format("={0}$2{0}", quoteChar))
EndFunctionEndModule
I felt that keeping the tag finder and the attribute fixing regex separate from each other in case I wanted to change how they each work in the future. Thanks for all your input.
Post a Comment for "Build Regex To Find And Replace Invalid Html Attributes"