Napisałem prosty regex do walidowania elementu DOCTYPE dokumentu HTML (standard 5.1 W3C):
jednak zaobserwowałem, że część kodu obsolete permitted DOCTYPE string powtarza się w większości i tutaj moje pytanie, jak to wyrażenie uprościć?
$regex=<<<'REGEX' @(*UTF8)^ <!((?i)DOCTYPE) (?<space_characters>[\x20\x09\x0A\x0C\x0D])+ ((?i)HTML) ( \g<space_characters>+ ( ( # DOCTYPE legacy string ((?i)SYSTEM) \g<space_characters>+ (?<quote_mark>["']) about:legacy-compat \k<quote_mark> )|( # obsolete permitted DOCTYPE string ((?i)PUBLIC) \g<space_characters>+ (?<first_quote_mark>["']) ( ( -//W3C//DTD\ HTML\ 4\.0//EN \k<first_quote_mark> ( \g<space_characters>+ (?<third_quote_mark_1>["']) <a href="http://www\.w3\.org/TR/REC-html40/strict\.dtd" target="_blank">http://www\.w3\.org/TR/REC-html40/strict\.dtd</a> \k<third_quote_mark_1> )? )|( -//W3C//DTD\ HTML\ 4\.01//EN \k<first_quote_mark> ( \g<space_characters>+ (?<third_quote_mark_2>["']) <a href="http://www\.w3\.org/TR/html4/strict\.dtd" target="_blank">http://www\.w3\.org/TR/html4/strict\.dtd</a> \k<third_quote_mark_2> )? )|( -//W3C//DTD\ XHTML\ 1\.0\ Strict//EN \k<first_quote_mark> \g<space_characters>+ (?<third_quote_mark_3>["']) <a href="http://www\.w3\.org/TR/xhtml1/DTD/xhtml1-strict\.dtd" target="_blank">http://www\.w3\.org/TR/xhtml1/DT...trict\.dtd</a> \k<third_quote_mark_3> )|( -//W3C//DTD\ XHTML\ 1\.1//EN \k<first_quote_mark> \g<space_characters>+ (?<third_quote_mark_4>["']) <a href="http://www\.w3\.org/TR/xhtml11/DTD/xhtml11\.dtd" target="_blank">http://www\.w3\.org/TR/xhtml11/D...tml11\.dtd</a> \k<third_quote_mark_4> ) ) ) ) )? \g<space_characters>* > $@suxDX REGEX; echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" \'http://www.w3.org/TR/REC-html40/strict.dtd\'>')); echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" \'http://www4w3.org/TR/REC-html40/strict.dtd\'>')); echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" \'http://www.w3.org/TR/REC-html40/strict.dtd\'>')); echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">')); echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >')); echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN">'));
Myślałem nad cachem third_quote_mark, ale nie mogę tego zrobić dla typu tabelarnego.
Macie może jakiś pomysł?