Witam!

Napisałem prosty regex do walidowania elementu DOCTYPE dokumentu HTML (standard 5.1 W3C):
  1. echo "<pre>";
  2.  
  3. $regex=<<<'REGEX'
  4. @(*UTF8)^
  5. <!((?i)DOCTYPE)
  6. (?<space_characters>[\x20\x09\x0A\x0C\x0D])+
  7. ((?i)HTML)
  8. (
  9. \g<space_characters>+
  10. (
  11. ( # DOCTYPE legacy string
  12. ((?i)SYSTEM)
  13. \g<space_characters>+
  14. (?<quote_mark>["'])
  15. about:legacy-compat
  16. \k<quote_mark>
  17. )|( # obsolete permitted DOCTYPE string
  18. ((?i)PUBLIC)
  19. \g<space_characters>+
  20. (?<first_quote_mark>["'])
  21. (
  22. (
  23. -//W3C//DTD\ HTML\ 4\.0//EN
  24. \k<first_quote_mark>
  25. (
  26. \g<space_characters>+
  27. (?<third_quote_mark_1>["'])
  28. <a href="http://www\.w3\.org/TR/REC-html40/strict\.dtd" target="_blank">http://www\.w3\.org/TR/REC-html40/strict\.dtd</a>
  29. \k<third_quote_mark_1>
  30. )?
  31. )|(
  32. -//W3C//DTD\ HTML\ 4\.01//EN
  33. \k<first_quote_mark>
  34. (
  35. \g<space_characters>+
  36. (?<third_quote_mark_2>["'])
  37. <a href="http://www\.w3\.org/TR/html4/strict\.dtd" target="_blank">http://www\.w3\.org/TR/html4/strict\.dtd</a>
  38. \k<third_quote_mark_2>
  39. )?
  40. )|(
  41. -//W3C//DTD\ XHTML\ 1\.0\ Strict//EN
  42. \k<first_quote_mark>
  43. \g<space_characters>+
  44. (?<third_quote_mark_3>["'])
  45. <a href="http://www\.w3\.org/TR/xhtml1/DTD/xhtml1-strict\.dtd" target="_blank">http://www\.w3\.org/TR/xhtml1/DT...trict\.dtd</a>
  46. \k<third_quote_mark_3>
  47. )|(
  48. -//W3C//DTD\ XHTML\ 1\.1//EN
  49. \k<first_quote_mark>
  50. \g<space_characters>+
  51. (?<third_quote_mark_4>["'])
  52. <a href="http://www\.w3\.org/TR/xhtml11/DTD/xhtml11\.dtd" target="_blank">http://www\.w3\.org/TR/xhtml11/D...tml11\.dtd</a>
  53. \k<third_quote_mark_4>
  54. )
  55. )
  56. )
  57. )
  58. )?
  59. \g<space_characters>*
  60. >
  61. $@suxDX
  62. REGEX;
  63.  
  64. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE>'));
  65. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE >'));
  66. echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE htmL>'));
  67. echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE htmL >'));
  68. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE htmL SYSTEM>'));
  69. echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE htmL SYSTEM \'about:legacy-compat\' >'));
  70. echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE htmL SYSTEM "about:legacy-compat">'));
  71. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE htmL SYSTEM "about:legacy-compat\'>'));
  72. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE htmL PUBLIC "about:legacy-compat">'));
  73. echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">'));
  74. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//en">'));
  75. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//PL">'));
  76. echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" \'http://www.w3.org/TR/REC-html40/strict.dtd\'>'));
  77. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" \'http://www4w3.org/TR/REC-html40/strict.dtd\'>'));
  78. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" \'http://www.w3.org/TR/REC-html40/strict.dtd\'>'));
  79. echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'));
  80. echo "1 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >'));
  81. echo "0 - ";var_dump(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN">'));
  82.  
  83. echo "</pre>";
jednak zaobserwowałem, że część kodu obsolete permitted DOCTYPE string powtarza się w większości i tutaj moje pytanie, jak to wyrażenie uprościć?
Myślałem nad cachem third_quote_mark, ale nie mogę tego zrobić dla typu tabelarnego.
Macie może jakiś pomysł?