admin管理员组

文章数量:1430031

Yesterday I made a question about Detect non valid XML characters in java, and this expression works as expected:

String xml10pattern = "[^"
                + "\u0009\r\n" // #x9 | #xA | #xD 
                + "\u0020-\uD7FF" // [#x20-#xD7FF]
                + "\uE000-\uFFFD" // [#xE000-#xFFFD] 
                + "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
                + "]";

However, I realized it would be better checking for invalid characters on client side using javascript, but I didn't succeed.

I almost achieved, except for range U+10000–U+10FFFF: /

For last range, I tried

 var rg = /[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD\ud800\udc00-\udbff\udfff]/g; 

but it doesn't work. In regextester, tells "Range values reversed". I think it is because \ud800\udc00-\udbff\udfff is intepreted as 3 expressions:

\ud800; \udc00-\udbff; \udfff  

and, of course, the middle one fails.

So, my question is how convert above java regular expression into javascript.

Thanks.

==== UPDATE ====

Thanks to @collapsar ments, I tried to make two regular expressions.
Because of that, I realized I can't negate characters [^...].
It'll discard correct characters like U+10001. I mean, this is not right:

function validateIllegalChars(str) {
    var re1 = /[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g; 
    var re2 = /[^[\uD800-\uDBFF][\uDC00-\uDFFF]]/g;
    var str2 = str.replace(re1, '').replace(re2, ''); // First replace would remove all valid characters [#x10000-#x10FFFF]
    alert('str2:' + str2);
    if (str2 != str) return false;
    return true;
}

Then, I tried next (/):

function valPos(str) { 
    var re1 = /[\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g; 
    var re2 = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;

    var str2 = str.replace(re1, '').replace(re2, ''); 
    if (str2.length === 0) return true; 
    alert('str2:' + str2 + '; length: ' + str2.length);
    return false; 
}

However, when I call this function: valPos('eo' + String.fromCharCode(65537)), where 65537 is U+10001 it returns false. What is wrong or how can I solve it?

Yesterday I made a question about Detect non valid XML characters in java, and this expression works as expected:

String xml10pattern = "[^"
                + "\u0009\r\n" // #x9 | #xA | #xD 
                + "\u0020-\uD7FF" // [#x20-#xD7FF]
                + "\uE000-\uFFFD" // [#xE000-#xFFFD] 
                + "\ud800\udc00-\udbff\udfff" // [#x10000-#x10FFFF]
                + "]";

However, I realized it would be better checking for invalid characters on client side using javascript, but I didn't succeed.

I almost achieved, except for range U+10000–U+10FFFF: http://jsfiddle/mymxyjaf/15/

For last range, I tried

 var rg = /[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD\ud800\udc00-\udbff\udfff]/g; 

but it doesn't work. In regextester, tells "Range values reversed". I think it is because \ud800\udc00-\udbff\udfff is intepreted as 3 expressions:

\ud800; \udc00-\udbff; \udfff  

and, of course, the middle one fails.

So, my question is how convert above java regular expression into javascript.

Thanks.

==== UPDATE ====

Thanks to @collapsar ments, I tried to make two regular expressions.
Because of that, I realized I can't negate characters [^...].
It'll discard correct characters like U+10001. I mean, this is not right:

function validateIllegalChars(str) {
    var re1 = /[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g; 
    var re2 = /[^[\uD800-\uDBFF][\uDC00-\uDFFF]]/g;
    var str2 = str.replace(re1, '').replace(re2, ''); // First replace would remove all valid characters [#x10000-#x10FFFF]
    alert('str2:' + str2);
    if (str2 != str) return false;
    return true;
}

Then, I tried next (http://jsfiddle/mymxyjaf/18/):

function valPos(str) { 
    var re1 = /[\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g; 
    var re2 = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;

    var str2 = str.replace(re1, '').replace(re2, ''); 
    if (str2.length === 0) return true; 
    alert('str2:' + str2 + '; length: ' + str2.length);
    return false; 
}

However, when I call this function: valPos('eo' + String.fromCharCode(65537)), where 65537 is U+10001 it returns false. What is wrong or how can I solve it?

Share Improve this question edited May 23, 2017 at 12:22 CommunityBot 11 silver badge asked Mar 13, 2015 at 12:05 AlbertAlbert 1,2162 gold badges16 silver badges29 bronze badges 11
  • the \u notation (so far) only supports up to 32 bit codepoints. This SO answer will solve your problem ( specify the code points in question as surrogate pairs ). However, you should be able to use the original solution if you create a RegExp object from a string: new RegExp ( xml10pattern ); with xml10pattern defined as in your question. – collapsar Commented Mar 13, 2015 at 12:11
  • @collapsar, I think it does not work. For instance, U+D801 shouldn't be accepted (it's not valid XML) and it seems accepted: jsfiddle/mymxyjaf/16. What is it wrong? – Albert Commented Mar 13, 2015 at 12:42
  • In your fiddle,you have nested character classes in your first regex. This is a syntax error. Follow the recipe in the cited answer - you cannot build a single negated character class (ora single regex) because the limits of the offending code points will be represented by 2 characters. – collapsar Commented Mar 13, 2015 at 12:53
  • @collapsar, the expression I just used is var re = /[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD[\uD800-\uDBFF][\uDC00-\uDFFF]]/g;. It looks like it don't take U+D801 as surrogate pair. It seems it only check first part [\uD800-\uDBFF] – Albert Commented Mar 13, 2015 at 12:53
  • @collapsar, so you mean I must use two regular expressions? One for 32-bits codepoints, and the other for U+10000 - U+10FFFF? – Albert Commented Mar 13, 2015 at 12:56
 |  Show 6 more ments

1 Answer 1

Reset to default 7

I finally solved.

The answer to my own question, as @collapsar told me, could be:

function validateIllegalChars(str) { 

    var re1 = /[\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD]/g;  // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] 
    var re2 = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g; // [#x10000-#x10FFFF]

    var res = str.replace(re1, '').replace(re2, ''); // Should remove any valid character

    if (!!res && res.length > 0) {  // any remaining characters, means input str is not valid 
        return false; 
    }

    return true; 
} 

The previous examples (the ones I post in jsfiddle) didn't work to me, because String.fromCharCode(65537) does no generate character with code point U+10001, as I thought, but U+0001.

Thanks for help.

本文标签: regexDetect non valid XML characters (javascript)Stack Overflow