[SCM] WebKit Debian packaging branch, debian/unstable, updated. debian/1.1.15-1-40151-g37bb677

darin darin at 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Sat Sep 26 07:52:07 UTC 2009


The following commit has been merged in the debian/unstable branch:
commit 6c9bbfd314147447f01a8828b73b5b8e41e75984
Author: darin <darin at 268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Date:   Mon Aug 18 18:51:25 2003 +0000

            Reviewed by Maciej.
    
            - fixed 3247528 -- encodeURI missing from JavaScriptCore (needed by Crystal Reports)
            - fixed 3381297 -- escape method does not escape the null character
            - fixed 3381299 -- escape method produces incorrect escape sequences ala WinIE, rather than correct ala Gecko
            - fixed 3381303 -- unescape method treats escape sequences as Latin-1 ala WinIE rather than as UTF-8 ala Gecko
            - fixed 3381304 -- unescape method garbles strings with bad escape sequences in them
    
            * kjs/function.h: Added constants for decodeURI, decodeURIComponent, encodeURI, and
            encodeURIComponent.
            * kjs/function.cpp:
            (encode): Added. New helper function for escape, encodeURI, and encodeURIComponent.
            (decode): Added. New helper function for unescape, decodeURI, and decodeURIComponent.
            (GlobalFuncImp::call): Added decodeURI, decodeURIComponent, encodeURI, and encodeURIComponent
            implementations. Changed escape and unescape to use new helper functions, which fixes
            the four problems above.
    
            * kjs/internal.cpp: (InterpreterImp::initGlobalObject): Add decodeURI, decodeURIComponent,
            encodeURI, and encodeURIComponent to the global object.
    
            * kjs/ustring.h: Added a length to the CString class so it can hold strings with null
            characters in them, not just null-terminated strings. This allows a null character from
            a UString to survive the process of UTF-16 to UTF-8 decoding. Added overloads to
            UString::append, UString::UTF8String, UTF8SequenceLength, decodeUTF8Sequence,
            convertUTF16OffsetsToUTF8Offsets, and convertUTF8OffsetsToUTF16Offsets.
    
            * kjs/ustring.cpp:
            (CString::CString): Set up the length properly in all the constructors. Also add a new
            constructor that takes a length.
            (CString::append): Use and set the length properly.
            (CString::operator=): Use and set the length properly.
            (operator==): Use and the length and memcmp instead of strcmp.
            (UString::append): Added new overloads for const char * and for a single string to make
            it more efficient to build up a UString from pieces. The old way, a UString was created
            and destroyed each time you appended.
            (UTF8SequenceLength): New. Helper for decoding UTF-8.
            (decodeUTF8Sequence): New. Helper for decoding UTF-8.
            (UString::UTF8String): New. Decodes from UTF-16 to UTF-8. Same as the function that
            was in regexp.cpp, except has proper handling for UTF-16 surrogates.
            (compareStringOffsets): Moved from regexp.cpp.
            (createSortedOffsetsArray): Moved from regexp.cpp.
            (convertUTF16OffsetsToUTF8Offsets): New. Converts UTF-16 offsets to UTF-8 offsets, given
            a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
            for UTF-16 surrogates.
            (convertUTF8OffsetsToUTF16Offsets): New. Converts UTF-8 offsets to UTF-16 offsets, given
            a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
            for UTF-16 surrogates.
    
            - fixed 3381296 -- regular expression matches with UTF-16 surrogates will treat sequences as two characters
    
            * kjs/regexp.cpp:
            (RegExp::RegExp): Use the new UString::UTF8String function instead a function in this file.
            (RegExp::match): Use the new convertUTF16OffsetsToUTF8Offsets (and the corresponding
            reverse) instead of convertCharacterOffsetsToUTF8ByteOffsets in this file.
    
    
    git-svn-id: http://svn.webkit.org/repository/webkit/trunk@4837 268f45cc-cd09-0410-ab3c-d52691b4dbfc

diff --git a/JavaScriptCore/ChangeLog b/JavaScriptCore/ChangeLog
index 3d9a610..e8fc08d 100644
--- a/JavaScriptCore/ChangeLog
+++ b/JavaScriptCore/ChangeLog
@@ -1,3 +1,60 @@
+2003-08-17  Darin Adler  <darin at apple.com>
+
+        Reviewed by Maciej.
+
+        - fixed 3247528 -- encodeURI missing from JavaScriptCore (needed by Crystal Reports)
+        - fixed 3381297 -- escape method does not escape the null character
+        - fixed 3381299 -- escape method produces incorrect escape sequences ala WinIE, rather than correct ala Gecko
+        - fixed 3381303 -- unescape method treats escape sequences as Latin-1 ala WinIE rather than as UTF-8 ala Gecko
+        - fixed 3381304 -- unescape method garbles strings with bad escape sequences in them
+
+        * kjs/function.h: Added constants for decodeURI, decodeURIComponent, encodeURI, and
+        encodeURIComponent.
+        * kjs/function.cpp:
+        (encode): Added. New helper function for escape, encodeURI, and encodeURIComponent.
+        (decode): Added. New helper function for unescape, decodeURI, and decodeURIComponent.
+        (GlobalFuncImp::call): Added decodeURI, decodeURIComponent, encodeURI, and encodeURIComponent 
+        implementations. Changed escape and unescape to use new helper functions, which fixes
+        the four problems above.
+
+        * kjs/internal.cpp: (InterpreterImp::initGlobalObject): Add decodeURI, decodeURIComponent,
+        encodeURI, and encodeURIComponent to the global object.
+
+        * kjs/ustring.h: Added a length to the CString class so it can hold strings with null
+        characters in them, not just null-terminated strings. This allows a null character from
+        a UString to survive the process of UTF-16 to UTF-8 decoding. Added overloads to
+        UString::append, UString::UTF8String, UTF8SequenceLength, decodeUTF8Sequence,
+        convertUTF16OffsetsToUTF8Offsets, and convertUTF8OffsetsToUTF16Offsets.
+        
+        * kjs/ustring.cpp:
+        (CString::CString): Set up the length properly in all the constructors. Also add a new
+        constructor that takes a length.
+        (CString::append): Use and set the length properly.
+        (CString::operator=): Use and set the length properly.
+        (operator==): Use and the length and memcmp instead of strcmp.
+        (UString::append): Added new overloads for const char * and for a single string to make
+        it more efficient to build up a UString from pieces. The old way, a UString was created
+        and destroyed each time you appended.
+        (UTF8SequenceLength): New. Helper for decoding UTF-8.
+        (decodeUTF8Sequence): New. Helper for decoding UTF-8.
+        (UString::UTF8String): New. Decodes from UTF-16 to UTF-8. Same as the function that
+        was in regexp.cpp, except has proper handling for UTF-16 surrogates.
+        (compareStringOffsets): Moved from regexp.cpp.
+        (createSortedOffsetsArray): Moved from regexp.cpp.
+        (convertUTF16OffsetsToUTF8Offsets): New. Converts UTF-16 offsets to UTF-8 offsets, given
+        a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
+        for UTF-16 surrogates.
+        (convertUTF8OffsetsToUTF16Offsets): New. Converts UTF-8 offsets to UTF-16 offsets, given
+        a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
+        for UTF-16 surrogates.
+
+        - fixed 3381296 -- regular expression matches with UTF-16 surrogates will treat sequences as two characters
+
+        * kjs/regexp.cpp:
+        (RegExp::RegExp): Use the new UString::UTF8String function instead a function in this file.
+        (RegExp::match): Use the new convertUTF16OffsetsToUTF8Offsets (and the corresponding
+        reverse) instead of convertCharacterOffsetsToUTF8ByteOffsets in this file.
+
 === Safari-93 ===
 
 2003-08-14  Vicki Murley  <vicki at apple.com>
diff --git a/JavaScriptCore/ChangeLog-2003-10-25 b/JavaScriptCore/ChangeLog-2003-10-25
index 3d9a610..e8fc08d 100644
--- a/JavaScriptCore/ChangeLog-2003-10-25
+++ b/JavaScriptCore/ChangeLog-2003-10-25
@@ -1,3 +1,60 @@
+2003-08-17  Darin Adler  <darin at apple.com>
+
+        Reviewed by Maciej.
+
+        - fixed 3247528 -- encodeURI missing from JavaScriptCore (needed by Crystal Reports)
+        - fixed 3381297 -- escape method does not escape the null character
+        - fixed 3381299 -- escape method produces incorrect escape sequences ala WinIE, rather than correct ala Gecko
+        - fixed 3381303 -- unescape method treats escape sequences as Latin-1 ala WinIE rather than as UTF-8 ala Gecko
+        - fixed 3381304 -- unescape method garbles strings with bad escape sequences in them
+
+        * kjs/function.h: Added constants for decodeURI, decodeURIComponent, encodeURI, and
+        encodeURIComponent.
+        * kjs/function.cpp:
+        (encode): Added. New helper function for escape, encodeURI, and encodeURIComponent.
+        (decode): Added. New helper function for unescape, decodeURI, and decodeURIComponent.
+        (GlobalFuncImp::call): Added decodeURI, decodeURIComponent, encodeURI, and encodeURIComponent 
+        implementations. Changed escape and unescape to use new helper functions, which fixes
+        the four problems above.
+
+        * kjs/internal.cpp: (InterpreterImp::initGlobalObject): Add decodeURI, decodeURIComponent,
+        encodeURI, and encodeURIComponent to the global object.
+
+        * kjs/ustring.h: Added a length to the CString class so it can hold strings with null
+        characters in them, not just null-terminated strings. This allows a null character from
+        a UString to survive the process of UTF-16 to UTF-8 decoding. Added overloads to
+        UString::append, UString::UTF8String, UTF8SequenceLength, decodeUTF8Sequence,
+        convertUTF16OffsetsToUTF8Offsets, and convertUTF8OffsetsToUTF16Offsets.
+        
+        * kjs/ustring.cpp:
+        (CString::CString): Set up the length properly in all the constructors. Also add a new
+        constructor that takes a length.
+        (CString::append): Use and set the length properly.
+        (CString::operator=): Use and set the length properly.
+        (operator==): Use and the length and memcmp instead of strcmp.
+        (UString::append): Added new overloads for const char * and for a single string to make
+        it more efficient to build up a UString from pieces. The old way, a UString was created
+        and destroyed each time you appended.
+        (UTF8SequenceLength): New. Helper for decoding UTF-8.
+        (decodeUTF8Sequence): New. Helper for decoding UTF-8.
+        (UString::UTF8String): New. Decodes from UTF-16 to UTF-8. Same as the function that
+        was in regexp.cpp, except has proper handling for UTF-16 surrogates.
+        (compareStringOffsets): Moved from regexp.cpp.
+        (createSortedOffsetsArray): Moved from regexp.cpp.
+        (convertUTF16OffsetsToUTF8Offsets): New. Converts UTF-16 offsets to UTF-8 offsets, given
+        a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
+        for UTF-16 surrogates.
+        (convertUTF8OffsetsToUTF16Offsets): New. Converts UTF-8 offsets to UTF-16 offsets, given
+        a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
+        for UTF-16 surrogates.
+
+        - fixed 3381296 -- regular expression matches with UTF-16 surrogates will treat sequences as two characters
+
+        * kjs/regexp.cpp:
+        (RegExp::RegExp): Use the new UString::UTF8String function instead a function in this file.
+        (RegExp::match): Use the new convertUTF16OffsetsToUTF8Offsets (and the corresponding
+        reverse) instead of convertCharacterOffsetsToUTF8ByteOffsets in this file.
+
 === Safari-93 ===
 
 2003-08-14  Vicki Murley  <vicki at apple.com>
diff --git a/JavaScriptCore/kjs/function.cpp b/JavaScriptCore/kjs/function.cpp
index d15b501..19b6fd2 100644
--- a/JavaScriptCore/kjs/function.cpp
+++ b/JavaScriptCore/kjs/function.cpp
@@ -408,13 +408,113 @@ bool GlobalFuncImp::implementsCall() const
   return true;
 }
 
+static Value encode(ExecState *exec, const List &args, const char *do_not_escape)
+{
+  UString r = "", s, str = args[0].toString(exec);
+  CString cstr = str.UTF8String();
+  const char *p = cstr.c_str();
+  for (int k = 0; k < cstr.size(); k++, p++) {
+    char c = *p;
+    if (c && strchr(do_not_escape, c)) {
+      r.append(c);
+    } else {
+      char tmp[4];
+      sprintf(tmp, "%%%02X", (unsigned char)c);
+      r += tmp;
+    }
+  }
+  return String(r);
+}
+
+static Value decode(ExecState *exec, const List &args, const char *do_not_unescape, bool strict)
+{
+  UString s = "", str = args[0].toString(exec);
+  int k = 0, len = str.size();
+  const UChar *d = str.data();
+  UChar u;
+  while (k < len) {
+    const UChar *p = d + k;
+    UChar c = *p;
+    if (c == '%') {
+      int charLen = 0;
+      if (k <= len - 3 && isxdigit(p[1].uc) && isxdigit(p[2].uc)) {
+        const char b0 = Lexer::convertHex(p[1].uc, p[2].uc);
+        const int sequenceLen = UTF8SequenceLength(b0);
+        if (sequenceLen != 0 && k <= len - sequenceLen * 3) {
+          charLen = sequenceLen * 3;
+          char sequence[5];
+          sequence[0] = b0;
+          for (int i = 1; i < sequenceLen; ++i) {
+            const UChar *q = p + i * 3;
+            if (q[0] == '%' && isxdigit(q[1].uc) && isxdigit(q[2].uc))
+              sequence[i] = Lexer::convertHex(q[1].uc, q[2].uc);
+            else {
+              charLen = 0;
+              break;
+            }
+          }
+          if (charLen != 0) {
+            sequence[sequenceLen] = 0;
+            const int character = decodeUTF8Sequence(sequence);
+            if (character < 0 || character >= 0x110000) {
+              charLen = 0;
+            } else if (character >= 0x10000) {
+              // Convert to surrogate pair.
+              s.append(static_cast<unsigned short>(0xD800 | ((character - 0x10000) >> 10)));
+              u = static_cast<unsigned short>(0xDC00 | ((character - 0x10000) & 0x3FF));
+            } else {
+              u = static_cast<unsigned short>(character);
+            }
+          }
+        }
+      }
+      if (charLen == 0) {
+        if (strict) {
+	  Object error = Error::create(exec, URIError);
+          exec->setException(error);
+          return error;
+        }
+        // The only case where we don't use "strict" mode is the "unescape" function.
+        // For that, it's good to support the wonky "%u" syntax for compatibility with WinIE.
+        if (k <= len - 6 && p[1] == 'u'
+            && isxdigit(p[2].uc) && isxdigit(p[3].uc)
+            && isxdigit(p[4].uc) && isxdigit(p[5].uc)) {
+	  charLen = 6;
+	  u = Lexer::convertUnicode(p[2].uc, p[3].uc, p[4].uc, p[5].uc);
+        }
+      }
+      if (charLen && (u.uc == 0 || u.uc >= 128 || !strchr(do_not_unescape, u.low()))) {
+        c = u;
+        k += charLen - 1;
+      }
+    }
+    k++;
+    s.append(c);
+  }
+  return String(s);
+}
+
 Value GlobalFuncImp::call(ExecState *exec, Object &/*thisObj*/, const List &args)
 {
   Value res;
 
-  static const char non_escape[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
-				   "abcdefghijklmnopqrstuvwxyz"
-				   "0123456789@*_+-./";
+  static const char do_not_escape[] =
+    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+    "abcdefghijklmnopqrstuvwxyz"
+    "0123456789"
+    "*+-./@_";
+  static const char do_not_escape_when_encoding_URI_component[] =
+    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+    "abcdefghijklmnopqrstuvwxyz"
+    "0123456789"
+    "!'()*-._~";
+  static const char do_not_escape_when_encoding_URI[] =
+    "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+    "abcdefghijklmnopqrstuvwxyz"
+    "0123456789"
+    "!#$&'()*+,-./:;=?@_~";
+  static const char do_not_unescape_when_decoding_URI[] =
+    "#$&+,/:;=?@";
 
   switch (id) {
   case Eval: { // eval()
@@ -502,54 +602,28 @@ Value GlobalFuncImp::call(ExecState *exec, Object &/*thisObj*/, const List &args
     res = Boolean(!isNaN(n) && !isInf(n));
     break;
   }
-  case Escape: {
-    UString r = "", s, str = args[0].toString(exec);
-    const UChar *c = str.data();
-    for (int k = 0; k < str.size(); k++, c++) {
-      int u = c->uc;
-      if (u > 255) {
-	char tmp[7];
-	sprintf(tmp, "%%u%04X", u);
-	s = UString(tmp);
-      } else if (strchr(non_escape, (char)u)) {
-	s = UString(c, 1);
-      } else {
-	char tmp[4];
-	sprintf(tmp, "%%%02X", u);
-	s = UString(tmp);
-      }
-      r += s;
-    }
-    res = String(r);
+  case DecodeURI:
+    res = decode(exec, args, do_not_unescape_when_decoding_URI, true);
     break;
-  }
-  case UnEscape: {
-    UString s, str = args[0].toString(exec);
-    int k = 0, len = str.size();
-    UChar u;
-    while (k < len) {
-      const UChar *c = str.data() + k;
-      if (*c == UChar('%') && k <= len - 6 && *(c+1) == UChar('u')) {
-	u = Lexer::convertUnicode((c+2)->uc, (c+3)->uc,
-				  (c+4)->uc, (c+5)->uc);
-	c = &u;
-	k += 5;
-      } else if (*c == UChar('%') && k <= len - 3) {
-	u = UChar(Lexer::convertHex((c+1)->uc, (c+2)->uc));
-	c = &u;
-	k += 2;
-      }
-      k++;
-      s += UString(c, 1);
-    }
-    res = String(s);
+  case DecodeURIComponent:
+    res = decode(exec, args, "", true);
+    break;
+  case EncodeURI:
+    res = encode(exec, args, do_not_escape_when_encoding_URI);
+    break;
+  case EncodeURIComponent:
+    res = encode(exec, args, do_not_escape_when_encoding_URI_component);
+    break;
+  case Escape:
+    res = encode(exec, args, do_not_escape);
+    break;
+  case UnEscape:
+    res = decode(exec, args, "", false);
     break;
-  }
 #ifndef NDEBUG
-  case KJSPrint: {
-    UString str = args[0].toString(exec);
-    puts(str.ascii());
-  }
+  case KJSPrint:
+    puts(args[0].toString(exec).ascii());
+    break;
 #endif
   }
 
diff --git a/JavaScriptCore/kjs/function.h b/JavaScriptCore/kjs/function.h
index 8279d7a..7c0d290 100644
--- a/JavaScriptCore/kjs/function.h
+++ b/JavaScriptCore/kjs/function.h
@@ -127,7 +127,8 @@ namespace KJS {
     virtual bool implementsCall() const;
     virtual Value call(ExecState *exec, Object &thisObj, const List &args);
     virtual CodeType codeType() const;
-    enum { Eval, ParseInt, ParseFloat, IsNaN, IsFinite, Escape, UnEscape 
+    enum { Eval, ParseInt, ParseFloat, IsNaN, IsFinite, Escape, UnEscape,
+           DecodeURI, DecodeURIComponent, EncodeURI, EncodeURIComponent
 #ifndef NDEBUG
 	   , KJSPrint
 #endif
diff --git a/JavaScriptCore/kjs/internal.cpp b/JavaScriptCore/kjs/internal.cpp
index a53dece..549be2d 100644
--- a/JavaScriptCore/kjs/internal.cpp
+++ b/JavaScriptCore/kjs/internal.cpp
@@ -664,6 +664,10 @@ void InterpreterImp::unlock()
   global.put(globExec,"isFinite",   Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::IsFinite,   1)), DontEnum);
   global.put(globExec,"escape",     Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::Escape,     1)), DontEnum);
   global.put(globExec,"unescape",   Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::UnEscape,   1)), DontEnum);
+  global.put(globExec,"decodeURI",  Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::DecodeURI,  1)), DontEnum);
+  global.put(globExec,"decodeURIComponent", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::DecodeURIComponent, 1)), DontEnum);
+  global.put(globExec,"encodeURI",  Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::EncodeURI,  1)), DontEnum);
+  global.put(globExec,"encodeURIComponent", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::EncodeURIComponent, 1)), DontEnum);
 #ifndef NDEBUG
   global.put(globExec,"kjsprint",   Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::KJSPrint,   1)), DontEnum);
 #endif
diff --git a/JavaScriptCore/kjs/regexp.cpp b/JavaScriptCore/kjs/regexp.cpp
index 4865c4c..610ee25 100644
--- a/JavaScriptCore/kjs/regexp.cpp
+++ b/JavaScriptCore/kjs/regexp.cpp
@@ -21,151 +21,12 @@
 
 #include "regexp.h"
 
+#include <assert.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 
-using KJS::CString;
-using KJS::RegExp;
-using KJS::UString;
-
-#ifdef HAVE_PCREPOSIX
-
-static CString convertToUTF8(const UString &s)
-{
-    // Allocate a buffer big enough to hold all the characters.
-    const int length = s.size();
-    const unsigned bufferSize = length * 3 + 1;
-    char fixedSizeBuffer[1024];
-    char *buffer;
-    if (bufferSize > sizeof(fixedSizeBuffer)) {
-        buffer = new char [bufferSize];
-    } else {
-        buffer = fixedSizeBuffer;
-    }
-
-    // Convert to runs of 8-bit characters.
-    char *p = buffer;
-    for (int i = 0; i != length; ++i) {
-        unsigned short c = s[i].unicode();
-        if (c < 0x80) {
-            *p++ = (char)c;
-        } else if (c < 0x800) {
-            *p++ = (char)((c >> 6) | 0xC0); // C0 is the 2-byte flag for UTF-8
-            *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
-        } else {
-            *p++ = (char)((c >> 12) | 0xE0); // E0 is the 3-byte flag for UTF-8
-            *p++ = (char)(((c >> 6) | 0x80) & 0xBF); // next 6 bits, with high bit set
-            *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
-        }
-    }
-    *p = 0;
-
-    // Return the result as a C string.
-    CString result(buffer);
-    if (buffer != fixedSizeBuffer) {
-        delete [] buffer;
-    }
-    return result;
-}
-
-struct StringOffset {
-    int offset;
-    int locationInOffsetsArray;
-};
-
-static int compareStringOffsets(const void *a, const void *b)
-{
-    const StringOffset *oa = static_cast<const StringOffset *>(a);
-    const StringOffset *ob = static_cast<const StringOffset *>(b);
-    
-    if (oa->offset < ob->offset) {
-        return -1;
-    }
-    if (oa->offset > ob->offset) {
-        return +1;
-    }
-    return 0;
-}
-
-const int sortedOffsetsFixedBufferSize = 128;
-
-static StringOffset *createSortedOffsetsArray(const int offsets[], int numOffsets,
-    StringOffset sortedOffsetsFixedBuffer[sortedOffsetsFixedBufferSize])
-{
-    // Allocate the sorted offsets.
-    StringOffset *sortedOffsets;
-    if (numOffsets <= sortedOffsetsFixedBufferSize) {
-        sortedOffsets = sortedOffsetsFixedBuffer;
-    } else {
-        sortedOffsets = new StringOffset [numOffsets];
-    }
-
-    // Copy offsets.
-    for (int i = 0; i != numOffsets; ++i) {
-        sortedOffsets[i].offset = offsets[i];
-        sortedOffsets[i].locationInOffsetsArray = i;
-    }
-
-    // Sort them.
-    qsort(sortedOffsets, numOffsets, sizeof(StringOffset), compareStringOffsets);
-
-    return sortedOffsets;
-}
-
-static void convertCharacterOffsetsToUTF8ByteOffsets(const char *s, int *offsets, int numOffsets)
-{
-    // Allocate buffer.
-    StringOffset fixedBuffer[sortedOffsetsFixedBufferSize];
-    StringOffset *sortedOffsets = createSortedOffsetsArray(offsets, numOffsets, fixedBuffer);
-
-    // Walk through sorted offsets and string, adjusting all the offests.
-    // Offsets that are off the ends of the string map to the edges of the string.
-    int characterOffset = 0;
-    const char *p = s;
-    for (int oi = 0; oi != numOffsets; ++oi) {
-        const int nextOffset = sortedOffsets[oi].offset;
-        while (*p && characterOffset < nextOffset) {
-            // Skip to the next character.
-            ++characterOffset;
-            do ++p; while ((*p & 0xC0) == 0x80); // if 1 of the 2 high bits is set, it's not the start of a character
-        }
-        offsets[sortedOffsets[oi].locationInOffsetsArray] = p - s;
-    }
-
-    // Free buffer.
-    if (sortedOffsets != fixedBuffer) {
-        delete [] sortedOffsets;
-    }
-}
-
-static void convertUTF8ByteOffsetsToCharacterOffsets(const char *s, int *offsets, int numOffsets)
-{
-    // Allocate buffer.
-    StringOffset fixedBuffer[sortedOffsetsFixedBufferSize];
-    StringOffset *sortedOffsets = createSortedOffsetsArray(offsets, numOffsets, fixedBuffer);
-
-    // Walk through sorted offsets and string, adjusting all the offests.
-    // Offsets that are off the end of the string map to the edges of the string.
-    int characterOffset = 0;
-    const char *p = s;
-    for (int oi = 0; oi != numOffsets; ++oi) {
-        const int nextOffset = sortedOffsets[oi].offset;
-        while (*p && (p - s) < nextOffset) {
-            // Skip to the next character.
-            ++characterOffset;
-            do ++p; while ((*p & 0xC0) == 0x80); // if 1 of the 2 high bits is set, it's not the start of a character
-        }
-        offsets[sortedOffsets[oi].locationInOffsetsArray] = characterOffset;
-    }
-
-    // Free buffer.
-    if (sortedOffsets != fixedBuffer) {
-        delete [] sortedOffsets;
-    }
-}
-
-#endif // HAVE_PCREPOSIX
+namespace KJS {
 
 RegExp::RegExp(const UString &p, int flags)
   : _flags(flags), _numSubPatterns(0)
@@ -181,7 +42,7 @@ RegExp::RegExp(const UString &p, int flags)
 
   const char *errorMessage;
   int errorOffset;
-  _regex = pcre_compile(convertToUTF8(p).c_str(), options, &errorMessage, &errorOffset, NULL);
+  _regex = pcre_compile(p.UTF8String().c_str(), options, &errorMessage, &errorOffset, NULL);
   if (!_regex) {
 #ifndef NDEBUG
     fprintf(stderr, "KJS: pcre_compile() failed with '%s'\n", errorMessage);
@@ -258,8 +119,8 @@ UString RegExp::match(const UString &s, int i, int *pos, int **ovector)
     offsetVector = new int [offsetVectorSize];
   }
 
-  const CString buffer(convertToUTF8(s));
-  convertCharacterOffsetsToUTF8ByteOffsets(buffer.c_str(), &i, 1);
+  const CString buffer(s.UTF8String());
+  convertUTF16OffsetsToUTF8Offsets(buffer.c_str(), &i, 1);
   const int numMatches = pcre_exec(_regex, NULL, buffer.c_str(), buffer.size(), i, 0, offsetVector, offsetVectorSize);
 
   if (numMatches < 0) {
@@ -272,7 +133,7 @@ UString RegExp::match(const UString &s, int i, int *pos, int **ovector)
     return UString::null();
   }
 
-  convertUTF8ByteOffsetsToCharacterOffsets(buffer.c_str(), offsetVector, (numMatches == 0 ? 1 : numMatches) * 2);
+  convertUTF8OffsetsToUTF16Offsets(buffer.c_str(), offsetVector, (numMatches == 0 ? 1 : numMatches) * 2);
 
   *pos = offsetVector[0];
   if (ovector)
@@ -314,3 +175,5 @@ UString RegExp::match(const UString &s, int i, int *pos, int **ovector)
 
 #endif
 }
+
+} // namespace KJS
diff --git a/JavaScriptCore/kjs/ustring.cpp b/JavaScriptCore/kjs/ustring.cpp
index b6fae6c..1922249 100644
--- a/JavaScriptCore/kjs/ustring.cpp
+++ b/JavaScriptCore/kjs/ustring.cpp
@@ -42,22 +42,30 @@
 #include "dtoa.h"
 
 namespace KJS {
-  extern const double NaN;
-  extern const double Inf;
-};
 
-using namespace KJS;
+extern const double NaN;
+extern const double Inf;
 
 CString::CString(const char *c)
 {
-  data = new char[strlen(c)+1];
+  length = strlen(c);
+  data = new char[length+1];
   strcpy(data, c);
 }
 
+CString::CString(const char *c, int len)
+{
+  length = len;
+  data = new char[len+1];
+  memcpy(data, c, len);
+  data[len] = 0;
+}
+
 CString::CString(const CString &b)
 {
-  data = new char[b.size()+1];
-  strcpy(data, b.c_str());
+  length = b.length;
+  data = new char[length+1];
+  memcpy(data, b.data, length);
 }
 
 CString::~CString()
@@ -68,14 +76,13 @@ CString::~CString()
 CString &CString::append(const CString &t)
 {
   char *n;
-  if (data) {
-    n = new char[strlen(data)+t.size()+1];
-    strcpy(n, data);
-  } else {
-    n = new char[t.size()+1];
-    n[0] = '\0';
-  }
-  strcat(n, t.c_str());
+  n = new char[length+t.length+1];
+  if (length)
+    memcpy(n, data, length);
+  if (t.length)
+    memcpy(n+length, t.data, t.length);
+  length += t.length;
+  n[length] = 0;
 
   delete [] data;
   data = n;
@@ -87,7 +94,8 @@ CString &CString::operator=(const char *c)
 {
   if (data)
     delete [] data;
-  data = new char[strlen(c)+1];
+  length = strlen(c);
+  data = new char[length+1];
   strcpy(data, c);
 
   return *this;
@@ -100,20 +108,17 @@ CString &CString::operator=(const CString &str)
 
   if (data)
     delete [] data;
-  data = new char[str.size()+1];
-  strcpy(data, str.c_str());
+  length = str.length;
+  data = new char[length + 1];
+  memcpy(data, str.data, length + 1);
 
   return *this;
 }
 
-int CString::size() const
-{
-  return strlen(data);
-}
-
 bool KJS::operator==(const KJS::CString& c1, const KJS::CString& c2)
 {
-  return (strcmp(c1.c_str(), c2.c_str()) == 0);
+  int len = c1.size();
+  return len == c2.size() && (len == 0 || memcmp(c1.c_str(), c2.c_str(), len) == 0);
 }
 
 UString::Rep UString::Rep::null = { 0, 0, 0, 1, 1 };
@@ -470,6 +475,53 @@ UString &UString::append(const UString &t)
   return *this;
 }
 
+UString &UString::append(const char *t)
+{
+  int l = size();
+  int tLen = strlen(t);
+  int newLen = l + tLen;
+  if (rep->rc == 1 && newLen <= rep->capacity) {
+    for (int i = 0; i < tLen; ++i)
+      rep->dat[l+i] = t[i];
+    rep->len = newLen;
+    rep->_hash = 0;
+    return *this;
+  }
+  
+  int newCapacity = (newLen * 3 + 1) / 2;
+  UChar *n = new UChar[newCapacity];
+  memcpy(n, data(), l * sizeof(UChar));
+  for (int i = 0; i < tLen; ++i)
+    n[l+i] = t[i];
+  release();
+  rep = Rep::create(n, newLen);
+  rep->capacity = newCapacity;
+
+  return *this;
+}
+
+UString &UString::append(unsigned short c)
+{
+  int l = size();
+  int newLen = l + 1;
+  if (rep->rc == 1 && newLen <= rep->capacity) {
+    rep->dat[l] = c;
+    rep->len = newLen;
+    rep->_hash = 0;
+    return *this;
+  }
+  
+  int newCapacity = (newLen * 3 + 1) / 2;
+  UChar *n = new UChar[newCapacity];
+  memcpy(n, data(), l * sizeof(UChar));
+  n[l] = c;
+  release();
+  rep = Rep::create(n, newLen);
+  rep->capacity = newCapacity;
+
+  return *this;
+}
+
 CString UString::cstring() const
 {
   return ascii();
@@ -894,3 +946,241 @@ int KJS::compare(const UString& s1, const UString& s2)
   }
   return (l1 < l2) ? 1 : -1;
 }
+
+// Given a first byte, gives the length of the UTF-8 sequence it begins.
+// Returns 0 for bytes that are not legal starts of UTF-8 sequences.
+// Only allows sequences of up to 4 bytes, since that works for all Unicode characters (U-00000000 to U-0010FFFF).
+int UTF8SequenceLength(char b0)
+{
+  if ((b0 & 0x80) == 0)
+    return 1;
+  if ((b0 & 0xC0) != 0xC0)
+    return 0;
+  if ((b0 & 0xE0) == 0xC0)
+    return 2;
+  if ((b0 & 0xF0) == 0xE0)
+    return 3;
+  if ((b0 & 0xF8) == 0xF0)
+    return 4;
+  return 0;
+}
+
+// Takes a null-terminated C-style string with a UTF-8 sequence in it and converts it to a character.
+// Only allows Unicode characters (U-00000000 to U-0010FFFF).
+// Returns -1 if the sequence is not valid (including presence of extra bytes).
+int decodeUTF8Sequence(const char *sequence)
+{
+  // Handle 0-byte sequences (never valid).
+  const unsigned char b0 = sequence[0];
+  const int length = UTF8SequenceLength(b0);
+  if (length == 0)
+    return -1;
+
+  // Handle 1-byte sequences (plain ASCII).
+  const unsigned char b1 = sequence[1];
+  if (length == 1) {
+    if (b1)
+      return -1;
+    return b0;
+  }
+
+  // Handle 2-byte sequences.
+  if ((b1 & 0xC0) != 0x80)
+    return -1;
+  const unsigned char b2 = sequence[2];
+  if (length == 2) {
+    if (b2)
+      return -1;
+    const int c = ((b0 & 0x1F) << 6) | (b1 & 0x3F);
+    if (c < 0x80)
+      return -1;
+    return c;
+  }
+
+  // Handle 3-byte sequences.
+  if ((b2 & 0xC0) != 0x80)
+    return -1;
+  const unsigned char b3 = sequence[3];
+  if (length == 3) {
+    if (b3)
+      return -1;
+    const int c = ((b0 & 0xF) << 12) | ((b1 & 0x3F) << 6) | (b2 & 0x3F);
+    if (c < 0x800)
+      return -1;
+    // UTF-16 surrogates should never appear in UTF-8 data.
+    if (c >= 0xD800 && c <= 0xDFFF)
+      return -1;
+    // Backwards BOM and U+FFFF should never appear in UTF-8 data.
+    if (c == 0xFFFE || c == 0xFFFF)
+      return -1;
+    return c;
+  }
+
+  // Handle 4-byte sequences.
+  if ((b3 & 0xC0) != 0x80)
+    return -1;
+  const unsigned char b4 = sequence[4];
+  if (length == 4) {
+    if (b4)
+      return -1;
+    const int c = ((b0 & 0x7) << 18) | ((b1 & 0x3F) << 12) | ((b2 & 0x3F) << 6) | (b3 & 0x3F);
+    if (c < 0x10000 || c > 0x10FFFF)
+      return -1;
+    return c;
+  }
+
+  return -1;
+}
+
+CString UString::UTF8String() const
+{
+  // Allocate a buffer big enough to hold all the characters.
+  const int length = size();
+  const unsigned bufferSize = length * 3;
+  char fixedSizeBuffer[1024];
+  char *buffer;
+  if (bufferSize > sizeof(fixedSizeBuffer)) {
+    buffer = new char [bufferSize];
+  } else {
+    buffer = fixedSizeBuffer;
+  }
+
+  // Convert to runs of 8-bit characters.
+  char *p = buffer;
+  const UChar *d = data();
+  for (int i = 0; i != length; ++i) {
+    unsigned short c = d[i].unicode();
+    if (c < 0x80) {
+      *p++ = (char)c;
+    } else if (c < 0x800) {
+      *p++ = (char)((c >> 6) | 0xC0); // C0 is the 2-byte flag for UTF-8
+      *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
+    } else if (c >= 0xD800 && c <= 0xDBFF && i < length && d[i+1].uc >= 0xDC00 && d[i+2].uc <= 0xDFFF) {
+      unsigned sc = 0x10000 + (((c & 0x3FF) << 10) | (d[i+1].uc & 0x3FF));
+      *p++ = (char)((sc >> 18) | 0xF0); // F0 is the 4-byte flag for UTF-8
+      *p++ = (char)(((sc >> 12) | 0x80) & 0xBF); // next 6 bits, with high bit set
+      *p++ = (char)(((sc >> 6) | 0x80) & 0xBF); // next 6 bits, with high bit set
+      *p++ = (char)((sc | 0x80) & 0xBF); // next 6 bits, with high bit set
+      ++i;
+    } else {
+      *p++ = (char)((c >> 12) | 0xE0); // E0 is the 3-byte flag for UTF-8
+      *p++ = (char)(((c >> 6) | 0x80) & 0xBF); // next 6 bits, with high bit set
+      *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
+    }
+  }
+
+  // Return the result as a C string.
+  CString result(buffer, p - buffer);
+  if (buffer != fixedSizeBuffer) {
+    delete [] buffer;
+  }
+  return result;
+}
+
+struct StringOffset {
+    int offset;
+    int locationInOffsetsArray;
+};
+
+static int compareStringOffsets(const void *a, const void *b)
+{
+    const StringOffset *oa = static_cast<const StringOffset *>(a);
+    const StringOffset *ob = static_cast<const StringOffset *>(b);
+    
+    if (oa->offset < ob->offset) {
+        return -1;
+    }
+    if (oa->offset > ob->offset) {
+        return +1;
+    }
+    return 0;
+}
+
+const int sortedOffsetsFixedBufferSize = 128;
+
+static StringOffset *createSortedOffsetsArray(const int offsets[], int numOffsets,
+    StringOffset sortedOffsetsFixedBuffer[sortedOffsetsFixedBufferSize])
+{
+    // Allocate the sorted offsets.
+    StringOffset *sortedOffsets;
+    if (numOffsets <= sortedOffsetsFixedBufferSize) {
+        sortedOffsets = sortedOffsetsFixedBuffer;
+    } else {
+        sortedOffsets = new StringOffset [numOffsets];
+    }
+
+    // Copy offsets.
+    for (int i = 0; i != numOffsets; ++i) {
+        sortedOffsets[i].offset = offsets[i];
+        sortedOffsets[i].locationInOffsetsArray = i;
+    }
+
+    // Sort them.
+    qsort(sortedOffsets, numOffsets, sizeof(StringOffset), compareStringOffsets);
+
+    return sortedOffsets;
+}
+
+// Note: This function assumes valid UTF-8.
+// It can even go into an infinite loop if the passed in string is not valid UTF-8.
+void convertUTF16OffsetsToUTF8Offsets(const char *s, int *offsets, int numOffsets)
+{
+    // Allocate buffer.
+    StringOffset fixedBuffer[sortedOffsetsFixedBufferSize];
+    StringOffset *sortedOffsets = createSortedOffsetsArray(offsets, numOffsets, fixedBuffer);
+
+    // Walk through sorted offsets and string, adjusting all the offests.
+    // Offsets that are off the ends of the string map to the edges of the string.
+    int UTF16Offset = 0;
+    const char *p = s;
+    for (int oi = 0; oi != numOffsets; ++oi) {
+        const int nextOffset = sortedOffsets[oi].offset;
+        while (*p && UTF16Offset < nextOffset) {
+            // Skip to the next character.
+            const int sequenceLength = UTF8SequenceLength(*p);
+            assert(sequenceLength >= 1 && sequenceLength <= 4);
+            p += sequenceLength;
+            // Characters that take a 4 byte sequence in UTF-8 take two bytes in UTF-16.
+            UTF16Offset += sequenceLength < 4 ? 1 : 2;
+        }
+        offsets[sortedOffsets[oi].locationInOffsetsArray] = p - s;
+    }
+
+    // Free buffer.
+    if (sortedOffsets != fixedBuffer) {
+        delete [] sortedOffsets;
+    }
+}
+
+// Note: This function assumes valid UTF-8.
+// It can even go into an infinite loop if the passed in string is not valid UTF-8.
+void convertUTF8OffsetsToUTF16Offsets(const char *s, int *offsets, int numOffsets)
+{
+    // Allocate buffer.
+    StringOffset fixedBuffer[sortedOffsetsFixedBufferSize];
+    StringOffset *sortedOffsets = createSortedOffsetsArray(offsets, numOffsets, fixedBuffer);
+
+    // Walk through sorted offsets and string, adjusting all the offests.
+    // Offsets that are off the end of the string map to the edges of the string.
+    int UTF16Offset = 0;
+    const char *p = s;
+    for (int oi = 0; oi != numOffsets; ++oi) {
+        const int nextOffset = sortedOffsets[oi].offset;
+        while (*p && (p - s) < nextOffset) {
+            // Skip to the next character.
+            const int sequenceLength = UTF8SequenceLength(*p);
+            assert(sequenceLength >= 1 && sequenceLength <= 4);
+            p += sequenceLength;
+            // Characters that take a 4 byte sequence in UTF-8 take two bytes in UTF-16.
+            UTF16Offset += sequenceLength < 4 ? 1 : 2;
+        }
+        offsets[sortedOffsets[oi].locationInOffsetsArray] = UTF16Offset;
+    }
+
+    // Free buffer.
+    if (sortedOffsets != fixedBuffer) {
+        delete [] sortedOffsets;
+    }
+}
+
+} // namespace KJS
diff --git a/JavaScriptCore/kjs/ustring.h b/JavaScriptCore/kjs/ustring.h
index 9d59cf1..5765674 100644
--- a/JavaScriptCore/kjs/ustring.h
+++ b/JavaScriptCore/kjs/ustring.h
@@ -169,8 +169,9 @@ namespace KJS {
    */
   class CString {
   public:
-    CString() : data(0) { }
+    CString() : data(0), length(0) { }
     CString(const char *c);
+    CString(const char *c, int len);
     CString(const CString &);
 
     ~CString();
@@ -180,10 +181,11 @@ namespace KJS {
     CString &operator=(const CString &);
     CString &operator+=(const CString &c) { return append(c); }
 
-    int size() const;
+    int size() const { return length; }
     const char *c_str() const { return data; }
   private:
     char *data;
+    int length;
   };
 
   /**
@@ -300,6 +302,10 @@ namespace KJS {
      * Append another string.
      */
     UString &append(const UString &);
+    UString &append(const char *);
+    UString &append(unsigned short);
+    UString &append(char c) { return append(static_cast<unsigned short>(static_cast<unsigned char>(c))); }
+    UString &append(UChar c) { return append(c.uc); }
 
     /**
      * @return The string converted to the 8-bit string type @ref CString().
@@ -313,6 +319,16 @@ namespace KJS {
      * instances.
      */
     char *ascii() const;
+
+    /**
+     * Convert the string to UTF-8, assuming it is UTF-16 encoded.
+     * Since this function is tolerant of badly formed UTF-16, it can create UTF-8
+     * strings that are invalid because they have characters in the range
+     * U+D800-U+DDFF, U+FFFE, or U+FFFF, but the UTF-8 string is guaranteed to
+     * be otherwise valid.
+     */
+    CString UTF8String() const;
+
     /**
      * @see UString(const QString&).
      */
@@ -335,6 +351,7 @@ namespace KJS {
      * Appends the specified string.
      */
     UString &operator+=(const UString &s) { return append(s); }
+    UString &operator+=(const char *s) { return append(s); }
 
     /**
      * @return A pointer to the internal Unicode data.
@@ -454,6 +471,26 @@ namespace KJS {
   
   int compare(const UString &, const UString &);
 
+  // Given a first byte, gives the length of the UTF-8 sequence it begins.
+  // Returns 0 for bytes that are not legal starts of UTF-8 sequences.
+  // Only allows sequences of up to 4 bytes, since that works for all Unicode characters (U-00000000 to U-0010FFFF).
+  int UTF8SequenceLength(char);
+
+  // Takes a null-terminated C-style string with a UTF-8 sequence in it and converts it to a character.
+  // Only allows Unicode characters (U-00000000 to U-0010FFFF).
+  // Returns -1 if the sequence is not valid (including presence of extra bytes).
+  int decodeUTF8Sequence(const char *);
+
+  // Given a UTF-8 string, converts offsets from the UTF-16 form of the string into offsets into the UTF-8 string.
+  // Note: This function can overrun the buffer if the string contains a partial UTF-8 sequence, so it should
+  // not be called with strings that might contain such sequences.
+  void convertUTF16OffsetsToUTF8Offsets(const char *UTF8String, int *offsets, int numOffsets);
+
+  // Given a UTF-8 string, converts offsets from the UTF-8 string into offsets into the UTF-16 form of the string.
+  // Note: This function can overrun the buffer if the string contains a partial UTF-8 sequence, so it should
+  // not be called with strings that might contain such sequences.
+  void convertUTF8OffsetsToUTF16Offsets(const char *UTF8String, int *offsets, int numOffsets);
+
 }; // namespace
 
 #endif

-- 
WebKit Debian packaging



More information about the Pkg-webkit-commits mailing list