[SCM] WebKit Debian packaging branch, debian/unstable, updated. debian/1.1.15-1-40151-g37bb677
darin
darin at 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Sat Sep 26 07:52:07 UTC 2009
The following commit has been merged in the debian/unstable branch:
commit 6c9bbfd314147447f01a8828b73b5b8e41e75984
Author: darin <darin at 268f45cc-cd09-0410-ab3c-d52691b4dbfc>
Date: Mon Aug 18 18:51:25 2003 +0000
Reviewed by Maciej.
- fixed 3247528 -- encodeURI missing from JavaScriptCore (needed by Crystal Reports)
- fixed 3381297 -- escape method does not escape the null character
- fixed 3381299 -- escape method produces incorrect escape sequences ala WinIE, rather than correct ala Gecko
- fixed 3381303 -- unescape method treats escape sequences as Latin-1 ala WinIE rather than as UTF-8 ala Gecko
- fixed 3381304 -- unescape method garbles strings with bad escape sequences in them
* kjs/function.h: Added constants for decodeURI, decodeURIComponent, encodeURI, and
encodeURIComponent.
* kjs/function.cpp:
(encode): Added. New helper function for escape, encodeURI, and encodeURIComponent.
(decode): Added. New helper function for unescape, decodeURI, and decodeURIComponent.
(GlobalFuncImp::call): Added decodeURI, decodeURIComponent, encodeURI, and encodeURIComponent
implementations. Changed escape and unescape to use new helper functions, which fixes
the four problems above.
* kjs/internal.cpp: (InterpreterImp::initGlobalObject): Add decodeURI, decodeURIComponent,
encodeURI, and encodeURIComponent to the global object.
* kjs/ustring.h: Added a length to the CString class so it can hold strings with null
characters in them, not just null-terminated strings. This allows a null character from
a UString to survive the process of UTF-16 to UTF-8 decoding. Added overloads to
UString::append, UString::UTF8String, UTF8SequenceLength, decodeUTF8Sequence,
convertUTF16OffsetsToUTF8Offsets, and convertUTF8OffsetsToUTF16Offsets.
* kjs/ustring.cpp:
(CString::CString): Set up the length properly in all the constructors. Also add a new
constructor that takes a length.
(CString::append): Use and set the length properly.
(CString::operator=): Use and set the length properly.
(operator==): Use and the length and memcmp instead of strcmp.
(UString::append): Added new overloads for const char * and for a single string to make
it more efficient to build up a UString from pieces. The old way, a UString was created
and destroyed each time you appended.
(UTF8SequenceLength): New. Helper for decoding UTF-8.
(decodeUTF8Sequence): New. Helper for decoding UTF-8.
(UString::UTF8String): New. Decodes from UTF-16 to UTF-8. Same as the function that
was in regexp.cpp, except has proper handling for UTF-16 surrogates.
(compareStringOffsets): Moved from regexp.cpp.
(createSortedOffsetsArray): Moved from regexp.cpp.
(convertUTF16OffsetsToUTF8Offsets): New. Converts UTF-16 offsets to UTF-8 offsets, given
a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
for UTF-16 surrogates.
(convertUTF8OffsetsToUTF16Offsets): New. Converts UTF-8 offsets to UTF-16 offsets, given
a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
for UTF-16 surrogates.
- fixed 3381296 -- regular expression matches with UTF-16 surrogates will treat sequences as two characters
* kjs/regexp.cpp:
(RegExp::RegExp): Use the new UString::UTF8String function instead a function in this file.
(RegExp::match): Use the new convertUTF16OffsetsToUTF8Offsets (and the corresponding
reverse) instead of convertCharacterOffsetsToUTF8ByteOffsets in this file.
git-svn-id: http://svn.webkit.org/repository/webkit/trunk@4837 268f45cc-cd09-0410-ab3c-d52691b4dbfc
diff --git a/JavaScriptCore/ChangeLog b/JavaScriptCore/ChangeLog
index 3d9a610..e8fc08d 100644
--- a/JavaScriptCore/ChangeLog
+++ b/JavaScriptCore/ChangeLog
@@ -1,3 +1,60 @@
+2003-08-17 Darin Adler <darin at apple.com>
+
+ Reviewed by Maciej.
+
+ - fixed 3247528 -- encodeURI missing from JavaScriptCore (needed by Crystal Reports)
+ - fixed 3381297 -- escape method does not escape the null character
+ - fixed 3381299 -- escape method produces incorrect escape sequences ala WinIE, rather than correct ala Gecko
+ - fixed 3381303 -- unescape method treats escape sequences as Latin-1 ala WinIE rather than as UTF-8 ala Gecko
+ - fixed 3381304 -- unescape method garbles strings with bad escape sequences in them
+
+ * kjs/function.h: Added constants for decodeURI, decodeURIComponent, encodeURI, and
+ encodeURIComponent.
+ * kjs/function.cpp:
+ (encode): Added. New helper function for escape, encodeURI, and encodeURIComponent.
+ (decode): Added. New helper function for unescape, decodeURI, and decodeURIComponent.
+ (GlobalFuncImp::call): Added decodeURI, decodeURIComponent, encodeURI, and encodeURIComponent
+ implementations. Changed escape and unescape to use new helper functions, which fixes
+ the four problems above.
+
+ * kjs/internal.cpp: (InterpreterImp::initGlobalObject): Add decodeURI, decodeURIComponent,
+ encodeURI, and encodeURIComponent to the global object.
+
+ * kjs/ustring.h: Added a length to the CString class so it can hold strings with null
+ characters in them, not just null-terminated strings. This allows a null character from
+ a UString to survive the process of UTF-16 to UTF-8 decoding. Added overloads to
+ UString::append, UString::UTF8String, UTF8SequenceLength, decodeUTF8Sequence,
+ convertUTF16OffsetsToUTF8Offsets, and convertUTF8OffsetsToUTF16Offsets.
+
+ * kjs/ustring.cpp:
+ (CString::CString): Set up the length properly in all the constructors. Also add a new
+ constructor that takes a length.
+ (CString::append): Use and set the length properly.
+ (CString::operator=): Use and set the length properly.
+ (operator==): Use and the length and memcmp instead of strcmp.
+ (UString::append): Added new overloads for const char * and for a single string to make
+ it more efficient to build up a UString from pieces. The old way, a UString was created
+ and destroyed each time you appended.
+ (UTF8SequenceLength): New. Helper for decoding UTF-8.
+ (decodeUTF8Sequence): New. Helper for decoding UTF-8.
+ (UString::UTF8String): New. Decodes from UTF-16 to UTF-8. Same as the function that
+ was in regexp.cpp, except has proper handling for UTF-16 surrogates.
+ (compareStringOffsets): Moved from regexp.cpp.
+ (createSortedOffsetsArray): Moved from regexp.cpp.
+ (convertUTF16OffsetsToUTF8Offsets): New. Converts UTF-16 offsets to UTF-8 offsets, given
+ a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
+ for UTF-16 surrogates.
+ (convertUTF8OffsetsToUTF16Offsets): New. Converts UTF-8 offsets to UTF-16 offsets, given
+ a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
+ for UTF-16 surrogates.
+
+ - fixed 3381296 -- regular expression matches with UTF-16 surrogates will treat sequences as two characters
+
+ * kjs/regexp.cpp:
+ (RegExp::RegExp): Use the new UString::UTF8String function instead a function in this file.
+ (RegExp::match): Use the new convertUTF16OffsetsToUTF8Offsets (and the corresponding
+ reverse) instead of convertCharacterOffsetsToUTF8ByteOffsets in this file.
+
=== Safari-93 ===
2003-08-14 Vicki Murley <vicki at apple.com>
diff --git a/JavaScriptCore/ChangeLog-2003-10-25 b/JavaScriptCore/ChangeLog-2003-10-25
index 3d9a610..e8fc08d 100644
--- a/JavaScriptCore/ChangeLog-2003-10-25
+++ b/JavaScriptCore/ChangeLog-2003-10-25
@@ -1,3 +1,60 @@
+2003-08-17 Darin Adler <darin at apple.com>
+
+ Reviewed by Maciej.
+
+ - fixed 3247528 -- encodeURI missing from JavaScriptCore (needed by Crystal Reports)
+ - fixed 3381297 -- escape method does not escape the null character
+ - fixed 3381299 -- escape method produces incorrect escape sequences ala WinIE, rather than correct ala Gecko
+ - fixed 3381303 -- unescape method treats escape sequences as Latin-1 ala WinIE rather than as UTF-8 ala Gecko
+ - fixed 3381304 -- unescape method garbles strings with bad escape sequences in them
+
+ * kjs/function.h: Added constants for decodeURI, decodeURIComponent, encodeURI, and
+ encodeURIComponent.
+ * kjs/function.cpp:
+ (encode): Added. New helper function for escape, encodeURI, and encodeURIComponent.
+ (decode): Added. New helper function for unescape, decodeURI, and decodeURIComponent.
+ (GlobalFuncImp::call): Added decodeURI, decodeURIComponent, encodeURI, and encodeURIComponent
+ implementations. Changed escape and unescape to use new helper functions, which fixes
+ the four problems above.
+
+ * kjs/internal.cpp: (InterpreterImp::initGlobalObject): Add decodeURI, decodeURIComponent,
+ encodeURI, and encodeURIComponent to the global object.
+
+ * kjs/ustring.h: Added a length to the CString class so it can hold strings with null
+ characters in them, not just null-terminated strings. This allows a null character from
+ a UString to survive the process of UTF-16 to UTF-8 decoding. Added overloads to
+ UString::append, UString::UTF8String, UTF8SequenceLength, decodeUTF8Sequence,
+ convertUTF16OffsetsToUTF8Offsets, and convertUTF8OffsetsToUTF16Offsets.
+
+ * kjs/ustring.cpp:
+ (CString::CString): Set up the length properly in all the constructors. Also add a new
+ constructor that takes a length.
+ (CString::append): Use and set the length properly.
+ (CString::operator=): Use and set the length properly.
+ (operator==): Use and the length and memcmp instead of strcmp.
+ (UString::append): Added new overloads for const char * and for a single string to make
+ it more efficient to build up a UString from pieces. The old way, a UString was created
+ and destroyed each time you appended.
+ (UTF8SequenceLength): New. Helper for decoding UTF-8.
+ (decodeUTF8Sequence): New. Helper for decoding UTF-8.
+ (UString::UTF8String): New. Decodes from UTF-16 to UTF-8. Same as the function that
+ was in regexp.cpp, except has proper handling for UTF-16 surrogates.
+ (compareStringOffsets): Moved from regexp.cpp.
+ (createSortedOffsetsArray): Moved from regexp.cpp.
+ (convertUTF16OffsetsToUTF8Offsets): New. Converts UTF-16 offsets to UTF-8 offsets, given
+ a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
+ for UTF-16 surrogates.
+ (convertUTF8OffsetsToUTF16Offsets): New. Converts UTF-8 offsets to UTF-16 offsets, given
+ a UTF-8 string. Same as the function that was in regexp.cpp, except has proper handling
+ for UTF-16 surrogates.
+
+ - fixed 3381296 -- regular expression matches with UTF-16 surrogates will treat sequences as two characters
+
+ * kjs/regexp.cpp:
+ (RegExp::RegExp): Use the new UString::UTF8String function instead a function in this file.
+ (RegExp::match): Use the new convertUTF16OffsetsToUTF8Offsets (and the corresponding
+ reverse) instead of convertCharacterOffsetsToUTF8ByteOffsets in this file.
+
=== Safari-93 ===
2003-08-14 Vicki Murley <vicki at apple.com>
diff --git a/JavaScriptCore/kjs/function.cpp b/JavaScriptCore/kjs/function.cpp
index d15b501..19b6fd2 100644
--- a/JavaScriptCore/kjs/function.cpp
+++ b/JavaScriptCore/kjs/function.cpp
@@ -408,13 +408,113 @@ bool GlobalFuncImp::implementsCall() const
return true;
}
+static Value encode(ExecState *exec, const List &args, const char *do_not_escape)
+{
+ UString r = "", s, str = args[0].toString(exec);
+ CString cstr = str.UTF8String();
+ const char *p = cstr.c_str();
+ for (int k = 0; k < cstr.size(); k++, p++) {
+ char c = *p;
+ if (c && strchr(do_not_escape, c)) {
+ r.append(c);
+ } else {
+ char tmp[4];
+ sprintf(tmp, "%%%02X", (unsigned char)c);
+ r += tmp;
+ }
+ }
+ return String(r);
+}
+
+static Value decode(ExecState *exec, const List &args, const char *do_not_unescape, bool strict)
+{
+ UString s = "", str = args[0].toString(exec);
+ int k = 0, len = str.size();
+ const UChar *d = str.data();
+ UChar u;
+ while (k < len) {
+ const UChar *p = d + k;
+ UChar c = *p;
+ if (c == '%') {
+ int charLen = 0;
+ if (k <= len - 3 && isxdigit(p[1].uc) && isxdigit(p[2].uc)) {
+ const char b0 = Lexer::convertHex(p[1].uc, p[2].uc);
+ const int sequenceLen = UTF8SequenceLength(b0);
+ if (sequenceLen != 0 && k <= len - sequenceLen * 3) {
+ charLen = sequenceLen * 3;
+ char sequence[5];
+ sequence[0] = b0;
+ for (int i = 1; i < sequenceLen; ++i) {
+ const UChar *q = p + i * 3;
+ if (q[0] == '%' && isxdigit(q[1].uc) && isxdigit(q[2].uc))
+ sequence[i] = Lexer::convertHex(q[1].uc, q[2].uc);
+ else {
+ charLen = 0;
+ break;
+ }
+ }
+ if (charLen != 0) {
+ sequence[sequenceLen] = 0;
+ const int character = decodeUTF8Sequence(sequence);
+ if (character < 0 || character >= 0x110000) {
+ charLen = 0;
+ } else if (character >= 0x10000) {
+ // Convert to surrogate pair.
+ s.append(static_cast<unsigned short>(0xD800 | ((character - 0x10000) >> 10)));
+ u = static_cast<unsigned short>(0xDC00 | ((character - 0x10000) & 0x3FF));
+ } else {
+ u = static_cast<unsigned short>(character);
+ }
+ }
+ }
+ }
+ if (charLen == 0) {
+ if (strict) {
+ Object error = Error::create(exec, URIError);
+ exec->setException(error);
+ return error;
+ }
+ // The only case where we don't use "strict" mode is the "unescape" function.
+ // For that, it's good to support the wonky "%u" syntax for compatibility with WinIE.
+ if (k <= len - 6 && p[1] == 'u'
+ && isxdigit(p[2].uc) && isxdigit(p[3].uc)
+ && isxdigit(p[4].uc) && isxdigit(p[5].uc)) {
+ charLen = 6;
+ u = Lexer::convertUnicode(p[2].uc, p[3].uc, p[4].uc, p[5].uc);
+ }
+ }
+ if (charLen && (u.uc == 0 || u.uc >= 128 || !strchr(do_not_unescape, u.low()))) {
+ c = u;
+ k += charLen - 1;
+ }
+ }
+ k++;
+ s.append(c);
+ }
+ return String(s);
+}
+
Value GlobalFuncImp::call(ExecState *exec, Object &/*thisObj*/, const List &args)
{
Value res;
- static const char non_escape[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
- "abcdefghijklmnopqrstuvwxyz"
- "0123456789@*_+-./";
+ static const char do_not_escape[] =
+ "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+ "abcdefghijklmnopqrstuvwxyz"
+ "0123456789"
+ "*+-./@_";
+ static const char do_not_escape_when_encoding_URI_component[] =
+ "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+ "abcdefghijklmnopqrstuvwxyz"
+ "0123456789"
+ "!'()*-._~";
+ static const char do_not_escape_when_encoding_URI[] =
+ "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+ "abcdefghijklmnopqrstuvwxyz"
+ "0123456789"
+ "!#$&'()*+,-./:;=?@_~";
+ static const char do_not_unescape_when_decoding_URI[] =
+ "#$&+,/:;=?@";
switch (id) {
case Eval: { // eval()
@@ -502,54 +602,28 @@ Value GlobalFuncImp::call(ExecState *exec, Object &/*thisObj*/, const List &args
res = Boolean(!isNaN(n) && !isInf(n));
break;
}
- case Escape: {
- UString r = "", s, str = args[0].toString(exec);
- const UChar *c = str.data();
- for (int k = 0; k < str.size(); k++, c++) {
- int u = c->uc;
- if (u > 255) {
- char tmp[7];
- sprintf(tmp, "%%u%04X", u);
- s = UString(tmp);
- } else if (strchr(non_escape, (char)u)) {
- s = UString(c, 1);
- } else {
- char tmp[4];
- sprintf(tmp, "%%%02X", u);
- s = UString(tmp);
- }
- r += s;
- }
- res = String(r);
+ case DecodeURI:
+ res = decode(exec, args, do_not_unescape_when_decoding_URI, true);
break;
- }
- case UnEscape: {
- UString s, str = args[0].toString(exec);
- int k = 0, len = str.size();
- UChar u;
- while (k < len) {
- const UChar *c = str.data() + k;
- if (*c == UChar('%') && k <= len - 6 && *(c+1) == UChar('u')) {
- u = Lexer::convertUnicode((c+2)->uc, (c+3)->uc,
- (c+4)->uc, (c+5)->uc);
- c = &u;
- k += 5;
- } else if (*c == UChar('%') && k <= len - 3) {
- u = UChar(Lexer::convertHex((c+1)->uc, (c+2)->uc));
- c = &u;
- k += 2;
- }
- k++;
- s += UString(c, 1);
- }
- res = String(s);
+ case DecodeURIComponent:
+ res = decode(exec, args, "", true);
+ break;
+ case EncodeURI:
+ res = encode(exec, args, do_not_escape_when_encoding_URI);
+ break;
+ case EncodeURIComponent:
+ res = encode(exec, args, do_not_escape_when_encoding_URI_component);
+ break;
+ case Escape:
+ res = encode(exec, args, do_not_escape);
+ break;
+ case UnEscape:
+ res = decode(exec, args, "", false);
break;
- }
#ifndef NDEBUG
- case KJSPrint: {
- UString str = args[0].toString(exec);
- puts(str.ascii());
- }
+ case KJSPrint:
+ puts(args[0].toString(exec).ascii());
+ break;
#endif
}
diff --git a/JavaScriptCore/kjs/function.h b/JavaScriptCore/kjs/function.h
index 8279d7a..7c0d290 100644
--- a/JavaScriptCore/kjs/function.h
+++ b/JavaScriptCore/kjs/function.h
@@ -127,7 +127,8 @@ namespace KJS {
virtual bool implementsCall() const;
virtual Value call(ExecState *exec, Object &thisObj, const List &args);
virtual CodeType codeType() const;
- enum { Eval, ParseInt, ParseFloat, IsNaN, IsFinite, Escape, UnEscape
+ enum { Eval, ParseInt, ParseFloat, IsNaN, IsFinite, Escape, UnEscape,
+ DecodeURI, DecodeURIComponent, EncodeURI, EncodeURIComponent
#ifndef NDEBUG
, KJSPrint
#endif
diff --git a/JavaScriptCore/kjs/internal.cpp b/JavaScriptCore/kjs/internal.cpp
index a53dece..549be2d 100644
--- a/JavaScriptCore/kjs/internal.cpp
+++ b/JavaScriptCore/kjs/internal.cpp
@@ -664,6 +664,10 @@ void InterpreterImp::unlock()
global.put(globExec,"isFinite", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::IsFinite, 1)), DontEnum);
global.put(globExec,"escape", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::Escape, 1)), DontEnum);
global.put(globExec,"unescape", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::UnEscape, 1)), DontEnum);
+ global.put(globExec,"decodeURI", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::DecodeURI, 1)), DontEnum);
+ global.put(globExec,"decodeURIComponent", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::DecodeURIComponent, 1)), DontEnum);
+ global.put(globExec,"encodeURI", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::EncodeURI, 1)), DontEnum);
+ global.put(globExec,"encodeURIComponent", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::EncodeURIComponent, 1)), DontEnum);
#ifndef NDEBUG
global.put(globExec,"kjsprint", Object(new GlobalFuncImp(globExec,funcProto,GlobalFuncImp::KJSPrint, 1)), DontEnum);
#endif
diff --git a/JavaScriptCore/kjs/regexp.cpp b/JavaScriptCore/kjs/regexp.cpp
index 4865c4c..610ee25 100644
--- a/JavaScriptCore/kjs/regexp.cpp
+++ b/JavaScriptCore/kjs/regexp.cpp
@@ -21,151 +21,12 @@
#include "regexp.h"
+#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
-using KJS::CString;
-using KJS::RegExp;
-using KJS::UString;
-
-#ifdef HAVE_PCREPOSIX
-
-static CString convertToUTF8(const UString &s)
-{
- // Allocate a buffer big enough to hold all the characters.
- const int length = s.size();
- const unsigned bufferSize = length * 3 + 1;
- char fixedSizeBuffer[1024];
- char *buffer;
- if (bufferSize > sizeof(fixedSizeBuffer)) {
- buffer = new char [bufferSize];
- } else {
- buffer = fixedSizeBuffer;
- }
-
- // Convert to runs of 8-bit characters.
- char *p = buffer;
- for (int i = 0; i != length; ++i) {
- unsigned short c = s[i].unicode();
- if (c < 0x80) {
- *p++ = (char)c;
- } else if (c < 0x800) {
- *p++ = (char)((c >> 6) | 0xC0); // C0 is the 2-byte flag for UTF-8
- *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
- } else {
- *p++ = (char)((c >> 12) | 0xE0); // E0 is the 3-byte flag for UTF-8
- *p++ = (char)(((c >> 6) | 0x80) & 0xBF); // next 6 bits, with high bit set
- *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
- }
- }
- *p = 0;
-
- // Return the result as a C string.
- CString result(buffer);
- if (buffer != fixedSizeBuffer) {
- delete [] buffer;
- }
- return result;
-}
-
-struct StringOffset {
- int offset;
- int locationInOffsetsArray;
-};
-
-static int compareStringOffsets(const void *a, const void *b)
-{
- const StringOffset *oa = static_cast<const StringOffset *>(a);
- const StringOffset *ob = static_cast<const StringOffset *>(b);
-
- if (oa->offset < ob->offset) {
- return -1;
- }
- if (oa->offset > ob->offset) {
- return +1;
- }
- return 0;
-}
-
-const int sortedOffsetsFixedBufferSize = 128;
-
-static StringOffset *createSortedOffsetsArray(const int offsets[], int numOffsets,
- StringOffset sortedOffsetsFixedBuffer[sortedOffsetsFixedBufferSize])
-{
- // Allocate the sorted offsets.
- StringOffset *sortedOffsets;
- if (numOffsets <= sortedOffsetsFixedBufferSize) {
- sortedOffsets = sortedOffsetsFixedBuffer;
- } else {
- sortedOffsets = new StringOffset [numOffsets];
- }
-
- // Copy offsets.
- for (int i = 0; i != numOffsets; ++i) {
- sortedOffsets[i].offset = offsets[i];
- sortedOffsets[i].locationInOffsetsArray = i;
- }
-
- // Sort them.
- qsort(sortedOffsets, numOffsets, sizeof(StringOffset), compareStringOffsets);
-
- return sortedOffsets;
-}
-
-static void convertCharacterOffsetsToUTF8ByteOffsets(const char *s, int *offsets, int numOffsets)
-{
- // Allocate buffer.
- StringOffset fixedBuffer[sortedOffsetsFixedBufferSize];
- StringOffset *sortedOffsets = createSortedOffsetsArray(offsets, numOffsets, fixedBuffer);
-
- // Walk through sorted offsets and string, adjusting all the offests.
- // Offsets that are off the ends of the string map to the edges of the string.
- int characterOffset = 0;
- const char *p = s;
- for (int oi = 0; oi != numOffsets; ++oi) {
- const int nextOffset = sortedOffsets[oi].offset;
- while (*p && characterOffset < nextOffset) {
- // Skip to the next character.
- ++characterOffset;
- do ++p; while ((*p & 0xC0) == 0x80); // if 1 of the 2 high bits is set, it's not the start of a character
- }
- offsets[sortedOffsets[oi].locationInOffsetsArray] = p - s;
- }
-
- // Free buffer.
- if (sortedOffsets != fixedBuffer) {
- delete [] sortedOffsets;
- }
-}
-
-static void convertUTF8ByteOffsetsToCharacterOffsets(const char *s, int *offsets, int numOffsets)
-{
- // Allocate buffer.
- StringOffset fixedBuffer[sortedOffsetsFixedBufferSize];
- StringOffset *sortedOffsets = createSortedOffsetsArray(offsets, numOffsets, fixedBuffer);
-
- // Walk through sorted offsets and string, adjusting all the offests.
- // Offsets that are off the end of the string map to the edges of the string.
- int characterOffset = 0;
- const char *p = s;
- for (int oi = 0; oi != numOffsets; ++oi) {
- const int nextOffset = sortedOffsets[oi].offset;
- while (*p && (p - s) < nextOffset) {
- // Skip to the next character.
- ++characterOffset;
- do ++p; while ((*p & 0xC0) == 0x80); // if 1 of the 2 high bits is set, it's not the start of a character
- }
- offsets[sortedOffsets[oi].locationInOffsetsArray] = characterOffset;
- }
-
- // Free buffer.
- if (sortedOffsets != fixedBuffer) {
- delete [] sortedOffsets;
- }
-}
-
-#endif // HAVE_PCREPOSIX
+namespace KJS {
RegExp::RegExp(const UString &p, int flags)
: _flags(flags), _numSubPatterns(0)
@@ -181,7 +42,7 @@ RegExp::RegExp(const UString &p, int flags)
const char *errorMessage;
int errorOffset;
- _regex = pcre_compile(convertToUTF8(p).c_str(), options, &errorMessage, &errorOffset, NULL);
+ _regex = pcre_compile(p.UTF8String().c_str(), options, &errorMessage, &errorOffset, NULL);
if (!_regex) {
#ifndef NDEBUG
fprintf(stderr, "KJS: pcre_compile() failed with '%s'\n", errorMessage);
@@ -258,8 +119,8 @@ UString RegExp::match(const UString &s, int i, int *pos, int **ovector)
offsetVector = new int [offsetVectorSize];
}
- const CString buffer(convertToUTF8(s));
- convertCharacterOffsetsToUTF8ByteOffsets(buffer.c_str(), &i, 1);
+ const CString buffer(s.UTF8String());
+ convertUTF16OffsetsToUTF8Offsets(buffer.c_str(), &i, 1);
const int numMatches = pcre_exec(_regex, NULL, buffer.c_str(), buffer.size(), i, 0, offsetVector, offsetVectorSize);
if (numMatches < 0) {
@@ -272,7 +133,7 @@ UString RegExp::match(const UString &s, int i, int *pos, int **ovector)
return UString::null();
}
- convertUTF8ByteOffsetsToCharacterOffsets(buffer.c_str(), offsetVector, (numMatches == 0 ? 1 : numMatches) * 2);
+ convertUTF8OffsetsToUTF16Offsets(buffer.c_str(), offsetVector, (numMatches == 0 ? 1 : numMatches) * 2);
*pos = offsetVector[0];
if (ovector)
@@ -314,3 +175,5 @@ UString RegExp::match(const UString &s, int i, int *pos, int **ovector)
#endif
}
+
+} // namespace KJS
diff --git a/JavaScriptCore/kjs/ustring.cpp b/JavaScriptCore/kjs/ustring.cpp
index b6fae6c..1922249 100644
--- a/JavaScriptCore/kjs/ustring.cpp
+++ b/JavaScriptCore/kjs/ustring.cpp
@@ -42,22 +42,30 @@
#include "dtoa.h"
namespace KJS {
- extern const double NaN;
- extern const double Inf;
-};
-using namespace KJS;
+extern const double NaN;
+extern const double Inf;
CString::CString(const char *c)
{
- data = new char[strlen(c)+1];
+ length = strlen(c);
+ data = new char[length+1];
strcpy(data, c);
}
+CString::CString(const char *c, int len)
+{
+ length = len;
+ data = new char[len+1];
+ memcpy(data, c, len);
+ data[len] = 0;
+}
+
CString::CString(const CString &b)
{
- data = new char[b.size()+1];
- strcpy(data, b.c_str());
+ length = b.length;
+ data = new char[length+1];
+ memcpy(data, b.data, length);
}
CString::~CString()
@@ -68,14 +76,13 @@ CString::~CString()
CString &CString::append(const CString &t)
{
char *n;
- if (data) {
- n = new char[strlen(data)+t.size()+1];
- strcpy(n, data);
- } else {
- n = new char[t.size()+1];
- n[0] = '\0';
- }
- strcat(n, t.c_str());
+ n = new char[length+t.length+1];
+ if (length)
+ memcpy(n, data, length);
+ if (t.length)
+ memcpy(n+length, t.data, t.length);
+ length += t.length;
+ n[length] = 0;
delete [] data;
data = n;
@@ -87,7 +94,8 @@ CString &CString::operator=(const char *c)
{
if (data)
delete [] data;
- data = new char[strlen(c)+1];
+ length = strlen(c);
+ data = new char[length+1];
strcpy(data, c);
return *this;
@@ -100,20 +108,17 @@ CString &CString::operator=(const CString &str)
if (data)
delete [] data;
- data = new char[str.size()+1];
- strcpy(data, str.c_str());
+ length = str.length;
+ data = new char[length + 1];
+ memcpy(data, str.data, length + 1);
return *this;
}
-int CString::size() const
-{
- return strlen(data);
-}
-
bool KJS::operator==(const KJS::CString& c1, const KJS::CString& c2)
{
- return (strcmp(c1.c_str(), c2.c_str()) == 0);
+ int len = c1.size();
+ return len == c2.size() && (len == 0 || memcmp(c1.c_str(), c2.c_str(), len) == 0);
}
UString::Rep UString::Rep::null = { 0, 0, 0, 1, 1 };
@@ -470,6 +475,53 @@ UString &UString::append(const UString &t)
return *this;
}
+UString &UString::append(const char *t)
+{
+ int l = size();
+ int tLen = strlen(t);
+ int newLen = l + tLen;
+ if (rep->rc == 1 && newLen <= rep->capacity) {
+ for (int i = 0; i < tLen; ++i)
+ rep->dat[l+i] = t[i];
+ rep->len = newLen;
+ rep->_hash = 0;
+ return *this;
+ }
+
+ int newCapacity = (newLen * 3 + 1) / 2;
+ UChar *n = new UChar[newCapacity];
+ memcpy(n, data(), l * sizeof(UChar));
+ for (int i = 0; i < tLen; ++i)
+ n[l+i] = t[i];
+ release();
+ rep = Rep::create(n, newLen);
+ rep->capacity = newCapacity;
+
+ return *this;
+}
+
+UString &UString::append(unsigned short c)
+{
+ int l = size();
+ int newLen = l + 1;
+ if (rep->rc == 1 && newLen <= rep->capacity) {
+ rep->dat[l] = c;
+ rep->len = newLen;
+ rep->_hash = 0;
+ return *this;
+ }
+
+ int newCapacity = (newLen * 3 + 1) / 2;
+ UChar *n = new UChar[newCapacity];
+ memcpy(n, data(), l * sizeof(UChar));
+ n[l] = c;
+ release();
+ rep = Rep::create(n, newLen);
+ rep->capacity = newCapacity;
+
+ return *this;
+}
+
CString UString::cstring() const
{
return ascii();
@@ -894,3 +946,241 @@ int KJS::compare(const UString& s1, const UString& s2)
}
return (l1 < l2) ? 1 : -1;
}
+
+// Given a first byte, gives the length of the UTF-8 sequence it begins.
+// Returns 0 for bytes that are not legal starts of UTF-8 sequences.
+// Only allows sequences of up to 4 bytes, since that works for all Unicode characters (U-00000000 to U-0010FFFF).
+int UTF8SequenceLength(char b0)
+{
+ if ((b0 & 0x80) == 0)
+ return 1;
+ if ((b0 & 0xC0) != 0xC0)
+ return 0;
+ if ((b0 & 0xE0) == 0xC0)
+ return 2;
+ if ((b0 & 0xF0) == 0xE0)
+ return 3;
+ if ((b0 & 0xF8) == 0xF0)
+ return 4;
+ return 0;
+}
+
+// Takes a null-terminated C-style string with a UTF-8 sequence in it and converts it to a character.
+// Only allows Unicode characters (U-00000000 to U-0010FFFF).
+// Returns -1 if the sequence is not valid (including presence of extra bytes).
+int decodeUTF8Sequence(const char *sequence)
+{
+ // Handle 0-byte sequences (never valid).
+ const unsigned char b0 = sequence[0];
+ const int length = UTF8SequenceLength(b0);
+ if (length == 0)
+ return -1;
+
+ // Handle 1-byte sequences (plain ASCII).
+ const unsigned char b1 = sequence[1];
+ if (length == 1) {
+ if (b1)
+ return -1;
+ return b0;
+ }
+
+ // Handle 2-byte sequences.
+ if ((b1 & 0xC0) != 0x80)
+ return -1;
+ const unsigned char b2 = sequence[2];
+ if (length == 2) {
+ if (b2)
+ return -1;
+ const int c = ((b0 & 0x1F) << 6) | (b1 & 0x3F);
+ if (c < 0x80)
+ return -1;
+ return c;
+ }
+
+ // Handle 3-byte sequences.
+ if ((b2 & 0xC0) != 0x80)
+ return -1;
+ const unsigned char b3 = sequence[3];
+ if (length == 3) {
+ if (b3)
+ return -1;
+ const int c = ((b0 & 0xF) << 12) | ((b1 & 0x3F) << 6) | (b2 & 0x3F);
+ if (c < 0x800)
+ return -1;
+ // UTF-16 surrogates should never appear in UTF-8 data.
+ if (c >= 0xD800 && c <= 0xDFFF)
+ return -1;
+ // Backwards BOM and U+FFFF should never appear in UTF-8 data.
+ if (c == 0xFFFE || c == 0xFFFF)
+ return -1;
+ return c;
+ }
+
+ // Handle 4-byte sequences.
+ if ((b3 & 0xC0) != 0x80)
+ return -1;
+ const unsigned char b4 = sequence[4];
+ if (length == 4) {
+ if (b4)
+ return -1;
+ const int c = ((b0 & 0x7) << 18) | ((b1 & 0x3F) << 12) | ((b2 & 0x3F) << 6) | (b3 & 0x3F);
+ if (c < 0x10000 || c > 0x10FFFF)
+ return -1;
+ return c;
+ }
+
+ return -1;
+}
+
+CString UString::UTF8String() const
+{
+ // Allocate a buffer big enough to hold all the characters.
+ const int length = size();
+ const unsigned bufferSize = length * 3;
+ char fixedSizeBuffer[1024];
+ char *buffer;
+ if (bufferSize > sizeof(fixedSizeBuffer)) {
+ buffer = new char [bufferSize];
+ } else {
+ buffer = fixedSizeBuffer;
+ }
+
+ // Convert to runs of 8-bit characters.
+ char *p = buffer;
+ const UChar *d = data();
+ for (int i = 0; i != length; ++i) {
+ unsigned short c = d[i].unicode();
+ if (c < 0x80) {
+ *p++ = (char)c;
+ } else if (c < 0x800) {
+ *p++ = (char)((c >> 6) | 0xC0); // C0 is the 2-byte flag for UTF-8
+ *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
+ } else if (c >= 0xD800 && c <= 0xDBFF && i < length && d[i+1].uc >= 0xDC00 && d[i+2].uc <= 0xDFFF) {
+ unsigned sc = 0x10000 + (((c & 0x3FF) << 10) | (d[i+1].uc & 0x3FF));
+ *p++ = (char)((sc >> 18) | 0xF0); // F0 is the 4-byte flag for UTF-8
+ *p++ = (char)(((sc >> 12) | 0x80) & 0xBF); // next 6 bits, with high bit set
+ *p++ = (char)(((sc >> 6) | 0x80) & 0xBF); // next 6 bits, with high bit set
+ *p++ = (char)((sc | 0x80) & 0xBF); // next 6 bits, with high bit set
+ ++i;
+ } else {
+ *p++ = (char)((c >> 12) | 0xE0); // E0 is the 3-byte flag for UTF-8
+ *p++ = (char)(((c >> 6) | 0x80) & 0xBF); // next 6 bits, with high bit set
+ *p++ = (char)((c | 0x80) & 0xBF); // next 6 bits, with high bit set
+ }
+ }
+
+ // Return the result as a C string.
+ CString result(buffer, p - buffer);
+ if (buffer != fixedSizeBuffer) {
+ delete [] buffer;
+ }
+ return result;
+}
+
+struct StringOffset {
+ int offset;
+ int locationInOffsetsArray;
+};
+
+static int compareStringOffsets(const void *a, const void *b)
+{
+ const StringOffset *oa = static_cast<const StringOffset *>(a);
+ const StringOffset *ob = static_cast<const StringOffset *>(b);
+
+ if (oa->offset < ob->offset) {
+ return -1;
+ }
+ if (oa->offset > ob->offset) {
+ return +1;
+ }
+ return 0;
+}
+
+const int sortedOffsetsFixedBufferSize = 128;
+
+static StringOffset *createSortedOffsetsArray(const int offsets[], int numOffsets,
+ StringOffset sortedOffsetsFixedBuffer[sortedOffsetsFixedBufferSize])
+{
+ // Allocate the sorted offsets.
+ StringOffset *sortedOffsets;
+ if (numOffsets <= sortedOffsetsFixedBufferSize) {
+ sortedOffsets = sortedOffsetsFixedBuffer;
+ } else {
+ sortedOffsets = new StringOffset [numOffsets];
+ }
+
+ // Copy offsets.
+ for (int i = 0; i != numOffsets; ++i) {
+ sortedOffsets[i].offset = offsets[i];
+ sortedOffsets[i].locationInOffsetsArray = i;
+ }
+
+ // Sort them.
+ qsort(sortedOffsets, numOffsets, sizeof(StringOffset), compareStringOffsets);
+
+ return sortedOffsets;
+}
+
+// Note: This function assumes valid UTF-8.
+// It can even go into an infinite loop if the passed in string is not valid UTF-8.
+void convertUTF16OffsetsToUTF8Offsets(const char *s, int *offsets, int numOffsets)
+{
+ // Allocate buffer.
+ StringOffset fixedBuffer[sortedOffsetsFixedBufferSize];
+ StringOffset *sortedOffsets = createSortedOffsetsArray(offsets, numOffsets, fixedBuffer);
+
+ // Walk through sorted offsets and string, adjusting all the offests.
+ // Offsets that are off the ends of the string map to the edges of the string.
+ int UTF16Offset = 0;
+ const char *p = s;
+ for (int oi = 0; oi != numOffsets; ++oi) {
+ const int nextOffset = sortedOffsets[oi].offset;
+ while (*p && UTF16Offset < nextOffset) {
+ // Skip to the next character.
+ const int sequenceLength = UTF8SequenceLength(*p);
+ assert(sequenceLength >= 1 && sequenceLength <= 4);
+ p += sequenceLength;
+ // Characters that take a 4 byte sequence in UTF-8 take two bytes in UTF-16.
+ UTF16Offset += sequenceLength < 4 ? 1 : 2;
+ }
+ offsets[sortedOffsets[oi].locationInOffsetsArray] = p - s;
+ }
+
+ // Free buffer.
+ if (sortedOffsets != fixedBuffer) {
+ delete [] sortedOffsets;
+ }
+}
+
+// Note: This function assumes valid UTF-8.
+// It can even go into an infinite loop if the passed in string is not valid UTF-8.
+void convertUTF8OffsetsToUTF16Offsets(const char *s, int *offsets, int numOffsets)
+{
+ // Allocate buffer.
+ StringOffset fixedBuffer[sortedOffsetsFixedBufferSize];
+ StringOffset *sortedOffsets = createSortedOffsetsArray(offsets, numOffsets, fixedBuffer);
+
+ // Walk through sorted offsets and string, adjusting all the offests.
+ // Offsets that are off the end of the string map to the edges of the string.
+ int UTF16Offset = 0;
+ const char *p = s;
+ for (int oi = 0; oi != numOffsets; ++oi) {
+ const int nextOffset = sortedOffsets[oi].offset;
+ while (*p && (p - s) < nextOffset) {
+ // Skip to the next character.
+ const int sequenceLength = UTF8SequenceLength(*p);
+ assert(sequenceLength >= 1 && sequenceLength <= 4);
+ p += sequenceLength;
+ // Characters that take a 4 byte sequence in UTF-8 take two bytes in UTF-16.
+ UTF16Offset += sequenceLength < 4 ? 1 : 2;
+ }
+ offsets[sortedOffsets[oi].locationInOffsetsArray] = UTF16Offset;
+ }
+
+ // Free buffer.
+ if (sortedOffsets != fixedBuffer) {
+ delete [] sortedOffsets;
+ }
+}
+
+} // namespace KJS
diff --git a/JavaScriptCore/kjs/ustring.h b/JavaScriptCore/kjs/ustring.h
index 9d59cf1..5765674 100644
--- a/JavaScriptCore/kjs/ustring.h
+++ b/JavaScriptCore/kjs/ustring.h
@@ -169,8 +169,9 @@ namespace KJS {
*/
class CString {
public:
- CString() : data(0) { }
+ CString() : data(0), length(0) { }
CString(const char *c);
+ CString(const char *c, int len);
CString(const CString &);
~CString();
@@ -180,10 +181,11 @@ namespace KJS {
CString &operator=(const CString &);
CString &operator+=(const CString &c) { return append(c); }
- int size() const;
+ int size() const { return length; }
const char *c_str() const { return data; }
private:
char *data;
+ int length;
};
/**
@@ -300,6 +302,10 @@ namespace KJS {
* Append another string.
*/
UString &append(const UString &);
+ UString &append(const char *);
+ UString &append(unsigned short);
+ UString &append(char c) { return append(static_cast<unsigned short>(static_cast<unsigned char>(c))); }
+ UString &append(UChar c) { return append(c.uc); }
/**
* @return The string converted to the 8-bit string type @ref CString().
@@ -313,6 +319,16 @@ namespace KJS {
* instances.
*/
char *ascii() const;
+
+ /**
+ * Convert the string to UTF-8, assuming it is UTF-16 encoded.
+ * Since this function is tolerant of badly formed UTF-16, it can create UTF-8
+ * strings that are invalid because they have characters in the range
+ * U+D800-U+DDFF, U+FFFE, or U+FFFF, but the UTF-8 string is guaranteed to
+ * be otherwise valid.
+ */
+ CString UTF8String() const;
+
/**
* @see UString(const QString&).
*/
@@ -335,6 +351,7 @@ namespace KJS {
* Appends the specified string.
*/
UString &operator+=(const UString &s) { return append(s); }
+ UString &operator+=(const char *s) { return append(s); }
/**
* @return A pointer to the internal Unicode data.
@@ -454,6 +471,26 @@ namespace KJS {
int compare(const UString &, const UString &);
+ // Given a first byte, gives the length of the UTF-8 sequence it begins.
+ // Returns 0 for bytes that are not legal starts of UTF-8 sequences.
+ // Only allows sequences of up to 4 bytes, since that works for all Unicode characters (U-00000000 to U-0010FFFF).
+ int UTF8SequenceLength(char);
+
+ // Takes a null-terminated C-style string with a UTF-8 sequence in it and converts it to a character.
+ // Only allows Unicode characters (U-00000000 to U-0010FFFF).
+ // Returns -1 if the sequence is not valid (including presence of extra bytes).
+ int decodeUTF8Sequence(const char *);
+
+ // Given a UTF-8 string, converts offsets from the UTF-16 form of the string into offsets into the UTF-8 string.
+ // Note: This function can overrun the buffer if the string contains a partial UTF-8 sequence, so it should
+ // not be called with strings that might contain such sequences.
+ void convertUTF16OffsetsToUTF8Offsets(const char *UTF8String, int *offsets, int numOffsets);
+
+ // Given a UTF-8 string, converts offsets from the UTF-8 string into offsets into the UTF-16 form of the string.
+ // Note: This function can overrun the buffer if the string contains a partial UTF-8 sequence, so it should
+ // not be called with strings that might contain such sequences.
+ void convertUTF8OffsetsToUTF16Offsets(const char *UTF8String, int *offsets, int numOffsets);
+
}; // namespace
#endif
--
WebKit Debian packaging
More information about the Pkg-webkit-commits
mailing list