[Python-modules-commits] [python-tidylib] 03/11: Import python-tidylib_0.2.4~dfsg.orig.tar.gz

Dmitry Shachnev mitya57 at moszumanska.debian.org
Mon Dec 14 18:03:44 UTC 2015


This is an automated email from the git hooks/post-receive script.

mitya57 pushed a commit to branch master
in repository python-tidylib.

commit dca25b5942b5e9770f12412cff3760ac67213d7a
Author: Dmitry Shachnev <mitya57 at gmail.com>
Date:   Mon Dec 14 20:46:14 2015 +0300

    Import python-tidylib_0.2.4~dfsg.orig.tar.gz
---
 ._LICENSE                                 | Bin 186 -> 0 bytes
 ._MANIFEST.in                             | Bin 186 -> 0 bytes
 ._README                                  | Bin 186 -> 0 bytes
 ._setup.py                                | Bin 187 -> 0 bytes
 LICENSE                                   |   2 +-
 PKG-INFO                                  |  36 +++++-----
 README                                    |   4 ++
 docs/rst/._conf.py                        | Bin 187 -> 0 bytes
 docs/rst/._index.rst                      | Bin 186 -> 0 bytes
 docs/rst/conf.py                          |  15 +---
 docs/rst/index.rst                        |  25 +++----
 setup.py                                  |  26 ++++---
 tests/._DocsTest.py                       | Bin 184 -> 0 bytes
 tests/._FragsTest.py                      | Bin 187 -> 0 bytes
 tests/._SinkMemTest.py                    | Bin 186 -> 0 bytes
 tests/._threadsafety.py                   | Bin 186 -> 0 bytes
 tests/__init__.py                         |   0
 tests/{DocsTest.py => test_docs.py}       |  90 +++++++++++++----------
 tests/{FragsTest.py => test_fragments.py} |  57 +++++++--------
 tests/{SinkMemTest.py => test_memory.py}  |  22 +++---
 tests/threadsafety.py                     |  17 ++---
 tidylib/.___init__.py                     | Bin 187 -> 0 bytes
 tidylib/._sink.py                         | Bin 187 -> 0 bytes
 tidylib/__init__.py                       | 115 +++++++++++++++++-------------
 tidylib/sink.py                           |  50 +++++++------
 25 files changed, 246 insertions(+), 213 deletions(-)

diff --git a/._LICENSE b/._LICENSE
deleted file mode 100644
index 80a3389..0000000
Binary files a/._LICENSE and /dev/null differ
diff --git a/._MANIFEST.in b/._MANIFEST.in
deleted file mode 100644
index f0c5a50..0000000
Binary files a/._MANIFEST.in and /dev/null differ
diff --git a/._README b/._README
deleted file mode 100644
index f15dfbb..0000000
Binary files a/._README and /dev/null differ
diff --git a/._setup.py b/._setup.py
deleted file mode 100644
index cd54e91..0000000
Binary files a/._setup.py and /dev/null differ
diff --git a/LICENSE b/LICENSE
index 730e8b4..73bb525 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,4 +1,4 @@
-Copyright 2009 Jason Stitt
+Copyright 2009-2014 Jason Stitt
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
diff --git a/PKG-INFO b/PKG-INFO
index 80d4911..a1ac0e5 100644
--- a/PKG-INFO
+++ b/PKG-INFO
@@ -1,37 +1,40 @@
-Metadata-Version: 1.0
+Metadata-Version: 1.1
 Name: pytidylib
-Version: 0.2.1
-Summary: Python wrapper for HTML Tidy (tidylib)
+Version: 0.2.4
+Summary: Python wrapper for HTML Tidy (tidylib) on Python 2 and 3
 Home-page: http://countergram.com/open-source/pytidylib/
 Author: Jason Stitt
 Author-email: js at jasonstitt.com
 License: UNKNOWN
-Download-URL: http://cloud.github.com/downloads/countergram/pytidylib/pytidylib-0.2.1.tar.gz
-Description: 0.2.0: Works on Windows! See documentation for available DLL download
-        locations. Documentation rewritten and expanded.
-        
-        `PyTidyLib`_ is a Python package that wraps the `HTML Tidy`_ library. This
+Description: `PyTidyLib`_ is a Python package that wraps the `HTML Tidy`_ library. This
         allows you, from Python code, to "fix" invalid (X)HTML markup. Some of the
         library's many capabilities include:
         
         * Clean up unclosed tags and unescaped characters such as ampersands
         * Output HTML 4 or XHTML, strict or transitional, and add missing doctypes
         * Convert named entities to numeric entities, which can then be used in XML
-        documents without an HTML doctype.
+          documents without an HTML doctype.
         * Clean up HTML from programs such as Word (to an extent)
         * Indent the output, including proper (i.e. no) indenting for ``pre`` elements,
-        which some (X)HTML indenting code overlooks.
+          which some (X)HTML indenting code overlooks.
+        
+        Version usage
+        =============
+        
+        * Windows: 0.2.0 and later
+        * Python 3: Tests pass on 0.2.3
+        * tidylib itself is not actively updated and may have problems with newer HTML
         
         Small example of use
         ====================
         
         The following code cleans up an invalid HTML document and sets an option::
         
-        from tidylib import tidy_document
-        document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''',
-        options={'numeric-entities':1})
-        print document
-        print errors
+            from tidylib import tidy_document
+            document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''',
+              options={'numeric-entities':1})
+            print document
+            print errors
         
         Docs
         ====
@@ -43,11 +46,12 @@ Description: 0.2.0: Works on Windows! See documentation for available DLL downlo
         .. _`PyTidyLib`: http://countergram.com/open-source/pytidylib/
         
 Platform: UNKNOWN
-Classifier: Development Status :: 4 - Beta
+Classifier: Development Status :: 5 - Production/Stable
 Classifier: Environment :: Other Environment
 Classifier: Intended Audience :: Developers
 Classifier: License :: OSI Approved :: MIT License
 Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 3
 Classifier: Natural Language :: English
 Classifier: Topic :: Utilities
 Classifier: Topic :: Text Processing :: Markup :: HTML
diff --git a/README b/README
index dd5738d..a471b26 100644
--- a/README
+++ b/README
@@ -8,3 +8,7 @@ document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''',
     options={'numeric-entities':1})
 print document
 print errors
+
+NOTE: HTML Tidy itself has currently not been updated for a long time, and may
+not be, and it may have trouble with newer HTML. This is just a thin Python
+wrapper around HTML Tidy, which is a separate project.
diff --git a/docs/rst/._conf.py b/docs/rst/._conf.py
deleted file mode 100644
index 0c4d34e..0000000
Binary files a/docs/rst/._conf.py and /dev/null differ
diff --git a/docs/rst/._index.rst b/docs/rst/._index.rst
deleted file mode 100644
index f46160c..0000000
Binary files a/docs/rst/._index.rst and /dev/null differ
diff --git a/docs/rst/conf.py b/docs/rst/conf.py
index 29a547e..3f95db2 100644
--- a/docs/rst/conf.py
+++ b/docs/rst/conf.py
@@ -5,22 +5,9 @@ extensions = ['sphinx.ext.autodoc']
 master_doc = "index"
 
 project = "pytidylib"
-copyright = "2009 Jason Stitt"
+copyright = "2009-2014 Jason Stitt"
 version = "0.1"
 language = "en"
 
 html_title = "pytidylib module"
 
-latex_use_modindex = False
-
-latex_documents = [
-    (
-    master_doc,
-    'pytidylib.tex',
-    'PyTidyLib documentation',
-    'Jason Stitt',
-    'howto',
-    False,
-    )
-    ]
-
diff --git a/docs/rst/index.rst b/docs/rst/index.rst
index 1550ce3..b4d7fb4 100644
--- a/docs/rst/index.rst
+++ b/docs/rst/index.rst
@@ -9,14 +9,16 @@ PyTidyLib: A Python Interface to HTML Tidy
 * Clean up HTML from programs such as Word (to an extent)
 * Indent the output, including proper (i.e. no) indenting for ``pre`` elements, which some (X)HTML indenting code overlooks.
 
-PyTidyLib is intended as as replacement for uTidyLib, which fills a similar purpose. The author previously used uTidyLib but found several areas for improvement, including OS X support, 64-bit platform support, unicode support, fixing a memory leak, and better speed.
+As of the latest PyTidyLib maintenance updates, HTML Tidy itself has currently not been updated since 2008, and it may have trouble with newer HTML. This is just a thin Python wrapper around HTML Tidy, which is a separate project.
+
+As of 0.2.3, both Python 2 and Python 3 are supported with passing tests.
 
 Naming conventions
 ==================
 
 `HTML Tidy`_ is a longstanding open-source library written in C that implements the actual functionality of cleaning up (X)HTML markup. It provides a shared library (``so``, ``dll``, or ``dylib``) that can variously be called ``tidy``, ``libtidy``, or ``tidylib``, as well as a command-line executable named ``tidy``. For clarity, this document will consistently refer to it by the project name, HTML Tidy.
 
-`PyTidyLib`_ is the name of the Python package discussed here. As this is the package name, ``easy_install pytidylib`` or ``pip install pytidylib`` is correct (they are case-insenstive). The *module* name is ``tidylib``, so ``import tidylib`` is correct in Python code. This document will consistently use the package name, PyTidyLib, outside of code examples.
+`PyTidyLib`_ is the name of the Python package discussed here. As this is the package name, ``pip install pytidylib`` is correct (they are case-insenstive). The *module* name is ``tidylib``, so ``import tidylib`` is correct in Python code. This document will consistently use the package name, PyTidyLib, outside of code examples.
 
 Installing HTML Tidy
 ====================
@@ -27,7 +29,7 @@ You must have both `HTML Tidy`_ and `PyTidyLib`_ installed in order to use the f
 
 **OS X:** You may already have HTML Tidy installed. In the Terminal, run ``locate libtidy`` and see if you get any results, which should end in ``dylib``. Otherwise see *Building from Source*, below.
 
-**Windows:** (Use PyTidyLib version 0.2 or later!) Prebuilt HTML Tidy DLLs are available from at least two locations. The `int64.org Tidy Binaries`_ page provides binaries that were built in 2005, for both 32-bit and 64-bit Windows, against a patched version of the source. The `HTML Tidy`_ web site links to a DLL built in 2006, for 32-bit Windows only, using the vanilla source (scroll near the bottom to "Other Builds" -- use the one that reads "exe/lib/dll", *not* the "exe"-only version.)
+**Windows:** (Do not use pre-0.2.0 PyTidyLib.) You may be able to find prebuild DLLs. The DLL sources that were linked to in previous versions of this documentation have since gone 404 without obvious  replacements.
 
 Once you have a DLL (which may be named ``tidy.dll``, ``libtidy.dll``, or ``tidylib.dll``), you must place it in a directory on your system path. If you are running Python from the command-line, placing the DLL in the present working directory will work, but this is unreliable otherwise (e.g. for server software).
 
@@ -36,19 +38,17 @@ See the articles `How to set the path in Windows 2000/Windows XP <http://www.com
 **Building from Source:** The HTML Tidy developers have chosen to make the source code downloadable *only* through CVS, and not from the web site. Use the following CVS checkout at the command line::
 
     cvs -z3 -d:pserver:anonymous at tidy.cvs.sourceforge.net:/cvsroot/tidy co -P tidy
-    
+
 Then see the instructions packaged with the source code or on the `HTML Tidy`_ web site.
 
 Installing PyTidyLib
 ====================
 
-PyTidyLib is available on the Python Package Index and may be installed in the usual ways if you have `pip`_ or `setuptools`_ installed::
+PyTidyLib is available on the Python Package Index::
 
     pip install pytidylib
-    # or:
-    easy_install pytidylib
-    
-You can also download the latest source distribution from the `PyTidyLib`_ web site.
+
+You can also download the latest source distribution from PyPI manually.
 
 Small example of use
 ====================
@@ -60,7 +60,7 @@ The following code cleans up an invalid HTML document and sets an option::
         options={'numeric-entities':1})
     print document
     print errors
-    
+
 Configuration options
 =====================
 
@@ -71,7 +71,6 @@ The Python interface allows you to pass options directly to HTML Tidy. For a com
 This module sets certain default options, as follows::
 
     BASE_OPTIONS = {
-        "output-xhtml": 1,     # XHTML instead of HTML4
         "indent": 1,           # Pretty; not too much of a performance hit
         "tidy-mark": 0,        # No tidy meta tag in output
         "wrap": 0,             # No wrapping
@@ -95,6 +94,4 @@ Function reference
 
 .. _`HTML Tidy`: http://tidy.sourceforge.net/
 .. _`PyTidyLib`: http://countergram.com/open-source/pytidylib/
-.. _`int64.org Tidy Binaries`: http://int64.org/projects/tidy-binaries
-.. _`setuptools`: http://pypi.python.org/pypi/setuptools
-.. _`pip`: http://pypi.python.org/pypi/pip
+
diff --git a/setup.py b/setup.py
index b034821..49e1d71 100644
--- a/setup.py
+++ b/setup.py
@@ -1,15 +1,15 @@
 # Copyright 2009 Jason Stitt
-# 
+#
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
-# 
+#
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
-# 
+#
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
@@ -21,9 +21,6 @@
 from distutils.core import setup
 
 longdesc = """\
-0.2.0: Works on Windows! See documentation for available DLL download
-locations. Documentation rewritten and expanded.
-
 `PyTidyLib`_ is a Python package that wraps the `HTML Tidy`_ library. This
 allows you, from Python code, to "fix" invalid (X)HTML markup. Some of the
 library's many capabilities include:
@@ -36,6 +33,13 @@ library's many capabilities include:
 * Indent the output, including proper (i.e. no) indenting for ``pre`` elements,
   which some (X)HTML indenting code overlooks.
 
+Version usage
+=============
+
+* Windows: 0.2.0 and later
+* Python 3: Tests pass on 0.2.3
+* tidylib itself is not actively updated and may have problems with newer HTML
+
 Small example of use
 ====================
 
@@ -46,7 +50,7 @@ The following code cleans up an invalid HTML document and sets an option::
       options={'numeric-entities':1})
     print document
     print errors
-    
+
 Docs
 ====
 
@@ -57,24 +61,24 @@ the `PyTidyLib`_ web page.
 .. _`PyTidyLib`: http://countergram.com/open-source/pytidylib/
 """
 
-VERSION = "0.2.1"
+VERSION = "0.2.4"
 
 setup(
     name="pytidylib",
     version=VERSION,
-    description="Python wrapper for HTML Tidy (tidylib)",
+    description="Python wrapper for HTML Tidy (tidylib) on Python 2 and 3",
     long_description=longdesc,
     author="Jason Stitt",
     author_email="js at jasonstitt.com",
     url="http://countergram.com/open-source/pytidylib/",
-    download_url="http://cloud.github.com/downloads/countergram/pytidylib/pytidylib-%s.tar.gz" % VERSION,
     packages=['tidylib'],
     classifiers=[
-          'Development Status :: 4 - Beta',
+          'Development Status :: 5 - Production/Stable',
           'Environment :: Other Environment',
           'Intended Audience :: Developers',
           'License :: OSI Approved :: MIT License',
           'Programming Language :: Python',
+          'Programming Language :: Python :: 3',
           'Natural Language :: English',
           'Topic :: Utilities',
           'Topic :: Text Processing :: Markup :: HTML',
diff --git a/tests/._DocsTest.py b/tests/._DocsTest.py
deleted file mode 100644
index 1b86477..0000000
Binary files a/tests/._DocsTest.py and /dev/null differ
diff --git a/tests/._FragsTest.py b/tests/._FragsTest.py
deleted file mode 100644
index 5b65ab2..0000000
Binary files a/tests/._FragsTest.py and /dev/null differ
diff --git a/tests/._SinkMemTest.py b/tests/._SinkMemTest.py
deleted file mode 100644
index 33912b2..0000000
Binary files a/tests/._SinkMemTest.py and /dev/null differ
diff --git a/tests/._threadsafety.py b/tests/._threadsafety.py
deleted file mode 100644
index bb9efcf..0000000
Binary files a/tests/._threadsafety.py and /dev/null differ
diff --git a/tests/__init__.py b/tests/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/tests/DocsTest.py b/tests/test_docs.py
similarity index 51%
rename from tests/DocsTest.py
rename to tests/test_docs.py
index 5028dc5..45ced58 100644
--- a/tests/DocsTest.py
+++ b/tests/test_docs.py
@@ -1,16 +1,16 @@
 # -*- coding: utf-8 -*-
-# Copyright 2009 Jason Stitt
-# 
+# Copyright 2009-2014 Jason Stitt
+#
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
-# 
+#
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
-# 
+#
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
@@ -19,12 +19,13 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 # THE SOFTWARE.
 
+from __future__ import unicode_literals
+
 import unittest
-from tidylib import tidy_document
+from tidylib import tidy_document, release_tidy_doc, thread_local_doc
 
-DOC = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
-    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-<html xmlns="http://www.w3.org/1999/xhtml">
+DOC = u'''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
+<html>
   <head>
     <title></title>
   </head>
@@ -34,48 +35,65 @@ DOC = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 </html>
 '''
 
+
 class TestDocs1(unittest.TestCase):
+
     """ Test some sample documents """
-    
-    def test_doc_with_unclosed_tag(self):
+
+    def test_p_element_closed(self):
         h = "<p>hello"
-        expected = DOC % '''<p>
-      hello
-    </p>'''
+        expected = DOC % '''<p>\n      hello\n    </p>'''
         doc, err = tidy_document(h)
         self.assertEqual(doc, expected)
-        
-    def test_doc_with_incomplete_img_tag(self):
+
+    def test_alt_added_to_img(self):
         h = "<img src='foo'>"
-        expected = DOC % '''<img src='foo' alt="" />'''
+        expected = DOC % '''<img src='foo' alt="">'''
         doc, err = tidy_document(h)
         self.assertEqual(doc, expected)
-        
-    def test_doc_with_entity(self):
-        h = "é"
-        expected = DOC % "é"
+
+    def test_entity_preserved_using_bytes(self):
+        h = b"é"
+        expected = (DOC % "é").encode('utf-8')
         doc, err = tidy_document(h)
         self.assertEqual(doc, expected)
-        
-        expected = DOC % "é"
-        doc, err = tidy_document(h, {'numeric-entities':1})
+
+    def test_numeric_entities_using_bytes(self):
+        h = b"é"
+        expected = (DOC % "é").encode('utf-8')
+        doc, err = tidy_document(h, {'numeric-entities': 1})
         self.assertEqual(doc, expected)
-    
-    def test_doc_with_unicode(self):
+
+    def test_non_ascii_preserved(self):
         h = u"unicode string ß"
-        expected = unicode(DOC, 'utf-8') % h
+        expected = DOC % h
         doc, err = tidy_document(h)
         self.assertEqual(doc, expected)
-        
-    def test_doc_with_unicode_subclass(self):
-        class MyUnicode(unicode):
-            pass
-        
-        h = MyUnicode(u"unicode string ß")
-        expected = unicode(DOC, 'utf-8') % h
+
+    def test_large_document(self):
+        h = u"A" * 10000
+        expected = DOC % h
         doc, err = tidy_document(h)
         self.assertEqual(doc, expected)
-        
-    
+
+    def test_xmlns_large_document_xml_corner_case(self):
+        # Test for a super weird edge case in Tidy that can cause it to return
+        # the wrong required buffer size.
+        body = '<span><span>A</span></span>' + 'A' * 7937
+        html = '<html xmlns="http://www.w3.org/1999/xhtml">' + body
+        doc, err = tidy_document(html, {'output-xml': 1})
+        self.assertEqual(doc.strip()[-7:], "</html>")
+
+    def test_keep_document(self):
+        h = "hello"
+        expected = DOC % h
+        for i in range(4):
+            doc, err = tidy_document(h, keep_doc=True)
+            self.assertEqual(doc, expected)
+        assert hasattr(thread_local_doc, 'doc')
+        release_tidy_doc()
+        assert not hasattr(thread_local_doc, 'doc')
+
+
 if __name__ == '__main__':
-    unittest.main()
\ No newline at end of file
+    unittest.main()
diff --git a/tests/FragsTest.py b/tests/test_fragments.py
similarity index 70%
rename from tests/FragsTest.py
rename to tests/test_fragments.py
index 1a5fbee..dcc7a3a 100644
--- a/tests/FragsTest.py
+++ b/tests/test_fragments.py
@@ -1,16 +1,16 @@
 # -*- coding: utf-8 -*-
-# Copyright 2009 Jason Stitt
-# 
+# Copyright 2009-2014 Jason Stitt
+#
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
-# 
+#
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
-# 
+#
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
@@ -19,53 +19,44 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 # THE SOFTWARE.
 
+from __future__ import unicode_literals
+
 import unittest
 from tidylib import tidy_fragment
 
+
 class TestFrags1(unittest.TestCase):
     """ Test some sample fragment documents """
-    
-    def test_frag_with_unclosed_tag(self):
+
+    def test_p_element_closed(self):
         h = "<p>hello"
-        expected = '''<p>
-      hello
-    </p>'''
+        expected = '''<p>\n  hello\n</p>'''
         doc, err = tidy_fragment(h)
         self.assertEqual(doc, expected)
-        
-    def test_frag_with_incomplete_img_tag(self):
+
+    def test_alt_added_to_img(self):
         h = "<img src='foo'>"
-        expected = '''<img src='foo' alt="" />'''
+        expected = '''<img src='foo' alt="">'''
         doc, err = tidy_fragment(h)
         self.assertEqual(doc, expected)
-        
-    def test_frag_with_entity(self):
-        h = "é"
-        expected = "é"
+
+    def test_entity_preserved_using_bytes(self):
+        h = b"é"
+        expected = b"é"
         doc, err = tidy_fragment(h)
         self.assertEqual(doc, expected)
-        
-        expected = "é"
-        doc, err = tidy_fragment(h, {'numeric-entities':1})
+
+    def test_numeric_entities_using_bytes(self):
+        h = b"é"
+        expected = b"é"
+        doc, err = tidy_fragment(h, {'numeric-entities': 1})
         self.assertEqual(doc, expected)
-    
-    def test_frag_with_unicode(self):
+
+    def test_non_ascii_preserved(self):
         h = u"unicode string ß"
         expected = h
         doc, err = tidy_fragment(h)
         self.assertEqual(doc, expected)
 
-    def test_frag_with_unicode_subclass(self):
-        class MyUnicode(unicode):
-            pass
-
-        h = MyUnicode(u"unicode string ß")
-        expected = h
-        doc, err = tidy_fragment(h)
-        self.assertEqual(doc, expected)
-    
 if __name__ == '__main__':
     unittest.main()
-
-
-
diff --git a/tests/SinkMemTest.py b/tests/test_memory.py
similarity index 94%
rename from tests/SinkMemTest.py
rename to tests/test_memory.py
index e186c26..539deb1 100644
--- a/tests/SinkMemTest.py
+++ b/tests/test_memory.py
@@ -1,15 +1,15 @@
-# Copyright 2009 Jason Stitt
-# 
+# Copyright 2009-2014 Jason Stitt
+#
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
-# 
+#
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
-# 
+#
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
@@ -21,21 +21,27 @@
 import unittest
 from tidylib import tidy_document, tidy_fragment, sink
 
+try:
+    xrange
+except NameError:
+    xrange = range
+
+
 class TestSinkMemory(unittest.TestCase):
     """ Make sure error sinks are cleared properly """
-    
+
     def test_tidy_document(self):
         h = "<p>hello"
         for i in xrange(100):
             doc, err = tidy_document(h)
         self.assertEqual(sink.sinks, {})
-        
+
     def test_tidy_fragment(self):
         h = "<p>hello"
         for i in xrange(100):
             doc, err = tidy_fragment(h)
         self.assertEqual(sink.sinks, {})
-        
+
+
 if __name__ == '__main__':
     unittest.main()
-    
\ No newline at end of file
diff --git a/tests/threadsafety.py b/tests/threadsafety.py
index a7b1a72..cc2a128 100644
--- a/tests/threadsafety.py
+++ b/tests/threadsafety.py
@@ -1,15 +1,15 @@
-# Copyright 2009 Jason Stitt
-# 
+# Copyright 2009-2014 Jason Stitt
+#
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
-# 
+#
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
-# 
+#
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
@@ -41,13 +41,15 @@ SAMPLE = "hello, world"
 NUM_THREADS = 100
 NUM_TRIES = 100
 
+
 class TidyingThread(threading.Thread):
     def run(self):
         for x in xrange(NUM_TRIES):
             output, errors = tidy_document(SAMPLE, keep_doc=True)
             if output != DOC:
                 error_queue.put(output)
-            
+
+
 def run_test():
     threads = []
     for i in xrange(NUM_THREADS):
@@ -56,11 +58,10 @@ def run_test():
         t.start()
     for t in threads:
         t.join()
-            
+
+
 if __name__ == '__main__':
     run_test()
     if not error_queue.empty():
         print "About %s errors out of %s" % (error_queue.qsize(), NUM_THREADS * NUM_TRIES)
         print error_queue.get()
-    
-    
\ No newline at end of file
diff --git a/tidylib/.___init__.py b/tidylib/.___init__.py
deleted file mode 100644
index e9092ae..0000000
Binary files a/tidylib/.___init__.py and /dev/null differ
diff --git a/tidylib/._sink.py b/tidylib/._sink.py
deleted file mode 100644
index cf4e892..0000000
Binary files a/tidylib/._sink.py and /dev/null differ
diff --git a/tidylib/__init__.py b/tidylib/__init__.py
index 2ac83c9..5a3864c 100644
--- a/tidylib/__init__.py
+++ b/tidylib/__init__.py
@@ -1,15 +1,15 @@
-# Copyright 2009 Jason Stitt
-# 
+# Copyright 2009-2014 Jason Stitt
+#
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
-# 
+#
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
-# 
+#
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
@@ -20,35 +20,32 @@
 
 import ctypes
 import threading
-import re
 import platform
-from sink import create_sink, destroy_sink
+from tidylib.sink import create_sink, destroy_sink
 
 __all__ = ['tidy_document', 'tidy_fragment', 'release_tidy_doc']
 
-#----------------------------------------------------------------------------#
+# -------------------------------------------------------------------------- #
 # Constants
 
 LIB_NAMES = ['libtidy', 'libtidy.so', 'libtidy-0.99.so.0', 'cygtidy-0-99-0',
              'tidylib', 'libtidy.dylib', 'tidy']
 ENOMEM = -12
-RE_BODY = re.compile(r"<body>[\r\n]*(.+?)</body>", re.S)
 BASE_OPTIONS = {
-    "output-xhtml": 1,     # XHTML instead of HTML4
     "indent": 1,           # Pretty; not too much of a performance hit
     "tidy-mark": 0,        # No tidy meta tag in output
     "wrap": 0,             # No wrapping
     "alt-text": "",        # Help ensure validation
     "doctype": 'strict',   # Little sense in transitional for tool-generated markup...
     "force-output": 1,     # May not get what you expect but you will get something
-    }
-    
+}
+
 # Note: These are meant as sensible defaults. If you don't like these being
 # applied by default, just set tidylib.BASE_OPTIONS = {} after importing.
 # You can of course override any of these options when you call the
 # tidy_document() or tidy_fragment() function
 
-#----------------------------------------------------------------------------#
+# -------------------------------------------------------------------------- #
 # Globals
 
 tidy = None
@@ -66,42 +63,64 @@ for name in LIB_NAMES:
         break
     except OSError:
         pass
-        
+
 if tidy is None:
     raise OSError("Could not load libtidy using any of these names: %s" % (",".join(LIB_NAMES)))
 
-tidy.tidyCreate.restype = ctypes.POINTER(ctypes.c_void_p) # Fix for 64-bit systems
+tidy.tidyCreate.restype = ctypes.POINTER(ctypes.c_void_p)  # Fix for 64-bit systems
+
+# -------------------------------------------------------------------------- #
+# 3.x/2.x cross-compatibility
+
+try:
+    unicode  # 2.x
+
+    def is_unicode(obj):
+        return isinstance(obj, unicode)
+
+    def encode_key_value(k, v):
+        return unicode(k).encode('utf-8'), unicode(v).encode('utf-8')
+except NameError:
+    # 3.x
+    def is_unicode(obj):
+        return isinstance(obj, str)
+
+    def encode_key_value(k, v):
+        return str(k).encode('utf-8'), str(v).encode('utf-8')
 
-#----------------------------------------------------------------------------#
+# -------------------------------------------------------------------------- #
 # Functions
 
+
 def tidy_document(text, options=None, keep_doc=False):
     """ Run a string with markup through HTML Tidy; return the corrected one.
-    
-    text (str): The markup, which may be anything from an empty string to a
-    complete (X)HTML document. Unicode values are supported; they will be
-    encoded as UTF-8, and HTML Tidy's output will be decoded back to a unicode
-    object.
-    
+
+    text: The markup, which may be anything from an empty string to a complete
+    (X)HTML document. If you pass in a unicode type (py3 str, py2 unicode) you
+    get one back out, and tidy will have some options set that may affect
+    behavior (e.g. named entities converted to plain unicode characters). If
+    you pass in a bytes type (py3 bytes, py2 str) you will get one of those
+    back.
+
     options (dict): Options passed directly to HTML Tidy; see the HTML Tidy docs
     (http://tidy.sourceforge.net/docs/quickref.html) or run tidy -help-config
-    from the command line.    
-    
+    from the command line.
+
     keep_doc (boolean): If True, store 1 document object per thread and re-use
     it, for a slight performance boost especially when tidying very large numbers
     of very short documents.
-    
-    returns (str, str): The tidied markup [0] and warning/error messages[1].
+
+    returns (str, str): The tidied markup and unparsed warning/error messages.
     Warnings and errors are returned just as tidylib returns them.
     """
     global tidy, option_names
-    
+
     # Unicode approach is to encode as string, then decode libtidy output
     use_unicode = False
-    if isinstance(text, unicode):
+    if is_unicode(text):
         use_unicode = True
         text = text.encode('utf-8')
-    
+
     # Manage thread-local storage of persistent document object
     if keep_doc:
         if not hasattr(thread_local_doc, 'doc'):
@@ -109,11 +128,11 @@ def tidy_document(text, options=None, keep_doc=False):
         doc = thread_local_doc.doc
     else:
         doc = tidy.tidyCreate()
-    
+
     # This is where error messages are sent by libtidy
     sink = create_sink()
     tidy.tidySetErrorSink(doc, sink)
-    
+
     try:
         # Set options on the document
         # If keep_doc=True, options will persist between calls, but they can
@@ -129,23 +148,23 @@ def tidy_document(text, options=None, keep_doc=False):
             key = key.replace('_', '-')
             if value is None:
                 value = ''
-            tidy.tidyOptParseValue(doc, key, str(value))
+            key, value = encode_key_value(key, value)
+            tidy.tidyOptParseValue(doc, key, value)
             error = str(sink)
             if error:
                 raise ValueError("(tidylib) " + error)
-    
+
         # The point of the whole thing
         tidy.tidyParseString(doc, text)
         tidy.tidyCleanAndRepair(doc)
-        
+
         # Guess at buffer size; tidy returns ENOMEM if the buffer is too
         # small and puts the required size into out_length
         out_length = ctypes.c_int(8192)
         out = ctypes.c_buffer(out_length.value)
-        if ENOMEM == tidy.tidySaveString(doc, out, ctypes.byref(out_length)):
+        while ENOMEM == tidy.tidySaveString(doc, out, ctypes.byref(out_length)):
             out = ctypes.c_buffer(out_length.value)
-            tidy.tidySaveString(doc, out, ctypes.byref(out_length))
-            
+
         document = out.value
         if use_unicode:
             document = document.decode('utf-8')
@@ -156,33 +175,29 @@ def tidy_document(text, options=None, keep_doc=False):
             tidy.tidyRelease(doc)
 
     return (document, errors)
-    
-    
+
+
 def tidy_fragment(text, options=None, keep_doc=False):
     """ Tidy a string with markup and return only the <body> contents.
-    
+
     HTML Tidy normally returns a full (X)HTML document; this function returns only
     the contents of the <body> element and is meant to be used for snippets.
     Calling tidy_fragment on elements that don't go in the <body>, like <title>,
     will produce incorrect behavior.
-    
+
     Arguments and return value are the same as tidy_document. Note that HTML
     Tidy will always complain about the lack of a doctype and <title> element
     in fragments, and these errors are not stripped out for you. """
+    options = dict(options) if options else dict()
+    options["show-body-only"] = 1
     document, errors = tidy_document(text, options, keep_doc)
-    match = RE_BODY.search(document)
-    if match:
-        document = match.group(1).strip()
-        return (document, errors)
-    else:
-        raise ValueError("tidy_fragment failed to process text")
-    
+    document = document.strip()
+    return document, errors
+
+
 def release_tidy_doc():
     """ Release the stored document object in the current thread. Only useful
     if you have called tidy_document or tidy_fragament with keep_doc=True. """
     if hasattr(thread_local_doc, 'doc'):
         tidy.tidyRelease(thread_local_doc.doc)
         del thread_local_doc.doc
-    
-#----------------------------------------------------------------------------#
-    
\ No newline at end of file
diff --git a/tidylib/sink.py b/tidylib/sink.py
index 1dd168a..25bc791 100644
--- a/tidylib/sink.py
+++ b/tidylib/sink.py
@@ -1,15 +1,15 @@
-# Copyright 2009 Jason Stitt
-# 
+# Copyright 2009-2014 Jason Stitt
+#
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
-# 
+#
 # The above copyright notice and this permission notice shall be included in
 # all copies or substantial portions of the Software.
-# 
+#
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
@@ -26,18 +26,21 @@ import platform
 try:
     from cStringIO import StringIO
 except ImportError:
-    from StringIO import StringIO 
+    try:
+        from StringIO import StringIO
+    except ImportError:
... 90 lines suppressed ...

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/python-modules/packages/python-tidylib.git



More information about the Python-modules-commits mailing list