html5lib

How to remove namespace value from inside lxml.html.html5paser element tag

一个人想着一个人 提交于 2019-12-06 16:47:31
Is it possible not to add namespace for the tag when using html5parser from the lxml.html package? Example: from lxml import html print(html.parse('http://example.com').getroot().tag) # You will get 'html' from lxml.html import html5parser print(html5parser.parse('http://example.com').getroot().tag) # You will get '{http://www.w3.org/1999/xhtml}html' The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all? There is a specific namespaceHTMLElements boolean flag that controls this behavior: from lxml.html import html5parser from

Remove contents of <style>…</style> tags using html5lib or bleach

为君一笑 提交于 2019-12-06 03:28:55
I've been using the excellent bleach library for removing bad HTML. I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like: <STYLE> st1:*{behavior:url(#ieooui) } </STYLE> Using bleach (with the style tag implicitly disallowed), leaves me with: st1:*{behavior:url(#ieooui) } Which isn't helpful. Bleach seems only to have options to: Escape tags; Remove the tags (but not their contents). I'm looking for a third option - remove the tags and their contents. Is there any way to use bleach or html5lib to completely remove the style tag and its

BeautifulSoup:网页解析利器上手简介

主宰稳场 提交于 2019-12-05 04:22:53
关于爬虫的案例和方法,我们已讲过许多。不过在以往的文章中,大多是关注在 如何把网页上的内容抓取下来 。今天我们来分享下,当你已经把内容爬下来之后, 如何提取出其中你需要的具体信息 。 网页被抓取下来,通常就是 str 字符串类型的对象 ,要从里面寻找信息,最直接的想法就是直接通过字符串的 find 方法 和 切片操作 : s = '<p>价格:15.7 元</p>' start = s . find ( '价格:' ) end = s . find ( ' 元' ) print ( s [ start + 3 : end ]) # 15.7 这能应付一些极简单的情况,但只要稍稍复杂一点,这么写就会累死人。更通用的做法是使用 正则表达式 : import re s = '<p>价格:15.7 元</p>' r = re . search ( '[\d.]+' , s ) print ( r . group ()) # 15.7 正则表达式是处理文本解析的万金油,什么情况都可以应对。但可惜掌握它需要一定的学习成本, 原本我们有一个网页提取的问题,用了正则表达式,现在我们有了两个问题。 HTML 文档本身是 结构化的文本 ,有一定的规则,通过它的结构可以简化信息提取。于是,就有了 lxml、pyquery、BeautifulSoup 等网页信息提取库。一般我们会用这些库来提取网页信息

lxml html5parser ignores “namespaceHTMLElements=False” option

我们两清 提交于 2019-12-04 03:54:41
问题 The lxml html5parser seems to ignore any namespaceHTMLElements=False option I pass to it. It puts all elements I give it into the HTML namespace instead of the (expected) void namespace. Here’s a simple case that reproduces the problem: echo "<p>" | python -c "from sys import stdin; \ from lxml.html import html5parser as h5, tostring; \ print tostring(h5.parse(stdin, h5.HTMLParser(namespaceHTMLElements=False)))" The output from that is this: <html:html xmlns:html="http://www.w3.org/1999/xhtml

lxml html5parser ignores “namespaceHTMLElements=False” option

你说的曾经没有我的故事 提交于 2019-12-01 19:38:38
The lxml html5parser seems to ignore any namespaceHTMLElements=False option I pass to it. It puts all elements I give it into the HTML namespace instead of the (expected) void namespace. Here’s a simple case that reproduces the problem: echo "<p>" | python -c "from sys import stdin; \ from lxml.html import html5parser as h5, tostring; \ print tostring(h5.parse(stdin, h5.HTMLParser(namespaceHTMLElements=False)))" The output from that is this: <html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head></html:head><html:body><html:p> </html:p></html:body></html:html> As can be seen, the html

Obtaining position info when parsing HTML in Python

依然范特西╮ 提交于 2019-11-29 10:59:11
I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need to build an object tree. I simply want to find certain pieces of data and their position in the original document (think of a spell checker, for example: 'word "foo" at line x, column y, is misspelled)' As an example I want something like this (using ElementTree's Target API ): import xml.etree.ElementTree as ET class EchoTarget:

教你如何在Ubuntu 18.04 下安装 Tensorflow(CPU)

六眼飞鱼酱① 提交于 2019-11-29 02:40:22
最近我开始学习深度学习框架Tensorflow,一开始在Windows平台下的anaconda下安装,由于anaconda安装几次后navigator打开老是出现闪退的问题,所以决定换个Ubuntu下继续折腾tensorflow。本人笔记本没有NVIDIA显卡,只装的CPU版本的。而且是在虚拟机下的,下面开始吧。 先安装好Ubuntu 18.04版本的系统(最好是Ubuntu的14.04版本以上),Ubuntu系统已经有了了Python 3.7.7.7,所以不需要再安装Python了。 一、首先更新源为阿里云软件源,增加下载速度 (1)备份当前也就是默认官方的源列表 sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup (2)删除sources.list文件中的列表,删除全部内容 sudo gedit /etc/apt/sources.list1 (3)替换内容后,保存 deb http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse deb-src http://mirrors.aliyun.com/ubuntu/ bionic main restricted universe multiverse deb http:/

beautifulsoup, html5lib: module object has no attribute _base

孤人 提交于 2019-11-28 04:37:30
When I updated my packages I have this new error: class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder): AttributeError: 'module' object has no attribute '_base' I tried to update beautifulsoup , with no more result. How can I fix that? I upgraded beautifulsoup4 and html5lib and it resolved the issue. pip install --upgrade beautifulsoup4 pip install --upgrade html5lib This is an issue with upstream package html5lib: https://bugs.launchpad.net/beautifulsoup/+bug/1603299 To fix, force downgrade to an older version: pip install --upgrade html5lib==1.0b8 edit nov, 2017: it seems

Obtaining position info when parsing HTML in Python

僤鯓⒐⒋嵵緔 提交于 2019-11-28 04:06:15
问题 I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need to build an object tree. I simply want to find certain pieces of data and their position in the original document (think of a spell checker, for example: 'word "foo" at line x, column y, is misspelled)' As an example I want

html5lib: TypeError: __init__() got an unexpected keyword argument 'encoding'

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-28 02:02:06
I'm trying to install html5lib . at first I tried to install the latest version (8 or 9 nines), but it came into conflict with my BeautifulSoup, so I decided to try older verison (0.9999999, seven nines ). I installed it, but when I try to use it: >>> with urlopen("http://example.com/") as f: document = html5lib.parse(f, encoding=f.info().get_content_charset()) I get an error: Traceback (most recent call last): File "<pyshell#11>", line 2, in <module> document = html5lib.parse(f, encoding=f.info().get_content_charset()) File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py",