字体反爬虫——58租房

58是一个字体反爬相对简单的网站了，它只对数字进行了反爬处理。适合拿来做字体反爬入门。

先上代码，在详细记录，纯小白操作，不怕看不懂啊：

import requests
import base64
import re
from fontTools.ttLib import TTFont


# 获取参数，这里主要是返回的响应内容，及匹配到的font_face被base64编码的文件
def get_params(url):
    resp = requests.get(url)
    content = resp.text
    # print(resp.text)
    font_face = re.search("font-face{.*?base64,(.*?)'.*?}", content, re.S).group(1).strip()
    # print(font_face)
    return font_face, content


# 解base64编码，写入ttf字体文件
def parse_font_face(font_face):
    font_face = base64.b64decode(font_face)
    with open('58.ttf', 'wb') as f:
        f.write(font_face)

    font = TTFont('58.ttf')
    font.saveXML('58.xml')
    # 使用footTools自带的getBestCmap()获取映射
    bestcmap = font['cmap'].getBestCmap()
    # print(bestcmap)
    # 创建新的映射关系字典
    newmap = dict()
    for key in bestcmap.keys():
        value = int(re.search('(\d+)', bestcmap[key]).group(1)) - 1
        key = hex(key)
        key = re.sub('0x', '&#x', key) + ';'
        newmap[key] = value
    return newmap


# 根据映射关系，使网页响应内容无乱码
def parse_message(newmap, content):
    for key, value in newmap.items():
        if key in content:
            content = content.replace(key, str(newmap[key]))
    titles = re.findall('<h2>.*?<a .*?>(.*?)</a>', content, re.S)
    # print(titles)
    prices = re.findall('<div class="money">.*?<b .*?>(.*?)</b>', content, re.S)
    # print(prices)
    # print(len(titles), len(prices))
    message = dict()
    for i in range(len(titles)):
        titles[i] = re.sub('\n|      ', '', titles[i])
        message[titles[i]] = prices[i] + '元/月'
    for k, v in message.items():
        print(k, v)


if __name__ == '__main__':
    url = 'https://sz.58.com/chuzu/'
    font_face, content = get_params(url)
    newmap = parse_font_face(font_face)
    parse_message(newmap, content)

效果：

然后详细讲一下代码中最重要的部分：

def parse_font_face(font_face):
    font_face = base64.b64decode(font_face)
    with open('58.ttf', 'wb') as f:
        f.write(font_face)

    font = TTFont('58.ttf')
    font.saveXML('58.xml')
    # 使用footTools自带的getBestCmap()获取映射
    bestcmap = font['cmap'].getBestCmap()
    # print(bestcmap)
    # 创建新的映射关系字典
    newmap = dict()
    for key in bestcmap.keys():
        value = int(re.search('(\d+)', bestcmap[key]).group(1)) - 1
        key = hex(key)
        key = re.sub('0x', '&#x', key) + ';'
        newmap[key] = value
    return newmap

处理字体文件这个部分。

我们一行一行看：

首先，这个函数传了一个font_face的值，它是在网页中匹配到的base64编码的ttf文件：

2-5行代码分别是解码和写文件。

写完之后的文件，用FontCreator打开：

font = TTFont('58.ttf')
font.saveXML('58.xml')

使用python第三方库fontTools处理ttf文件，并生成一个xml文件，其实这个xml文件在代码里没有作用，主要是为了分析。

打开xml文件：有600行+

我们看到cmap，它就是映射关系，它有多种映射关系，也就是说，它的映射是动态的，而非静态的。也就是说，你用找规律的方式写死映射关系，是不可行的。

它用到的映射关系，会在最上面一个：

我们拿第一行为例：

<map code="0x9476" name="glyph00008"/><!-- CJK UNIFIED IDEOGRAPH-9476 -->

注意：0x9476 glyph00008

我们在看FontCreator打开的文件：

注意：uni9476 7

我们对照多组可以发现，去掉uni和0x，glyph0000：

9476在xml里对应的值，比在ttf文件对应的值，大1

也就是说：

xml：9476 8

ttf ：9476 7

所以，后面代码：

bestcmap = font['cmap'].getBestCmap()

获取ttf文件中的cmap映射关系

# 创建新的映射关系字典
newmap = dict()
for key in bestcmap.keys():
    value = int(re.search('(\d+)', bestcmap[key]).group(1)) - 1
    key = hex(key)
    key = re.sub('0x', '&#x', key) + ';'
    newmap[key] = value
return newmap

创建一个新的映射关系表：

原有的cmap映射关系，使获取的数字-1，达到我们在FontCreator里看到的映射关系

变成16进制

替换，这一步是为了后面处理网页做的，在开发者模式，看见是这些乱七八糟的字符：