安装与环境配置
需要安装appium
、Android SDK
、夜神模拟器,并配置环境变量(安卓和夜神模拟器)。百度个教程即可。此处我提供一个简要的说明,如果想要更加详细的教程,请自行百度啦——我之前找到一堆。
安装appium
从官网http://appium.io下载Appium并安装。
安装Android SDK
下载一个安卓的SDK——自行百度或使用以下地址:http://tools.android-studio.org/index.php/sdk,也可以使用我的版本:Android SDK, 密码cyup
安装完成后需要配置path
环境变量,一共是两个:D:\Android\Sdk\tools
和D:\Android\Sdk\platform-tools
安装夜神模拟器
正常是使用真机进行测试的,但为了节省成本、方便部署,个人推荐安装夜神模拟器(市面上其他款模拟器应该也能做到类似功能),从官网下载安装。
安装完成后配置环境变量,地址为:D:\software\夜神模拟器\Nox\bin
。
下面需要保证安卓SDK和夜神模拟器的adb
版本保持一致:
查看本地的安卓SDK中的adb
版本,如果跟夜神模拟器的nox_adb
不一致,就把夜神的nox_adb.exe
替换掉(复制,重命名,粘贴–覆盖)。
# 查看安卓adb版本:adb --version
# 查看夜神模拟器adb版本:nox_adb --version
图:
配置fiddler抓包APP
上面的配置是对fiddler的配置。
如果想要将固定格式url
抓取并保存为文件,可以修改Fiddler
中的FiddlerScript
。一个简单的示例是修改OnBeforeResponse
方法:
static function OnBeforeResponse(oSession: Session) {
var oRegEx = /aweme\/v1\/search\/item\.*/i;
var file = "d://douyinvideo/"
// 时间格式化
var date = new Date();
var month = date.getMonth() + 1;
var strDate = date.getDate();
var strHours = date.getHours();
var strMinutes = date.getMinutes();
var strSeconds = date.getSeconds();
var strMilliSeconds = date.getMilliseconds();
var currentdate = date.getFullYear() + month + strDate
+ '_' +strHours + strMinutes + strSeconds + '_'+ strMilliSeconds;
if ((oSession.responseCode == 200) && oRegEx.test(oSession.fullUrl))
{
var oriBody = oSession.GetResponseBodyAsString();
oSession.utilSetResponseBody(oriBody);
oSession.SaveResponseBody(file+currentdate+".txt");
oSession.utilSetResponseBody(oriBody);
}
if (m_Hide304s && oSession.responseCode == 304) {
oSession["ui-hide"] = "true";
}
}
配置夜神模拟器
为了让程序进行一些系统性的操作,需要对模拟器进行配置。开启夜神模拟器后,启动开发者模式,设置不息屏,并打开USB调试模式。
如果需要抓包,在安装并配置完fiddler之后,模拟器需要设置网络代理,打开WLAN,长按已连接的网络,配置代理,如192.168.11.52
,端口号8888
(这个值是fiddler默认的监听端口)。配置代理后会导致无法上网,需要安装fiddler证书,在模拟器的浏览器中输入192.168.11.51:8888
(ip请自动切换),点击打开网页中的链接,下载证书并安装即可。
下面需要安装你要进行测试或抓包的APP,最好从官网下载,然后copy到模拟器中,会自动安装。此处以抖音为例进行开发。
开发脚本
获取设备和APP信息
获取当前被打开的APP的包名和Activity
,在命令行窗口中依次执行以下命令:
adb shell
dumpsys window windows | grep -E 'mFocusedApp'
下面是appium需要的夜神模拟器设备信息。
夜神模拟器设备信息(以抖音APP为例):
{
"platformName": "Android",
"deviceName": "127.0.0.1:62001",
"appPackage": "com.ss.android.ugc.aweme",
"appActivity": ".main.MainActivity",
}
注:夜神模拟器的设备名称默认是:127.0.0.1:62001
,如果是真机需要通过命令行去查看。
开启夜神模拟器后,在命令行输入adb devices
,必须要看到如下:
C:\\User\zfh>adb devices
List of devices attached
127.0.0.1:62001 device
如果没有这个设备,那么用命令行nox_adb connect 127.0.0.1:62001
抖音爬虫脚本
以抖音APP为例,开发了一个根据固定关键词搜索抖音视频的脚本。
需要的依赖:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-api</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-remote-driver</artifactId>
<version>3.141.59</version>
</dependency>
<dependency>
<groupId>io.appium</groupId>
<artifactId>java-client</artifactId>
<version>6.1.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>4.3.2</version>
</dependency>
关于DesiredCapabilities
配置项:DesiredCapabilities内容详解(较全)
Code:
import io.appium.java_client.TouchAction;
import io.appium.java_client.android.AndroidDriver;
import io.appium.java_client.remote.MobileCapabilityType;
import io.appium.java_client.touch.WaitOptions;
import io.appium.java_client.touch.offset.PointOption;
import org.apache.commons.io.FileUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.time.Duration;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Random;
/**
* 抖音爬虫线程----夜神模拟器版
* 通过启动appium,使用fiddler将抖音视频数据保存在文件中
* 注意:当前代码,尤其是坐标和部分的xpath,仅适用于当前设备
* 补充:本采集程序启动前,若抖音未登录,或设置了清空缓存登录,则必须先在同机器上安装今日头条并登录
* @author zfh
* @version 1.0
* @since 2019/4/11 18:27
*/
public class DouYinCrawlerThread extends Thread {
public static Integer PAGE_LIMIT = 20; // 抖音翻页限制,这里设置翻20页
private static Logger logger = LoggerFactory.getLogger(DouYinCrawlerThread.class);
private static List<String> keyList = new ArrayList<>();
private static Integer keyIndex = 0; // 当前需要进行搜索的关键词
private static boolean noReset = false;
static {
keyList.add("java");
keyList.add("python");
keyList.add("php");
keyList.add("sql");
keyList.add("程序员");
}
public DouYinCrawlerThread(boolean noReset) {
this.setName("抖音视频采集线程_夜神版");
DouYinCrawlerThread.noReset = noReset;
}
@Override
public void run() {
logger.info("启动:" + this.getName());
while (true) {
Date startTime = new Date();
try {
keyIndex = 0;
startCrawler();
} catch (MalformedURLException e) {
e.printStackTrace();
}
long consumeTime = new Date().getTime() - startTime.getTime();
long sleepTime = 60 * 60 * 1000 - consumeTime;
if (sleepTime <= 0) {
sleepTime = (20 + new Random().nextInt(40)) * 60 * 1000;
}
if (!noReset) {
noReset = true; // 默认从第二次开始,不需要清空缓存
}
logger.info("本轮查询结束,共用时:" + consumeTime/(60*1000) + "分钟,距下一轮查询开始需睡眠:" + sleepTime/(60*1000) + "分钟");
sleepTime(sleepTime); // 1h一次完整搜索
}
}
/**
* 启动爬虫
*/
private static void startCrawler() throws MalformedURLException {
DesiredCapabilities desiredCapabilities = new DesiredCapabilities();
desiredCapabilities.setCapability(MobileCapabilityType.DEVICE_NAME, "127.0.0.1:62001"); //
desiredCapabilities.setCapability("platformName", "Android");
desiredCapabilities.setCapability("appPackage","com.ss.android.ugc.aweme");
desiredCapabilities.setCapability("appActivity",".main.MainActivity");
desiredCapabilities.setCapability("unicodeKeyboard", true);
desiredCapabilities.setCapability("resetKeyboard", true); // 是否重置输入法到原状态
desiredCapabilities.setCapability("noReset", noReset);
logger.info(noReset ? "本次启动不清空缓存" : "本次启动清空缓存");
URL url = new URL("http://127.0.0.1:4723/wd/hub");
boolean fail = false;
do {
try {
logger.info("启动抖音APP....");
startApp(desiredCapabilities, url);
fail = false;
} catch (Exception ex) {
fail = true;
logger.info("系统异常,此次中断的关键词为:" + keyList.get(keyIndex)+ ",重启服务");
ex.printStackTrace();
}
} while (fail) ;
}
/**
* 启动抖音APP
* @param desiredCapabilities
* @param url
* @throws Exception
*/
private static void startApp(DesiredCapabilities desiredCapabilities, URL url) throws Exception {
AndroidDriver webDriver = new AndroidDriver(url, desiredCapabilities);
Thread.sleep(30*1000);
String pageSource = webDriver.getPageSource();
// 判定是否存在需要关闭的弹框
while (needToCheckAndClick(pageSource)) {
exCheckAndReset(webDriver);
Thread.sleep(5 * 1000);
pageSource = webDriver.getPageSource();
}
/*//点击同意政策
try {
webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.FrameLayout/android.widget." +
"FrameLayout/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.TextView[3]")).click();
} catch (Exception ex) {
ex.printStackTrace();
}
//点击权限
try {
Thread.sleep(3*1000);
webDriver.findElement(By.id("com.android.packageinstaller:id/permission_allow_button")).click();
Thread.sleep(3*1000);
webDriver.findElement(By.id("com.android.packageinstaller:id/permission_allow_button")).click();
} catch (Exception ex) {
ex.printStackTrace();
}*/
//点击屏幕进入
Thread.sleep(3*1000);
logger.info("点击屏幕进入");
click(340, 720, webDriver);
if (!noReset) {
// 如果重置了缓存,则需要重新登录
logger.info("缓存被重置,需要重新登录");
try {
login(webDriver);
} catch (Exception ex) {
logger.info("头条授权登录失败");
getScreenShot(webDriver);
ex.printStackTrace();
throw ex; // 登录失败,抛出
}
Thread.sleep(5 * 1000);
pageSource = webDriver.getPageSource();
// 登录后,再次判定是否存在需要关闭的弹框
while (needToCheckAndClick(pageSource)) {
exCheckAndReset(webDriver);
Thread.sleep(5 * 1000);
}
}
// 点击进入搜索页
Thread.sleep(3*1000);
logger.info("点击进入搜索页");
click(675, 75, webDriver);
// 开始搜索
int time = 0;
int successTime = 0;
boolean isNew = true;
for (; keyIndex < keyList.size(); keyIndex++) {
time++;
String keywords = keyList.get(keyIndex);
logger.info((time) + ": 开始新的查询,查询关键词:" + keywords);
try {
boolean res = search(webDriver, keywords, isNew); //
if (res) {
successTime++;
}
if (isNew) {
isNew = false; // 设为false
}
} catch (Exception ex) {
logger.info("查询出错:");
getScreenShot(webDriver); // 截图
throw ex; // 出错了,返回
}
Thread.sleep(5 * 1000);
}
logger.info("关闭APP");
webDriver.closeApp();
Thread.sleep(10 * 1000);
logger.info("执行查询" + (time) + "次,成功" + successTime + "次,失败" + (time - successTime) + "次");
}
/**
* 通过头条进行登录
* @param webDriver
* @throws Exception
*/
public static void login(AndroidDriver webDriver) throws Exception {
// 登录
Thread.sleep(5 * 1000);
logger.info("点击登录");
click(650, 1240, webDriver);
// 点击头条登录
Thread.sleep(5 * 1000);
if (!webDriver.getPageSource().contains("密码登录")) {
Thread.sleep(5 * 1000);
}
logger.info("选择头条登录");
click(215, 580, webDriver);
// 点击头条授权登录
Thread.sleep(10 * 1000);
logger.info("点击头条授权登录");
click(340, 650, webDriver);
// 选择跳过绑定手机号
Thread.sleep(10 * 1000);
logger.info("选择跳过绑定手机号");
click(670, 75, webDriver);
}
/**
* 搜索
* @param webDriver
* @throws InterruptedException
*/
private static boolean search(AndroidDriver webDriver, String keywords, boolean isNew) throws Exception {
if (!isNew) {
Thread.sleep(3*1000);
// 点击输入框
click(320, 80, webDriver);
logger.info("输入关键词");
WebElement editElement = webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout/android.widget.FrameLayout[2]/android.widget.RelativeLayout/android.widget.RelativeLayout/android.widget.FrameLayout/android.widget.EditText"));
editElement.clear(); // 清空输入框内容;网上有些资料说必须将焦点移动到最后面,在三星手机上测试发现不需要移动也可以起作用
editElement.click();
editElement.sendKeys(keywords);
} else {
Thread.sleep(3*1000);
webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout[1]/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.widget.LinearLayout/android.widget.FrameLayout[2]"))
.click();
Thread.sleep(3 * 1000);
logger.info("输入关键词");
webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout[1]/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.widget.LinearLayout/android.widget.FrameLayout[2]/android.widget.EditText"))
.sendKeys(keywords); // 输入关键词
}
Thread.sleep(5*1000);
logger.info("点击搜索");
click(670, 70, webDriver); // 点击搜索
if (isSearchNull(webDriver)) {
return false;
}
Thread.sleep(3*1000);
// 点击选择视频
webDriver.findElement(By.xpath("/hierarchy/android.widget.FrameLayout/android.widget.LinearLayout/android.widget.FrameLayout/android.widget.FrameLayout/android.widget.FrameLayout[2]/android.widget.RelativeLayout/android.widget.FrameLayout/android.widget.RelativeLayout/android.widget.HorizontalScrollView/android.widget.LinearLayout/android.support.v7.app.ActionBar.Tab[2]"))
.click();
int width = webDriver.manage().window().getSize().width;
int height = webDriver.manage().window().getSize().height;
int i = 0; // 翻页次数
int swipeFailNum = 0; // 翻页失败次数
while (swipeFailNum < 1) {
if (i >= PAGE_LIMIT) {
logger.info("翻页达到限制," + PAGE_LIMIT + "页,结束翻页");
break;
}
logger.info("开始翻页:" + (i+1));
try {
Thread.sleep( 2 * 1000);
swipe(webDriver, width/2, (int)(height * 0.75), width/2, (int)(height/3));
swipe(webDriver, width/2, (int)(height * 0.75), width/2, (int)(height/3)); // 滑动两次
logger.info("翻页成功");
} catch (Exception ex) {
logger.info("翻页失败");
swipeFailNum++;
ex.printStackTrace();
} finally {
i++;
}
// 查看是否为最后一页
String source = webDriver.getPageSource();
if (source.contains("没有搜索到相关")) {
getScreenShot(webDriver); // 可能是乱码了,也可能是机器被暂时封号了
logger.info(source);
logger.info("没有搜索到相关的内容");
break;
}
if (source.contains("没有更多")) {
logger.info(source);
logger.info("已达到最后一页,没有更多数据了");
break;
}
try {
viewAndReturn(webDriver); // 随机点击视频进行查看,并返回
} catch (Exception e) {
e.printStackTrace();
logger.info("点击视频查看报错");
}
}
logger.info("本次视频查询成功");
return true;
}
/**
* 随机查看当前页的视频,随机等待时间后,返回
*/
private static void viewAndReturn(AndroidDriver driver) throws Exception {
Thread.sleep(5 * 1000);
logger.info("点击查看视频");
int width = driver.manage().window().getSize().width;
int height = driver.manage().window().getSize().height;
logger.info("width = " + width + ", height = " + height);
int randomX = width/2 + (new Random().nextBoolean() ? new Random().nextInt(width/4) : - new Random().nextInt(width/4));
int randomY = height/2 + (new Random().nextBoolean() ? new Random().nextInt(height/4) : - new Random().nextInt(height/4));
TouchAction action = new TouchAction(driver);
action.tap(PointOption.point(randomX, randomY));
action.perform();
logger.info("randomX = " + randomX + ", randomY = " + randomY);
Thread.sleep(3 * 1000);
String resource = driver.getPageSource();
if (resource != null && resource.contains("视频") && resource.contains("综合")) {
logger.info("打开视频失败,返回");
return;
}
int sleepTime = 3 + new Random().nextInt(9);
logger.info("睡眠" + sleepTime + "秒");
Thread.sleep(sleepTime * 1000);
logger.info("查看完毕,返回");
action.tap(PointOption.point(35, 80));
action.perform();
action.release();
}
/**
* 点击坐标
* @param x
* @param y
* @param driver
*/
private static void click(int x, int y, AndroidDriver driver) {
TouchAction action = new TouchAction(driver);
action.tap(PointOption.point(x, y));
action.perform();
action.release();
}
/**
* 滑动
* @param driver
* @param fromX
* @param fromY
* @param toX
* @param toY
*/
private static void swipe(AndroidDriver driver, int fromX, int fromY, int toX, int toY) {
Duration duration = Duration.ofMillis(800);
TouchAction action = new TouchAction(driver)
.press(PointOption.point(fromX, fromY))
.waitAction(WaitOptions.waitOptions(duration))
.moveTo(PointOption.point(toX, toY)).release();
action.perform();
action.release();
}
/**
* 获取屏幕截图
* @param driver
*/
private static void getScreenShot(AndroidDriver driver) {
File img = driver.getScreenshotAs(OutputType.FILE);
if (img != null && img.exists()) {
File file = new File(new Date().getTime() + "." + img.getName().split("\\.")[1]);
try {
FileUtils.copyFile(img, file);
img.delete();
logger.info("屏幕截图:" + file.getAbsolutePath());
} catch (IOException e) {
e.printStackTrace();
}
}
}
/**
* 检查搜索结果是否为空
* @param driver
* @return
*/
private static boolean isSearchNull(AndroidDriver driver) {
String resource = driver.getPageSource();
return resource != null && resource.contains("搜索结果为空");
}
/**
* APP运行时的异常状态检查和修复
* @param driver
*/
private static void exCheckAndReset(AndroidDriver driver) {
String resource = driver.getPageSource();
if (resource != null) {
if (resource.contains("隐私政策") && resource.contains("仅浏览") && resource.contains("同意")) {
logger.info("同意隐私政策");
click(460, 940, driver);
} else if (resource.contains("青少年模式")) {
// 关闭打开青少年模式的通知
logger.info("关闭打开青少年模式通知");
click(350, 875, driver);
} else if ((resource.contains("通知") && resource.contains("去打开")) ||
(resource.contains("新版本") && resource.contains("升级")) ||
(resource.contains("通讯录好友"))) {
/*getScreenShot(driver);*/
logger.info("取消打开通知/新版本升级/通讯录好友");
click(250, 900, driver);
}
}
}
/**
* 判定是否存在弹框通知需要关闭
* @param pageSource
* @return
*/
private static boolean needToCheckAndClick(String pageSource) {
if (pageSource != null) {
return pageSource.contains("青少年模式")
|| pageSource.contains("隐私政策")
|| pageSource.contains("通知")
|| pageSource.contains("去打开")
|| pageSource.contains("通讯录好友");
}
return false;
}
/**
* thread.sleep(millis)
* @param millis
*/
private static void sleepTime(long millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
DouYinCrawlerThread thread = new DouYinCrawlerThread(false);
thread.start();
}
}
配置好环境后,可运行该脚本,即可自动根据关键词顺序搜索抖音视频信息。
来源:CSDN
作者:eknown
链接:https://blog.csdn.net/qq_28379809/article/details/89362551