当前位置:网站首页>Tiktok crawler, Appium+Mitmproxy powerful combination to achieve tiktok data crawling

Tiktok crawler, Appium+Mitmproxy powerful combination to achieve tiktok data crawling

2020-12-07 10:20:21 TiToData

APP Introduction to reptiles ,Appium+Mitmproxy Tiktok data acquisition by strong combination
I've been studying APP Crawler implementation of . Let's take the tiktok as an example to make a demonstration routine. . Tiktok is tiktok. , Tiktok is mainly because a small item on the hand is based on the jitter. APP Of , This is the latter part 🤭.
Grab the bag APP Idea and webpage capture bag is a truth . The simplest and most efficient way to analyze web requests directly . The return value is obtained by constructing a web request to simulate sending , Extract the required fields from the returned data . This is the same idea as Web Capture . The difficulty is that the construction of request parameters of Web capture can be analyzed JS It's simulated in a script . But for APP for , The parameters of the request are constructed in APP Internal , I don't have the ability to reverse APP, So this road doesn't work . That's what we need to pass Appium The main reason is . In contrast, it's not like being able to analyze JS While using Selenium It's exactly the same thing ( ﹁ ﹁ ) ~→.

Achieve the goal

Since it's an introductory routine , So the example is relatively simple , It's just a summary of these days' study . Even so, we have to achieve certain functions , Otherwise, some meaningless code is just a waste of time . Let me think about the function ? We don't make repetitive wheels , Some time ago, the tiktok was pretty beautiful. , Of course, we don't want to imitate him . With , Just try writing a crawler crawling to tiktok users. , Call Baidu by the way AI The interface is used to score the fans' appearance !
Obvious , It's stupid to rate users by their avatars . Most fans don't use their own photos as their avatars , Even many people's faces are not human . So this function is very chicken ribs . In a word, we have the idea , Friends in need can add more weight judgment to the data , To improve accuracy . For example, : Number of fans , The number of likes received, etc ... Generally, these data are positively correlated with the appearance of bloggers , After all, it's a face watching age .
Thought analysis
adopt appium Enter the fan page of the specified user -> Automatic fan traversal operation . then mitmproxy Intercept as a middleman , Process the returned packets , Call Baidu's face recognition database to discriminate images . Data annotation and saving . Clear thinking , The code is also very clear , Because there is no real machine on hand , This tutorial chooses to test on a virtual machine , On virtual machines, the speed must be greatly reduced .
The construction of development environment is a little complicated , Don't understand can look at the previous article , There is a more detailed introduction . But the real business code is really simple . This is also Python Strong embodiment , It has a huge third-party support library .

Code implementation


General business process


Environmental support :Python 3.6.5、Appium、MitmProxy
support library :requests,Appium-Python-Client,mitmproxy ( use pip install XXX Just install one by one )

Virtual machine environment :Genymotion,Android 5.1.1

Tiktok version :Ver 6.3.0


Finally, the extracted data were extracted with Json Format preservation , Each piece of data consists of four fields , Namely :shortid( Tiktok ),nickname( nickname ),uid( Tiktok user internal key ),beauty( Level of appearance ). The fourth value is highly uncertain , Because the code is not perfect , If there is a need, it can be modified according to the actual situation , This article is an introduction to the use of learning . The code is mainly divided into two modules , One Appium Control of cell phone , One Mitmproxy Fetching the data , Let's talk about the code in two parts . But before analyzing the code, we need to introduce it briefly Mitmproxy.
MitmProxy Easy tutorial for
mitmproxy It is used for MITM Of proxy,MITM That is man in the middle attack (Man-in-the-middle attack). The agent used for man in the middle attack first forwards the request like a normal agent , Ensure the communication between the server and the client , secondly , Will check in time 、 Record the data it intercepted , Or tamper with data , Trigger server-side or client-side specific behavior . differ Fiddler Wait for bag grabbing tools ,mitmproxy Not only can you intercept requests to help developers view them 、 analysis , It can also be used for secondary development through custom scripts .mitmproxy Of python The support library allows me to Python Intercepting packets in , This is also the basis of this crawler .
About mitmproxy This crawler will give you a little tutorial , But the detailed tutorial has to go by yourself Mitmproxy Of Github Warehouse check , The official gave the complete examples, From simplicity to complexity , It's good for beginners to learn . For students who have difficulty reading English documents , I also found an article summarized by the great God in China : Use mitmproxy + python Act as interceptor agent - Langsha blog , A very good summary of the introductory tutorial , First understand how it works, and then look at the official routine, there will be a kind of feeling .


Appium Part of the

Appium The effect that needs to be achieved


def init_device():
    desired_caps = {}
    desired_caps['platformName'] = 'Android'
    desired_caps['udid'] = "192.168.13.107:5555"
    desired_caps['deviceName'] = "second"
    desired_caps['platformVersion'] = "5.1.1"
    desired_caps['appPackage'] = 'com.ss.android.ugc.aweme'
    desired_caps['appActivity'] = 'com.ss.android.ugc.aweme.main.MainActivity'
    desired_caps["unicodeKeyboard"] = True
    desired_caps["resetKeyboard"] = True
    desired_caps["noReset"] = True
    desired_caps["newCommandTimeout"] = 600
    device = webdriver.Remote('http://127.0.0.1:4723/wd/hub', desired_caps)
    device.implicitly_wait(3)
    return device
def move_to_fans(device):
    #  Search the page to search the tiktok and enter the fan page. 
    device.find_element_by_id("com.ss.android.ugc.aweme:id/au1").click()
    device.find_element_by_id("com.ss.android.ugc.aweme:id/a86").send_keys(AIM_ID)
    device.find_element_by_id("com.ss.android.ugc.aweme:id/d5h").click()
    device.find_elements_by_id("com.ss.android.ugc.aweme:id/cwm")[0].click()
    device.find_element_by_id("com.ss.android.ugc.aweme:id/adf").click()
def fans_cycle():
    fans_done = []
    while True:
        elements = device.find_elements_by_id("com.ss.android.ugc.aweme:id/d9x")
        all_fans = [x.text for x in elements]
        if reduce(lambda x, y: x and y, [(x in fans_done) for x in all_fans]) and fans_done:
            print(" End of traversal ,  Will terminate session")
            break
        for element in elements:
            if element.text not in fans_done:
                element.click()
                time.sleep(2)
                device.press_keycode("4")
                time.sleep(1)
                fans_done.append(element.text)
                print(element.text)
        device.swipe(600, 1600, 600, 900, duration=1000)
        if len(fans_done) > 30:
            fans_done = fans_done[10:]


Appium The main responsibility is to tiktok users. , Then go to the user's fan page , adopt fans_cycle Method to traverse the entire fan list . While traversing fans APP Packets are sent to the server , All we need to do is mitmproxy Handling function interception response data , Just customize the data .
Be careful : The test found different tiktok versions. elements ID It's all different , So the above code is not universal , Maybe it needs to be reused Appium Get elements ID.
Several lines of code marked in the code are used to judge whether the fan page is in the end or not . It's necessary to explain , Because the code is so abstract . My idea is to create a temporary list of fans that have been traversed , When the data of a page of fans is in the temporary list, it indicates that : The page data was not refreshed , It proves that the page is in the end . At this point, the loop can be terminated . But the first time the loop is executed , Both lists are empty , We need to add a null operation .reduce What the function does is whether all the new data exists in the temporary list , Established as true , Otherwise it is false . The temporary list is sliced once per loop , Ensure that the length does not exceed 30, Save system memory .
This judgment is not very easy to understand , But the idea is really simple ,Python The refinement of language is also reflected , If this is Java I might have to write dozens of lines to complete the same requirement . In a word : Life is too short , I use Python
Mitmproxy Part of the
This part includes data interception and Baidu API call . Filtering by intercepting the request of the vibrato tiktok. , Find out the details of the user packet interception response Data request . Because the data returned is Json type , So in Python It's very easy to parse . The four fields we need are all in the packet , The jitter number may be tiktok. . Score the user's appearance , We use the user's high-definition head picture .

import mitmproxy.http
import json
from spider.api.baidu import FaceDetect
from lib.shortid import Short_ID
face = FaceDetect()
spider_id = Short_ID()
class Fans():
    def response(self, flow: mitmproxy.http.flow):
        if "aweme/v1/user/?user_id" in flow.request.url:
            user = json.loads(flow.response.text)["user"]
            short_id = user["short_id"]
            nickname = user['nickname']
            uid = user["uid"]
            avatar = user["avatar_larger"]["url_list"][0]
            beauty = face(avatar)
            short_id = spider_id(uid) if short_id == "0" else short_id
            data = {
                "short_id": short_id,
                "nickname": nickname,
                "uid": uid,
                "beauty": beauty
            }
            print(data)


Baidu AI Face recognition , Because it's easy , Can pass API Document to write the code we need , I won't go into too much detail here ( Document address ), My request code is posted here for friends in need to learn . There may be more than one face in the user's Avatar , So we do the average face value processing for multiple face images , The color data obtained is the average value , Keep four decimal places . Be careful : If the color is 0 It doesn't mean very much ugly, Maybe there's no face in the picture , Or is it api Request reached request limit , Please make a judgment according to the actual situation , If necessary, you can improve the code by yourself .

import requests
import json
class FaceDetect():
    def __init__(self):
        self.ak = " Baidu acess_key"
        self.sk = " Baidu secret_key"
        self.token = self.__access_token()
    def __access_token(self):
        url = 'https://aip.baidubce.com/oauth/2.0/token?' \
              'grant_type=client_credentials&client_id={}&client_secret={}'.format(self.ak, self.sk)
        headers = {'Content-Type': 'application/json; charset=UTF-8'}
        req = requests.get(url, headers=headers)
        token = json.loads(req.text)["access_token"]
        return token
    def __face_detect(self, pic):
        url = "https://aip.baidubce.com/rest/2.0/face/v3/detect?access_token={}".format(self.token)
        params = {
            "image": pic,
            "image_type": "URL",
            "face_field": "age,beauty,expression,gender,face_shape,emotion",
            "max_face_num": "10"
        }
        req = requests.post(url, params=params)
        return req.text
    def __average_beauty(self, data):
        if data["error_code"] == 0:
            average_beauty = []
            for face in data["result"]["face_list"]:
                average_beauty.append(face["beauty"])
            return "{:.4f}".format((sum(average_beauty) / len(average_beauty)))
        return 0
    def __call__(self, url):
        r = self.__face_detect(url)
        data = json.loads(r)
        return self.__average_beauty(data)


Running output effect picture

summary

Generally speaking, this article involves a lot of knowledge , It needs to be digested slowly .

——————————————————————————————————————————

TiToData: Professional short video 、 Live data interface service platform .

For more information, please contact : TiToData

Covering mainstream platforms : Tiktok , Well quickly , The little red book ,TikTok,YouTube

版权声明
本文为[TiToData]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/202012071017260738.html