当前位置:网站首页>Tiktok crawler tutorial, from 0 to 1, get tiktok user data.

Tiktok crawler tutorial, from 0 to 1, get tiktok user data.

2020-12-08 09:53:44 TiToData

Tiktok crawler tutorial , from 0 To 1, Get tiktok user data

Preface

Because the ultimate goal is to grasp the tiktok video data. , The video data that requests the request to shake the voice through the packet is required tiktok. sec_id, This is encrypted id We don't know the generation process of , But we can see the corresponding tiktok from the user's data packet. sec_id, And this is encrypted id It won't change , So I can crawl the user's data first , However, we can get other users' sec_id, We know the user's data , We can crawl the user's video data . This article will introduce tiktok user data crawling. .


One 、 Analyze user request package

If you've configured the environment ( The mobile phone is installed with tiktok. app And packet capture software is also configured ), You can enjoy the following , Otherwise, please check Environment configuration Then continue with this article .

1. Analyze user data

Tiktok , Tiktok publisher personal page :( Click the publisher's portrait on the right to enter the publisher's personal page )
image.png
At this time, we'll check fiddle Bag caught , We found that the request address of the corresponding data contains “aweme” Styling url,fiddle In the upper right corner is our request data , In the lower right corner is the corresponding response data :
image.png
Because my crawling idea is to crawl the attention list of the following list according to the user's attention list , So I don't pay much attention to user data , I'm more interested in the user's following list data and fan data , that How to look at user's attention list Well ?
image.png
So we can see the list of users' concerns , So what is the corresponding packet capture software :
image.png
We can see from the picture follower It's the user's fan data ,following It's the users that users care about . The corresponding request header and response data are shown in the figure below :
image.png

1.1. Request header analysis

We first analyze the request data :
image.png
The request data includes the corresponding url( That is to say api), Corresponding header data , among headers There are Host、Connection、CookieAccept-EncodingX-SS-QUERIESX-SS-REQ-TICKETX-Tt-Token、sdk-version、User-AgentX-KhronosX-Gorgon、X-Pods

  • First of all, our corresponding introduction url
    api = "https://api.amemv.com/aweme/v1/user/follower/list/?" \ # url
          "user_id={}" \   # user_id  It can be found in the user's attention list 
          "&max_time={}" \ #  Current timestamp 
          "&count=20&offset=0&source_type=1&address_book_access=2&gps_access=2" \ #  Is not important 
          "&ts={}" \ #  Current timestamp 
          "&js_sdk_version=1.16.3.5&app_type=normal&manifest_version_code=630" \ #  Is not important 
          "&_rticket={}" \ #  Current timestamp 
          "&ac=wifi&device_id=47012747444&iid=1846815477740845" \ #  Is not important 
          "&os_version=8.0.0&channel=wandoujia_aweme1&version_code=630" \ #  Is not important 
          "&device_type=HUAWEI%20NXT-AL10&language=zh&resolution=1080*1812&openudid=b202a24eb8c1538a" \ #  Is not important 
          "&update_version_code=6302&app_name=aweme&version_name=6.3.0&os_api=26&device_brand=HUAWEI&ssmix=a" \ #  Is not important 
          "&device_platform=android&dpi=480&aid=1128" \ #  Is not important 
          "&sec_user_id={}"\  #  Encrypted uid  It can be found in the user's attention list 
          ".format(user_id, max_time, ts, _rticket, sec_user_id)

According to the above api, We found that We can construct most of the data , Only user_id And encrypted sec_user_id We can't make it ourselves , But we can Through the user's follow list Get all the users it cares about user_id and sec_user_id, So we It only needs Know a user's user_id and sec_user_id, We can get the user associated with him and the user associated with him user_id and sec_user_id.

  • Next, we analyze the corresponding request header :
Host: api.amemv.com #  Corresponding host  unchanged 
Connection: keep-alive #  unchanged , Is not important 
Cookie: "cookies"  #  important , unchanged , Your own cookie, Can be in fiddle see 
Accept-Encoding: gzip #  unchanged 
X-SS-REQ-TICKET: 1606999477776 #  The current timestamp , We can make it ourselves 
X-Tt-Token:  003ea17385e4...23bbe199e41467-1.0.0 #  Your own token, important , unchanged , Can be in fiddle see 
sdk-version: 1 #  unchanged 
User-Agent: com.ss.a....0.2991.0) #  important , Your own ua, Can be in fiddle see 
X-Khronos: 1606999477 #  Current timestamp 
X-Gorgon: 03006cc00000d7464322a76ab998c12eef987b81af552788dabd #  important , I'll talk about how to get it later 
X-Pods: #  Is not important , You can leave it alone 

By analyzing the request header , We find that most of the data is constant , And we can all pass Fiddle get , The only thing that can't be obtained or will change is :X-Gorgon, Tiktok by decompile APK, We found that it was on request url and cookies and token etc. Generated .
Post the corresponding request here X-Gorgon Code for :
Suppose we already know our own cookies and token And what we asked for url So we can get the corresponding X-Gorgon:

#  Get the current timestamp :
ts = str(time.time()).split(".")[0]
_rticket = str(time.time() * 1000).split(".")[0]
max_time = ts
user_id = "96244072243"
sec_user_id = "MS4wLjABAAAAtk0pVzYt82o_R5jUjN4FEpRlautyPFGSgioxrH-jfvg"

#  Fill in your own below cookies and token
cookies = " Your own cookies"
token = " Your own token"

#  Construct the requested url
url= "https://api.amemv.com/aweme/v1/user/follower/list/?" \
          "user_id={}" \
          "&max_time={}" \
          "&count=20&offset=0&source_type=1&address_book_access=2&gps_access=2" \
          "&ts={}" \
          "&js_sdk_version=1.16.3.5&app_type=normal&manifest_version_code=630" \
          "&_rticket={}" \
          "&ac=wifi&device_id=47012747444&iid=1846815477740845" \
          "&os_version=8.0.0&channel=wandoujia_aweme1&version_code=630" \
          "&device_type=HUAWEI%20NXT-AL10&language=zh&resolution=1080*1812&openudid=b202a24eb8c1538a" \
          "&update_version_code=6302&app_name=aweme&version_name=6.3.0&os_api=26&device_brand=HUAWEI&ssmix=a" \
          "&device_platform=android&dpi=480&aid=1128" \
          "&sec_user_id={}".format(user_id, max_time, ts, _rticket, sec_user_id)
#  Initiate a request to get X-Gorgon
headers = {
        "dou-url": url,  #  Fill in the corresponding request api
        "dou-cookies": cookies,  #  Fill in your cookies
        "dou-token": token,  #  Fill in your token
        "dou-queries": ""  #  Fill in your request queries( If not , Just fill in the vacancy :“”)
    }
    res = requests.get("http://8.131.59.252:8080", headers=headers)
    if res.status_code==200:
        res_gorgon = json.loads(res.text)
        if res_gorgon.get("status") == 0:
            gorgon = res_gorgon.get("X-gorgon")
        else:
            print("param error when get gorgon")
            return
    else:
        print("request error when get gorgon")
        return
    print("gorgon: " + gorgon)  #  This is your gorgon 了 
  • By getting X-Gorgon Request for response data , The list of users' concerns :
#  Reconstruct request header :
headers = {
        "Host": "api.amemv.com",
        "Connection": "keep-alive",
        "Cookie": cookies, #  Your own Cookies
        "Accept-Encoding": "gzip",
        "X-SS-REQ-TICKET": _rticket, #  Current timestamp , The above code fragment has been generated 
        "X-Tt-Token": "0095a45e5cc.....c42c97e37d7350",  #  Your own token
        "sdk-version": "1",
        "User-Agent": " Your own user-agent", 
        "X-Khronos": ts, #  Current timestamp , The above code fragment has been generated 
        "X-Gorgon": gorgon # X-gorgon, The above code fragment has been generated 
    }

#  Initiate request 
result = doGetGzip(url, headers) #  This is a function I wrote myself , In the code snippet below 
print(result)
  • doGetGzip function
def doGetGzip(url, headers):

    req = request.Request(url)

    for key in headers:
        req.add_header(key, headers[key])
    with request.urlopen(req) as f:
        data = f.read()
        return gzip.decompress(data).decode()

2. Analysis of response data to obtain user data

2.1. Response data format :

The format of the response data is json Of , So we usually convert the response data into json To deal with , adopt fiddle We can see that the response data mainly includes the following parts , And the information of the concerned users is in the “followers” Inside . Other fields are mainly used to turn pages , Because a request only returns 20 Data , Other fields of response data :has_more Is there more data ,max_time Is the cursor for the next page of data , Our main concern is follower The data in .
image.png
Now we can see followers There are 20 Data , Each piece of data contains the information of a user
image.png
The following are the specific fields for each user :
image.png
There are many user fields , Most of them don't work for me , I only care user_id and Corresponding sec_uid, Of course, if you can look at your own needs to get more data , From the picture above, we find that we can get it , So my purpose was achieved , So I can save them , Next, we will request the two information of the user concerned by the user through the two of them , The next article will focus on how to crawl video information .
image.png
Tiktok is the whole thing that gets the users. , Later, I will explain how to capture video data , It's not easy to code words , Please also like to pay attention to it , Please leave a message if you have any questions .

——————————————————————————————————————————

TiToData: Professional short video 、 Live data interface service platform .
For more information, please contact : TiToData

Covering mainstream platforms : Tiktok , Well quickly , The little red book ,TikTok,YouTube

版权声明
本文为[TiToData]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/202012080953224779.html