Scrap articles form wsj by requests, CURL and BeautifulSoup

后端 未结 1 1049
时光说笑
时光说笑 2020-12-20 04:45

I\'m a paid member of wsj and I tried to scrap articles to do my NLP project. I thought I kept the session.

rs = requests.session()
login_url=\"https://sso.a         


        
相关标签:
1条回答
  • 2020-12-20 05:42

    Your attempts have failed because the protocol used is oauth2.0. This is not basic authentication.

    What's happening here is :

    • some information are generated server side when login URL https://accounts.wsj.com/login is called : connection & client_id
    • when submitting username/password, the URL https://sso.accounts.dowjones.com/usernamepassword/login is called which needs some parameter (the previous connection & client_id + some static parameter for oauth2 : scope, response_type, redirect_uri
    • a response is received from the previous login call that gives a form which auto-submit. This form has 3 params wa, wresult and wctx (wresult is a JWT). This form performs the call to https://sso.accounts.dowjones.com/login/callback to retrieve an URL with a code param like code=AjKK8g0pZZfvYpju
    • The URL https://accounts.wsj.com/auth/sso/login?code=AjKK8g0pZZfvYpju is called which retrieve the cookies with a valid user session

    The bash script which uses curl, grep, pup and jq :

    username="user@gmail.com"
    password="YourPassword"
    
    login_url=$(curl -s -I "https://accounts.wsj.com/login")
    connection=$(echo "$login_url" | grep -oP "Location:\s+.*connection=\K(\w+)")
    client_id=$(echo "$login_url" | grep -oP "Location:\s+.*client_id=\K(\w+)")
    
    #connection=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*connection=(\w+)&/, data) {print data[1]}')
    #client_id=$(echo "$login_url" | gawk 'match($0, /Location:\s+.*client_id=(\w+)&/, data) {print data[1]}')
    
    rm -f cookies.txt
    
    IFS='|' read -r wa wresult wctx < <(curl -s 'https://sso.accounts.dowjones.com/usernamepassword/login' \
          --data-urlencode "username=$username" \
          --data-urlencode "password=$password" \
          --data-urlencode "connection=$connection" \
          --data-urlencode "client_id=$client_id" \
          --data 'scope=openid+idp_id&tenant=sso&response_type=code&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin' | pup 'input json{}' | jq -r 'map(.value) | join("|")')
    
    # replace double quote ""
    wctx=$(echo "$wctx" | sed 's/&#34;/"/g')
    
    code_url=$(curl -D - -s -c cookies.txt 'https://sso.accounts.dowjones.com/login/callback' \
         --data-urlencode "wa=$wa" \
         --data-urlencode "wresult=$wresult" \
         --data-urlencode "wctx=$wctx" | grep -oP "Location:\s+\K(\S*)")
    
    curl -s -c cookies.txt "$code_url"
    
    # here call your URL loading cookies.txt
    curl -s -b cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y"
    
    0 讨论(0)
自定义标题
段落格式
字体
字号
代码语言
提交回复
热议问题