I am learning python and i am currently scraping reddit. Somehow reddit has figured out that I am a bot (which my software actually is) but how do they know that? And how we
There's a large array of techniques that internet service providers use to detect and combat bots and scrapers. At the core of all of them is to build heuristics and statistical models that can identify non-human-like behavior. Things such as:
Total number of requests from a certain IP per specific time frame, for example, anything more than 50 requests per second, or 500 per minute, or 5000 per day may seem suspicious or even malicious. Counting number of requests per IP per unit of time is a very common, and arguably effective, technique.
Regularity of incoming requests rate, for example, a sustained flow of 10 requests per second may seem like a robot programmed to make a request, wait a little, make the next request, and so on.
HTTP Headers. Browsers send predictable User-Agent
headers with each request that helps the server identify their vendor, version, and other information. In combination with other headers, a server might be able to figure out that requests are coming from an unknown or otherwise exploitative source.
A stateful combination of authentication tokens, cookies, encryption keys, and other ephemeral pieces of information that require subsequent requests to be formed and submitted in a special manner. For example, the server may send down a certain key (via cookies, headers, in the response body, etc) and expect that your browser include or otherwise use that key for the subsequent request it makes to the server. If too many requests fail to satisfy that condition, it's a telltale sign they might be coming from a bot.
Mouse and keyboard tracking techniques: if the server knows that a certain API can only be called when the user clicks a certain button, they can write front-end code to ensure that the proper mouse-activity is detected (i.e. the user did actually click on the button) before the API request is made.
And many many more techniques. Imagine you are the person trying to detect and block bot activity. What approaches would you take to ensure that requests are coming from human users? How would you define human behavior as opposed to bot behavior, and what metrics can you use to discern the two?
There's a question of practicality as well: some approaches are more costly and difficult to implement. Then the question will be: to what extent (how reliably) would you need to detect and block bot activity? Are you combatting bots trying to hack into user accounts? Or do you simply need to prevent them (perhaps in a best-effort manner) from scraping some data from otherwise publicly visible web pages? What would you do in case of false-negative and false-positive detections? These questions inform the complexity and ingenuity of the approach you might take to identify and block bot activity.