Bot Detection¶
Implementations used for bot detection.
- searx.botdetection.get_network(real_ip: IPv4Address | IPv6Address, cfg: Config) IPv4Network | IPv6Network[source]¶
Returns the (client) network of whether the real_ip is part of.
- searx.botdetection.get_real_ip(request: SXNG_Request) str[source]¶
Returns real IP of the request. Since not all proxies set all the HTTP headers and incoming headers can be faked it may happen that the IP cannot be determined correctly.
This function tries to get the remote IP in the order listed below, additional some tests are done and if inconsistencies or errors are detected, they are logged.
The remote IP of the request is taken from (first match):
- searx.botdetection.too_many_requests(network: IPv4Network | IPv6Network, log_msg: str) Response | None[source]¶
Returns a HTTP 429 response object and writes a ERROR message to the ‘botdetection’ logger. This function is used in part by the filter methods to return the default
Too Many Requestsresponse.
IP lists¶
Method ip_lists¶
The ip_lists method implements IP block- and
pass-lists.
[botdetection.ip_lists]
pass_ip = [
'167.235.158.251', # IPv4 of check.searx.space
'192.168.0.0/16', # IPv4 private network
'fe80::/10' # IPv6 linklocal
]
block_ip = [
'93.184.216.34', # IPv4 of example.org
'257.1.1.1', # invalid IP --> will be ignored, logged in ERROR class
]
- searx.botdetection.ip_lists.SEARXNG_ORG = ['167.235.158.251', '2a01:04f8:1c1c:8fc2::/64']¶
Passlist of IPs from the SearXNG organization, e.g. check.searx.space.
- searx.botdetection.ip_lists.pass_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str][source]¶
Checks if the IP on the subnet is in one of the members of the
botdetection.ip_lists.pass_iplist.
- searx.botdetection.ip_lists.block_ip(real_ip: IPv4Address | IPv6Address, cfg: Config) Tuple[bool, str][source]¶
Checks if the IP on the subnet is in one of the members of the
botdetection.ip_lists.block_iplist.
Rate limit¶
Method ip_limit¶
The ip_limit method counts request from an IP in sliding windows. If
there are to many requests in a sliding window, the request is evaluated as a
bot request. This method requires a redis DB and needs a HTTP X-Forwarded-For
header. To take privacy only the hash value of an IP is stored in the redis DB
and at least for a maximum of 10 minutes.
The link_token method can be used to investigate whether a request is
suspicious. To activate the link_token method in the
ip_limit method add the following configuration:
[botdetection.ip_limit]
link_token = true
If the link_token method is activated and a request is suspicious
the request rates are reduced:
To intercept bots that get their IPs from a range of IPs, there is a
SUSPICIOUS_IP_WINDOW. In this window the suspicious IPs are stored
for a longer time. IPs stored in this sliding window have a maximum of
SUSPICIOUS_IP_MAX accesses before they are blocked. As soon as the IP
makes a request that is not suspicious, the sliding window for this IP is
dropped.
- searx.botdetection.ip_limit.BURST_WINDOW = 20¶
Time (sec) before sliding window for burst requests expires.
- searx.botdetection.ip_limit.BURST_MAX = 15¶
Maximum requests from one IP in the
BURST_WINDOW
- searx.botdetection.ip_limit.BURST_MAX_SUSPICIOUS = 2¶
Maximum of suspicious requests from one IP in the
BURST_WINDOW
- searx.botdetection.ip_limit.LONG_WINDOW = 600¶
Time (sec) before the longer sliding window expires.
- searx.botdetection.ip_limit.LONG_MAX = 150¶
Maximum requests from one IP in the
LONG_WINDOW
- searx.botdetection.ip_limit.LONG_MAX_SUSPICIOUS = 10¶
Maximum suspicious requests from one IP in the
LONG_WINDOW
- searx.botdetection.ip_limit.API_WINDOW = 3600¶
Time (sec) before sliding window for API requests (format != html) expires.
- searx.botdetection.ip_limit.API_MAX = 4¶
Maximum requests from one IP in the
API_WINDOW
- searx.botdetection.ip_limit.SUSPICIOUS_IP_WINDOW = 2592000¶
Time (sec) before sliding window for one suspicious IP expires.
- searx.botdetection.ip_limit.SUSPICIOUS_IP_MAX = 3¶
Maximum requests from one suspicious IP in the
SUSPICIOUS_IP_WINDOW.
Method link_token¶
The link_token method evaluates a request as suspicious if the URL /client<token>.css is not requested by the
client. By adding a random component (the token) in the URL, a bot can not send
a ping by request a static URL.
Note
This method requires a redis DB and needs a HTTP X-Forwarded-For header.
To get in use of this method a flask URL route needs to be added:
@app.route('/client<token>.css', methods=['GET', 'POST'])
def client_token(token=None):
link_token.ping(request, token)
return Response('', mimetype='text/css')
And in the HTML template from flask a stylesheet link is needed (the value of
link_token comes from get_token):
<link rel="stylesheet"
href="{{ url_for('client_token', token=link_token) }}"
type="text/css" >
- searx.botdetection.link_token.TOKEN_LIVE_TIME = 600¶
Lifetime (sec) of limiter’s CSS token.
- searx.botdetection.link_token.PING_LIVE_TIME = 3600¶
Lifetime (sec) of the ping-key from a client (request)
- searx.botdetection.link_token.PING_KEY = 'SearXNG_limiter.ping'¶
Prefix of all ping-keys generated by
get_ping_key
- searx.botdetection.link_token.TOKEN_KEY = 'SearXNG_limiter.token'¶
Key for which the current token is stored in the DB
- searx.botdetection.link_token.is_suspicious(network: IPv4Network | IPv6Network, request: SXNG_Request, renew: bool = False)[source]¶
Checks whether a valid ping is exists for this (client) network, if not this request is rated as suspicious. If a valid ping exists and argument
renewisTruethe expire time of this ping is reset toPING_LIVE_TIME.
- searx.botdetection.link_token.ping(request: SXNG_Request, token: str)[source]¶
This function is called by a request to URL
/client<token>.css. Iftokenis valid aPING_KEYfor the client is stored in the DB. The expire time of this ping-key isPING_LIVE_TIME.
- searx.botdetection.link_token.get_ping_key(network: IPv4Network | IPv6Network, request: SXNG_Request) str[source]¶
Generates a hashed key that fits (more or less) to a WEB-browser session in a network.
Probe HTTP headers¶
Method http_accept¶
The http_accept method evaluates a request as the request of a bot if the
Accept header ..
did not contain
text/html
Method http_accept_encoding¶
The http_accept_encoding method evaluates a request as the request of a
bot if the Accept-Encoding header ..
did not contain
gzipANDdeflate(if both values are missed)did not contain
text/html
Method http_accept_language¶
The http_accept_language method evaluates a request as the request of a bot
if the Accept-Language header is unset.
Method http_connection¶
The http_connection method evaluates a request as the request of a bot if
the Connection header is set to close.
Method http_user_agent¶
The http_user_agent method evaluates a request as the request of a bot if
the User-Agent header is unset or matches the regular expression
USER_AGENT.
- searx.botdetection.http_user_agent.USER_AGENT = '(unknown|[Cc][Uu][Rr][Ll]|[wW]get|Scrapy|splash|JavaFX|FeedFetcher|python-requests|Go-http-client|Java|Jakarta|okhttp|HttpClient|Jersey|Python|libwww-perl|Ruby|SynHttpClient|UniversalFeedParser|Googlebot|GoogleImageProxy|bingbot|Baiduspider|yacybot|YandexMobileBot|YandexBot|Yahoo! Slurp|MJ12bot|AhrefsBot|archive.org_bot|msnbot|MJ12bot|SeznamBot|linkdexbot|Netvibes|SMTBot|zgrab|James BOT|Sogou|Abonti|Pixray|Spinn3r|SemrushBot|Exabot|ZmEu|BLEXBot|bitlybot|HeadlessChrome|Mozilla/5\\.0\\ \\(compatible;\\ Farside/0\\.1\\.0;\\ \\+https://farside\\.link\\)|.*PetalBot.*)'¶
Regular expression that matches to User-Agent from known bots
Config¶
Configuration class Config with deep-update, schema validation
and deprecated names.
The Config class implements a configuration that is based on
structured dictionaries. The configuration schema is defined in a dictionary
structure and the configuration data is given in a dictionary structure.
- class searx.botdetection.config.Config(cfg_schema: Dict, deprecated: Dict[str, str])[source]¶
Base class used for configuration
- validate(cfg: dict)[source]¶
Validation of dictionary
cfgonConfig.SCHEMA. Validation is done byvalidate.
- get(name: str, default: ~typing.Any = <UNSET>, replace: bool = True) Any[source]¶
Returns the value to which
namepoints in the configuration.If there is no such
namein the config and thedefaultisUNSET, aKeyErroris raised.
- set(name: str, val)[source]¶
Set the value to which
namepoints in the configuration.If there is no such
namein the config, aKeyErroris raised.
- path(name: str, default=<UNSET>)[source]¶
Get a
pathlib.Pathobject from a config string.