Skip to content

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome ⭐

License

Notifications You must be signed in to change notification settings

stape-io/traefik-crawler-user-agents

 
 

Repository files navigation

traefik-crawler-user-agents

This repository contains a list of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.

Adapted for Yaegi interpreter and able to use in Traefik plugins.

Forked from Go package: https://pkg.go.dev/github.com/monperrus/crawler-user-agents

Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library.

If you use this project in a commercial product, please sponsor it.

Install

Direct download

Download the crawler-user-agents.json file from this repository directly.

Go

Go: use this package, it provides global variable Crawlers (it is synchronized with crawler-user-agents.json), functions IsCrawler and MatchingCrawlers.

Example of Go program:

package main

import (
  "fmt"

  "github.com/stape-io/traefik-crawler-user-agents"
)

func main() {
  userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"

  isCrawler := agents.IsCrawler(userAgent)
  fmt.Println("isCrawler:", isCrawler)

  indices := agents.MatchingCrawlers(userAgent)
  fmt.Println("crawlers' indices:", indices)
  fmt.Println("crawler's URL:", agents.Crawlers[indices[0]].URL)
}

Output:

isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com

Contributing

I do welcome additions contributed as pull requests.

The pull requests should:

  • contain a single addition
  • specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
  • contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
  • result in a valid JSON file (don't forget the comma between items)

Example:

{
  "pattern": "rogerbot",
  "addition_date": "2014/02/28",
  "url": "http://moz.com/help/pro/what-is-rogerbot-",
  "instances" : ["rogerbot/2.3 example UA"]
}

License

The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.

Related work

There are a few wrapper libraries that use this data to detect bots:

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:

About

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome ⭐

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 100.0%