Skip to content

ricardoaat/bioschemas-gocrawlit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIOSCHEMAS.ORG GO CRAWL IT!

Crawls and extracts bioschemas.org/schema.org JSON-LD and Microdata from a given website. The extracted information is stored on a JSON file and optionally can be stored on a Elasticsearch local service.

How to use it:


Use example:

./bioschemas-gocrawlit_mac_64 -p -u "https://www.ebi.ac.uk/biosamples/samples"
./bioschemas-gocrawlit_mac_64 -q -u https://tess.elixir-europe.org/sitemaps/events.xml
./bioschemas-gocrawlit_mac_64 -u http://159.149.160.88/pscan_chip_dev/

A folder "bioschemas_gocrawlit_cache" will be created on the current path of execution; This folder contains crawled website information in order to prevent multiple download of pages. Is safe to delete this folder.

Output

Scraped data will be stored in a json file named <website_host>_schema.json on the current program folder.

Available commands

  • -p: Stay on current path. i.e. When crawling a page like https://www.ebi.ac.uk/biosamples/samples and don't want it to crawl the whole website, e.g. https://www.ebi.ac.uk.
  • -m: Max number of recursion depth of visited URLs. Default infinity recursion. (The crawler does not revisit URLs)
  • -e: Adds crawled data to an Elasticsearch (v6) service at http://127.0.0.1:9200.
  • -u: Start page to start crawling.
  • -q: Remove query section from the link URL found.
  • --query: Use with -q so it follows only links that contain the query word provided, e.g., ./bioschemas-gocrawlit_mac_64 -u https://tess.elixir-europe.org/events -q --page page
  • -h: Print Help and exit.

Building binaries


To create a binary for your current SO use:

make build

To create a binary for windows, macos and linux SO use:

make build-all

The binaries would be placed under build/ path.

Elasticsearch quick setup DOCKER


Steps for starting dockerized elasticsearch and kibana locally. This requires Docker.

Create a custom network for your elastic-stack:

docker network create elastic-stack

Pull and run an elasticsearch image:

docker run -it --network=elastic-stack -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch docker.elastic.co/elasticsearch/elasticsearch:6.2.4

Avoid changing the containers name since Kibana docker image points by default to http://elasticsearch:9200.

Pull and run an elasticsearch image:

docker run --network=elastic-stack --rm -it -p 5601:5601 --name kibana docker.elastic.co/kibana/kibana:6.2.4

Remember the --rm flag will delete the container once it is stoped.

ToDo

  • Crawl website
  • URL by command line parameters
  • JSON-LD Extraction
  • Microdata extraction
  • Better file output
  • Sitemap.xml Crawl option
  • Pagination option
  • Conecting to a flexible storage
  • RDFa extraction support
  • Writing file as it scraps