This is a web crawling program is implemented using go. This program uses ferret to define and execute fql, go-co-op/gocron to schedule tasks, and viper to read configuration files, including program configuration and task configuration.
After comparing several scrap tools, I chose ferret to support the web data crawling , because it uses fql dsl to define, it can be easily modified, and no need to recompile the program after modification.
In order to use ferret in the program through library, I studied the example of the ferret open source code, and imitated the example to embed ferret as a library in the program, so that there is no need to use ferret cli.
In order to support multiple fql scripts, and each script can have different schedule strategies, go-co-op/gocron is introduced as scheduler in the program.
Divided into program configuration, job configuration, fql script.
Program configuration file, configs/crawl.[yaml|json|toml]
, the current version of the program cannot modify this path. Because this program uses spf13/viper as the configuration tool library, it can support the format supported by viper.
The currently valid configuration items are:
jobsdir: configs/fql # default configDir/jobs
fqldir: configs/fql # default configDir/fql
outdir: output # default current working directory
scheduler:
# maximum fql crawling jobs running cocurrenly running
max: 4
jobsdir : The program will scan the crawling job configuration files in this directory
fqldir : Root directory of ferret scripts
outdir : The output location of the ferret script execution
scheduler.max : Maximum concurrency of gocron scheduling
The jobsdir
in configs/crawl.yaml
points to the directory where the job configuration file is located. The program will scan all yaml job configuration files in this directory and its subdirectories.
Spec of the job configuration file is as follows:
enable: true
fqljobs:
-name: zhihu/hotlist
desc: Know the hot list
script: zhihu/hotlist.fql
output: zhihu_hotlist.json
enable: true
schedule:
every: 3m
-name: bilibili/weekly
script: bilibili/weekly.fql
desc: stop b must see every week
schedule:
cron: "* 3 18/1 * * 5 *" # Every Friday 18:00-23:00
enable: : Required, default false. So, you can disable/enable all the jobs in one yaml.
job.name : Required, job name, it needs to be unique in a single yaml configuration file
job.desc : Optional, description information, help memorize and understand
job.script
: Required, the relative path of the ferret query script. The program will search from the directory specified by configs/crawl.yaml#fqldir
job.output
: Optional, the relative path of the files saved in the fql result. These files will be saved in the directory specified by config/crawl.yaml#output
. If it is missing, the program will task default value to the format of config/crawl.yaml#fqldir + '_' + script + .json
, such as bilibili/weekly above does not specify output, then the default name of the output file automatically generated by the program is configs_fql_bilibili_weekly.json
job.enable
: Optional, default false. So, only jobs with enable=true
would be loaded.
job.schedule
: Read cron
first, if missing, try to read every
. If boths none exists, the default is 7m. Because this program uses the go-co-op/gocron scheduling library, you can use the cron expression when filling in cron
; When filling in every
, you can choose to the following units to define the time frequency, s->seconds, m->minutes, h->hours. If the time interval is very long, such as many days or one month, it is recommended to use cron expression to control the time more accurately
The job configuration file can configure multiple jobs at the same time, and the number of job configuration files is not limited, as long as it is in the directory or subdirectory specified by configs/crawl.yaml#fqldir
. Plan according to your needs.
ferret query script, this program will search in the config/crawl.yaml#fqldir
directory or subdirectories. These fql scripts are written in ferret query language, and can be developed and tested through montferret.dev/try/ in advance.
make sure that the program has been compiled, or download the executable.
- Download
github.com/lizhuoqi/crawling/releases gitee.com/lizhuoqi/crawling/releases
- Compile by yourself
> git clone the open source url crawling of this program
> go mod tidy -v
> go build -v
> # Or use make
> make build
> ./crawl
When the program is running, the generated json file will overwrite the existing file, and the history will not be saved. If you need to get updates in time, you need to install a file watcher to trigger subsequent actions.