Tutorial

Plugins

Basic plugins

parse plugin

Enabled by default. Parse html content with complex rules. After task's execution, a parsed property will be added on task.

Options for task

task = {
  ...
  parse: "title",

  parseCheck(res) {return true;}, //optional
  callback(task) {
    console.log(task.parsed);
  }
}

if parseCheck is a function, it will be called first. Return true if you want to continue the parsing execution. You can also change parse rule dynamically in the funtion.

Options for crawler

{
  ...
  filters: {
    trim: v=>v.trim()
  }
}
  • Select an object
task = {
  ...
  parse: { title: "title", warning: ".alert" },
}
  • Select an array
task = {
  ...
  // starts with '[' and ends with ']'
  parse: "[ul li a @href]"
}
  • Using scope
task = {
  ...
  // first element of array: scope selector
  // scope selector and use `[]` to process an array for divisions
  // second element of array: element selector
  parse: ["[ul li]", "a @href"]
}
  • Select array of objects
task = {
  ...
  parse: ["[.product_pod]", { title: "h3 a" }]
}
  • Available attributes
"@html";
"@outerHtml";
"@text"; // default
"@string"; // direct text node's value
"@nextNode"; // next dom node's value
  • Using filters

To apply filters to a value, append them to the selector using |.

task = {
  parse: "title | trim"
};

Or using array format to provide functions directly

task = {
  parse: ["title", s => s.trim()]
};

To register filters on the instance, use filters property in crawler's options.

attempt plugin

Enabled by default. In addition to letting the got library handle retry, we can also handle it through this attempt plugin.

Options for crawler and task.

{
  // [retries, allowedStatuses, callback({ err, shouldRetry, maxRetry, task, crawler })]
  attempts: [3, [404]];
  // or simply a number
  attempts: 3;
}

Default Callback function will log the error to console and re-add or resolve the task based on shouldRetry variable. When providing your own callback, defaultCallback will not be called. So make sure that the task will be resolved. But you can also return true value in your custom callback function and the plugin will call default Callback function automatically.

delay plugin

Enabled by default.

Options for task

{
  // the number of milliseconds to wait before sending http request
  delay: 3000;
}

follow plugin

Enabled by default.

Prerequisite: parse plugin

Workflow for this plugin: selector-parse => filterFunction => callback for each

Selector for parsing new urls and Callback function are required Callback function should receive a url as parameter and return a task FilterFunction will be used to filter selected results. And Default filter function on urls from selector is: urls=>urls.filter(v=>v)

let selector = '[a@href]';
let callbackFunc = url=>({url,parse:"title",callback(task){...}});
let followRule = [selector, callbackFunc, filter=urls=>urls.filter(v=>v)]

Spawner mode is also supported: False callback will enable spawner mode for selected urls;

let followRule2 = ['[a@href]']

Options for task


let task = {
  ...
  follow: followRule
  // or use follows: Array of followRule
  follows: [followRule1, followRule2]
}
...
function xMain(url) {
  return {
    // main page task
    url,
    parse: [
      // divide by '.quote' selector
      "[.quote]",
      // rule on each division
      {
        author: ".author",
        authorUrl: ".author+a@href",
        text: ".text | slice:0,20",
        tags: "[a.tag]"
      }
    ],
    follow: [".next a@href", xMain]
  };
}

x(xMain('http://quotes.toscrape.com/'))
$ DEBUG=crawlx* node quotes.js
  crawlx GET http://quotes.toscrape.com/ -> 200 +0ms
  crawlx GET http://quotes.toscrape.com/page/2/ -> 200 +1s
  crawlx GET http://quotes.toscrape.com/page/3/ -> 200 +822ms
  crawlx GET http://quotes.toscrape.com/page/4/ -> 200 +1s
  crawlx GET http://quotes.toscrape.com/page/5/ -> 200 +648ms
  crawlx GET http://quotes.toscrape.com/page/6/ -> 200 +996ms
  crawlx GET http://quotes.toscrape.com/page/7/ -> 200 +695ms
  crawlx GET http://quotes.toscrape.com/page/8/ -> 200 +2s
  crawlx GET http://quotes.toscrape.com/page/9/ -> 200 +712ms
  crawlx GET http://quotes.toscrape.com/page/10/ -> 200 +646ms

dupFilter plugin

Disabled by default.

Prerequisite: normalize-url - npm

$ npm i normalize-url

Usage:

const x = require("crawlx").default;
const plugins = require("crawlx").plugins;

x.crawler.use(plugins.dupFilter());
// options for normalize-url
x.crawler.use(plugins.dupFilter({ stripHash: true }));

Options for task

{
  // set to true if the task should not be filtered by dupFilter plugin
  dontFilter: true;
}

other plugins

Others

Plugin Structure

const getPlugin = options => ({
  name: "",
  priority: 0,
  before: (task, crawler) => {},
  after: (task, crawler) => {},
  start: crawler => {},
  finish: crawler => {},
  onError: (error, task, crawler) => {}
});

Enable plugin in order

Sometimes plugins should be enabled in order(start method should be called in order). For example, resume plugin will save & load dupFilter set as well. So resume plugin should be enabled first.

(async () => {
  await x.crawler.use(pluginA);
  await x.crawler.use(pluginB);

  x("http://quotes.toscrape.com/");
})();